Check whether two datasets are statistically different - statistics

I have two data sets (each with datapoints + standard deviation) and want to check whether they are statistically different. What kind of test would be appropriate?
Thank you!

The answer depends. If blue and red samples are randomly obtained, and the same group of items but measured at different times, then paired two-sample t-test applies. If they belong to different groups, the unpaired two-sample t-test is suitable. This decision is based on the assumption that both blue and red samples are normally distributed or can be transformed to a normal distribution by means of a logarithmic transformation. Otherwise, you need to implement Mann-Whitney test. The data values to be used are Output percentages given the same Input value. Data values should be continuous as in your case.

Related

Is there a metric that can determine spatial and temporal proximity together?

Given a dataset which consists of geographic coordinates and the corresponding timestamps for each record, I want to know if there's any suitable measure that can determine the closeness between two points by taking the spatial and temporal distance into consideration.
The approaches I've tried so far includes implementing a distance measure between the two coordinate values and calculating the time difference separately. But in this case, I'd require two threshold values for both the spatial and temporal distances to determine their overall proximity.
I wanted to know there's any single function that can take in these values as an input together and give a single measure of their correlation. Ultimately, I want to be able to use this measure to cluster similar records together.

Why do Excel and Matlab give different results?

I have 352k values and I want to find the most frequent values from all of them.
Numbers are rounded to two decimal places.
I use the commands mode(a) in Matlab and mode(B1:B352000) in Excel, but the results are different.
Where did I make a mistake, or which one can I believe?
Thanks
//edit: When I use other commands like average, the results are the same.
From Wikipedia:
For a sample from a continuous distribution, such as [0.935..., 1.211..., 2.430..., 3.668..., 3.874...], the concept is unusable in its raw form, since no two values will be exactly the same, so each value will occur precisely once. In order to estimate the mode of the underlying distribution, the usual practice is to discretize the data by assigning frequency values to intervals of equal distance, as for making a histogram, effectively replacing the values by the midpoints of the intervals they are assigned to. The mode is then the value where the histogram reaches its peak. For small or middle-sized samples the outcome of this procedure is sensitive to the choice of interval width if chosen too narrow or too wide
Thus, it is likely that the two programs use a different interval size, yielding different answers. You can believe both (I presume) but knowing that the value returned is an approximation to the true mode of the undelying distribution.

Correlation statistics

Naive Question:
In the attached snapshot, I am trying to figure out the correlation concept when applied to actual values and to calculation performed on those actual values and creating a new stream of data.
In the example,
Columns A,B,C,D,E have very different correlation but when I do a rolling sum on the same columns to get G,H,I,J,K the correlation is very much the same(negative or positive.
Are these to different types of correlation or am I missing out on something.
Thanks in advance!!
Yes, these are different correlations. It's similar to if you were to measure acceleration over time of 5 automobiles (your first piece of data) and correlate those accelerations. Each car accelerates at different rates over time leaving your correlation all over the place.
Your second set of data would be the velocity of each car at each point in time. Because each car is accelerating at a pretty constant rate (and doing so in two different directions from the starting point) you either get a big positive or big negative correlation.
It's not necessary that you get that big positive or big negative correlation in the second set, but since your data in each list is consistently positive or negative and grows at a consistent rate, it correlates with either similar lists.

Averaging many curves with different x and y values

I have several curves that contain many data points. The x-axis is time and let's say I have n curves with data points corresponding to times on the x-axis.
Is there a way to get an "average" of the n curves, despite the fact that the data points are located at different x-points?
I was thinking maybe something like using a histogram to bin the values, but I am not sure which code to start with that could accomplish something like this.
Can Excel or MATLAB do this?
I would also like to plot the standard deviation of the averaged curve.
One concern is: The distribution amongst the x-values is not uniform. There are many more values closer to t=0, but at t=5 (for example), the frequency of data points is much less.
Another concern. What happens if two values fall within 1 bin? I assume I would need the average of these values before calculating the averaged curve.
I hope this conveys what I would like to do.
Any ideas on what code I could use (MATLAB, EXCEL etc) to accomplish my goal?
Since your series' are not uniformly distributed, interpolating prior to computing the mean is one way to avoid biasing towards times where you have more frequent samples. Note that by definition, interpolation will likely reduce the range of your values, i.e. the interpolated points aren't likely to fall exactly at the times of your measured points. This has a greater effect on the extreme statistics (e.g. 5th and 95th percentiles) rather than the mean. If you plan on going this route, you'll need the interp1 and mean functions
An alternative is to do a weighted mean. This way you avoid truncating the range of your measured values. Assuming x is a vector of measured values and t is a vector of measurement times in seconds from some reference time then you can compute the weighted mean by:
timeStep = diff(t);
weightedMean = timeStep .* x(1:end-1) / sum(timeStep);
As mentioned in the comments above, a sample of your data would help a lot in suggesting the appropriate method for calculating the "average".

Comparing and visualising groups of sequences

I have two groups A and B of strings of the letters "AGTE" and I'd like to find some way of comparing these to see whether they are statistically similar. The first group A are real world observations, B are predictions. There are 400 or so in each group Eg
**A**
GTAATEGTTTEAAA
TTEAGE
...
**B**
AGTEAAAAGT
TAT
GGATEAATGGGTEAATG
....
I'd also like to be up to visualise these in some way really for presentation purposes. Do you have any ideas how I might be able to do that?
I'd suggest you compute the Levenshtein distance between the strings, then you can plot these inter string distances. Larger values indicate strings that are more dissimilar.
If you don't want to implement the Levenshtein distance calculation yourself, check out these submissions on file exchange.

Resources