Correlation statistics - excel

Naive Question:
In the attached snapshot, I am trying to figure out the correlation concept when applied to actual values and to calculation performed on those actual values and creating a new stream of data.
In the example,
Columns A,B,C,D,E have very different correlation but when I do a rolling sum on the same columns to get G,H,I,J,K the correlation is very much the same(negative or positive.
Are these to different types of correlation or am I missing out on something.
Thanks in advance!!

Yes, these are different correlations. It's similar to if you were to measure acceleration over time of 5 automobiles (your first piece of data) and correlate those accelerations. Each car accelerates at different rates over time leaving your correlation all over the place.
Your second set of data would be the velocity of each car at each point in time. Because each car is accelerating at a pretty constant rate (and doing so in two different directions from the starting point) you either get a big positive or big negative correlation.
It's not necessary that you get that big positive or big negative correlation in the second set, but since your data in each list is consistently positive or negative and grows at a consistent rate, it correlates with either similar lists.

Related

Check whether two datasets are statistically different

I have two data sets (each with datapoints + standard deviation) and want to check whether they are statistically different. What kind of test would be appropriate?
Thank you!
The answer depends. If blue and red samples are randomly obtained, and the same group of items but measured at different times, then paired two-sample t-test applies. If they belong to different groups, the unpaired two-sample t-test is suitable. This decision is based on the assumption that both blue and red samples are normally distributed or can be transformed to a normal distribution by means of a logarithmic transformation. Otherwise, you need to implement Mann-Whitney test. The data values to be used are Output percentages given the same Input value. Data values should be continuous as in your case.

Is there a metric that can determine spatial and temporal proximity together?

Given a dataset which consists of geographic coordinates and the corresponding timestamps for each record, I want to know if there's any suitable measure that can determine the closeness between two points by taking the spatial and temporal distance into consideration.
The approaches I've tried so far includes implementing a distance measure between the two coordinate values and calculating the time difference separately. But in this case, I'd require two threshold values for both the spatial and temporal distances to determine their overall proximity.
I wanted to know there's any single function that can take in these values as an input together and give a single measure of their correlation. Ultimately, I want to be able to use this measure to cluster similar records together.

Convert GMM-UBM scores to equicalent accuracy percent

I have constructed a GMM-UBM model for the speaker recognition purpose. The output of models adapted for each speaker some scores calculated by log likelihood ratio. Now I want to convert these likelihood scores to equivalent number between 0 and 100. Can anybody guide me please?
There is no straightforward formula. You can do simple things like
prob = exp(logratio_score)
but those might not reflect the true distribution of your data. The computed probability percentage of your samples will not be uniformly distributed.
Ideally you need to take a large dataset and collect statistics on what acceptance/rejection rate do you have for what score. Then once you build a histogram you can normalize the score difference by that spectrogram to make sure that 30% of your subjects are accepted if you see the certain score difference. That normalization will allow you to create uniformly distributed probability percentages. See for example How to calculate the confidence intervals for likelihood ratios from a 2x2 table in the presence of cells with zeroes
This problem is rarely solved in speaker identification systems because confidence intervals is not what you want actually want to display. You need a simple accept/reject decision and for that you need to know the amount of false rejects and accept rate. So it is enough to find just a threshold, not build the whole distribution.

Averaging many curves with different x and y values

I have several curves that contain many data points. The x-axis is time and let's say I have n curves with data points corresponding to times on the x-axis.
Is there a way to get an "average" of the n curves, despite the fact that the data points are located at different x-points?
I was thinking maybe something like using a histogram to bin the values, but I am not sure which code to start with that could accomplish something like this.
Can Excel or MATLAB do this?
I would also like to plot the standard deviation of the averaged curve.
One concern is: The distribution amongst the x-values is not uniform. There are many more values closer to t=0, but at t=5 (for example), the frequency of data points is much less.
Another concern. What happens if two values fall within 1 bin? I assume I would need the average of these values before calculating the averaged curve.
I hope this conveys what I would like to do.
Any ideas on what code I could use (MATLAB, EXCEL etc) to accomplish my goal?
Since your series' are not uniformly distributed, interpolating prior to computing the mean is one way to avoid biasing towards times where you have more frequent samples. Note that by definition, interpolation will likely reduce the range of your values, i.e. the interpolated points aren't likely to fall exactly at the times of your measured points. This has a greater effect on the extreme statistics (e.g. 5th and 95th percentiles) rather than the mean. If you plan on going this route, you'll need the interp1 and mean functions
An alternative is to do a weighted mean. This way you avoid truncating the range of your measured values. Assuming x is a vector of measured values and t is a vector of measurement times in seconds from some reference time then you can compute the weighted mean by:
timeStep = diff(t);
weightedMean = timeStep .* x(1:end-1) / sum(timeStep);
As mentioned in the comments above, a sample of your data would help a lot in suggesting the appropriate method for calculating the "average".

From one histogram, create a new histogram from just a mean or median?

Suppose I have a list of values that I can histogram and calculate descriptive statistics on such as mean, average, max, standard deviation, etc. Perhaps this histogram is bimodal or right skewed. Let’s call this group of data “DataSet1”.
Suppose I had just a mean or median of another set of data. Lets call that DataSet2. I do not have all the raw data for DataSet2, just the median or mean. There is a strong belief that DataSet1 and DataSet2 would show the same variability in values.
If I knew just a single value of either mean or median, can I apply the description statistics from DataSet1 to create a new histogram that mirrors the bimodal or right skewed behavior from DataSet1?
Thanks
Dan
Alternative intent:
I have 3 years of historical data, where the data definitely has a "day of week" trend to it. I am using a python api to apply seasonal ARIMA to forecast the next 7 days from the 3 years of historical data. The predicted value is great, but it is only 1 value. I would like to use that predicted value as the "mean" and create a histogram from the variability of values shown to exists historically by day of week.
so, today is thursday. Lets say i predict tomorrow to have a value of 78.6.
I want to sample potential values of tomorrow based upon a mean of 78.6 but with variability similar to that showed to exist on all historical fridays
If i look at historical fridays, perhaps it shows a skewed to the left behavior
so when i sample with a mean of 78.6, if i sampled 100 times, the values sampled, if plotted in a histogram, would also skew to the left
Hope that helps..

Resources