Analyzing how noisy a data set using Excel - excel

I have a set of data that has over 15,000 records in Excel that is from a measurement tool that finds trends over a large areas. I'm not interested in looking for trends within the data as whole but rather over the data closest to each other to get a sense of how noisy (variation with neighboring records). Almost like I want to know the average standard deviation of looking at the 15,000 or so records only at 20 records at a time. The hope is the data values trend gradually rather than sudden changes from record to record and thus looks noisy. If I add a Chart and use the "Moving Average" Trendline it kind of visually shows how noisy the data looks across the 15,000 + records. However, I was hoping to get a numeric value to rate how noisy the data is vs. other datasets. Any ideas on what I could do here with formula's built-in Excel or by adding some add-in? Let me know if I need to explain this any better.

Could you calculate your moving average for your 20 sample window, then use the difference between each point and the expected value to calculate a variance?
Hard to do tables here, but here is a sample of what I mean
Actual Measured Expected Variance
5 5.44 4.49 0.91
6 4.34 5.84 2.26
7 8.45 7.07 1.90
8 6.18 7.84 2.75
9 8.89 9.10 0.04
10 11.98 10.01 3.89
The "measured" values were determined as
measured = actual + (rand() - 0.5) * 4
The "expected" values were calculated from a moving average (the table was pulled from the middle of the data set).
The variance is simply the square of expected minus measured.
Then you could calculate an average variance as a summary statistic.

Moving average is the correct, but you need a critical element - order. Do you date/time variable or a sequence number?
Use the OFFSET function to setup your window. If you want 20, your formula will look something like AVERAGE(OFFSET(C15,-10,0,21)). This is your moving average.
Relate that to C15, whether additive or multiplicative, you'll have your distance. All we need now is your tolerance.

Related

Fill empty values with forecast.linear in excel

I have a column with increasing numbers and I want with forecast.linear to predict the missing values between the previous values and the next value. G2:G6 AND G16.
However when I run the FORECAST.LINEAR(F14,G2:G13,F2:F13) it outputs 1.60 which is not correct if you consider that it should be something greater than 1.62 and less than 1.89
UPDATE:
I did this calculation and it seems ok
=IF(AND(G2=0;G3=0;G4<>0;G1<>0)=TRUE;ROUND((G4-G1)/3;2);FALSE)
The correct linear progression for the values your sample shows for 4/1/21 through 15/1/21 should be 1.60424. The final value, 1.62, happens to be a high outlier that is above the linear best fit for the values given. So the function is working correctly. It would not be uncommon for the first or last points to be above or below the linear progression.
The problem is that the function’s range of known Y values ends with 1.62, so the function you entered knows nothing of the 1.89 value.
When I set the problem up to skip a 13th and 14th x and y, but include a 15th value 1.89, I get 1.61 and 1.74 for values 13 and 14, so even when including the 1.89 value, the 13th value is still less than 1.62. It’s a significantly high variation from the linear.
I’m not sure what the best approach is, but this will not likely be an easy problem to solve using this approach. You end up with a circular reference if the Y value you are trying to forecast is within the known Y values range of the formula. The normal way of solving this problem is to have separate actual columns and forecast columns, and not mix the two

Can t-test be calculated on large samples with non-normal distribution?

Can t-test be calculated on large samples with non-normal distribution?
For example, the number of users in group A is 100K, the number of users in group B is 100K. I want to test whether the average session duration of these two groups is statistically significant.
1st method) We calculated the average session duration of these users on the day after the AB test (DAY1) as
31.2 min for group A
30.2 min for group B.
We know that users in groups A and B have a non-normal distribution of DAY1 session values.
In such a case, would it be correct to use two samples t-test to test the DAY1 avg session durations of two groups? (We will accept n=100K)
(Some sources say that calculating t-scores for large samples will give accurate results even with non-normal distribution.)
2nd method) Would it be a correct method to calculate the t-score over the daily average session duration during the day the AB test is open?
E.g; In the scenario below, the average daily session duration of 100K users in groups A and B are calculated. We will accept the number of days here as the number of observations and get n=30.
We will also calculate the two-sample t-test calculation over n=30.
Group
day0 avg duration
day1 avg duration
day2 avg duration
...
day30 av gduration
A
30.2
31.2
32.4
...
33.2
B
29.1
30.2
30.4
...
30.1
Do these methods give correct results or is it necessary to apply another method in such scenarios?
Would it make sense to calculate t-test on large samples in AB test?
The t-test assumes that the means of different samples taken from a population are normally distributed. It doesn't assume that the population itself is normally distributed.
For a population with finite variance, the central limit theorem suggests that the means of samples from the population are normally distributed. However, the sample size needed for the distribution of means to be approximately normal depends on the degree of non-normalness of the population. The t-test is invalid for small samples from non-normal population distributions, but is valid for large samples from non-normal distributions.
Method 1 works because of this reason (large sample size ~100K) and you are correct that calculating t-scores for large samples will give accurate results even with non-normal distribution. [You may also consider using a z-test for the sample sizes you're working with (100K). T-tests are more appropriate for smaller sample sizes, such as n < 30]
Method 2 works because the daily averages should be normally distributed given enough samples per the central limit theorem. Time-spent datasets may be skewed but generally work well.

Formula to determine if growth is increasing/decreasing, smooth, lumpy, etc

I have some results data in around 10 columns (sample in CSV below) that I would like to run a formula or formulas per row to determine if the trend is
reasonably steady/predictable, or lumpy (no straight lines expected)
generally trending upwards or downwards over the period
changing trend towards the end (most recent 3 months) or continuing it's trend
As the sample shows below, some rows do not have all the data, but still need a determination of the general trend and consistency of the results.
A graph would display these easily enough, but I have thousands of rows to compare, so is not efficient or feasible.
I've tried a few formulas such as trend, growth, stddev, avedev, but I suspect I might have to use them in combination, which is currently beyond me. I feel like using the percentage difference between neighboring cells will help standardise the results to a degree, rather than the value of each cell. If the percentages are all positive then the trend is upward, but that's the best I've been able to come up with that I'm happy gives a clear answer.
I'm using google sheets, but might be able to convert an Excel formula.
Any suggestions?
May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,Jan,Feb
11.65,11.79,11.96,12.26,12.71,12.6,12.71,13.6,14.1,14.7
0.57,0.65,0.33,0.89,1.03,0.74,1.35,0.81,2.13,2.15
1.85,1.88,1.84,1.92,2.07,2.24,2.56,2.74,2.85,2.92
,,,0.66,0.72,0.78,1.33,1.43,1.47,1.52
,,,,0.64,0.6,0.56,0.55,0.3,
,8.97,8.54,10.46,11.44,8.06,7.42,7.86,7.66,7.1
2.67,1.53,1.84,2.43,2.94,3.43,4.04,7.46,6.25,9.09
Row 1 - Smooth growth
Row 2 - Lumpy growth
Row 3 - Growth
Row 4 - Growth
Row 5 - Declining
Row 6 - Lumpy. Past 3m decline.
Row 7 - Growth. Past 3m lumpy

making big data set smaller in excel

I made a little test machine that accidentally created a 'big' data set:
6 columns with +/- 550.000 rows.
The end result I am looking for is a graph with 6 lines, horizontal axis 1 - 550.000 measurements and vertically the values in the rows. (capped at 200 or so). Data is a resistance measurement that should be between 0 - 30 or very big (borken), the software writes 'inf' in these cases.
My skill is limited to excel, so what have I done until now:
Imported in Excel. The measurements are valuable between 0 - 30 and inf is not good for a graph, so I did: if(cell>200){200}else{keep cell value}.
Now making a graph is a timely exercise and excel does not like this, result is not good.
So I would like to take the average value of 60 measurements to reduce the rows to below 10.000. So =AVERAGE(H1:H60)
But I cannot get this to work.
Questions:
How do I reduce this data set and get a good graph.
Should I switch
to other software that is more applicable?
FYI: I already changed the software of the testing device to take the average value of a bunch of measurements the next time... But I cannot repeat this test.
Download link of data set comma separated file 17MB
I think you are on the right track, however my guess is that you only want to get an average every 60 rows and are unsure how to do this.
Using MOD(Number, Divisor) inside an if statement will let you specify that the average should be calculated only once in every x number of cells.
Assuming you'll have one row above your data table for headers, you are looking for something along the lines of:
=IF(MOD(ROW(A61),60) = 1,AVERAGE(H2:H61),"")
Once you have this you can filter your average column to non-blank values and use this to create your graph.

How to generate random numbers within a normal distribution using Excel

I want to use the RAND() function in Excel to generate a random number between 0 and 1.
However, I would like 80% of the values to fall between 0 and 0.2, 90% of the values to fall between 0 and 0.3, 95% of the values to fall between 0 and 0.5, etc.
This reminds me that I took an applied statistics course once upon a time, but not of what was actually in the course...
How is the best way to go about achieving this result using an Excel formula. Alternatively, what is this kind of statistical calculation called / any other pointers that I can Google around for.
=================
Use case:
I have a single column of meter readings, which I would like to duplicate 7 times (each column for a new month). each column has 55 000 rows. While the meter readings need to vary for each month, when taken as a time series, each meter number should have 7 realistic readings.
The aim is to produce realistic data to turn into heat maps (i.e. flag outlying meter readings)
I don't think that there is a formula which would fit exactly to your requirements. I would use a very straightforward solution:
Generate 80% of data using =RANDBETWEEN(0,20)/100
Generate 10% of data using =RANDBETWEEN(20,30)/100
Generate 5% of data using =RANDBETWEEN(30,50)/100
and so on
You can easily change the precision of generated data by modifying the parameters, for example: =RANDBETWEEN(0,2000)/10000 will generate data with up to 4 digits after decimal point.
UPDATE
Use a normal distribution for the use case, for example:
=NORMINV(RAND(), 20, 5)
where 20 is a mean value and 5 is a standard deviation.

Resources