Can t-test be calculated on large samples with non-normal distribution? - statistics

Can t-test be calculated on large samples with non-normal distribution?
For example, the number of users in group A is 100K, the number of users in group B is 100K. I want to test whether the average session duration of these two groups is statistically significant.
1st method) We calculated the average session duration of these users on the day after the AB test (DAY1) as
31.2 min for group A
30.2 min for group B.
We know that users in groups A and B have a non-normal distribution of DAY1 session values.
In such a case, would it be correct to use two samples t-test to test the DAY1 avg session durations of two groups? (We will accept n=100K)
(Some sources say that calculating t-scores for large samples will give accurate results even with non-normal distribution.)
2nd method) Would it be a correct method to calculate the t-score over the daily average session duration during the day the AB test is open?
E.g; In the scenario below, the average daily session duration of 100K users in groups A and B are calculated. We will accept the number of days here as the number of observations and get n=30.
We will also calculate the two-sample t-test calculation over n=30.
Group
day0 avg duration
day1 avg duration
day2 avg duration
...
day30 av gduration
A
30.2
31.2
32.4
...
33.2
B
29.1
30.2
30.4
...
30.1
Do these methods give correct results or is it necessary to apply another method in such scenarios?
Would it make sense to calculate t-test on large samples in AB test?

The t-test assumes that the means of different samples taken from a population are normally distributed. It doesn't assume that the population itself is normally distributed.
For a population with finite variance, the central limit theorem suggests that the means of samples from the population are normally distributed. However, the sample size needed for the distribution of means to be approximately normal depends on the degree of non-normalness of the population. The t-test is invalid for small samples from non-normal population distributions, but is valid for large samples from non-normal distributions.
Method 1 works because of this reason (large sample size ~100K) and you are correct that calculating t-scores for large samples will give accurate results even with non-normal distribution. [You may also consider using a z-test for the sample sizes you're working with (100K). T-tests are more appropriate for smaller sample sizes, such as n < 30]
Method 2 works because the daily averages should be normally distributed given enough samples per the central limit theorem. Time-spent datasets may be skewed but generally work well.

Related

estimate rolling percentile of a population from windowed sample sets

Calculating a percentile (95th, 99th) in my data set is expensive due to the large number of time series and time ranges ranging from weeks to months. The cost incurred is proportional to number of samples fetched from data store and computational overhead in processing the calculations. I am attempting to optimize the solution by calculating the statistics for smaller time ranges in parts, in stream as data points are ingested -- then estimating metrics for population from those samples. This approach works accurately for mean, peak (max) but require a good approximation for percentiles.
population_mean = mean(sample_mean_t0, sample_mean_t1, ... ,sample_mean_tn)
population_max = max(sample_max_t0, sample_max_t1 , ... , sample_max_tn)
To calculate p95, I am calculating 95th percentile over 95th percentile of all samples. Is this a reasonable approximation for calculating 95th percentile? (we are not attempting to solve the problem when there is high degree of skewness). Is there a better approximation I can use for calculating percentiles?
population_p95 = p95(sample_p95_t0, sample_p95_t1, ... , sample_p95_tn)
Does taking the average over sample p95s make more sense? Any reference here to approximate this solution and estimate errors will be helpful.

Reading response time percentile in Designing Data-Intensive Applications Book

In the book Designing Data-Intensive Applications, there is this sentence:
For example, if the 95th percentile response time is 1.5 seconds, that means 95 out of 100 requests take less than 1.5 seconds, and 5 out of 100 requests take 1.5 seconds or more.
The confusing part is the saying that 95 of these requests will take less than 1.5 seconds. Isn't that supposed to be that 95 of requests take 1.5 seconds or less, and the remaining 5 takes more than 1.5 seconds? Or, the one percent in the 95th percentile takes exactly 1.5 seconds, 89th percentile and below take less than 1.5, and the 96th and above percentiles take more than 1.5? What is the correct reading of these numbers?
I have done some research on this and found several articles. The interesting part is that some say what I say and some don't.
Some of the links that read the percentile similar to 95 of the requests take 1.5 or less:
average 90th percentile response time and average response time
90% percentile is a statistical measurement, in case of JMeter it means that 90% of the sampler response times were smaller than or equal to this time
https://www.dynatrace.com/news/blog/why-averages-suck-and-percentiles-are-great/
so 90 percent of the requests are processed in 3.0 seconds or less
https://www.adfpm.com/adf-performance-monitor-monitoring-with-percentiles
If the 90th percentile of the same transaction is at 1000ms it means that 90% are as fast or faster and only 10% are slower.
Other links that read the percentile similar to 95 of the requests take less than 1.5:
https://www.elastic.co/blog/averages-can-dangerous-use-percentile
In contrast, the 99th percentile says “99% of your values are less than 850ms”, which is a very different picture.
I got the answer from this website and according to them, both of them is true. It just depends on how the percentile rank is calculated:
The word “percentile” is used informally in the above definition. In common use, the percentile usually indicates that a certain percentage falls below that percentile. For example, if you score in the 25th percentile, then 25% of test takers are below your score. The “25” is called the percentile rank. In statistics, it can get a little more complicated as there are actually three definitions of “percentile.” Here are the first two (see below for definition 3), based on an arbitrary “25th percentile”:
Definition 1: The nth percentile is the lowest score that is greater than a certain percentage (“n”) of the scores. In this example, or n is 25, so we’re looking for the lowest score that is greater than 25%.
Definition 2: The nth percentile is the smallest score that is greater than or equal to a certain percentage of the scores. To rephrase this, it’s the percentage of data that falls at or below a certain observation. This is the definition used in AP statistics. In this example, the 25th percentile is the score that’s greater or equal to 25% of the scores.

What is the minimum number of observations required per group for One-Way Anova?

Not really a code-related question per se, but more statistical. I plan to conduct a One Way Anova comparing mean catch rates (from fishing) across 3 locations, whereby there were only 4 surveys conducted at each location to calculate this mean catch rate (North: n=4, East: n=4, West: n=4).
With n=4, is it appropriate to conduct a One Way Anova, or is that number of observations too small?
minimum total sample size should be n = k+1 = 4
You can read about it here

How to generate random numbers within a normal distribution using Excel

I want to use the RAND() function in Excel to generate a random number between 0 and 1.
However, I would like 80% of the values to fall between 0 and 0.2, 90% of the values to fall between 0 and 0.3, 95% of the values to fall between 0 and 0.5, etc.
This reminds me that I took an applied statistics course once upon a time, but not of what was actually in the course...
How is the best way to go about achieving this result using an Excel formula. Alternatively, what is this kind of statistical calculation called / any other pointers that I can Google around for.
=================
Use case:
I have a single column of meter readings, which I would like to duplicate 7 times (each column for a new month). each column has 55 000 rows. While the meter readings need to vary for each month, when taken as a time series, each meter number should have 7 realistic readings.
The aim is to produce realistic data to turn into heat maps (i.e. flag outlying meter readings)
I don't think that there is a formula which would fit exactly to your requirements. I would use a very straightforward solution:
Generate 80% of data using =RANDBETWEEN(0,20)/100
Generate 10% of data using =RANDBETWEEN(20,30)/100
Generate 5% of data using =RANDBETWEEN(30,50)/100
and so on
You can easily change the precision of generated data by modifying the parameters, for example: =RANDBETWEEN(0,2000)/10000 will generate data with up to 4 digits after decimal point.
UPDATE
Use a normal distribution for the use case, for example:
=NORMINV(RAND(), 20, 5)
where 20 is a mean value and 5 is a standard deviation.

Analyzing how noisy a data set using Excel

I have a set of data that has over 15,000 records in Excel that is from a measurement tool that finds trends over a large areas. I'm not interested in looking for trends within the data as whole but rather over the data closest to each other to get a sense of how noisy (variation with neighboring records). Almost like I want to know the average standard deviation of looking at the 15,000 or so records only at 20 records at a time. The hope is the data values trend gradually rather than sudden changes from record to record and thus looks noisy. If I add a Chart and use the "Moving Average" Trendline it kind of visually shows how noisy the data looks across the 15,000 + records. However, I was hoping to get a numeric value to rate how noisy the data is vs. other datasets. Any ideas on what I could do here with formula's built-in Excel or by adding some add-in? Let me know if I need to explain this any better.
Could you calculate your moving average for your 20 sample window, then use the difference between each point and the expected value to calculate a variance?
Hard to do tables here, but here is a sample of what I mean
Actual Measured Expected Variance
5 5.44 4.49 0.91
6 4.34 5.84 2.26
7 8.45 7.07 1.90
8 6.18 7.84 2.75
9 8.89 9.10 0.04
10 11.98 10.01 3.89
The "measured" values were determined as
measured = actual + (rand() - 0.5) * 4
The "expected" values were calculated from a moving average (the table was pulled from the middle of the data set).
The variance is simply the square of expected minus measured.
Then you could calculate an average variance as a summary statistic.
Moving average is the correct, but you need a critical element - order. Do you date/time variable or a sequence number?
Use the OFFSET function to setup your window. If you want 20, your formula will look something like AVERAGE(OFFSET(C15,-10,0,21)). This is your moving average.
Relate that to C15, whether additive or multiplicative, you'll have your distance. All we need now is your tolerance.

Resources