Reading response time percentile in Designing Data-Intensive Applications Book

Reading response time percentile in Designing Data-Intensive Applications Book - percentile

In the book Designing Data-Intensive Applications, there is this sentence:
For example, if the 95th percentile response time is 1.5 seconds, that means 95 out of 100 requests take less than 1.5 seconds, and 5 out of 100 requests take 1.5 seconds or more.
The confusing part is the saying that 95 of these requests will take less than 1.5 seconds. Isn't that supposed to be that 95 of requests take 1.5 seconds or less, and the remaining 5 takes more than 1.5 seconds? Or, the one percent in the 95th percentile takes exactly 1.5 seconds, 89th percentile and below take less than 1.5, and the 96th and above percentiles take more than 1.5? What is the correct reading of these numbers?
I have done some research on this and found several articles. The interesting part is that some say what I say and some don't.
Some of the links that read the percentile similar to 95 of the requests take 1.5 or less:
average 90th percentile response time and average response time
90% percentile is a statistical measurement, in case of JMeter it means that 90% of the sampler response times were smaller than or equal to this time
https://www.dynatrace.com/news/blog/why-averages-suck-and-percentiles-are-great/
so 90 percent of the requests are processed in 3.0 seconds or less
https://www.adfpm.com/adf-performance-monitor-monitoring-with-percentiles
If the 90th percentile of the same transaction is at 1000ms it means that 90% are as fast or faster and only 10% are slower.
Other links that read the percentile similar to 95 of the requests take less than 1.5:
https://www.elastic.co/blog/averages-can-dangerous-use-percentile
In contrast, the 99th percentile says “99% of your values are less than 850ms”, which is a very different picture.

I got the answer from this website and according to them, both of them is true. It just depends on how the percentile rank is calculated:
The word “percentile” is used informally in the above definition. In common use, the percentile usually indicates that a certain percentage falls below that percentile. For example, if you score in the 25th percentile, then 25% of test takers are below your score. The “25” is called the percentile rank. In statistics, it can get a little more complicated as there are actually three definitions of “percentile.” Here are the first two (see below for definition 3), based on an arbitrary “25th percentile”:
Definition 1: The nth percentile is the lowest score that is greater than a certain percentage (“n”) of the scores. In this example, or n is 25, so we’re looking for the lowest score that is greater than 25%.
Definition 2: The nth percentile is the smallest score that is greater than or equal to a certain percentage of the scores. To rephrase this, it’s the percentage of data that falls at or below a certain observation. This is the definition used in AP statistics. In this example, the 25th percentile is the score that’s greater or equal to 25% of the scores.

Related

estimate rolling percentile of a population from windowed sample sets

Calculating a percentile (95th, 99th) in my data set is expensive due to the large number of time series and time ranges ranging from weeks to months. The cost incurred is proportional to number of samples fetched from data store and computational overhead in processing the calculations. I am attempting to optimize the solution by calculating the statistics for smaller time ranges in parts, in stream as data points are ingested -- then estimating metrics for population from those samples. This approach works accurately for mean, peak (max) but require a good approximation for percentiles.
population_mean = mean(sample_mean_t0, sample_mean_t1, ... ,sample_mean_tn)
population_max = max(sample_max_t0, sample_max_t1 , ... , sample_max_tn)
To calculate p95, I am calculating 95th percentile over 95th percentile of all samples. Is this a reasonable approximation for calculating 95th percentile? (we are not attempting to solve the problem when there is high degree of skewness). Is there a better approximation I can use for calculating percentiles?
population_p95 = p95(sample_p95_t0, sample_p95_t1, ... , sample_p95_tn)
Does taking the average over sample p95s make more sense? Any reference here to approximate this solution and estimate errors will be helpful.

Can t-test be calculated on large samples with non-normal distribution?

Can t-test be calculated on large samples with non-normal distribution?
For example, the number of users in group A is 100K, the number of users in group B is 100K. I want to test whether the average session duration of these two groups is statistically significant.
1st method) We calculated the average session duration of these users on the day after the AB test (DAY1) as
31.2 min for group A
30.2 min for group B.
We know that users in groups A and B have a non-normal distribution of DAY1 session values.
In such a case, would it be correct to use two samples t-test to test the DAY1 avg session durations of two groups? (We will accept n=100K)
(Some sources say that calculating t-scores for large samples will give accurate results even with non-normal distribution.)
2nd method) Would it be a correct method to calculate the t-score over the daily average session duration during the day the AB test is open?
E.g; In the scenario below, the average daily session duration of 100K users in groups A and B are calculated. We will accept the number of days here as the number of observations and get n=30.
We will also calculate the two-sample t-test calculation over n=30.
Group
day0 avg duration
day1 avg duration
day2 avg duration
...
day30 av gduration
A
30.2
31.2
32.4
...
33.2
B
29.1
30.2
30.4
...
30.1
Do these methods give correct results or is it necessary to apply another method in such scenarios?
Would it make sense to calculate t-test on large samples in AB test?

The t-test assumes that the means of different samples taken from a population are normally distributed. It doesn't assume that the population itself is normally distributed.
For a population with finite variance, the central limit theorem suggests that the means of samples from the population are normally distributed. However, the sample size needed for the distribution of means to be approximately normal depends on the degree of non-normalness of the population. The t-test is invalid for small samples from non-normal population distributions, but is valid for large samples from non-normal distributions.
Method 1 works because of this reason (large sample size ~100K) and you are correct that calculating t-scores for large samples will give accurate results even with non-normal distribution. [You may also consider using a z-test for the sample sizes you're working with (100K). T-tests are more appropriate for smaller sample sizes, such as n < 30]
Method 2 works because the daily averages should be normally distributed given enough samples per the central limit theorem. Time-spent datasets may be skewed but generally work well.

Estimating percentiles in a skewed distribution (doesn't need to be exact)

This may be more of a statistics question, and I'd like to find a solution with Excel. I'd rather use simple VBA if any coding is necessary.
Is there a way to estimate the percentile of a specific data point in a skewed distribution? I don't need exact percentiles and only need a reasonable estimate. I work on analyses that rely on weighted average benchmarks reported by multiple sources. All of my sources report the 25th, 50th, 75th, and 90th percentiles as well as the mean and standard deviation. We use these benchmarks to set a target range, and our goal is for our results from a specific analysis to land somewhere within the published percentiles. I'm often asked to indicate what percentile our specific result is at, and all I can provide is broad ranges like 25th-50th, etc. So, I'm then asked to use simple extrapolation to determine the specific percentile of the specific result, and I know that using this method is inaccurate.
Mean and median differ in 99% of cases in my data set, but % difference between mean and median on average is only 6%. Only about 10% of cases have mean and median with greater than 10% difference.
For the 90% of cases with relatively low % difference between mean and median, can I assume the normal distribution?
For cases with higher % difference between mean and median, can I make an assumption that will help me estimate more accurately? I could for these cases just use the normal distribution and send my percentile estimate along with a note indicating that the estimate is likely off in one direction or another, but I'd rather give a better estimate.
Responding to cybernetic.nomad:
First, thanks for commenting! Second, it doesn't seem to work. I think I don't have enough data. The attached image shows an example. The first 5 rows show one set of my weighted average benchmarks for a single case. Below that, I added two lines--one with my "target" amount. This could be any number but, to test out the formula you suggested, I entered my 50th percentile weighted average. The row below that has the results of the formula =percentrank.exc(25th:90th,target). The result should be 0.5 but it's not, so I don't think this works. example

Excel Solver solver is messing up my optimization

I have set up an optimization problem but i must be doing something wrong and I could use your help. I have three firms: alpha, Bravo, Charlie. They each complete three tasks: Milling, Inspecting, Drilling. They each require different amounts of minutes to complete each task. Alpha requires 12 minutes to mill, 5 minutes to inspect and 10 minutes to drill. Bravo requires 10 minutes to mill, 4 to inspect, and 8 to drill. Charlie requires 8 to mill, 4 to inspect, and 16 to drill. After each firm completes all of these tasks they will earn a certain amount of profit, Alpha will earn $2.40, Bravo will earn $2.50, and Charlie will earn $3.00. All three firms have a maximum allotted time of 1200 minutes to mill, 900 to inspect, and 1440 to drill. The goal is to maximize the profit of these three firms. I have set it up so that the sums of the tasks will take away from the available time left when changed by the solver. I have also set constraints within the solver to cap each task to the allotted time allowed per task. I must be missing a vital step however because it keeps trying to just max out the allotted time for an individual firm, not taking in to account the opportunity cost of the other firms or something. Please help! (shown in photos)
Data
Solver
After executing Solver

I have changed the logic a bit different in order to take the minimum unit into consideration:
UNITS portion are the variable cells. Since the final produced unit will be the minimum of these cells, E9 formula is =MIN(B9:D9) and copied down.
TIME portion is multiplication of Unit Times and Units. So the formula of B14 is =B9*B2 and copied down & right.
I9:I11 are the earnings calculated by multiplying the unit earning with the minimum units
I12 is our total earning and is our Objective cell.
Please also be careful about the constraints since when you do not set an integer constrain, finding a solution becomes more difficult and of course our units should be integer in any case.
And also fill B9:D11 cells with some values such as 100, since otherwise iteration does not start correctly and solver ends up with a very small objective cell.

I have just had a go at this and I get a different answer as I have made the assumption that to achieve the profit the company must complete a milling process, then inspect, then drill and once all are complete then that is 1 unit for the profit - I hope that is valid.
But if not, then this layout may help you anyway. Note I have set this as a Linear model for the solver and also note the use of integer and non-negative.
It was fun anyway !

Handling Negative Values when Calculating Average & Value

I am working on a tool for Fantasy Football that calculates the average value a player offers per million pounds of cost. It essentially boils down to their average points per game divided by their cost.
So for example, a player who costs £10m and scores an average of 5 points per game offers 0.5 points per game per million. Whereas a player who costs £8m and scores an average of 5 points per game offers 0.625 points per game per million. Clearly the player who costs £8m is better value.
My problem is, players are capable of scoring negatively, and so how do I account for that in calculating the value of a player?
To give another example, a player who costs £10m and scores an average of -2 points per game offers -0.2 points per game per million. Whereas a player who costs £8m and scores an average of -2 points per game offers -0.25 points per game per million.
Now the player who costs £10m appears to be better value because their PPG/£m is higher. This shouldn't be true, they can't be better value if they cost more but score the same points. So if I have a list of players sorted by their value, calculated in this manner, some players will incorrectly show higher than players that are technically better value.
Is there a way to account for this problem? Or is just an unfortunate fact of the system I'm using?

One simple trick will be to slightly change your formula for PPG/£m as the ratio of the square of the average points he scored and the cost.
If you are particular about the scales, consider its positive square root.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string