The 95% of non-normally distributed points around a mean/median - statistics

I asked users to tap a location repeatedly. To calculate the size of a target in that location, such that 95% of users will hit that target successfully, I usually measure 2 std of the tap offsets from the centroid. That works if the tap offsets are normally distributed, but my data now is not distributed normally. How can I figure out the equivalent of a 2 std around the mean/median?

If you're only measuring in one dimension, the region encompassed by +/-2 std in a Normal distribution corresponds fairly well to the central 95% of the distribution. Perhaps it's worth working with quantiles instead - take the interval corresponding to that within the 2.5th and 97.5th percentiles - this will be robust to skew or any other departure from normality.

Related

How to aggregate data by period in a rrdtool graph

I have a rrd file with average ping times to a server (GAUGE) every minute and when the server is offline (which is very frequent for reasons that doesn't matter now) it stores a NaN/unknown.
I'd like to create a graph with the percentage the server is offline each hour which I think can be achieved by counting every NaN within 60 samples and then dividing by 60.
For now I get to the point where I define a variable that is 1 when the server is offline and 0 otherwise, but I already read the docs and don't know how to aggregate this:
DEF:avg=server.rrd:rtt:AVERAGE CDEF:offline=avg,UN,1,0,IF
Is it possible to do this when creating a graph? Or I will have to store that info in another rrd?
I don't think you can do exactly what you want, but you have a couple of options.
You can define a sliding window average, that shows the percentage of the previous hour that was unknown, and graph that, using TRENDNAN.
DEF:avg=server.rrd:rtt:AVERAGE:step=60
CDEF:offline=avg,UN,100,0,IF
CDEF:pcavail=offline,3600,TREND
LINE:pcavail#ff0000:Availability
This defines avg as the 1-min time series of ping data. Note we use step=60 to ensure we get the best resolution of data even in a smaller graph. Then we define offline as 100 when the server is there, 0 when not. Then, pcavail is a 1-hour sliding window average of this, which will in effect be the percentage of time during the previous hour during which the server was available.
However, there's a problem in that RRDTool will silently summarise the source data before you get your hands on it, if there are many data points to a pixel in the graph (this won't happen if doing a fetch of course). To get around that, you'd need to have the offline CDEF done at store time -- IE, have a COMPUTE type DS that is 100 or 0 depending on if the avg DS is known. Then, any averaging will preserve data (normal averaging omits the unknowns, or the xff setting makes the whole cdp unknown).
rrdtool create ...
DS:rtt:GAUGE:120:0:9999
DS:offline:COMPUTE:rtt,UN,100,0,IF
rrdtool graph ...
DEF:offline=server.rrd:offline:AVERAGE:step=3600
LINE:offline#ff0000:Availability
If you are able to modify your RRD, and do not need historical data, then use of a COMPUTE in this way will allow you to display your data in a 1-hour stepped graph as you wanted.

What factors influence the "Avoids Enormous Network Payloads" message?

According to the official documentation, the target is 1600KB:
https://developers.google.com/web/tools/lighthouse/audits/network-payloads
However, even when the payload is larger than the target, it sometimes still shows as a passed audit:
payload of 2405KB still passes
What are the other conditions that allow large payloads to pass the audit?
Lighthouse scores are 0–100 based on some log normal distribution math.
1600KB is not a passing score, it is approximately a maximum possible 100 score.
As of right now the values used for distribution calculation are 2500KB point of diminishing returns and 4000KB median, which would correspond to scores of about 93 and 50 respectively.
That puts 2405KB result at ~94 score which is sufficient to pass.

How to calculate Azure SQL Data Warehouse DWU?

I am analyzing Azure SQL DW and I came across the term DWU (Data warehouse units). The link on Azure site only mentions a crude definition of DWU. I want to understand how DWU is calculated and how should I scale my system accordingly.
I have also referred to the link but it does not cover my question:
In addition to the links you found it is helpful to know that Azure SQL DW stores data in 60 different parts called "distributions". If your DW is DWU100 then all 60 distributions are attached to one compute node. If you scale to DWU200 then 30 distributions are detached and reattached to a second compute node. If you scale all the way to DWU2000 then you have 20 compute nodes each with 3 distributions attached. So you see how DWU is a measure of the compute/query power of your DW. As you scale you have more compute operating on less data per compute node.
Update: For Gen2 there are still 60 distributions but the DWU math is a bit different. DWU500c is one full size node (playing both compute and control node roles) where all 60 distributions are mounted. Scales smaller than DWU500c are single nodes that are not full size (meaning fewer cores and less RAM than full size nodes on larger DWUs). DWU1000c is 2 compute nodes each with 30 distributions mounted and there is a separate control node. DWU1500c is 3 compute nodes and a separate control node. And the largest is DWU30000c which is 60 compute nodes each with one distribution mounted.
I just found this link which shows the throughput to DWU relation
You can also checkout the dwucalculator. This site walks you through the process of taking a capture for your existing workload and makes a recommendation on the number of DWUs necessary to fulfill the workload in Azure SQL DW.
http://dwucalculator.azurewebsites.net/
Depending on the amount of time and the number of tables, you may choose DWU.
For eg: If 100 DWU's are taking 15 mins of time for 3 tables and to implement the same in 3 mins you may choose 500 DWU.

How to calculate waiting time for customer based on previous customers's waiting time

Can anyone suggest a method to calculate customer waiting time for a restaurant based on previous waiting times. My system stores the waiting time of each customer and based on this values i want to predict the waiting time for next customer.
You can't predict an exact figure.
But a simple statistical approach would be:
average( waiting_time ) + ( 2 * standard_deviation( waiting_time ) )
That is, take the average and add two standard deviations.
Assuming that wait time is normally distributed, the result from the above equation is the maximum amount of waiting time that approximately 95% of your customers would experience.
A Poisson process is a stochastic process which counts the number of events and the time that these events occur in a given time interval. The time between each pair of consecutive events, for example, customer waiting time, has an Exponential distribution. From wiki:
The exponential distribution occurs naturally when describing the lengths of the inter-arrival times in a homogeneous Poisson process.
Prediction
Using maximum likelihood estimation, you can use the inverse sample mean to get the rate parameter of exponential distribution.
Confidence Interval
From wiki:
A simple and rapid method to calculate an approximate confidence interval for the estimation of λ is based on the application of the central limit theorem. This method provides a good approximation of the confidence interval limits, for samples containing at least 15 – 20 elements. Denoting by N the sample size, the upper and lower limits of the 95% confidence interval are given by:
For more details, see Poisson process and Exponential distribution.

Creating a measure that combines a percentage with a low decimal number?

I'm working on a project in Tableau (which uses functions very similar to Excel, if that helps) where I need a single measurement derived from two different measurements, one of which is a low decimal number (2.95 on the high end, 0.00667 on the low) and the other is a percentage (ranging from 29.8 to 100 percent).
Put another way, I have two tables detailing bus punctuality -- one is for high frequency routes and measured in Excess Waiting Time (EWT, in minutes), the other for low frequency routes and measured in terms of percent on time. I have a map of all the routes, and want to colour the lines based on how punctual that route is (thinner lines for routes with a low EWT or a high percentage on time; thicker lines for routes with high EWT or low percentage on time). In preparation for this, I've combined both tables and zero'd out the non-existent value.
I thought I'd do something like log(EWT + PercentOnTime), but am realizing that might not give the value I'm wanting (especially because I ultimately need an inverse of one or the other, since low EWT is favourable and high % on time favourable).
Any idea how I'd do this? Thanks!
If you are combining/comparing the metrics in an even manner and the data is relatively linear then all you need to do is normalise them.
If you have the EWT expected ranges (eg. 0.00667 to 2.95). Then a 2 would be
(2 - 0.00667)/(2.95 - 0.00667) = 0.67723 but because EWT is the inverse semantically to punctuality we need to use 1-0.67723 = 0.32277.
If you do the same for the Punctuality percentage range:
Eg. 80%
(80 - 29.8)/(100-29.8) = 0.7169
You can compare these metrics because they are normalised (between 0-1 : multiply by 100 to get percentages) if you are assuming the underlying metrics (1-EWT) and on time percentage (OTP) are analogous.
Thus you can combine these into a single table. You will want to ignore all zero'd values as this is actually an indication you have no data at these points.
you'll have to use an if statement to say something like :
if OTP > 0 then ([OTP-29.8])/(100-29.8) else (1-(([EWT]-0.00667)/(2.95-0.00667)))
Hope this helps.

Resources