Calculate approximate trimmed 10% mean using percentile data - statistics

I have access to my services' latency metrics at all percentiles. I need to calculate the trimmed 10% mean of the service's latency now. Is there a way I can approximate the trimmed 10% mean using just the percentiles data? I understand I can simply calculate the mean using a script for the transactions between the 10th percentile and 90th percentile, but since this data is to be used directionally only, I was wondering if there is an easy hack to guesstimate it as doing it at scale would be expensive.

This is really more suitable for stats.stackexchange.com, but anyway you can approximate the trimmed mean or any other sample statistic given percentiles. From the percentiles, construct the equivalent histogram. Each bar has the width from one percentile to the next, and height equal to the difference of percentiles. (So if you reversed the process and added up the bars, you would get the percentiles again.)
Now with that histogram, calculate the sample statistic. The exact value is an integral. An easy approximation is to generate a number of data from the span of each bar, and then use those data to calculate the sample statistic according to the ordinary formula. The first thing to try is to just generate data equal to the midpoint of each bar, with the number of values in each bin proportional to the bar height.
I don't know a package to do this, but with this description maybe you can look it up, or work out the details.

Related

How to quantify magnitude of change in dataset containing base values of 0?

I have a dataset with the low-water and high-water surface area of lakes/ponds within a delta for each year. These lakes can undergo substantial change from year to year, and sometimes can dry out completely. As such, surface area can have values of 0 during the low-water period. I'm trying to quantify the magnitude of flooding in the spring on the surface areas of these lakes. Given the high inter annual variations in surface area, I need to compare the low-water value from the previous year to the high-water value of the following year to quantify this magnitude; comparing to a mean isn't sensitive enough. However, given the low water surface area of 0 for some lakes, I cannot quantify percent change.
My current idea is to do an "inverse" of percent change (don't know how else to describe it), where I divide the low-water value by the high-water value. This gives me a scale where large change will equal 0 and little change will equal 1. However, again small changes from a surface area of 0 will be over represented. Any idea how I could accurately compare the magnitude of flooding in such a case?

Normalisation or Standardisation for detecting outlier?

When to use min max scaling that is normalisation and when to use standardisation that is using z score for data pre-processing ?
I know that normalisation brings down the range of feature to 0 to 1, and z score bring downs to -3 to 3, but am unsure when to use one of the two technique for detecting the outliers in data?
Let us briefly agree on the terms:
The z-score tells us how many standard deviations a given element of a sample is away from the mean.
The min-max scaling is the method of rescaling a range of measurements the interval [0, 1].
By those definitions, z-score usually spans an interval much larger than [-3,3] if your data follows a long-tailed distribution. On the other hand, a plain normalization does indeed limit the range of the possible outcomes, but will not help you help you to find outliers, since it just bounds the data.
What you need for outlier dedetction are thresholds above or below which you consider a data point to be an outlier. Many programming languages offer Violin plots or Box plots which nicely show your data distribution. The methods behind plots implement a common choice of thresholds:
Box and whisker [of the box plot] plots quartiles, and the band inside the box is always the second quartile (the median). But the ends of the whiskers can represent several possible alternative values, among them:
the minimum and maximum of all of the data [...]
one standard deviation above and below the mean of the data
the 9th percentile and the 91st percentile
the 2nd percentile and the 98th percentile.
All data points outside the whiskers of the box plots are plotted as points and considered outliers.

Generate positive only distribution based on array

I have an array of data, for example:
[1000,800,700,650,630,500,370,350,310,250,210,180,150,100,80,50,30,20,15,12,10,8,6,3]
From this data, I want to generate random numbers that fit the same distribution.
I can generate a random number using code like the following:
dist = scipy.stats.gaussian_kde(data)
randomVar = np.floor(dist.resample()[0])
This results in random number generation that includes negative numbers, which I believe I can dump fairly easily without changing the overall shape of the rest of the curve (I just generate sufficient resamples that I still have enough for purpose after dumping the negatives).
However, because the original data was positive values only - and heaped up against that boundary, I end up with a kde that is highest a short distance before it gets to zero, but then drops off sharply from there as it approaches zero; and that downward tick in the KDE is preventing me from generating appropriate numbers.
I can set the bandwidth lower, in order to get a sharper corner, closer to zero, but then due to the low quantity of the original data it ends up sawtoothing elsewhere. Higher bandwidths unfortunately hide the shape of the curve before they remove the downward tick.
As broadly suggested in the comments by Hilbert's Drinking Problem, the real solution was to find a better distribution that fit the parameters. In my case Chi-Squared, which fit both the shape of the curve, and also the fact that it only took positive values.
However in the comments Stelios made the good suggestion of using scipy.stats.rv_histogram, which I used and was satisfied with for a while. This enabled me to fit a curve to the data exactly, though it had two problems:
1) It assumes zero value in the absence of data. I.e. if you set the
settings to fit too closely to the data, then during gaps in your
data it will drop to zero rather than interpolate.
2) As an extension
to point 1, it wont extrapolate beyond the seed data's maximum and
minimum (those data ranges are effectively giant gaps, so everything
eventually zeroes out).

Excel, Determine where data takes a dive

I'm trying to determine where, in a set of measurement data, the data takes a dive...
... so I can plot a vertical line and
... plot a horizontal line in the graph.
I have no problem doing the 2nd and 3rd bullet points above on my own, so that's taken care of.
The problem I need help with is the first bullet point - determining WHERE the data takes a dive - WHERE the data crosses a threshold that basically says, "Whatever-it-is you're measuring, is no longer performing as it is expected to.".
Here's what I'm doing:
I am taking measurements using a measuring device and that device is logging the measurements in its internal memory and allowing me to download that measurement data to my computer into a csv when the test session is complete.
I pull that csv into an xls and plot the data on a graph. (see attached image)
Here's what I want to do:
If you look at the attached image I would like to find the value where the data DEFINITELY crosses BELOW the horizontal line so I can say, "Here is where the device being tested 'gave up the ghost' and was no longer able to perform as desired."
What the data roughly looks like:
Each measurement set will have the rough look and feel of the attached image but slightly different each time. (because each object I am testing will have roughly the same performance characteristics but they all have their own manufacturing defects and variations.)
The data set for the attached image is a data set of 7000 measurements.
I never really know where the horizontal line will be.
Examples of the data sets I have gotten in the past several tests look like this:
(394 to 0)
(390000 to 0)
(3.88 to 0)
(375000 to 0)
(39.55 to 0)
(59200 to 0)
and each data set will have about 1,000 to 7,000 measurements each.
Here's how I was trying to solve this issue:
I was using SLOPE() and trying to latch onto where the slop of the line took a dive / started to work its way to a zero slope (which is a vertical line) so when it starts approaching a really small slope then it MUST be taking a dive. That didn't really work.
I was looking at using STDEV.P() in Excel and feeding it the entire data set. Then I was looking at doing the same thing but feeding it only the first 10, 30, 60 measurements but then I thought - we never really know just how many measurements will come through. Then I thought I would use the first 10% of the measurements and feed that to STDEV.P().
Please let me know what you think of this and please let me know of any ideas you may have.
Thanks.
H
Something like this should work to flag when the decay rate increases.
To find what 'direction' your data is going in you need the derivative.
Excel doesn't have a derivative formula but you can set it up pretty easily by using the (change in y)/(change in x) as demonstrated here:
http://faculty.educ.ubc.ca/sanderson/lab/CLFbiom/demo/diff.htm
I would then check a formula which counts how many datarows you have (=COUNTA(A:A) or similar)
Then uses that to get a step of 10% of your data
Then check the value of the derivative in a cell against a cell 10% further down. If it's still a negative (to account for the slight downhill at first) then you'll know
The right way to go about this is to model the data with an unknown discontinuity, something like "if time < break_time then (some constant plus noise) else (decaying exponential)". A maximum likelihood estimation for that model might require iteration or other operations which are clumsy in Excel -- maybe you should consider VB or Python or some other programming language. I.e. choose the tool to fit the problem and not the other way around.
See Seber and Wild, "Nonlinear Regression", for an extensive discussion of models with discontinuities.
If your data can be generally characterized as having:
(A) a more or less flat plateau region, followed by
(B) a downward trending region
then a basic strategy could be to start at then end of the data and march towards the beginning one point at a time, checking to see that the values are increasing. Once they stop increasing, you've found the break point.
The strategy assumes (unwisely?) that the downward trending region is smooth/noiseless. To make the solution more robust to noise, you could compare values that are 5 apart, or 10 apart, or whatever interval works to filter out the noise. Or you could use a moving average.
This strategy could potentially be made more efficient by starting the search somewhere in the middle of the data but still in downward trending portion. If you know (based on experience) that any value that is (say) 0.5X the maximum is in the downward trending portion, you could start the search there.
Hope that helps.
It appears as though you want to detect when the slope changes from something near zero to something negative. One way to detect this is to calculate the 2nd derivative of the values (calculate the slope of the slope). The 2nd derivative should be near zero in the flat portion of the data AND in the downward trending portion of the data. It should go negative at the break point. So finding the minimum (most negative) value of the 2nd should locate the break point.
To implement this, you probably will need to filter noise. So calculate the first derivative (slope) over some suitable window of data:
=SLOPE(moving window of say 25 raw values)
Then calculate the second derivative (slope of slope):
=SLOPE(moving window of say 25 slope values)
Then look for the minimum.
Hope that helps.

Averaging many curves with different x and y values

I have several curves that contain many data points. The x-axis is time and let's say I have n curves with data points corresponding to times on the x-axis.
Is there a way to get an "average" of the n curves, despite the fact that the data points are located at different x-points?
I was thinking maybe something like using a histogram to bin the values, but I am not sure which code to start with that could accomplish something like this.
Can Excel or MATLAB do this?
I would also like to plot the standard deviation of the averaged curve.
One concern is: The distribution amongst the x-values is not uniform. There are many more values closer to t=0, but at t=5 (for example), the frequency of data points is much less.
Another concern. What happens if two values fall within 1 bin? I assume I would need the average of these values before calculating the averaged curve.
I hope this conveys what I would like to do.
Any ideas on what code I could use (MATLAB, EXCEL etc) to accomplish my goal?
Since your series' are not uniformly distributed, interpolating prior to computing the mean is one way to avoid biasing towards times where you have more frequent samples. Note that by definition, interpolation will likely reduce the range of your values, i.e. the interpolated points aren't likely to fall exactly at the times of your measured points. This has a greater effect on the extreme statistics (e.g. 5th and 95th percentiles) rather than the mean. If you plan on going this route, you'll need the interp1 and mean functions
An alternative is to do a weighted mean. This way you avoid truncating the range of your measured values. Assuming x is a vector of measured values and t is a vector of measurement times in seconds from some reference time then you can compute the weighted mean by:
timeStep = diff(t);
weightedMean = timeStep .* x(1:end-1) / sum(timeStep);
As mentioned in the comments above, a sample of your data would help a lot in suggesting the appropriate method for calculating the "average".

Resources