Estimating percentiles in a skewed distribution (doesn't need to be exact) - excel

This may be more of a statistics question, and I'd like to find a solution with Excel. I'd rather use simple VBA if any coding is necessary.
Is there a way to estimate the percentile of a specific data point in a skewed distribution? I don't need exact percentiles and only need a reasonable estimate. I work on analyses that rely on weighted average benchmarks reported by multiple sources. All of my sources report the 25th, 50th, 75th, and 90th percentiles as well as the mean and standard deviation. We use these benchmarks to set a target range, and our goal is for our results from a specific analysis to land somewhere within the published percentiles. I'm often asked to indicate what percentile our specific result is at, and all I can provide is broad ranges like 25th-50th, etc. So, I'm then asked to use simple extrapolation to determine the specific percentile of the specific result, and I know that using this method is inaccurate.
Mean and median differ in 99% of cases in my data set, but % difference between mean and median on average is only 6%. Only about 10% of cases have mean and median with greater than 10% difference.
For the 90% of cases with relatively low % difference between mean and median, can I assume the normal distribution?
For cases with higher % difference between mean and median, can I make an assumption that will help me estimate more accurately? I could for these cases just use the normal distribution and send my percentile estimate along with a note indicating that the estimate is likely off in one direction or another, but I'd rather give a better estimate.
Responding to cybernetic.nomad:
First, thanks for commenting! Second, it doesn't seem to work. I think I don't have enough data. The attached image shows an example. The first 5 rows show one set of my weighted average benchmarks for a single case. Below that, I added two lines--one with my "target" amount. This could be any number but, to test out the formula you suggested, I entered my 50th percentile weighted average. The row below that has the results of the formula =percentrank.exc(25th:90th,target). The result should be 0.5 but it's not, so I don't think this works. example

Related

Weighted percentile calculation from group of percentiles

Can we calculate the overall kth percentile if we have kth percentile over 1 minute window for the same time period?
The underlying data is not available. Only the kth percentile and count of underlying data is available.
Are there any existing algorithms available for this?
How approximate will the calculated kth percentile be?
No. If you have only one percentile (and count) for every time period, then you cannot reasonably estimate that same percentile for the entire time period.
This is because percentiles are only semi-numerical measures (like Means) and don't implicitly tell you enough about their distributions above and below their measured values at each measurement time. There are a couple of exceptions to the above.
If the percentile that you have is the 50th percentile (i.e., the Mean), then you can do some extrapolation to the Mean of the whole time, but it's a bit sketchy and I'm not sure how bad the variance would be.
If all of your percentile measure are very close together (compared to the actual range of the measured population), then obviously you can use that as a reasonable estimate of the overall percentile.
If you can assume with high assurance that every minute's data is an independent sampling of the exact same population distribution (i.e., there is no time-dependence), then you may be able to combine them, possibly even if the exact distribution is not fully known (has parameter that are unknown, but still known to be fixed over the time-period). Again I am not sure what the valid functions and variance calculations are for this.
If the distribution is known (or can be assumed) to be a specific function or shape with some unknown value or values and where time-dependence has a known role in that function, then you should be able to using weighting and time-adjustments to transform into the same situation as #3 above. So for instance if the distributions were a time-varying exponential distribution of the form pdf(k,t) = (k*t)e^-(k*t) then I believe that you could derive an overall percentile estimate by estimating the value of k for by adjust it for each different minute (t).
Unfortunately I am not a professional statistician. I have Math/CS background, enough to have some idea of what's mathematically possible/reasonable, but not enough to tell exactly how to do it. If you think that your situation falls into one of the above categories, then you might be able to take it to https://stats.stackexchange.com but you will need to also provide the information I mentioned in those categories and/or detailed and specific information about what you are measuring and how you are measuring it.
Based on statistical instincts ,The error rate will be proportional to Standard Deviation of the total set. If you are creating a approximation for a longer time span , that includes the discrete chunks of kth percentile . [ clarification may be need for proving this theory.]

Small data anomaly detection algo

I have the following 3 cases of a numeric metric on a time series(t,t1,t2 etc denotes different hourly comparisons across periods)
If you notice the 3 graphs t(period of interest) clearly has a drop off for image 1 but not so much for image 2 and image 3. Assume this is some sort of numeric metric(raw metric or derived) and I want to create a system/algo which specifically catches case 1 but not case 2 or 3 with t being the point of interest. While visually this makes sense and is very intuitive I am trying to design a way to this in python using the dataframes shown in the picture.
Generally the problem is how do I detect when the time series is behaving very differently from any of the prior weeks.
Edit: When I say different what I really mean is, my metric trends together across periods in t1 to t4 but if they dont and try to separate out of the envelope, that to me is an anomaly. If you notice chart 1 you can see t tries to split out from rest of the tn this is an anomaly for me. in other cases t is within the bounds of other time periods. Hope this helps.
With small data the best is if you can come up with a good transformation into a simpler representation.
In this case I would try the following:
Distance to the median along the time-axis. Then a summary of that, could be median, Mean-Squared-Error etc
Median of the cross-correlation of the signals

Calculating percentile - Excel vs online

I have a set of data say {4,7,7,10,10,12,12,14,15,67} and i want to know the 95th Percentile. I used Excel and Online calculator.
Both gave different answers.
In Excel, formula i used : =PERCENTILE.INC(A1:A10,0.95) and result = 43.6
But this online percentile calculator yielded a result of 67
Which one is right?
First of all, both methods are "right" in the sense that both implement a standard algorithm for computing percentiles. Unlike the mean or median (where all sources use the same approach) there are many different approaches to calculating percentiles. The fundamental issue is that there is no obvious solution to the problem of what to do with percentiles which fall between observations. Do you take the observed value which is closest? Do you interpolate between the two? If so -- with what weighting factors do you do the interpolation? Wikipedia discusses nine (!) with both the Excel approach and the approach from that online percentile calculator making the list. See this paper for a very nice discussion of these algorithms.
You can replicate the functionality of that online percentile function like thus:
=SMALL(A1:A10,CEILING.MATH(COUNT(A1:A10)*0.95))
For example:
The point of using the function SMALL rather than a direct numerical index is that this approach works even if the data isn't sorted.

Compute statistical significance with Excel

I have 2 columns and multiple rows of data in excel. Each column represents an algorithm and the values in rows are the results of these algorithms with different parameters. I want to make statistical significance test of these two algorithms with excel. Can anyone suggest a function?
As a result, it will be nice to state something like "Algorithm A performs 8% better than Algorithm B with .9 probability (or 95% confidence interval)"
The wikipedia article explains accurately what I need:
http://en.wikipedia.org/wiki/Statistical_significance
It seems like a very easy task but I failed to find a scientific measurement function.
Any advice over a built-in function of excel or function snippets are appreciated.
Thanks..
Edit:
After tharkun's comments, I realized I should clarify some points:
The results are merely real numbers between 1-100 (they are percentage values). As each row represents a different parameter, values in a row represents an algorithm's result for this parameter. The results do not depend on each other.
When I take average of all values for Algorithm A and Algorithm B, I see that the mean of all results that Algorithm A produced are 10% higher than Algorithm B's. But I don't know if this is statistically significant or not. In other words, maybe for one parameter Algorithm A scored 100 percent higher than Algorithm B and for the rest Algorithm B has higher scores but just because of this one result, the difference in average is 10%.
And I want to do this calculation using just excel.
Thanks for the clarification. In that case you want to do an independent sample T-Test. Meaning you want to compare the means of two independent data sets.
Excel has a function TTEST, that's what you need.
For your example you should probably use two tails and type 2.
The formula will output a probability value known as probability of alpha error. This is the error which you would make if you assumed the two datasets are different but they aren't. The lower the alpha error probability the higher the chance your sets are different.
You should only accept the difference of the two datasets if the value is lower than 0.01 (1%) or for critical outcomes even 0.001 or lower. You should also know that in the t-test needs at least around 30 values per dataset to be reliable enough and that the type 2 test assumes equal variances of the two datasets. If equal variances are not given, you should use the type 3 test.
http://depts.alverno.edu/nsmt/stats.htm

Statistically removing erroneous values

We have a application where users enter prices all day. These prices are recorded in a table with a timestamp and then used for producing charts of how the price has moved... Every now and then the user enters a price wrongly (eg. puts in a zero to many or to few) which somewhat ruins the chart (you get big spikes). We've even put in an extra confirmation dialogue if the price moves by more than 20% but this doesn't stop them entering wrong values...
What statistical method can I use to analyse the values before I chart them to exclude any values that are way different from the rest?
EDIT: To add some meat to the bone. Say the prices are share prices (they are not but they behave in the same way). You could see prices moving significantly up or down during the day. On an average day we record about 150 prices and sometimes one or two are way wrong. Other times they are all good...
Calculate and track the standard deviation for a while. After you have a decent backlog, you can disregard the outliers by seeing how many standard deviations away they are from the mean. Even better, if you've got the time, you could use the info to do some naive Bayesian classification.
That's a great question but may lead to quite a bit of discussion as the answers could be very varied. It depends on
how much effort are you willing to put into this?
could some answers genuinely differ by +/-20% or whatever test you invent? so will there always be need for some human intervention?
and to invent a relevant test I'd need to know far more about the subject matter.
That being said the following are possible alternatives.
A simple test against the previous value (or mean/mode of previous 10 or 20 values) would be straight forward to implement
The next level of complexity would involve some statistical measurement of all values (or previous x values, or values of the last 3 months), a normal or Gaussian distribution would enable you to give each value a degree of certainty as to it being a mistake vs. accurate. This degree of certainty would typically be expressed as a percentage.
See http://en.wikipedia.org/wiki/Normal_distribution and http://en.wikipedia.org/wiki/Gaussian_function there are adequate links from these pages to help in programming these, also depending on the language you're using there are likely to be functions and/or plugins available to help with this
A more advanced method could be to have some sort of learning algorithm that could take other parameters into account (on top of the last x values) a learning algorithm could take the product type or manufacturer into account, for instance. Or even monitor the time of day or the user that has entered the figure. This options seems way over the top for what you need however, it would require a lot of work to code it and also to train the learning algorithm.
I think the second option is the correct one for you. Using standard deviation (a lot of languages contain a function for this) may be a simpler alternative, this is simply a measure of how far the value has deviated from the mean of x previous values, I'd put the standard deviation option somewhere between option 1 and 2
You could measure the standard deviation in your existing population and exclude those that are greater than 1 or 2 standard deviations from the mean?
It's going to depend on what your data looks like to give a more precise answer...
Or graph a moving average of prices instead of the actual prices.
Quoting from here:
Statisticians have devised several methods for detecting outliers. All the methods first quantify how far the outlier is from the other values. This can be the difference between the outlier and the mean of all points, the difference between the outlier and the mean of the remaining values, or the difference between the outlier and the next closest value. Next, standardize this value by dividing by some measure of scatter, such as the SD of all values, the SD of the remaining values, or the range of the data. Finally, compute a P value answering this question: If all the values were really sampled from a Gaussian population, what is the chance of randomly obtaining an outlier so far from the other values? If the P value is small, you conclude that the deviation of the outlier from the other values is statistically significant.
Google is your friend, you know. ;)
For your specific question of plotting, and your specific scenario of an average of 1-2 errors per day out of 150, the simplest thing might be to plot trimmed means, or the range of the middle 95% of values, or something like that. It really depends on what value you want out of the plot.
If you are really concerned with the true max and true of a day's prices, then you have to deal with the outliers as outliers, and properly exclude them, probably using one of the outlier tests previously proposed ( data point is x% more than next point, or the last n points, or more than 5 standard deviations away from the daily mean). Another approach is to view what happens after the outlier. If it is an outlier, then it will have a sharp upturn followed by a sharp downturn.
If however you care about overall trend, plotting daily trimmed mean, median, 5% and 95% percentiles will portray history well.
Choose your display methods and how much outlier detection you need to do based on the analysis question. If you care about medians or percentiles, they're probably irrelevant.

Resources