Can you estimate percentiles in unordered data? - statistics

Suppose you have a very large list of numbers which would be expensive to sort. They are real numbers/decimals but all lie in the same range, say 0 to n for some integer n. Are there any methods for estimating percentiles that don't require sorting the data i.e. an algorithm that has better complexity than the fastest sorting algorithm.
Note: The tag is quantiles only because there is no existing tag for percentiles and it wouldn't let me create one; my question is not specific to quantiles.

In order to find the p-th percentile of a set of N numbers, essentially you are trying to find the k-th largest number where k = N*p/100 (rounded down, I think--or on second thought, thinking of the median, for example, maybe it's rounded up).
You might try the median of medians algorithm, which is supposed to be able to find the k-th largest number among N numbers in O(N) time.
I don't know where this is implemented in a standard library but a proposed implementation
was posted in one of the answers to this question.

Related

Weighted percentile calculation from group of percentiles

Can we calculate the overall kth percentile if we have kth percentile over 1 minute window for the same time period?
The underlying data is not available. Only the kth percentile and count of underlying data is available.
Are there any existing algorithms available for this?
How approximate will the calculated kth percentile be?
No. If you have only one percentile (and count) for every time period, then you cannot reasonably estimate that same percentile for the entire time period.
This is because percentiles are only semi-numerical measures (like Means) and don't implicitly tell you enough about their distributions above and below their measured values at each measurement time. There are a couple of exceptions to the above.
If the percentile that you have is the 50th percentile (i.e., the Mean), then you can do some extrapolation to the Mean of the whole time, but it's a bit sketchy and I'm not sure how bad the variance would be.
If all of your percentile measure are very close together (compared to the actual range of the measured population), then obviously you can use that as a reasonable estimate of the overall percentile.
If you can assume with high assurance that every minute's data is an independent sampling of the exact same population distribution (i.e., there is no time-dependence), then you may be able to combine them, possibly even if the exact distribution is not fully known (has parameter that are unknown, but still known to be fixed over the time-period). Again I am not sure what the valid functions and variance calculations are for this.
If the distribution is known (or can be assumed) to be a specific function or shape with some unknown value or values and where time-dependence has a known role in that function, then you should be able to using weighting and time-adjustments to transform into the same situation as #3 above. So for instance if the distributions were a time-varying exponential distribution of the form pdf(k,t) = (k*t)e^-(k*t) then I believe that you could derive an overall percentile estimate by estimating the value of k for by adjust it for each different minute (t).
Unfortunately I am not a professional statistician. I have Math/CS background, enough to have some idea of what's mathematically possible/reasonable, but not enough to tell exactly how to do it. If you think that your situation falls into one of the above categories, then you might be able to take it to https://stats.stackexchange.com but you will need to also provide the information I mentioned in those categories and/or detailed and specific information about what you are measuring and how you are measuring it.
Based on statistical instincts ,The error rate will be proportional to Standard Deviation of the total set. If you are creating a approximation for a longer time span , that includes the discrete chunks of kth percentile . [ clarification may be need for proving this theory.]

What's the better way to choose pivot for quicksort?

Some people told me there were a list of optimized pivot for Quicksort, but I searched on the net and I didn't found it.
So this list contain a lot of prime number, but also many others (nowadays we aren't able to explain why this pivot are the best).
Then if u know Something about it or have some documentation, I'm interested.
If you know another way to optimize the quicksort I'm interested too.
Thanks in advance
One sort that uses a list of numbers is shell short, where the numbers are used for the "gaps":
https://en.wikipedia.org/wiki/Shellsort#Gap_sequences
For quicksort, using median of 3 will help, median of 9 helps a bit more, and median of medians guarantees worst case O(n log(n)) time complexity, but involves a large constant factor that in most cases, results in a slower overall quicksort.
https://en.wikipedia.org/wiki/Median_of_medians
Introsort with a reasonable pivot choice (random, median of 3, 9, ...), that switches to heapsort if the level of recursion becomes too deep is a common choice.
https://en.wikipedia.org/wiki/Introsort
There's no better way to pick a pivot than to pick the middle element of the list as our pivot.
Why?
The ideal way to find a pivot for a list of numbers is to find a pivot randomly. However, the additional randomization process will take additional time complexity or space complexity.
What if we just select the first element as a pivot and say that's somehow "random" for the list?
If the list is already sorted and select the first element as a pivot, then the algorithm will generate to have a time complexity of O(n^2) instead of our average time O(nlogn).
Therefore, in order to guarantee no additional time complexity is used and to not degenerate our algorithm. The quickest, easiest, and most common fix is to use the middle element of the list as our pivot. If so, we could guarantee that our algorithm is O(nlogn). It will be extremely difficult for our algorithm at this point to degenerate, unless it is ordered purposely in a way so as to degenerate it.

find median of a unsorted array using heap

Is there a way to find median of a unsorted array using heap ? If it is possible, is it more efficient than using sorting and then finding median ?
The trick here is to use two heaps of which one is min-heap and other is max-heap. I will not go in details, but the following points are sufficient to implement the required algorithm.
The top of the min-heap is the smallest element greater than or equal to the mean
The top of the max-heap is the largest element less than or equal to the mean.
Now coming to your second question, it is only efficient if you want to find the running median i.e. the median just after inserting a new element each time into the array.
If you want to calculate the median of all the array elements just once, then sorting will be a good idea.
Hope this helps.

Find the 10 smallest numbers in a large dataset?

I am coding specifically in python, but right now in the phase of designing psuedocode for an algorithm that will take all the data points in a data set that has n values, n being very large and pick out the 10 smallest values (or finite number m << n, where m is the m smallest numbers). I wish to have an optimally efficient algorithm for the requirements.
My idea:
1) Heapsort the data then pick the smallest 10 values. O(nlog(n))
2) Alternatively,use a loop to identify a 'champion' that runs 10 times. With the first 'champion' determined remove from the dataset and then repeat this loop. O(n) (given m is small)
Which suggestion or if there is another would be best?
One approach among many possible:
Grab 10 values and sort them. Now compare the largest with the 11th through nth values one at a time. Whenever the new value is smaller replace the 10th smallest with it and resort your 10 values.
The list of 10 values, sorting them etc will all be in cache so fast even with rough code. The whole list will be accessed once through so will be fast as well.

Compute statistical significance with Excel

I have 2 columns and multiple rows of data in excel. Each column represents an algorithm and the values in rows are the results of these algorithms with different parameters. I want to make statistical significance test of these two algorithms with excel. Can anyone suggest a function?
As a result, it will be nice to state something like "Algorithm A performs 8% better than Algorithm B with .9 probability (or 95% confidence interval)"
The wikipedia article explains accurately what I need:
http://en.wikipedia.org/wiki/Statistical_significance
It seems like a very easy task but I failed to find a scientific measurement function.
Any advice over a built-in function of excel or function snippets are appreciated.
Thanks..
Edit:
After tharkun's comments, I realized I should clarify some points:
The results are merely real numbers between 1-100 (they are percentage values). As each row represents a different parameter, values in a row represents an algorithm's result for this parameter. The results do not depend on each other.
When I take average of all values for Algorithm A and Algorithm B, I see that the mean of all results that Algorithm A produced are 10% higher than Algorithm B's. But I don't know if this is statistically significant or not. In other words, maybe for one parameter Algorithm A scored 100 percent higher than Algorithm B and for the rest Algorithm B has higher scores but just because of this one result, the difference in average is 10%.
And I want to do this calculation using just excel.
Thanks for the clarification. In that case you want to do an independent sample T-Test. Meaning you want to compare the means of two independent data sets.
Excel has a function TTEST, that's what you need.
For your example you should probably use two tails and type 2.
The formula will output a probability value known as probability of alpha error. This is the error which you would make if you assumed the two datasets are different but they aren't. The lower the alpha error probability the higher the chance your sets are different.
You should only accept the difference of the two datasets if the value is lower than 0.01 (1%) or for critical outcomes even 0.001 or lower. You should also know that in the t-test needs at least around 30 values per dataset to be reliable enough and that the type 2 test assumes equal variances of the two datasets. If equal variances are not given, you should use the type 3 test.
http://depts.alverno.edu/nsmt/stats.htm

Resources