demerits of median of medians approach in choosing pivot for QUICKSORT - pivot

-randomized approach does not guarantee better space time complexity.
-choosing median as pivot guarantees the best space time complexity,same as that of merge sort.(also making quicksort better than merge sort as quicksort is easier to code)
and yet people always teach to use the randomized approach to pick a pivot, even though it is the worst option among the above two approaches.
why?

Related

Use Excel to optimise KFC order

Don't laugh, but, from time to time, my friends and I host a multiple-course KFC dinner, and I have a spreadsheet to optimise the order. This is to make sure we order the right combination of 'bucket'-type items (i.e. SKUs that contain multiple pieces, often of different types):
to minimise leftover items
to reduce the total cost
Here is the spreadsheet I currently use, and here's a screenshot:
To use it, you first specify the number of participants in A2, and what you want each person to have in E2:E6 (we're really only interested in the chicken, so 'sides' are treated as generic to simplify).
Here's the manual part, that I'd like to improve.
The next step is to look at the ideal totals for each item (F2:F6), and to try to set the right quantities (H12:H20) of the 'bucket'-type SKUs that I have recreated (A11:G20), so that the output totals (H21:M21) match the ideal totals (F2:F6).
The optimisation part is to get the deltas (H22:M22) as close to zero as possible, and to get the total cost (N21) as low as possible.
So, my question is: is there a way to do this better? I think Excel has some sort of Solver functionality, but I'm afraid I don't know how I'd go about even starting to use that, as my Excel skills are pretty rudimentary. Oh, and in case it makes a difference regarding functionality, I'm using Excel for Mac v16.37.
Any thoughts gratefully appreciated! :)
I can't take any credit for this, but am happy to say that GSerg left me a couple of comments that pointed me in the right direction, and I now have Solver set up to organise my chicken parties!
Here are the parameters for anyone who is curious:

What's the better way to choose pivot for quicksort?

Some people told me there were a list of optimized pivot for Quicksort, but I searched on the net and I didn't found it.
So this list contain a lot of prime number, but also many others (nowadays we aren't able to explain why this pivot are the best).
Then if u know Something about it or have some documentation, I'm interested.
If you know another way to optimize the quicksort I'm interested too.
Thanks in advance
One sort that uses a list of numbers is shell short, where the numbers are used for the "gaps":
https://en.wikipedia.org/wiki/Shellsort#Gap_sequences
For quicksort, using median of 3 will help, median of 9 helps a bit more, and median of medians guarantees worst case O(n log(n)) time complexity, but involves a large constant factor that in most cases, results in a slower overall quicksort.
https://en.wikipedia.org/wiki/Median_of_medians
Introsort with a reasonable pivot choice (random, median of 3, 9, ...), that switches to heapsort if the level of recursion becomes too deep is a common choice.
https://en.wikipedia.org/wiki/Introsort
There's no better way to pick a pivot than to pick the middle element of the list as our pivot.
Why?
The ideal way to find a pivot for a list of numbers is to find a pivot randomly. However, the additional randomization process will take additional time complexity or space complexity.
What if we just select the first element as a pivot and say that's somehow "random" for the list?
If the list is already sorted and select the first element as a pivot, then the algorithm will generate to have a time complexity of O(n^2) instead of our average time O(nlogn).
Therefore, in order to guarantee no additional time complexity is used and to not degenerate our algorithm. The quickest, easiest, and most common fix is to use the middle element of the list as our pivot. If so, we could guarantee that our algorithm is O(nlogn). It will be extremely difficult for our algorithm at this point to degenerate, unless it is ordered purposely in a way so as to degenerate it.

Calculating percentile - Excel vs online

I have a set of data say {4,7,7,10,10,12,12,14,15,67} and i want to know the 95th Percentile. I used Excel and Online calculator.
Both gave different answers.
In Excel, formula i used : =PERCENTILE.INC(A1:A10,0.95) and result = 43.6
But this online percentile calculator yielded a result of 67
Which one is right?
First of all, both methods are "right" in the sense that both implement a standard algorithm for computing percentiles. Unlike the mean or median (where all sources use the same approach) there are many different approaches to calculating percentiles. The fundamental issue is that there is no obvious solution to the problem of what to do with percentiles which fall between observations. Do you take the observed value which is closest? Do you interpolate between the two? If so -- with what weighting factors do you do the interpolation? Wikipedia discusses nine (!) with both the Excel approach and the approach from that online percentile calculator making the list. See this paper for a very nice discussion of these algorithms.
You can replicate the functionality of that online percentile function like thus:
=SMALL(A1:A10,CEILING.MATH(COUNT(A1:A10)*0.95))
For example:
The point of using the function SMALL rather than a direct numerical index is that this approach works even if the data isn't sorted.

Optimizing Excel formulas - SUMPRODUCT vs SUMIFS/COUNTIFS

According to a couple of web sites, SUMIFS and COUNTIFS are faster than SUMPRODUCT (for example: http://exceluser.com/blog/483/excels-sumifs-or-sumproduct-which-is-faster.html). I have a worksheet with an unknown number of rows (around 200 000) and I'm calculating performance reports with the numbers. I have over 6000 times almost identical SUMPRODUCT formulas with a couple of difference each times (only the conditions change).
Here is an example of what I got:
=IF(AFO4>0,
(SUMPRODUCT((Sheet1!$N:$N=$A4)
*(LEFT(Sheet1!$H:$H,2)="1A")
*(Sheet1!$M:$M<>"service catalog")
*(Sheet1!$J:$J="incident")
*(Sheet1!$I:$I<>"self-serve")
*(Sheet1!$AK:$AK=AFM$1)
*(Sheet1!$E:$E>=$E$1)
*(Sheet1!$E:$E<$E$2))
+SUMPRODUCT((Sheet1!$AJ:$AJ=$C4)
*(LEFT(Sheet1!$H:$H,2)="1A")
*(Sheet1!$M:$M<>"service catalog")
*(Sheet1!$J:$J="incident")
*(Sheet1!$I:$I="self-serve")
*(Sheet1!$AK:$AK=AFM$1)
*(Sheet1!$E:$E>=$E$1)
*(Sheet1!$E:$E<$E$2)))/AFO4,0)
Calculating that thing takes a little bit more than 1 second. Since I have more than 6000 of those formulas, it takes a little bit over an hour to calculate everything.
So, I'm now looking at how I could optimize that formula. Could I convert it to SUMIFS? Would it be faster? All I'm adding up here is 0s and 1s, I'm just counting the number of rows in my data source (Sheet1) where the set of conditions is met. Maybe COUNTIFS would work better?
I would appreciate any help to gain some execution time since we need to execute the formulas every month.
I can use VBA if that helps, but I always heard that Excel formulas were usually faster.
Instead of formulas, Why not use a PivotTable to crunch the numbers? You potentially face a longer one-time hit to load the data into the PivotCache, but after that, you should find a PivotTable recalculates much faster in response to filter changes than these computationally expensive formulas. Is there any reason you're not using one?
Here's some content from a book I'm writing, where I compare SUMPRODUCT, SUMIFS, DSUM, PivotTables, the Advanced Filter, and something called Range Slicing (which uses clever combinations of INDEX/MATCH on sorted data) to conditionally sum the records in a table that contains over 1 million sales records, based on choices you make from 10 different dropdowns:
Those dropdowns allow you to filter the database by a combination of the Store, Segment, Species, Gender, Payment, Cust. History, Order Status, Delivery Instructions, Membership Type, and Order Channel columns. So there’s some pretty mammoth filtering and aggregation going on in order to reduce those 1 million records down to just one sum. The file outlines six different ways to achieve this outcome, the first three of which are shown in the screenshot below:
As you’d expect, when all those dropdowns are set to the same settings, you get exactly the same answer out of all six approaches. But what you won’t expect is just how slow SUMPRODUCT is to calculate a new answer if you change one of those dropdowns, compared to the other approaches.
In fact, it turns out that the SUMIFS approach is 15 times faster than the SUMPRODUCT one at coming up with the answer on this mammoth dataset. But that’s nothing: The range slicing approach is 56 times faster!
The Range Slicing approach works by sorting your source data, and then using a series of clever formulas in helper columns to cleverly identifying exactly where any records of interest sit within that sorted data. This means that you can then directly sum just the few records that match rather than having to do a complex criteria match against hundreds of thousands of rows (or against a million rows, as in the example here).
Here’s how that looks in terms of my sample file. The number in the Rows helper column on the right-hand side shows that through some clever elimination, the SUM function at the bottom has to process only 18 rows of data (rows 292996 through 293014) rather than all 1 million rows. In other words, this is mighty efficient.
And here’s the second group of alternatives:
Yup, you can quite easily use a PivotTable here. And the PivotTable approach seems to be around 6 times faster than SUMPRODUCT—although you get a small amount of extra delay when calling up the filters, and the first time you perform a filter operation it takes quite a bit longer again, as Excel must load the PivotCache into memory. But let’s face it: Setting up the PivotTable in the first place is the easiest of any of these approaches, so it has my vote.
The DSUM approach is 12 times faster than SUMPRODUCT. That’s not as good as SUMIFS, but it’s still a significant improvement. The Advanced Filter approach is only 4 times faster than SUMPRODUCT—which isn’t really surprising because what it does is grab an extract of all records from the source data that match the criteria in that list, dump it into the spreadsheet, and then sum the result.
1st SUMPRODUCT could become
=COUNTIFS(Sheet1!$N:$N,$A4,Sheet1!$H:$H,"1A*",Sheet1!$M:$M,"<>service catalog",Sheet1!$J:$J,"incident",Sheet1!$I:$I,"<>self-serve",Sheet1!$AK:$AK,AFM$‌​1,Sheet1!$E:$E,">="&$E$1,Sheet1!$E:$E,"<"&$E$2)
The LEFT part can be handled by a wildcard, as shown
change the second part along the same lines

Statistically removing erroneous values

We have a application where users enter prices all day. These prices are recorded in a table with a timestamp and then used for producing charts of how the price has moved... Every now and then the user enters a price wrongly (eg. puts in a zero to many or to few) which somewhat ruins the chart (you get big spikes). We've even put in an extra confirmation dialogue if the price moves by more than 20% but this doesn't stop them entering wrong values...
What statistical method can I use to analyse the values before I chart them to exclude any values that are way different from the rest?
EDIT: To add some meat to the bone. Say the prices are share prices (they are not but they behave in the same way). You could see prices moving significantly up or down during the day. On an average day we record about 150 prices and sometimes one or two are way wrong. Other times they are all good...
Calculate and track the standard deviation for a while. After you have a decent backlog, you can disregard the outliers by seeing how many standard deviations away they are from the mean. Even better, if you've got the time, you could use the info to do some naive Bayesian classification.
That's a great question but may lead to quite a bit of discussion as the answers could be very varied. It depends on
how much effort are you willing to put into this?
could some answers genuinely differ by +/-20% or whatever test you invent? so will there always be need for some human intervention?
and to invent a relevant test I'd need to know far more about the subject matter.
That being said the following are possible alternatives.
A simple test against the previous value (or mean/mode of previous 10 or 20 values) would be straight forward to implement
The next level of complexity would involve some statistical measurement of all values (or previous x values, or values of the last 3 months), a normal or Gaussian distribution would enable you to give each value a degree of certainty as to it being a mistake vs. accurate. This degree of certainty would typically be expressed as a percentage.
See http://en.wikipedia.org/wiki/Normal_distribution and http://en.wikipedia.org/wiki/Gaussian_function there are adequate links from these pages to help in programming these, also depending on the language you're using there are likely to be functions and/or plugins available to help with this
A more advanced method could be to have some sort of learning algorithm that could take other parameters into account (on top of the last x values) a learning algorithm could take the product type or manufacturer into account, for instance. Or even monitor the time of day or the user that has entered the figure. This options seems way over the top for what you need however, it would require a lot of work to code it and also to train the learning algorithm.
I think the second option is the correct one for you. Using standard deviation (a lot of languages contain a function for this) may be a simpler alternative, this is simply a measure of how far the value has deviated from the mean of x previous values, I'd put the standard deviation option somewhere between option 1 and 2
You could measure the standard deviation in your existing population and exclude those that are greater than 1 or 2 standard deviations from the mean?
It's going to depend on what your data looks like to give a more precise answer...
Or graph a moving average of prices instead of the actual prices.
Quoting from here:
Statisticians have devised several methods for detecting outliers. All the methods first quantify how far the outlier is from the other values. This can be the difference between the outlier and the mean of all points, the difference between the outlier and the mean of the remaining values, or the difference between the outlier and the next closest value. Next, standardize this value by dividing by some measure of scatter, such as the SD of all values, the SD of the remaining values, or the range of the data. Finally, compute a P value answering this question: If all the values were really sampled from a Gaussian population, what is the chance of randomly obtaining an outlier so far from the other values? If the P value is small, you conclude that the deviation of the outlier from the other values is statistically significant.
Google is your friend, you know. ;)
For your specific question of plotting, and your specific scenario of an average of 1-2 errors per day out of 150, the simplest thing might be to plot trimmed means, or the range of the middle 95% of values, or something like that. It really depends on what value you want out of the plot.
If you are really concerned with the true max and true of a day's prices, then you have to deal with the outliers as outliers, and properly exclude them, probably using one of the outlier tests previously proposed ( data point is x% more than next point, or the last n points, or more than 5 standard deviations away from the daily mean). Another approach is to view what happens after the outlier. If it is an outlier, then it will have a sharp upturn followed by a sharp downturn.
If however you care about overall trend, plotting daily trimmed mean, median, 5% and 95% percentiles will portray history well.
Choose your display methods and how much outlier detection you need to do based on the analysis question. If you care about medians or percentiles, they're probably irrelevant.

Resources