QuickSelect vs. linear search - search

I am wondering why QuickSelect is supposed to be such a good performing algorithm for finding
an arbitrary element out of an n-sized, unsorted set. I mean, when you go through all elements one by one, until you find the desired one it took O(n) comparisions - That's as much the quickselect's best case and much easier.
Am I missing something essential about this? Is there a case the QiuckSelect is performing better, than linear search?

QuickSelect in Average is better in finding the k-th smallest (largest) number(item) in not sorted array

Related

Is there any way to get the k smallest elements from a list without sorting it in Python?

I want to retrieve k smallest elements from a list in python. But I want to achieve this with less than O(n log n)(that is, without sorting the list) complexity. Is there any way to do so in Python. If yes please let me know. Thanks in advance.
I think Quickselect is what you are looking for.
Quickselect uses the same overall approach as quicksort, choosing one element as a pivot and partitioning the data in two based on the pivot, accordingly as less than or greater than the pivot. However, instead of recursing into both sides, as in quicksort, quickselect only recurses into one side – the side with the element it is searching for. This reduces the average complexity from O(n log n) to O(n), with a worst case of O(n2).
-- https://en.wikipedia.org/wiki/Quickselect
I can think of a few ways sorting/non-sorting solution for this problem:
Rank Selection Algorithm - like quicksort, we can find pivot rank then decide on whether to go left or right, O(N) time
Build a Min Heap O(N), extract k times - O(N + kLogN) time
Priority Queue - like max heap (remove biggest element if there is a new smaller), but instead we can loop through the entire array, build a heap of k size - O(N + NlogK)
Bubble Sort - bubble smallest few element upwards - O(k*N)
Use the heapq.nsmallest, it performs with partial heap-sort

When is good to use KMP algorithm?

I understand that KMP algorithm depends on the helper array that there are prefixes that are similar to suffixes.
It won't efficient when the above condition is not fulfilled as in the helper array contains all zeroes.
Would the runtime be O(m + n) ?
If I am right, what is a better substring algorithm in this case?
To understand when KMP is a good algorithm to use, it's often helpful to ask the question "what's the alternative?"
KMP has the nice advantage that it is guaranteed worst-case efficient. The preprocessing time is always O(n), and the searching time is always O(m). There are no worst-case inputs, no probability of getting unlucky, etc. In cases where you are searching for very long strings (large n) inside of really huge strings (large m), this may be highly desirable compared to other algorithms like the naive one (which can take time Θ(mn) in bad cases), Rabin-Karp (pathological inputs can take time Θ(mn)), or Boyer-Moore (worst-case can be Θ(mn)). You're right that KMP might not be all that necessary in the case where there aren't many overlapping parts of the string, but the fact that you never need to worry about whether there's a bad case is definitely a nice thing to have!
KMP also has the nice property that the processing can be done a single time. If you know you're going to search for the same substring lots and lots of times, you can do the O(n) preprocessing work once and then have the ability to search in any length-m string you'd like in time O(m).

Complexities of the following -insertion sort, selection sort ,merge sort , radix sort and explain which one is best sorting algorithm and why?

Complexities of the following -insertion sort, selection sort ,merge sort , radix sort and explain which one is best sorting algorithm and why?
I don't believe in the 'best' sorting algorithm. It depends on what you want to do. For instance, bubble sort is really easy to implement and would be the best if you just want a quick and dirty way of sorting a short array. On the other hand, for larger arrays, the time complexity will really come into play and you will notice considerable runtime difference. If you really value memory, then you probably want to evaluate space complexities of these.
So the sort answer is: IMHO, there's no best sorting algorithm. I'll leave the following table for you to evaluate for yourself what you want to use.
Sorting AlgorithmAvg Time ComplextitySpace Complexity
Quicksort O(nlog(n)) O(log(n))
Mergesort O(nlog(n)) O(n)
Insertionsort O(n^2) O(1)
Selectionsort O(n^2) O(1)
Radixsort O(nk) O(n+k)

Do you need to sort inputs for dynamic programming knapsack

In every single example I've found for a 1/0 Knapsack problem using dynamic programming where the items have weights(costs) and profits, it never explicitly says to sort the items list, but in all the examples they are sorted by increasing both weight and profit (higher weights have higher profits in the examples). So my question is when adding items in the matrix from the item array/list, can I add them in any order, or do I add the one with the smallest weight or profit? Because from multiple examples I found I'm not sure if its just a coincidence or you do in fact need to put the smallest weight/profit into the matrix each time
The dynamic programming solution is nothing but choosing all the possibilities (using brute force) in an efficient way (just by saving the values for future reference).
Note: We consider all the subsets. Even if the list is sorted or not the total number of subsets will be the same. So in the end all the subsets will get considered.
No, you don't need to sort the weights because every row gives the maximum possible value under the weight limit of that row. The maximum will come in the last column of that row.
Maybe you are looking at bottom-up Dynamic solutions. And there is one characteristic in Dynamic solution when you solve using the bottom-up method.
The second approach is the bottom-up method. This approach typically depends
on some natural notion of the “size” of a subproblem, such that solving any particular
subproblem depends only on solving “smaller” subproblems. We sort the
subproblems by size and solve them in size order, smallest first. When solving a
particular subproblem, we have already solved all of the smaller subproblems its
solution depends upon, and we have saved their solutions. We solve each subproblem
only once, and when we first see it, we have already solved all of its
prerequisite subproblems.
From: Introduction to Algorithm, CORMEN (3rd edition)
The "smaller problem" in this case is just smaller in terms of the number of available items to choose, not about the profits or weights of these items. If given a 3 item list, the sub-problems will be 2 items, and 1 item to choose from.
The smallest problem is first hardcoded (base case), then at each stage of moving from a smaller to bigger problem, the best profit is enumerated, and the max is chosen. At the end, all 2^n combinations would have been considered and the various stages of repeated max will bubble up the largest solution.
Changing the input order, or putting dominated items into the input (like higher weight and lower profit) may just change which argument of max() won at each stage, but the final max result will come from the same selection of items, albeit being selected at different stages in the algorithm for different sort orders or input characteristics.
The answer could be found by some random shuffle experiments.
which I found is: better in ascending order. correct me if i'm wrong.
gist: https://gist.github.com/whille/39cf7bf8cf5dcf6ac933063735ae54de
Problem described in "Algorithm Design", ISBN: 9780321295354, chapter 6.4.
Two methods could be used:
as the chapter used, a M cache to pre-calculated, in which not sub answer is needed.
recursive function, which is simple to understand and test. I found functools.cache of python3.5+ could be use to check how many sub caculation is need, as my gist show: ascending order of test_random() is the smallest in currsize, So it's most efficient, and could extend to float value.
results for 10 random weights(1~100) and 200 knapack:
[(13.527716157276256, 18.371888775465692), (16.18632175987168, 206.88043031085252), (20.14117982372607, 81.52793937986635), (33.28606671929836, 298.8676699147799), (49.12968642850187, 22.037638580809592), (55.279973594800225, 377.3715225559507), (56.56103181962746, 460.9161412820592), (60.38456825749498, 10.721915577913244), (67.98836121062645, 63.47478755362385), (86.49436333909377, 208.06767811169286)]: reverse: False
CacheInfo(hits=0, misses=832, maxsize=None, currsize=832)
[(86.49436333909377, 208.06767811169286), (67.98836121062645, 63.47478755362385), (60.38456825749498, 10.721915577913244), (56.56103181962746, 460.9161412820592), (55.279973594800225, 377.3715225559507), (49.12968642850187, 22.037638580809592), (33.28606671929836, 298.8676699147799), (20.14117982372607, 81.52793937986635), (16.18632175987168, 206.88043031085252), (13.527716157276256, 18.371888775465692)]: reverse: True
CacheInfo(hits=0, misses=1120, maxsize=None, currsize=1120)
Note for method2:
if capacity is much larger, all random orders have equal currsize.
callstack overflow should be avoided for large N, so recurse method should be transformed. Generally a two step method could be used, first map subproblem dependency, and calc. I'ill try later in gist.
I guess, sorting might be required in certain types of knapsack problem. For example consider the problem "Maximum Earnings From Taxi". Here, the input has to be sorted by the starting point of the riders, else we won't get the optimal result.
For example, consider the below input for the above problem:-
9
[[2,3,1],[2,9,2], [3,6,7],[2,3,6]]
If you apply a typical knapsack implementation in recursive approach, without sorting the input we won't get the optimal solution.

Getting fuzzy string matches from database very fast

I have a database of ~150'000 words and a pattern (any single word) and I want to get all words from the database which has Damerau-Levenshtein distance between it and the pattern less than given number. I need to do it extremely fast. What algorithm could you suggest? If there's no good algorithm for Damerau-Levenshtein distance, just Levenshtin distance will be welcome as well.
Thank you for your help.
P.S. I'm not going to use SOUNDEX.
I would start with a SQL function to calculate the Levenshtein distance (in T-SQl or .Net) (yes, I'm a MS person...) with a maximum distance parameter that would cause an early exit.
This function could then be used to compare your input with each string to check the distanve and move on to the next if it breaks the threshold.
I was also thinking you could, for example, set the maximum distance to be 2, then filter all words where the length is more than 1 different whilst the first letter is different. With an index this may be slightly quicker.
You could also shortcut to bring back all strings that are perfect matches (indexing will speed this up) as these will actually take longer to calculate the Levenshtein distance of 0.
Just some thoughts....
I do not think you can calculate this kind of function without actually enumerating all rows.
So the solutions are:
Make it a very fast enumeration (but this doesn't really scale)
Filter initial variants somehow (index by a letter, at least x common letters)
Use alternative (indexable) algorithm, such as N-Grams (however I do not have details on result quality of ngrams versus D-L distance).
A solution off the top of my head might be to store the database in a sorted set (e.g., std::set in C++), as it seems to me that strings sorted lexicographically would compare well. To approximate the position of the given string in the set, use std::upper_bound on the string, then iterate over the set outward from the found position in both directions, computing the distance as you go, and stop when it falls below a certain threshold. I have a feeling that this solution would probably only match strings with the same start character, but if you're using the algorithm for spell-checking, then that restriction is common, or at least unsurprising.
Edit: If you're looking for an optimisation of the algorithm itself, however, this answer is irrelevant.
I have used KNIME for string fuzzy matching and has got very fast results. It is also very easy to make visual workflows in it. Just install KNIME free edition from https://www.knime.org/ then use "String Distance" and "Similarity Search" nodes to get your results. I have attached a small fuzzy matching smaple workflow in here (the input data come from top and the patterns to search for come from the bottom in this case):
I would recommend looking into Ankiro.
I'm not certain that it meets your requirements for precision, but it is fast.

Resources