Find the 10 smallest numbers in a large dataset?

Find the 10 smallest numbers in a large dataset? - python-3.x

I am coding specifically in python, but right now in the phase of designing psuedocode for an algorithm that will take all the data points in a data set that has n values, n being very large and pick out the 10 smallest values (or finite number m << n, where m is the m smallest numbers). I wish to have an optimally efficient algorithm for the requirements.
My idea:
1) Heapsort the data then pick the smallest 10 values. O(nlog(n))
2) Alternatively,use a loop to identify a 'champion' that runs 10 times. With the first 'champion' determined remove from the dataset and then repeat this loop. O(n) (given m is small)
Which suggestion or if there is another would be best?

One approach among many possible:
Grab 10 values and sort them. Now compare the largest with the 11th through nth values one at a time. Whenever the new value is smaller replace the 10th smallest with it and resort your 10 values.
The list of 10 values, sorting them etc will all be in cache so fast even with rough code. The whole list will be accessed once through so will be fast as well.

Related

What's the big O time-complexity of an IIR filter?

Given an IIR filter like the one shown below, what's its O(n?) time complexity?
I can't decide if it's O(n^2) since to compute each output, you need to iterate through all of the previous samples. But it could also be O(n) because for each output there's only an x number of computations you need (x being the size of b and a), given that you already have all of the previous outputs.

You don’t need to iterate through every sample, only the last k, both for inputs and outputs with the corresponding time lag.
So, if you had a second order IIR filter, that would produce 2 coefficient in the nominator and 2 in the denominator (if we’re talking about transfer functions). So, for a filter of order k, there are (roughly) 2k multiplications and additions. That’s for one sample of the output you’re trying to calculate.
Now, you apply the filter to a signal of length n, meaning the overall calculation time will rise linearly with k and n, making the full complexity O(k*n).

Fast token overlap between strings

I have two sets of tokenised sentences A and B and I want to calculate the overlap between them in terms of common tokens. For example, the overlap between two individual sentences a1 today is a good day and b1 today I went to a park is 2 (today and a). I need a simple string matching method, without fuzzy or advanced methods. So the result is a matrix between all sentences in A and B with an overlap count for each pair.
The problem is that, while trivial, it is a quadratic operation (size of A x size of B pair-wise comparisons). With large data, the computation gets very slow very quickly. What would be a smart way of computing this avoiding pair-wise comparisons or doing them very fast? Are there packages/data structures particularly good for this?

Fast finding of nearest bigger element to a given parameter I in OrderedList

Lets say we have a OrderedList with int years in it.I have to find the years between year i and year n, but im having trouble finding the nearest element bigger than year i.I need something faster than linear algorithm complexity (O(n)).

(Guessing you mean sorted list by ordered list, since ordered list wouldn't make much sense here. Otherwise sort that list before searching.)
Try binary search. What you do is the following:
Say you have a list of length n and are looking for the value k
Take the element a position n/2
If the element is smaller than k, take the right half of the list, starting at n/2 and ending at n
If the element is larger than k, take the left half of the list, starting at 0 and ending at n/2
If the element is equal k, end here.
repeat the search until your upper and lower limits of the search are only one field apart, or you have an element equal to k. In that case put that element as your lower limit and the next element after that as the higher limit. The higher limit is the value you are looking for. The lower limit should be the highest number lower or equal to k.

Given a range of values 1 though n, what is the most efficent way to split that list into m "nearly equal" sub-ranges?

For example, if I have a range of numbers from 1 to 27, and I want to split that range into 7 sub-ranges, an answer given n=27, m = 7 (using tuples to denote start/end ranges) might be:
[(1,4),(5,8),(9,12),(13,16),(17,20),(21,24),(25,27)]
I've started to try and come up with some good heuristics (based on the size of a mod b) to come up with a decent answer each time (i.e. the difference between the largest sub-range and the smallest sub-range is less than some value), but surely this has been implemented in Haskell somewhere before.
My use case is finding a set of ranges for parallel processing of enumerations in a given range, so I want to find the ranges so that all parallel processes finish their processing at approximately the same time.

Can you estimate percentiles in unordered data?

Suppose you have a very large list of numbers which would be expensive to sort. They are real numbers/decimals but all lie in the same range, say 0 to n for some integer n. Are there any methods for estimating percentiles that don't require sorting the data i.e. an algorithm that has better complexity than the fastest sorting algorithm.
Note: The tag is quantiles only because there is no existing tag for percentiles and it wouldn't let me create one; my question is not specific to quantiles.

In order to find the p-th percentile of a set of N numbers, essentially you are trying to find the k-th largest number where k = N*p/100 (rounded down, I think--or on second thought, thinking of the median, for example, maybe it's rounded up).
You might try the median of medians algorithm, which is supposed to be able to find the k-th largest number among N numbers in O(N) time.
I don't know where this is implemented in a standard library but a proposed implementation
was posted in one of the answers to this question.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Find the 10 smallest numbers in a large dataset? - python-3.x

Related

What's the big O time-complexity of an IIR filter?

Fast token overlap between strings

Fast finding of nearest bigger element to a given parameter I in OrderedList

Given a range of values 1 though n, what is the most efficent way to split that list into m "nearly equal" sub-ranges?

Can you estimate percentiles in unordered data?

Categories

Resources