searching and sorting - search

If the list has 1024 items (lg1024 = 10) at what point (the number of searches) does sorting the list first and using binary search pay off? How does your answer change if the list has 2048 items? instead of using sequential search

Where the "linear access" curve crosses the "binary search" curve depends on how long it takes to access/insert a single item versus how many items there are. This will be different for every combination of compiler, memory and cpu architecture, type of data/node in the list, the distribution of data values, what sort and insertion algorithms you use, etc... But with a "large enough" set of items, the running time can be described by mentioning how its upper bound grows with increasing number of items, even though that "Big-O" bound may not precisely describe any particular run.
You can figure out precisely if you can know the specific algorithm you will insert or search with, and determine the actual instructions that make up your list accesses, and find out how many clock cycles they take to execute, etc etc...
Then you can say for sure which one is faster, and at which point. And if you know you data values, you can model it. But if you don't know, you have to assume (for example, what if your inserted data values are already ordered? how does that affect your sort or insertion function?)
For example, a single item retrieval may take 1us. Comparing two items may take 0.5us. Doing a sorted list insertion with 100 items in the list might require X number of retrievals, Y number of compares, and Z number of updates/writes.... Whereas an unordered list might require more or less depending on what's already there and what you're inserting.

if your list is unsorted it will take O(n) to find it. Sort with quicksort costs O(n*log n), then binary search is O(log n). Lets assume that x is number of searchs. x * n = x * logn + n * logn . by putting different values you can estimate the dynamics. my rough estimate tells that if n = 1024 and number searches is greater then ~10, it is more efficitent to sort first. put 1024 instead of n and try.

Related

Data Structure Question: Is there a link between the size of a list in a chaining implementation of hash maps and its load factor?

For example, if I have n keys and m slots in the hash map, the average size of a linked list starting from a slot would be n/m. Am I correct in thinking this? Again, I'm talking about an average. Thanks in advance!
I'm trying to learn data structures.
As you say, the average size of a single list is generally going to be the table's load factor; but this is assuming that the "Simple Uniform Hashing Assumption" holds with your hash table (more specifically, with its hash function(s) and expected input keys): simply put, we assume that the hash function distributes elements to buckets uniformly, as well as independently of one another.
To expand a little, and in different words:
We assume that if we choose a new item randomly (imagine sampling an item from the probability distribution that characterizes our inputs), then there is an equal chance that the item we end up with will be mapped to any of the m buckets. (A chance of 1/m.)
Furthermore, that this probability is unaffected given the presence (or absence) of any other elements in any of the buckets.
This is helpful because from this we can conclude that the probability for an item to be sorted into a given bucket is always 1/m, regardless of any other circumstances; from this it directly follows that the expected (average) length of a single bucket's list will be n/m (we insert n elements into the table, and for each one, sort it into this given list at a probability of 1/m).
To see that this is important, we might imagine a case in which it doesn't hold: for instance, if we're facing some kind of "attack" and our inputs are engineered to all hash into the same bucket, or even just with a high probability. In this case SUHA no longer holds, and clearly neither does the link you've asked about between the length of a list and the load factor.
This is part of the reason that it is important to choose a good hash function for your use case: without it, the assumption may not hold which could have a harmful effect on your lookup times.

Can I write a hash function say `hashit(a, b, c)` taking no more than O(N) space

Given integers a,b,c such that
-N<=a<=N,
0<=b<=N,
0<=c<=10
Can I write a hash function say hashit(a, b, c) taking no more than O(N) adrdress space.
My naive thought was to write it as,
a+2N*b+10*2N*N*c
thats like O(20N*N) space, so it wont suffice my need.
let me elaborate my usecase, I want tuple (a,b,c) as key of a hashmap . Basically a,b,c are arguments to my function which I want to memorise. in python #lru_cache perfectly does it without any issue for N=1e6 but when I try to write hash function myself I get memory overflow. So how do python do it ?
I am working wih N of the order of 10^6
This code work
#lru_cache(maxsize=None)
def myfn(a,b,c):
//some logic
return 100
But if i write the hash function myself like this, it doesn't . So how do python do it.
def hashit(a,b,c):
return a+2*N*b+2*N*N*c
def myfn(a,b,c):
if hashit(a,b,c) in myhashtable:
return myhashtable[hashit(a,b,c)]
//some logic
myhashtable[hashit(a,b,c)] = 100;
return myhashtable[hashit(a,b,c)]
To directly answer your question of whether it is possible to find an injective hash function from a set of size Θ(N^2) to a set of size O(N): it isn't. The very existence of an injective function from a finite set A to a set B implies that |B| >= |A|. This is similar to trying to give a unique number out of {1, 2, 3} to each member of a group of 20 people.
However, do note that hash functions do oftentimes have collisions; the hash tables that employ them simply have a method for resolving those collisions. As one simple example for clarification, you could for instance hold an array such that every possible output of your hash function is mapped to an index of this array, and at each index you have a list of elements (so an array of lists where the array is of size O(N)), and then in the case of a collision simply go over all elements in the matching list and compare them (not their hashes) until you find what you're looking for. This is known as chain hashing or chaining. Some rare manipulations (re-hashing) on the hash table based on how populated it is (measured through its load factor) could ensure an amortized time complexity of O(1) for element access, but this could over time increase your space complexity if you actually try to hold ω(N) values, though do note that this is unavoidable: you can't use less space than Θ(X) to hold Θ(X) values without any extra information (for instance: if you hold 1000000 unordered elements where each is a natural number between 1 and 10, then you could simply hold ten counters; but in your case you describe a whole possible set of elements of size 11*(N+1)*(2N+1), so Θ(N^2)).
This method would, however, ensure a space complexity of O(N+K) (equivalent to O(max{N,K})) where K is the amount of elements you're holding; so long as you aren't trying to simultaneously hold Θ(N^2) (or however many you deem to be too many) elements, it would probably suffice for your needs.

Is there any way to get the k smallest elements from a list without sorting it in Python?

I want to retrieve k smallest elements from a list in python. But I want to achieve this with less than O(n log n)(that is, without sorting the list) complexity. Is there any way to do so in Python. If yes please let me know. Thanks in advance.
I think Quickselect is what you are looking for.
Quickselect uses the same overall approach as quicksort, choosing one element as a pivot and partitioning the data in two based on the pivot, accordingly as less than or greater than the pivot. However, instead of recursing into both sides, as in quicksort, quickselect only recurses into one side – the side with the element it is searching for. This reduces the average complexity from O(n log n) to O(n), with a worst case of O(n2).
-- https://en.wikipedia.org/wiki/Quickselect
I can think of a few ways sorting/non-sorting solution for this problem:
Rank Selection Algorithm - like quicksort, we can find pivot rank then decide on whether to go left or right, O(N) time
Build a Min Heap O(N), extract k times - O(N + kLogN) time
Priority Queue - like max heap (remove biggest element if there is a new smaller), but instead we can loop through the entire array, build a heap of k size - O(N + NlogK)
Bubble Sort - bubble smallest few element upwards - O(k*N)
Use the heapq.nsmallest, it performs with partial heap-sort

fast, semi-accurate sort in linux

I'm going through a huge list of files in Linux, the output of a "find" (directory walk). I want to sort the list by filename, but I'd like to begin processing the files as soon as possible.
I don't need the sort to be 100% correct.
How can I do a "partial sort", that might be off some of the time but will output quickly?
This is StackOverflow, not SuperUser, so an algorithm answer should be enough for you.
Try implementing HeapSort. But instead of sorting the full list of names, do the following.
Pick a constant M. The smaller it is, the more "off" it will be and the "faster" the algorithm will start printing the results. In the limiting case where M is equal to the number of all names, it will be an exact sorting algorithm.
Load the first M elements, heapify() them.
Take the lowest element from the heap, print it. Put next unsorted name into its place, then do siftDown().
Repeat until you run out of unsorted names. Do a standard HeapSort on the elements left in the heap.
This algorithm will be linear in number of names and will start printing the names as soon as the first M of them will be read. Step 2 is O(M) == O(1). Step 3 is O(log M) == O(1), it is repeated O(N) times, hence total is O(N).
This algorithm will try to keep the large elements in the heap as long as possible while pushing the lowest elements from the heap as quickly as possible. Hence the output will look as if it was almost sorted.
IIRC, a variant of this algorithm is actually what GNU sort does before switching to on-disk MergeSort to keep sorted runs of data as long as possible and minimize number of on-disk merges.

hashmap remove complexity

So a lot of sources say the hashmap remove function is O(1), but I don't see how this could be unless a hashmap were backed by a linkedlist because list removals are O(n). Could someone explain?
You can view a Hasmap as an array. Imagine, you want to store objects of all humans on earth somewhere. You could just get an unique number for everyone and use an array with a dimension of 10*10^20.
If someone is born, she/he gets the next free number and is added to the end. If someone dies, her/his number is used and the array entry is set to null.
You can easily see, to add some or to remove someone, you need only constant time. calculate array address, done (if you have random access memory).
What is added by the Hashmap? There are 2 motivations. On the one side, you do not want to have such a big array. If you only want to store 10 people from all over the world, nearly all entries of the array are free. On the other side, not all data you want to store somewhere have an unique number. Sometimes there are multiple times the same number, some numbers do now show overall and sometimes you do not have any number. Therefore, you define a function, which uses the big numbers from the input and reduce them to numbers in a smaller range. This reduction should be in a way, that the resulting number is most likely unique for different inputs.
Example: Lets say you want to store 10 numbers from 1 to 100000000. You could use an array with 100000000 indices. Or you could use an array with 100 indices and the function f(x) = x % 100. If you have the number 1234, f(1234) = 34. Mark 34 as assigned.
Now you could ask, what happens if you have the number 2234? We have a collision then. You need some strategy then to handle this, there are several. Study some literature or ask specific questions for this.
If you want to store a string, you could imagine to use the length or the sum of the ascii value from every characters.
As you see, we can easily store something, and easily access it again. What we have to do? Calculate the hash from the function (constant time for a good function), access the array (constant time), store or remove (constant time).
In real world, a good hash function is not that easy. Try to stick with the included ones in java.
If you want to read more details, the wikipedia article about hash table is a good starting point: http://en.wikipedia.org/wiki/Hash_table
I don't think the remove(key) complexity is O(1). If we have a big hash table with many collisions, then it would be O(n) in worst case. It very rare to get the worst case but we can't neglect the fact that O(1) is not guaranteed.
If your HashMap is backed by a LinkedList buckets array
The worst case of the remove function will be O(n)
If your HashMap is backed by a Balanced Binary Tree buckets array
The worst case of the remove function will be O(log n)
The best case and the average case (amortized complexity) of the remove function is O(1)

Resources