Why LinkedList as a bucket implementation for HashMap? - hashmap

I understand the workings of HashMap -collisions and everything - trying to understand the deeper mechanics and the choice for entry bucket - and not say an Array (making it a 2 dimensional matrix )? Since search through both are O(n) operations ? I assumed one factor that might make the choice for Linked List is the insertions o(1) factor! Is that a correct assumption?

If it is in Java then the linked list is the bucket. And each entry in the table is a linked list containing the array of entry elements. In that each node knows what is in the next to the list till the time you reach to the end when the next reference is null.
Also do check:-
How does Java's Hashmap work internally?
An array has a predetermined size. So if you use an array, then the hash table has a predetermined size for each of the array elements. Every possible bucket is allocated, and your hash table will become big. This will be make sense if you have very huge memory but if not then use link list and walk through the list to find the match.

Its because you do not add at the end of the linkedlist but at the head to make it O(1)
[Answering below question - No not looking for a solution :) - I just wanted to understand why was the choice was made for Buckets to be am internam implemetnation of a Linked List and not an array based List (i didn't mean a fixed size array - ). Linked lists offer node based implemetnation that certainly help when there are a large number of inserts into the data structure - but when you are adding to the end of list - any growable array based implementation for bucket should suffice?]

Related

Data Structure Question: Is there a link between the size of a list in a chaining implementation of hash maps and its load factor?

For example, if I have n keys and m slots in the hash map, the average size of a linked list starting from a slot would be n/m. Am I correct in thinking this? Again, I'm talking about an average. Thanks in advance!
I'm trying to learn data structures.
As you say, the average size of a single list is generally going to be the table's load factor; but this is assuming that the "Simple Uniform Hashing Assumption" holds with your hash table (more specifically, with its hash function(s) and expected input keys): simply put, we assume that the hash function distributes elements to buckets uniformly, as well as independently of one another.
To expand a little, and in different words:
We assume that if we choose a new item randomly (imagine sampling an item from the probability distribution that characterizes our inputs), then there is an equal chance that the item we end up with will be mapped to any of the m buckets. (A chance of 1/m.)
Furthermore, that this probability is unaffected given the presence (or absence) of any other elements in any of the buckets.
This is helpful because from this we can conclude that the probability for an item to be sorted into a given bucket is always 1/m, regardless of any other circumstances; from this it directly follows that the expected (average) length of a single bucket's list will be n/m (we insert n elements into the table, and for each one, sort it into this given list at a probability of 1/m).
To see that this is important, we might imagine a case in which it doesn't hold: for instance, if we're facing some kind of "attack" and our inputs are engineered to all hash into the same bucket, or even just with a high probability. In this case SUHA no longer holds, and clearly neither does the link you've asked about between the length of a list and the load factor.
This is part of the reason that it is important to choose a good hash function for your use case: without it, the assumption may not hold which could have a harmful effect on your lookup times.

Why is it not possible to get dictionary values in O(1) time?

Can we write a data structure which will search directly by taking the values in O(1) time?
For example, in this code in python3, we can get morse code by taking the keys and output the values.
morse={'A':'.-','B':'-...','C':'-.-.','D':'-..','E':'.',\
'F':'..-.','G':'--.','H':'....','I':'..','J':'.---',\
'K':'-.-','L':'.-..','M':'--','N':'_.','O':'---',\
'P':'.--.','Q':'--.-','R':'.-.','S':'...','T':'-',\
'U':'..-','V':'...-','W':'.--','X':'-..-','Y':'-.--',\
'Z':'--..','1':'.---','2':'..---','3':'...--','4':'....-',\
'5':'.....','6':'-....','7':'--...','8':'---..','9':'----.',\
'0':'----'}
n=input()
n=''.join(i.upper() for i in n if i!=' ')
for i in n:
print(morse[i],end=' ')
This gives the output:
>>>
S O S
... --- ...
If we want to search by taking the morse code as input and giving the string as output:
>>>
... --- ...
S O S
how do we do that without making another dictionary of morse code?
Please provide the proper reasoning and what are the limitations if any.
Python dictionaries are hashmaps behind the scenes. The keys are hashed to achieve O(1) lookups. The same is not done for values for a few reasons, one of which is the reason #CrakC mentioned: the dict doesn't have to have unique values. Maintaining an automatic reverse lookup would be nonconsistent at best. Another reason could be that fundamental data structures are best kept to a minimum set of operations for predictability reasons.
Hence the correct & common pattern is to maintain a separate dict with key-value pairs reversed if you want to have reverse lookups in O(1). If you cannot do that, you'll have to settle for greater time complexities.
Yes, getting the name of the key from its value in a dictionary is not possible in python. The reason for this is quite obvious. The keys in a dictionary are unique in nature i.e., there cannot be two entries in the dictionary for the same key. But the inverse is not always true. Unique keys might have non-unique values. It should be noted here that the immutable nature of the keys actually defines the structure of the dictionary. Since they are unique in nature, they can be indexed and so fetching the value of a given key executes in O(1) time. The inverse, as explained above, cannot be realized in O(1) time and will always take an average time of O(n). The most important point that you should know here is that python dictionaries are not meant to be used this way.
Further reading: http://stupidpythonideas.blogspot.in/2014/07/reverse-dictionary-lookup-and-more-on.html
Can we write a data structure which will search directly by taking the values in O(1) time?
The answer to that question would be yes, and it's a HasMap or HashTable.
Following your example, what actually happens there is that Python Dictionaries are implemented as HashMap's. From that follows that search complexity is O(1) but, as I understand, your real problem is how to search the key by the value in O(1) too. Well, being dictionaries implemented as hashmaps, if Python provided (I am not 100% sure it doesn't) that reverse searching functionality it wouldn't be O(1) because HashMaps are not designed to provide it.
It can be shown looking at how HashMaps work: you would need a hashing function which would map the key and the value to the same index in the array which, if not impossible, is pretty hard to do.
I guess that your best option is to define de inverse dictionary. It's not that uncommon to sacrifice memory to achieve better times.
As CrakC has correctly stated it is not possible to get the key from the dictionary in O(1) time, you will need to traverse the dictionary once in O(n) time in order to search for the key in the dictionary. As you do not want to create another dictionary this would be your only option.

Mapping arbitrary objects to indices

Let's assume that we have some objects (strings, for example). It is well known that working with indices (i.e. with numbers 1,2,3...) is much more convenient than with arbitrary objects.
Is there any common way of assigning an index for each object? One can create a hash_map and store an index in the value, but that will be memory-expensive when the number of objects is too high to be placed into the memory.
Thanks.
You can store the string objects in a sorted file.
This way, you are not storing the objects in memory.
Your mapping function can search for the required object in the sorted file.
You can create a hash map to optimize the search.

How Strings are stored in a VBA Dictionary structure?

As I am currently playing with huge number of strings (have a look at another question: VBA memory size of Arrays and Arraylist) I used a scripting dictionary just for the feature of the keyed access that it has.
Everything was looking fine except that it was some how slow in loading the strings and that it uses a lot of memory. For an example of 100,000 strings of 128 characters in length, the Task manager showed at the end of the sub approximately 295 MB and when setting Dictionary=Nothing a poor 12 MB was remaining in Excel. Even considering internal Unicode conversion of strings 128 * 2 * 100,000 gives 25.6 MB ! Can someone explain this big difference ?
Here is all the info I could find on the Scripting.Dictionary:
According to Eric Lippert, who wrote the Scripting.Dictionary, "the actual implementation of the generic dictionary is an extensible-hashing-with-chaining algorithm that re-hashes when the table gets too full." (It is clear from the context that he is referring to the Scripting.Dictionary) Wikipedia's article on Hash Tables is a pretty good introduction to the concepts involved. (Here is a search of Eric's blog for the Scripting.Dictionary, he occasionally mentions it)
Basically, you can think of a Hash Table as a large array in memory. Instead of storing your strings directly by an index, you must provide a key (usually a string). The key gets "hashed", that is, a consistent set of algorithmic steps is applied to the key to crunch it down into a number between 0 and current max index in the Hash Table. That number is used as the index to store your string into the hash table. Since the same set of steps is applied each time the key is hashed, it results in the same index each time, meaning if you are looking up a string by its key, there is no need to search through the array as your normally would.
The hash function (which is what converts a key to an index into the table) is designed to be as random as possible, but every once in a while two keys can crunch down to the same index - this is called a collision. This is handled by "chaining" the strings together in a linked list (or possibly a more searchable structure). So suppose you tried to look a string up in the Hash Table with a key. The key is hashed, and you get an index. Looking in the array at that index, it could be an empty slot if no string with that key was ever added, or it could be a linked list that contains one or more strings whose keys mapped to that index in the array.
The entire reason for going through the details above is to point out that a Hash Table must be larger than the number of things it will store to make it efficient (with some exceptions, see Perfect Hash Function). So much of the overhead you would see in a Hash Table are the empty parts of the array that have to be there to make the hash table efficient.
Additionally, resizing the Hash Table is an expensive operation because the all the existing strings have to be rehashed to new locations, so when the load factor of the Hash Table exceeds the predefined threshold and it gets resized, it might get doubled in size to avoid having to do so again soon.
The implementation of the structure that holds the chain of strings at each array position can also have a large impact on the overhead.
If I find anything else out, I'll add it here...

hashmap remove complexity

So a lot of sources say the hashmap remove function is O(1), but I don't see how this could be unless a hashmap were backed by a linkedlist because list removals are O(n). Could someone explain?
You can view a Hasmap as an array. Imagine, you want to store objects of all humans on earth somewhere. You could just get an unique number for everyone and use an array with a dimension of 10*10^20.
If someone is born, she/he gets the next free number and is added to the end. If someone dies, her/his number is used and the array entry is set to null.
You can easily see, to add some or to remove someone, you need only constant time. calculate array address, done (if you have random access memory).
What is added by the Hashmap? There are 2 motivations. On the one side, you do not want to have such a big array. If you only want to store 10 people from all over the world, nearly all entries of the array are free. On the other side, not all data you want to store somewhere have an unique number. Sometimes there are multiple times the same number, some numbers do now show overall and sometimes you do not have any number. Therefore, you define a function, which uses the big numbers from the input and reduce them to numbers in a smaller range. This reduction should be in a way, that the resulting number is most likely unique for different inputs.
Example: Lets say you want to store 10 numbers from 1 to 100000000. You could use an array with 100000000 indices. Or you could use an array with 100 indices and the function f(x) = x % 100. If you have the number 1234, f(1234) = 34. Mark 34 as assigned.
Now you could ask, what happens if you have the number 2234? We have a collision then. You need some strategy then to handle this, there are several. Study some literature or ask specific questions for this.
If you want to store a string, you could imagine to use the length or the sum of the ascii value from every characters.
As you see, we can easily store something, and easily access it again. What we have to do? Calculate the hash from the function (constant time for a good function), access the array (constant time), store or remove (constant time).
In real world, a good hash function is not that easy. Try to stick with the included ones in java.
If you want to read more details, the wikipedia article about hash table is a good starting point: http://en.wikipedia.org/wiki/Hash_table
I don't think the remove(key) complexity is O(1). If we have a big hash table with many collisions, then it would be O(n) in worst case. It very rare to get the worst case but we can't neglect the fact that O(1) is not guaranteed.
If your HashMap is backed by a LinkedList buckets array
The worst case of the remove function will be O(n)
If your HashMap is backed by a Balanced Binary Tree buckets array
The worst case of the remove function will be O(log n)
The best case and the average case (amortized complexity) of the remove function is O(1)

Resources