Why does a HashMap contain a LinkedList instead of an AVL tree? - hashmap

The instructor in this video explains that hash map implementations usually contain a linked list to chain values in case of collisions. My question is: Why not use something like an AVL tree (that takes O(log n) for insertions, deletions and lookups), instead of a linked list (that has a worst case lookup of O(n))?
I understand that hash functions should be designed such that collisions would be rare. But why not implement AVL trees anyway to optimize those rare cases?

It depends of the language implementing HashMap. I dont think this is a strict rule.
For example in Java:
What your video says is true up to Java 7.
In Java 8, the implementation of HashMap was changed to make use of red-black trees once the bucket grows beyond a certain point.
If your number of elements in the bucket is less than 8, it uses a singly linked list. Once it grows larger than 8 it becomes a tree. And reverts back to a singly linked list once it shrinks back to 6.
Why not just use a tree all the time? I guess this is a tradeoff between memory footprint vs lookup complexity within the bucket. Keep in mind that most hash functions will yield very few collisions, so maintaining a tree for buckets that have a size of 3 or 4 would be much more expensive for no good reason.
For reference, this is the Java 8 impl of an HashMap (and it actually has a quite good explanation about how the whole thing works, and why they chose 8 and 6, as "TREEIFY" and "UNTREEIFY" threshold) :
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/HashMap.java?av=f
And in Java 7:
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7u40-b43/java/util/HashMap.java?av=f

Related

Why isn't bitwise trie a popular implementation of associative array

I have to store billions of entries with Int64 keys in an ordered map.
If I use a usual BST then each search operation costs log(N) pointer dereferencings (20-30 for millions to billions entries),
however bitwise trie with bitmap reduces this just to Ceil(64/6) = 11 pointer dereferencings.
This comes at a cost of an array for all 64 children in each trie node but I think that applying the usual array list growth strategy to this array and reusing previously allocated but discarded arrays with mitigate some problems with space wastage.
I'm aware a variant of this data structure is called HAMT and used as an effective persistent data structure, but this question is about a usual ordered map like std::map in C++, besides I need no deletions of entries.
However there are a few implementations of this data structure on github.
Why aren't bitwise tries as popular as binary search trees?
Disclaimer: I'm the author of https://en.wikipedia.org/wiki/Bitwise_trie_with_bitmap
Why aren't bitwise tries as popular as binary search trees?
Good question and I don't know the answer. To make bitwise tries more popular is exactly the reason to publish the wikipedia article.
This comes at a cost of an array for all 64 children in each trie node but I think that applying the usual array list growth strategy to this array and reusing previously allocated but discarded arrays with mitigate some problems with space wastage.
Nope: That's exactly where the bitmap comes in: To avoid having an array sized for all 64 possible children.

Haskell alternative for Doubly-linked-list coupled with Hash-table pattern

There's a useful pattern in imperative programming, namely, a doubly-linked-list coupled with a hash-table for constant time lookup in the linked list.
One application of this pattern is in LRU cache. The head of the doubly-linked-list will contain the least recently used entry in the cache and the last element in the doubly-linked-list will contain the most recently used entry. The keys in the hash-table are keys of the entries and the values are pointers to nodes in the linked-list corresponding to the key/entry. When an entry is queried in the cache, hash-table will be used to point to its node in the linked-list and then the node will be removed from its current location in the linked-list and be placed at the end of the linked-list making it the most-recently-used entry. For eviction, we simply remove entries from the head of the linked-list as they are the least recently used ones. Both lookup and eviction operations will take constant time.
I can think of implementing this in Haskell using two TreeMaps and I know that the time complexity will be O(log n). But I am a little uncomfortable as the constant factor in the time complexity seems a little high. Specifically, to perform a look-up, first I need to check if the entry exists and save its value, then I need to first delete it from the LRU map and re-insert it with a new key. This means that each lookup will result in a root-to-node traversal three times.
Is there a better way of doing this in Haskell?
As comments indicate, mutable vectors are perfectly acceptable when required. However, I think there's an issue with the way you've stated the question - unless the idea is to duplicate "as closely as possible" (without mutable structures) the imperative code, why bother having 2 treemaps? A single priority search queue (see packages pqueue or PSQueue) would be an appropriate structure whilst maintaining purity. It supports efficiently both priorities (for eviction) and searching (for lookups of your desired cached argument).
On a related note, some structures support eg. Data.Map's alterF, which effectively provides you with a continuation allowing you to "do something else" dependent on the Maybe value at a key, but "remembering" where you are and thus avoiding to pay the full cost to re-traverse the structure to subsequently modify at this key. See also the at lens.

Does time complexity of Hashmap get() and put() operation is O(1) at all time [duplicate]

We are used to saying that HashMap get/put operations are O(1). However it depends on the hash implementation. The default object hash is actually the internal address in the JVM heap. Are we sure it is good enough to claim that the get/put are O(1)?
Available memory is another issue. As I understand from the javadocs, the HashMap load factor should be 0.75. What if we do not have enough memory in JVM and the load factor exceeds the limit?
So, it looks like O(1) is not guaranteed. Does it make sense or am I missing something?
It depends on many things. It's usually O(1), with a decent hash which itself is constant time... but you could have a hash which takes a long time to compute, and if there are multiple items in the hash map which return the same hash code, get will have to iterate over them calling equals on each of them to find a match.
In the worst case, a HashMap has an O(n) lookup due to walking through all entries in the same hash bucket (e.g. if they all have the same hash code). Fortunately, that worst case scenario doesn't come up very often in real life, in my experience. So no, O(1) certainly isn't guaranteed - but it's usually what you should assume when considering which algorithms and data structures to use.
In JDK 8, HashMap has been tweaked so that if keys can be compared for ordering, then any densely-populated bucket is implemented as a tree, so that even if there are lots of entries with the same hash code, the complexity is O(log n). That can cause issues if you have a key type where equality and ordering are different, of course.
And yes, if you don't have enough memory for the hash map, you'll be in trouble... but that's going to be true whatever data structure you use.
It has already been mentioned that hashmaps are O(n/m) in average, if n is the number of items and m is the size. It has also been mentioned that in principle the whole thing could collapse into a singly linked list with O(n) query time. (This all assumes that calculating the hash is constant time).
However what isn't often mentioned is, that with probability at least 1-1/n (so for 1000 items that's a 99.9% chance) the largest bucket won't be filled more than O(logn)! Hence matching the average complexity of binary search trees. (And the constant is good, a tighter bound is (log n)*(m/n) + O(1)).
All that's required for this theoretical bound is that you use a reasonably good hash function (see Wikipedia: Universal Hashing. It can be as simple as a*x>>m). And of course that the person giving you the values to hash doesn't know how you have chosen your random constants.
TL;DR: With Very High Probability the worst case get/put complexity of a hashmap is O(logn).
I'm not sure the default hashcode is the address - I read the OpenJDK source for hashcode generation a while ago, and I remember it being something a bit more complicated. Still not something that guarantees a good distribution, perhaps. However, that is to some extent moot, as few classes you'd use as keys in a hashmap use the default hashcode - they supply their own implementations, which ought to be good.
On top of that, what you may not know (again, this is based in reading source - it's not guaranteed) is that HashMap stirs the hash before using it, to mix entropy from throughout the word into the bottom bits, which is where it's needed for all but the hugest hashmaps. That helps deal with hashes that specifically don't do that themselves, although i can't think of any common cases where you'd see that.
Finally, what happens when the table is overloaded is that it degenerates into a set of parallel linked lists - performance becomes O(n). Specifically, the number of links traversed will on average be half the load factor.
I agree with:
the general amortized complexity of O(1)
a bad hashCode() implementation could result to multiple collisions, which means that in the worst case every object goes to the same bucket, thus O(N) if each bucket is backed by a List.
since Java 8, HashMap dynamically replaces the Nodes (linked list) used in each bucket with TreeNodes (red-black tree when a list gets bigger than 8 elements) resulting to a worst performance of O(logN).
But, this is not the full truth if we want to be 100% precise. The implementation of hashCode() and the type of key Object (immutable/cached or being a Collection) might also affect real time complexity in strict terms.
Let's assume the following three cases:
HashMap<Integer, V>
HashMap<String, V>
HashMap<List<E>, V>
Do they have the same complexity? Well, the amortised complexity of the 1st one is, as expected, O(1). But, for the rest, we also need to compute hashCode() of the lookup element, which means we might have to traverse arrays and lists in our algorithm.
Lets assume that the size of all of the above arrays/lists is k.
Then, HashMap<String, V> and HashMap<List<E>, V> will have O(k) amortised complexity and similarly, O(k + logN) worst case in Java8.
*Note that using a String key is a more complex case, because it is immutable and Java caches the result of hashCode() in a private variable hash, so it's only computed once.
/** Cache the hash code for the string */
private int hash; // Default to 0
But, the above is also having its own worst case, because Java's String.hashCode() implementation is checking if hash == 0 before computing hashCode. But hey, there are non-empty Strings that output a hashcode of zero, such as "f5a5a608", see here, in which case memoization might not be helpful.
HashMap operation is dependent factor of hashCode implementation. For the ideal scenario lets say the good hash implementation which provide unique hash code for every object (No hash collision) then the best, worst and average case scenario would be O(1).
Let's consider a scenario where a bad implementation of hashCode always returns 1 or such hash which has hash collision. In this case the time complexity would be O(n).
Now coming to the second part of the question about memory, then yes memory constraint would be taken care by JVM.
In practice, it is O(1), but this actually is a terrible and mathematically non-sense simplification. The O() notation says how the algorithm behaves when the size of the problem tends to infinity. Hashmap get/put works like an O(1) algorithm for a limited size. The limit is fairly large from the computer memory and from the addressing point of view, but far from infinity.
When one says that hashmap get/put is O(1) it should really say that the time needed for the get/put is more or less constant and does not depend on the number of elements in the hashmap so far as the hashmap can be presented on the actual computing system. If the problem goes beyond that size and we need larger hashmaps then, after a while, certainly the number of the bits describing one element will also increase as we run out of the possible describable different elements. For example, if we used a hashmap to store 32bit numbers and later we increase the problem size so that we will have more than 2^32 bit elements in the hashmap, then the individual elements will be described with more than 32bits.
The number of the bits needed to describe the individual elements is log(N), where N is the maximum number of elements, therefore get and put are really O(log N).
If you compare it with a tree set, which is O(log n) then hash set is O(long(max(n)) and we simply feel that this is O(1), because on a certain implementation max(n) is fixed, does not change (the size of the objects we store measured in bits) and the algorithm calculating the hash code is fast.
Finally, if finding an element in any data structure were O(1) we would create information out of thin air. Having a data structure of n element I can select one element in n different way. With that, I can encode log(n) bit information. If I can encode that in zero bit (that is what O(1) means) then I created an infinitely compressing ZIP algorithm.
In simple word, If each bucket contain only single node then time complexity will be O(1). If bucket contain more than one node them time complexity will be O(linkedList size). which is always efficient than O(n).
hence we can say on an average case time complexity of put(K,V) function :
nodes(n)/buckets(N) = λ (lambda)
Example : 16/16 = 1
Time complexity will be O(1)
Java HashMap time complexity
--------------------------------
get(key) & contains(key) & remove(key) Best case Worst case
HashMap before Java 8, using LinkedList buckets 1 O(n)
HashMap after Java 8, using LinkedList buckets 1 O(n)
HashMap after Java 8, using Binary Tree buckets 1 O(log n)
put(key, value) Best case Worst case
HashMap before Java 8, using LinkedList buckets 1 1
HashMap after Java 8, using LinkedList buckets 1 1
HashMap after Java 8, using Binary Tree buckets 1 O(log n)
Hints:
Before Java 8, HashMap use LinkedList buckets
After Java 8, HashMap will use either LinkedList buckets or Binary Tree buckets according to the bucket size.
if(bucket size > TREEIFY_THRESHOLD[8]):
treeifyBin: The bucket will be a Balanced Binary Red-Black Tree
if(bucket size <= UNTREEIFY_THRESHOLD[6]):
untreeify: The bucket will be LinkedList (plain mode)

Haskell data structure that is efficient for swapping elements?

I am looking for a Haskell data structure that stores an ordered list of elements and that is time-efficient at swapping pairs of elements at arbitrary locations within the list. It's not [a], obviously. It's not Vector because swapping creates new vectors. Which data structure is efficient at this?
The most efficient implementations of persistent data structures, which exhibit O(1) updates (as well as appending, prepending, counting and slicing), are based on the Array Mapped Trie algorithm. The Vector data-structures of Clojure and Scala are based on it, for instance. The only Haskell implementation of that data-structure that I know of is presented by the "persistent-vector" package.
This algorithm is very young, it was only first presented in the year 2000, which might be the reason why not so many people have ever heard about it. But the thing turned out to be such a universal solution that it got adapted for Hash-tables soon after. The adapted version of this algorithm is called Hash Array Mapped Trie. It is as well used in Clojure and Scala to implement the Set and Map data-structures. It is also more ubiquitous in Haskell with packages like "unordered-containers" and "stm-containers" revolving around it.
To learn more about the algorithm I recommend the following links:
http://blog.higher-order.net/2009/02/01/understanding-clojures-persistentvector-implementation.html
http://lampwww.epfl.ch/papers/idealhashtrees.pdf
Data.Sequence from the containers package would likely be a not-terrible data structure to start with for this use case.
Haskell is a (nearly) pure functional language, so any data structure you update will need to make a new copy of the structure, and re-using the data elements is close to the best you can do. Also, the new list would be lazily evaluated and typically only the spine would need to be created until you need the data. If the number of updates is small compared to the number of elements, you could make a difference list that checks a sparse set of updates first, and only then looks in the original vector.

Is it possible to efficiently search a bit-trie for values less than the key?

I am currently storing a large number of unsigned 32-bit integers in a bit trie (effectively forming a binary tree with a node for each bit in the 32-bit value.) This is very efficient for fast lookup of exact values.
I now want to be able to search for keys that may or may not be in the trie and find the value for the first key less than or equal to the search key. Is this efficiently possible with a bit trie, or should I use a different data structure?
I am using a trie due to its speed and cache locality, and ideally want to sacrifice neither.
For example, suppose the trie has two keys added:
0x00AABBCC
0x00AABB00
and I an now searching for a key that is not present, 0x00AABB11. I would like to find the first key present in the tree with a value <= the search key, which in this case would be the node for 0x00AABB00.
While I've thought of a possible algorithm for this, I am seeking concrete information on if it is efficiently possible and/or if there are known algorithms for this, which will no doubt be better than my own.
We can think bit trie as a binary search tree. In fact, it is a binary search tree. Take the 32-bit trie for example, suppose left child as 0, right child as 1. For the root, the left subtree is for the numbers less than 0x80000000 and the right subtree is for the numbers no less than 0x80000000, so on and so forth. So you can just use the similar the method to find the largest item not larger than the search key in the binary search tree. Just don't worry about the backtracks, it won't backtrack too much and won't change the search complexity.
When you match fails in the bit trie, just backtrack to find the right-most child of the nearest ancestor of the failed node.
If the data is static--you're not adding or removing items--then I'd take a good look at using a simple array with binary search. You sacrifice cache locality, but that might not be catastrophic. I don't see cache locality as an end in itself, but rather a means of making the data structure fast.
You might get better cache locality by creating a balanced binary tree in an array. Position 0 is the root node, position 1 is left node, position 2 is right node, etc. It's the same structure you'd use for a binary heap. If you're willing to allocate another 4 bytes per node, you could make it a left-threaded binary tree so that if you search for X and end up at the next larger value, following that left thread would give you the next smaller value. All told, though, I don't see where this can outperform the plain array in the general case.
A lot depends on how sparse your data is and what the range is. If you're looking at a few thousand possible values in the range 0 to 4 billion, then the binary search looks pretty attractive. If you're talking about 500 million distinct values, then I'd look at allocating a bit array (500 megabytes) and doing a direct lookup with linear backward scan. That would give you very good cache locality.
A bit trie walks 32 nodes in the best case when the item is found.
A million entries in a red-black tree like std::map or java.util.TreeMap would only require log2(1,000,000) or roughly 20 nodes per query, worst case. And you do not always need to go to the bottom of the tree making average case appealing.
When backtracking to find <= the difference is even more pronounced.
The fewer entries you have, the better the case for a red-black tree
At a minimum, I would compare any solution to a red-black tree.

Resources