Hash-table - Mapping a hash value to an index - hashmap

I haven't considered more than using the "MOD PRIME" type of hash functions and am a little confused as how to use a returned hash value to store a value in a HashMap.
I want to implement a HashMap, where the key is a 64-bit int (long long int). I have a hash function that returns a long int. The question is, what is the best way to use this returned hash value to determine the table index. Because my table will obviously be smaller than the range of the hash value.
Are there any guidelines to choose the best table size? Or a best way to map the hash value to the size of the table?
Thank you.

You will need to resize the table at some point. Depending on the method you use, you will either need to rehash all keys during the resize-and-copy operation or use some form of dynamic hashing, such as extendible hashing or linear hashing.
As to answering the first part of the question, as you have a used a prime number for the modulo, you should be able to just use the hash value modulo table size to get an index (for a 64-bit int and a table of size 2^16, that would be just the 16 least significant bits of your 64-bit hash). As for the table size, you choose a size that is big enough to hold all data plus some spare room (a value of 0.75 load is used in practice). If you expect a lot of inserts, you will need to give more headroom otherwise you will be resizing the table all the time. Note that with the dynamic hashing algorithms mentioned above this is not necessary, as all resizing operations are amortized over time.
Also, remember that two items can be stored in the same bucket (at the same hashed location in the hash table), the hash function merely tells you where to start looking. So in practice, you would have an array of entries at each location of your hashtable. Note that this can be avoided if you use open addressing to handle hash collisions.
Of course, sometimes you can do better if you choose a different hash function. Your goal would be to have a perfect hash function for each size of your table (if you allow rehashing upon resizing), using something like dynamic perfect hashing or universal hashing.

Related

Hash Table that tries to hash Strings uniformly?

I am currently in a Data Structures course nearing the end of the semester, and have been assigned a project in which we are implementing a Linked Hash Table to store and retrieve keys. We have been given a pretty large amount of freedom with how we are going to design our hash table implementation, but for bonus points we were told to try and find a hash function that distributes our keys (unique strings) close to uniformly and randomly throughout the table.
I have chosen to use the ELF hash, seen here http://www.eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx
My question is as follows: With this hash function an integer is returned, but I am having trouble seeing how this can be used to help specify a specific index to put my key in in the hash table. I could simply do: index = ELFhash(String key) % tableSize, but does this defeat the purpose of using the ELF hash in the first place??
Also I have chosen my collision resolution strategy to be double hashing. Is there a good way to determine an appropriate secondary hashing function to find your jumps? My hash table is not going to be a constant size (sets of strings will be added and removed from the set of data I am hashing, and I will be rehashing them after each iteration of adding and removing to have a load factor of .75), so it is hard for me to just do something like k % n where n is a number that is relatively prime with my table size.
Thanks for taking the time to read my question, and let me know what you think!
You're correct to think about "wrapping bias," but for most practical purposes, it's not going to be a problem.
If the hash table is of size N and the hash value is in the range [0..M), then let k = floor(M/N). Any hash value in the range [0..k*N) is a "good" one in that, using mod N as a map, each hash bucket is mapped by exactly k hash values. The hash values in [k*N..M) are "bad" in that if you use them, the corresponding M-K*n lowest hash buckets map from one additional hash value. Even if the hash function is perfect, these buckets have a higher probability of receiving a given value.
The question, though, is "How much higher?" That depends on M and N. If the hash value is an unsigned int in [0..2^32), and - having read Knuth and others - you decide to pick prime number of buckets around a thousand, say 1009, what happens?
floor(2^32 / 1009) = 4256657
The number of "bad" values is
2^32 - 4256657 * 1009 = 383
Consequently, all buckets are mapped from 4256657 "good" values, and 383 get one additional unwanted "bad" value for 4256658. Thus the "bias" for is 1/4,256,657.
It's very unlikely you'll find a hash function where a 1 in 4 million probability difference between buckets will be noticeable.
Now if you redo the calculation with a million buckets instead of a thousand, then things look a bit different. In that case if you're a bit OC, you might want to switch to a 64-bit hash.
On additional thing: The Elf hash is pretty unlikely to give absolutely terrible results, and it's quite fast, but there are much better hash functions. A reasonably well-regarded one you might want give a try is Murmur 32. (The Wiki article mentions that the original alg has some weaknesses that can be exploited for DoS attacks, but for your application it will be fine.) I'm sure your prof doesn't want you to copy code, but the Wikipedia page has it complete. It would be interesting to implement Elf yourself and try it against Murmur to see how they compare.

how to get original value from hash value in node.js

I have created hash of some fields and storing in database using 'crypto' npm.
var crypto = require('crypto');
var hashFirtName = crypto.createHash('md5').update(orgFirtName).digest("hex"),
QUESTION: How can I get the original value from the hash value when needed?
The basic definition of a "hash" is that it's one-way. You cannot get the originating value from the hash. Mostly because a single value will always produce the same hash, but a hash isn't always related to a single value, since most hash functions return a string of finite/fixed length.
Additional Information
I wanted to provide some additional information, as I felt I may have left this too short.
As #xShirase pointed out in his answer, you can use a table to reverse a Hash. These are known as Rainbow Tables. You can generate them or download them from the internet, usually from nefarious sources [ahem].
To expand on my other statement about a hash value possibly relating to multiple original values, lets take a look at MD5.
MD5 is a 128-bit hash. This means it can hold 2^128 bits, or (unsigned) 0 through 340,282,366,920,938,463,463,374,607,431,768,211,455. That's a REALLY big number. So, for any given input you have a 1 in 340,282,366,920,938,463,463,374,607,431,768,211,456 chance that it will collide with the same hash result of another input value.
Now, for simple data like passwords, the chances are astronomical. And for those purposes, who cares? Most of the time you are simply taking an input, hashing it, then comparing the hashes. For reasons I will not get into, when using hashes for passwords you should ALWAYS store the data already hashed. You don't want to leave plain-text passwords just lying about. Keep in mind that a hash is NOT the same as encryption.
Hashes can also be used for other reasons. For instance, they can be used to create a fast-lookup data structure known as a Hash Table. A Hash Table uses a hash as sort of a "primary key", allowing it to search a huge set of data in relatively few number of instructions, approaching O(1) (On-order of 1). Depending on the implementation of the Hash Table and the hashing algorithm, you have to deal with collisions, usually by means of a sorted list. This is why the Hash Table isn't "exactly" O(1), but close. If your hash algorithm is bad, the performance of your Hash Table can begin to approach O(n).
Another use for a hash it to tell if a file's contents have been altered, or match an original. You will see many OSS project provide binary downloads that also have an MD5 and/or SHA-2 hash values. This is so you can download the files, do a hash locally, and compare the results against theirs to make sure the file you are getting is the file they posted. Again, since the odds of two files matching another is 1 in 340,282,366,920,938,463,463,374,607,431,768,211,456, the odds of a hacker successfully generating a file of the same size with a bad payload that hashes to the exact same MD5/SHA-2 hash is pretty low.
Hope this discussion can help either you or someone in the future.
If you could get the original value from the hash, it wouldn't be that secure.
If you need to compare a value to what you have previously stored as a hash, you can create a hash for this value and compare the hashes.
In practice there is only one way to 'decrypt' a hash. It involves using a massive database of decrypted hashes, and compare them to yours. An example here

Alternative of hash map which is faster

What is the alternative of hash map which will provide faster functionality.
I need to put values in key , value pair.As per Hash map functionality whenever we will add new key , value pair it will search the key into existing pairs and will add key if not exists .I want to omit this search as in required data key will never be repeated.
Can I override put method of hashmap .
Check your hash function first, it may turn out that hash map is slow because there are a lot of collisions between hash values for different keys; it leads to inefficient hash table operations.
Also, reserve enough capacity ahead of the insertion operation in order to prevent costly resize operations and check your load factor to be not too high (1-10 is generally good).

How Strings are stored in a VBA Dictionary structure?

As I am currently playing with huge number of strings (have a look at another question: VBA memory size of Arrays and Arraylist) I used a scripting dictionary just for the feature of the keyed access that it has.
Everything was looking fine except that it was some how slow in loading the strings and that it uses a lot of memory. For an example of 100,000 strings of 128 characters in length, the Task manager showed at the end of the sub approximately 295 MB and when setting Dictionary=Nothing a poor 12 MB was remaining in Excel. Even considering internal Unicode conversion of strings 128 * 2 * 100,000 gives 25.6 MB ! Can someone explain this big difference ?
Here is all the info I could find on the Scripting.Dictionary:
According to Eric Lippert, who wrote the Scripting.Dictionary, "the actual implementation of the generic dictionary is an extensible-hashing-with-chaining algorithm that re-hashes when the table gets too full." (It is clear from the context that he is referring to the Scripting.Dictionary) Wikipedia's article on Hash Tables is a pretty good introduction to the concepts involved. (Here is a search of Eric's blog for the Scripting.Dictionary, he occasionally mentions it)
Basically, you can think of a Hash Table as a large array in memory. Instead of storing your strings directly by an index, you must provide a key (usually a string). The key gets "hashed", that is, a consistent set of algorithmic steps is applied to the key to crunch it down into a number between 0 and current max index in the Hash Table. That number is used as the index to store your string into the hash table. Since the same set of steps is applied each time the key is hashed, it results in the same index each time, meaning if you are looking up a string by its key, there is no need to search through the array as your normally would.
The hash function (which is what converts a key to an index into the table) is designed to be as random as possible, but every once in a while two keys can crunch down to the same index - this is called a collision. This is handled by "chaining" the strings together in a linked list (or possibly a more searchable structure). So suppose you tried to look a string up in the Hash Table with a key. The key is hashed, and you get an index. Looking in the array at that index, it could be an empty slot if no string with that key was ever added, or it could be a linked list that contains one or more strings whose keys mapped to that index in the array.
The entire reason for going through the details above is to point out that a Hash Table must be larger than the number of things it will store to make it efficient (with some exceptions, see Perfect Hash Function). So much of the overhead you would see in a Hash Table are the empty parts of the array that have to be there to make the hash table efficient.
Additionally, resizing the Hash Table is an expensive operation because the all the existing strings have to be rehashed to new locations, so when the load factor of the Hash Table exceeds the predefined threshold and it gets resized, it might get doubled in size to avoid having to do so again soon.
The implementation of the structure that holds the chain of strings at each array position can also have a large impact on the overhead.
If I find anything else out, I'll add it here...

Hashmap Inserts O(N)?

So, let's say you have a hashmap that uses linear probing.
You first insert a value X with key X, which hashes to location 5, say.
You then insert a value Y with key Y, which also hashes to 5. It will take location 6.
You then insert a value Z with key Z, which also hashes to 5. It will take location 7.
You then delete Y, so the memory looks like "X, null, Z"
You then try to insert a value with key Z, it will check 5, see it's taken, check 6, and then insert it there as its empty. However, there is already an entry with key Z, so you'll have two entries with key Z, which is against the invariant.
So wouldn't you therefore need to go through the entire map until you found the value itself. If it's not found, then you can insert it into the first null space. Therefore wouldn't all first-time inserts on a certain key be O(N)?
No.
The problem you're running into is caused by the deletion, which you've done incorrectly.
In fact, deletion from a table using linear probing is somewhat difficult -- to the point that many tables built using linear probing simply don't support deletion at all.
That said: at least in theory, nearly all operations on a hash table can end up linear in the worst case (insertion, deletion, lookup, etc.) Regardless of how clever a hash function you write, there are infinite inputs that can hash to any particular output. With a sufficiently unfortunate choice of inputs (or just a poor hash function) you can end up with an arbitrary percentage all producing the same hash code.
Edit: if you insist on supporting deletion with linear probing, the basic idea is that you need to ensure that each "chain" of entries remains contiguous. So, you hash the key, then walk from there all the way to the next empty bucket. You check the hash code for each of those entries, and fill the "hole" with the last contiguous item that hashed to a position before the hole. That, in turn, may create another hole that you have to fill in with the last item that hashed to a position before that hole you're creating (and so on, recursively).
Not sure why the village idiot (;)) deleted his post, since he was right -- an overcommitted/unbalanced hash table degenerates into a linear search.
To achieve O(1) performance the table must not be overcommitted (the table must be sufficiently oversized, given the number of entries), and the hash algorithm must do a good job (avoiding imbalance), given the characteristics/statistics of the key value.
It should be noted that there are two basic hash table schemes -- linear probing, where hash synonyms are simply inserted into the next available table slot, and linked lists, where hash synonyms are added to a linked list off the table element for the given hash value. They produce roughly the same statistics until overcommitted/unbalanced, at which point linear probing quickly falls completely apart while linked lists simply degrade slowly. And, as someone else stated, linear probing makes deletions very difficult.

Resources