Is this approach to dealing with hash collisions new/unique? - haskell

When dealing with hash maps, I have seen a few strategies to deal with hash collisions, but we have come up with something different.
I was wondering if this is something new or not.
This version of a hash map only works if the hash and the data structures that will be hashed are salteable.
(This is the case in hashable in Haskell, where we suggested implementing this approach.)
The idea is that, instead of storing a list or array in each cell of the hash map, you store a recursive hash map. The only difference in this recursive hash map is that you use a different salt.
This way the hash collisions on one level of the hash map are most likely not hash collisions on the next level.
As a result, insertion into such a hash map is no longer O(number of collisions on this hash) but O(number of levels that the collisions on this happen at recursively), which is most likely better.
A more detailed explanation and an implementation can be found here:
https://github.com/tibbe/unordered-containers/pull/217/files/58af4519ace34c5f7d3c1359907ff75e27b9cdb8#diff-ba23e0f18c79cb873ac5375367524cfaR114

Your idea seems to be effectively the same as the one suggested in the Fredman, Komlós & Szemerédi paper from 1984. As Wikipedia summarizes it:
FKS Hashing makes use of a hash table with two levels in which the top level contains n buckets which each contain their own hash table.
In contrast to your idea, the local hash maps aren't recursive, instead each of them chooses a salt that makes it a perfect hash. In practice, this will (as you say) usually be given already by the first salt you try, thus it's asymptotically constant-time.

Related

How is it possible to have these reversed hashes available on the web?

If these hashing algorithms are one-way functions, how is it possible to have these reversed hashes available on the web? What is the reverse hashing procedure used by those lookup sites?
When we say that a hash function h is a one-way function, we mean that
given some fixed string w, it's "easy" to compute h(w), but
given f(x) for some randomly-chosen string x, it's "hard" to find a string w where f(w) = f(x).
So in that sense, if you have a hash of a string that you know literally nothing about, there is no easy way to invert that hash.
However, this doesn't mean that, once you hash something, it can never be reversed. For example, suppose I know that you're hashing either the string YES or the string NO. I could then, in advance, precompute h(YES) and h(NO), write the values down, and then compare your hashed string against the two hashed values to figure out which string you hashed. Similarly, if I knew you were hashing a number between 0 and 999,999,999, I could hash all those values, store the results, then compare the hash of your number against my precomputed hashes and see which one you hashed.
To directly answer your question - the sites that offer tables of reversed hashes don't compute those tables by reversing the hash function, but rather by hashing lots and lots and lots of strings and writing down the results. They might hash strings they expect people to use (for example, the most common weak web passwords), or they may pick random short strings to cover all possible simple strings (along the lines of the number hashing example from above).
Since cryptographic hash functions like SHA1, SHA2, SHA2, Blake2, etc., are candidates to one-way functions there is no way to reverse the hashing.
So how do they achieve this; they may choose three ways;
Build a pair database (x, hash(x)) by generating the hash of the well-knowns string; the cracked password list, the English dictionary, Wikipedia text on all languages, and all strings up to some bound like 8;
This method has a huge problem, the space to store all pairs of input and their hash.
Build a rainbow table. Rainbow table is a time-vs-memory trade. Before starting to build the select table parameters in order to cover the target search space.
See Rainbow crack for details of password cracking.
Combine both. Due to the target search space, not all well-known strings, passwords, etc. can be placed in the Rainbow table. For those, use the 1. option.
Don't forget that some of them also providing online hashing tools. Once you asked to hash a value, it is going to enter their database/rainbow table, and when you later visit the site and asked the pre-image of the hash that you have stored, surprise they have it now! If the text is sensitive don't use online hashing services.
There is no process for reverse hashing. You just guess a password and hash it. You can make big databases of these guesses and hashes for reverse lookup, but it's not reversing the hash itself. Search for "rainbow tables" for more details.
Those website does not preform any kind of reverse hashing. There are tables called "Rainbow tables". Those rainbow tables are precomputed table for caching the output of cryptographic hash functions. They got lots and lots of strings and calculated hash values for them and when someone search a hash value they lookup the corresponding value from table and display is.

Hash Table that tries to hash Strings uniformly?

I am currently in a Data Structures course nearing the end of the semester, and have been assigned a project in which we are implementing a Linked Hash Table to store and retrieve keys. We have been given a pretty large amount of freedom with how we are going to design our hash table implementation, but for bonus points we were told to try and find a hash function that distributes our keys (unique strings) close to uniformly and randomly throughout the table.
I have chosen to use the ELF hash, seen here http://www.eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx
My question is as follows: With this hash function an integer is returned, but I am having trouble seeing how this can be used to help specify a specific index to put my key in in the hash table. I could simply do: index = ELFhash(String key) % tableSize, but does this defeat the purpose of using the ELF hash in the first place??
Also I have chosen my collision resolution strategy to be double hashing. Is there a good way to determine an appropriate secondary hashing function to find your jumps? My hash table is not going to be a constant size (sets of strings will be added and removed from the set of data I am hashing, and I will be rehashing them after each iteration of adding and removing to have a load factor of .75), so it is hard for me to just do something like k % n where n is a number that is relatively prime with my table size.
Thanks for taking the time to read my question, and let me know what you think!
You're correct to think about "wrapping bias," but for most practical purposes, it's not going to be a problem.
If the hash table is of size N and the hash value is in the range [0..M), then let k = floor(M/N). Any hash value in the range [0..k*N) is a "good" one in that, using mod N as a map, each hash bucket is mapped by exactly k hash values. The hash values in [k*N..M) are "bad" in that if you use them, the corresponding M-K*n lowest hash buckets map from one additional hash value. Even if the hash function is perfect, these buckets have a higher probability of receiving a given value.
The question, though, is "How much higher?" That depends on M and N. If the hash value is an unsigned int in [0..2^32), and - having read Knuth and others - you decide to pick prime number of buckets around a thousand, say 1009, what happens?
floor(2^32 / 1009) = 4256657
The number of "bad" values is
2^32 - 4256657 * 1009 = 383
Consequently, all buckets are mapped from 4256657 "good" values, and 383 get one additional unwanted "bad" value for 4256658. Thus the "bias" for is 1/4,256,657.
It's very unlikely you'll find a hash function where a 1 in 4 million probability difference between buckets will be noticeable.
Now if you redo the calculation with a million buckets instead of a thousand, then things look a bit different. In that case if you're a bit OC, you might want to switch to a 64-bit hash.
On additional thing: The Elf hash is pretty unlikely to give absolutely terrible results, and it's quite fast, but there are much better hash functions. A reasonably well-regarded one you might want give a try is Murmur 32. (The Wiki article mentions that the original alg has some weaknesses that can be exploited for DoS attacks, but for your application it will be fine.) I'm sure your prof doesn't want you to copy code, but the Wikipedia page has it complete. It would be interesting to implement Elf yourself and try it against Murmur to see how they compare.

how to get original value from hash value in node.js

I have created hash of some fields and storing in database using 'crypto' npm.
var crypto = require('crypto');
var hashFirtName = crypto.createHash('md5').update(orgFirtName).digest("hex"),
QUESTION: How can I get the original value from the hash value when needed?
The basic definition of a "hash" is that it's one-way. You cannot get the originating value from the hash. Mostly because a single value will always produce the same hash, but a hash isn't always related to a single value, since most hash functions return a string of finite/fixed length.
Additional Information
I wanted to provide some additional information, as I felt I may have left this too short.
As #xShirase pointed out in his answer, you can use a table to reverse a Hash. These are known as Rainbow Tables. You can generate them or download them from the internet, usually from nefarious sources [ahem].
To expand on my other statement about a hash value possibly relating to multiple original values, lets take a look at MD5.
MD5 is a 128-bit hash. This means it can hold 2^128 bits, or (unsigned) 0 through 340,282,366,920,938,463,463,374,607,431,768,211,455. That's a REALLY big number. So, for any given input you have a 1 in 340,282,366,920,938,463,463,374,607,431,768,211,456 chance that it will collide with the same hash result of another input value.
Now, for simple data like passwords, the chances are astronomical. And for those purposes, who cares? Most of the time you are simply taking an input, hashing it, then comparing the hashes. For reasons I will not get into, when using hashes for passwords you should ALWAYS store the data already hashed. You don't want to leave plain-text passwords just lying about. Keep in mind that a hash is NOT the same as encryption.
Hashes can also be used for other reasons. For instance, they can be used to create a fast-lookup data structure known as a Hash Table. A Hash Table uses a hash as sort of a "primary key", allowing it to search a huge set of data in relatively few number of instructions, approaching O(1) (On-order of 1). Depending on the implementation of the Hash Table and the hashing algorithm, you have to deal with collisions, usually by means of a sorted list. This is why the Hash Table isn't "exactly" O(1), but close. If your hash algorithm is bad, the performance of your Hash Table can begin to approach O(n).
Another use for a hash it to tell if a file's contents have been altered, or match an original. You will see many OSS project provide binary downloads that also have an MD5 and/or SHA-2 hash values. This is so you can download the files, do a hash locally, and compare the results against theirs to make sure the file you are getting is the file they posted. Again, since the odds of two files matching another is 1 in 340,282,366,920,938,463,463,374,607,431,768,211,456, the odds of a hacker successfully generating a file of the same size with a bad payload that hashes to the exact same MD5/SHA-2 hash is pretty low.
Hope this discussion can help either you or someone in the future.
If you could get the original value from the hash, it wouldn't be that secure.
If you need to compare a value to what you have previously stored as a hash, you can create a hash for this value and compare the hashes.
In practice there is only one way to 'decrypt' a hash. It involves using a massive database of decrypted hashes, and compare them to yours. An example here

How does a reduction function used with rainbow tables work?

I've carefully read about rainbow tables and can't get one thing. In order to build a hash chain a reduction function is used. It's a function that somehow maps hashes onto passwords. This article says that reduction function isn't an inverse of hash, it's just some mapping.
I don't get it - what's the use of a mapping that isn't even an inverse of the hash function? How should such mapping practically work and aid in deducing a password?
A rainbow table is "just" a smart compression method for a big table of precomputed hashes. The idea is that the table can "invert" a hash output if and only if a corresponding input was considered during the table construction.
Each table line ("chain") is a sequence of hash function invocations. The trick is that each input is computed deterministically from the previous output in the chain, so that:
by storing the starting and ending points of the chain, you "morally" store the complete chain, which you can rebuild at will (this is where a rainbow table can be viewed as a compression method);
you can start the chain rebuilding from a hash function output.
The reduction function is the glue which turns a hash function output into an appropriate input (for instance a character string which looks like a genuine password, consisting only of printable characters). Its role is mostly to be able to generate possible hash inputs with more or less uniform probability, given random data to work with (and the hash output will be acceptably random). The reduction function needs not have any specific structure, in particular with regards to how the hash function itself works; the reduction function must just allow keeping on building the chain without creating too many spurious collisions.
The reason the reduction function isn't the inverse of a hash is that the true inverse of a hash would not be a function (remember, the actual definition of "function" requires one output for one input).
Hash functions produce strings which are shorter than their corresponding inputs. By the pigeonhole principle, this means that two inputs can have the same output. If arbitrarily long strings can be hashed, an infinite number of strings can have the same output, in fact. Yet a rainbow table generally only keeps one output for each hash - so it can't be a true inverse.
The reduction function most rainbow tables use is "store the shortest string having this hash".
It doesn't matter if what it produces is the password: what you would get would also work as a password, and you could log in with it just as well as with the original password.

Constant-time hash for strings?

Another question on SO brought up the facilities in some languages to hash strings to give them a fast lookup in a table. Two examples of this are dictionary<> in .NET and the {} storage structure in Python. Other languages certainly support such a mechanism. C++ has its map, LISP has an equivalent, as do most other modern languages.
It was contended in the answers to the question that hash algorithms on strings can be conducted in constant timem with one SO member who has 25 years experience in programming claiming that anything can be hashed in constant time. My personal contention is that this is not true, unless your particular application places a boundary on the string length. This means that some constant K would dictate the maximal length of a string.
I am familiar with the Rabin-Karp algorithm which uses a hashing function for its operation, but this algorithm does not dictate a specific hash function to use, and the one the authors suggested is O(m), where m is the length of the hashed string.
I see some other pages such as this one (http://www.cse.yorku.ca/~oz/hash.html) that display some hash algorithms, but it seems that each of them iterates over the entire length of the string to arrive at its value.
From my comparatively limited reading on the subject, it appears that most associative arrays for string types are actually created using a hashing function that operates with a tree of some sort under the hood. This may be an AVL tree or red/black tree that points to the location of the value element in the key/value pair.
Even with this tree structure, if we are to remain on the order of theta(log(n)), with n being the number of elements in the tree, we need to have a constant-time hash algorithm. Otherwise, we have the additive penalty of iterating over the string. Even though theta(m) would be eclipsed by theta(log(n)) for indexes containing many strings, we cannot ignore it if we are in such a domain that the texts we search against will be very large.
I am aware that suffix trees/arrays and Aho-Corasick can bring the search down to theta(m) for a greater expense in memory, but what I am asking specifically if a constant-time hash method exists for strings of arbitrary lengths as was claimed by the other SO member.
Thanks.
A hash function doesn't have to (and can't) return a unique value for every string.
You could use the first 10 characters to initialize a random number generator and then use that to pull out 100 random characters from the string, and hash that. This would be constant time.
You could also just return the constant value 1. Strictly speaking, this is still a hash function, although not a very useful one.
In general, I believe that any complete string hash must use every character of the string and therefore would need to grow as O(n) for n characters. However I think for practical string hashes you can use approximate hashes that can easily be O(1).
Consider a string hash that always uses Min(n, 20) characters to compute a standard hash. Obviously this grows as O(1) with string size. Will it work reliably? It depends on your domain...
You cannot easily achieve a general constant time hashing algorithm for strings without risking severe cases of hash collisions.
For it to be constant time, you will not be able to access every character in the string. As a simple example, suppose we take the first 6 characters. Then comes someone and tries to hash an array of URLs. The has function will see "http:/" for every single string.
Similar scenarios may occur for other characters selections schemes. You could pick characters pseudo-randomly based on the value of the previous character, but you still run the risk of failing spectacularly if the strings for some reason have the "wrong" pattern and many end up with the same hash value.
You can hope for asymptotically less than linear hashing time if you use ropes instead of strings and have sharing that allows you to skip some computations. But obviously a hash function can not separate inputs that it has not read, so I wouldn't take the "everything can be hashed in constant time" too seriously.
Anything is possible in the compromise between the hash function's quality and the amount of computation it takes, and a hash function over long strings must have collisions anyway.
You have to determine if the strings that are likely to occur in your algorithm will collide too often if the hash function only looks at a prefix.
Although I cannot imagine a fixed-time hash function for unlimited length strings, there is really no need for it.
The idea behind using a hash function is to generate a distribution of the hash values that makes it unlikely that many strings would collide - for the domain under consideration. This key would allow direct access into a data store. These two combined result in a constant time lookup - on average.
If ever such collision occurs, the lookup algorithm falls back on a more flexible lookup sub-strategy.
Certainly this is doable, so long as you ensure all your strings are 'interned', before you pass them to something requiring hashing. Interning is the process of inserting the string into a string table, such that all interned strings with the same value are in fact the same object. Then, you can simply hash the (fixed length) pointer to the interned string, instead of hashing the string itself.
You may be interested in the following mathematical result I came up with last year.
Consider the problem of hashing an infinite number of keys—such as the set of all strings of any length—to the set of numbers in {1,2,…,b}. Random hashing proceeds by first picking at random a hash function h in a family of H functions.
I will show that there is always an infinite number of keys that are certain to collide over all H functions, that is, they always have the same hash value for all hash functions.
Pick any hash function h: there is at least one hash value y such that the set A={s:h(s)=y} is infinite, that is, you have infinitely many strings colliding. Pick any other hash function h‘ and hash the keys in the set A. There is at least one hash value y‘ such that the set A‘={s is in A: h‘(s)=y‘} is infinite, that is, there are infinitely many strings colliding on two hash functions. You can repeat this argument any number of times. Repeat it H times. Then you have an infinite set of strings where all strings collide over all of your H hash functions. CQFD.
Further reading:
Sensible hashing of variable-length strings is impossible
http://lemire.me/blog/archives/2009/10/02/sensible-hashing-of-variable-length-strings-is-impossible/

Resources