Hash Table that tries to hash Strings uniformly? - string

I am currently in a Data Structures course nearing the end of the semester, and have been assigned a project in which we are implementing a Linked Hash Table to store and retrieve keys. We have been given a pretty large amount of freedom with how we are going to design our hash table implementation, but for bonus points we were told to try and find a hash function that distributes our keys (unique strings) close to uniformly and randomly throughout the table.
I have chosen to use the ELF hash, seen here http://www.eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx
My question is as follows: With this hash function an integer is returned, but I am having trouble seeing how this can be used to help specify a specific index to put my key in in the hash table. I could simply do: index = ELFhash(String key) % tableSize, but does this defeat the purpose of using the ELF hash in the first place??
Also I have chosen my collision resolution strategy to be double hashing. Is there a good way to determine an appropriate secondary hashing function to find your jumps? My hash table is not going to be a constant size (sets of strings will be added and removed from the set of data I am hashing, and I will be rehashing them after each iteration of adding and removing to have a load factor of .75), so it is hard for me to just do something like k % n where n is a number that is relatively prime with my table size.
Thanks for taking the time to read my question, and let me know what you think!

You're correct to think about "wrapping bias," but for most practical purposes, it's not going to be a problem.
If the hash table is of size N and the hash value is in the range [0..M), then let k = floor(M/N). Any hash value in the range [0..k*N) is a "good" one in that, using mod N as a map, each hash bucket is mapped by exactly k hash values. The hash values in [k*N..M) are "bad" in that if you use them, the corresponding M-K*n lowest hash buckets map from one additional hash value. Even if the hash function is perfect, these buckets have a higher probability of receiving a given value.
The question, though, is "How much higher?" That depends on M and N. If the hash value is an unsigned int in [0..2^32), and - having read Knuth and others - you decide to pick prime number of buckets around a thousand, say 1009, what happens?
floor(2^32 / 1009) = 4256657
The number of "bad" values is
2^32 - 4256657 * 1009 = 383
Consequently, all buckets are mapped from 4256657 "good" values, and 383 get one additional unwanted "bad" value for 4256658. Thus the "bias" for is 1/4,256,657.
It's very unlikely you'll find a hash function where a 1 in 4 million probability difference between buckets will be noticeable.
Now if you redo the calculation with a million buckets instead of a thousand, then things look a bit different. In that case if you're a bit OC, you might want to switch to a 64-bit hash.
On additional thing: The Elf hash is pretty unlikely to give absolutely terrible results, and it's quite fast, but there are much better hash functions. A reasonably well-regarded one you might want give a try is Murmur 32. (The Wiki article mentions that the original alg has some weaknesses that can be exploited for DoS attacks, but for your application it will be fine.) I'm sure your prof doesn't want you to copy code, but the Wikipedia page has it complete. It would be interesting to implement Elf yourself and try it against Murmur to see how they compare.

Related

Hashing and 'brute-force' permutations

So this is a two-part question:
Are there any hashing functions that guarantee that for any combination of the same length, they generate a unique hash? As I remember - most are that way, but I just need to confirm this.
Based on the 1st question - so, given a file hash and a length - is it then theoretically possible to 'brute-force' all byte permutations of that same length until the same hash is generated - ie. the original file has been recreated?
PS. I am aware that this will take ages (if theoretically possible), but I think it would be feasible for small files (sizes < 1KB)
1KB, that'd be 1000^256, right? 1000 possible combinations of bytes (256 configurations each?). It's a real big number. 1 with 768 0s behind it.
If you were to generate all of them, one would be the right one, but you'd have some number of collisions.
According to this security.SE post, the collission rate for md5 (for example) is about 1 in 2^64. So, if we divide our original number by that, we'd get how many possible combinations, right? http://www.wolframalpha.com/input/?i=1000%5E256+%2F+2%5E64
~5.42 × 10^748
That is still a lot of files to check.
I'd feel a lot better if someone critiqued my math here, but the point is that your first point is not true because of collisions. You can use the same sort math for calculating two 1000 character passwords having the same hash. It's the birthday problem. Given 2 people, it is unlikely that we'd have the same birthday, but if you take a room full the probability of any two people having the same birthday increases very quickly. If you take all 1000 character passwords, some of them are going to collide. You are going from X bytes to 16 bytes. You can't fit all of the combinations into 16 bytes.
Expanding upon the response to your first point, one of the points of cryptographic hash functions is unpredictability. A function with zero collisions is a 1-1 (or one-to-one) function, so called because every input has exactly one output and every output has exactly one input.
In order for a function to accept arbitrary length & complexity inputs without generating a collision, it is easy to see that the function must have arbitrary length outputs. As Gray obliquely points out, most hash functions have fixed-length outputs. (There are apparently some new algorithms that support arbitrary length outputs, but they still don't guarantee 0 collisions.) The reason is not stated clearly in the common crypto literature, but consider the difference between hashing and encrypting.
In hashing, you have the message (the unaltered original) and the message digest (the output of the hash function. (Digest here having the meaning "a summation or condensation of a body of information.")
With encryption, you have the plain text and the cipher text. The implication is that the cipher text is of equal length and complexity as the original.
I look at it as a cryptographic hash function with 0 collisions is of equal complexity as encryption. (Note that I'm unsure of what the advantages of a variable-length hash output are, so I asked a question about it.)
Additionally, hash functions are susceptible to attacks by pre-computed rainbow tables, which is why all hash algorithms still considered secure employ extra random inputs, called salts. The reason encryption isn't susceptible to a similar attack is that the encryption key is kept secret and you can't pre-compute output values without knowing the key. Compare symmetric key encryption (where there is one key that must be kept secret) with public key encryption (where the encryption key is public and the decryption key is private).
The other thing that prevents encryption algorithms from pre-computation attacks is that the number of computations for arbitrary-length inputs grows exponentially, and it is literally impossible to store the output from every input you may be interested in.

How Strings are stored in a VBA Dictionary structure?

As I am currently playing with huge number of strings (have a look at another question: VBA memory size of Arrays and Arraylist) I used a scripting dictionary just for the feature of the keyed access that it has.
Everything was looking fine except that it was some how slow in loading the strings and that it uses a lot of memory. For an example of 100,000 strings of 128 characters in length, the Task manager showed at the end of the sub approximately 295 MB and when setting Dictionary=Nothing a poor 12 MB was remaining in Excel. Even considering internal Unicode conversion of strings 128 * 2 * 100,000 gives 25.6 MB ! Can someone explain this big difference ?
Here is all the info I could find on the Scripting.Dictionary:
According to Eric Lippert, who wrote the Scripting.Dictionary, "the actual implementation of the generic dictionary is an extensible-hashing-with-chaining algorithm that re-hashes when the table gets too full." (It is clear from the context that he is referring to the Scripting.Dictionary) Wikipedia's article on Hash Tables is a pretty good introduction to the concepts involved. (Here is a search of Eric's blog for the Scripting.Dictionary, he occasionally mentions it)
Basically, you can think of a Hash Table as a large array in memory. Instead of storing your strings directly by an index, you must provide a key (usually a string). The key gets "hashed", that is, a consistent set of algorithmic steps is applied to the key to crunch it down into a number between 0 and current max index in the Hash Table. That number is used as the index to store your string into the hash table. Since the same set of steps is applied each time the key is hashed, it results in the same index each time, meaning if you are looking up a string by its key, there is no need to search through the array as your normally would.
The hash function (which is what converts a key to an index into the table) is designed to be as random as possible, but every once in a while two keys can crunch down to the same index - this is called a collision. This is handled by "chaining" the strings together in a linked list (or possibly a more searchable structure). So suppose you tried to look a string up in the Hash Table with a key. The key is hashed, and you get an index. Looking in the array at that index, it could be an empty slot if no string with that key was ever added, or it could be a linked list that contains one or more strings whose keys mapped to that index in the array.
The entire reason for going through the details above is to point out that a Hash Table must be larger than the number of things it will store to make it efficient (with some exceptions, see Perfect Hash Function). So much of the overhead you would see in a Hash Table are the empty parts of the array that have to be there to make the hash table efficient.
Additionally, resizing the Hash Table is an expensive operation because the all the existing strings have to be rehashed to new locations, so when the load factor of the Hash Table exceeds the predefined threshold and it gets resized, it might get doubled in size to avoid having to do so again soon.
The implementation of the structure that holds the chain of strings at each array position can also have a large impact on the overhead.
If I find anything else out, I'll add it here...

Hash-table - Mapping a hash value to an index

I haven't considered more than using the "MOD PRIME" type of hash functions and am a little confused as how to use a returned hash value to store a value in a HashMap.
I want to implement a HashMap, where the key is a 64-bit int (long long int). I have a hash function that returns a long int. The question is, what is the best way to use this returned hash value to determine the table index. Because my table will obviously be smaller than the range of the hash value.
Are there any guidelines to choose the best table size? Or a best way to map the hash value to the size of the table?
Thank you.
You will need to resize the table at some point. Depending on the method you use, you will either need to rehash all keys during the resize-and-copy operation or use some form of dynamic hashing, such as extendible hashing or linear hashing.
As to answering the first part of the question, as you have a used a prime number for the modulo, you should be able to just use the hash value modulo table size to get an index (for a 64-bit int and a table of size 2^16, that would be just the 16 least significant bits of your 64-bit hash). As for the table size, you choose a size that is big enough to hold all data plus some spare room (a value of 0.75 load is used in practice). If you expect a lot of inserts, you will need to give more headroom otherwise you will be resizing the table all the time. Note that with the dynamic hashing algorithms mentioned above this is not necessary, as all resizing operations are amortized over time.
Also, remember that two items can be stored in the same bucket (at the same hashed location in the hash table), the hash function merely tells you where to start looking. So in practice, you would have an array of entries at each location of your hashtable. Note that this can be avoided if you use open addressing to handle hash collisions.
Of course, sometimes you can do better if you choose a different hash function. Your goal would be to have a perfect hash function for each size of your table (if you allow rehashing upon resizing), using something like dynamic perfect hashing or universal hashing.

Why is it called rainbow table?

Anyone know why it is called rainbow table? Just remembered we have learned there is an attack called "dictionary attack". Why it is not call dictionary?
Because it contains the entire "spectrum" of possibilities.
A dictionary attack is a bruteforce technique of just trying possibilities. Like this (python pseudo code)
mypassworddict = dict()
for password in mypassworddict:
trypassword(password)
However, a rainbow table works differently, because it's for inverting hashes. A high level overview of a hash is that it has a number of bins:
bin1, bin2, bin3, bin4, bin5, ...
Which correspond to binary parts of the output string - that's how the string ends up the length it is. As the hash proceeds, it affects differing parts of the bins in different ways. So the first byte (or whatever input field is accepted) input affects (say, simplistically) bins 3 and 4. The next input affects 2 and 6. And so on.
A rainbow table is a computation of all the possibilities of a given bin, i.e. all the possible inverses of that bin, for every bin... that's why it ends up so large. If the first bin value is 0x1 then you need to have a lookup list of all the values of bin2 and all the values of bin3 working backwards through the hash, which eventually gives you a value.
Why isn't it called a dictionary attack? Because it isn't.
As I've seen your previous question, let me expand on the detail you're looking for there. A cryptographically secure hash needs to be safe ideally from smallish input sizes up to whole files. To precompute the values of a hash for an entire file would take forever. So a rainbow table is designed on a small well understood subset of outputs, for example the permutations of all the characters a-z over a field of say 10 characters.
This is why password advice for defeating dictionary attacks works here. The more subsets of the whole possible set of inputs you put into your input for the hash, the more a rainbow table needs to contain to search it. The data sizes required end up stupidly big and so does the time to search. So, think about it:
If you have an input that is [a-z] for 5-8 characters, that's not too bad a rainbow table.
If you increase the length to 42 characters, that's a massive rainbow table. Each input affects the hash and so the bins of said hash.
If you throw numbers in to your search requirement [a-z][0-9] you've got even more searching to do.
Likewise [A-Za-z0-9]. Finally, stick in [\w] i.e. any printable character you can think of, and again, you're looking at a massive table.
So, making passwords long and complicated makes rainbow tables start taking blue-ray sized discs of data. Then, as per your previous question, you start adding in salting and hash derived functions and you make a general solution to hash cracking hard(er).
The goal here is to stay ahead of the computational power available.
Rainbow is a variant of dictionary attack (Pre-computed dictionary attack to be exact), but it takes less space than full dictionary (at the price of time needed to find a key in table). The other end of this space-memory tradeoff is full search (brute force attack = zero precomputation, a lot of time).
In the rainbow table the precomputed dictionary of pairs key-ciphertext is compressed in chains. Every step in chain is done using different commpression function. And the table has a lot of chains, so it looks like a rainbow.
In this picture different compression functions K1, K2, K3 have a colors like in rainbow:
The table, stored in the file contains only first and last columns, as the middle columns can be recomputed.
I don't know where the name comes from, but the differences are:
A dictionary contains a few selected items (e.g. english words), while a rainbow table contains every possible combination.
A dictionary only contains the input, while the rainbow table contains both the input and the output.
A dictionary is used to test different input to see if the output is valid, while a rainbow table is used for e reverse lookup, i.e. to find which input gives a specific output.
Unfortunately some of the statements are not correct. Contrary to what is bring posted rainbow tables DO NOT contain all the possibilites for a given keyspace well not the ones generated for use that I've seen. They can be generated to cover 99.9 but due to the randomness of a hash function there in no gurantee that EVERY plaintext is covered.
Each chain is made up of links or steps and each step is made of a hashing and reduction function. If your chain was 100 links long you would go that number of hash/reduction functions then discarding everything in between except the start and end.
To find the plain for a given hash you simply perform the reduction / hash x amount of the length of your chain. So you run the step once and check against the endpoint if it's a miss you would repeat... Until you have stepped through the entire length of your chain. If there is a match you can then regenerate the chain from the start point and you may be able to find the plain. If after the regeneration it is not correct then this is a false alarm. This happens due to collisions caused by the reduction hashing function. Since the table contains many chains you can do a large lookup against all the chain endpoints each step, this is essentially where the magic happens allowing speed. This will also lead to false alarms, since you only need to regenerate chains which have matches you save lots of time by skipping unnecessary chains.
They do not contain dictionaries.... Well not the traditional tables there are variants of rainbow tables which incorporate the use of dictionaries though.
That's about it. There are many ways which this process has been optimized including removing merging / duplicate chains and creating perfect tables and also storing them in differing packing to save space and loading time.

Constant-time hash for strings?

Another question on SO brought up the facilities in some languages to hash strings to give them a fast lookup in a table. Two examples of this are dictionary<> in .NET and the {} storage structure in Python. Other languages certainly support such a mechanism. C++ has its map, LISP has an equivalent, as do most other modern languages.
It was contended in the answers to the question that hash algorithms on strings can be conducted in constant timem with one SO member who has 25 years experience in programming claiming that anything can be hashed in constant time. My personal contention is that this is not true, unless your particular application places a boundary on the string length. This means that some constant K would dictate the maximal length of a string.
I am familiar with the Rabin-Karp algorithm which uses a hashing function for its operation, but this algorithm does not dictate a specific hash function to use, and the one the authors suggested is O(m), where m is the length of the hashed string.
I see some other pages such as this one (http://www.cse.yorku.ca/~oz/hash.html) that display some hash algorithms, but it seems that each of them iterates over the entire length of the string to arrive at its value.
From my comparatively limited reading on the subject, it appears that most associative arrays for string types are actually created using a hashing function that operates with a tree of some sort under the hood. This may be an AVL tree or red/black tree that points to the location of the value element in the key/value pair.
Even with this tree structure, if we are to remain on the order of theta(log(n)), with n being the number of elements in the tree, we need to have a constant-time hash algorithm. Otherwise, we have the additive penalty of iterating over the string. Even though theta(m) would be eclipsed by theta(log(n)) for indexes containing many strings, we cannot ignore it if we are in such a domain that the texts we search against will be very large.
I am aware that suffix trees/arrays and Aho-Corasick can bring the search down to theta(m) for a greater expense in memory, but what I am asking specifically if a constant-time hash method exists for strings of arbitrary lengths as was claimed by the other SO member.
Thanks.
A hash function doesn't have to (and can't) return a unique value for every string.
You could use the first 10 characters to initialize a random number generator and then use that to pull out 100 random characters from the string, and hash that. This would be constant time.
You could also just return the constant value 1. Strictly speaking, this is still a hash function, although not a very useful one.
In general, I believe that any complete string hash must use every character of the string and therefore would need to grow as O(n) for n characters. However I think for practical string hashes you can use approximate hashes that can easily be O(1).
Consider a string hash that always uses Min(n, 20) characters to compute a standard hash. Obviously this grows as O(1) with string size. Will it work reliably? It depends on your domain...
You cannot easily achieve a general constant time hashing algorithm for strings without risking severe cases of hash collisions.
For it to be constant time, you will not be able to access every character in the string. As a simple example, suppose we take the first 6 characters. Then comes someone and tries to hash an array of URLs. The has function will see "http:/" for every single string.
Similar scenarios may occur for other characters selections schemes. You could pick characters pseudo-randomly based on the value of the previous character, but you still run the risk of failing spectacularly if the strings for some reason have the "wrong" pattern and many end up with the same hash value.
You can hope for asymptotically less than linear hashing time if you use ropes instead of strings and have sharing that allows you to skip some computations. But obviously a hash function can not separate inputs that it has not read, so I wouldn't take the "everything can be hashed in constant time" too seriously.
Anything is possible in the compromise between the hash function's quality and the amount of computation it takes, and a hash function over long strings must have collisions anyway.
You have to determine if the strings that are likely to occur in your algorithm will collide too often if the hash function only looks at a prefix.
Although I cannot imagine a fixed-time hash function for unlimited length strings, there is really no need for it.
The idea behind using a hash function is to generate a distribution of the hash values that makes it unlikely that many strings would collide - for the domain under consideration. This key would allow direct access into a data store. These two combined result in a constant time lookup - on average.
If ever such collision occurs, the lookup algorithm falls back on a more flexible lookup sub-strategy.
Certainly this is doable, so long as you ensure all your strings are 'interned', before you pass them to something requiring hashing. Interning is the process of inserting the string into a string table, such that all interned strings with the same value are in fact the same object. Then, you can simply hash the (fixed length) pointer to the interned string, instead of hashing the string itself.
You may be interested in the following mathematical result I came up with last year.
Consider the problem of hashing an infinite number of keys—such as the set of all strings of any length—to the set of numbers in {1,2,…,b}. Random hashing proceeds by first picking at random a hash function h in a family of H functions.
I will show that there is always an infinite number of keys that are certain to collide over all H functions, that is, they always have the same hash value for all hash functions.
Pick any hash function h: there is at least one hash value y such that the set A={s:h(s)=y} is infinite, that is, you have infinitely many strings colliding. Pick any other hash function h‘ and hash the keys in the set A. There is at least one hash value y‘ such that the set A‘={s is in A: h‘(s)=y‘} is infinite, that is, there are infinitely many strings colliding on two hash functions. You can repeat this argument any number of times. Repeat it H times. Then you have an infinite set of strings where all strings collide over all of your H hash functions. CQFD.
Further reading:
Sensible hashing of variable-length strings is impossible
http://lemire.me/blog/archives/2009/10/02/sensible-hashing-of-variable-length-strings-is-impossible/

Resources