Constant-time hash for strings? - string

Another question on SO brought up the facilities in some languages to hash strings to give them a fast lookup in a table. Two examples of this are dictionary<> in .NET and the {} storage structure in Python. Other languages certainly support such a mechanism. C++ has its map, LISP has an equivalent, as do most other modern languages.
It was contended in the answers to the question that hash algorithms on strings can be conducted in constant timem with one SO member who has 25 years experience in programming claiming that anything can be hashed in constant time. My personal contention is that this is not true, unless your particular application places a boundary on the string length. This means that some constant K would dictate the maximal length of a string.
I am familiar with the Rabin-Karp algorithm which uses a hashing function for its operation, but this algorithm does not dictate a specific hash function to use, and the one the authors suggested is O(m), where m is the length of the hashed string.
I see some other pages such as this one (http://www.cse.yorku.ca/~oz/hash.html) that display some hash algorithms, but it seems that each of them iterates over the entire length of the string to arrive at its value.
From my comparatively limited reading on the subject, it appears that most associative arrays for string types are actually created using a hashing function that operates with a tree of some sort under the hood. This may be an AVL tree or red/black tree that points to the location of the value element in the key/value pair.
Even with this tree structure, if we are to remain on the order of theta(log(n)), with n being the number of elements in the tree, we need to have a constant-time hash algorithm. Otherwise, we have the additive penalty of iterating over the string. Even though theta(m) would be eclipsed by theta(log(n)) for indexes containing many strings, we cannot ignore it if we are in such a domain that the texts we search against will be very large.
I am aware that suffix trees/arrays and Aho-Corasick can bring the search down to theta(m) for a greater expense in memory, but what I am asking specifically if a constant-time hash method exists for strings of arbitrary lengths as was claimed by the other SO member.
Thanks.

A hash function doesn't have to (and can't) return a unique value for every string.
You could use the first 10 characters to initialize a random number generator and then use that to pull out 100 random characters from the string, and hash that. This would be constant time.
You could also just return the constant value 1. Strictly speaking, this is still a hash function, although not a very useful one.

In general, I believe that any complete string hash must use every character of the string and therefore would need to grow as O(n) for n characters. However I think for practical string hashes you can use approximate hashes that can easily be O(1).
Consider a string hash that always uses Min(n, 20) characters to compute a standard hash. Obviously this grows as O(1) with string size. Will it work reliably? It depends on your domain...

You cannot easily achieve a general constant time hashing algorithm for strings without risking severe cases of hash collisions.
For it to be constant time, you will not be able to access every character in the string. As a simple example, suppose we take the first 6 characters. Then comes someone and tries to hash an array of URLs. The has function will see "http:/" for every single string.
Similar scenarios may occur for other characters selections schemes. You could pick characters pseudo-randomly based on the value of the previous character, but you still run the risk of failing spectacularly if the strings for some reason have the "wrong" pattern and many end up with the same hash value.

You can hope for asymptotically less than linear hashing time if you use ropes instead of strings and have sharing that allows you to skip some computations. But obviously a hash function can not separate inputs that it has not read, so I wouldn't take the "everything can be hashed in constant time" too seriously.
Anything is possible in the compromise between the hash function's quality and the amount of computation it takes, and a hash function over long strings must have collisions anyway.
You have to determine if the strings that are likely to occur in your algorithm will collide too often if the hash function only looks at a prefix.

Although I cannot imagine a fixed-time hash function for unlimited length strings, there is really no need for it.
The idea behind using a hash function is to generate a distribution of the hash values that makes it unlikely that many strings would collide - for the domain under consideration. This key would allow direct access into a data store. These two combined result in a constant time lookup - on average.
If ever such collision occurs, the lookup algorithm falls back on a more flexible lookup sub-strategy.

Certainly this is doable, so long as you ensure all your strings are 'interned', before you pass them to something requiring hashing. Interning is the process of inserting the string into a string table, such that all interned strings with the same value are in fact the same object. Then, you can simply hash the (fixed length) pointer to the interned string, instead of hashing the string itself.

You may be interested in the following mathematical result I came up with last year.
Consider the problem of hashing an infinite number of keys—such as the set of all strings of any length—to the set of numbers in {1,2,…,b}. Random hashing proceeds by first picking at random a hash function h in a family of H functions.
I will show that there is always an infinite number of keys that are certain to collide over all H functions, that is, they always have the same hash value for all hash functions.
Pick any hash function h: there is at least one hash value y such that the set A={s:h(s)=y} is infinite, that is, you have infinitely many strings colliding. Pick any other hash function h‘ and hash the keys in the set A. There is at least one hash value y‘ such that the set A‘={s is in A: h‘(s)=y‘} is infinite, that is, there are infinitely many strings colliding on two hash functions. You can repeat this argument any number of times. Repeat it H times. Then you have an infinite set of strings where all strings collide over all of your H hash functions. CQFD.
Further reading:
Sensible hashing of variable-length strings is impossible
http://lemire.me/blog/archives/2009/10/02/sensible-hashing-of-variable-length-strings-is-impossible/

Related

Hashing and 'brute-force' permutations

So this is a two-part question:
Are there any hashing functions that guarantee that for any combination of the same length, they generate a unique hash? As I remember - most are that way, but I just need to confirm this.
Based on the 1st question - so, given a file hash and a length - is it then theoretically possible to 'brute-force' all byte permutations of that same length until the same hash is generated - ie. the original file has been recreated?
PS. I am aware that this will take ages (if theoretically possible), but I think it would be feasible for small files (sizes < 1KB)
1KB, that'd be 1000^256, right? 1000 possible combinations of bytes (256 configurations each?). It's a real big number. 1 with 768 0s behind it.
If you were to generate all of them, one would be the right one, but you'd have some number of collisions.
According to this security.SE post, the collission rate for md5 (for example) is about 1 in 2^64. So, if we divide our original number by that, we'd get how many possible combinations, right? http://www.wolframalpha.com/input/?i=1000%5E256+%2F+2%5E64
~5.42 × 10^748
That is still a lot of files to check.
I'd feel a lot better if someone critiqued my math here, but the point is that your first point is not true because of collisions. You can use the same sort math for calculating two 1000 character passwords having the same hash. It's the birthday problem. Given 2 people, it is unlikely that we'd have the same birthday, but if you take a room full the probability of any two people having the same birthday increases very quickly. If you take all 1000 character passwords, some of them are going to collide. You are going from X bytes to 16 bytes. You can't fit all of the combinations into 16 bytes.
Expanding upon the response to your first point, one of the points of cryptographic hash functions is unpredictability. A function with zero collisions is a 1-1 (or one-to-one) function, so called because every input has exactly one output and every output has exactly one input.
In order for a function to accept arbitrary length & complexity inputs without generating a collision, it is easy to see that the function must have arbitrary length outputs. As Gray obliquely points out, most hash functions have fixed-length outputs. (There are apparently some new algorithms that support arbitrary length outputs, but they still don't guarantee 0 collisions.) The reason is not stated clearly in the common crypto literature, but consider the difference between hashing and encrypting.
In hashing, you have the message (the unaltered original) and the message digest (the output of the hash function. (Digest here having the meaning "a summation or condensation of a body of information.")
With encryption, you have the plain text and the cipher text. The implication is that the cipher text is of equal length and complexity as the original.
I look at it as a cryptographic hash function with 0 collisions is of equal complexity as encryption. (Note that I'm unsure of what the advantages of a variable-length hash output are, so I asked a question about it.)
Additionally, hash functions are susceptible to attacks by pre-computed rainbow tables, which is why all hash algorithms still considered secure employ extra random inputs, called salts. The reason encryption isn't susceptible to a similar attack is that the encryption key is kept secret and you can't pre-compute output values without knowing the key. Compare symmetric key encryption (where there is one key that must be kept secret) with public key encryption (where the encryption key is public and the decryption key is private).
The other thing that prevents encryption algorithms from pre-computation attacks is that the number of computations for arbitrary-length inputs grows exponentially, and it is literally impossible to store the output from every input you may be interested in.

Any algorithm that mangles/hashes a string but can be matched against?

Usage case: client needs to send a huge string over HTTP. The server replies whether the string contains some substring. However, huge string is huge. This system is as a result really inefficient. Moreover, huge string contains some sensitive info, so this is really insecure.
Is there some pseudo-hashing mechanism that somehow summarizes a big string into some number, which all substrings of this big string would hash to the same number, but non-substrings will with high probability not hash to this big string?
Is there some pseudo-hashing mechanism that somehow summarizes a big string into some number, which all substrings of this big string would hash to the same number, but non-substrings will with high probability not hash to this big string?
No.
Let f be such a hash. Consider a string s and non-substring t. Note that s and t are substrings of s + t. Therefore, s and t have the same hash (i.e., f(s) = f(t) = f(s + t)). This is contrary to the requirement that f(s) != f(t) with high probability.
In particular, with s = "", we see that all strings t have f(s) = f(t), so that f is constant and equal to f("").
Is there some pseudo-hashing mechanism that somehow summarizes a big string into some number, which all substrings of this big string would hash to the same number, but non-substrings will with high probability not hash to this big string?
I guess I'll have to explain why this won't happen:
String string = "the quick brown fox jumps over the lazy dog";
That means, according to your request, that every single letter in this will hash to the same value. Hashing algorithms are deterministic. In this example, t -> 5, h -> 5, e -> 5... And so on, but if you have some string:
String string2 = "hello there";
Then now, you want h to hash to something different, and you want e to hash to something different, so given the exact same input, you want a different value. This defeats the definition of a mathematical function.
What does this mean?
Well, without any aspect of determinism in your function, your data has no repeatable mapping between a value and the letter that is being hashed, meaning your data is meaningless.
If you have a constant length for the substrings you could do what many file-sharing programs do and use a list of hashes or something like the tiger-tree hash.
List of hashes: Make a hash for every chunk of the file of some pre-set length (say 64kB), then transmit a list of these hashes so these chunks can be verified.
Tiger-Tree hash: http://en.wikipedia.org/wiki/Merkle_tree#Tiger_tree_hash
Basically build a binary tree of hashes with the leaves being hashes of chunks like in a list of hashes.
If you need to match to every possible substring instead of just pre-defined chunks this isn't going to work though.
All substrings doesn't sound viable, but I imagine you may have some constraints on your substrings you haven't yet told us about.
If you do your substrings block-aligned or whitespace-aligned or something, you might look into using a bloom filter, EG: https://pypi.python.org/pypi/drs-bloom-filter/1.01 . Bloom Filters can store members of a set and be used for testing set membership, at times with as little as one bit per element. They do sometimes give false positives, but with a user-adjustable probability of a false positive.

hashmap remove complexity

So a lot of sources say the hashmap remove function is O(1), but I don't see how this could be unless a hashmap were backed by a linkedlist because list removals are O(n). Could someone explain?
You can view a Hasmap as an array. Imagine, you want to store objects of all humans on earth somewhere. You could just get an unique number for everyone and use an array with a dimension of 10*10^20.
If someone is born, she/he gets the next free number and is added to the end. If someone dies, her/his number is used and the array entry is set to null.
You can easily see, to add some or to remove someone, you need only constant time. calculate array address, done (if you have random access memory).
What is added by the Hashmap? There are 2 motivations. On the one side, you do not want to have such a big array. If you only want to store 10 people from all over the world, nearly all entries of the array are free. On the other side, not all data you want to store somewhere have an unique number. Sometimes there are multiple times the same number, some numbers do now show overall and sometimes you do not have any number. Therefore, you define a function, which uses the big numbers from the input and reduce them to numbers in a smaller range. This reduction should be in a way, that the resulting number is most likely unique for different inputs.
Example: Lets say you want to store 10 numbers from 1 to 100000000. You could use an array with 100000000 indices. Or you could use an array with 100 indices and the function f(x) = x % 100. If you have the number 1234, f(1234) = 34. Mark 34 as assigned.
Now you could ask, what happens if you have the number 2234? We have a collision then. You need some strategy then to handle this, there are several. Study some literature or ask specific questions for this.
If you want to store a string, you could imagine to use the length or the sum of the ascii value from every characters.
As you see, we can easily store something, and easily access it again. What we have to do? Calculate the hash from the function (constant time for a good function), access the array (constant time), store or remove (constant time).
In real world, a good hash function is not that easy. Try to stick with the included ones in java.
If you want to read more details, the wikipedia article about hash table is a good starting point: http://en.wikipedia.org/wiki/Hash_table
I don't think the remove(key) complexity is O(1). If we have a big hash table with many collisions, then it would be O(n) in worst case. It very rare to get the worst case but we can't neglect the fact that O(1) is not guaranteed.
If your HashMap is backed by a LinkedList buckets array
The worst case of the remove function will be O(n)
If your HashMap is backed by a Balanced Binary Tree buckets array
The worst case of the remove function will be O(log n)
The best case and the average case (amortized complexity) of the remove function is O(1)

Comparing long strings by their hashes

Trying to improve the performance of a function that compares strings I decided to compare them by comparing their hashes.
So is there a guarantee if the hash of 2 very long strings are equal to each other then the strings are also equal to each other?
While it's guaranteed that 2 identical strings will give you equal hashes, the other way round is not true : for a given hash, there are always several possible strings which produce the same hash.
This is true due to the PigeonHole principle.
That being said, the chances of 2 different strings producing the same hash can be made infinitesimal, to the point of being considered equivalent to null.
A fairly classical example of such hash is MD5, which has a near perfect 128 bits distribution. Which means that you have one chance in 2^128 that 2 different strings produce the same hash. Well, basically, almost the same as impossible.
In the simple common case where two long strings are to be compared to determine if they are identical or not, a simple compare would be much preferred over a hash, for two reasons. First, as pointed out by #wildplasser, the hash requires that all bytes of both strings must be traversed in order to calculate the two hash values, whereas the simple compare is fast, and only needs to traverse bytes until the first difference is found, which may be much less than the full string length. And second, a simple compare is guaranteed to detect any difference, whereas the hash gives only a high probability that they are identical, as pointed out by #AdamLiss and #Cyan.
There are, however, several interesting cases where the hash comparison can be employed to great advantage. As mentioned by #Cyan if the compare is to be done more than once, or must be stored for later use, then hash may be faster. A case not mentioned by others is if the strings are on different machines connected via a local network or the Internet. Passing a small amount of data between the two machines will generally be much faster. The simplest first check is compare the size of the two, if different, you're done. Else, compute the hash, each on its own machine (assuming you are able to create the process on the remote machine) and again, if different you are done. If the hash values are the same, and if you must have absolute certainty, there is no easy shortcut to that certainty. Using lossless compression on both ends will allow less data to be transferred for comparison. And finally, if the two strings are separated by time, as alluded to by #Cyan, if you want to know if a file has changed since yesterday, and you have stored the hash from yesterday's version, then you can compare today's hash to it.
I hope this will help stimulate some "out of the box" ideas for someone.
I am not sure, if your performance will be improved. Both: building hash + comparing integers and simply comparing strings using equals have same complexity, that lays in O(n), where n is the number of characters.

How does a reduction function used with rainbow tables work?

I've carefully read about rainbow tables and can't get one thing. In order to build a hash chain a reduction function is used. It's a function that somehow maps hashes onto passwords. This article says that reduction function isn't an inverse of hash, it's just some mapping.
I don't get it - what's the use of a mapping that isn't even an inverse of the hash function? How should such mapping practically work and aid in deducing a password?
A rainbow table is "just" a smart compression method for a big table of precomputed hashes. The idea is that the table can "invert" a hash output if and only if a corresponding input was considered during the table construction.
Each table line ("chain") is a sequence of hash function invocations. The trick is that each input is computed deterministically from the previous output in the chain, so that:
by storing the starting and ending points of the chain, you "morally" store the complete chain, which you can rebuild at will (this is where a rainbow table can be viewed as a compression method);
you can start the chain rebuilding from a hash function output.
The reduction function is the glue which turns a hash function output into an appropriate input (for instance a character string which looks like a genuine password, consisting only of printable characters). Its role is mostly to be able to generate possible hash inputs with more or less uniform probability, given random data to work with (and the hash output will be acceptably random). The reduction function needs not have any specific structure, in particular with regards to how the hash function itself works; the reduction function must just allow keeping on building the chain without creating too many spurious collisions.
The reason the reduction function isn't the inverse of a hash is that the true inverse of a hash would not be a function (remember, the actual definition of "function" requires one output for one input).
Hash functions produce strings which are shorter than their corresponding inputs. By the pigeonhole principle, this means that two inputs can have the same output. If arbitrarily long strings can be hashed, an infinite number of strings can have the same output, in fact. Yet a rainbow table generally only keeps one output for each hash - so it can't be a true inverse.
The reduction function most rainbow tables use is "store the shortest string having this hash".
It doesn't matter if what it produces is the password: what you would get would also work as a password, and you could log in with it just as well as with the original password.

Resources