Comparing long strings by their hashes

Comparing long strings by their hashes - string

Trying to improve the performance of a function that compares strings I decided to compare them by comparing their hashes.
So is there a guarantee if the hash of 2 very long strings are equal to each other then the strings are also equal to each other?

While it's guaranteed that 2 identical strings will give you equal hashes, the other way round is not true : for a given hash, there are always several possible strings which produce the same hash.
This is true due to the PigeonHole principle.
That being said, the chances of 2 different strings producing the same hash can be made infinitesimal, to the point of being considered equivalent to null.
A fairly classical example of such hash is MD5, which has a near perfect 128 bits distribution. Which means that you have one chance in 2^128 that 2 different strings produce the same hash. Well, basically, almost the same as impossible.

In the simple common case where two long strings are to be compared to determine if they are identical or not, a simple compare would be much preferred over a hash, for two reasons. First, as pointed out by #wildplasser, the hash requires that all bytes of both strings must be traversed in order to calculate the two hash values, whereas the simple compare is fast, and only needs to traverse bytes until the first difference is found, which may be much less than the full string length. And second, a simple compare is guaranteed to detect any difference, whereas the hash gives only a high probability that they are identical, as pointed out by #AdamLiss and #Cyan.
There are, however, several interesting cases where the hash comparison can be employed to great advantage. As mentioned by #Cyan if the compare is to be done more than once, or must be stored for later use, then hash may be faster. A case not mentioned by others is if the strings are on different machines connected via a local network or the Internet. Passing a small amount of data between the two machines will generally be much faster. The simplest first check is compare the size of the two, if different, you're done. Else, compute the hash, each on its own machine (assuming you are able to create the process on the remote machine) and again, if different you are done. If the hash values are the same, and if you must have absolute certainty, there is no easy shortcut to that certainty. Using lossless compression on both ends will allow less data to be transferred for comparison. And finally, if the two strings are separated by time, as alluded to by #Cyan, if you want to know if a file has changed since yesterday, and you have stored the hash from yesterday's version, then you can compare today's hash to it.
I hope this will help stimulate some "out of the box" ideas for someone.

I am not sure, if your performance will be improved. Both: building hash + comparing integers and simply comparing strings using equals have same complexity, that lays in O(n), where n is the number of characters.

Related

Find substring in stream without storing substring in plain text

Lets say I have a large stream of data (for example packets coming in from a network), and I want to determine if this data contains a certain substring. There are multiple string searching algorithms, but they require the algorithm to know the plain text string they are searching for.
Lets say, the string being sought is a password, and you do not want to store it in plain text in this search application. It would however appear in the stream as plain text. You could for example, store the hash and length of the password. Then for every byte in the stream check if the next length byte data from the stream hash to the password hash you have a probable match.
That way you can determine if the password was in the stream, without knowing the password. However, hashing once for every byte is not fast/efficient.
Is there perhaps a clever algorithm that could find the plain text password in the stream, without directly knowing the plain text password (and instead some non-reversible equivalent). Alternatively could a low quality version of the password be used, with the risk of false positives? For example, if the search application only knew half the password (in plain text), it could with some error detect the full password without knowing it.
thanks
P.S This question comes from a hypothetical discussion I had with some friends, about alerting you if your password was spotted in plain text on a network.

You could use a low-entropy rolling hash to pre-screen each byte so that, for the cost of lg k bits of entropy, you reduce the number of invocations of the cryptographic hash by a factor of k.

SAT is an NP-hard problem. Suppose your password is n characters long. If you could find a way to make a large enough SAT instance that
used a contiguous sequence of m >= n bytes from the data stream as its 8m input bits, and
produced the output 1 if and only if the bits present at its inputs contains your password starting at an offset that is some multiple of 8 bits
then by "operating" this SAT instance as a circuit, you would have a password detector that is (at least potentially) very difficult to "invert".
In some ways, what you want is the opposite of Boolean logic minimisation. You want the biggest, hairiest circuit (ideally for some theoretically justified notions of size and hairiness :) ) that computes the truth table. It's easy enough to come up with truth-table-preserving ways to grow the original CNF propositional logic formula -- e.g., if you have two clauses A and B, then you can always safely add a new clause consisting of all the literals in either A or B -- but it's probably much harder to come up with ways to grow the formula in ways that will confuse a modern SAT solver, since a lot of research has gone into making these programs super-efficient at detecting and exploiting all kinds of structure in the problem.
One possible avenue for injecting "complications" is to make the circuit compute functions that are difficult for circuits to compute, like divisions or square roots, and then test the results of these for equality in addition to the raw inputs. E.g., instead of making the circuit merely test that X[1 .. 8n] = YOUR_PASSWORD, make it test that X[1 .. 8n] = YOUR_PASSWORD AND sqrt(X[1 .. 8n]) = sqrt(YOUR_PASSWORD). If a SAT solver is smart enough to "see" that the first test implies the second then it can immediately dispense with all the clauses corresponding to the second -- but since everything is represented at a very low level with propositional clauses, this relationship is (I hope; as I said, modern SAT solvers are pretty amazing) well obscured. My guess is that it's better to choose functions like sqrt() that are not one-to-one on integers: this will potentially cause a SAT solver to waste time exploring seemingly promising (but ultimately incorrect) solutions.

Why is it called rainbow table?

Anyone know why it is called rainbow table? Just remembered we have learned there is an attack called "dictionary attack". Why it is not call dictionary?

Because it contains the entire "spectrum" of possibilities.
A dictionary attack is a bruteforce technique of just trying possibilities. Like this (python pseudo code)
mypassworddict = dict()
for password in mypassworddict:
trypassword(password)
However, a rainbow table works differently, because it's for inverting hashes. A high level overview of a hash is that it has a number of bins:
bin1, bin2, bin3, bin4, bin5, ...
Which correspond to binary parts of the output string - that's how the string ends up the length it is. As the hash proceeds, it affects differing parts of the bins in different ways. So the first byte (or whatever input field is accepted) input affects (say, simplistically) bins 3 and 4. The next input affects 2 and 6. And so on.
A rainbow table is a computation of all the possibilities of a given bin, i.e. all the possible inverses of that bin, for every bin... that's why it ends up so large. If the first bin value is 0x1 then you need to have a lookup list of all the values of bin2 and all the values of bin3 working backwards through the hash, which eventually gives you a value.
Why isn't it called a dictionary attack? Because it isn't.
As I've seen your previous question, let me expand on the detail you're looking for there. A cryptographically secure hash needs to be safe ideally from smallish input sizes up to whole files. To precompute the values of a hash for an entire file would take forever. So a rainbow table is designed on a small well understood subset of outputs, for example the permutations of all the characters a-z over a field of say 10 characters.
This is why password advice for defeating dictionary attacks works here. The more subsets of the whole possible set of inputs you put into your input for the hash, the more a rainbow table needs to contain to search it. The data sizes required end up stupidly big and so does the time to search. So, think about it:
If you have an input that is [a-z] for 5-8 characters, that's not too bad a rainbow table.
If you increase the length to 42 characters, that's a massive rainbow table. Each input affects the hash and so the bins of said hash.
If you throw numbers in to your search requirement [a-z][0-9] you've got even more searching to do.
Likewise [A-Za-z0-9]. Finally, stick in [\w] i.e. any printable character you can think of, and again, you're looking at a massive table.
So, making passwords long and complicated makes rainbow tables start taking blue-ray sized discs of data. Then, as per your previous question, you start adding in salting and hash derived functions and you make a general solution to hash cracking hard(er).
The goal here is to stay ahead of the computational power available.

Rainbow is a variant of dictionary attack (Pre-computed dictionary attack to be exact), but it takes less space than full dictionary (at the price of time needed to find a key in table). The other end of this space-memory tradeoff is full search (brute force attack = zero precomputation, a lot of time).
In the rainbow table the precomputed dictionary of pairs key-ciphertext is compressed in chains. Every step in chain is done using different commpression function. And the table has a lot of chains, so it looks like a rainbow.
In this picture different compression functions K1, K2, K3 have a colors like in rainbow:
The table, stored in the file contains only first and last columns, as the middle columns can be recomputed.

I don't know where the name comes from, but the differences are:
A dictionary contains a few selected items (e.g. english words), while a rainbow table contains every possible combination.
A dictionary only contains the input, while the rainbow table contains both the input and the output.
A dictionary is used to test different input to see if the output is valid, while a rainbow table is used for e reverse lookup, i.e. to find which input gives a specific output.

Unfortunately some of the statements are not correct. Contrary to what is bring posted rainbow tables DO NOT contain all the possibilites for a given keyspace well not the ones generated for use that I've seen. They can be generated to cover 99.9 but due to the randomness of a hash function there in no gurantee that EVERY plaintext is covered.
Each chain is made up of links or steps and each step is made of a hashing and reduction function. If your chain was 100 links long you would go that number of hash/reduction functions then discarding everything in between except the start and end.
To find the plain for a given hash you simply perform the reduction / hash x amount of the length of your chain. So you run the step once and check against the endpoint if it's a miss you would repeat... Until you have stepped through the entire length of your chain. If there is a match you can then regenerate the chain from the start point and you may be able to find the plain. If after the regeneration it is not correct then this is a false alarm. This happens due to collisions caused by the reduction hashing function. Since the table contains many chains you can do a large lookup against all the chain endpoints each step, this is essentially where the magic happens allowing speed. This will also lead to false alarms, since you only need to regenerate chains which have matches you save lots of time by skipping unnecessary chains.
They do not contain dictionaries.... Well not the traditional tables there are variants of rainbow tables which incorporate the use of dictionaries though.
That's about it. There are many ways which this process has been optimized including removing merging / duplicate chains and creating perfect tables and also storing them in differing packing to save space and loading time.

How safely can I assume unicity of a part of SHA1 hash?

I'm currently using a SHA1 to somewhat shorten an url:
Digest::SHA1.hexdigest("salt-" + url)
How safe is it to use only the first 8 characters of the SHA1 as a unique identifier, like GitHub does for commits apparently?

To calculate the probability of a collision with a given length and the number of hashes that you have, see the birthday problem. I don't know the number of hashes that you are going to have, but here are some examples. 8 hexadecimal characters is 32 bits, so for about 100 hashes the probability of a collision is about 1/1,000,000, for 10,000 hashes it's about 1/100, for 100,000 it's 3/4 etc.
See the table in the Birthday attack article on Wikipedia to find a good hash length that would satisfy your needs. For example if you want the collision to be less likely than 1/1,000,000,000 for a set of more than 100,000 hashes then use 64 bits, or 16 hexadecimal digits.
It all depends on how many hashes are you going to have and what probability of a collision are you willing to accept (because there is always some probability, even if insanely small).

If you're talking about a SHA-1 in hexadecimal, then you're only getting 4 bits per character, for a total of 32 bits. The chances of a collision are inversely proportional to the square root of that maximum value, so about 1/65536. If your URL shortener gets used much, it probably won't take terribly long before you start to see collisions.
As for alternatives, the most obvious is probably to just maintain a counter. Since you need to store a table of URLs to translate your shortened URL back to the original, you basically just store each new URL in your table. If it was already present, you give its existing number. Otherwise, you insert it and give it a new number. Either way, you give that number to the user.

It depends on what you are trying to accomplish. The output of SHA1 is effectively random with regards to the input (the output of a good hash function changes in half of its bits based on a one-bit change in the input, and SHA1, while not perfect, is pretty good), and by taking a 32-bit (assuming 8 hex digits) subset of the 160-bit output, you reduce the output space from 2^160 to 2^32 values. All things being equal, which they never are, this would significantly reduce the difficulty of finding a collision.
However, if the hash function's input must be a valid URL, that significantly reduces the number of possible inputs. #rsp points out the birthday problem, but given this, I'm not sure exactly how applicable it is at least in its simple form. Also, it largely assumes that there are no other precautions in place.
I would be more interested in why you are doing this. Is this about URLs that the user will need to remember and type? If so, tacking on a bunch of random hexadecimal digits is probably a bad idea. Is it a URL or URL parameter that will just be passed around programmatically? Then, I wouldn't care much about length. Either way, there are probably better ways to do what you are trying to accomplish.

If you use a binary output for SHA1 and Base64 encode the result, you will get much higher information density per character; you can have the same 8-character names, but rather than only 16^8 (2^32) possibilities, you'll have 64^8 (2^48) possibilities.
Using the assumption that the 50% probability-of-collision scales with 1.177*sqrt(N), using a Base64-style encoding will require 256 times more inputs than the hex-output before reaching the 50% chance of collision probability.

Comparing string distance based on precomputed hashes

I have a large list (over 200,000) of strings that I'd like to compare to a given string.
The given string is inserted by a user, so it may be slightly incorrect.
What I was hoping to do was create some kind of precomputed hash on each string on adding it to the list. This hash would contain information such as string length, addition of all the characters etc.
My question is, does something like this already exist? Surely there would be something that lets me avoid running Levenshtein distance on every string in the list?
Or maybe there's a third option I haven't thought of yet?

Sounds like you want to use a fuzzy hash of some sort. There are lots of hash functions available that can do things like this. The classic old "SOUNDEX" algorithm might even work.
Another thought - if you estimate that the probability of an incorrect entry is low, then you might actually be fine having a direct hit 99.9% of the time, falling back to SOUNDEX which might catch 90% of the remaining cases and then searching the whole list for the remaining 0.01% of the time.
Also worth checking this discussion:
How to find best fuzzy match for a string in a large string database

Constant-time hash for strings?

Another question on SO brought up the facilities in some languages to hash strings to give them a fast lookup in a table. Two examples of this are dictionary<> in .NET and the {} storage structure in Python. Other languages certainly support such a mechanism. C++ has its map, LISP has an equivalent, as do most other modern languages.
It was contended in the answers to the question that hash algorithms on strings can be conducted in constant timem with one SO member who has 25 years experience in programming claiming that anything can be hashed in constant time. My personal contention is that this is not true, unless your particular application places a boundary on the string length. This means that some constant K would dictate the maximal length of a string.
I am familiar with the Rabin-Karp algorithm which uses a hashing function for its operation, but this algorithm does not dictate a specific hash function to use, and the one the authors suggested is O(m), where m is the length of the hashed string.
I see some other pages such as this one (http://www.cse.yorku.ca/~oz/hash.html) that display some hash algorithms, but it seems that each of them iterates over the entire length of the string to arrive at its value.
From my comparatively limited reading on the subject, it appears that most associative arrays for string types are actually created using a hashing function that operates with a tree of some sort under the hood. This may be an AVL tree or red/black tree that points to the location of the value element in the key/value pair.
Even with this tree structure, if we are to remain on the order of theta(log(n)), with n being the number of elements in the tree, we need to have a constant-time hash algorithm. Otherwise, we have the additive penalty of iterating over the string. Even though theta(m) would be eclipsed by theta(log(n)) for indexes containing many strings, we cannot ignore it if we are in such a domain that the texts we search against will be very large.
I am aware that suffix trees/arrays and Aho-Corasick can bring the search down to theta(m) for a greater expense in memory, but what I am asking specifically if a constant-time hash method exists for strings of arbitrary lengths as was claimed by the other SO member.
Thanks.

A hash function doesn't have to (and can't) return a unique value for every string.
You could use the first 10 characters to initialize a random number generator and then use that to pull out 100 random characters from the string, and hash that. This would be constant time.
You could also just return the constant value 1. Strictly speaking, this is still a hash function, although not a very useful one.

In general, I believe that any complete string hash must use every character of the string and therefore would need to grow as O(n) for n characters. However I think for practical string hashes you can use approximate hashes that can easily be O(1).
Consider a string hash that always uses Min(n, 20) characters to compute a standard hash. Obviously this grows as O(1) with string size. Will it work reliably? It depends on your domain...

You cannot easily achieve a general constant time hashing algorithm for strings without risking severe cases of hash collisions.
For it to be constant time, you will not be able to access every character in the string. As a simple example, suppose we take the first 6 characters. Then comes someone and tries to hash an array of URLs. The has function will see "http:/" for every single string.
Similar scenarios may occur for other characters selections schemes. You could pick characters pseudo-randomly based on the value of the previous character, but you still run the risk of failing spectacularly if the strings for some reason have the "wrong" pattern and many end up with the same hash value.

You can hope for asymptotically less than linear hashing time if you use ropes instead of strings and have sharing that allows you to skip some computations. But obviously a hash function can not separate inputs that it has not read, so I wouldn't take the "everything can be hashed in constant time" too seriously.
Anything is possible in the compromise between the hash function's quality and the amount of computation it takes, and a hash function over long strings must have collisions anyway.
You have to determine if the strings that are likely to occur in your algorithm will collide too often if the hash function only looks at a prefix.

Although I cannot imagine a fixed-time hash function for unlimited length strings, there is really no need for it.
The idea behind using a hash function is to generate a distribution of the hash values that makes it unlikely that many strings would collide - for the domain under consideration. This key would allow direct access into a data store. These two combined result in a constant time lookup - on average.
If ever such collision occurs, the lookup algorithm falls back on a more flexible lookup sub-strategy.

Certainly this is doable, so long as you ensure all your strings are 'interned', before you pass them to something requiring hashing. Interning is the process of inserting the string into a string table, such that all interned strings with the same value are in fact the same object. Then, you can simply hash the (fixed length) pointer to the interned string, instead of hashing the string itself.

You may be interested in the following mathematical result I came up with last year.
Consider the problem of hashing an infinite number of keys—such as the set of all strings of any length—to the set of numbers in {1,2,…,b}. Random hashing proceeds by first picking at random a hash function h in a family of H functions.
I will show that there is always an infinite number of keys that are certain to collide over all H functions, that is, they always have the same hash value for all hash functions.
Pick any hash function h: there is at least one hash value y such that the set A={s:h(s)=y} is infinite, that is, you have infinitely many strings colliding. Pick any other hash function h‘ and hash the keys in the set A. There is at least one hash value y‘ such that the set A‘={s is in A: h‘(s)=y‘} is infinite, that is, there are infinitely many strings colliding on two hash functions. You can repeat this argument any number of times. Repeat it H times. Then you have an infinite set of strings where all strings collide over all of your H hash functions. CQFD.
Further reading:
Sensible hashing of variable-length strings is impossible
http://lemire.me/blog/archives/2009/10/02/sensible-hashing-of-variable-length-strings-is-impossible/

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string