How to hash variable-length strings

How to hash variable-length strings - string

I am very much a beginner in encryption/hashing. And I want to know how to hash a variable length string, maybe 10 or 100 letters to a fixed length code, e.g. 128-bit binary, regardless of the underlying programming language, while achieving relatively equal collisions among the bins.
Specifically, how to deal with inputs of different inputs, and make the hashcode evenly distributed?

There are many different ways to do this.
For non-cryptographic applications, it's common to hash strings by iterating over the characters in sequence and applying some operation to mix in the bits of the new character with the accumulated hash bits. There are many variations on how exactly you'd carry this out. One common approach is shown here:
unsigned int kSmallPrime = /* some small prime */;
unsigned int kLargePrime = /* some large prime */;
unsigned int result = 0;
for (char ch: string) {
result = (result * kSmallPrime + ch) % kLargePrime;
}
More complex combination steps are possible to get better distributions. These approaches generally don't require the string to have any specific length and work for any length of string. The number of bits you get back depends on what internal storage you use for mixing up the bits, though there's not necessarily a strong theoretical reason (other than empirical evidence) to believe that you have a good distribution.
For cryptographic applications, string hash functions are often derived from block ciphers. Constructions like Merkle-Damgard let you start with a secure block cipher and produce a secure hash function. They work by padding the string up to some multiple of the block size using a secure padding scheme (one that ensures that different strings end up different after padding), breaking the string apart into blocks, and hashing them in a chain. The final output is then derived from the underlying block cipher, which naturally outputs a large number of bits, and the nice distribution comes from the strength of the underlying block cipher, which (in principle) should be indistinguishable from random.

Related

How is SHA unique?

I am trying to understand SHA uniqueness in simple terms.
For example let us assume there are only messages with maximum length of 4 bits (binery) in whole world. Number of possible messages with different lengths is
2 for single bit length
2^2 for double bit length
2^3 for 3 bit length
2^4 for 4 bit length
that would be 2+4+8+16 = 30 (31 if we consider empty message 2^0 = 1)
Lets us consider SHA3(for example) with output length of 3bits (binery), so maximum possible number of digest are 8.
How can a digest be unique if we need to map 30 messages to 8, or why is it hard to find digest collision for 2 unique messages

I'm not sure what you mean by "SHA uniqueness". An SHA value (any version) is not unique, it cannot be, because it maps an infinite number of inputs (an input of any length) to a finite number of outputs.
A cryptographic hash function has three important properties (which make it a crypto hash, over a regular hash):
strong collision resistance: it is very difficult (computationally infeasible, ie. "not practically possible") to find two inputs that produce the same output (even if you can choose both)
weak collision resistance: for a given input, it is computationally infeasible to find another input that gives the same hash value (you can choose one input to match the output of a given input)
preimage resistance: for a hash value, it's computationally infeasible to find an input that produces that output (it's "one-way")
The only problem in your example is the size. With such small numbers it doesn't make sense of course. But if the hash value is say 512 bits, it suddenly gets really time consuming and hence practically impossible to brute force.

"SHA3 which has digest length of 3bits"
I think this question is based on one bit misunderstanding. SHA-3 is a family of hashes that has the same output bit size as SHA-2. SHA-2 has bit sizes 224, 256, 384 or 512 for SHA-224, SHA-256, SHA-384 and SHA-512 respectively.
Of course, SHA-2 already took those identifiers, so SHA-3 will have SHA3-224, SHA3-256, SHA3-384 and SHA3-512. There were some proposals to use a different acronym, but those failed.
Still, SHA-3 hashes have near infinite input, so there will be many hashes that map to the same value. However, since it is not possible reverse any SHA-3 algorithm, it should be impossible to find a collision. That is, unless SHA-3 is broken, as it is not provably secure.

Any SHA3 variant will have digests with more than 100 bits. The terminology has probably confused you, because SHA256 has 256 bits, while SHA3 is considered the third generation of SHA algorithms (and does NOT have 3 bits of lenght).
Generally speaking it's not hard to find a hash collision by brute-forcing (alas, it's time-consuming), what is difficult is producing a collision that is also meaningful in its context. For example, assume you have a source file for an important application, that hashes to a digest. If an attacker tried to alter the source file in a way to introduce a vulnerability, while also hashing to the same digest, he'd have to introduce a lot of random gibberish, making the attack obvious.

How to uniquely identify a set of strings using an integer

Here my problem statement:
I have a set of strings that match a regular expression. let's say it matches [A-Z][0-9]{3} (i.e. 1 letter and 3 digits).
I can have any number of strings between 1 and 30. For example I could have:
{A123}
{A123, B456}
{Z789, D752, E147, ..., Q665}
...
I need to generate an integer (actually I can use 256 bits) that would be unique for any set of strings regardless of the number of elements (although the number of elements could be used to generate the integer)
What sort of algorithm could I use?
My first idea would be to convert my strings to number and then do operations (I thought of hash functions) on them but I am not sure what formula would be give me could results.
Any suggestion?

You have 2^333 possible input sets ((26 * 10^3) choose 30).
This means you would need a 333 bit wide integer to represent all possibilities. You only have a maximum of 256 bits, so there will be collisions.
This is a typical application for a hash function. There are hashes for various purposes, so it's important to select the right type:
A simple hash function for use in bucket based data structures (dictionaries) must be fast. Collisions are not only tolerated but wanted. The hash's size (in bits) usually is small. Due to collisions this type of hash is not suited for your purpose.
A checksum tries to avoid collisions and is reasonably fast. If it's large enough this might be enough for your case.
Cryptographic hashes have the characteristic that it's not possible (or very hard) to find a collision (even when both input and hash are known). Also they are not invertible (from the hash it's not possible to find the input). These are usually computationally expensive and overkill for your use case.
Hashes to uniquely identify arbitrary inputs, like CityHash and SpookyHash are designed for fast hashing and collision free identification.
SpookyHash seems like a good candidate for your use case. It's 128 bits wide, which means that you need 2^64 differing inputs to get a 50% chance of a single collision.
It's also fast: three bytes per cycle is orders of magnitude faster than md5 or sha1. SpookyHash is available in the public domain (see link above).
To apply any hash on your use case you could convert the items in your list to numbers, but it seems easier to just feed them as strings. You have to settle for an encoding in this case (ASCII would do).
I'm usually using UTF8 or so, when I18N is an issue. Then it's sometimes important to care for canonicalization. But this does not apply to your simple use case.

A hash is not going to work, since it could produce collisions. Every significant input bit must be mapped to an output bit.
For the letter, you have 90 - 65 = 25 different values, so you can use 5 bits to represent the letter.
The 3-digit number has 1000 different values, so you need 10 bits for this.
If you combine these bits, you have a unique mapping from the input to a 15-bit number.
This approach is simple, but it could wastes some bits. If the output must be as short as possible, you could map as follows:
output = (L - 'A')*1000 + N
where L is the letter value, 'A' is the value of the letter A, N is the 3-digit number. Then you can use as few bits as are necessary to represent the complete range of output, which is 25*1000 - 1 = 24999. Here it is 15 bits again, so the simple approach does not waste space.
If there are fewer output bits than input bits, a hash function is needed. I would strongly recommend to map the strings to binary data like above, and use a simple function to map the input to the output, for this reason:
A general-purpose hash function can not differentiate the input bits, because it knows nothing about their meaning.
For 256 output bits, after hashing 5.7e38 values, the chance of a collision is 75%. Source: Birthday Attack.
5.7e38 seems huge, but it corresponds to only 129 bits (2^129 = 6.8e38). In this case it means that there is a chance of over 75% that there is a pair of strings with 9 (129/15 = 8.6) elements that collide.
On the other hand, if you use a very simple mapping function like:
truncate the input to 256 bits (use the first 17 elements of 15 bits each)
make a 256 bit xor value of all the 15-bit elements
you can guaratee there is no collision between any two strings with at most 17 elements.
The hash functions wich are optimized for generating unique IDs likely perform better than a general-purpose hash as compared here, but I would doubt that they can guarantee collision-free hashing of all 256-bit values.
Conclusion: If most of the input strings have less than 17 elements, I would prefer this to a hash.

Hashing and 'brute-force' permutations

So this is a two-part question:
Are there any hashing functions that guarantee that for any combination of the same length, they generate a unique hash? As I remember - most are that way, but I just need to confirm this.
Based on the 1st question - so, given a file hash and a length - is it then theoretically possible to 'brute-force' all byte permutations of that same length until the same hash is generated - ie. the original file has been recreated?
PS. I am aware that this will take ages (if theoretically possible), but I think it would be feasible for small files (sizes < 1KB)

1KB, that'd be 1000^256, right? 1000 possible combinations of bytes (256 configurations each?). It's a real big number. 1 with 768 0s behind it.
If you were to generate all of them, one would be the right one, but you'd have some number of collisions.
According to this security.SE post, the collission rate for md5 (for example) is about 1 in 2^64. So, if we divide our original number by that, we'd get how many possible combinations, right? http://www.wolframalpha.com/input/?i=1000%5E256+%2F+2%5E64
~5.42 × 10^748
That is still a lot of files to check.
I'd feel a lot better if someone critiqued my math here, but the point is that your first point is not true because of collisions. You can use the same sort math for calculating two 1000 character passwords having the same hash. It's the birthday problem. Given 2 people, it is unlikely that we'd have the same birthday, but if you take a room full the probability of any two people having the same birthday increases very quickly. If you take all 1000 character passwords, some of them are going to collide. You are going from X bytes to 16 bytes. You can't fit all of the combinations into 16 bytes.

Expanding upon the response to your first point, one of the points of cryptographic hash functions is unpredictability. A function with zero collisions is a 1-1 (or one-to-one) function, so called because every input has exactly one output and every output has exactly one input.
In order for a function to accept arbitrary length & complexity inputs without generating a collision, it is easy to see that the function must have arbitrary length outputs. As Gray obliquely points out, most hash functions have fixed-length outputs. (There are apparently some new algorithms that support arbitrary length outputs, but they still don't guarantee 0 collisions.) The reason is not stated clearly in the common crypto literature, but consider the difference between hashing and encrypting.
In hashing, you have the message (the unaltered original) and the message digest (the output of the hash function. (Digest here having the meaning "a summation or condensation of a body of information.")
With encryption, you have the plain text and the cipher text. The implication is that the cipher text is of equal length and complexity as the original.
I look at it as a cryptographic hash function with 0 collisions is of equal complexity as encryption. (Note that I'm unsure of what the advantages of a variable-length hash output are, so I asked a question about it.)
Additionally, hash functions are susceptible to attacks by pre-computed rainbow tables, which is why all hash algorithms still considered secure employ extra random inputs, called salts. The reason encryption isn't susceptible to a similar attack is that the encryption key is kept secret and you can't pre-compute output values without knowing the key. Compare symmetric key encryption (where there is one key that must be kept secret) with public key encryption (where the encryption key is public and the decryption key is private).
The other thing that prevents encryption algorithms from pre-computation attacks is that the number of computations for arbitrary-length inputs grows exponentially, and it is literally impossible to store the output from every input you may be interested in.

Constant-time hash for strings?

Another question on SO brought up the facilities in some languages to hash strings to give them a fast lookup in a table. Two examples of this are dictionary<> in .NET and the {} storage structure in Python. Other languages certainly support such a mechanism. C++ has its map, LISP has an equivalent, as do most other modern languages.
It was contended in the answers to the question that hash algorithms on strings can be conducted in constant timem with one SO member who has 25 years experience in programming claiming that anything can be hashed in constant time. My personal contention is that this is not true, unless your particular application places a boundary on the string length. This means that some constant K would dictate the maximal length of a string.
I am familiar with the Rabin-Karp algorithm which uses a hashing function for its operation, but this algorithm does not dictate a specific hash function to use, and the one the authors suggested is O(m), where m is the length of the hashed string.
I see some other pages such as this one (http://www.cse.yorku.ca/~oz/hash.html) that display some hash algorithms, but it seems that each of them iterates over the entire length of the string to arrive at its value.
From my comparatively limited reading on the subject, it appears that most associative arrays for string types are actually created using a hashing function that operates with a tree of some sort under the hood. This may be an AVL tree or red/black tree that points to the location of the value element in the key/value pair.
Even with this tree structure, if we are to remain on the order of theta(log(n)), with n being the number of elements in the tree, we need to have a constant-time hash algorithm. Otherwise, we have the additive penalty of iterating over the string. Even though theta(m) would be eclipsed by theta(log(n)) for indexes containing many strings, we cannot ignore it if we are in such a domain that the texts we search against will be very large.
I am aware that suffix trees/arrays and Aho-Corasick can bring the search down to theta(m) for a greater expense in memory, but what I am asking specifically if a constant-time hash method exists for strings of arbitrary lengths as was claimed by the other SO member.
Thanks.

A hash function doesn't have to (and can't) return a unique value for every string.
You could use the first 10 characters to initialize a random number generator and then use that to pull out 100 random characters from the string, and hash that. This would be constant time.
You could also just return the constant value 1. Strictly speaking, this is still a hash function, although not a very useful one.

In general, I believe that any complete string hash must use every character of the string and therefore would need to grow as O(n) for n characters. However I think for practical string hashes you can use approximate hashes that can easily be O(1).
Consider a string hash that always uses Min(n, 20) characters to compute a standard hash. Obviously this grows as O(1) with string size. Will it work reliably? It depends on your domain...

You cannot easily achieve a general constant time hashing algorithm for strings without risking severe cases of hash collisions.
For it to be constant time, you will not be able to access every character in the string. As a simple example, suppose we take the first 6 characters. Then comes someone and tries to hash an array of URLs. The has function will see "http:/" for every single string.
Similar scenarios may occur for other characters selections schemes. You could pick characters pseudo-randomly based on the value of the previous character, but you still run the risk of failing spectacularly if the strings for some reason have the "wrong" pattern and many end up with the same hash value.

You can hope for asymptotically less than linear hashing time if you use ropes instead of strings and have sharing that allows you to skip some computations. But obviously a hash function can not separate inputs that it has not read, so I wouldn't take the "everything can be hashed in constant time" too seriously.
Anything is possible in the compromise between the hash function's quality and the amount of computation it takes, and a hash function over long strings must have collisions anyway.
You have to determine if the strings that are likely to occur in your algorithm will collide too often if the hash function only looks at a prefix.

Although I cannot imagine a fixed-time hash function for unlimited length strings, there is really no need for it.
The idea behind using a hash function is to generate a distribution of the hash values that makes it unlikely that many strings would collide - for the domain under consideration. This key would allow direct access into a data store. These two combined result in a constant time lookup - on average.
If ever such collision occurs, the lookup algorithm falls back on a more flexible lookup sub-strategy.

Certainly this is doable, so long as you ensure all your strings are 'interned', before you pass them to something requiring hashing. Interning is the process of inserting the string into a string table, such that all interned strings with the same value are in fact the same object. Then, you can simply hash the (fixed length) pointer to the interned string, instead of hashing the string itself.

You may be interested in the following mathematical result I came up with last year.
Consider the problem of hashing an infinite number of keys—such as the set of all strings of any length—to the set of numbers in {1,2,…,b}. Random hashing proceeds by first picking at random a hash function h in a family of H functions.
I will show that there is always an infinite number of keys that are certain to collide over all H functions, that is, they always have the same hash value for all hash functions.
Pick any hash function h: there is at least one hash value y such that the set A={s:h(s)=y} is infinite, that is, you have infinitely many strings colliding. Pick any other hash function h‘ and hash the keys in the set A. There is at least one hash value y‘ such that the set A‘={s is in A: h‘(s)=y‘} is infinite, that is, there are infinitely many strings colliding on two hash functions. You can repeat this argument any number of times. Repeat it H times. Then you have an infinite set of strings where all strings collide over all of your H hash functions. CQFD.
Further reading:
Sensible hashing of variable-length strings is impossible
http://lemire.me/blog/archives/2009/10/02/sensible-hashing-of-variable-length-strings-is-impossible/

What is a good hashing algorithm for seeding a prng with a string?

I am looking for a hashing algorithm that produces a 31/32 bit signed/unsigned integer as a digest for a utf8 string with the purpose of using the output for seeding a prng, such as a Park-Miller-Carta LCG or a Mersenne-Twister.
I have looked into FNV1 and FNV1a, but they provide very close values for similar strings differing in their last character; I would like to have a low collision hash that radically changes upon minimal modifications on the input string. Performance is not an issue.
My current approach consists in a dirty LCG that uses character codes and a prime number as multipliers:
a = 524287;
for ( i = 0; i < n; i ++ )
a = ( a * string.charCodeAt ( i ) * 16807 + 524287 ) % 2147483647;
Please let me know of any better alternatives.

Use SHA-2
It is the best/latest hashing algorithm out there. It is always advisable to go with standard algorithms.

If you're generating 32-bit value, consider using classic CRC32. FNV is suposed to be fast alternative to CRC, and you're saying, that performance is not an issue.

Any cryptographically strong hash will have the properties you want, but generate more bits, but simple truncation of the result to 32 bits would be fine. I presume cryptographic strength is not an actual requirement so that flawed (but widely used) hash schemes like MD5 would be adequate - and readily available in many libraries.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to hash variable-length strings - string

Related

How is SHA unique?

How to uniquely identify a set of strings using an integer

Hashing and 'brute-force' permutations

Constant-time hash for strings?

What is a good hashing algorithm for seeding a prng with a string?

Categories

Resources