What does "no collisions have been found yet for this hashing method" even mean?

What does "no collisions have been found yet for this hashing method" even mean? - security

I mean I don't need to look for the actual collisions, to know they exist. If there weren't collisions, then how would you have fixed-length results? That's why I don't understand what people mean when they claim 'md5 is insecure! someone found collisions!', or something like that.
The only thing I can think of, is that the collision search only looks for dictionary words, eg: If 'dog' and 'house' share the same hash, it would be a stupid hashing method IMO. It could also look for strings with a length < X, being X something between 5-10 (passwords that people could remember)
Am I totally wrong?

MD5 is a 128-bit hash, so there are 2^128 possible hashes. If the hash were perfect, then it would in theory require around 2^64 different hash attempts to find a collision (and you would have to store all 2^64 because each new hash would require comparison to all previous values). There isn't 2^64 bits of storage on the planet, so you would be safe.
The attacks on MD5 allow collisions to be found with significantly less than 2^64 hashes and significantly less than 128 x 2^64 bits of storage. That's why MD5 is considered broken.
Currently there are no similar attacks that work on full-strength SHA-1, but it's expected that such attacks will be publicly known within a few years.

As you know, a collision is the term for the situation where two different things (e.g. documents) hash to the same value.
Clearly, collisions are always theoretically possible for a secure hashing algorithm. But the security of secure hashing comes from:
using a large domain of possible hash values, and
using a hashing algorithm with the property that trial and error is close to the best way to produce a document with a given hash.
If both of these criteria are satisfied, then the probability of someone being able to manufacture a collision for a given document is vanishingly small. This is sufficient to make it impractical to (for example) change the content of a document with a digital signature.
The problem is that clever people have figured out a way (or ways) that are a LOT faster than trial and error for creating documents whose MD5 signatures collide. Hence they can defeat digital signatures, and similar uses of MD5 to provide security.
FOLLOWUP
This quote comes from the Wikipedia page on MD5:
MD5 makes only one pass over the data, so if two prefixes with the same hash can be constructed, a common suffix can be added to both to make the collision more likely to be accepted as valid data by the application using it. Furthermore, current collision-finding techniques allow to specify an arbitrary prefix: an attacker can create two colliding files that both begin with the same content. All the attacker needs to generate two colliding files is a template file with a 128-byte block of data aligned on a 64-byte boundary that can be changed freely by the collision-finding algorithm.
I don't completely understand this, but it looks like a recipe for producing files with (different) meaningful content and the same signature.

In practice, it's not about whether a single sample was found, but about a method. These can be either based on some property "if you hash values of length N, ending with ..., etc. you will get the same hash" (silly example), or based on some algorithm "having this hash / value, this is how you get a new value with the same hash".
Collisions will of course always exist, but the interesting problem is how to find them. I'm not sure what is the source of that claim you quoted, but I'm pretty sure it was supposed to actually mean "no practical way to find collisions has been found yet for this hashing method".

When you see "No collisions found" for the SHA-256 hash, for example, it really means that no hash collisions have ever been found. You are right that theoretically collisions exists, and there may already have happened a SHA-256 collision that no-one noticed, but this is irrelevant.
To find a collision by chance, you would need on average 18 quintillion of hash attempts for a MD5 hash, and 340 undecillion attempts for a SHA-256 hash, already accounting for the birthday problem.
As vy32 said, it is computationally unfeasible compute, store and compare so many hashes. So, in order to find a collision, you need a method that is many orders of magnitude faster than the random trial and error one. If there exists such a method for a secure hash, the hash is considered broken, at least in regards to general collision resistance.
So, to say "Someone found a collision in this xxxbit hash" is in fact synonymous of saying "A practical method of finding collisions was found for this hash, making it insecure". The alternative is a cosmically unlikely event, and would be reported in another way.

Related

MD5 hash reversing

I know it's not possible to reverse an MD5 hash back to its original value. But what about generating a set of random characters which would give the exact same value when hashed? Is that possible?

Finding a message that matches a given MD5 hash can happen in three ways:
You guess the original message. For passwords and other low entropy messages this is often relatively easy. That's why we use use key-stretching in such situations. For sufficiently complex messages, this becomes infeasible.
You guess about 2^127 times and get a new message fitting that hash. This is currently infeasible.
You exploit a pre-image attack against that specific hash function, obtained by cryptoanalyzing it. For MD5 there is one, with a workfactor of 2^123, but that's still infeasible.
There is no efficient attack on MD5's pre-image resistance at the moment.
There are efficient collision attacks against MD5, but they only allow an attacker to construct two different messages with the same hash. But it doesn't allow him to construct a message for a given hash.

Yes it is possible to come up with a collision (since you map from a larger space to a smaller this is something that you can assume to happen eventually). Actually MD5 is already considered as "broken" in this respect.
From wiki:
However, it has since been shown that MD5 is not collision
resistant;[3] as such, MD5 is not suitable for applications like SSL
certificates or digital signatures that rely on this property. In
1996, a flaw was found with the design of MD5, and while it was not a
clearly fatal weakness, cryptographers began recommending the use of
other algorithms, such as SHA-1—which has since been found also to be
vulnerable. In 2004, more serious flaws were discovered in MD5, making
further use of the algorithm for security purposes
questionable—specifically, a group of researchers described how to
create a pair of files that share the same MD5 checksum.[4][5] Further
advances were made in breaking MD5 in 2005, 2006, and 2007.[6] In
December 2008, a group of researchers used this technique to fake SSL
certificate validity,[7][8] and US-CERT now says that MD5 "should be
considered cryptographically broken and unsuitable for further
use."[9] and most U.S. government applications now require the SHA-2
family of hash functions.[10]

In one sense, this is possible. If you have strings that are longer than the hash itself, then you will have collisions, so such a string will exist.
However, finding such a string would be equivalent to reversing the hash, as you would be finding a value that hashes to a particular hash, so it would not be any more feasible than reversing a hash any other way.

For MD5 specifically? Yes.
Several years ago, an article was published on an exploit of the MD5 hash that allowed easy generation of data which, when hashed, gave a desired MD5 hash (well, what they actually discovered was an algorithm to find sets of data with the same hash, but you get how that can be used the other way around). You can read an overview of the principle here. No similar algorithm has been found for SHA-2, although that may change in the future.

Yes, what you're talking about is called a collision. A collision in any hashing mechanism is when two different plaintexts create the same hash after being run through a hashing algorithm.

Is there an algorithm for unique "hashes"

I'm interested in finding an algorithm that can encode a piece of data into a sort of hash (as in that is impossible to convert back into the source data, except by brute force), but also has a unique output for every unique input. The size of the output doesn't matter.
It should be able to hash the same input twice though, and give the same output, so regular encryption with a random, discarded key won't suffice. Nor will regular encryption with a known key, or a salt, because they would be exposed to attackers.
Does such a thing exist?
Can it event theoretically exist, or is the data-destroying part of normal hash algorithms critical for the irreversible characteristic?
What use would something like this be? Well, imagine a browser with a list of websites that should be excluded from the history (like NSFW sites). If this list is saved unencoded or encrypted with a key known on the system, it's readable not just by the browser but also by bosses, wives, etc.
If instead the website addresses are stored hashed, they can't be read, but the browser can check if a site is present in the list.
Using a normal hash function could result in false positives (however unlikely).
I'm not building a browser, I have no plan to actually use the answer. I'm just curious and interested in encryption and such.

Given the definition of a hash;
A cryptographic hash function is a deterministic procedure that takes an arbitrary block of data and returns a fixed-size bit string, the (cryptographic) hash value, such that an accidental or intentional change to the data will change the hash value.
no - it's not theoretically possible. A hash value is of a fixed length that is generally smaller than the data it is hashing (unless the data being hashed is less than the fixed length of the hash). They will always lose data, and as such there can always be collisions (a hash function is considered good if the risk of collision is low, and infeasible to compute.)

In theory it's impossible for outputs that are shorter than the input. This trivially follows from the pidgeon-hole principle.
You could use asymmetric encryption where you threw away the private key. That way it's technically lossless encryption, but nobody will be able to easily reverse it. Note that this is much slower than normal hashing, and the output will be larger than the input.
But the probability of collision drops exponentially with the hash size. A good 256 bit hash is collision free for all practical purposes. And by that I mean hashing for billions of years with all computers in the world will almost certainly not produce collision.
Your extended question shows two problems.
What use would something like this be? Well, imagine a browser with a list of websites that should be excluded from the history (like NSFW sites). If this list is saved unencoded or encrypted with a key known on the system, it's readable not just by the browser but also by bosses, wives, etc.
If instead the website addresses are stored hashed, they can't be read, but the browser can check if a site is present in the list.
Brute force is trivial in this use case. Just find the list of all domains/the zone file. Wouldn't be surprised if a good list is downloadable somewhere.
Using a normal hash function could result in false positives (however unlikely).
The collision probability of a hash is much lower(especially since you have no attacker that tries to provoke a collision in this scenario) than the probability of hardware error.
So my conclusion is to combine a secret with a slow hash.
byte[] secret=DeriveKeyFromPassword(pwd, salt, enough iterations for this to take perhaps a second)
and then for the actual hash use a KDF again combining the secret and the domain name.

Any form of lossless public encryption where you forget the private key.

Well, any lossless compressor with a password would work.
Or you could salt your input with some known (to you) text. This would give you something as long as the input. You could then run some sort of lossless compression on the result, which would make it shorter.

you can find a hash function with a low probability of that happening, but i think all of them are prone to birthday attack, you can try to use a function with a large size output to minimize that probability

Well what about md5 hash? sha1 hash?

I don't think it can exist; if you can put anything into them and get a different result, it couldn't be a fixed length byte array, and it would lose a lot of its usefulness.
Perhaps instead of a hash what you are looking for is reversible encryption? That should be unique. Won't be fast, but it will be unique.

Given a hashing algorithm, is there a more efficient way to 'unhash' besides bruteforce?

So I have the code for a hashing function, and from the looks of it, there's no way to simply unhash it (lots of bitwise ANDs, ORs, Shifts, etc). My question is, if I need to find out the original value before being hashed, is there a more efficient way than just brute forcing a set of possible values?
Thanks!
EDIT: I should add that in my case, the original message will never be longer than several characters, for my purposes.
EDIT2: Out of curiosity, are there any ways to do this on the run, without precomputed tables?

Yes; rainbow table attacks. This is especially true for hashes of shorter strings. i.e. hashes of small strings like 'true' 'false' 'etc' can be stored in a dictionary and can be used as a comparison table. This speeds up cracking process considerably. Also if the hash size is short (i.e. MD5) the algorithm becomes especially easy to crack. Of course, the way around this issue is combining 'cryptographic salts' with passwords, before hashing them.
There are two very good sources of info on the matter: Coding Horror: Rainbow Hash Cracking and
Wikipedia: Rainbow table
Edit: Rainbox tables can tage tens of gigabytes so downloading (or reproducing) them may take weeks just to make simple tests. Instead, there seems to be some online tools for reversing simple hashes: http://www.onlinehashcrack.com/ (i.e. try to reverse 463C8A7593A8A79078CB5C119424E62A which is MD5 hash of the word 'crack')

"Unhashing" is called a "preimage attack": given a hash output, find a corresponding input.
If the hash function is "secure" then there is no better attack than trying possible inputs until a hit is found; for a hash function with a n-bit output, the average number of hash function invocations will be about 2n, i.e. Way Too Much for current earth-based technology if n is greater than 180 or so. To state it otherwise: if an attack method faster than this brute force method is found, for a given hash function, then the hash function is deemed irreparably broken.
MD5 is considered broken, but for other weaknesses (there is a published method for preimages with cost 2123.4, which is thus about 24 times faster than the brute force cost -- but it is still so far in the technologically unfeasible that it cannot be confirmed).
When the hash function input is known to be part of a relatively small space (e.g. it is a "password", so it could fit in the brain of a human user), then one can optimize preimage attacks by using precomputed tables: the attacker still has to pay the search cost once, but he can reuse his tables to attack multiple instances. Rainbow tables are precomputed tables with a space-efficient compressed representation: with rainbow tables, the bottleneck for the attacker is CPU power, not the size of his hard disks.

Assuming the "normal case", the original message will be many times longer than the hash. Therefore, it is in principle absolutely impossible to derive the message from the hash, simply because you cannot calculate information that is not there.
However, you can guess what's probably the right message, and there exist techniques to accelerate this process for common messages (such as passwords), for example rainbow tables. It is very likely that if something that looks sensible is the right message if the hash matches.
Finally, it may not be necessary at all to find the good message as long as one can be found which will pass. This is the subject of a known attack on MD5. This attack lets you create a different message which gives the same hash.
Whether this is a security problem or not depends on what exactly you use the hash for.

This may sound trivial, but if you have the code to the hashing function, you could always override a hash table container class's hash() function (or similar, depending on your programming language and environment). That way, you can hash strings of say 3 characters or less, and then you can store the hash as a key by which you obtain the original string, which appears to be exactly what you want. Use this method to construct your own rainbow table, I suppose. If you have the code to the program environment in which you want to find these values out, you could always modify it to store hashes in the hash table.

why should a good hash algorithm not allow attackers to find two messages producing the same hash?

I was reading wikipedia, and it says
Cryptographic hash functions are a third type of cryptographic algorithm.
They take a message of any length as input, and output a short,
fixed length hash which can be used in (for example) a digital signature.
For good hash functions, an attacker cannot find two messages that produce the same hash.
But why? What I understand is that you can put the long Macbeth story into the hash function and get a X long hash out of it. Then you can put in the Beowulf story to get another hash out of it again X long.
So since this function maps loads of things into a shorter length, there is bound to be overlaps, like I might put in the story of the Hobit into the hash function and get the same output as Beowulf, ok, but this is inevitable right (?) since we are producing a shorter length output from our input? And even if the output is found, why is it a problem?
I can imagine if I invert it and get out Hobit instead of Beowulf, that would be bad but why is it useful to the attacker?
Best,

Yes, of course there will be collisions for the reasons you describe.
I suppose the statement should really be something like this: "For good hash functions, an attacker cannot find two messages that produce the same hash, except by brute-force".
As for the why...
Hash algorithms are often used for authentication. By checking the hash of a message you can be (almost) certain that the message itself hasn't been tampered with. This relies on it being infeasible to find two messages that generate the same hash.
If a hash algorithm allows collisions to be found relatively easily then it becomes useless for authentication because an attacker could then (theoretically) tamper with a message and have the tampered message generate the same hash as the original.

Yes, it's inevitable that there will be collisions when mapping a long message onto a shorter hash, as the hash cannot contain all possible values of the message. For the same reason you cannot 'invert' the hash to uniquely produce either Beowulf or The Hobbit - but if you generated every possible text and filtered out the ones that had your particular hash value, you'd find both texts (amongst billions of others).
The article is saying that it should be hard for an attacker to find or construct a second message that has the same hash value as a first. Cryptographic hash functions are often used as proof that a message hasn't been tampered with - if even a single bit of data flips then the hash value should be completely different.
A couple of years back, Dutch researchers demonstrated weaknesses in MD5 by publishing a hash of their "prediction" for the US presidential election. Of course, they had no way of knowing the outcome in advance - but with the computational power of a PS3 they constructed a PDF file for each candidate, each with the same hash value. The implications for MD5 - already on its way down - as a trusted algorithm for digital signatures became even more dire...

Cryptographic hashes are used for authentication. For instance, peer-to-peer protocols rely heavily on them. They use them to make sure that an ill-intentioned peer cannot spoil the download for everyone else by distributing packets that contain garbage. The torrent file that describes a download contains the hashes for each block. With this check in place, the victim peer can find out that he has been handled a corrupted block and download it again from someone else.
The attacker would like to replace Beowulf by Hobbit to increase saxon poetry's visibility, but the cryptographic hash that is used in the protocol won't let him.

If it is easy to find collisions then the attacker could create malicious data, and simply prepend it with dummy data until the collision is found. The hash check would then pass for the malicious data. That is why collisions should only be possible via brute force and be as rare as possible.
Alternatively collisions are also a problem with Certificates.

Is it possible to create a forged file which has the same checksums using two different algorithms?

I was a bit inspired by this blog entry http://blogs.technet.com/dmelanchthon/archive/2009/07/23/windows-7-rtm.aspx (German)
The current notion is that md5 and sha1 are both somewhat broken. Not easily and fast, but at least for md5 in the range of a practical possibility. (I'm not at all a crypto expert, so maybe I'm wrong in stuff like that).
So I asked myself if it would be possible to create a file A' which has the same size, the same md5 sum, and the same sha1 sum as the original file A.
First, would it be possible at all?
Second, would it be possible in reality, with current hardware/software?
If not, wouldn't be the easiest way to provide assurance of the integrity of a file to use always two different algorithms, even if they have some kind of weakness?
Updated:
Just to clarify: the idea is to have a file A and a file A' which fullfills the conditions:
size(A) == size(A') && md5sum(A) == md5sum(A') && sha1sum(A) == sha1sum(A')

"Would it be possible at all?" - yes, if the total size of the checksums is smaller than the total size of the file, it is impossible to avoid collisions.
"would it be possible in reality, with current hardware/software?" - if it is feasible to construct a text to match a given checksum for each of the checksums in use, then yes.
See wikipedia on concatenation of cryptographic hash functions, which is also a useful term to google for.
From that page:
"However, for Merkle-Damgård hash
functions, the concatenated function
is only as strong as the best
component, not stronger. Joux noted
that 2-collisions lead to
n-collisions: if it is feasible to
find two messages with the same MD5
hash, it is effectively no more
difficult to find as many messages as
the attacker desires with identical
MD5 hashes. Among the n messages with
the same MD5 hash, there is likely to
be a collision in SHA-1. The
additional work needed to find the
SHA-1 collision (beyond the
exponential birthday search) is
polynomial. This argument is
summarized by Finney."

For a naive answer, we'd have make some (incorrect) assumptions:
Both the SHA1 and MD5 hashing algorithms result in an even distribution of hash values for a set of random inputs
Algorithm details aside--a random input string has an equally likely chance of producing any hash value
(Basically, no clumping and nicely distributed domains.)
If the probability of discovering a string that collides with another's SHA1 hash is p1, and similarly p2 for MD5, the naive answer is the probability of finding one that collides with both is p1*p2.
However, the hashes are both broken, so we know our assumptions are wrong.
The hashes have clumping, are more sensitive to changes with some data than others, and in other words, aren't perfect. On the other hand, a perfect, non-broken hashing algorithm will have the above properties, and that's exactly what makes it hard to find collisions. They're random.
The probability intrinsically depends on the properties of the algorithm--basically, since our assumptions aren't valid, we can't "easily" determine how hard it is. In fact, the difficultly of finding input that collides likely depends very strongly on the characteristics of the input string itself. Some may be relatively easy (but still probably impractical on today's hardware), and due to the different nature of the two algorithms, some may actually be impossible.

So I asked myself if it would be
possible to create a file A' which has
the same size, the same md5 sum, and
the same sha1 sum as the original file
A.
Yes, make a copy of the file.
Other than that, not without large amounts of computing resources to check tons of permutations (assuming the file size is non-trivial).
You can think of it like this:
If file size increases by n, the likelihood of a possible fake increases, but the computing costs necessary to test the combinations increases exponentially by 2^n.
So the bigger your file is, the more likely there is a dupe out there, but the less likely you are at finding it.

In theory yes you can have it, in practice it's hell of a collusion. In practice no one even able to create a SHA1 collusion let alone MD5 + SHA1 + Size at the same time. This combination is simply impossible right now without having the whole computer power in the world and run it for a while.
Although in the close future we might see more vulnerabilities in SHA1 and MD5. And with the support of better hardware (especially GPU) why not.

In theory you could do this. In practice, if you started from the two checksums provided by MD5 and SHA1 and tried to create a file that produced the same two checksums - it would be very difficult (many times more difficult than creating a file that produced the same MD5 checksum, or SHA1 checksum in isolation).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string