Is there a way to actually find out whether a hash is MD5 or MD4?
For instance:
10c7ccc7a4f0aff03c915c485565b9da is an MD5 hash
be08c2de91787c51e7aee5dd16ca4f76 is an MD4 hash
I know that there is a difference "security wise" but how can someone determine which hash is which "programming wise" or just by the eyes. Or is there really no way to know for sure?
I was thinking about giving a hash and comparing it to the 2 of them. However, I noticed that they are all identical and there is no way to really check the difference ! The first "alpha-digit" is not necessarily a number in MD5, and it is not necessary a character in MD4. So how can someone determine which hash is being used?
Both MD5 and MD4 output 128-bit digests (16B, or 32B hex strings). The representation of digests (hashes) of both algorithms is undistinguishable from one another unless they are annotated with extra metadata (which would need to be provided by the application).
You wouldn't be able to tell without recomputing the hashes over the original file or original piece of data.
If you have the data, and are given two digests to differentiate, you'll have to recompute with one of the two algorithms, and hope the hash you derive is one of the test-two. In cases where the input data is corrupted, then you will not even re-obtain any of the two.
If you don't have access to the original data, and you had to take a guess today on the nature of just one (not two) 32-hexchar-long digest to go with that data, it would be most likely md5. MD4 was badly broken by 2008, and considered historic by 2011 (RFC 6150).
If you're given a single hash, and that hash is computed over a file, its hashing algorith is in practice indicated by the file extension of the checksum file (.md5sum, .md4sum, .sha1sum, etc.).
Related
I saw different ways of creating checksum for some plain text file but I would like to be able to change mentioned file contents but in same time to keep already known (set) checksum by filling rest of the file with necessary characters. I got this idea when long years ago found some app (ATARI computer I think) able to make disk boot-able after change of its ID, if checksum of boot sector is $1234. Is it possible to achieve in VBA? Thank you.
I would like to be able to change mentioned file contents but in same time to keep already known (set) checksum by filling rest of the file with necessary characters.
You can't do that, at least not with any hashing algorithm worth its salt (crypto pun not intended... I swear!). Well you could, in theory, but then there's no telling how many characters (and how much time and disk space!) you're going to need to add in order to get the hash collision that yields exactly the same hash as the original file.
What you're asking is basically defeating the entire purpose of a checksum.
I don't think that ATARI computer used SHA-1 hashing (160 bits), let alone the SHA-256 or SHA-512 (or 128-bit MD5), or any other algorithm in common use today.
You could implement some of the lower-bitness checksum algorithms, but the smaller the hash, the higher the risk of a hash collision - and the easier it is to get a hash that collides with your checksum value, the more meaningless the checksum is.
By definition, a hashing function isn't reversible, and a salted, cryptographic hash will not even produce the same ouptut given two identical inputs. I'm not familiar with checksum, but if I had to implement one, I would probably go with a high-bitness cryptographic hashing algorithm, in order to reduce the risk of a hash collision down to statistical insignificance.
I know it's not possible to reverse an MD5 hash back to its original value. But what about generating a set of random characters which would give the exact same value when hashed? Is that possible?
Finding a message that matches a given MD5 hash can happen in three ways:
You guess the original message. For passwords and other low entropy messages this is often relatively easy. That's why we use use key-stretching in such situations. For sufficiently complex messages, this becomes infeasible.
You guess about 2^127 times and get a new message fitting that hash. This is currently infeasible.
You exploit a pre-image attack against that specific hash function, obtained by cryptoanalyzing it. For MD5 there is one, with a workfactor of 2^123, but that's still infeasible.
There is no efficient attack on MD5's pre-image resistance at the moment.
There are efficient collision attacks against MD5, but they only allow an attacker to construct two different messages with the same hash. But it doesn't allow him to construct a message for a given hash.
Yes it is possible to come up with a collision (since you map from a larger space to a smaller this is something that you can assume to happen eventually). Actually MD5 is already considered as "broken" in this respect.
From wiki:
However, it has since been shown that MD5 is not collision
resistant;[3] as such, MD5 is not suitable for applications like SSL
certificates or digital signatures that rely on this property. In
1996, a flaw was found with the design of MD5, and while it was not a
clearly fatal weakness, cryptographers began recommending the use of
other algorithms, such as SHA-1—which has since been found also to be
vulnerable. In 2004, more serious flaws were discovered in MD5, making
further use of the algorithm for security purposes
questionable—specifically, a group of researchers described how to
create a pair of files that share the same MD5 checksum.[4][5] Further
advances were made in breaking MD5 in 2005, 2006, and 2007.[6] In
December 2008, a group of researchers used this technique to fake SSL
certificate validity,[7][8] and US-CERT now says that MD5 "should be
considered cryptographically broken and unsuitable for further
use."[9] and most U.S. government applications now require the SHA-2
family of hash functions.[10]
In one sense, this is possible. If you have strings that are longer than the hash itself, then you will have collisions, so such a string will exist.
However, finding such a string would be equivalent to reversing the hash, as you would be finding a value that hashes to a particular hash, so it would not be any more feasible than reversing a hash any other way.
For MD5 specifically? Yes.
Several years ago, an article was published on an exploit of the MD5 hash that allowed easy generation of data which, when hashed, gave a desired MD5 hash (well, what they actually discovered was an algorithm to find sets of data with the same hash, but you get how that can be used the other way around). You can read an overview of the principle here. No similar algorithm has been found for SHA-2, although that may change in the future.
Yes, what you're talking about is called a collision. A collision in any hashing mechanism is when two different plaintexts create the same hash after being run through a hashing algorithm.
I'm interested in finding an algorithm that can encode a piece of data into a sort of hash (as in that is impossible to convert back into the source data, except by brute force), but also has a unique output for every unique input. The size of the output doesn't matter.
It should be able to hash the same input twice though, and give the same output, so regular encryption with a random, discarded key won't suffice. Nor will regular encryption with a known key, or a salt, because they would be exposed to attackers.
Does such a thing exist?
Can it event theoretically exist, or is the data-destroying part of normal hash algorithms critical for the irreversible characteristic?
What use would something like this be? Well, imagine a browser with a list of websites that should be excluded from the history (like NSFW sites). If this list is saved unencoded or encrypted with a key known on the system, it's readable not just by the browser but also by bosses, wives, etc.
If instead the website addresses are stored hashed, they can't be read, but the browser can check if a site is present in the list.
Using a normal hash function could result in false positives (however unlikely).
I'm not building a browser, I have no plan to actually use the answer. I'm just curious and interested in encryption and such.
Given the definition of a hash;
A cryptographic hash function is a deterministic procedure that takes an arbitrary block of data and returns a fixed-size bit string, the (cryptographic) hash value, such that an accidental or intentional change to the data will change the hash value.
no - it's not theoretically possible. A hash value is of a fixed length that is generally smaller than the data it is hashing (unless the data being hashed is less than the fixed length of the hash). They will always lose data, and as such there can always be collisions (a hash function is considered good if the risk of collision is low, and infeasible to compute.)
In theory it's impossible for outputs that are shorter than the input. This trivially follows from the pidgeon-hole principle.
You could use asymmetric encryption where you threw away the private key. That way it's technically lossless encryption, but nobody will be able to easily reverse it. Note that this is much slower than normal hashing, and the output will be larger than the input.
But the probability of collision drops exponentially with the hash size. A good 256 bit hash is collision free for all practical purposes. And by that I mean hashing for billions of years with all computers in the world will almost certainly not produce collision.
Your extended question shows two problems.
What use would something like this be? Well, imagine a browser with a list of websites that should be excluded from the history (like NSFW sites). If this list is saved unencoded or encrypted with a key known on the system, it's readable not just by the browser but also by bosses, wives, etc.
If instead the website addresses are stored hashed, they can't be read, but the browser can check if a site is present in the list.
Brute force is trivial in this use case. Just find the list of all domains/the zone file. Wouldn't be surprised if a good list is downloadable somewhere.
Using a normal hash function could result in false positives (however unlikely).
The collision probability of a hash is much lower(especially since you have no attacker that tries to provoke a collision in this scenario) than the probability of hardware error.
So my conclusion is to combine a secret with a slow hash.
byte[] secret=DeriveKeyFromPassword(pwd, salt, enough iterations for this to take perhaps a second)
and then for the actual hash use a KDF again combining the secret and the domain name.
Any form of lossless public encryption where you forget the private key.
Well, any lossless compressor with a password would work.
Or you could salt your input with some known (to you) text. This would give you something as long as the input. You could then run some sort of lossless compression on the result, which would make it shorter.
you can find a hash function with a low probability of that happening, but i think all of them are prone to birthday attack, you can try to use a function with a large size output to minimize that probability
Well what about md5 hash? sha1 hash?
I don't think it can exist; if you can put anything into them and get a different result, it couldn't be a fixed length byte array, and it would lose a lot of its usefulness.
Perhaps instead of a hash what you are looking for is reversible encryption? That should be unique. Won't be fast, but it will be unique.
I believe I can download the code to PHP or Linux or whatever and look directly at the source code for the MD5 function. Could I not then reverse engineer the encryption?
Here's the code - http://dollar.ecom.cmu.edu/sec/cryptosource.htm
It seems like any encryption method would be useless if "the enemy" has the code it was created with. Am I wrong?
That is actually a good question.
MD5 is a hash function -- it "mixes" input data in such a way that it should be unfeasible to do a number of things, including recovering the input given the output (it is not encryption, there is no key and it is not meant to be inverted -- rather the opposite). A handwaving description is that each input bit is injected several times in a large enough internal state, which is mixed such that any difference quickly propagates to the whole state.
MD5 is public since 1992. There is no secret, and has never been any secret, to the design of MD5.
MD5 is considered cryptographically broken since 2004, year of publication of the first collision (two distinct input messages which yield the same output); it was considered "weak" since 1996 (when some structural properties were found, which were believed to ultimately help in building collisions). However, there are other hash functions, which are as public as MD5 is, and for which no weakness is known yet: the SHA-2 family. Newer hash functions are currently being evaluated as part of the SHA-3 competition.
The really troubling part is that there is no known mathematical proof that a hash function may actually exist. A hash function is a publicly described efficient algorithm, which can be embedded as a logic circuit of a finite, fixed and small size. For the practitioners of computational complexity, it is somewhat surprising that it is possible to exhibit a circuit which cannot be inverted. So right now we only have candidates: functions for which nobody has found weaknesses yet, rather than function for which no weakness exists. On the other hand, the case of MD5 shows that, apparently, getting from known structural weaknesses to actual collisions to attacks takes a substantial amount of time (weaknesses in 1996, collisions in 2004, applied collisions -- to a pair of X.509 certificates -- in 2008), so the current trend is to use algorithm agility: when we use a hash function in a protocol, we also think about how we could transition to another, should the hash function prove to be weak.
It is not an encryption, but a one way hashing mechanism. It digests the string and produces a (hopefully) unique hash.
If it were a reversible encryption, zip and tar.gz formats would be quite verbose. :)
The reason it doesn't help hackers too much (obviously knowing how one is made is beneficial) is that if they find a password to a system that is hashed, e.g. 2fcab58712467eab4004583eb8fb7f89, they need to know the original string used to create it, and also if any salt was used. That is because when you login, for obvious reasons, the password string is hashed with the same method as it is generated and then that resulting hash is compared to what is stored.
Also, many developers are migrating to bcrypt which incorporates a work factor, if the hashing takes 1 second as opposed to .01 second, it greatly slows down generating a rainbow table for you application, and those old PHP sites using md5() only become the low hanging fruit.
Further reading on bcrypt.
One of the criteria of good cryptographic operations is that knowledge of the algorithm should not make it easier to break the encryption. So an encryption should not be reversible without knowledge of the algorithm and the key, and a hash function must not be reversible regardless of knowledge of the algorithm (the term used is "computationally infeasible").
MD5 and other hash function (like SHA-1 SHA-256, etc) perform a one-way operation on data that creates a digest or "fingerprint" that is usually much smaller than than the plaintext. This one way function cannot be reversed to retrieve the plaintext, even when you know exactly what the function does.
Likewise, knowledge of an encryption algorithm doesn't make it any easier (assuming a good algorithm) to recover plaintext from ciphertext. The reverse process is "computationally infeasible" without knowledge of the encryption key used.
I was reading wikipedia, and it says
Cryptographic hash functions are a third type of cryptographic algorithm.
They take a message of any length as input, and output a short,
fixed length hash which can be used in (for example) a digital signature.
For good hash functions, an attacker cannot find two messages that produce the same hash.
But why? What I understand is that you can put the long Macbeth story into the hash function and get a X long hash out of it. Then you can put in the Beowulf story to get another hash out of it again X long.
So since this function maps loads of things into a shorter length, there is bound to be overlaps, like I might put in the story of the Hobit into the hash function and get the same output as Beowulf, ok, but this is inevitable right (?) since we are producing a shorter length output from our input? And even if the output is found, why is it a problem?
I can imagine if I invert it and get out Hobit instead of Beowulf, that would be bad but why is it useful to the attacker?
Best,
Yes, of course there will be collisions for the reasons you describe.
I suppose the statement should really be something like this: "For good hash functions, an attacker cannot find two messages that produce the same hash, except by brute-force".
As for the why...
Hash algorithms are often used for authentication. By checking the hash of a message you can be (almost) certain that the message itself hasn't been tampered with. This relies on it being infeasible to find two messages that generate the same hash.
If a hash algorithm allows collisions to be found relatively easily then it becomes useless for authentication because an attacker could then (theoretically) tamper with a message and have the tampered message generate the same hash as the original.
Yes, it's inevitable that there will be collisions when mapping a long message onto a shorter hash, as the hash cannot contain all possible values of the message. For the same reason you cannot 'invert' the hash to uniquely produce either Beowulf or The Hobbit - but if you generated every possible text and filtered out the ones that had your particular hash value, you'd find both texts (amongst billions of others).
The article is saying that it should be hard for an attacker to find or construct a second message that has the same hash value as a first. Cryptographic hash functions are often used as proof that a message hasn't been tampered with - if even a single bit of data flips then the hash value should be completely different.
A couple of years back, Dutch researchers demonstrated weaknesses in MD5 by publishing a hash of their "prediction" for the US presidential election. Of course, they had no way of knowing the outcome in advance - but with the computational power of a PS3 they constructed a PDF file for each candidate, each with the same hash value. The implications for MD5 - already on its way down - as a trusted algorithm for digital signatures became even more dire...
Cryptographic hashes are used for authentication. For instance, peer-to-peer protocols rely heavily on them. They use them to make sure that an ill-intentioned peer cannot spoil the download for everyone else by distributing packets that contain garbage. The torrent file that describes a download contains the hashes for each block. With this check in place, the victim peer can find out that he has been handled a corrupted block and download it again from someone else.
The attacker would like to replace Beowulf by Hobbit to increase saxon poetry's visibility, but the cryptographic hash that is used in the protocol won't let him.
If it is easy to find collisions then the attacker could create malicious data, and simply prepend it with dummy data until the collision is found. The hash check would then pass for the malicious data. That is why collisions should only be possible via brute force and be as rare as possible.
Alternatively collisions are also a problem with Certificates.