I saw different ways of creating checksum for some plain text file but I would like to be able to change mentioned file contents but in same time to keep already known (set) checksum by filling rest of the file with necessary characters. I got this idea when long years ago found some app (ATARI computer I think) able to make disk boot-able after change of its ID, if checksum of boot sector is $1234. Is it possible to achieve in VBA? Thank you.
I would like to be able to change mentioned file contents but in same time to keep already known (set) checksum by filling rest of the file with necessary characters.
You can't do that, at least not with any hashing algorithm worth its salt (crypto pun not intended... I swear!). Well you could, in theory, but then there's no telling how many characters (and how much time and disk space!) you're going to need to add in order to get the hash collision that yields exactly the same hash as the original file.
What you're asking is basically defeating the entire purpose of a checksum.
I don't think that ATARI computer used SHA-1 hashing (160 bits), let alone the SHA-256 or SHA-512 (or 128-bit MD5), or any other algorithm in common use today.
You could implement some of the lower-bitness checksum algorithms, but the smaller the hash, the higher the risk of a hash collision - and the easier it is to get a hash that collides with your checksum value, the more meaningless the checksum is.
By definition, a hashing function isn't reversible, and a salted, cryptographic hash will not even produce the same ouptut given two identical inputs. I'm not familiar with checksum, but if I had to implement one, I would probably go with a high-bitness cryptographic hashing algorithm, in order to reduce the risk of a hash collision down to statistical insignificance.
Related
Is there a way to actually find out whether a hash is MD5 or MD4?
For instance:
10c7ccc7a4f0aff03c915c485565b9da is an MD5 hash
be08c2de91787c51e7aee5dd16ca4f76 is an MD4 hash
I know that there is a difference "security wise" but how can someone determine which hash is which "programming wise" or just by the eyes. Or is there really no way to know for sure?
I was thinking about giving a hash and comparing it to the 2 of them. However, I noticed that they are all identical and there is no way to really check the difference ! The first "alpha-digit" is not necessarily a number in MD5, and it is not necessary a character in MD4. So how can someone determine which hash is being used?
Both MD5 and MD4 output 128-bit digests (16B, or 32B hex strings). The representation of digests (hashes) of both algorithms is undistinguishable from one another unless they are annotated with extra metadata (which would need to be provided by the application).
You wouldn't be able to tell without recomputing the hashes over the original file or original piece of data.
If you have the data, and are given two digests to differentiate, you'll have to recompute with one of the two algorithms, and hope the hash you derive is one of the test-two. In cases where the input data is corrupted, then you will not even re-obtain any of the two.
If you don't have access to the original data, and you had to take a guess today on the nature of just one (not two) 32-hexchar-long digest to go with that data, it would be most likely md5. MD4 was badly broken by 2008, and considered historic by 2011 (RFC 6150).
If you're given a single hash, and that hash is computed over a file, its hashing algorith is in practice indicated by the file extension of the checksum file (.md5sum, .md4sum, .sha1sum, etc.).
I'm interested in finding an algorithm that can encode a piece of data into a sort of hash (as in that is impossible to convert back into the source data, except by brute force), but also has a unique output for every unique input. The size of the output doesn't matter.
It should be able to hash the same input twice though, and give the same output, so regular encryption with a random, discarded key won't suffice. Nor will regular encryption with a known key, or a salt, because they would be exposed to attackers.
Does such a thing exist?
Can it event theoretically exist, or is the data-destroying part of normal hash algorithms critical for the irreversible characteristic?
What use would something like this be? Well, imagine a browser with a list of websites that should be excluded from the history (like NSFW sites). If this list is saved unencoded or encrypted with a key known on the system, it's readable not just by the browser but also by bosses, wives, etc.
If instead the website addresses are stored hashed, they can't be read, but the browser can check if a site is present in the list.
Using a normal hash function could result in false positives (however unlikely).
I'm not building a browser, I have no plan to actually use the answer. I'm just curious and interested in encryption and such.
Given the definition of a hash;
A cryptographic hash function is a deterministic procedure that takes an arbitrary block of data and returns a fixed-size bit string, the (cryptographic) hash value, such that an accidental or intentional change to the data will change the hash value.
no - it's not theoretically possible. A hash value is of a fixed length that is generally smaller than the data it is hashing (unless the data being hashed is less than the fixed length of the hash). They will always lose data, and as such there can always be collisions (a hash function is considered good if the risk of collision is low, and infeasible to compute.)
In theory it's impossible for outputs that are shorter than the input. This trivially follows from the pidgeon-hole principle.
You could use asymmetric encryption where you threw away the private key. That way it's technically lossless encryption, but nobody will be able to easily reverse it. Note that this is much slower than normal hashing, and the output will be larger than the input.
But the probability of collision drops exponentially with the hash size. A good 256 bit hash is collision free for all practical purposes. And by that I mean hashing for billions of years with all computers in the world will almost certainly not produce collision.
Your extended question shows two problems.
What use would something like this be? Well, imagine a browser with a list of websites that should be excluded from the history (like NSFW sites). If this list is saved unencoded or encrypted with a key known on the system, it's readable not just by the browser but also by bosses, wives, etc.
If instead the website addresses are stored hashed, they can't be read, but the browser can check if a site is present in the list.
Brute force is trivial in this use case. Just find the list of all domains/the zone file. Wouldn't be surprised if a good list is downloadable somewhere.
Using a normal hash function could result in false positives (however unlikely).
The collision probability of a hash is much lower(especially since you have no attacker that tries to provoke a collision in this scenario) than the probability of hardware error.
So my conclusion is to combine a secret with a slow hash.
byte[] secret=DeriveKeyFromPassword(pwd, salt, enough iterations for this to take perhaps a second)
and then for the actual hash use a KDF again combining the secret and the domain name.
Any form of lossless public encryption where you forget the private key.
Well, any lossless compressor with a password would work.
Or you could salt your input with some known (to you) text. This would give you something as long as the input. You could then run some sort of lossless compression on the result, which would make it shorter.
you can find a hash function with a low probability of that happening, but i think all of them are prone to birthday attack, you can try to use a function with a large size output to minimize that probability
Well what about md5 hash? sha1 hash?
I don't think it can exist; if you can put anything into them and get a different result, it couldn't be a fixed length byte array, and it would lose a lot of its usefulness.
Perhaps instead of a hash what you are looking for is reversible encryption? That should be unique. Won't be fast, but it will be unique.
I mean I don't need to look for the actual collisions, to know they exist. If there weren't collisions, then how would you have fixed-length results? That's why I don't understand what people mean when they claim 'md5 is insecure! someone found collisions!', or something like that.
The only thing I can think of, is that the collision search only looks for dictionary words, eg: If 'dog' and 'house' share the same hash, it would be a stupid hashing method IMO. It could also look for strings with a length < X, being X something between 5-10 (passwords that people could remember)
Am I totally wrong?
MD5 is a 128-bit hash, so there are 2^128 possible hashes. If the hash were perfect, then it would in theory require around 2^64 different hash attempts to find a collision (and you would have to store all 2^64 because each new hash would require comparison to all previous values). There isn't 2^64 bits of storage on the planet, so you would be safe.
The attacks on MD5 allow collisions to be found with significantly less than 2^64 hashes and significantly less than 128 x 2^64 bits of storage. That's why MD5 is considered broken.
Currently there are no similar attacks that work on full-strength SHA-1, but it's expected that such attacks will be publicly known within a few years.
As you know, a collision is the term for the situation where two different things (e.g. documents) hash to the same value.
Clearly, collisions are always theoretically possible for a secure hashing algorithm. But the security of secure hashing comes from:
using a large domain of possible hash values, and
using a hashing algorithm with the property that trial and error is close to the best way to produce a document with a given hash.
If both of these criteria are satisfied, then the probability of someone being able to manufacture a collision for a given document is vanishingly small. This is sufficient to make it impractical to (for example) change the content of a document with a digital signature.
The problem is that clever people have figured out a way (or ways) that are a LOT faster than trial and error for creating documents whose MD5 signatures collide. Hence they can defeat digital signatures, and similar uses of MD5 to provide security.
FOLLOWUP
This quote comes from the Wikipedia page on MD5:
MD5 makes only one pass over the data, so if two prefixes with the same hash can be constructed, a common suffix can be added to both to make the collision more likely to be accepted as valid data by the application using it. Furthermore, current collision-finding techniques allow to specify an arbitrary prefix: an attacker can create two colliding files that both begin with the same content. All the attacker needs to generate two colliding files is a template file with a 128-byte block of data aligned on a 64-byte boundary that can be changed freely by the collision-finding algorithm.
I don't completely understand this, but it looks like a recipe for producing files with (different) meaningful content and the same signature.
In practice, it's not about whether a single sample was found, but about a method. These can be either based on some property "if you hash values of length N, ending with ..., etc. you will get the same hash" (silly example), or based on some algorithm "having this hash / value, this is how you get a new value with the same hash".
Collisions will of course always exist, but the interesting problem is how to find them. I'm not sure what is the source of that claim you quoted, but I'm pretty sure it was supposed to actually mean "no practical way to find collisions has been found yet for this hashing method".
When you see "No collisions found" for the SHA-256 hash, for example, it really means that no hash collisions have ever been found. You are right that theoretically collisions exists, and there may already have happened a SHA-256 collision that no-one noticed, but this is irrelevant.
To find a collision by chance, you would need on average 18 quintillion of hash attempts for a MD5 hash, and 340 undecillion attempts for a SHA-256 hash, already accounting for the birthday problem.
As vy32 said, it is computationally unfeasible compute, store and compare so many hashes. So, in order to find a collision, you need a method that is many orders of magnitude faster than the random trial and error one. If there exists such a method for a secure hash, the hash is considered broken, at least in regards to general collision resistance.
So, to say "Someone found a collision in this xxxbit hash" is in fact synonymous of saying "A practical method of finding collisions was found for this hash, making it insecure". The alternative is a cosmically unlikely event, and would be reported in another way.
I was reading wikipedia, and it says
Cryptographic hash functions are a third type of cryptographic algorithm.
They take a message of any length as input, and output a short,
fixed length hash which can be used in (for example) a digital signature.
For good hash functions, an attacker cannot find two messages that produce the same hash.
But why? What I understand is that you can put the long Macbeth story into the hash function and get a X long hash out of it. Then you can put in the Beowulf story to get another hash out of it again X long.
So since this function maps loads of things into a shorter length, there is bound to be overlaps, like I might put in the story of the Hobit into the hash function and get the same output as Beowulf, ok, but this is inevitable right (?) since we are producing a shorter length output from our input? And even if the output is found, why is it a problem?
I can imagine if I invert it and get out Hobit instead of Beowulf, that would be bad but why is it useful to the attacker?
Best,
Yes, of course there will be collisions for the reasons you describe.
I suppose the statement should really be something like this: "For good hash functions, an attacker cannot find two messages that produce the same hash, except by brute-force".
As for the why...
Hash algorithms are often used for authentication. By checking the hash of a message you can be (almost) certain that the message itself hasn't been tampered with. This relies on it being infeasible to find two messages that generate the same hash.
If a hash algorithm allows collisions to be found relatively easily then it becomes useless for authentication because an attacker could then (theoretically) tamper with a message and have the tampered message generate the same hash as the original.
Yes, it's inevitable that there will be collisions when mapping a long message onto a shorter hash, as the hash cannot contain all possible values of the message. For the same reason you cannot 'invert' the hash to uniquely produce either Beowulf or The Hobbit - but if you generated every possible text and filtered out the ones that had your particular hash value, you'd find both texts (amongst billions of others).
The article is saying that it should be hard for an attacker to find or construct a second message that has the same hash value as a first. Cryptographic hash functions are often used as proof that a message hasn't been tampered with - if even a single bit of data flips then the hash value should be completely different.
A couple of years back, Dutch researchers demonstrated weaknesses in MD5 by publishing a hash of their "prediction" for the US presidential election. Of course, they had no way of knowing the outcome in advance - but with the computational power of a PS3 they constructed a PDF file for each candidate, each with the same hash value. The implications for MD5 - already on its way down - as a trusted algorithm for digital signatures became even more dire...
Cryptographic hashes are used for authentication. For instance, peer-to-peer protocols rely heavily on them. They use them to make sure that an ill-intentioned peer cannot spoil the download for everyone else by distributing packets that contain garbage. The torrent file that describes a download contains the hashes for each block. With this check in place, the victim peer can find out that he has been handled a corrupted block and download it again from someone else.
The attacker would like to replace Beowulf by Hobbit to increase saxon poetry's visibility, but the cryptographic hash that is used in the protocol won't let him.
If it is easy to find collisions then the attacker could create malicious data, and simply prepend it with dummy data until the collision is found. The hash check would then pass for the malicious data. That is why collisions should only be possible via brute force and be as rare as possible.
Alternatively collisions are also a problem with Certificates.
I was a bit inspired by this blog entry http://blogs.technet.com/dmelanchthon/archive/2009/07/23/windows-7-rtm.aspx (German)
The current notion is that md5 and sha1 are both somewhat broken. Not easily and fast, but at least for md5 in the range of a practical possibility. (I'm not at all a crypto expert, so maybe I'm wrong in stuff like that).
So I asked myself if it would be possible to create a file A' which has the same size, the same md5 sum, and the same sha1 sum as the original file A.
First, would it be possible at all?
Second, would it be possible in reality, with current hardware/software?
If not, wouldn't be the easiest way to provide assurance of the integrity of a file to use always two different algorithms, even if they have some kind of weakness?
Updated:
Just to clarify: the idea is to have a file A and a file A' which fullfills the conditions:
size(A) == size(A') && md5sum(A) == md5sum(A') && sha1sum(A) == sha1sum(A')
"Would it be possible at all?" - yes, if the total size of the checksums is smaller than the total size of the file, it is impossible to avoid collisions.
"would it be possible in reality, with current hardware/software?" - if it is feasible to construct a text to match a given checksum for each of the checksums in use, then yes.
See wikipedia on concatenation of cryptographic hash functions, which is also a useful term to google for.
From that page:
"However, for Merkle-Damgård hash
functions, the concatenated function
is only as strong as the best
component, not stronger. Joux noted
that 2-collisions lead to
n-collisions: if it is feasible to
find two messages with the same MD5
hash, it is effectively no more
difficult to find as many messages as
the attacker desires with identical
MD5 hashes. Among the n messages with
the same MD5 hash, there is likely to
be a collision in SHA-1. The
additional work needed to find the
SHA-1 collision (beyond the
exponential birthday search) is
polynomial. This argument is
summarized by Finney."
For a naive answer, we'd have make some (incorrect) assumptions:
Both the SHA1 and MD5 hashing algorithms result in an even distribution of hash values for a set of random inputs
Algorithm details aside--a random input string has an equally likely chance of producing any hash value
(Basically, no clumping and nicely distributed domains.)
If the probability of discovering a string that collides with another's SHA1 hash is p1, and similarly p2 for MD5, the naive answer is the probability of finding one that collides with both is p1*p2.
However, the hashes are both broken, so we know our assumptions are wrong.
The hashes have clumping, are more sensitive to changes with some data than others, and in other words, aren't perfect. On the other hand, a perfect, non-broken hashing algorithm will have the above properties, and that's exactly what makes it hard to find collisions. They're random.
The probability intrinsically depends on the properties of the algorithm--basically, since our assumptions aren't valid, we can't "easily" determine how hard it is. In fact, the difficultly of finding input that collides likely depends very strongly on the characteristics of the input string itself. Some may be relatively easy (but still probably impractical on today's hardware), and due to the different nature of the two algorithms, some may actually be impossible.
So I asked myself if it would be
possible to create a file A' which has
the same size, the same md5 sum, and
the same sha1 sum as the original file
A.
Yes, make a copy of the file.
Other than that, not without large amounts of computing resources to check tons of permutations (assuming the file size is non-trivial).
You can think of it like this:
If file size increases by n, the likelihood of a possible fake increases, but the computing costs necessary to test the combinations increases exponentially by 2^n.
So the bigger your file is, the more likely there is a dupe out there, but the less likely you are at finding it.
In theory yes you can have it, in practice it's hell of a collusion. In practice no one even able to create a SHA1 collusion let alone MD5 + SHA1 + Size at the same time. This combination is simply impossible right now without having the whole computer power in the world and run it for a while.
Although in the close future we might see more vulnerabilities in SHA1 and MD5. And with the support of better hardware (especially GPU) why not.
In theory you could do this. In practice, if you started from the two checksums provided by MD5 and SHA1 and tried to create a file that produced the same two checksums - it would be very difficult (many times more difficult than creating a file that produced the same MD5 checksum, or SHA1 checksum in isolation).