All the online articles blogs says md5 is vulnerable but how exactly it produces same hash from two different inputs. I use this application called fdupes which recursively calculates md5 hash of files in a directory or subdirs and deletes the duplicate ones if delete option is specified. In this case what are the chances that it can produce same hash from 2 different files ? Also I would like to know how to reproduce such behavior.
Related
I have a requirement where I want to compare 2 identical excel/ppt/csv files which may have exact same content but may be created at different point in time.
I want to compare only the file contents in whatever manner possible using any nodejs package.
But I couldn't figure out how it can be done in an easier way either by stream comparison or even buffer comparison also didn't help.
I've done more research but not much success and I'm just wondering how it would be possible to ignore certain things such as time stamp and any other metadata while doing comparison and only consider contents to match up.
I've tried stream-compare, stream-equal, file-compare, buff1.equals(buff2) and few others but nine of them seem to have worked for my requirement.
But I didn't find any node package on the web which does what I am looking for.
Any insights or any suggestions as how it can be achieved?
Thanks in advance any help would be appreciated.
Search for a package that computes a hash on the document, for example crypto, calculate hashes (sha256) for 2 docs and compare them. If hashes match, document content will be the same (there is still a chance of hash collision, but it depends on the hash algorytm that are you using, sha256 will give you a decent confidence that documents are identical). Check this thread for more details: Obtaining the hash of a file using the stream capabilities of crypto module (ie: without hash.update and hash.digest)
Is there a way to actually find out whether a hash is MD5 or MD4?
For instance:
10c7ccc7a4f0aff03c915c485565b9da is an MD5 hash
be08c2de91787c51e7aee5dd16ca4f76 is an MD4 hash
I know that there is a difference "security wise" but how can someone determine which hash is which "programming wise" or just by the eyes. Or is there really no way to know for sure?
I was thinking about giving a hash and comparing it to the 2 of them. However, I noticed that they are all identical and there is no way to really check the difference ! The first "alpha-digit" is not necessarily a number in MD5, and it is not necessary a character in MD4. So how can someone determine which hash is being used?
Both MD5 and MD4 output 128-bit digests (16B, or 32B hex strings). The representation of digests (hashes) of both algorithms is undistinguishable from one another unless they are annotated with extra metadata (which would need to be provided by the application).
You wouldn't be able to tell without recomputing the hashes over the original file or original piece of data.
If you have the data, and are given two digests to differentiate, you'll have to recompute with one of the two algorithms, and hope the hash you derive is one of the test-two. In cases where the input data is corrupted, then you will not even re-obtain any of the two.
If you don't have access to the original data, and you had to take a guess today on the nature of just one (not two) 32-hexchar-long digest to go with that data, it would be most likely md5. MD4 was badly broken by 2008, and considered historic by 2011 (RFC 6150).
If you're given a single hash, and that hash is computed over a file, its hashing algorith is in practice indicated by the file extension of the checksum file (.md5sum, .md4sum, .sha1sum, etc.).
I'm trying to find a way to transform a large number of similar strings into unique hashes. The reason is that each string (an url), is used to generate a file on s3, for later access.
I need to be able to rebuild the hash at a later stage for comparison purposes.
I've used MD5 up until now, but some strings are long and very similar, which gave me duplicates.
I believe that a 256 or 512 hash would work, but maybe there's a best practice? I'd just use an URLencode as a filename, but one requirement is that a user wouldn't be able to access the base file on our server from the title of the S3 file.
I'd like to index files in a local database but I do not understand how I can identified each individual file. For example if I store the file path in the database then the entry will no longer be valid if the file is moved or deleted. I imagine there is some way of uniquely identifying files no matter what happens to them but I have had no success with Google.
This will be for *nix/Linux and ext4 in particular, so please nothing specific to windows or ntfs or anything like that.
In addition to the excellent suggestion above, you might consider using the inode number property of the files, viewable in a shell with ls -i.
Using index.php on one of my boxes:
ls -i
yields
196237 index.php
I then rename the file using mv index.php index1.php, after which the same ls -i yields:
196237 index1.php
(Note the inode number is the same)
Try using a hashing scheme such as MD5, SHA-1, or SHA-2 these will allow you to match the files up by content.
Basically when you first create the index, you will hash all the files that you wish to add. This string is pretty good at telling if two files are different or the same. Then when you need to see if one of the files is already in the index, hash it and then compare the generated hash to your table of known hashes.
EDIT: As was said in the comments, it is a good idea to incorporate both data's so that way you can more accurately track changes
If you do not consider files with same content same and only want to track moved/renamed files as same, then using its inode number will do. Otherwise you will have to hash the content.
Only fly in the ointment with inodes is that they can be reassigned after a delete (depending on the platform) - you need to record the file creation Timestamp as well as the device id to be 100% sure. Its easier with windows and their user file attributes.
I was a bit inspired by this blog entry http://blogs.technet.com/dmelanchthon/archive/2009/07/23/windows-7-rtm.aspx (German)
The current notion is that md5 and sha1 are both somewhat broken. Not easily and fast, but at least for md5 in the range of a practical possibility. (I'm not at all a crypto expert, so maybe I'm wrong in stuff like that).
So I asked myself if it would be possible to create a file A' which has the same size, the same md5 sum, and the same sha1 sum as the original file A.
First, would it be possible at all?
Second, would it be possible in reality, with current hardware/software?
If not, wouldn't be the easiest way to provide assurance of the integrity of a file to use always two different algorithms, even if they have some kind of weakness?
Updated:
Just to clarify: the idea is to have a file A and a file A' which fullfills the conditions:
size(A) == size(A') && md5sum(A) == md5sum(A') && sha1sum(A) == sha1sum(A')
"Would it be possible at all?" - yes, if the total size of the checksums is smaller than the total size of the file, it is impossible to avoid collisions.
"would it be possible in reality, with current hardware/software?" - if it is feasible to construct a text to match a given checksum for each of the checksums in use, then yes.
See wikipedia on concatenation of cryptographic hash functions, which is also a useful term to google for.
From that page:
"However, for Merkle-Damgård hash
functions, the concatenated function
is only as strong as the best
component, not stronger. Joux noted
that 2-collisions lead to
n-collisions: if it is feasible to
find two messages with the same MD5
hash, it is effectively no more
difficult to find as many messages as
the attacker desires with identical
MD5 hashes. Among the n messages with
the same MD5 hash, there is likely to
be a collision in SHA-1. The
additional work needed to find the
SHA-1 collision (beyond the
exponential birthday search) is
polynomial. This argument is
summarized by Finney."
For a naive answer, we'd have make some (incorrect) assumptions:
Both the SHA1 and MD5 hashing algorithms result in an even distribution of hash values for a set of random inputs
Algorithm details aside--a random input string has an equally likely chance of producing any hash value
(Basically, no clumping and nicely distributed domains.)
If the probability of discovering a string that collides with another's SHA1 hash is p1, and similarly p2 for MD5, the naive answer is the probability of finding one that collides with both is p1*p2.
However, the hashes are both broken, so we know our assumptions are wrong.
The hashes have clumping, are more sensitive to changes with some data than others, and in other words, aren't perfect. On the other hand, a perfect, non-broken hashing algorithm will have the above properties, and that's exactly what makes it hard to find collisions. They're random.
The probability intrinsically depends on the properties of the algorithm--basically, since our assumptions aren't valid, we can't "easily" determine how hard it is. In fact, the difficultly of finding input that collides likely depends very strongly on the characteristics of the input string itself. Some may be relatively easy (but still probably impractical on today's hardware), and due to the different nature of the two algorithms, some may actually be impossible.
So I asked myself if it would be
possible to create a file A' which has
the same size, the same md5 sum, and
the same sha1 sum as the original file
A.
Yes, make a copy of the file.
Other than that, not without large amounts of computing resources to check tons of permutations (assuming the file size is non-trivial).
You can think of it like this:
If file size increases by n, the likelihood of a possible fake increases, but the computing costs necessary to test the combinations increases exponentially by 2^n.
So the bigger your file is, the more likely there is a dupe out there, but the less likely you are at finding it.
In theory yes you can have it, in practice it's hell of a collusion. In practice no one even able to create a SHA1 collusion let alone MD5 + SHA1 + Size at the same time. This combination is simply impossible right now without having the whole computer power in the world and run it for a while.
Although in the close future we might see more vulnerabilities in SHA1 and MD5. And with the support of better hardware (especially GPU) why not.
In theory you could do this. In practice, if you started from the two checksums provided by MD5 and SHA1 and tried to create a file that produced the same two checksums - it would be very difficult (many times more difficult than creating a file that produced the same MD5 checksum, or SHA1 checksum in isolation).