Is it good practice to use CRC32 for hashing Strings? If not, what are better alternatives?
I have objects whose uniqueness is defined by 2 strings. I would like to add these objects to Mongo database, add calculated hash as an object's field, create index on that field and then search DB for object when I have the 2 strings (and can calculate the hash).
Thanks.
It would work, but a CRC is not the best choice for hashing. There are many hash functions that have been developed to be both fast and to minimize several different kinds of collision threats.
A very good example are the CityHash set of algorithms.
Related
I often times use filepaths to provide some sort of unique id for some software system. Is there any way to take a filepath and turn it into a unique integer in relatively quick (computationally) way?
I am ok with larger integers. This would have to be a pretty nifty algorithm as far as I can tell, but would be very useful in some cases.
Anybody know if such a thing exists?
You could try the inode number:
fs.statSync(filename).ino
#djones's suggestion of the inode number is good if the program is only running on one machine and you don't care about a new file duplicating the id of an old, deleted one. Inode numbers are re-used.
Another simple approach is hashing the path to a big integer space. E.g. using a 128 bit murmurhash (in Java I'd use the Guava Hashing class; there are several js ports), the chance of a collision among a billion paths is still 1/2^96. If you're really paranoid, keep a set of the hash values you've already used and rehash on collision.
This is just my comment turned to an answer.
If you run it in the memory, you can use one of standard hashmaps in your corresponding language. Not just for file names, but for any similar situation. Normally, hashmaps in different programming languages are satisfying collisions by buckets, so the hash number and the corresponding bucket number will provide a unique id.
Btw, it is not hard to write your own hashmap, such that you have control on the underlying structure (e.g. to retrieve the number etc).
I wonder if would you consider checking lengths for database fields is part secure SDLC?
When designing the database, one must consider checking the length of its fields.
Without needing to gossip a lot, I may mention the famous password field: what if the developers follow the good security principle of hashing, salting (or even peppering) the password but the length of the password field in the database is shorter than the length of the hashing output function? Depending on the situation, this may result in hashing the passwords almost useless in this case.
Apart from this, you may get troubles if you do not check at some point the lengths in question as explained in the accepted answer of overstating field size in database design.
In practice, however, checking the lengths of the database fields is not enough if it is not coupled with data validation and data sanitization concepts.
I am working on a small project to keep my skills from completely rusting
I am generating a lot of hashes(in his case md5) and I need to check if I've seen that hash before so I wanted to keep it in a list
whats the best way to list them that I can look if they exist in pior to doing calculations
The hash itself is already a key of sorts. Your best bet is a hash table. In a properly implemented hash table, you can check for the existence of a key in constant time. Common hash table implementations with this feature are C# Dictionaries, Python's dict type, PHP array (which are actually Maps, not arrays), Perl's hashes % and Ruby's Hash. If you included details of what language you're working in, an example wouldn't be too hard to lookup.
So I have the code for a hashing function, and from the looks of it, there's no way to simply unhash it (lots of bitwise ANDs, ORs, Shifts, etc). My question is, if I need to find out the original value before being hashed, is there a more efficient way than just brute forcing a set of possible values?
Thanks!
EDIT: I should add that in my case, the original message will never be longer than several characters, for my purposes.
EDIT2: Out of curiosity, are there any ways to do this on the run, without precomputed tables?
Yes; rainbow table attacks. This is especially true for hashes of shorter strings. i.e. hashes of small strings like 'true' 'false' 'etc' can be stored in a dictionary and can be used as a comparison table. This speeds up cracking process considerably. Also if the hash size is short (i.e. MD5) the algorithm becomes especially easy to crack. Of course, the way around this issue is combining 'cryptographic salts' with passwords, before hashing them.
There are two very good sources of info on the matter: Coding Horror: Rainbow Hash Cracking and
Wikipedia: Rainbow table
Edit: Rainbox tables can tage tens of gigabytes so downloading (or reproducing) them may take weeks just to make simple tests. Instead, there seems to be some online tools for reversing simple hashes: http://www.onlinehashcrack.com/ (i.e. try to reverse 463C8A7593A8A79078CB5C119424E62A which is MD5 hash of the word 'crack')
"Unhashing" is called a "preimage attack": given a hash output, find a corresponding input.
If the hash function is "secure" then there is no better attack than trying possible inputs until a hit is found; for a hash function with a n-bit output, the average number of hash function invocations will be about 2n, i.e. Way Too Much for current earth-based technology if n is greater than 180 or so. To state it otherwise: if an attack method faster than this brute force method is found, for a given hash function, then the hash function is deemed irreparably broken.
MD5 is considered broken, but for other weaknesses (there is a published method for preimages with cost 2123.4, which is thus about 24 times faster than the brute force cost -- but it is still so far in the technologically unfeasible that it cannot be confirmed).
When the hash function input is known to be part of a relatively small space (e.g. it is a "password", so it could fit in the brain of a human user), then one can optimize preimage attacks by using precomputed tables: the attacker still has to pay the search cost once, but he can reuse his tables to attack multiple instances. Rainbow tables are precomputed tables with a space-efficient compressed representation: with rainbow tables, the bottleneck for the attacker is CPU power, not the size of his hard disks.
Assuming the "normal case", the original message will be many times longer than the hash. Therefore, it is in principle absolutely impossible to derive the message from the hash, simply because you cannot calculate information that is not there.
However, you can guess what's probably the right message, and there exist techniques to accelerate this process for common messages (such as passwords), for example rainbow tables. It is very likely that if something that looks sensible is the right message if the hash matches.
Finally, it may not be necessary at all to find the good message as long as one can be found which will pass. This is the subject of a known attack on MD5. This attack lets you create a different message which gives the same hash.
Whether this is a security problem or not depends on what exactly you use the hash for.
This may sound trivial, but if you have the code to the hashing function, you could always override a hash table container class's hash() function (or similar, depending on your programming language and environment). That way, you can hash strings of say 3 characters or less, and then you can store the hash as a key by which you obtain the original string, which appears to be exactly what you want. Use this method to construct your own rainbow table, I suppose. If you have the code to the program environment in which you want to find these values out, you could always modify it to store hashes in the hash table.
I mean I don't need to look for the actual collisions, to know they exist. If there weren't collisions, then how would you have fixed-length results? That's why I don't understand what people mean when they claim 'md5 is insecure! someone found collisions!', or something like that.
The only thing I can think of, is that the collision search only looks for dictionary words, eg: If 'dog' and 'house' share the same hash, it would be a stupid hashing method IMO. It could also look for strings with a length < X, being X something between 5-10 (passwords that people could remember)
Am I totally wrong?
MD5 is a 128-bit hash, so there are 2^128 possible hashes. If the hash were perfect, then it would in theory require around 2^64 different hash attempts to find a collision (and you would have to store all 2^64 because each new hash would require comparison to all previous values). There isn't 2^64 bits of storage on the planet, so you would be safe.
The attacks on MD5 allow collisions to be found with significantly less than 2^64 hashes and significantly less than 128 x 2^64 bits of storage. That's why MD5 is considered broken.
Currently there are no similar attacks that work on full-strength SHA-1, but it's expected that such attacks will be publicly known within a few years.
As you know, a collision is the term for the situation where two different things (e.g. documents) hash to the same value.
Clearly, collisions are always theoretically possible for a secure hashing algorithm. But the security of secure hashing comes from:
using a large domain of possible hash values, and
using a hashing algorithm with the property that trial and error is close to the best way to produce a document with a given hash.
If both of these criteria are satisfied, then the probability of someone being able to manufacture a collision for a given document is vanishingly small. This is sufficient to make it impractical to (for example) change the content of a document with a digital signature.
The problem is that clever people have figured out a way (or ways) that are a LOT faster than trial and error for creating documents whose MD5 signatures collide. Hence they can defeat digital signatures, and similar uses of MD5 to provide security.
FOLLOWUP
This quote comes from the Wikipedia page on MD5:
MD5 makes only one pass over the data, so if two prefixes with the same hash can be constructed, a common suffix can be added to both to make the collision more likely to be accepted as valid data by the application using it. Furthermore, current collision-finding techniques allow to specify an arbitrary prefix: an attacker can create two colliding files that both begin with the same content. All the attacker needs to generate two colliding files is a template file with a 128-byte block of data aligned on a 64-byte boundary that can be changed freely by the collision-finding algorithm.
I don't completely understand this, but it looks like a recipe for producing files with (different) meaningful content and the same signature.
In practice, it's not about whether a single sample was found, but about a method. These can be either based on some property "if you hash values of length N, ending with ..., etc. you will get the same hash" (silly example), or based on some algorithm "having this hash / value, this is how you get a new value with the same hash".
Collisions will of course always exist, but the interesting problem is how to find them. I'm not sure what is the source of that claim you quoted, but I'm pretty sure it was supposed to actually mean "no practical way to find collisions has been found yet for this hashing method".
When you see "No collisions found" for the SHA-256 hash, for example, it really means that no hash collisions have ever been found. You are right that theoretically collisions exists, and there may already have happened a SHA-256 collision that no-one noticed, but this is irrelevant.
To find a collision by chance, you would need on average 18 quintillion of hash attempts for a MD5 hash, and 340 undecillion attempts for a SHA-256 hash, already accounting for the birthday problem.
As vy32 said, it is computationally unfeasible compute, store and compare so many hashes. So, in order to find a collision, you need a method that is many orders of magnitude faster than the random trial and error one. If there exists such a method for a secure hash, the hash is considered broken, at least in regards to general collision resistance.
So, to say "Someone found a collision in this xxxbit hash" is in fact synonymous of saying "A practical method of finding collisions was found for this hash, making it insecure". The alternative is a cosmically unlikely event, and would be reported in another way.