How to reconstruct hash value to the original format? - cryptographic-hash-function

I would like to know how I can reconstruct a hash value such as 558f68181d2b0c9d57d41ce7aa36b71d9 to its original format (734).
I have used a code in matlab, which provided me with an hash output, but I tried to revers the operation to obtain the original value but no use. I tired converting from hex to binary but no use.
Are there any built in functions that can help me obtaining the original value?
i have used this code :
http://uk.mathworks.com/matlabcentral/fileexchange/31272-datahash

In general this is impossible. The whole idea of cryptographical hashes (like SHA-1 used above) is to be as unpredictable as possible. The hash of certain data should always be the same (of course) but it should be really hard to predict which data that resulted in a certain hash.
If you have a limited amount of values, you could probably create a lookup-table (hash -> data that made it) but this is actually the exact opposite of how they are supposed to be used.
I think you want to create your own hashing for this problem, where you could inline the data you hash in some particular way.

Related

Hash Table that tries to hash Strings uniformly?

I am currently in a Data Structures course nearing the end of the semester, and have been assigned a project in which we are implementing a Linked Hash Table to store and retrieve keys. We have been given a pretty large amount of freedom with how we are going to design our hash table implementation, but for bonus points we were told to try and find a hash function that distributes our keys (unique strings) close to uniformly and randomly throughout the table.
I have chosen to use the ELF hash, seen here http://www.eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx
My question is as follows: With this hash function an integer is returned, but I am having trouble seeing how this can be used to help specify a specific index to put my key in in the hash table. I could simply do: index = ELFhash(String key) % tableSize, but does this defeat the purpose of using the ELF hash in the first place??
Also I have chosen my collision resolution strategy to be double hashing. Is there a good way to determine an appropriate secondary hashing function to find your jumps? My hash table is not going to be a constant size (sets of strings will be added and removed from the set of data I am hashing, and I will be rehashing them after each iteration of adding and removing to have a load factor of .75), so it is hard for me to just do something like k % n where n is a number that is relatively prime with my table size.
Thanks for taking the time to read my question, and let me know what you think!
You're correct to think about "wrapping bias," but for most practical purposes, it's not going to be a problem.
If the hash table is of size N and the hash value is in the range [0..M), then let k = floor(M/N). Any hash value in the range [0..k*N) is a "good" one in that, using mod N as a map, each hash bucket is mapped by exactly k hash values. The hash values in [k*N..M) are "bad" in that if you use them, the corresponding M-K*n lowest hash buckets map from one additional hash value. Even if the hash function is perfect, these buckets have a higher probability of receiving a given value.
The question, though, is "How much higher?" That depends on M and N. If the hash value is an unsigned int in [0..2^32), and - having read Knuth and others - you decide to pick prime number of buckets around a thousand, say 1009, what happens?
floor(2^32 / 1009) = 4256657
The number of "bad" values is
2^32 - 4256657 * 1009 = 383
Consequently, all buckets are mapped from 4256657 "good" values, and 383 get one additional unwanted "bad" value for 4256658. Thus the "bias" for is 1/4,256,657.
It's very unlikely you'll find a hash function where a 1 in 4 million probability difference between buckets will be noticeable.
Now if you redo the calculation with a million buckets instead of a thousand, then things look a bit different. In that case if you're a bit OC, you might want to switch to a 64-bit hash.
On additional thing: The Elf hash is pretty unlikely to give absolutely terrible results, and it's quite fast, but there are much better hash functions. A reasonably well-regarded one you might want give a try is Murmur 32. (The Wiki article mentions that the original alg has some weaknesses that can be exploited for DoS attacks, but for your application it will be fine.) I'm sure your prof doesn't want you to copy code, but the Wikipedia page has it complete. It would be interesting to implement Elf yourself and try it against Murmur to see how they compare.

What's the difference between a secure compare and a simple ==(=)

Github's securing webhooks page says:
Using a plain == operator is not advised. A method like secure_compare performs a “constant time” string comparison, which renders it safe from certain timing attacks against regular equality operators.
I use bcrypt.compare('string', 'computed hash') when comparing passwords.
What makes this a "secure compare" and can I do this using the standard crypto library in Node?
The point of a "constant time" string comparison is that the comparison will take the exact same amount of time no matter what the comparison target is (the unknown value). This "constant time" reveals no information to an attacker about what the unknown target value might be. The usual solution is that all characters are compared, even after a mismatch is found so no matter where a mismatch is found, the comparison runs in the same amount of time.
Other forms of comparison might return an answer in a shorter time when certain conditions are true which allows an attacker to learn what they might be missing. For example, in a typical string comparison, the comparison will return false as soon as an unequal character is found. If the first character does not match, then the comparison will return in a shorter amount of time than if it does. A diligent attacker can use this information to make a smarter brute force attack.
A "constant time" comparison eliminates this extra information because no matter how the two strings are unequal, the function will return its value in the same amount of time.
In looking at the nodejs v4 crypto library, I don't see any signs of a function to do constant time comparison and per this post, there is a discussion about the fact that the nodejs crypto library is missing this functionality.
EDIT: Node v6 now has crypto.timingSafeEqual(a, b).
There is also such a constant time comparison function available in this buffer-equal-constant-time module.
jfriend's answer is correct in general, but in terms of this specific context (comparing the output of a bcrypt operation with what is stored in the database), there is no risk with using "==".
Remember, bcrypt is designed to be a one-way function that is specifically built to resist password guessing attacks when the attacker gets hold of the database. If we assume that the attacker has the database, then the attacker does not need timing leak information to know which byte of his guess for the password is wrong: he can check that himself by simply looking at the database. If we assume the attacker does not have the database, then timing leak information could potentially tell us which byte was wrong in his guess in a scenario that is ideal for the attacker (not realistic at all). Even if he could get that information, the one-way property of bcrypt prevents him from exploiting the knowledge gain.
Summary: preventing timing attacks is a good idea in general, but in this specific context, you're not putting yourself in any danger by using "==".
EDIT: The bcrypt.compare( ) function already is programmed to resist timing attacks even though there is absolutely no security risk in not doing this.
Imagine a long block of material to compare. If the first block does not match and the compare function returns right then, you have leaked data to the attacker. He can work on the first block of data until the routine takes longer to return, at which time he will know that the first chunk matched.
2 ways to compare data that are more secure from timing attacks are to hash both sets of data and to compare the hashes, or to XOR all the data and compare the result to 0. If == just scans both blocks of data and returns if and when it finds a discrepancy, it can inadvertently play "warmer / colder" and guide the adversary right in on the secret text he wants to match.

how to get original value from hash value in node.js

I have created hash of some fields and storing in database using 'crypto' npm.
var crypto = require('crypto');
var hashFirtName = crypto.createHash('md5').update(orgFirtName).digest("hex"),
QUESTION: How can I get the original value from the hash value when needed?
The basic definition of a "hash" is that it's one-way. You cannot get the originating value from the hash. Mostly because a single value will always produce the same hash, but a hash isn't always related to a single value, since most hash functions return a string of finite/fixed length.
Additional Information
I wanted to provide some additional information, as I felt I may have left this too short.
As #xShirase pointed out in his answer, you can use a table to reverse a Hash. These are known as Rainbow Tables. You can generate them or download them from the internet, usually from nefarious sources [ahem].
To expand on my other statement about a hash value possibly relating to multiple original values, lets take a look at MD5.
MD5 is a 128-bit hash. This means it can hold 2^128 bits, or (unsigned) 0 through 340,282,366,920,938,463,463,374,607,431,768,211,455. That's a REALLY big number. So, for any given input you have a 1 in 340,282,366,920,938,463,463,374,607,431,768,211,456 chance that it will collide with the same hash result of another input value.
Now, for simple data like passwords, the chances are astronomical. And for those purposes, who cares? Most of the time you are simply taking an input, hashing it, then comparing the hashes. For reasons I will not get into, when using hashes for passwords you should ALWAYS store the data already hashed. You don't want to leave plain-text passwords just lying about. Keep in mind that a hash is NOT the same as encryption.
Hashes can also be used for other reasons. For instance, they can be used to create a fast-lookup data structure known as a Hash Table. A Hash Table uses a hash as sort of a "primary key", allowing it to search a huge set of data in relatively few number of instructions, approaching O(1) (On-order of 1). Depending on the implementation of the Hash Table and the hashing algorithm, you have to deal with collisions, usually by means of a sorted list. This is why the Hash Table isn't "exactly" O(1), but close. If your hash algorithm is bad, the performance of your Hash Table can begin to approach O(n).
Another use for a hash it to tell if a file's contents have been altered, or match an original. You will see many OSS project provide binary downloads that also have an MD5 and/or SHA-2 hash values. This is so you can download the files, do a hash locally, and compare the results against theirs to make sure the file you are getting is the file they posted. Again, since the odds of two files matching another is 1 in 340,282,366,920,938,463,463,374,607,431,768,211,456, the odds of a hacker successfully generating a file of the same size with a bad payload that hashes to the exact same MD5/SHA-2 hash is pretty low.
Hope this discussion can help either you or someone in the future.
If you could get the original value from the hash, it wouldn't be that secure.
If you need to compare a value to what you have previously stored as a hash, you can create a hash for this value and compare the hashes.
In practice there is only one way to 'decrypt' a hash. It involves using a massive database of decrypted hashes, and compare them to yours. An example here

How does a reduction function used with rainbow tables work?

I've carefully read about rainbow tables and can't get one thing. In order to build a hash chain a reduction function is used. It's a function that somehow maps hashes onto passwords. This article says that reduction function isn't an inverse of hash, it's just some mapping.
I don't get it - what's the use of a mapping that isn't even an inverse of the hash function? How should such mapping practically work and aid in deducing a password?
A rainbow table is "just" a smart compression method for a big table of precomputed hashes. The idea is that the table can "invert" a hash output if and only if a corresponding input was considered during the table construction.
Each table line ("chain") is a sequence of hash function invocations. The trick is that each input is computed deterministically from the previous output in the chain, so that:
by storing the starting and ending points of the chain, you "morally" store the complete chain, which you can rebuild at will (this is where a rainbow table can be viewed as a compression method);
you can start the chain rebuilding from a hash function output.
The reduction function is the glue which turns a hash function output into an appropriate input (for instance a character string which looks like a genuine password, consisting only of printable characters). Its role is mostly to be able to generate possible hash inputs with more or less uniform probability, given random data to work with (and the hash output will be acceptably random). The reduction function needs not have any specific structure, in particular with regards to how the hash function itself works; the reduction function must just allow keeping on building the chain without creating too many spurious collisions.
The reason the reduction function isn't the inverse of a hash is that the true inverse of a hash would not be a function (remember, the actual definition of "function" requires one output for one input).
Hash functions produce strings which are shorter than their corresponding inputs. By the pigeonhole principle, this means that two inputs can have the same output. If arbitrarily long strings can be hashed, an infinite number of strings can have the same output, in fact. Yet a rainbow table generally only keeps one output for each hash - so it can't be a true inverse.
The reduction function most rainbow tables use is "store the shortest string having this hash".
It doesn't matter if what it produces is the password: what you would get would also work as a password, and you could log in with it just as well as with the original password.

Why is it called rainbow table?

Anyone know why it is called rainbow table? Just remembered we have learned there is an attack called "dictionary attack". Why it is not call dictionary?
Because it contains the entire "spectrum" of possibilities.
A dictionary attack is a bruteforce technique of just trying possibilities. Like this (python pseudo code)
mypassworddict = dict()
for password in mypassworddict:
trypassword(password)
However, a rainbow table works differently, because it's for inverting hashes. A high level overview of a hash is that it has a number of bins:
bin1, bin2, bin3, bin4, bin5, ...
Which correspond to binary parts of the output string - that's how the string ends up the length it is. As the hash proceeds, it affects differing parts of the bins in different ways. So the first byte (or whatever input field is accepted) input affects (say, simplistically) bins 3 and 4. The next input affects 2 and 6. And so on.
A rainbow table is a computation of all the possibilities of a given bin, i.e. all the possible inverses of that bin, for every bin... that's why it ends up so large. If the first bin value is 0x1 then you need to have a lookup list of all the values of bin2 and all the values of bin3 working backwards through the hash, which eventually gives you a value.
Why isn't it called a dictionary attack? Because it isn't.
As I've seen your previous question, let me expand on the detail you're looking for there. A cryptographically secure hash needs to be safe ideally from smallish input sizes up to whole files. To precompute the values of a hash for an entire file would take forever. So a rainbow table is designed on a small well understood subset of outputs, for example the permutations of all the characters a-z over a field of say 10 characters.
This is why password advice for defeating dictionary attacks works here. The more subsets of the whole possible set of inputs you put into your input for the hash, the more a rainbow table needs to contain to search it. The data sizes required end up stupidly big and so does the time to search. So, think about it:
If you have an input that is [a-z] for 5-8 characters, that's not too bad a rainbow table.
If you increase the length to 42 characters, that's a massive rainbow table. Each input affects the hash and so the bins of said hash.
If you throw numbers in to your search requirement [a-z][0-9] you've got even more searching to do.
Likewise [A-Za-z0-9]. Finally, stick in [\w] i.e. any printable character you can think of, and again, you're looking at a massive table.
So, making passwords long and complicated makes rainbow tables start taking blue-ray sized discs of data. Then, as per your previous question, you start adding in salting and hash derived functions and you make a general solution to hash cracking hard(er).
The goal here is to stay ahead of the computational power available.
Rainbow is a variant of dictionary attack (Pre-computed dictionary attack to be exact), but it takes less space than full dictionary (at the price of time needed to find a key in table). The other end of this space-memory tradeoff is full search (brute force attack = zero precomputation, a lot of time).
In the rainbow table the precomputed dictionary of pairs key-ciphertext is compressed in chains. Every step in chain is done using different commpression function. And the table has a lot of chains, so it looks like a rainbow.
In this picture different compression functions K1, K2, K3 have a colors like in rainbow:
The table, stored in the file contains only first and last columns, as the middle columns can be recomputed.
I don't know where the name comes from, but the differences are:
A dictionary contains a few selected items (e.g. english words), while a rainbow table contains every possible combination.
A dictionary only contains the input, while the rainbow table contains both the input and the output.
A dictionary is used to test different input to see if the output is valid, while a rainbow table is used for e reverse lookup, i.e. to find which input gives a specific output.
Unfortunately some of the statements are not correct. Contrary to what is bring posted rainbow tables DO NOT contain all the possibilites for a given keyspace well not the ones generated for use that I've seen. They can be generated to cover 99.9 but due to the randomness of a hash function there in no gurantee that EVERY plaintext is covered.
Each chain is made up of links or steps and each step is made of a hashing and reduction function. If your chain was 100 links long you would go that number of hash/reduction functions then discarding everything in between except the start and end.
To find the plain for a given hash you simply perform the reduction / hash x amount of the length of your chain. So you run the step once and check against the endpoint if it's a miss you would repeat... Until you have stepped through the entire length of your chain. If there is a match you can then regenerate the chain from the start point and you may be able to find the plain. If after the regeneration it is not correct then this is a false alarm. This happens due to collisions caused by the reduction hashing function. Since the table contains many chains you can do a large lookup against all the chain endpoints each step, this is essentially where the magic happens allowing speed. This will also lead to false alarms, since you only need to regenerate chains which have matches you save lots of time by skipping unnecessary chains.
They do not contain dictionaries.... Well not the traditional tables there are variants of rainbow tables which incorporate the use of dictionaries though.
That's about it. There are many ways which this process has been optimized including removing merging / duplicate chains and creating perfect tables and also storing them in differing packing to save space and loading time.

Resources