Shorter output possible - maintaining security properties - security

I have a problem that I've been trying to solve for a few days now, but it seems i got stuck!
The problem is, given a piece of data, i need to generate an output that has to be:
Impossible for others to reproduce.
Unique.
Short (since it will be used in a URL)
So i've decided that i would sign the given data with a private key, and then hash the result - This way i believe the first two properties are covered.
Now, to get the final result as short as possible i started reading and i came across the THIS, that talks about truncate a Hash, and i quote
As far as truncating a hash goes, that's fine. It's explicitly
endorsed by the NIST, and there are hash functions in the SHA-2 family
that are simple truncated variants of their full brethren:
SHA-256/224, SHA-512/224, SHA-512/256, and SHA-512/384, where SHA-x/y
denotes a full-length SHA-x truncated to y bits.
After reading a bit more about this, i decided that i would use SHA256 and truncated it to 128 bits.
Since it has to be used in a URL, i parsed the final result into a Base62 (Base64 uses ? and = signs which have a different meaning in a URL environment).
The final result was a string with 24 characters, which i thought was good but it seems it's still not good enough.
I need to get it even shorter (around 10 characters). I've been around this for a few days and i am starting to get out of ideas.
Does anyone have any suggestions? Is there something i can try?
Thank you very much!

Related

How to reconstruct hash value to the original format?

I would like to know how I can reconstruct a hash value such as 558f68181d2b0c9d57d41ce7aa36b71d9 to its original format (734).
I have used a code in matlab, which provided me with an hash output, but I tried to revers the operation to obtain the original value but no use. I tired converting from hex to binary but no use.
Are there any built in functions that can help me obtaining the original value?
i have used this code :
http://uk.mathworks.com/matlabcentral/fileexchange/31272-datahash
In general this is impossible. The whole idea of cryptographical hashes (like SHA-1 used above) is to be as unpredictable as possible. The hash of certain data should always be the same (of course) but it should be really hard to predict which data that resulted in a certain hash.
If you have a limited amount of values, you could probably create a lookup-table (hash -> data that made it) but this is actually the exact opposite of how they are supposed to be used.
I think you want to create your own hashing for this problem, where you could inline the data you hash in some particular way.

How difficult/easy is it to decode a custom base 62 encoding?

Here's what I am going to do to obfuscate database id's in permalinks:
1) XOR the id with a lengthy secret key
2) Scramble (rotate, flip, reverse) bits around a little in the XOR'ed integer in a reversible way
3) Base 62 encode the resulting integer with my own secret scrambled up sequence of all
alphanumeric characters (A-Za-z0-9)
How difficult would it be to convert my Base 62 encoding back to base 10?
Also How difficult is it to reverse engineer the whole process? (obviously without taking a peak at source or compiled code) I know 'only XOR' is pretty susceptible to basic analysis.
EDIT: the result should be not more than 8-9 chars long, 3DES and AES seem to produce very long encrypted texts and can't be practically used in URLs
Resulting strings look something like:
In [2]: for i in range(1, 11):
print code(i)
...:
9fYgiSHq
MdKx0tZu
vjd0Dipm
6dDakK9x
Ph7DYBzp
sfRUFaRt
jkmg0hl
dBbX9nHk4
ifqBZwLW
WdaQE630
As you can see 1 looks nothing like 2 so this seems to works great for obfuscation of id's.
If the attacker is allowed to play around with the input, it will be trivial for a skilled attacker to "decrypt" the data. A crucial property of modern crypto systems is the "Avalanche effect" which your system lacks. Basically it means that every bit of the output is connected with every bit of the input.
If an attacker of your system is allowed to see that, for example, id = 1000 produces the output "AAAAAA" and id=1001 produces "ABAAA" and id=1002 produces "ACAAA" the algorithm can be easily reversed, and the value of the key obtained.
That said, this question is a better fit for https://security.stackexchange.com/ or https://crypto.stackexchange.com/
The standard advice for anyone trying to develop their own cryptography is, "Don't". The advanced advice is to read Bruce Schneier's Memo to the Amateur Cipher Designer and then don't.
You are not the first person to need to obfuscate IDs, so there are already methods available. #CodesInChaos suggested a good method above; you should try that first to see if it meets your needs.

Comparing long strings by their hashes

Trying to improve the performance of a function that compares strings I decided to compare them by comparing their hashes.
So is there a guarantee if the hash of 2 very long strings are equal to each other then the strings are also equal to each other?
While it's guaranteed that 2 identical strings will give you equal hashes, the other way round is not true : for a given hash, there are always several possible strings which produce the same hash.
This is true due to the PigeonHole principle.
That being said, the chances of 2 different strings producing the same hash can be made infinitesimal, to the point of being considered equivalent to null.
A fairly classical example of such hash is MD5, which has a near perfect 128 bits distribution. Which means that you have one chance in 2^128 that 2 different strings produce the same hash. Well, basically, almost the same as impossible.
In the simple common case where two long strings are to be compared to determine if they are identical or not, a simple compare would be much preferred over a hash, for two reasons. First, as pointed out by #wildplasser, the hash requires that all bytes of both strings must be traversed in order to calculate the two hash values, whereas the simple compare is fast, and only needs to traverse bytes until the first difference is found, which may be much less than the full string length. And second, a simple compare is guaranteed to detect any difference, whereas the hash gives only a high probability that they are identical, as pointed out by #AdamLiss and #Cyan.
There are, however, several interesting cases where the hash comparison can be employed to great advantage. As mentioned by #Cyan if the compare is to be done more than once, or must be stored for later use, then hash may be faster. A case not mentioned by others is if the strings are on different machines connected via a local network or the Internet. Passing a small amount of data between the two machines will generally be much faster. The simplest first check is compare the size of the two, if different, you're done. Else, compute the hash, each on its own machine (assuming you are able to create the process on the remote machine) and again, if different you are done. If the hash values are the same, and if you must have absolute certainty, there is no easy shortcut to that certainty. Using lossless compression on both ends will allow less data to be transferred for comparison. And finally, if the two strings are separated by time, as alluded to by #Cyan, if you want to know if a file has changed since yesterday, and you have stored the hash from yesterday's version, then you can compare today's hash to it.
I hope this will help stimulate some "out of the box" ideas for someone.
I am not sure, if your performance will be improved. Both: building hash + comparing integers and simply comparing strings using equals have same complexity, that lays in O(n), where n is the number of characters.

How safely can I assume unicity of a part of SHA1 hash?

I'm currently using a SHA1 to somewhat shorten an url:
Digest::SHA1.hexdigest("salt-" + url)
How safe is it to use only the first 8 characters of the SHA1 as a unique identifier, like GitHub does for commits apparently?
To calculate the probability of a collision with a given length and the number of hashes that you have, see the birthday problem. I don't know the number of hashes that you are going to have, but here are some examples. 8 hexadecimal characters is 32 bits, so for about 100 hashes the probability of a collision is about 1/1,000,000, for 10,000 hashes it's about 1/100, for 100,000 it's 3/4 etc.
See the table in the Birthday attack article on Wikipedia to find a good hash length that would satisfy your needs. For example if you want the collision to be less likely than 1/1,000,000,000 for a set of more than 100,000 hashes then use 64 bits, or 16 hexadecimal digits.
It all depends on how many hashes are you going to have and what probability of a collision are you willing to accept (because there is always some probability, even if insanely small).
If you're talking about a SHA-1 in hexadecimal, then you're only getting 4 bits per character, for a total of 32 bits. The chances of a collision are inversely proportional to the square root of that maximum value, so about 1/65536. If your URL shortener gets used much, it probably won't take terribly long before you start to see collisions.
As for alternatives, the most obvious is probably to just maintain a counter. Since you need to store a table of URLs to translate your shortened URL back to the original, you basically just store each new URL in your table. If it was already present, you give its existing number. Otherwise, you insert it and give it a new number. Either way, you give that number to the user.
It depends on what you are trying to accomplish. The output of SHA1 is effectively random with regards to the input (the output of a good hash function changes in half of its bits based on a one-bit change in the input, and SHA1, while not perfect, is pretty good), and by taking a 32-bit (assuming 8 hex digits) subset of the 160-bit output, you reduce the output space from 2^160 to 2^32 values. All things being equal, which they never are, this would significantly reduce the difficulty of finding a collision.
However, if the hash function's input must be a valid URL, that significantly reduces the number of possible inputs. #rsp points out the birthday problem, but given this, I'm not sure exactly how applicable it is at least in its simple form. Also, it largely assumes that there are no other precautions in place.
I would be more interested in why you are doing this. Is this about URLs that the user will need to remember and type? If so, tacking on a bunch of random hexadecimal digits is probably a bad idea. Is it a URL or URL parameter that will just be passed around programmatically? Then, I wouldn't care much about length. Either way, there are probably better ways to do what you are trying to accomplish.
If you use a binary output for SHA1 and Base64 encode the result, you will get much higher information density per character; you can have the same 8-character names, but rather than only 16^8 (2^32) possibilities, you'll have 64^8 (2^48) possibilities.
Using the assumption that the 50% probability-of-collision scales with 1.177*sqrt(N), using a Base64-style encoding will require 256 times more inputs than the hex-output before reaching the 50% chance of collision probability.

Comparing string distance based on precomputed hashes

I have a large list (over 200,000) of strings that I'd like to compare to a given string.
The given string is inserted by a user, so it may be slightly incorrect.
What I was hoping to do was create some kind of precomputed hash on each string on adding it to the list. This hash would contain information such as string length, addition of all the characters etc.
My question is, does something like this already exist? Surely there would be something that lets me avoid running Levenshtein distance on every string in the list?
Or maybe there's a third option I haven't thought of yet?
Sounds like you want to use a fuzzy hash of some sort. There are lots of hash functions available that can do things like this. The classic old "SOUNDEX" algorithm might even work.
Another thought - if you estimate that the probability of an incorrect entry is low, then you might actually be fine having a direct hit 99.9% of the time, falling back to SOUNDEX which might catch 90% of the remaining cases and then searching the whole list for the remaining 0.01% of the time.
Also worth checking this discussion:
How to find best fuzzy match for a string in a large string database

Resources