Security: longer keys versus more available characters

Security: longer keys versus more available characters - security

I apologize if this has been answered before, but I was not able to find anything. This question was inspired by a comment on another security-related question here on SO:
How to generate a random, long salt for use in hashing?
The specific comment is as follows (sixth comment of accepted answer):
...Second, and more importantly, this will only return hexadecimal
characters - i.e. 0-9 and A-F. It will never return a letter higher
than an F. You're reducing your output to just 16 possible characters
when there could be - and almost certainly are - many other valid
characters.
– AgentConundrum Oct 14 '12 at 17:19
This got me thinking. Say I had some arbitrary series of bytes, with each byte being randomly distributed over 2^(8). Let this key be A. Now suppose I transformed A into its hexadecimal string representation, key B (ex. 0xde 0xad 0xbe 0xef => "d e a d b e e f").
Some things are readily apparent:
len(B) = 2 len(A)
The symbols in B are limited to 2^(4) discrete values while the symbols in A range over 2^(8)
A and B represent the same 'quantities', just using different encoding.
My suspicion is that, in this example, the two keys will end up being equally as secure (otherwise every password cracking tool would just convert one representation to another for quicker attacks). External to this contrived example, however, I suspect there is an important security moral to take away from this; especially when selecting a source of randomness.
So, in short, which is more desirable from a security stand point: longer keys or keys whose values cover more discrete symbols?
I am really interested in the theory behind this, so an extra bonus gold star (or at least my undying admiration) to anyone who can also provide the math / proof behind their conclusion.

If the number of different symbols usable in your password is x, and the length is y, then the number of different possible passwords (and therefore the strength against brute-force attacks) is x ** y. So you want to maximize x ** y. Both adding to x or adding to y will do that, Which one makes the greater total depends on the actual numbers involved and what your practical limits are.
But generally, increasing x gives only polynomial growth while adding to y gives exponential growth. So in the long run, length wins.

Let's start with a binary string of length 8. The possible combinations are all permutations from 00000000 and 11111111. This gives us a keyspace of 2^8, or 256 possible keys. Now let's look at option A:
A: Adding one additional bit.
We now have a 9-bit string, so the possible values are between 000000000 and 111111111, which gives us a keyspace size of 2^9, or 512 keys. We also have option B, however.
B: Adding an additional value to the keyspace (NOT the keyspace size!):
Now let's pretend we have a trinary system, where the accepted numbers are 0, 1, and 2. Still assuming a string of length 8, we have 3^8, or 6561 keys...clearly much higher.
However! Trinary does not exist!
Let's look at your example. Please be aware I will be clarifying some of it, which you may have been confused about. Begin with a 4-BYTE (or 32-bit) bitstring:
11011110 10101101 10111110 11101111 (this is, btw, the bitstring equivalent to 0xDEADBEEF)
Since our possible values for each digit are 0 or 1, the base of our exponent is 2. Since there are 32 bits, we have 2^32 as the strength of this key. Now let's look at your second key, DEADBEEF. Each "digit" can be a value from 0-9, or A-F. This gives us 16 values. We have 8 "digits", so our exponent is 16^8...which also equals 2^32! So those keys are equal in strength (also, because they are the same thing).
But we're talking about REAL passwords, not just those silly little binary things. Consider an alphabetical password with only lowercase letters of length 8: we have 26 possible characters, and 8 of them, so the strength is 26^8, or 208.8 billion (takes about a minute to brute force). Adding one character to the length yields 26^9, or 5.4 trillion combinations: 20 minutes or so.
Let's go back to our 8-char string, but add a character: the space character. now we have 27^8, which is 282 billion....FAR LESS than adding an additional character!
The proper solution, of course, is to do both: for instance, 27^9 is 7.6 trillion combinations, or about half an hour of cracking. An 8-character password using upper case, lower case, numbers, special symbols, and the space character would take around 20 days to crack....still not nearly strong enough. Add another character, and it's 5 years.
As a reference, I usually make my passwords upwards of 16 characters, and they have at least one Cap, one space, one number, and one special character. Such a password at 16 characters would take several (hundred) trillion years to brute force.

Related

How to uniquely identify a set of strings using an integer

Here my problem statement:
I have a set of strings that match a regular expression. let's say it matches [A-Z][0-9]{3} (i.e. 1 letter and 3 digits).
I can have any number of strings between 1 and 30. For example I could have:
{A123}
{A123, B456}
{Z789, D752, E147, ..., Q665}
...
I need to generate an integer (actually I can use 256 bits) that would be unique for any set of strings regardless of the number of elements (although the number of elements could be used to generate the integer)
What sort of algorithm could I use?
My first idea would be to convert my strings to number and then do operations (I thought of hash functions) on them but I am not sure what formula would be give me could results.
Any suggestion?

You have 2^333 possible input sets ((26 * 10^3) choose 30).
This means you would need a 333 bit wide integer to represent all possibilities. You only have a maximum of 256 bits, so there will be collisions.
This is a typical application for a hash function. There are hashes for various purposes, so it's important to select the right type:
A simple hash function for use in bucket based data structures (dictionaries) must be fast. Collisions are not only tolerated but wanted. The hash's size (in bits) usually is small. Due to collisions this type of hash is not suited for your purpose.
A checksum tries to avoid collisions and is reasonably fast. If it's large enough this might be enough for your case.
Cryptographic hashes have the characteristic that it's not possible (or very hard) to find a collision (even when both input and hash are known). Also they are not invertible (from the hash it's not possible to find the input). These are usually computationally expensive and overkill for your use case.
Hashes to uniquely identify arbitrary inputs, like CityHash and SpookyHash are designed for fast hashing and collision free identification.
SpookyHash seems like a good candidate for your use case. It's 128 bits wide, which means that you need 2^64 differing inputs to get a 50% chance of a single collision.
It's also fast: three bytes per cycle is orders of magnitude faster than md5 or sha1. SpookyHash is available in the public domain (see link above).
To apply any hash on your use case you could convert the items in your list to numbers, but it seems easier to just feed them as strings. You have to settle for an encoding in this case (ASCII would do).
I'm usually using UTF8 or so, when I18N is an issue. Then it's sometimes important to care for canonicalization. But this does not apply to your simple use case.

A hash is not going to work, since it could produce collisions. Every significant input bit must be mapped to an output bit.
For the letter, you have 90 - 65 = 25 different values, so you can use 5 bits to represent the letter.
The 3-digit number has 1000 different values, so you need 10 bits for this.
If you combine these bits, you have a unique mapping from the input to a 15-bit number.
This approach is simple, but it could wastes some bits. If the output must be as short as possible, you could map as follows:
output = (L - 'A')*1000 + N
where L is the letter value, 'A' is the value of the letter A, N is the 3-digit number. Then you can use as few bits as are necessary to represent the complete range of output, which is 25*1000 - 1 = 24999. Here it is 15 bits again, so the simple approach does not waste space.
If there are fewer output bits than input bits, a hash function is needed. I would strongly recommend to map the strings to binary data like above, and use a simple function to map the input to the output, for this reason:
A general-purpose hash function can not differentiate the input bits, because it knows nothing about their meaning.
For 256 output bits, after hashing 5.7e38 values, the chance of a collision is 75%. Source: Birthday Attack.
5.7e38 seems huge, but it corresponds to only 129 bits (2^129 = 6.8e38). In this case it means that there is a chance of over 75% that there is a pair of strings with 9 (129/15 = 8.6) elements that collide.
On the other hand, if you use a very simple mapping function like:
truncate the input to 256 bits (use the first 17 elements of 15 bits each)
make a 256 bit xor value of all the 15-bit elements
you can guaratee there is no collision between any two strings with at most 17 elements.
The hash functions wich are optimized for generating unique IDs likely perform better than a general-purpose hash as compared here, but I would doubt that they can guarantee collision-free hashing of all 256-bit values.
Conclusion: If most of the input strings have less than 17 elements, I would prefer this to a hash.

How do password restrictions help security?

On some sites there are certain restrictions on what characters should be used in passwords. For example, it must contain at least 1 digit, 1 alphabet symbol, etc. Does it really make password harder to guess? It seems that bruteforcing such password is easier than arbitrary one. I've looked up for similar questions, but those address password length restrictions, which seem reasonable to me (minimum length, of course).

By making passwords meet a larger set of conditions, some feel that they increase the security of their systems. I would argue against that. Lets take a minor example:
Password of 4 characters where 1 must be capitalized (i.e. a letter), 1 must be a number, and all entries are a letter or number. Then you have:
26 letters
10 numbers
62 letters/numbers
62 letters/numbers
That gives
26*10*62*62 combinations (for one ordering)
However, if we simply limit to all letters/numbers only then we get
62*62*62*62 combinations
It's obvious which is larger.
Now, remove the limitation of letters/numbers and allow every UTF-8 character (including space, ofc!) and that gets much larger.
By requiring certain characteristics of a password other than minimum length, the total number of combinations is reduced and that implies the overall security is reduced.
EDIT: It helps and does not hurt to have a list of passwords which are disallowed. For example cuss words, common pets names, etc. As those increase hackability while decreasing security.

In math, it's called Permutation.
http://betterexplained.com/articles/easy-permutations-and-combinations/
For easy examples:
only 5 digits numbers, there are 10*10*10*10*10 possibilities.
ddddd: 10*10*10*10*10
only 5 alphabetic characters, there are (26+26+10)^5 possibilities.
xxxxx: (26+26+10)^5
More possibilities take more time to hack your password.

Encoding name strings into an unique number

I have a large set of names (millions in number). Each of them has a first name, an optional middle name, and a lastname. I need to encode these names into a number that uniquely represents the names. The encoding should be one-one, that is a name should be associated with only one number, and a number should be associated with only one name.
What is a smart way of encoding this? I know it is easy to tag each alphabet of the name according to its position in the alphabet set (a-> 1, b->2.. and so on) and so a name like Deepa would get -> 455161, but again here I cannot make out if the '16' is really 16 or a combination of 1 and 6.
So, I am looking for a smart way of encoding the names.
Furthermore, the encoding should be such that the number of digits in the output numeral for any name should have fixed number of digits, i.e., it should be independent of the length. Is this possible?
Thanks
Abhishek S

To get the same width numbers, can't you just zero-pad on the left?
Some options:
Sort them. Count them. The 10th name is number 10.
Treat each character as a digit in a base 26 (case insensitive, no
digits) or 52 (case significant, no digits) or 36 (case insensitive
with digits) or 62 (case sensitive with digits) number. Compute the
value in an int. EG, for a name of "abc", you'd have 0 * 26^2 + 1 *
26^1 + 2 * 20^0. Sometimes Chinese names may use digits to indicate tonality.
Use a "perfect hashing" scheme: http://en.wikipedia.org/wiki/Perfect_hash_function
This one's mostly suggested in fun: use goedel numbering :). So
"abc" would be 2^0 * 3^1 * 5^2 - it's a product of powers of primes.
Factoring the number gives you back the characters. The numbers
could get quite large though.
Convert to ASCII, if you aren't already using it. Then treat each
ordinal of a character as a digit in a base-256 numbering system.
So "abc" is 0*256^2 + 1*256^1 + 2*256^0.
If you need to be able to update your list of names and numbers from time to time, #2, #4 and #5 should work. #1 and #3 would have problems. #5 is probably the most future-proofed, though you may find you need unicode at some point.
I believe you could do unicode as a variant of #5, using powers of 2^32 instead of 2^8 == 256.

What you are trying to do there is actually hashing (at least if you have a fixed number of digits). There are some good hashing algorithms with few collisions. Try out sha1 for example, that one is well tested and available for modern languages (see http://en.wikipedia.org/wiki/Sha1) -- it seems to be good enough for git, so it might work for you.
There is of course a small possibility for identical hash values for two different names, but that's always the case with hashing and can be taken care of. With sha1 and such you won't have any obvious connection between names and IDs, which can be a good or a bad thing, depending on your problem.
If you really want unique ids for sure, you will need to do something like NealB suggested, create IDs yourself and connect names and IDs in a Database (you could create them randomly and check for collisions or increment them, starting at 0000000000001 or so).
(improved answer after giving it some thought and reading the first comments)

You can use the BigInteger for encoding arbitrary strings like this:
BigInteger bi = new BigInteger("some string".getBytes());
And for getting the string back use:
String str = new String(bi.toByteArray());

I've been looking for a solution to a problem very similar to the one you proposed and this is what I came up with:
def hash_string(value):
score = 0
depth = 1
for char in value:
score += (ord(char)) * depth
depth /= 256.
return score
If you are unfamiliar with Python, here's what it does.
The score is initially 0 and the depth are set to 1
For every character add the ord value * the depth
The ord function returns the UTF-8 value (0-255) for each character
Then it's multiplied by the 'depth'.
Finally the depth is divided by 256.
Essentially, the way that it works is that the initial characters add more to the score while later characters contribute less and less. If you need an integer, multiply the end score by 2**64. Otherwise you will have a decimal value between 0-256. This encoding scheme works for binary data as well as there are only 256 possible values in a byte/char.
This method works great for smaller string values, however, for longer strings you will notice that the decimal value requires more precision than a regular double (64-bit) can provide. In Java, you can use the 'BigDecimal' and in Python use the 'decimal' module for added precision. A bonus to using this method is that the values returned are in sorted order so they can be searched 'efficiently'.

Take a look at https://en.wikipedia.org/wiki/Huffman_coding. That is the standard approach.

You can translate it, if every character (plus blank, at least) will occupy a position.
Therefore ABC, which is 1,2,3 has to be translated to
1*(2*26+1)² + 2*(53) + 3
This way, you could encode arbitrary strings, but if the length of the input isn't limited (and how should it?), you aren't guaranteed to have an upper limit for the length.

Minimum password length for maximum entropy

Assuming a SHA 256 hash and a completely random password using the extended ASCII charset, is there a specific length after which additional characters offer no increase in entropy, and if so what is this?
Thanks.

SHA-256 has 256 bits, obviously. The minimum UTF-8 character length is one byte, i.e. 8 bits. Therefore, any password longer than 256/8=32 characters is guaranteed extremely likely to collide with a shorter one.
Is this what you meant?

A hash doesn't increase entropy, it just, so to speak, distills it. Since SHA256 produces 256 bits of output, if you supply it with a password that's completely unpredictable (i.e., each bit of input represents one bit of entropy) then anything beyond 256 bits of input is more or less wasted.
Other than from a truly random source, however, it's really hard to get input that has one bit of entropy for every bit of input. For typical English text, Shannon's testing showed about one bit of entropy per character.

I have come to roughly the same conclusion as the others did, but with a different rationale.
Generally speaking, a preimage (brute force) attack on SHA-256 requires 2^256 evaluations, regardless of password length. In other words, a hash of a "password" that is thousands of characters long would still take an average of 2^256 tries to duplicate. 2^256 is about 1.2 x 10^77. However, a very short password, where the number of possibilities is less than 2^256, is even easier to break.
The threshold is passed when the number of possibilities is greater than 2^256.
If you are using ISO 8859-1, which has 191 characters, there are 191^n possible random passwords of length n, where n is the length of the password. 191^33 is about 1.9 x 10^75 and 191^34 is about 3.6 x 10^77, so the threshold would be at 33 characters.
If you were using plain ASCII, with 128 characters, there would be 128^n possible random passwords of length n, where n is the length of the password. 128^36 is about 7.2 x 10^75 and 128^37 is about 9.3 x 10^77, so the threshold would be at 36 characters.
Some of the other answers seem to imply that the threshold is always at 32 characters. However, if my logic is correct, the threshold varies, depending on the number of characters you have in your character set.
In fact, suppose that you used only characters a-z and 0-9, you would continue to add password strength up until your password was 49 characters long! (36^49 is about 1.8 x 10^76)
Hopefully this answer gives you a mathematical basis for answering the question.
As a side note, if a birthday (collision) attack were possible on SHA-256, it would theoretically require only 2^128 evaluations (on average), which is about 3.4 x 10^38. In that case, the threshold for ISO 8859-1 would be at only 16 characters (191^16 is about 3.1 x 10^36). Thankfully, such an attack has not yet been publicly demonstrated.
Please see the Wikipedia articles on SHA-2, preimage attacks, and birthday attacks.

I don't think there is an "effective" limit. Password of any length will be effective if it is effectively created (the usual rules, no words, mixed numbers, letters, cases and characters). It is best to force user to follow these rules rather then limit length. But minimum length should be imposed, sth like 8-10 characters, to save the users from themselves.

Is there a standard for using PBKDF2 as a password hash?

Join me in the fight against weak password hashes.
A PBKDF2 password hash should contain the salt, the number of iterations, and the hash itself so it's possible to verify later. Is there a standard format, like RFC2307's {SSHA}, for PBKDF2 password hashes? BCRYPT is great but PBKDF2 is easier to implement.
Apparently, there's no spec. So here's my spec.
>>> from base64 import urlsafe_b64encode
>>> password = u"hashy the \N{SNOWMAN}"
>>> salt = urlsafe_b64decode('s8MHhEQ78sM=')
>>> encoded = pbkdf2_hash(password, salt=salt)
>>> encoded
'{PBKDF2}1000$s8MHhEQ78sM=$hcKhCiW13OVhmLrbagdY-RwJvkA='
Update: http://www.dlitz.net/software/python-pbkdf2/ defines a crypt() replacement. I updated my little spec to match his, except his starts with $p5k2$ instead of {PBKDF2}. (I have the need to migrate away from other LDAP-style {SCHEMES}).
That's {PBKDF2}, the number of iterations in lowercase hexadecimal, $, the urlsafe_base64 encoded salt, $, and the urlsafe_base64 encoded PBKDF2 output. The salt should be 64 bits, the number of iterations should be at least 1000, and the PBKDF2 with HMAC-SHA1 output can be any length. In my implementation it is always 20 bytes (the length of a SHA-1 hash) by default.
The password must be encoded to utf-8 before being sent through PBKDF2. No word on whether it should be normalized into Unicode's NFC.
This scheme should be on the order of iterations times more costly to brute force than {SSHA}.

There is a specification for the parameters (salt and iterations) of PBKDF2, but it doesn't include the hash. This is included in PKCS #5 version 2.0 (see Appendix A.2). Some platforms have built-in support for encoding and decoding this ASN.1 structure.
Since PBKDF2 is really a key derivation function, it doesn't make sense for it to specify a way to bundle the "hash" (which is the really a derived key) together with the derivation parameters—in normal usage, the key must remain secret, and is never stored.
But for usage as a one-way password hash, the hash can be stored in a record with the parameters, but in its own field.

I'll join you in the fight against weak hashes.
OWASP has a Password Storage Cheat Sheet (https://www.owasp.org/index.php/Password_Storage_Cheat_Sheet) with some guidance; they recommend 64,000 PBKDF2 iterations minimum as of 2012, doubling every two years (i.e. 90,510 in 2012).
Note that a storing a long, cryptographically random salt per-userid is always basic.
Note that having a widely variable per-userid number of iterations and storing the number of iterations along with the salt will add some complexity to cracking software, and may help preclude certain optimizations. For instance, "bob" gets encrypted with 135817 iterations, while "alice" uses 95,121 iterations, i.e. perhaps a minimum of(90510 + RAND(90510)) for 2013.
Note also that all of this is useless if users are allowed to choose weak passwords like "password", "Password1!", "P#$$w0rd", and "P#$$w0rd123", all of which will be found by rules based dictionary attacks very quickly indeed (the latter is simply "password" with the following rules: uppercase first letter, 1337-speak, add a three digit number to the end). Take a basic dictionary list (phpbb, for a good, small starter wordlist) and apply rules like this to it, and you'll crack a great many passwords where people try "clever" tricks.
Therefore, when checking new passwords, don't just apply "All four of upper, lower, number, digit, at least 11 characters long", since "P#$$w0rd123" complies with this seemingly very tough rule. Instead, use that basic dictionary list and see if basic rules would crack it (it's a lot simpler than actually trying a crack - you can lower-case your list and their word, and then simply write code like "if the last 4 characters are a common year, check all but the last four characters against the wordlist", and "if the last 3 characters are digits, check all but the last 3 characters against the wordlist" and "check all but the last two characters against the wordlist" and "De-1337 the password - turn #'s into a, 3 into e, and so on, and then check it against the wordlist and try those other rules too."
As far as passphrases go, in general are a great idea, particularly if some other characters are added to the middle of words, but if and only if they're long enough, since you're giving up a lot of possible combinations.
Note that modern machines with GPU's are up to the tens of billions of hash iterations (MD5, SHA1, SHA-256, SHA-512, etc.) per second, even in 2012. As far as word combination "correct horse battery staple" type passwords, this one is at best a very modest password- it's only 4 all lower case English words of length 7 or less with spaces. So, if we go looking for XKCD style passwords with an 18 billion guess a second setup: A modern small american english dictionary has: 6k words of length 5 or less 21k words of length 7 or less 36k words of length 9 or less 46k words of length 11 or less 49k words of length 13 or less
With an XKCD style passphrase, and without bothering to filter words by popularity ("correct" vs. "chair's" vs. "dumpier" vs. "hemorrhaging") we have 21k^4, which is only about 2E17 possibilities. With the 18 billion/sec setup (a single machine with 8 GPU's if we're facing a single SHA1 iteration), that's about 4 months to exhaustively search the keyspace. If we had ten such setups, that's about two weeks. If we excluded unlikely words like "dumpier", that's a lot faster for a quick first pass.
Now, if you get words out of a "huge" linux american english wordlist, like "Balsamina" or "Calvinistically" (both chosen by using the "go to row" feature", then we'd have 30k words of length 5 or less 115k words of length 7 or less 231k words of length 9 or less 317k words of length 11 or less 362k words of length 13 or less
Even with the 7 length max limit, with this huge dictionary as a base and randomly chosen words, we have 115k^4 ~= 1.8E20 possibilities, or about 12 years if the setup is kept up to date (doubling in power every 18 months). This is extremely similar to a 13 character, lower case + number only password. "300 years" is what most estimates will tell you, but they fail to take Moore's Law into account.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string