Security of bcrypt iterations/cost parameter - security

Fact A. Based on Pigeonhole Principle, every hash functions has infinite number of collisions, even if none is found yet.
Fact B. Re-hashing a hash, like hash(hash(password)) is not more secure than hash(password), actually hash(hash(password)) open up a collision attack that is not possible with hash(password).
Fact C. Based on B, by increasing iterations, we reach a point that most passwords and salts will return same constant hash value. I mean probability of colliding will be high, even 100%.
Fact D. bcrypt has a iteration/cost parameter that we can increase over time, based on our hardware specifications.
So, by combining this facts, can we say that with a higher bcrypt cost value, we decrease security by increasing probability of colliding?
If answer is "no", why?

BCrypt does not do stupid iterations, it includes the original password and the salt in every iteration. The same goes for PBKDF2, which uses a HMAC in every iteration. Have a look at the pseudo code of BCrypt.
There is a very illustrative answer on Information Security about the effects of collisions with iterative hashing. In praxis, as far as i know, collisions are not really a problem for password hashing, even when iterated.

Related

Hashing and 'brute-force' permutations

So this is a two-part question:
Are there any hashing functions that guarantee that for any combination of the same length, they generate a unique hash? As I remember - most are that way, but I just need to confirm this.
Based on the 1st question - so, given a file hash and a length - is it then theoretically possible to 'brute-force' all byte permutations of that same length until the same hash is generated - ie. the original file has been recreated?
PS. I am aware that this will take ages (if theoretically possible), but I think it would be feasible for small files (sizes < 1KB)
1KB, that'd be 1000^256, right? 1000 possible combinations of bytes (256 configurations each?). It's a real big number. 1 with 768 0s behind it.
If you were to generate all of them, one would be the right one, but you'd have some number of collisions.
According to this security.SE post, the collission rate for md5 (for example) is about 1 in 2^64. So, if we divide our original number by that, we'd get how many possible combinations, right? http://www.wolframalpha.com/input/?i=1000%5E256+%2F+2%5E64
~5.42 × 10^748
That is still a lot of files to check.
I'd feel a lot better if someone critiqued my math here, but the point is that your first point is not true because of collisions. You can use the same sort math for calculating two 1000 character passwords having the same hash. It's the birthday problem. Given 2 people, it is unlikely that we'd have the same birthday, but if you take a room full the probability of any two people having the same birthday increases very quickly. If you take all 1000 character passwords, some of them are going to collide. You are going from X bytes to 16 bytes. You can't fit all of the combinations into 16 bytes.
Expanding upon the response to your first point, one of the points of cryptographic hash functions is unpredictability. A function with zero collisions is a 1-1 (or one-to-one) function, so called because every input has exactly one output and every output has exactly one input.
In order for a function to accept arbitrary length & complexity inputs without generating a collision, it is easy to see that the function must have arbitrary length outputs. As Gray obliquely points out, most hash functions have fixed-length outputs. (There are apparently some new algorithms that support arbitrary length outputs, but they still don't guarantee 0 collisions.) The reason is not stated clearly in the common crypto literature, but consider the difference between hashing and encrypting.
In hashing, you have the message (the unaltered original) and the message digest (the output of the hash function. (Digest here having the meaning "a summation or condensation of a body of information.")
With encryption, you have the plain text and the cipher text. The implication is that the cipher text is of equal length and complexity as the original.
I look at it as a cryptographic hash function with 0 collisions is of equal complexity as encryption. (Note that I'm unsure of what the advantages of a variable-length hash output are, so I asked a question about it.)
Additionally, hash functions are susceptible to attacks by pre-computed rainbow tables, which is why all hash algorithms still considered secure employ extra random inputs, called salts. The reason encryption isn't susceptible to a similar attack is that the encryption key is kept secret and you can't pre-compute output values without knowing the key. Compare symmetric key encryption (where there is one key that must be kept secret) with public key encryption (where the encryption key is public and the decryption key is private).
The other thing that prevents encryption algorithms from pre-computation attacks is that the number of computations for arbitrary-length inputs grows exponentially, and it is literally impossible to store the output from every input you may be interested in.

Decrypted Hash and Encrypted hash

If this password's ( qwqwqw123456 ) hash is $2a$07$sijdbfYKmgWdcGhPPn$$$.C98C0wmy6jsqA3fUKODD0OFBKJkHdn.
What is the password of this hash $2a$07$sijdbfYKmgWdcGhPPn$$$.9PTdICzon3EUNHZvOOXgTY4z.UTQTqG
And Can I know which hash algorithm is it ?
You could try to guess which algorithm was used,
depending on the format and length of the hash,
your known value etc. but there is no definitive way to know it.
And the purpose of any "hash" function is
that it is NOT reversible/decryptable/whatever.
Depending on some factors you could try to guess the original value too
(Brute force attack: Try to hash all possible values and check which hash
is equal to yours) but, depending on the count of possibilities,
the used algortihm etc. that could take millions of years. (you could also be lucky
and get the correct value within short time, but that´s unlikely).
There are other things than bruteforce-ing, but in the end,
it´s pretty much impossible to reverse a good hash function

Iterate over hash function though it reduces search space

I was reading this article regarding the number of times you should hash your password
A salt is added to password before the password is hashed to safeguard against dictionary attacks and rainbow table attacks.
The commentors in the answer by ORIP stated
hashing a hash is not something you should do, as the possibility of
hash collision increase with each iteration which may reduce the
search space (salt doesn't help), but this is irrelevant for
password-based cryptography. To reach the 256-bit search space of this
hash you'd need a completely random password, 40 characters long, from
all available keyboard characters (log2(94^40))
The answer by erickson recommended
With pre-computation off the table, an attacker has compute the hash
on each attempt. How long it takes to find a password now depends
entirely on how long it takes to hash a candidate. This time is
increased by iteration of the hash function. The number iterations is
generally a parameter of the key derivation function; today, a lot of
mobile devices use 10,000 to 20,000 iterations, while a server might
use 100,000 or more. (The bcrypt algorithm uses the term "cost
factor", which is a logarithmic measure of the time required.)
My questions are
1) Why do we iterate over the hash function since each iteration reduces the search space and hence make it easier to crack the password
2) What does search space mean ??
3) Why is the reduction of search space irrelevant for password-based cryptography
4) When is reduction of search space relevant ??
.
Let's start with the basic question: What is a search space?
A search space is the set of all values that must be searched in order to find the one you want. In the case of AES-256, the total key space is 2^256. This is a really staggeringly large number. This is the number that most people are throwing around when they say that AES cannot be brute forced.
The search space of "8-letter sequences of lowercase letters" is 26^8, or about 200 billion (~2^37), which from a cryptographic point of view is a tiny, insignificant number that can be searched pretty quickly. It's less than 3 days at 1,000,000 checks per second. Real passwords are chosen out of much smaller sets, since most people don't type 8 totally random letters. (You can up this with upper case and numbers and symbols, but people pick from a tiny set of those, too.)
OK, so people like to type short, easy passwords, but we want to make them hard to brute-force. So we need a way to convert "easy to guess passwords" into "hard to guess key." We call this a Key Derivation Function (KDF). We need two things for it:
The KDF must be "computationally indistinguishable from random." This means that there is no inverse of the hash function that can be computed more quickly than a brute force search.
The KDF should take non-trivial time to compute, so that brute forcing the tiny password space is still very difficult. Ideally it should be made as difficult as brute forcing the entire key space, but it is rare to push it that far.
The first point is the answer to your question of "why don't we care about collisions?" It is because collisions, while they could possibly exist, cannot be predicted in an computationally efficient manner. If collisions could be efficiently predicted, then your KDF function is not indistinguishable from random.
A KDF is not the same as just "repeated hashing." Repeated hashing can be distinguished from random, and is subject to significant attacks (most notably length-extension attacks).
PBKDF2, as a specific KDF example, is proven to be computationally indistinguishable from random, as long as it is provided with a pseudorandom function (PRF). A PRF is defined as itself being computationally indistinguishable from random. PBDFK2 uses HMAC, which is proven to be a PRF as long as it is provided a hashing function that is at least weakly collision resistant (the requirement is actually a bit weaker than even that).
Note the word "proven" here. Good cryptography lives on top of mathematical security proofs. It is not just "tie a lot of knots and hope it holds."
So that's a little tiny bit of the math behind why we're not worried about collisions, but let's also consider some intuition about it.
The total number of 16-character (absurdly long) passwords that can be easily typed on a common English keyboard is about 95^16 or 2^105 (that doesn't count the 15, 14, 13, etc length passwords, but since 95^16 is almost two orders of magnitude larger than 95^15, it's close enough). Now, consider that for each password, we're going to randomly map it to 10,000 intermediate keys (via 10,000 iterations of PBKDF2). That gets us up to 2^118 random choices that we hope never collide in our hash. What are the chances?
Well, 2^256 (our total space) divided by 2^118 (our keys) is 2^138. That means we're using much less than 10^-41 of the space for all passwords that could even be remotely likely. If we're picking these randomly (and the definition of a PRF says we are), the chances of two colliding are, um, small. And if two somehow did, no attacker would ever be able to predict it.
Take away lesson: Use PBKDF2 (or another good KDF like scrypt or bcrypt) to convert passwords into keys. Use a lot of iterations (10,000-100,000 at a minimum). Do not worry about the collisions.
You may be interested in a little more discussion of this in Brute-Forcing Passwords.
As the second snippet said, each iteration makes each "guess" a hacker makes take longer, therefore increasing the total time it will take then to crack an average password.
Search space is all the possible hashes for a password after however many iterations you are using. Each iteration decreases the search space.
Because of #1, as the size of the search space decreases, the time to check each possibility increases, balancing out that negative effect.
According to the second snippet, answers #1 and #3 say it actually isn't.
I hope this makes sense, it's a very complicated topic.
The reason to iterate is to make it harder for an attacker to brute force the hash. If you have a single round of hashing for a value, then in order to precompute a table for cracking that hash, you need to do 1 * keyspace hashes. If you do 1000 hashes of the value, then it would require the work of 1000 * keyspace.
Search space generally refers to the total number of combinations of characters that could make up a password.
I would say that the reduction of search space is irrelevant because passwords are generally not cracked by attempting 0000000, then 0000001, etc. They are instead attempted to be cracked by using dictionaries and combinatorics. There is essentially a realm of passwords that are likely to get cracked (like "password", "abcdef1", "goshawks", etc.), but creating a larger work factor will make it much more difficult for an attacker to hit all of the likely passwords in the space. Combining that with a salt, means they have to do all of the work for those likely passwords, for every hash they want to crack.
The reduction in search space becomes relevant if you are trying to crack something that is random and could take up any value in the search space.

possible collision hashing uuid cakephp

Is it possible to have collisions if to use Security::hash on uuid() string ? I know that uuid() generates truly unique string, but I need them to be hashed, and I am worried if there is a possibility that the hashed string can be repeated.
Thanks
Firstly, contrary to the name, a uuid does not create a truly unique string. It generates a string that is unique with very high probability(high enough that it can for pretty much all purposes be treated as unique).
As for your chances of getting a collision, that really depends on which hashing algorithm you are using. Assuming a well built hashing algorithm which distributes uniformly over it's output space, your odds of a collision with any two hashes is 1 / 2^n where n is the hash length in bits. The odds of any two hashes colliding in a birthday attack scenario can be approximated using the formula p(h) = h^2 / 2 m where h is the number of hashes you expect to generate and m is the output space (2^256 in the case of SHA256 for example).
So, the sum it all up, you will always have a chance of getting a hash collision regardless of what hashing algorithm you're using. However, in the case of pretty much anything equal to or greater than SHA256, the chance is so vanishingly small that is is not worth worrying about. Your time is better spent worrying about the chances of a bus running over your server in the next second.
uuid can generate duplicates but the chance is very very very small.
Security::hash of cakePHP looks like the hash function of PHP.
If you use it with sha512 it should be pretty good.

How (if at all) does a predictable random number generator get more secure after SHA-1ing its output?

This article states that
Despite the fact that the Mersenne Twister is an extremely good pseudo-random number generator, it is not cryptographically secure by itself for a very simple reason. It is possible to determine all future states of the generator from the state the generator has at any given time, and either 624 32-bit outputs, or 19,937 one-bit outputs are sufficient to provide that state. Using a cryptographically-secure hash function, such as SHA-1, on the output of the Mersenne Twister has been recommended as one way of obtaining a keystream useful in cryptography.
But there are no references on why digesting the output would make it any more secure. And honestly, I don't see why this should be the case. The Mersenne Twister has a period of 2^19937-1, but I think my reasoning would also apply to any periodic PRNG, e.g. Linear Congruential Generators as well. Due to the properties of a secure one-way function h, one could think of h as an injective function (otherwise we could produce collisions), thus simply mapping the values from its domain into its range in a one-to-one manner.
With this thought in mind I would argue that the hashed values will produce exactly the same periodical behaviour as the original Mersenne Twister did. This means if you observe all values of one period and the values start to recur, then you are perfectly able to predict all future values.
I assume this to be related to the same principle that is applied in password-based encryption (PKCS#5) - because the domain of passwords does not provide enough entropy, simply hashing passwords doesn't add any additional entropy - that's why you need to salt passwords before you hash them. I think that exactly the same principle applies here.
One simple example that finally convinced me: Suppose you have a very bad PRNG that will always produce a "random number" of 1. Then even if SHA-1 would be a perfect one-way function, applying SHA-1 to the output will always yield the same value, thus making the output no less predictable than previously.
Still, I'd like to believe there is some truth to that article, so surely I must have overlooked something. Can you help me out? To a large part, I have left out the seed value from my arguments - maybe this is where the magic happens?
The state of the mersenne twister is defined by the previous n outputs, where n is the degree of recurrence (a constant). As such, if you give the attacker n outputs straight from a mersenne twister, they will immediately be able to predict all future values.
Passing the values through SHA-1 makes it more difficult, as now the attacker must try to reverse the RNG. However, for a 32-bit word size, this is unlikely to be a severe impediment to a determined attacker; they can build a rainbow table or use some other standard approach for reversing SHA-1s, and in the event of collisions, filter candidates by whether they produce the RNG stream observed. As such, the mersenne twister should not be used for cryptographically sensitive applications, SHA-1 masking or no. There are a number of standard CSPRNGs that may be used instead.
An attacker is able to predict the output of MT based on relatively few outputs not because it repeats over such a short period (it doesn't), but because the output leaks information about the internal state of the PRNG. Hashing the output obscures that leaked information. As #bdonlan points out, though, if the output size is small (32 bits, for instance), this doesn't help, as the attacker can easily enumerate all valid plaintexts and precalculate their hashes.
Using more than 32 bits of PRNG output as an input to the hash would make this impractical, but a cryptographically secure PRNG is still a much better choice if you need this property.

Resources