What's the difference between a secure compare and a simple ==(=)

What's the difference between a secure compare and a simple ==(=) - node.js

Github's securing webhooks page says:
Using a plain == operator is not advised. A method like secure_compare performs a “constant time” string comparison, which renders it safe from certain timing attacks against regular equality operators.
I use bcrypt.compare('string', 'computed hash') when comparing passwords.
What makes this a "secure compare" and can I do this using the standard crypto library in Node?

The point of a "constant time" string comparison is that the comparison will take the exact same amount of time no matter what the comparison target is (the unknown value). This "constant time" reveals no information to an attacker about what the unknown target value might be. The usual solution is that all characters are compared, even after a mismatch is found so no matter where a mismatch is found, the comparison runs in the same amount of time.
Other forms of comparison might return an answer in a shorter time when certain conditions are true which allows an attacker to learn what they might be missing. For example, in a typical string comparison, the comparison will return false as soon as an unequal character is found. If the first character does not match, then the comparison will return in a shorter amount of time than if it does. A diligent attacker can use this information to make a smarter brute force attack.
A "constant time" comparison eliminates this extra information because no matter how the two strings are unequal, the function will return its value in the same amount of time.
In looking at the nodejs v4 crypto library, I don't see any signs of a function to do constant time comparison and per this post, there is a discussion about the fact that the nodejs crypto library is missing this functionality.
EDIT: Node v6 now has crypto.timingSafeEqual(a, b).
There is also such a constant time comparison function available in this buffer-equal-constant-time module.

jfriend's answer is correct in general, but in terms of this specific context (comparing the output of a bcrypt operation with what is stored in the database), there is no risk with using "==".
Remember, bcrypt is designed to be a one-way function that is specifically built to resist password guessing attacks when the attacker gets hold of the database. If we assume that the attacker has the database, then the attacker does not need timing leak information to know which byte of his guess for the password is wrong: he can check that himself by simply looking at the database. If we assume the attacker does not have the database, then timing leak information could potentially tell us which byte was wrong in his guess in a scenario that is ideal for the attacker (not realistic at all). Even if he could get that information, the one-way property of bcrypt prevents him from exploiting the knowledge gain.
Summary: preventing timing attacks is a good idea in general, but in this specific context, you're not putting yourself in any danger by using "==".
EDIT: The bcrypt.compare( ) function already is programmed to resist timing attacks even though there is absolutely no security risk in not doing this.

Imagine a long block of material to compare. If the first block does not match and the compare function returns right then, you have leaked data to the attacker. He can work on the first block of data until the routine takes longer to return, at which time he will know that the first chunk matched.
2 ways to compare data that are more secure from timing attacks are to hash both sets of data and to compare the hashes, or to XOR all the data and compare the result to 0. If == just scans both blocks of data and returns if and when it finds a discrepancy, it can inadvertently play "warmer / colder" and guide the adversary right in on the secret text he wants to match.

Related

Program to encrypt / decrypt text string in Assembly MIPS

I want to create a program that reads in input a string of characters, and through a predefined action (I was thinking of a sum with an integer randomly generated) encrypts the string by returning the encrypted string and the key to decode it in a second moment.
Could you give me any suggestions on how to treat the string?
I would like to do so :
li $v0,8
la $a0,buffer
li $a1,1024
syscall
move $s7,$a0
This is the code to read the string.
After that I want to do:
add $t0,$s5,$s3
When I add a random generated integer to the register contain the string.
After knowing the values of the random number and the sum, I can again get the original string with a subtraction.
Is it a proper method?

That depends somewhat on the purpose of the encryption. As I understand this, the approach you're suggesting is basically a form of a Caesar Cipher. While this will protect your string to some degree against casual observers, it will definitely not be suitable for serious security purposes. It is subject to brute-force attacks, known-plaintext and chosen-plaintext attacks, and frequency analysis.
The idea behind a brute-force attack is that, for any given string of a reasonable length, there will almost always be exactly one shift that will make the string make sense, so an attacker could repeatedly try different shifts until he found the one that made the string make sense. The first shift that makes the string make sense is is almost certainly the correct shift.
If you're doing a "classical" Caesar cipher (e.g. C = A, D = B, E = C, etc.), there are only 25 possible shifts, so on average an attacker could guess the plaintext in 12.5 guesses (and 25 guesses in the worst case). In a scheme like yours you'd have to use a very large range of enormous numbers in order to be able to defend against this even slightly. For example, if you were only doing shifts of between 1 - 100 an attacker could reconstruct the plaintext in an average of 50 guesses (and 100 guesses in the worst case), which is obviously not a defense against a motivated attacker, especially since this task lends itself to easy parallelization. Assuming I did my math right, even if you had a trillion possible shifts and it took 100 operations to do and test a particular shift, you could try all of them in under 7 seconds on an Intel i7 if I did my math right and, on average, it would take less than 3.5 seconds to find the correct answer using brute force.
The idea behind frequency analysis is that your text retains the same statistical characteristics as the host language. For example, in English the most common letter is "e," so if you find the most common letter in your ciphertext it probably corresponds to "e." You can then work out how much you shifted the string to get that particular output. For example, if "g" is the most frequent letter in the ciphertext, you can guess that g = e and that they therefore must have shifted the text over by two.
A known-plaintext attack is where an attacker has an example of both the plaintext and its corresponding ciphertext and they can use that information to reconstruct what the key must have been. A chosen-plaintext attack is basically the same thing except that the attacker gets to choose which plaintext he sees the corresponding ciphertext for. (Note that this is only a problem if you're reusing keys, especially if you're doing so in a predictable manner; if you never reuse keys reconstructing the key for the known/chosen plaintext won't give the attacker any information about the key you used for other messages).
I've never tried doing this in assembly language to tell the truth but if you want good security you might want to consider AES. If you're really interested in simplicity of implementation and are willing to go with something less secure, you might also go with XTEA.

Find substring in stream without storing substring in plain text

Lets say I have a large stream of data (for example packets coming in from a network), and I want to determine if this data contains a certain substring. There are multiple string searching algorithms, but they require the algorithm to know the plain text string they are searching for.
Lets say, the string being sought is a password, and you do not want to store it in plain text in this search application. It would however appear in the stream as plain text. You could for example, store the hash and length of the password. Then for every byte in the stream check if the next length byte data from the stream hash to the password hash you have a probable match.
That way you can determine if the password was in the stream, without knowing the password. However, hashing once for every byte is not fast/efficient.
Is there perhaps a clever algorithm that could find the plain text password in the stream, without directly knowing the plain text password (and instead some non-reversible equivalent). Alternatively could a low quality version of the password be used, with the risk of false positives? For example, if the search application only knew half the password (in plain text), it could with some error detect the full password without knowing it.
thanks
P.S This question comes from a hypothetical discussion I had with some friends, about alerting you if your password was spotted in plain text on a network.

You could use a low-entropy rolling hash to pre-screen each byte so that, for the cost of lg k bits of entropy, you reduce the number of invocations of the cryptographic hash by a factor of k.

SAT is an NP-hard problem. Suppose your password is n characters long. If you could find a way to make a large enough SAT instance that
used a contiguous sequence of m >= n bytes from the data stream as its 8m input bits, and
produced the output 1 if and only if the bits present at its inputs contains your password starting at an offset that is some multiple of 8 bits
then by "operating" this SAT instance as a circuit, you would have a password detector that is (at least potentially) very difficult to "invert".
In some ways, what you want is the opposite of Boolean logic minimisation. You want the biggest, hairiest circuit (ideally for some theoretically justified notions of size and hairiness :) ) that computes the truth table. It's easy enough to come up with truth-table-preserving ways to grow the original CNF propositional logic formula -- e.g., if you have two clauses A and B, then you can always safely add a new clause consisting of all the literals in either A or B -- but it's probably much harder to come up with ways to grow the formula in ways that will confuse a modern SAT solver, since a lot of research has gone into making these programs super-efficient at detecting and exploiting all kinds of structure in the problem.
One possible avenue for injecting "complications" is to make the circuit compute functions that are difficult for circuits to compute, like divisions or square roots, and then test the results of these for equality in addition to the raw inputs. E.g., instead of making the circuit merely test that X[1 .. 8n] = YOUR_PASSWORD, make it test that X[1 .. 8n] = YOUR_PASSWORD AND sqrt(X[1 .. 8n]) = sqrt(YOUR_PASSWORD). If a SAT solver is smart enough to "see" that the first test implies the second then it can immediately dispense with all the clauses corresponding to the second -- but since everything is represented at a very low level with propositional clauses, this relationship is (I hope; as I said, modern SAT solvers are pretty amazing) well obscured. My guess is that it's better to choose functions like sqrt() that are not one-to-one on integers: this will potentially cause a SAT solver to waste time exploring seemingly promising (but ultimately incorrect) solutions.

Iterate over hash function though it reduces search space

I was reading this article regarding the number of times you should hash your password
A salt is added to password before the password is hashed to safeguard against dictionary attacks and rainbow table attacks.
The commentors in the answer by ORIP stated
hashing a hash is not something you should do, as the possibility of
hash collision increase with each iteration which may reduce the
search space (salt doesn't help), but this is irrelevant for
password-based cryptography. To reach the 256-bit search space of this
hash you'd need a completely random password, 40 characters long, from
all available keyboard characters (log2(94^40))
The answer by erickson recommended
With pre-computation off the table, an attacker has compute the hash
on each attempt. How long it takes to find a password now depends
entirely on how long it takes to hash a candidate. This time is
increased by iteration of the hash function. The number iterations is
generally a parameter of the key derivation function; today, a lot of
mobile devices use 10,000 to 20,000 iterations, while a server might
use 100,000 or more. (The bcrypt algorithm uses the term "cost
factor", which is a logarithmic measure of the time required.)
My questions are
1) Why do we iterate over the hash function since each iteration reduces the search space and hence make it easier to crack the password
2) What does search space mean ??
3) Why is the reduction of search space irrelevant for password-based cryptography
4) When is reduction of search space relevant ??
.

Let's start with the basic question: What is a search space?
A search space is the set of all values that must be searched in order to find the one you want. In the case of AES-256, the total key space is 2^256. This is a really staggeringly large number. This is the number that most people are throwing around when they say that AES cannot be brute forced.
The search space of "8-letter sequences of lowercase letters" is 26^8, or about 200 billion (~2^37), which from a cryptographic point of view is a tiny, insignificant number that can be searched pretty quickly. It's less than 3 days at 1,000,000 checks per second. Real passwords are chosen out of much smaller sets, since most people don't type 8 totally random letters. (You can up this with upper case and numbers and symbols, but people pick from a tiny set of those, too.)
OK, so people like to type short, easy passwords, but we want to make them hard to brute-force. So we need a way to convert "easy to guess passwords" into "hard to guess key." We call this a Key Derivation Function (KDF). We need two things for it:
The KDF must be "computationally indistinguishable from random." This means that there is no inverse of the hash function that can be computed more quickly than a brute force search.
The KDF should take non-trivial time to compute, so that brute forcing the tiny password space is still very difficult. Ideally it should be made as difficult as brute forcing the entire key space, but it is rare to push it that far.
The first point is the answer to your question of "why don't we care about collisions?" It is because collisions, while they could possibly exist, cannot be predicted in an computationally efficient manner. If collisions could be efficiently predicted, then your KDF function is not indistinguishable from random.
A KDF is not the same as just "repeated hashing." Repeated hashing can be distinguished from random, and is subject to significant attacks (most notably length-extension attacks).
PBKDF2, as a specific KDF example, is proven to be computationally indistinguishable from random, as long as it is provided with a pseudorandom function (PRF). A PRF is defined as itself being computationally indistinguishable from random. PBDFK2 uses HMAC, which is proven to be a PRF as long as it is provided a hashing function that is at least weakly collision resistant (the requirement is actually a bit weaker than even that).
Note the word "proven" here. Good cryptography lives on top of mathematical security proofs. It is not just "tie a lot of knots and hope it holds."
So that's a little tiny bit of the math behind why we're not worried about collisions, but let's also consider some intuition about it.
The total number of 16-character (absurdly long) passwords that can be easily typed on a common English keyboard is about 95^16 or 2^105 (that doesn't count the 15, 14, 13, etc length passwords, but since 95^16 is almost two orders of magnitude larger than 95^15, it's close enough). Now, consider that for each password, we're going to randomly map it to 10,000 intermediate keys (via 10,000 iterations of PBKDF2). That gets us up to 2^118 random choices that we hope never collide in our hash. What are the chances?
Well, 2^256 (our total space) divided by 2^118 (our keys) is 2^138. That means we're using much less than 10^-41 of the space for all passwords that could even be remotely likely. If we're picking these randomly (and the definition of a PRF says we are), the chances of two colliding are, um, small. And if two somehow did, no attacker would ever be able to predict it.
Take away lesson: Use PBKDF2 (or another good KDF like scrypt or bcrypt) to convert passwords into keys. Use a lot of iterations (10,000-100,000 at a minimum). Do not worry about the collisions.
You may be interested in a little more discussion of this in Brute-Forcing Passwords.

As the second snippet said, each iteration makes each "guess" a hacker makes take longer, therefore increasing the total time it will take then to crack an average password.
Search space is all the possible hashes for a password after however many iterations you are using. Each iteration decreases the search space.
Because of #1, as the size of the search space decreases, the time to check each possibility increases, balancing out that negative effect.
According to the second snippet, answers #1 and #3 say it actually isn't.
I hope this makes sense, it's a very complicated topic.

The reason to iterate is to make it harder for an attacker to brute force the hash. If you have a single round of hashing for a value, then in order to precompute a table for cracking that hash, you need to do 1 * keyspace hashes. If you do 1000 hashes of the value, then it would require the work of 1000 * keyspace.
Search space generally refers to the total number of combinations of characters that could make up a password.
I would say that the reduction of search space is irrelevant because passwords are generally not cracked by attempting 0000000, then 0000001, etc. They are instead attempted to be cracked by using dictionaries and combinatorics. There is essentially a realm of passwords that are likely to get cracked (like "password", "abcdef1", "goshawks", etc.), but creating a larger work factor will make it much more difficult for an attacker to hit all of the likely passwords in the space. Combining that with a salt, means they have to do all of the work for those likely passwords, for every hash they want to crack.
The reduction in search space becomes relevant if you are trying to crack something that is random and could take up any value in the search space.

When preventing timing attacks, is it safe to exit on different lengths?

In some libraries, for example flask-bcrypt, we can see that the code exits early if the two strings are different lengths:
def constant_time_compare(val1, val2):
'''Returns True if the two strings are equal, False otherwise.
The time taken is independent of the number of characters that match.
'''
if len(val1) != len(val2):
return False
result = 0
for x, y in zip(val1, val2):
result |= ord(x) ^ ord(y)
return result == 0
Is this really safe? Surely this reveals to an attacker that the two strings were different lengths early and leaks information?

When preventing timing attacks, is it safe to exit on different lengths?
Generally no, but it's really dependent on the situation.
The function itself
This function will leak information with a timing attack regardless of the length comparison because it's running time is always dependent on the length of it's input.
With the length compare, the running time will change when both inputs are the same length.
Without the length compare, the running time will change based on the length of the shorter input (beause of zip). Once the attacker controlled input exceeds the length of the other input, running time will remain constant.
The running time of this function is so short though (unscientific testing shows less than 0.1ms for 32 bytes of input) that, in a real life situation, it would fairly difficult for an attacker to take advantage of this because of other factors such as variance in network latency. The attacker would probably need to already be on the machine where the code is executing to really make use of this weakness.
Concerning flask-bcrypt
In the context of flask-bcrypt though, this function is only used for comparing hashes, not direct user input. Because the hash length that bcrypt outputs is fixed, the return False should never actually execute. Hence, no timing attack exists for this function when used with bcrypt.
Flask-bcrypt uses this function for checking equality because the running time for normal string comparison in python (==) will change based on the content of the strings. Consider two nearly identical strings of the same length, if the first character of the two strings are different, == comparison will complete faster than if the difference occurs at the last character of the strings.
I would argue though that constant time string comparison is really unnecessary in this case. The goal of the attacker is to deduce the stored hash value based on processing time, the attacker needs to know what hash value is produced by their input to achieve this. The only way to know what hash is being produced though is for the attacker to know the workfactor and salt, and if they have this information then they already have the hash as well (because they're all stored together). In which case, there's no reason to perform the attack to begin with.

How (if at all) does a predictable random number generator get more secure after SHA-1ing its output?

This article states that
Despite the fact that the Mersenne Twister is an extremely good pseudo-random number generator, it is not cryptographically secure by itself for a very simple reason. It is possible to determine all future states of the generator from the state the generator has at any given time, and either 624 32-bit outputs, or 19,937 one-bit outputs are sufficient to provide that state. Using a cryptographically-secure hash function, such as SHA-1, on the output of the Mersenne Twister has been recommended as one way of obtaining a keystream useful in cryptography.
But there are no references on why digesting the output would make it any more secure. And honestly, I don't see why this should be the case. The Mersenne Twister has a period of 2^19937-1, but I think my reasoning would also apply to any periodic PRNG, e.g. Linear Congruential Generators as well. Due to the properties of a secure one-way function h, one could think of h as an injective function (otherwise we could produce collisions), thus simply mapping the values from its domain into its range in a one-to-one manner.
With this thought in mind I would argue that the hashed values will produce exactly the same periodical behaviour as the original Mersenne Twister did. This means if you observe all values of one period and the values start to recur, then you are perfectly able to predict all future values.
I assume this to be related to the same principle that is applied in password-based encryption (PKCS#5) - because the domain of passwords does not provide enough entropy, simply hashing passwords doesn't add any additional entropy - that's why you need to salt passwords before you hash them. I think that exactly the same principle applies here.
One simple example that finally convinced me: Suppose you have a very bad PRNG that will always produce a "random number" of 1. Then even if SHA-1 would be a perfect one-way function, applying SHA-1 to the output will always yield the same value, thus making the output no less predictable than previously.
Still, I'd like to believe there is some truth to that article, so surely I must have overlooked something. Can you help me out? To a large part, I have left out the seed value from my arguments - maybe this is where the magic happens?

The state of the mersenne twister is defined by the previous n outputs, where n is the degree of recurrence (a constant). As such, if you give the attacker n outputs straight from a mersenne twister, they will immediately be able to predict all future values.
Passing the values through SHA-1 makes it more difficult, as now the attacker must try to reverse the RNG. However, for a 32-bit word size, this is unlikely to be a severe impediment to a determined attacker; they can build a rainbow table or use some other standard approach for reversing SHA-1s, and in the event of collisions, filter candidates by whether they produce the RNG stream observed. As such, the mersenne twister should not be used for cryptographically sensitive applications, SHA-1 masking or no. There are a number of standard CSPRNGs that may be used instead.

An attacker is able to predict the output of MT based on relatively few outputs not because it repeats over such a short period (it doesn't), but because the output leaks information about the internal state of the PRNG. Hashing the output obscures that leaked information. As #bdonlan points out, though, if the output size is small (32 bits, for instance), this doesn't help, as the attacker can easily enumerate all valid plaintexts and precalculate their hashes.
Using more than 32 bits of PRNG output as an input to the hash would make this impractical, but a cryptographically secure PRNG is still a much better choice if you need this property.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string