In some libraries, for example flask-bcrypt, we can see that the code exits early if the two strings are different lengths:
def constant_time_compare(val1, val2):
'''Returns True if the two strings are equal, False otherwise.
The time taken is independent of the number of characters that match.
'''
if len(val1) != len(val2):
return False
result = 0
for x, y in zip(val1, val2):
result |= ord(x) ^ ord(y)
return result == 0
Is this really safe? Surely this reveals to an attacker that the two strings were different lengths early and leaks information?
When preventing timing attacks, is it safe to exit on different lengths?
Generally no, but it's really dependent on the situation.
The function itself
This function will leak information with a timing attack regardless of the length comparison because it's running time is always dependent on the length of it's input.
With the length compare, the running time will change when both inputs are the same length.
Without the length compare, the running time will change based on the length of the shorter input (beause of zip). Once the attacker controlled input exceeds the length of the other input, running time will remain constant.
The running time of this function is so short though (unscientific testing shows less than 0.1ms for 32 bytes of input) that, in a real life situation, it would fairly difficult for an attacker to take advantage of this because of other factors such as variance in network latency. The attacker would probably need to already be on the machine where the code is executing to really make use of this weakness.
Concerning flask-bcrypt
In the context of flask-bcrypt though, this function is only used for comparing hashes, not direct user input. Because the hash length that bcrypt outputs is fixed, the return False should never actually execute. Hence, no timing attack exists for this function when used with bcrypt.
Flask-bcrypt uses this function for checking equality because the running time for normal string comparison in python (==) will change based on the content of the strings. Consider two nearly identical strings of the same length, if the first character of the two strings are different, == comparison will complete faster than if the difference occurs at the last character of the strings.
I would argue though that constant time string comparison is really unnecessary in this case. The goal of the attacker is to deduce the stored hash value based on processing time, the attacker needs to know what hash value is produced by their input to achieve this. The only way to know what hash is being produced though is for the attacker to know the workfactor and salt, and if they have this information then they already have the hash as well (because they're all stored together). In which case, there's no reason to perform the attack to begin with.
Related
Github's securing webhooks page says:
Using a plain == operator is not advised. A method like secure_compare performs a “constant time” string comparison, which renders it safe from certain timing attacks against regular equality operators.
I use bcrypt.compare('string', 'computed hash') when comparing passwords.
What makes this a "secure compare" and can I do this using the standard crypto library in Node?
The point of a "constant time" string comparison is that the comparison will take the exact same amount of time no matter what the comparison target is (the unknown value). This "constant time" reveals no information to an attacker about what the unknown target value might be. The usual solution is that all characters are compared, even after a mismatch is found so no matter where a mismatch is found, the comparison runs in the same amount of time.
Other forms of comparison might return an answer in a shorter time when certain conditions are true which allows an attacker to learn what they might be missing. For example, in a typical string comparison, the comparison will return false as soon as an unequal character is found. If the first character does not match, then the comparison will return in a shorter amount of time than if it does. A diligent attacker can use this information to make a smarter brute force attack.
A "constant time" comparison eliminates this extra information because no matter how the two strings are unequal, the function will return its value in the same amount of time.
In looking at the nodejs v4 crypto library, I don't see any signs of a function to do constant time comparison and per this post, there is a discussion about the fact that the nodejs crypto library is missing this functionality.
EDIT: Node v6 now has crypto.timingSafeEqual(a, b).
There is also such a constant time comparison function available in this buffer-equal-constant-time module.
jfriend's answer is correct in general, but in terms of this specific context (comparing the output of a bcrypt operation with what is stored in the database), there is no risk with using "==".
Remember, bcrypt is designed to be a one-way function that is specifically built to resist password guessing attacks when the attacker gets hold of the database. If we assume that the attacker has the database, then the attacker does not need timing leak information to know which byte of his guess for the password is wrong: he can check that himself by simply looking at the database. If we assume the attacker does not have the database, then timing leak information could potentially tell us which byte was wrong in his guess in a scenario that is ideal for the attacker (not realistic at all). Even if he could get that information, the one-way property of bcrypt prevents him from exploiting the knowledge gain.
Summary: preventing timing attacks is a good idea in general, but in this specific context, you're not putting yourself in any danger by using "==".
EDIT: The bcrypt.compare( ) function already is programmed to resist timing attacks even though there is absolutely no security risk in not doing this.
Imagine a long block of material to compare. If the first block does not match and the compare function returns right then, you have leaked data to the attacker. He can work on the first block of data until the routine takes longer to return, at which time he will know that the first chunk matched.
2 ways to compare data that are more secure from timing attacks are to hash both sets of data and to compare the hashes, or to XOR all the data and compare the result to 0. If == just scans both blocks of data and returns if and when it finds a discrepancy, it can inadvertently play "warmer / colder" and guide the adversary right in on the secret text he wants to match.
Lets say I have a large stream of data (for example packets coming in from a network), and I want to determine if this data contains a certain substring. There are multiple string searching algorithms, but they require the algorithm to know the plain text string they are searching for.
Lets say, the string being sought is a password, and you do not want to store it in plain text in this search application. It would however appear in the stream as plain text. You could for example, store the hash and length of the password. Then for every byte in the stream check if the next length byte data from the stream hash to the password hash you have a probable match.
That way you can determine if the password was in the stream, without knowing the password. However, hashing once for every byte is not fast/efficient.
Is there perhaps a clever algorithm that could find the plain text password in the stream, without directly knowing the plain text password (and instead some non-reversible equivalent). Alternatively could a low quality version of the password be used, with the risk of false positives? For example, if the search application only knew half the password (in plain text), it could with some error detect the full password without knowing it.
thanks
P.S This question comes from a hypothetical discussion I had with some friends, about alerting you if your password was spotted in plain text on a network.
You could use a low-entropy rolling hash to pre-screen each byte so that, for the cost of lg k bits of entropy, you reduce the number of invocations of the cryptographic hash by a factor of k.
SAT is an NP-hard problem. Suppose your password is n characters long. If you could find a way to make a large enough SAT instance that
used a contiguous sequence of m >= n bytes from the data stream as its 8m input bits, and
produced the output 1 if and only if the bits present at its inputs contains your password starting at an offset that is some multiple of 8 bits
then by "operating" this SAT instance as a circuit, you would have a password detector that is (at least potentially) very difficult to "invert".
In some ways, what you want is the opposite of Boolean logic minimisation. You want the biggest, hairiest circuit (ideally for some theoretically justified notions of size and hairiness :) ) that computes the truth table. It's easy enough to come up with truth-table-preserving ways to grow the original CNF propositional logic formula -- e.g., if you have two clauses A and B, then you can always safely add a new clause consisting of all the literals in either A or B -- but it's probably much harder to come up with ways to grow the formula in ways that will confuse a modern SAT solver, since a lot of research has gone into making these programs super-efficient at detecting and exploiting all kinds of structure in the problem.
One possible avenue for injecting "complications" is to make the circuit compute functions that are difficult for circuits to compute, like divisions or square roots, and then test the results of these for equality in addition to the raw inputs. E.g., instead of making the circuit merely test that X[1 .. 8n] = YOUR_PASSWORD, make it test that X[1 .. 8n] = YOUR_PASSWORD AND sqrt(X[1 .. 8n]) = sqrt(YOUR_PASSWORD). If a SAT solver is smart enough to "see" that the first test implies the second then it can immediately dispense with all the clauses corresponding to the second -- but since everything is represented at a very low level with propositional clauses, this relationship is (I hope; as I said, modern SAT solvers are pretty amazing) well obscured. My guess is that it's better to choose functions like sqrt() that are not one-to-one on integers: this will potentially cause a SAT solver to waste time exploring seemingly promising (but ultimately incorrect) solutions.
Trying to improve the performance of a function that compares strings I decided to compare them by comparing their hashes.
So is there a guarantee if the hash of 2 very long strings are equal to each other then the strings are also equal to each other?
While it's guaranteed that 2 identical strings will give you equal hashes, the other way round is not true : for a given hash, there are always several possible strings which produce the same hash.
This is true due to the PigeonHole principle.
That being said, the chances of 2 different strings producing the same hash can be made infinitesimal, to the point of being considered equivalent to null.
A fairly classical example of such hash is MD5, which has a near perfect 128 bits distribution. Which means that you have one chance in 2^128 that 2 different strings produce the same hash. Well, basically, almost the same as impossible.
In the simple common case where two long strings are to be compared to determine if they are identical or not, a simple compare would be much preferred over a hash, for two reasons. First, as pointed out by #wildplasser, the hash requires that all bytes of both strings must be traversed in order to calculate the two hash values, whereas the simple compare is fast, and only needs to traverse bytes until the first difference is found, which may be much less than the full string length. And second, a simple compare is guaranteed to detect any difference, whereas the hash gives only a high probability that they are identical, as pointed out by #AdamLiss and #Cyan.
There are, however, several interesting cases where the hash comparison can be employed to great advantage. As mentioned by #Cyan if the compare is to be done more than once, or must be stored for later use, then hash may be faster. A case not mentioned by others is if the strings are on different machines connected via a local network or the Internet. Passing a small amount of data between the two machines will generally be much faster. The simplest first check is compare the size of the two, if different, you're done. Else, compute the hash, each on its own machine (assuming you are able to create the process on the remote machine) and again, if different you are done. If the hash values are the same, and if you must have absolute certainty, there is no easy shortcut to that certainty. Using lossless compression on both ends will allow less data to be transferred for comparison. And finally, if the two strings are separated by time, as alluded to by #Cyan, if you want to know if a file has changed since yesterday, and you have stored the hash from yesterday's version, then you can compare today's hash to it.
I hope this will help stimulate some "out of the box" ideas for someone.
I am not sure, if your performance will be improved. Both: building hash + comparing integers and simply comparing strings using equals have same complexity, that lays in O(n), where n is the number of characters.
This article states that
Despite the fact that the Mersenne Twister is an extremely good pseudo-random number generator, it is not cryptographically secure by itself for a very simple reason. It is possible to determine all future states of the generator from the state the generator has at any given time, and either 624 32-bit outputs, or 19,937 one-bit outputs are sufficient to provide that state. Using a cryptographically-secure hash function, such as SHA-1, on the output of the Mersenne Twister has been recommended as one way of obtaining a keystream useful in cryptography.
But there are no references on why digesting the output would make it any more secure. And honestly, I don't see why this should be the case. The Mersenne Twister has a period of 2^19937-1, but I think my reasoning would also apply to any periodic PRNG, e.g. Linear Congruential Generators as well. Due to the properties of a secure one-way function h, one could think of h as an injective function (otherwise we could produce collisions), thus simply mapping the values from its domain into its range in a one-to-one manner.
With this thought in mind I would argue that the hashed values will produce exactly the same periodical behaviour as the original Mersenne Twister did. This means if you observe all values of one period and the values start to recur, then you are perfectly able to predict all future values.
I assume this to be related to the same principle that is applied in password-based encryption (PKCS#5) - because the domain of passwords does not provide enough entropy, simply hashing passwords doesn't add any additional entropy - that's why you need to salt passwords before you hash them. I think that exactly the same principle applies here.
One simple example that finally convinced me: Suppose you have a very bad PRNG that will always produce a "random number" of 1. Then even if SHA-1 would be a perfect one-way function, applying SHA-1 to the output will always yield the same value, thus making the output no less predictable than previously.
Still, I'd like to believe there is some truth to that article, so surely I must have overlooked something. Can you help me out? To a large part, I have left out the seed value from my arguments - maybe this is where the magic happens?
The state of the mersenne twister is defined by the previous n outputs, where n is the degree of recurrence (a constant). As such, if you give the attacker n outputs straight from a mersenne twister, they will immediately be able to predict all future values.
Passing the values through SHA-1 makes it more difficult, as now the attacker must try to reverse the RNG. However, for a 32-bit word size, this is unlikely to be a severe impediment to a determined attacker; they can build a rainbow table or use some other standard approach for reversing SHA-1s, and in the event of collisions, filter candidates by whether they produce the RNG stream observed. As such, the mersenne twister should not be used for cryptographically sensitive applications, SHA-1 masking or no. There are a number of standard CSPRNGs that may be used instead.
An attacker is able to predict the output of MT based on relatively few outputs not because it repeats over such a short period (it doesn't), but because the output leaks information about the internal state of the PRNG. Hashing the output obscures that leaked information. As #bdonlan points out, though, if the output size is small (32 bits, for instance), this doesn't help, as the attacker can easily enumerate all valid plaintexts and precalculate their hashes.
Using more than 32 bits of PRNG output as an input to the hash would make this impractical, but a cryptographically secure PRNG is still a much better choice if you need this property.
Another question on SO brought up the facilities in some languages to hash strings to give them a fast lookup in a table. Two examples of this are dictionary<> in .NET and the {} storage structure in Python. Other languages certainly support such a mechanism. C++ has its map, LISP has an equivalent, as do most other modern languages.
It was contended in the answers to the question that hash algorithms on strings can be conducted in constant timem with one SO member who has 25 years experience in programming claiming that anything can be hashed in constant time. My personal contention is that this is not true, unless your particular application places a boundary on the string length. This means that some constant K would dictate the maximal length of a string.
I am familiar with the Rabin-Karp algorithm which uses a hashing function for its operation, but this algorithm does not dictate a specific hash function to use, and the one the authors suggested is O(m), where m is the length of the hashed string.
I see some other pages such as this one (http://www.cse.yorku.ca/~oz/hash.html) that display some hash algorithms, but it seems that each of them iterates over the entire length of the string to arrive at its value.
From my comparatively limited reading on the subject, it appears that most associative arrays for string types are actually created using a hashing function that operates with a tree of some sort under the hood. This may be an AVL tree or red/black tree that points to the location of the value element in the key/value pair.
Even with this tree structure, if we are to remain on the order of theta(log(n)), with n being the number of elements in the tree, we need to have a constant-time hash algorithm. Otherwise, we have the additive penalty of iterating over the string. Even though theta(m) would be eclipsed by theta(log(n)) for indexes containing many strings, we cannot ignore it if we are in such a domain that the texts we search against will be very large.
I am aware that suffix trees/arrays and Aho-Corasick can bring the search down to theta(m) for a greater expense in memory, but what I am asking specifically if a constant-time hash method exists for strings of arbitrary lengths as was claimed by the other SO member.
Thanks.
A hash function doesn't have to (and can't) return a unique value for every string.
You could use the first 10 characters to initialize a random number generator and then use that to pull out 100 random characters from the string, and hash that. This would be constant time.
You could also just return the constant value 1. Strictly speaking, this is still a hash function, although not a very useful one.
In general, I believe that any complete string hash must use every character of the string and therefore would need to grow as O(n) for n characters. However I think for practical string hashes you can use approximate hashes that can easily be O(1).
Consider a string hash that always uses Min(n, 20) characters to compute a standard hash. Obviously this grows as O(1) with string size. Will it work reliably? It depends on your domain...
You cannot easily achieve a general constant time hashing algorithm for strings without risking severe cases of hash collisions.
For it to be constant time, you will not be able to access every character in the string. As a simple example, suppose we take the first 6 characters. Then comes someone and tries to hash an array of URLs. The has function will see "http:/" for every single string.
Similar scenarios may occur for other characters selections schemes. You could pick characters pseudo-randomly based on the value of the previous character, but you still run the risk of failing spectacularly if the strings for some reason have the "wrong" pattern and many end up with the same hash value.
You can hope for asymptotically less than linear hashing time if you use ropes instead of strings and have sharing that allows you to skip some computations. But obviously a hash function can not separate inputs that it has not read, so I wouldn't take the "everything can be hashed in constant time" too seriously.
Anything is possible in the compromise between the hash function's quality and the amount of computation it takes, and a hash function over long strings must have collisions anyway.
You have to determine if the strings that are likely to occur in your algorithm will collide too often if the hash function only looks at a prefix.
Although I cannot imagine a fixed-time hash function for unlimited length strings, there is really no need for it.
The idea behind using a hash function is to generate a distribution of the hash values that makes it unlikely that many strings would collide - for the domain under consideration. This key would allow direct access into a data store. These two combined result in a constant time lookup - on average.
If ever such collision occurs, the lookup algorithm falls back on a more flexible lookup sub-strategy.
Certainly this is doable, so long as you ensure all your strings are 'interned', before you pass them to something requiring hashing. Interning is the process of inserting the string into a string table, such that all interned strings with the same value are in fact the same object. Then, you can simply hash the (fixed length) pointer to the interned string, instead of hashing the string itself.
You may be interested in the following mathematical result I came up with last year.
Consider the problem of hashing an infinite number of keys—such as the set of all strings of any length—to the set of numbers in {1,2,…,b}. Random hashing proceeds by first picking at random a hash function h in a family of H functions.
I will show that there is always an infinite number of keys that are certain to collide over all H functions, that is, they always have the same hash value for all hash functions.
Pick any hash function h: there is at least one hash value y such that the set A={s:h(s)=y} is infinite, that is, you have infinitely many strings colliding. Pick any other hash function h‘ and hash the keys in the set A. There is at least one hash value y‘ such that the set A‘={s is in A: h‘(s)=y‘} is infinite, that is, there are infinitely many strings colliding on two hash functions. You can repeat this argument any number of times. Repeat it H times. Then you have an infinite set of strings where all strings collide over all of your H hash functions. CQFD.
Further reading:
Sensible hashing of variable-length strings is impossible
http://lemire.me/blog/archives/2009/10/02/sensible-hashing-of-variable-length-strings-is-impossible/