How long does it take to crack a hash? - security

I want to calculate the time it will take to break a SHA-256 hash. So I research and found the following calculation. If I have a password in lower letter with a length of 6 chars, I would have 26^6passwords right?
To calculate the time I have to divide this number by a hashrate, I guess. So if I had one RTX 3090, the hashrate would be 120 MH/s (1.2*10^8 H/s) and than I need to calculate 26^6/(1.2*10^8) to get the time in seconds right?
Is this idea right or wrong?

Yes, but a lowercase-latin 6 character string is also short enough that you would expect to compute this one time and put it into a database so that you could look it up in O(1). It's only a bit over 300M entries. That said, given you're 50% likely to find the answer in the first half of your search, it's so fast to crack that you might not even bother unless you were doing this often. You don't even need a particularly fancy GPU for something on this scale.
Note that in many cases a 6 character string can also be a 5 character string, so you need to add 26^6 + 26^5 + 26^4 + ..., but all of these together only raises this to around 320M hashes. It's a tiny space.
Adding uppercase, numbers and the easily typed symbols gets you up to 96^6 ~ 780B. On the other hand, adding just 3 more lowercase-letters (9 total) gets you to 26^9 ~ 5.4T. For brute force on random strings, longer is much more powerful than complicated.
To your specific question, note that it does matter how you implement this. You won't get these kinds of hash rates if you don't write your code in a way to maximize the GPU. For example, writing simple code that sends one value to the GPU to hash at a time, and then compares the result on the CPU could be incredibly slow (in some cases slower than just doing all the work on a CPU). Setting up your memory efficiently and maximizing things the GPU can do in parallel are very important. If you're not familiar with this kind of programming, I recommend using or studying a tool like John the Ripper.

Related

Why is it unsafe to use Linear Congruential Generator to shuffle cards in online casino? How long will it take to crack this system?

I know that Linear Congruential Generator is not recommended where high randomness and security level are needed, but I don't know why. If I use Linear Congruential Generator to generate a random number for shuffle algorithm, is it easy to crack? If it is, how long will it take to crack it?
Linear congruential generators have several major flaws:
They have very little internal state. Some generators may have as few as 64 thousand possible starting states -- as a result, using one of these would mean that there are only 64 thousand possible shuffled decks. This makes it very easy to identify which one of those decks is being used at any given point.
Their future and past behavior can be perfectly determined based on their state. Once an attacker is able to gather enough information to guess the state of a LCRNG being used to shuffle cards, they can determine both what all the future shuffles will be, and what any past shuffles were.
They frequently suffer from statistical biases. One common flaw is that low-order bits will follow a short cycle -- for instance, in many generators, the least significant bit of the raw output will flip between 1 and 0 on subsequent outputs.
The first two issues are what will cause you the most trouble here. The exact severity will depend on the size of the LCRNG's state. That being said, if a 32-bit LCRNG is being used, its state can probably be guessed given 32 bits of output state. The position of one card is roughly lg(52) ≈ 5.7 bits of state, so the complete state can probably be guessed by seeing at least 6 (32 ÷ 5.7 ≈ 5.6) cards.

Find substring in stream without storing substring in plain text

Lets say I have a large stream of data (for example packets coming in from a network), and I want to determine if this data contains a certain substring. There are multiple string searching algorithms, but they require the algorithm to know the plain text string they are searching for.
Lets say, the string being sought is a password, and you do not want to store it in plain text in this search application. It would however appear in the stream as plain text. You could for example, store the hash and length of the password. Then for every byte in the stream check if the next length byte data from the stream hash to the password hash you have a probable match.
That way you can determine if the password was in the stream, without knowing the password. However, hashing once for every byte is not fast/efficient.
Is there perhaps a clever algorithm that could find the plain text password in the stream, without directly knowing the plain text password (and instead some non-reversible equivalent). Alternatively could a low quality version of the password be used, with the risk of false positives? For example, if the search application only knew half the password (in plain text), it could with some error detect the full password without knowing it.
thanks
P.S This question comes from a hypothetical discussion I had with some friends, about alerting you if your password was spotted in plain text on a network.
You could use a low-entropy rolling hash to pre-screen each byte so that, for the cost of lg k bits of entropy, you reduce the number of invocations of the cryptographic hash by a factor of k.
SAT is an NP-hard problem. Suppose your password is n characters long. If you could find a way to make a large enough SAT instance that
used a contiguous sequence of m >= n bytes from the data stream as its 8m input bits, and
produced the output 1 if and only if the bits present at its inputs contains your password starting at an offset that is some multiple of 8 bits
then by "operating" this SAT instance as a circuit, you would have a password detector that is (at least potentially) very difficult to "invert".
In some ways, what you want is the opposite of Boolean logic minimisation. You want the biggest, hairiest circuit (ideally for some theoretically justified notions of size and hairiness :) ) that computes the truth table. It's easy enough to come up with truth-table-preserving ways to grow the original CNF propositional logic formula -- e.g., if you have two clauses A and B, then you can always safely add a new clause consisting of all the literals in either A or B -- but it's probably much harder to come up with ways to grow the formula in ways that will confuse a modern SAT solver, since a lot of research has gone into making these programs super-efficient at detecting and exploiting all kinds of structure in the problem.
One possible avenue for injecting "complications" is to make the circuit compute functions that are difficult for circuits to compute, like divisions or square roots, and then test the results of these for equality in addition to the raw inputs. E.g., instead of making the circuit merely test that X[1 .. 8n] = YOUR_PASSWORD, make it test that X[1 .. 8n] = YOUR_PASSWORD AND sqrt(X[1 .. 8n]) = sqrt(YOUR_PASSWORD). If a SAT solver is smart enough to "see" that the first test implies the second then it can immediately dispense with all the clauses corresponding to the second -- but since everything is represented at a very low level with propositional clauses, this relationship is (I hope; as I said, modern SAT solvers are pretty amazing) well obscured. My guess is that it's better to choose functions like sqrt() that are not one-to-one on integers: this will potentially cause a SAT solver to waste time exploring seemingly promising (but ultimately incorrect) solutions.

Iterate over hash function though it reduces search space

I was reading this article regarding the number of times you should hash your password
A salt is added to password before the password is hashed to safeguard against dictionary attacks and rainbow table attacks.
The commentors in the answer by ORIP stated
hashing a hash is not something you should do, as the possibility of
hash collision increase with each iteration which may reduce the
search space (salt doesn't help), but this is irrelevant for
password-based cryptography. To reach the 256-bit search space of this
hash you'd need a completely random password, 40 characters long, from
all available keyboard characters (log2(94^40))
The answer by erickson recommended
With pre-computation off the table, an attacker has compute the hash
on each attempt. How long it takes to find a password now depends
entirely on how long it takes to hash a candidate. This time is
increased by iteration of the hash function. The number iterations is
generally a parameter of the key derivation function; today, a lot of
mobile devices use 10,000 to 20,000 iterations, while a server might
use 100,000 or more. (The bcrypt algorithm uses the term "cost
factor", which is a logarithmic measure of the time required.)
My questions are
1) Why do we iterate over the hash function since each iteration reduces the search space and hence make it easier to crack the password
2) What does search space mean ??
3) Why is the reduction of search space irrelevant for password-based cryptography
4) When is reduction of search space relevant ??
.
Let's start with the basic question: What is a search space?
A search space is the set of all values that must be searched in order to find the one you want. In the case of AES-256, the total key space is 2^256. This is a really staggeringly large number. This is the number that most people are throwing around when they say that AES cannot be brute forced.
The search space of "8-letter sequences of lowercase letters" is 26^8, or about 200 billion (~2^37), which from a cryptographic point of view is a tiny, insignificant number that can be searched pretty quickly. It's less than 3 days at 1,000,000 checks per second. Real passwords are chosen out of much smaller sets, since most people don't type 8 totally random letters. (You can up this with upper case and numbers and symbols, but people pick from a tiny set of those, too.)
OK, so people like to type short, easy passwords, but we want to make them hard to brute-force. So we need a way to convert "easy to guess passwords" into "hard to guess key." We call this a Key Derivation Function (KDF). We need two things for it:
The KDF must be "computationally indistinguishable from random." This means that there is no inverse of the hash function that can be computed more quickly than a brute force search.
The KDF should take non-trivial time to compute, so that brute forcing the tiny password space is still very difficult. Ideally it should be made as difficult as brute forcing the entire key space, but it is rare to push it that far.
The first point is the answer to your question of "why don't we care about collisions?" It is because collisions, while they could possibly exist, cannot be predicted in an computationally efficient manner. If collisions could be efficiently predicted, then your KDF function is not indistinguishable from random.
A KDF is not the same as just "repeated hashing." Repeated hashing can be distinguished from random, and is subject to significant attacks (most notably length-extension attacks).
PBKDF2, as a specific KDF example, is proven to be computationally indistinguishable from random, as long as it is provided with a pseudorandom function (PRF). A PRF is defined as itself being computationally indistinguishable from random. PBDFK2 uses HMAC, which is proven to be a PRF as long as it is provided a hashing function that is at least weakly collision resistant (the requirement is actually a bit weaker than even that).
Note the word "proven" here. Good cryptography lives on top of mathematical security proofs. It is not just "tie a lot of knots and hope it holds."
So that's a little tiny bit of the math behind why we're not worried about collisions, but let's also consider some intuition about it.
The total number of 16-character (absurdly long) passwords that can be easily typed on a common English keyboard is about 95^16 or 2^105 (that doesn't count the 15, 14, 13, etc length passwords, but since 95^16 is almost two orders of magnitude larger than 95^15, it's close enough). Now, consider that for each password, we're going to randomly map it to 10,000 intermediate keys (via 10,000 iterations of PBKDF2). That gets us up to 2^118 random choices that we hope never collide in our hash. What are the chances?
Well, 2^256 (our total space) divided by 2^118 (our keys) is 2^138. That means we're using much less than 10^-41 of the space for all passwords that could even be remotely likely. If we're picking these randomly (and the definition of a PRF says we are), the chances of two colliding are, um, small. And if two somehow did, no attacker would ever be able to predict it.
Take away lesson: Use PBKDF2 (or another good KDF like scrypt or bcrypt) to convert passwords into keys. Use a lot of iterations (10,000-100,000 at a minimum). Do not worry about the collisions.
You may be interested in a little more discussion of this in Brute-Forcing Passwords.
As the second snippet said, each iteration makes each "guess" a hacker makes take longer, therefore increasing the total time it will take then to crack an average password.
Search space is all the possible hashes for a password after however many iterations you are using. Each iteration decreases the search space.
Because of #1, as the size of the search space decreases, the time to check each possibility increases, balancing out that negative effect.
According to the second snippet, answers #1 and #3 say it actually isn't.
I hope this makes sense, it's a very complicated topic.
The reason to iterate is to make it harder for an attacker to brute force the hash. If you have a single round of hashing for a value, then in order to precompute a table for cracking that hash, you need to do 1 * keyspace hashes. If you do 1000 hashes of the value, then it would require the work of 1000 * keyspace.
Search space generally refers to the total number of combinations of characters that could make up a password.
I would say that the reduction of search space is irrelevant because passwords are generally not cracked by attempting 0000000, then 0000001, etc. They are instead attempted to be cracked by using dictionaries and combinatorics. There is essentially a realm of passwords that are likely to get cracked (like "password", "abcdef1", "goshawks", etc.), but creating a larger work factor will make it much more difficult for an attacker to hit all of the likely passwords in the space. Combining that with a salt, means they have to do all of the work for those likely passwords, for every hash they want to crack.
The reduction in search space becomes relevant if you are trying to crack something that is random and could take up any value in the search space.

Comparing long strings by their hashes

Trying to improve the performance of a function that compares strings I decided to compare them by comparing their hashes.
So is there a guarantee if the hash of 2 very long strings are equal to each other then the strings are also equal to each other?
While it's guaranteed that 2 identical strings will give you equal hashes, the other way round is not true : for a given hash, there are always several possible strings which produce the same hash.
This is true due to the PigeonHole principle.
That being said, the chances of 2 different strings producing the same hash can be made infinitesimal, to the point of being considered equivalent to null.
A fairly classical example of such hash is MD5, which has a near perfect 128 bits distribution. Which means that you have one chance in 2^128 that 2 different strings produce the same hash. Well, basically, almost the same as impossible.
In the simple common case where two long strings are to be compared to determine if they are identical or not, a simple compare would be much preferred over a hash, for two reasons. First, as pointed out by #wildplasser, the hash requires that all bytes of both strings must be traversed in order to calculate the two hash values, whereas the simple compare is fast, and only needs to traverse bytes until the first difference is found, which may be much less than the full string length. And second, a simple compare is guaranteed to detect any difference, whereas the hash gives only a high probability that they are identical, as pointed out by #AdamLiss and #Cyan.
There are, however, several interesting cases where the hash comparison can be employed to great advantage. As mentioned by #Cyan if the compare is to be done more than once, or must be stored for later use, then hash may be faster. A case not mentioned by others is if the strings are on different machines connected via a local network or the Internet. Passing a small amount of data between the two machines will generally be much faster. The simplest first check is compare the size of the two, if different, you're done. Else, compute the hash, each on its own machine (assuming you are able to create the process on the remote machine) and again, if different you are done. If the hash values are the same, and if you must have absolute certainty, there is no easy shortcut to that certainty. Using lossless compression on both ends will allow less data to be transferred for comparison. And finally, if the two strings are separated by time, as alluded to by #Cyan, if you want to know if a file has changed since yesterday, and you have stored the hash from yesterday's version, then you can compare today's hash to it.
I hope this will help stimulate some "out of the box" ideas for someone.
I am not sure, if your performance will be improved. Both: building hash + comparing integers and simply comparing strings using equals have same complexity, that lays in O(n), where n is the number of characters.

How safely can I assume unicity of a part of SHA1 hash?

I'm currently using a SHA1 to somewhat shorten an url:
Digest::SHA1.hexdigest("salt-" + url)
How safe is it to use only the first 8 characters of the SHA1 as a unique identifier, like GitHub does for commits apparently?
To calculate the probability of a collision with a given length and the number of hashes that you have, see the birthday problem. I don't know the number of hashes that you are going to have, but here are some examples. 8 hexadecimal characters is 32 bits, so for about 100 hashes the probability of a collision is about 1/1,000,000, for 10,000 hashes it's about 1/100, for 100,000 it's 3/4 etc.
See the table in the Birthday attack article on Wikipedia to find a good hash length that would satisfy your needs. For example if you want the collision to be less likely than 1/1,000,000,000 for a set of more than 100,000 hashes then use 64 bits, or 16 hexadecimal digits.
It all depends on how many hashes are you going to have and what probability of a collision are you willing to accept (because there is always some probability, even if insanely small).
If you're talking about a SHA-1 in hexadecimal, then you're only getting 4 bits per character, for a total of 32 bits. The chances of a collision are inversely proportional to the square root of that maximum value, so about 1/65536. If your URL shortener gets used much, it probably won't take terribly long before you start to see collisions.
As for alternatives, the most obvious is probably to just maintain a counter. Since you need to store a table of URLs to translate your shortened URL back to the original, you basically just store each new URL in your table. If it was already present, you give its existing number. Otherwise, you insert it and give it a new number. Either way, you give that number to the user.
It depends on what you are trying to accomplish. The output of SHA1 is effectively random with regards to the input (the output of a good hash function changes in half of its bits based on a one-bit change in the input, and SHA1, while not perfect, is pretty good), and by taking a 32-bit (assuming 8 hex digits) subset of the 160-bit output, you reduce the output space from 2^160 to 2^32 values. All things being equal, which they never are, this would significantly reduce the difficulty of finding a collision.
However, if the hash function's input must be a valid URL, that significantly reduces the number of possible inputs. #rsp points out the birthday problem, but given this, I'm not sure exactly how applicable it is at least in its simple form. Also, it largely assumes that there are no other precautions in place.
I would be more interested in why you are doing this. Is this about URLs that the user will need to remember and type? If so, tacking on a bunch of random hexadecimal digits is probably a bad idea. Is it a URL or URL parameter that will just be passed around programmatically? Then, I wouldn't care much about length. Either way, there are probably better ways to do what you are trying to accomplish.
If you use a binary output for SHA1 and Base64 encode the result, you will get much higher information density per character; you can have the same 8-character names, but rather than only 16^8 (2^32) possibilities, you'll have 64^8 (2^48) possibilities.
Using the assumption that the 50% probability-of-collision scales with 1.177*sqrt(N), using a Base64-style encoding will require 256 times more inputs than the hex-output before reaching the 50% chance of collision probability.

Resources