How many times can i compose a md5 function with itself? - security

For studying purposes it would be usefull to find out how many times can i compose a md5 function with itself without getting the same value.
This is a paralell/complementary approach to the salt, because this way the value gets harder to crack using brute force.

Seemingly infinite. However MD5 has been shown to not be collision resistant so at some point you will have a duplicate.
The following Ruby code will cyclicly apply the MD5 hashing algorithm until a duplicate has been detected, at which point it will print the number of cycles required to reach the duplication point. The original string is randomly generated from alphabetical characters.
require 'set'
require 'digest'
keys = Set.new
o = [('a'..'z'), ('A'..'Z')].map { |i| i.to_a }.flatten
string = (0...10).map{ o[rand(o.length)] }.join
count = 0
while !keys.include?(string) do
count += 1
puts count
keys << string
string = Digest::MD5.digest(string)
end
puts "#{count}"
This continues to run past 15mil cycles... I will update once a duplicate has been found.
Update: due to the limited resources of my machine I had to halt the above script after 75,933,338 cycles without a collision (the set had allocated ~8 GB in memory)

Related

How concerned should we be about brute force attacks when creating passwords?

Are brute force attacks effective in the real world? Almost every service I have ever used where a password was needed had some type of protection against trying even more than a few passwords. Also, how does this translate to data encryption?
brute force a.k.a. exhaustive key search - is a method that can work on multiple attack vectors
what you describe is attacking a service by authenticating through the service. The service here acts as a gate keeper, and can prevent the attack by limiting your attempts, etc.
However that is only one attack vector to the system:
what if an attacker gains access to the user database by other means? will he be able to leverage this knowledge to access the service now?
if the owner of said service was dumb enough to store clear text passwords ... ouch... but let's assume the passwords are stored in a way that uses at least a one way function like a hashfunction
let's see what changed...
we can have as many attempts on the password guessing side as our available computational ressources allow ...
the only remaining gatekeeper here is the complexity of guessing a password
if we only need to check value == hash(password) we can easily see that there are other ways of efficiently attacking this than brute force ...
but you asked for brute force specifically so let's assume our passwords are salted...
value == hash(password,salt) ... no more rainbow tables... but this is still bad, since we can possibly calculate that very quickly and parallelize ... the limiting factor that remains is hardware ... if we throw in a number of (rented? ... hacked?) number crunching systems with enough cores / GPUs ... how fast can we crack a password by brute force? ... let's do a little math here ...
let's say we have one system that does our work ... doing roughly 1,000,000 password guesses per second ... let's say our password has an entropy of ~42 bit ... that's roughly a 9 letter lowercase password ... how long does it take?
~51 days untill a guaranteed positive guess if the attacker knows the allowed char pool (lower-alpha in our case)
you can assume that after 50% of that time the probability of a correct guess is 50% so... not very long...
but it get's worse ... we assumed one machine ... does it scale if we buy/rent/hack more? sure it does ... let's assume 10,000 machines ...
we can simply divide by 10k ... and we are down to roughly 7 minutes 21 seconds ...
what does it cost to rent 10k machines from a cloud provider for 10 minutes? ... as an example (we didn't specify the complexity of our hash function so this example might not be in the range of those guesses per second, but you will see where this goes) ... an a1.xlarge amazon ec2 on-demand instance costs roughly 10 us-cents per hour ... we need it for ... 10 min ... so divide by 6 ... we want 10k nodes ... round about 170 USD ... let's round up ... 200 USD as a rough estimate
so... computational ressources have become available in vast numbers for relatively small money if we wanted to rent them... now think about malware capturing computational ressources
so ... with all that computational ressources on the bad guy side ... what can we do?
we can make guessing harder ...
if we increase the password complexity from those 42 bit above to ... lets say lower and upper alphanumeric and a length of 16 chars ... that's an entropy of roughly 96 bit ... and our little machine with 1,000,000 guesses per second will take roughly 2.5 quadrillion years to find our password (guaranteed... if it does not malfunction, has enough power, and you find a way to make it survive the end of our sun in a few billion years... but that's another story)
but back to the real world .. we take another cloud setup that can guess 100,000,000,000,000 passwords per second ... 10k times the size of the 10k machines we calculated earlier with... it will take us 25 million years to have a guaranteed hit ... probable success after 50% of that time ... so why does that not counter the increase of those additional 54bit entropy? ... exponential growth ... each bit more doubles the work...
another thing we can do besides the increase of our entropy is to increase the calculation complexity of a guess
if we chain our hash function like hash(hash(hash(...))) for a variable number of rounds, we can increase the time it takes to calculate the result ...
for a regular login, it does not really matter if your login takes 10ms or 250ms, but for an attacker, that difference means he has to afford 25 times the computational ressources or it takes 25 times as long... not as mighty as our entropy increase, but still... it's a measurable effect at very little cost on our side
when it comes to data encryption we usually don't talk about passwords but encryption keys ... a few bytes of entropy... at this level it's basically the same calculation as above ... at around 80 to 90 bits it's rather hard and expensive ... at around 40 ... whoopsy it's open ...
the interesting part is when you have to use a password to encrypt something ... how to get from the password to the needed bunch of key bytes? hashing? probably... but how?
the good thing is ... other folks had that very same problem before us ... and that resulted in well documented and examined functions like PBKDF2 (Password Derived Key Derivation Function ... you can kind of see it as a hash function with variable length output ... it creates a pseudo random bit stream and it's up to you to take as many bits as you need... typically the key size of a symetric cipher)
PBKDF2 has a round parameter to increase computation time ... so basically... it's the very same as above

list.count() vs Counter() performance

While trying to find the frequency of a bunch of characters in a string, why does running string.count(character) 4 times for 4 different characters yield faster execution time (using time.time()) than using a collections.Counter(string)?
Background:
Given a sequence of moves represented by a string. Valid moves are R (right), L (left), U (up), and D (down). Return True if the sequence of moves takes me back to the origin. Otherwise, return false.
# approach - 1 : iterate 4 times (3.9*10^-6 seconds)
def foo1(moves):
return moves.count('U') == moves.count('D') and moves.count('L') == moves.count('R')
# approach - 2 iterate once (3.9*10^-5 seconds)
def foo2(moves):
from collections import Counter
d = Counter(moves)
return d['R'] == d['L'] and d['U'] == d['D']
import time
start = time.time()
moves = "LDRRLRUULRLRLRLRLRLRLRLRLRLRL"
foo1(moves)
# foo2(moves)
end = time.time()
print("--- %s seconds ---" % (end - start))
These results are the opposite of what I had expected. My reasoning is that first approach should take longer because the string is iterated over 4 times whereas in the second approach, we iterate only once. Could it be due to the library call overhead?
Counter is faster in theory, but has higher fixed overhead, especially compared to str.count, which can scan the underlying C array with direct memory comparisons, where list.count has to do rich comparisons for each element; converting moves to a list of single characters nearly triples the time for foo1 in local tests, from 448 ns to 1.3 μs (while foo2 actually gets a tiny bit faster, dropping from 5.6 μs to 5.48 μs).
Other problems:
Importing an already imported module uses the cached import, but there is a surprising amount of overhead involved in even a cached import (the loading machinery has a lot of stuff to check to make sure it's okay to do so); in local tests, moving from collections import Counter to the top level reduced the runtime of foo2 by 1.6 μs (5.6 μs with single global import, 7.2 μs with local per-call import). This will vary a lot by environment; on another machine (with less stuff installed in both user and system site-packages), the overhead was only 0.75 μs. Regardless, it's a significant, avoidable disadvantage for foo2.
Counter on modern Python uses a C accelerator to speed up counting, but the accelerator only provides a benefit when the iterable is long enough. If you use the list form of moves, but multiply it by 100 to make a longer sequence, the difference drops, relatively speaking (to 106 µs for foo1 vs. 140 µs for foo2)
You're just not counting very many things; when there are only four things you care about, paying O(n) four times can easily beat paying O(n) once if the former case has lower constant multipliers (which aren't included in big-O notation) than the latter. Counter remains O(n) for any number of unique things being counted; calling .count is O(n) per call, but if you need to know the count of every unique thing in the input, for inputs that are mostly unique, individual .count calls for each will be asymptotically O(n²).
The .count approach is short-circuiting in your specific case, so it isn't even doing O(n) work four times, just twice; the U and D counts don't match, so it never counts L and R at all. Counter doesn't get meaningfully slower if it can't short-circuit (all the cost is paid in the single counting pass), but your foo1, in the same benchmark I used from point #2 (longer input, in list form), goes from 106 µs to 185 µs if I just add a single D to the end of the (pre-multiplication) moves (making the U and D counts the same, and requiring two more count calls); foo2 only goes up to 143 µs (from 140 µs), presumably because moves actually got longer (adding the D before multiplying by 100 meant it went from 2900 elements to count to 3000).
Basically, you had some minor implementation weaknesses, but mostly, you happened to choose a use case that gave all the advantage to .count, none to Counter. If your inputs are always str, and you're only counting them a small, fixed number of times, then sure, repeated calls to count are generally going to win. But for arbitrary input types (especially iterators, where count is impossible, both because it doesn't exist, and because you can only iterate it once), especially larger ones, with more unique things to count, where consistent performance counts (so relying on short-circuiting to reduce the number of count calls isn't acceptable), Counter will win.

How many characters should a session key be for security?

I am generating a session key to be stored in a cookie using the following function:
function getRandomKey($length=32) {
$string = '';
$characters = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
for ($i = 0; $i < $length; $i++) {
$string .= $characters[mt_rand(0, strlen($characters)-1)];
}
return $string;
}
If I were to generate a 1 digit key it would have:
26 lowercase + 26 uppercase + 0-10 = 62 options.
Therefore an 8 digit key would have 62^8 or 218,340,105,584,896 possible combinations.
1) Is there any rule of thumb on how many characters out I should go? The more the better, I know, but is 8 enough or should it be more like 32 characters, 64 etc.?
2) Are there any security concerns when using localStorage?
Thanks in advance!
These are two very different questions.
1) TL;DR: about 16 characters (case-sensitive) is ok for most purposes.
First, please if you can, avoid implementing session management. It is already done in many frameworks, including session id generation and more - use an existing, well-known implementation if you can, because it is not straightforward to get it right.
Now, it's all about entropy. You started out right by calculating the number of possible combinations. If you take log2 of that, you get how many bits of entropy that session id has. (Well, let's not go into entropy here...)
So one case-sensitive alphanumeric character ([a-zA-Z0-9]) has log2(62)=5.9542 bits of entropy, two characters two times more, and so on.
The time required for an attacker to guess a valid session id is:
(2^b + 1) / (2 * n * s)
Where 'b' is the available bits of entropy in the session id, 'n' is the number of guesses the attacker can make every second, and 's' is the number of valid session ids in the system.
In a large, distributed web application, potentially using a botnet, an attacker may be able to make n=100000 guesses a second, and there may be s=1 million valid session ids. You want the result to be several hundred years at the very least, say 300 (15768000000 seconds). (These are totally arbitrary values.)
This gives about b=70, so you need 70 bits of entropy. If each character has 5.9542 bits of entropy as discussed above, it gives about 12 for the required session id length, but you can just round it up to 16 to make sure. :)
As a rule of thumb, it is sometimes assumed that bits of entropy in a session id is half the length (in bits) of that session id. It is mostly a reasonable approximation without any calculation. :) Even more so, because sessuion ids are sometimes actual random numbers base64 or otherwise encoded. Different encodings usually give different results though.
Also make sure to use a cryptographic random number generator, otherwise entropy is much less. Note that mt_rand() is not cryptographically random, so the code in your question is vulnerable!
2) TL;DR Yes. (I suppose you mean using local storage for storing the session id.)
The best possible place to store a session id is a httpOnly, Secure cookie without an expiration (non-persistent), because Javascript cannot access it there (for example cross-site scripting doesn't affect a victim user's session id at least), and being non-persistent, it will be removed when the user closes the browser and will not be persisted to disk (well, mostly... but that's a long story).
If you use localStorage, any XSS will directly affect the session id, which is very valuable for an attacker. Also sessions will survive closing the browser, which is slightly unexpected - user sessuions might easily be hijacked on shared computers.
Note though that this depends on the use-case and the risk you want to take. While it would definitaly not be ok for a financial application where you can access and manage very sensitive data, it can be ok for less risky applications. You can also let the user decide ("remember me", in which case you put it into localStorage), but most users are not aware of the associated risk, so they can't make an informed decision.
Also note that sessionStorage is a little better, because the session id will be removed from the browser when it is closed, but it is still available to Javascript (XSS).

Why is string manipulation more expensive?

I've heard this so many times, that I have taken it for granted. But thinking back on it, can someone help me realize why string manipulation, say comparison etc, is more expensive than say an integer, or some other primitive?
8bit example:
1 bit can be 1 or 0. With 2 bits you can represent 0, 1, 2, and 3. And so on.
With a byte you have 2^8 possibilities, from 0 to 255.
In a string a single letter is stored in a byte, so "Hello world" is 11 bytes.
If I want to do 100 + 100, 100 is stored in 1 byte of memory, I need only two bytes to sum two numbers. The result will need again 1 byte.
Now let's try with strings, "100" + "100", this is 3 bytes plus 3 bytes and the result, "100100" needs 6 bytes to be stored.
This is over-simplified, but more or less it works in this way.
The int data type in C# was carefully selected to be a good match with processor design. Which can store an int in a cpu register, a storage location that's an easy factor of 3 faster than memory. And a single cpu instruction to compare values of type int. The CMP instruction runs in less than a single cpu cycle, a fraction of a nano-second.
That doesn't work nearly as well for a string, it is a variable length data type and every single char in the string must be compared to test for equality. So it is automatically proportionally slower by the size of the string. Furthermore, string comparison is afflicted by culture dependent comparison rules. The kind that make "ss" and "ß" equal in German and "Aa" and "Å" equal in Danish. Nothing subtle to deal with, taken care of by highly optimized table-driven code inside the CLR. It can't beat CMP.
I've always thought it was because of the immutability of strings. That is, every time you make a change to the string, it requires allocating memory for a whole new string (rather than modifying the original in place).
Probably a woefully naive understanding but perhaps someone else can expound further.
There are several things to consider when looking at the "cost" of manipulating strings.
There is the cost in terms of memory usage, there is the cost in terms of CPU cycles used, and there is a cost associated with the complexity of the code involved.
Integer manipulation (Add, Subtract, Multipy, Divide, Compare) is most often done by the CPU at the hardware level, in few (or even 1) instruction. When the manipulation is done, the answer fits back in the same size chunk of memory.
Strings are stored in blocks of memory, which have to be manipulated a byte or word at a time. Comparing two 100 character long strings may require 100 separate comparison operations.
Any manipulation that makes a string longer will require, either moving the string to a bigger block of memory, or moving other stuff around in memory to allow growing the existing block.
Any manipulation that leaves the string the same, or smaller, could be done in place, if the language allows for it. If not, then again, a new block of memory has to be allocated and contents moved.

File containing its own checksum

Is it possible to create a file that will contain its own checksum (MD5, SHA1, whatever)? And to upset jokers I mean checksum in plain, not function calculating it.
I created a piece of code in C, then ran bruteforce for less than 2 minutes and got this wonder:
The CRC32 of this string is 4A1C449B
Note the must be no characters (end of line, etc) after the sentence.
You can check it here:
http://www.crc-online.com.ar/index.php?d=The+CRC32+of+this+string+is+4A1C449B&en=Calcular+CRC32
This one is also fun:
I killed 56e9dee4 cows and all I got was...
Source code (sorry it's a little messy) here: http://www.latinsud.com/pub/crc32/
Yes. It's possible, and it's common with simple checksums. Getting a file to include it's own md5sum would be quite challenging.
In the most basic case, create a checksum value which will cause the summed modulus to equal zero. The checksum function then becomes something like
(n1 + n2 ... + CRC) % 256 == 0
If the checksum then becomes a part of the file, and is checked itself. A very common example of this is the Luhn algorithm used in credit card numbers. The last digit is a check digit, and is itself part of the 16 digit number.
Check this:
echo -e '#!/bin/bash\necho My cksum is 918329835' > magic
"I wish my crc32 was 802892ef..."
Well, I thought this was interesting so today I coded a little java program to find collisions. Thought I'd leave it here in case someone finds it useful:
import java.util.zip.CRC32;
public class Crc32_recurse2 {
public static void main(String[] args) throws InterruptedException {
long endval = Long.parseLong("ffffffff", 16);
long startval = 0L;
// startval = Long.parseLong("802892ef",16); //uncomment to save yourself some time
float percent = 0;
long time = System.currentTimeMillis();
long updates = 10000000L; // how often to print some status info
for (long i=startval;i<endval;i++) {
String testval = Long.toHexString(i);
String cmpval = getCRC("I wish my crc32 was " + testval + "...");
if (testval.equals(cmpval)) {
System.out.println("Match found!!! Message is:");
System.out.println("I wish my crc32 was " + testval + "...");
System.out.println("crc32 of message is " + testval);
System.exit(0);
}
if (i%updates==0) {
if (i==0) {
continue; // kludge to avoid divide by zero at the start
}
long timetaken = System.currentTimeMillis() - time;
long speed = updates/timetaken*1000;
percent = (i*100.0f)/endval;
long timeleft = (endval-i)/speed; // in seconds
System.out.println(percent+"% through - "+ "done "+i/1000000+"M so far"
+ " - " + speed+" tested per second - "+timeleft+
"s till the last value.");
time = System.currentTimeMillis();
}
}
}
public static String getCRC(String input) {
CRC32 crc = new CRC32();
crc.update(input.getBytes());
return Long.toHexString(crc.getValue());
}
}
The output:
49.825756% through - done 2140M so far - 1731000 tested per second - 1244s till the last value.
50.05859% through - done 2150M so far - 1770000 tested per second - 1211s till the last value.
Match found!!! Message is:
I wish my crc32 was 802892ef...
crc32 of message is 802892ef
Note the dots at the end of the message are actually part of the message.
On my i5-2500 it was going to take ~40 minutes to search the whole crc32 space from 00000000 to ffffffff, doing about 1.8 million tests/second. It was maxing out one core.
I'm fairly new with java so any constructive comments on my code would be appreciated.
"My crc32 was c8cb204, and all I got was this lousy T-Shirt!"
Certainly, it is possible. But one of the uses of checksums is to detect tampering of a file - how would you know if a file has been modified, if the modifier can also replace the checksum?
Sure, you could concatenate the digest of the file itself to the end of the file. To check it, you would calculate the digest of all but the last part, then compare it to the value in the last part. Of course, without some form of encryption, anyone can recalculate the digest and replace it.
edit
I should add that this is not so unusual. One technique is to concatenate a CRC-32 so that the CRC-32 of the whole file (including that digest) is zero. This won't work with digests based on cryptographic hashes, though.
I don't know if I understand your question correctly, but you could make the first 16 bytes of the file the checksum of the rest of the file.
So before writing a file, you calculate the hash, write the hash value first and then write the file contents.
There is a neat implementation of the Luhn Mod N algorithm in the python-stdnum library ( see luhn.py). The calc_check_digit function will calculate a digit or character which, when appended to the file (expressed as a string) will create a valid Luhn Mod N string. As noted in many answers above, this gives a sanity check on the validity of the file, but no significant security against tampering. The receiver will need to know what alphabet is being used to define Luhn mod N validity.
If the question is asking whether a file can contain its own checksum (in addition to other content), the answer is trivially yes for fixed-size checksums, because a file could contain all possible checksum values.
If the question is whether a file could consist of its own checksum (and nothing else), it's trivial to construct a checksum algorithm that would make such a file impossible: for an n-byte checksum, take the binary representation of the first n bytes of the file and add 1. Since it's also trivial to construct a checksum that always encodes itself (i.e. do the above without adding 1), clearly there are some checksums that can encode themselves, and some that cannot. It would probably be quite difficult to tell which of these a standard checksum is.
There are many ways to embed information in order to detect transmission errors etc. CRC checksums are good at detecting runs of consecutive bit-flips and might be added in such a way that the checksum is always e.g. 0. These kind of checksums (including error correcting codes) are however easy to recreate and doesn't stop malicious tampering.
It is impossible to embed something in the message so that the receiver can verify its authenticity if the receiver knows nothing else about/from the sender. The receiver could for instance share a secret key with the sender. The sender can then append an encrypted checksum (which needs to be cryptographically secure such as md5/sha1). It is also possible to use asymmetric encryption where the sender can publish his public key and sign the md5 checksum/hash with his private key. The hash and the signature can then be tagged onto the data as a new kind of checksum. This is done all the time on internet nowadays.
The remaining problems then are 1. How can the receiver be sure that he got the right public key and 2. How secure is all this stuff in reality?. The answer to 1 might vary. On internet it's common to have the public key signed by someone everyone trusts. Another simple solution is that the receiver got the public key from a meeting in personal... The answer to 2 might change from day-to-day, but what's costly to force to day will probably be cheap to break some time in the future. By that time new algorithms and/or enlarged key sizes has hopefully emerged.
You can of course, but in that case the SHA digest of the whole file will not be the SHA you included, because it is a cryptographic hash function, so changing a single bit in the file changes the whole hash. What you are looking for is a checksum calculated using the content of the file in way to match a set of criteria.
Sure.
The simplest way would be to run the file through an MD5 algorithm and embed that data within the file. You can split up the check sum and place it at known points of the file (based on a portion size of the file e.g. 30%, 50%, 75%) if you wish to try and hide it.
Similarly you could encrypt the file, or encrypt a portion of the file (along with the MD5 checksum) and embed that in the file.
Edit
I forgot to say that you would need to remove the checksum data before using it.
Of course if your file needs to be readily readable by another program e.g. Word then things become a little more complicated as you don't want to "corrupt" the file so that it is no longer readable.

Resources