Is possible to distinguish strings encrypted with different cryptography algorithms?
If i have a set of N encrypted strings that comes from different cryptography algorithms (i.e. 100 from AES, 150 from tripleDES, etc... ) i want to know if is possible with a reasonable error that there is a sort of clustering of the strings (i.e. 111 in the AES-cluster, 139 in tripleDES-cluster) also with the simplification that the keys or the strings that are encrypted are the same and obviously without an a priori knowledge (even if there is a training could be interesting).
There are some works, papers, toy-example about that?
Thank you
Yes, you can distinguish some ciphers based on their ciphertexts, but this doesn't work for all modes of operation.
The key observation is that AES and Triple DES have different block sizes of 128 bit and 64 bit. Which means that a 7 byte message will be 8 bytes long in 3DES and 16 bytes long in AES. But padding does also have a role in this. PKCS#5 padding will add a whole block of padding if the plaintext size is a multiple of the block size. This means that an 8 byte message will be 16 byte long for 3DES and 16 byte long for AES.
For example: if the lengths of the plaintext messages are distributed uniformly, then there is a 50% chance that you can distinguish between the two, because 3DES can have 24 byte ciphertexts, but AES cannot. Or said differently, you can find out if it is 3DES in 50% of the time, but you cannot say for sure if AES was used. This zero padding the probability is the same, but the matching lengths are slightly different.
This holds true for ECB, CBC and some others. In CTR mode on the other hand the length of the ciphertext cannot be used, because the ciphertext has always the same length as the plaintext. CTR mode is essentially a stream cipher.
If the block sizes are not different, then there is no way to distinguish them, because modern ciphers are designed in a way to be indistinguishable from noise.
No, there's no way to distinguish one from another without there being some serious flaw in the algorithm. See here and here for a more detailed explanation.
As far as I can see methods to break a simple Caesar Cipher rely on patterns in the source data, such as the frequency of vowels in the English language. I struggle to see how a simple cipher on binary data could be compromised, assuming the key length is equal to or greater than the length of the data to be encrypted (so the key bytes never repeat) and assuming the source binary data is scrambled so any underlying pattern is removed.
If I offer you a byte value of 152, there is no mathematical way to determine that original data was 52 and the key was 100 without the key.
Are my assumptions here correct and if not how could this simple encryption method be broken?
key length is equal to or greater than the length of the data to be encrypted
You're describing one-time pad. It is secure.
Caesar ciphers have keys that are much shorter than the message.
Many of the encryption techniques I've seen can easily encrypt a simple 8 digit number like "12345678" but the result is often something like "8745b34097af8bc9de087e98deb8707aac8797d097f" (made up but you get the idea).
Is there a way to encrypt this 8 digit number but have the resulting encrypted value be the same or at least only a slightly longer number? An ideal target would be to end up with a 10 digit number or less. Is this possible while still maintaining a fairly strong encryption?
Update: I didn't make the output clear enough - I am wanting an 8-digit number to turn into an 8-digit number, not 8 bytes.
A lot here is going to depend on how seriously you mean your "public-key-encryption" tag. Do you actually want public key encryption, or are you just taking that possibility into account?
If you're willing to use symmetric encryption, producing 8 bytes of output from 8 bytes of input is pretty easy: just run 3DES in ECB (Electronic Code Book) mode, and that's what you'll get. The main weakness of ECB is that a given input will always produce the same result, so if your inputs might repeat an attacker will be able to see that repetition, and may be able to notice a pattern of "encrypted value X leads to action Y", even if they can't/don't break the encryption itself at all. If you can live with that, 3DES/ECB is probably your answer.
If you can't live with that, 3DES in CFB mode is probably the next best. This will produce 16 bytes of output from 8 bytes of input (note that it's not normally doubling the input size, but adding 8 bytes to the input size).
3DES is hardly what anybody would call a cutting edge algorithm, but I'd say it still qualifies as "fairly strong encryption". Part of its weakness as an algorithm stems from its relatively small block size, but that also minimizes expansion of the output.
Edit: Sorry, I forgot to the public-key possibility. With most public-key cryptography, the smallest result is roughly equal to the key size. With RSA encryption, that'll typically mean a minimum of something like 1024 bits (and often considerably more than that). To keep the result smaller, I'd probably use Elliptical Curve Cryptography, for which a ~200 bit key is reasonably secure against known attacks. This will still be larger than 3DES/CFB, but not outrageously so.
Well you could look a stream cipher which encrypts bytes 1:1. With N bytes input, there are N bytes encrypted/decrypted output. Such ciphers are usually based on an algorithm that creates a stream of random numbers, with the encryption key/IV acting as seed.
For some stream ciphers, look at the eSTREAM candidates. I don't know of any relevant attacks on HC-128 and HC-256, for example.
If I have just encrypted some plain-text into cipher-text with CBC and Rijndael, is it insecure to tell the world that the original plain-text had a length of x bytes? It seems that it's always the same as the length of the cipher-text, so, I think it does not matter, but are there some block modes or ciphers where it does matter?
You should use the PKCS5 padding scheme. This scheme always pads with extra information, including the number of extra bytes that have been added. Upon decryption, the last block is examined to see how many bytes should be discarded.
Information about the length of the original message is hard to suppress. Even with padding to block size, you can deduce that the original message is either n, n-1, n-2, ..., or n-blocksize+1 bytes in length. Most crypto protocols make little or no effort to hide the plaintext length.
The length of the output is not always the same. However, if the lengths of the plain-text are similar, the length of the cipher text can be the same due to padding (Rijndael/AES is a block cipher)
http://en.wikipedia.org/wiki/Padding_(cryptography)
And it is weakens your security if you tell the world the length of the plaintext, so unless there is a very good reason, i would advice against it.
Also take a look at how CBC works:
http://en.wikipedia.org/wiki/Block_cipher_modes_of_operation#Cipher-block_chaining_.28CBC.29
The plain text is not typically the same length as the cipher text. The plain text is padded before encryption. If you are sending one of two different messages 'Yes' or 'No' then sending the original length is probably not a good idea. If all your messages are of the same length then it's OK to publish that length.
Rijndael is a block cipher, so the plain-text will be padded to a complete block. (This might be hidden by your encryption layer).
So the encrypted message text length will be indicative of the plain text length.
If you are leaking information by the message length, then you will have to do your own padding. Indeed the sending of a message might leak information, so you'll need to send no-op messages when you don't have anything to send.
I wrote a short C++ program to do XOR encryption on a file, which I may use for some personal files (if it gets cracked it's no big deal - I'm just protecting against casual viewers). Basically, I take an ASCII password and repeatedly XOR the password with the data in the file.
Now I'm curious, though: if someone wanted to crack this, how would they go about it? Would it take a long time? Does it depend on the length of the password (i.e., what's the big-O)?
The problem with XOR encryption is that for long runs of the same characters, it is very easy to see the password. Such long runs are most commonly spaces in text files. Say your password is 8 chars, and the text file has 16 spaces in some line (for example, in the middle of ASCII-graphics table). If you just XOR that with your password, you'll see that output will have repeating sequences of characters. The attacker would just look for any such, try to guess the character in the original file (space would be the first candidate to try), and derive the length of the password from length of repeating groups.
Binary files can be even worse as they often contain repeating sequences of 0x00 bytes. Obviously, XORing with those is no-op, so your password will be visible in plain text in the output! An example of a very common binary format that has long sequences of nulls is .doc.
I concur with Pavel Minaev's explanation of XOR's weaknesses. For those who are interested, here's a basic overview of the standard algorithm used to break the trivial XOR encryption in a few minutes:
Determine how long the key is. This
is done by XORing the encrypted data
with itself shifted various numbers
of places, and examining how many
bytes are the same.
If the bytes that are equal are
greater than a certain percentage
(6% according to Bruce Schneier's
Applied Cryptography second
edition), then you have shifted the
data by a multiple of the keylength.
By finding the smallest amount of
shifting that results in a large
amount of equal bytes, you find the
keylength.
Shift the cipher text by the
keylength, and XOR against itself.
This removes the key and leaves you
with the plaintext XORed with the
plaintext shifted the length of the
key. There should be enough
plaintext to determine the message
content.
Read more at Encryption Matters, Part 1
XOR encryption can be reasonably* strong if the following conditions are met:
The plain text and the password are about the same length.
The password is not reused for encrypting more than one message.
The password cannot be guessed, IE by dictionary or other mathematical means. In practice this means the bits are randomized.
*Reasonably strong meaning it cannot be broken by trivial, mathematical means, as in GeneQ's post. It is still no stronger than your password.
In addition to the points already mentioned, XOR encryption is completely vulnerable to known-plaintext attacks:
cryptotext = plaintext XOR key
key = cryptotext XOR plaintext = plaintext XOR key XOR plaintext
where XORring the plaintexts cancel each other out, leaving just the key.
Not being vulnerable to known-plaintext attacks is a required but not sufficient property for any "secure" encryption method where the same key is used for more than one plaintext block (i.e. a one-time pad is still secure).
Ways to make XOR work:
Use multiple keys with each key length equal to a prime number but never the same length for keys.
Use the original filename as another key but remember to create a mechanism for retrieving the filename. Then create a new filename with an extension that will let you know it is an encrypted file.
The reason for using multiple keys of prime-number length is that they cause the resulting XOR key to be Key A TIMES Key B in length before it repeats.
Compress any repeating patterns out of the file before it is encrypted.
Generate a random number and XOR this number every X Offset (Remember, this number must also be recreatable. You could use a RANDOM SEED of the Filelength.
After doing all this, if you use 5 keys of length 31 and greater, you would end up with a key length of approximately One Hundred Meg!
For keys, Filename being one (including the full path), STR(Filesize) + STR(Filedate) + STR(Date) + STR(Time), Random Generation Key, Your Full Name, A private key created one time.
A database to store the keys used for each file encrypted but keep the DAT file on a USB memory stick and NOT on the computer.
This should prevent the repeating pattern on files like Pictures and Music but movies, being four gigs in length or more, may still be vulnerable so may need a sixth key.
I personally have the dat file encrypted itself on the memory stick (Dat file for use with Microsoft Access). I used a 3-Key method to encrypt it cause it will never be THAT large, being a directory of the files with the associated keys.
The reason for multiple keys rather than randomly generating one very large key is that primes times primes get large quick and I have some control over the creation of the key and you KNOW that there really is no such thing as a truely random number. If I created one large random number, someone else can generate that same number.
Method to use the keys: Encrypt the file with one key, then the next, then the next till all keys are used. Each key is used over and over again till the entire file is encrypted with that key.
Because the keys are of different length, the overlap of the repeat is different for each key and so creates a derived key the length of Key one time Key two. This logic repeats for the rest of the keys. The reason for Prime numbers is that the repeating would occur on a division of the key length so you want the division to be 1 or the length of the key, hense, prime.
OK, granted, this is more than a simple XOR on the file but the concept is the same.
Lance
I'm just protecting against casual viewers
As long as this assumption holds, your encryption scheme is ok. People who think that Internet Explorer is "teh internets" are not capable of breaking it.
If not, just use some crypto library. There are already many good algorithms like Blowfish or AES for symmetric crypto.
The target of a good encryption is to make it mathematically difficult to decrypt without the key.
This includes the desire to protect the key itself.
The XOR technique is basically a very simple cipher easily broken as described here.
It is important to note that XOR is used within cryptographic algorithms.
These algorithms work on the introduction of mathematical difficulty around it.
Norton's Anti-virus used to use a technique of using the previous unencrypted letter as the key for next letter. That took me an extra half-hour to figure out, if I recall correctly.
If you just want to stop the casual viewer, it's good enough; I've used to hide strings within executables. It won't stand up 10 minutes to anyone who actually tries, however.
That all said, these days there are much better encryption methods readily available, so why not avail yourself of something better. If you are trying to just hide from the "casual" user, even something like gzip would do that job better.
Another trick is to generate a md5() hash for your password. You can make it even more unique by using the length of the protected text as an offset or combining it with your password to provide better distribution for short phrases. And for long phrases, evolve your md5() hash by combining each 16-byte block with the previous hash -- making the entire XOR key "random" and non-repetitive.
RC4 is essentially XOR encryption! As are many stream ciphers - the key is the key (no pun intended!) you must NEVER reuse the key. EVER!
I'm a little late in answering, but since no one has mentioned it yet: this is called a Vigenère cipher.
Wikipedia gives a number of cryptanalysis attacks to break it; even simpler, though, since most file-formats have a fixed header, would be to XOR the plaintext-header with the encrypted-header, giving you the key.
That ">6%" GeneQ mentions is the index of coincidence for English telegraph text - 26 letters, with punctuation and numerals spelled out. The actual value for long texts is 0.0665.
The <4% is the index of coincidence for random text in a 26-character alphabet, which is 1/26, or 0.385.
If you're using a different language or a different alphabet, the specific values will different. If you're using the ASCII character set, Unicode, or binary bytes, the specific values will be very different. But the difference between the IC of plaintext and random text will usually be present. (Compressed binaries may have ICs very close to that of random, and any file encrypted with any modern computer cipher will have an IC that is exactly that of random text.)
Once you've XORed the text against itself, what you have left is equivalent to an autokey cipher. Wikipedia has a good example of breaking such a cipher
http://en.wikipedia.org/wiki/Autokey_cipher
If you want to keep using XOR you could easily hash the password with multiple different salts (a string that you add to a password before hashing) and then combine them to get a larger key.
E.G. use sha3-512 with 64 unique salts, then hash your password with each salt to get a 32768 bit key that you can use to encrypt a 32Kib (Kilibit) (4KiB (kilibyte)) or smaller file. Hashing this many times should be less than a second on a modern CPU.
for something more secure you could try manipulating your key during encryption like AES (Rijndael). AES actually does XOR times and modifies the key each repeat of the key using a switch table. It became an internation standard so its quite secure.