Security of Exclusive-OR (XOR) encryption

Security of Exclusive-OR (XOR) encryption - security

XOR encryption is known to be quite weak. But how weak is it if I have a key that is made up of multiple keys of different (ideally prime) lengths which are combined to make a longer key. eg I have a text keys of length 5, 9 and 11. If I just apply the first key using XOR encryption then it should be easy to break as the encryption byte will repeat every 5 bytes. However if I 'overlay' the 3 of these keys I get an effective non-repeating length of 5*9*11 = 495. This sounds to me pretty strong. If I use a couple of verses of a poem using each line as a key then my non-repeating length will be way bigger than most files. How strong would this be (providing the key remains secret! :) )

XOR encryption is exactly as strong as the key stream. If you XOR with a "One time pad" - a sequence of physically generated random numbers that you only use once, then your encryption is theoretically unbreakable. You do have the problem however of hiding and distributing the key.
So your question comes down to - "how secure/random is a keystream made of three text strings?" The answer is "not very secure at all". Probably good enough to keep out your little sister, but not necessarily if you've got a smart little sister like I have.

What about the 'known plaintext' attack? If you know the encrypted and the cleartext versions of the same string, you can retrieve the key.
http://en.wikipedia.org/wiki/XOR_cipher
http://en.wikipedia.org/wiki/Known-plaintext_attack
http://en.wikipedia.org/wiki/Stream_cipher_attack

If P and Q are two independent cryptographic methods, the composite cryptographic function P(Q(x)) won't be any weaker than the stronger of P(x) or Q(x), but it won't necessarily be meaningfully stronger either. In order for a composite cryptographic function to gain any strength, the operations comprising it have to meet certain criteria. Combining weak ciphers arbitrarily, no matter how many one uses, is unlikely to yield a strong cipher.

Related

Hashing and 'brute-force' permutations

So this is a two-part question:
Are there any hashing functions that guarantee that for any combination of the same length, they generate a unique hash? As I remember - most are that way, but I just need to confirm this.
Based on the 1st question - so, given a file hash and a length - is it then theoretically possible to 'brute-force' all byte permutations of that same length until the same hash is generated - ie. the original file has been recreated?
PS. I am aware that this will take ages (if theoretically possible), but I think it would be feasible for small files (sizes < 1KB)

1KB, that'd be 1000^256, right? 1000 possible combinations of bytes (256 configurations each?). It's a real big number. 1 with 768 0s behind it.
If you were to generate all of them, one would be the right one, but you'd have some number of collisions.
According to this security.SE post, the collission rate for md5 (for example) is about 1 in 2^64. So, if we divide our original number by that, we'd get how many possible combinations, right? http://www.wolframalpha.com/input/?i=1000%5E256+%2F+2%5E64
~5.42 × 10^748
That is still a lot of files to check.
I'd feel a lot better if someone critiqued my math here, but the point is that your first point is not true because of collisions. You can use the same sort math for calculating two 1000 character passwords having the same hash. It's the birthday problem. Given 2 people, it is unlikely that we'd have the same birthday, but if you take a room full the probability of any two people having the same birthday increases very quickly. If you take all 1000 character passwords, some of them are going to collide. You are going from X bytes to 16 bytes. You can't fit all of the combinations into 16 bytes.

Expanding upon the response to your first point, one of the points of cryptographic hash functions is unpredictability. A function with zero collisions is a 1-1 (or one-to-one) function, so called because every input has exactly one output and every output has exactly one input.
In order for a function to accept arbitrary length & complexity inputs without generating a collision, it is easy to see that the function must have arbitrary length outputs. As Gray obliquely points out, most hash functions have fixed-length outputs. (There are apparently some new algorithms that support arbitrary length outputs, but they still don't guarantee 0 collisions.) The reason is not stated clearly in the common crypto literature, but consider the difference between hashing and encrypting.
In hashing, you have the message (the unaltered original) and the message digest (the output of the hash function. (Digest here having the meaning "a summation or condensation of a body of information.")
With encryption, you have the plain text and the cipher text. The implication is that the cipher text is of equal length and complexity as the original.
I look at it as a cryptographic hash function with 0 collisions is of equal complexity as encryption. (Note that I'm unsure of what the advantages of a variable-length hash output are, so I asked a question about it.)
Additionally, hash functions are susceptible to attacks by pre-computed rainbow tables, which is why all hash algorithms still considered secure employ extra random inputs, called salts. The reason encryption isn't susceptible to a similar attack is that the encryption key is kept secret and you can't pre-compute output values without knowing the key. Compare symmetric key encryption (where there is one key that must be kept secret) with public key encryption (where the encryption key is public and the decryption key is private).
The other thing that prevents encryption algorithms from pre-computation attacks is that the number of computations for arbitrary-length inputs grows exponentially, and it is literally impossible to store the output from every input you may be interested in.

Is there a hash function to generate a hash with a given length?

Is there a function that generates a hash that has the exact lenght I want? I know that MD5 always has 16 bytes. But I want to define the lenght of the resulting hash.
Example:
hash('Something', 2) = 'gn'
hash('Something', 5) = 'a5d92'
hash('Something', 20) = 'RYNSl7cMObkPuXCK1GhF'
When the length increases, the result should be more secure from duplicates.

The upcoming SHAKE256 (or SHAKE128 for a security level of 128bit instead of 256bit), a so called extendable-output function (XOF), is exactly what you are looking for. It will be defined alongside with SHA3. There is already a draft online.
If you need an established solution now, follow CodesInChaos advice and truncate SHA512 if a maximum of 64byte is enough and otherwise seed a stream cipher with the output of a hash of the original data.
Technical disclaimer: After an output length of 512bit the "security against duplicates" (collision resistance) does not increase any more with longer output, as with SHAKE256 it has reached the maximum security level against collisions the primitive supports (256bit). (Note that because of the birthday paradox the security level of an ideal hash function with output length of n bit against collisions is only n/2 bit.) Any higher security level is pretty much meaningless anyway (probably 256bit is already an overkill) given that our solar system does not provide enough energy to even count from 0 to 2^256.
Please do not confuse "security levels" with key lengths: With symmetric algorithms one usually expects a security level equal to the key size, but with asymmetric algorithms the numbers are completely unrelated: A 512 bit RSA encryption scheme is far less secure than 128bit AES (i.e. 512bit RSA moduli can be factored by brute force already).
If a cryptographic primitive tries to achieve a "security level of n bits" it means that there are supposed to be no attacks against it that is faster than 2^n operations.

BLAKE2 can produce digests of any size between 1 and 64 bytes.
If you want a digest considered cryptographically secure, consider the Birthday problem and what other algorithms use — e.g. SHA-1 uses 20 bytes and is considered insecure, SHA-2 uses 28/32/48/64 bytes and is generally considered secure.
If you just want to avoid accidental collisions, still consider the Birthday problem (above), but 16 or even 8 bytes might be considered sufficient depending on the application (see table).

Initialization vector uniqueness

Best practice is to use unique ivs, but what is unique? Is it unique for each record? or absolutely unique (unique for each field too)?
If it's per field, that sounds awfully complicated, how do you manage the storage of so many ivs if you have 60 fields in each record.

I started an answer a while ago, but suffered a crash that lost what I'd put in. What I said was along the lines of:
It depends...
The key point is that if you ever reuse an IV, you open yourself up to cryptographic attacks that are easier to execute than those when you use a different IV every time. So, for every sequence where you need to start encrypting again, you need a new, unique IV.
You also need to look up cryptographic modes - the Wikipedia has an excellent illustration of why you should not use ECB. CTR mode can be very beneficial.
If you are encrypting each record separately, then you need to create and record one IV for the record. If you are encrypting each field separately, then you need to create and record one IV for each field. Storing the IVs can become a significant overhead, especially if you do field-level encryption.
However, you have to decide whether you need the flexibility of field level encryption. You might - it is unlikely, but there might be advantages to using a single key but different IVs for different fields. OTOH, I strongly suspect that it is overkill, not to mention stressing your IV generator (cryptographic random number generator).
If you can afford to do encryption at a page level instead of the row level (assuming rows are smaller than a page), then you may benefit from using one IV per page.
Erickson wrote:
You could do something clever like generating one random value in each record, and using a hash of the field name and the random value to produce an IV for that field.
However, I think a better approach is to store a structure in the field that collects an algorithm identifier, necessary parameters (like IV) for that parameter, and the ciphertext. This could be stored as a little binary packet, or encoded into some text like Base-85 or Base-64.
And Chris commented:
I am indeed using CBC mode. I thought about an algorithm to do a 1:many so I can store only 1 IV per record. But now I'm considering your idea of storing the IV with the ciphertext. Can you give me more some more advice: I'm using PHP + MySQL, and many of the fields are either varchar or text. I don't have much experience with binary in the database, I thought binary was database-unfriendly so I always base64_encoded when storing binary (like the IV for example).
To which I would add:
IBM DB2 LUW and Informix Dynamic Server both use a Base-64 encoded scheme for the character output of their ENCRYPT_AES() and related functions, storing the encryption scheme, IV and other information as well as the encrypted data.
I think you should look at CTR mode carefully - as I said before. You could create a 64-bit IV from, say, 48-bits of random data plus a 16-bit counter. You could use the counter part as an index into the record (probably in 16 byte chunks - one crypto block for AES).
I'm not familiar with how MySQL stores data at the disk level. However, it is perfectly possible to encrypt the entire record including the representation of NULL (absence of) values.
If you use a single IV for a record, but use a separate CBC encryption for each field, then each field has to be padded to 16 bytes, and you are definitely indulging in 'IV reuse'. I think this is cryptographically unsound. You would be much better off using a single IV for the entire record and either one unit of padding for the record and CBC mode or no padding and CTR mode (since CTR does not require padding - one of its merits; another is that you only use the encryption mode of the cipher for both encrypting and decrypting the data).

Once again, appendix C of NIST pub 800-38 might be helpful. E.g., according to this
you could generate an IV for the CBC mode simply by encrypting a unique nonce with your encryption key. Even simpler if you would use OFB then the IV just needs to be unique.
There is some confusion about what the real requirements are for good IVs in the CBC mode. Therefore, I think it is helpful to look briefly at some of the reasons behind these requirements.
Let's start with reviewing why IVs are even necessary. IVs randomize the ciphertext. If the same message is encrypted twice with the same key then (but different IVs) then the ciphertexts are distinct. An attacker who is given two (equally long) ciphertexts, should not be able to determine whether the two ciphertexts encrypt the same plaintext or two different plaintext. This property is usually called ciphertext indistinguishablility.
Obviously this is an important property for encrypting databases, where many short messages are encrypted.
Next, let's look at what can go wrong if the IVs are predictable. Let's for example take
Ericksons proposal:
"You could do something clever like generating one random value in each record, and using a hash of the field name and the random value to produce an IV for that field."
This is not secure. For simplicity assume that a user Alice has a record in which there
exist only two possible values m1 or m2 for a field F. Let Ra be the random value that was used to encrypt Alice's record. Then the ciphertext for the field F would be
EK(hash(F || Ra) xor m).
The random Ra is also stored in the record, since otherwise it wouldn't be possible to decrypt. An attacker Eve, who would like to learn the value of Alice's record can proceed as follows: First, she finds an existing record where she can add a value chosen by her.
Let Re be the random value used for this record and let F' be the field for which Eve can submit her own value v. Since the record already exists, it is possible to predict the IV for the field F', i.e. it is
hash(F' || Re).
Eve can exploit this by selecting her value v as
v = hash(F' || Re) xor hash(F || Ra) xor m1,
let the database encrypt this value, which is
EK(hash(F || Ra) xor m1)
and then compare the result with Alice's record. If the two result match, then she knows that m1 was the value stored in Alice's record otherwise it will be m2.
You can find variants of this attack by searching for "block-wise adaptive chosen plaintext attack" (e.g. this paper). There is even a variant that worked against TLS.
The attack can be prevented. Possibly by encrypting the random before using putting it into the record, deriving the IV by encrypting the result. But again, probably the simplest thing to do is what NIST already proposes. Generate a unique nonce for every field that you encrypt (this could simply be a counter) encrypt the nonce with your encryption key and use the result as an IV.
Also note, that the attack above is a chosen plaintext attack. Even more damaging attacks are possible if the attacker has the possibility to do chosen ciphertext attacks, i.e. is she can modify your database. Since I don't know how your databases are protected it is hard to make any claims there.

The requirements for IV uniqueness depend on the "mode" in which the cipher is used.
For CBC, the IV should be unpredictable for a given message.
For CTR, the IV has to be unique, period.
For ECB, of course, there is no IV. If a field is short, random identifier that fits in a single block, you can use ECB securely.
I think a good approach is to store a structure in the field that collects an algorithm identifier, necessary parameters (like IV) for that algorithm, and the ciphertext. This could be stored as a little binary packet, or encoded into some text like Base-85 or Base-64.

Random access encryption with AES In Counter mode using Fortuna PRNG:

I'm building file-encryption based on AES that have to be able to work in random-access mode (accesing any part of the file). AES in Counter for example can be used, but it is well known that we need an unique sequence never used twice.
Is it ok to use a simplified Fortuna PRNG in this case (encrypting a counter with a randomly chosen unique key specific to the particular file)? Are there weak points in this approach?
So encryption/decryption can look like this
Encryption of a block at Offset:
rndsubseq = AESEnc(Offset, FileUniqueKey)
xoredplaintext = plaintext xor rndsubseq
ciphertext = AESEnc(xoredplaintext, PasswordBasedKey)
Decryption of a block at Offset:
rndsubseq = AESEnc(Offset, FileUniqueKey)
xoredplaintext = AESDec(ciphertext, PasswordBasedKey)
plaintext = xoredplaintext xor rndsubseq
One observation. I came to the idea used in Fortuna by myself and surely discovered later that it is already invented. But as I read everywhere the key point about it is security, but there's another good point: it is a great random-access pseudo random numbers generator so to speak (in simplified form). So the PRNG that not only produces very good sequence (I tested it with Ent and Die Hard) but also allow to access any sub-sequence if you know the step number. So is it generally ok to use Fortuna as a "Random-access" PRNG in security applications?
EDIT:
In other words, what I suggest is to use Fortuna PRNG as a tweak to form a tweakable AES Cipher with random-access ability. I read the work of Liskov, Rivest and Wagner, but could not understand what was the main difference between a cipher in a mode of operation and a tweakable cipher. They said they suggested to bring this approach from high level inside the cipher itself, but for example in my case xoring the plain text with the tweak, is this a tweak or not?

I think you may want to look up how "tweakable block ciphers" work and have a look at how the problem of disc encryption is solved: Disk encryption theory. Encrypting the whole disk is similar to your problem: encryption of each sector must be done independently (you want independent encryption of data at different offsets) and yet the whole thing must be secure. There is a lot of work done on that. Wikipedia seems to give a good overview.
EDITED to add:
Re your edit: Yes, you are trying to make a tweakable block cipher out of AES by XORing the tweak with the plaintext. More concretely, you have Enc(T,K,M) = AES (K, f(T) xor M) where AES(K,...) means AES encryption with the key K and f(T) is some function of the tweak (in your case I guess it's Fortuna). I had a brief look at the paper you mentioned and as far as I can see it's possible to show that this method does not produce a secure tweakable block cipher.
The idea (based on definitions from section 2 of the Liskov, Rivest, Wagner paper) is as follows. We have access to either the encryption oracle or a random permutation and we want to tell which one we are interacting with. We can set the tweak T and the plaintext M and we get back the corresponding ciphertext but we don't know the key which is used. Here is how to figure out if we use the construction AES(K, f(T) xor M).
Pick any two different values T, T', compute f(T), f(T'). Pick any message M and then compute the second message as M' = M xor f(T) xor f(T'). Now ask the encrypting oracle to encrypt M using tweak T and M' using tweak T'. If we deal with the considered construction, the outputs will be identical. If we deal with random permutations, the outputs will be almost certainly (with probability 1-2^-128) different. That is because both inputs to the AES encryptions will be the same, so the ciphertexts will be also identical. This would not be the case when we use random permutations, because the probability that the two outputs are identical is 2^-128. The bottom line is that xoring tweak to the input is probably not a secure method.
The paper gives some examples of what they can prove to be a secure construction. The simplest one seems to be Enc(T,K,M) = AES(K, T xor AES(K, M)). You need two encryptions per block, but they prove the security of this construction. They also mention faster variants, but they require additional primitive (almost-xor-universal function families).

Even though I think your approach is secure enough, I don't see any benefits over CTR. You have the exact same problem, which is you don't inject true randomness to the ciphertext. The offset is a known systematic input. Even though it's encrypted with a key, it's still not random.
Another issue is how do you keep the FileUniqueKey secure? Encrypted with password? A whole bunch issues are introduced when you use multiple keys.
Counter mode is accepted practice to encrypt random access files. Even though it has all kinds of vulnerabilities, it's all well studied so the risk is measurable.

What's wrong with XOR encryption?

I wrote a short C++ program to do XOR encryption on a file, which I may use for some personal files (if it gets cracked it's no big deal - I'm just protecting against casual viewers). Basically, I take an ASCII password and repeatedly XOR the password with the data in the file.
Now I'm curious, though: if someone wanted to crack this, how would they go about it? Would it take a long time? Does it depend on the length of the password (i.e., what's the big-O)?

The problem with XOR encryption is that for long runs of the same characters, it is very easy to see the password. Such long runs are most commonly spaces in text files. Say your password is 8 chars, and the text file has 16 spaces in some line (for example, in the middle of ASCII-graphics table). If you just XOR that with your password, you'll see that output will have repeating sequences of characters. The attacker would just look for any such, try to guess the character in the original file (space would be the first candidate to try), and derive the length of the password from length of repeating groups.
Binary files can be even worse as they often contain repeating sequences of 0x00 bytes. Obviously, XORing with those is no-op, so your password will be visible in plain text in the output! An example of a very common binary format that has long sequences of nulls is .doc.

I concur with Pavel Minaev's explanation of XOR's weaknesses. For those who are interested, here's a basic overview of the standard algorithm used to break the trivial XOR encryption in a few minutes:
Determine how long the key is. This
is done by XORing the encrypted data
with itself shifted various numbers
of places, and examining how many
bytes are the same.
If the bytes that are equal are
greater than a certain percentage
(6% according to Bruce Schneier's
Applied Cryptography second
edition), then you have shifted the
data by a multiple of the keylength.
By finding the smallest amount of
shifting that results in a large
amount of equal bytes, you find the
keylength.
Shift the cipher text by the
keylength, and XOR against itself.
This removes the key and leaves you
with the plaintext XORed with the
plaintext shifted the length of the
key. There should be enough
plaintext to determine the message
content.
Read more at Encryption Matters, Part 1

XOR encryption can be reasonably* strong if the following conditions are met:
The plain text and the password are about the same length.
The password is not reused for encrypting more than one message.
The password cannot be guessed, IE by dictionary or other mathematical means. In practice this means the bits are randomized.
*Reasonably strong meaning it cannot be broken by trivial, mathematical means, as in GeneQ's post. It is still no stronger than your password.

In addition to the points already mentioned, XOR encryption is completely vulnerable to known-plaintext attacks:
cryptotext = plaintext XOR key
key = cryptotext XOR plaintext = plaintext XOR key XOR plaintext
where XORring the plaintexts cancel each other out, leaving just the key.
Not being vulnerable to known-plaintext attacks is a required but not sufficient property for any "secure" encryption method where the same key is used for more than one plaintext block (i.e. a one-time pad is still secure).

Ways to make XOR work:
Use multiple keys with each key length equal to a prime number but never the same length for keys.
Use the original filename as another key but remember to create a mechanism for retrieving the filename. Then create a new filename with an extension that will let you know it is an encrypted file.
The reason for using multiple keys of prime-number length is that they cause the resulting XOR key to be Key A TIMES Key B in length before it repeats.
Compress any repeating patterns out of the file before it is encrypted.
Generate a random number and XOR this number every X Offset (Remember, this number must also be recreatable. You could use a RANDOM SEED of the Filelength.
After doing all this, if you use 5 keys of length 31 and greater, you would end up with a key length of approximately One Hundred Meg!
For keys, Filename being one (including the full path), STR(Filesize) + STR(Filedate) + STR(Date) + STR(Time), Random Generation Key, Your Full Name, A private key created one time.
A database to store the keys used for each file encrypted but keep the DAT file on a USB memory stick and NOT on the computer.
This should prevent the repeating pattern on files like Pictures and Music but movies, being four gigs in length or more, may still be vulnerable so may need a sixth key.
I personally have the dat file encrypted itself on the memory stick (Dat file for use with Microsoft Access). I used a 3-Key method to encrypt it cause it will never be THAT large, being a directory of the files with the associated keys.
The reason for multiple keys rather than randomly generating one very large key is that primes times primes get large quick and I have some control over the creation of the key and you KNOW that there really is no such thing as a truely random number. If I created one large random number, someone else can generate that same number.
Method to use the keys: Encrypt the file with one key, then the next, then the next till all keys are used. Each key is used over and over again till the entire file is encrypted with that key.
Because the keys are of different length, the overlap of the repeat is different for each key and so creates a derived key the length of Key one time Key two. This logic repeats for the rest of the keys. The reason for Prime numbers is that the repeating would occur on a division of the key length so you want the division to be 1 or the length of the key, hense, prime.
OK, granted, this is more than a simple XOR on the file but the concept is the same.
Lance

I'm just protecting against casual viewers
As long as this assumption holds, your encryption scheme is ok. People who think that Internet Explorer is "teh internets" are not capable of breaking it.
If not, just use some crypto library. There are already many good algorithms like Blowfish or AES for symmetric crypto.

The target of a good encryption is to make it mathematically difficult to decrypt without the key.
This includes the desire to protect the key itself.
The XOR technique is basically a very simple cipher easily broken as described here.
It is important to note that XOR is used within cryptographic algorithms.
These algorithms work on the introduction of mathematical difficulty around it.

Norton's Anti-virus used to use a technique of using the previous unencrypted letter as the key for next letter. That took me an extra half-hour to figure out, if I recall correctly.
If you just want to stop the casual viewer, it's good enough; I've used to hide strings within executables. It won't stand up 10 minutes to anyone who actually tries, however.
That all said, these days there are much better encryption methods readily available, so why not avail yourself of something better. If you are trying to just hide from the "casual" user, even something like gzip would do that job better.

Another trick is to generate a md5() hash for your password. You can make it even more unique by using the length of the protected text as an offset or combining it with your password to provide better distribution for short phrases. And for long phrases, evolve your md5() hash by combining each 16-byte block with the previous hash -- making the entire XOR key "random" and non-repetitive.

RC4 is essentially XOR encryption! As are many stream ciphers - the key is the key (no pun intended!) you must NEVER reuse the key. EVER!

I'm a little late in answering, but since no one has mentioned it yet: this is called a Vigenère cipher.
Wikipedia gives a number of cryptanalysis attacks to break it; even simpler, though, since most file-formats have a fixed header, would be to XOR the plaintext-header with the encrypted-header, giving you the key.

That ">6%" GeneQ mentions is the index of coincidence for English telegraph text - 26 letters, with punctuation and numerals spelled out. The actual value for long texts is 0.0665.
The <4% is the index of coincidence for random text in a 26-character alphabet, which is 1/26, or 0.385.
If you're using a different language or a different alphabet, the specific values will different. If you're using the ASCII character set, Unicode, or binary bytes, the specific values will be very different. But the difference between the IC of plaintext and random text will usually be present. (Compressed binaries may have ICs very close to that of random, and any file encrypted with any modern computer cipher will have an IC that is exactly that of random text.)
Once you've XORed the text against itself, what you have left is equivalent to an autokey cipher. Wikipedia has a good example of breaking such a cipher
http://en.wikipedia.org/wiki/Autokey_cipher

If you want to keep using XOR you could easily hash the password with multiple different salts (a string that you add to a password before hashing) and then combine them to get a larger key.
E.G. use sha3-512 with 64 unique salts, then hash your password with each salt to get a 32768 bit key that you can use to encrypt a 32Kib (Kilibit) (4KiB (kilibyte)) or smaller file. Hashing this many times should be less than a second on a modern CPU.
for something more secure you could try manipulating your key during encryption like AES (Rijndael). AES actually does XOR times and modifies the key each repeat of the key using a switch table. It became an internation standard so its quite secure.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string