Implement PKCS #7 Padding Scheme for AES in Python - python-3.x

I've written a small command line utility to encrypt single files with AES, using Python 3. As I'm sure we all know, AES works on 16-byte blocks, so if I want to encrypt a file that isn't exactly a multiple of 16, then I'll have to pad the file to make it a multiple of 16. PKCS #7 padding scheme says that I should pad the last chunk with N bytes all of value N. This is how I do that in my encryption function.
for chunk in getChunks(plainFile, chunkSizeBytes):
padLength = ((AES.block_size - len(chunk)) % AES.block_size)
# We have to have padding!
if padLength == 0:
padLength = 16
pad = chr(padLength) * padLength
chunk += pad.encode('utf-8')
# Write the encrypted chunk to an output file.
cipherFile.write(en.encrypt(chunk))
However, I'm unsure about how I should read this data from that last chunk of a decrypted file. Is there a way to read in files in reverse order? What's the correct way to do this?

I should pad the last chunk with N bytes all of value N.
In this sentence, the first N is equal to the second N, which means the value of the byte determines how much characters you need to remove for decoding.
For example, if you only have 9 characters in your last chunk, pad with 7 characters of value 7 (7 turns out to be the BEL character, but that doesn't matter).

Related

Node.JS AES decryption truncates initial result

I'm attempting to replicate some python encryption code in node.js using the built in crypto library. To test, I'm encrypting the data using the existing python script and then attempting to decrypt using node.js.
I have everything working except for one problem, doing the decryption results in a truncated initial decrypted result unless I grab extra data, which then results in a truncated final result.
I'm very new to the security side of things, so apologize in advance if my vernacular is off.
Python encryption logic:
encryptor = AES.new(key, AES.MODE_CBC, IV)
<# Header logic, like including digest, salt, and IV #>
for rec in vect:
chunk = rec.pack() # Just adds disparate pieces of data into a contiguous bytearray of length 176
encChunk = encryptor.encrypt(chunk)
outfile.write(encChunk)
Node decryption logic:
let offset = 0;
let derivedKey = crypto.pbkdf2Sync(secret, salt, iterations, 32, 'sha256');
let decryptor = crypto.createDecipheriv('aes-256-cbc', derivedKey, iv);
let chunk = data.slice(offset, (offset + RECORD_LEN))
while(chunk.length > 0) {
let clearChunk = decryptor.update(chunk);
// unpack clearChunk and do something with that data
offset += RECORD_LEN;
chunk = data.slice(offset, offset + RECORD_LEN);
}
I would expect my initial result to print something like this to hex:
54722e34d8b2bf158db6b533e315764f87a07bbfbcf1bd6df0529e56b6a6ae0f123412341234123400000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
And it gets close, expect it cuts off the final 16 bytes (in example above the final 32 "0's" would be missing). This shifts all following decryptions by those 16 bytes, meaning those 32 "0's" are added to the front of the next decrypted chunk.
If I add 16 bytes to the initial chunk size (meaning actually grab more data, not just shift the offset) then this solves everything on the front end, but results in the final chunk losing it's last 16 bytes of data.
One thing that seems weird to me: The initial chunk has a length of 176, but the decrypted results has a length of 160. All other chunks have lengths 176 before and after decryption. I'm assuming I'm doing something wrong with how I'm initializing the decryptor which is causing it to expect an extra 16 bytes of data at the beginning, but can't for the life of me figure out what.
I must be close since the decrypted data is correct, minus the mystery shifting, even when reading in large amounts of data. Just need to figure out this final step.
Short version based on your updated code: if you are absolutely certain that every block will be 176 bytes (i.e. a multiple of 16), then you can add cipher.setAutoPadding(false) to your Node code. If that's not true, or for more about why, read on.
At the end of your decryption, you need to call decryptor.final to get the final block.
If you have all the data together, you can decrypt it in one call:
let clearChunk = decryptor.update(chunk) + decryptor.final()
update() exists so that you can pass data to the decryptor in chunks. For example, if you had a very large file, you may not want a full copy of the encrypted data plus a full copy of the decrypted data in memory at the same time. You can therefore read encrypted data from the file in chunks, pass it to update(), and write out the decrypted data in chunks.
The input data using CBC mode must be a multiple of 16 bytes long. To ensure this, we typically use PKCS7 padding. That will pad out your input data to a multiple of 16. If it's already a multiple of 16 it will add an extra block of 16 bytes. The padding value is the number of padding values. So if your block is 12 bytes long, it will be padded with 04040404. If it's a multiple of 16, then the padding is 16 bytes of 0x10. This padding system lets the decryptor validate that it's removing the right amount of padding. This is likely what's causing your 176/160 issue.
This padding issue is why there's a final() call. The system needs to know which block is the last block so it can remove the padding. So the first call to update() will always return one fewer blocks than you pass in, since it's holding onto it until it knows whether it's the last block.
Looking at your Python code, I think it's not padding at all (most Python libraries I'm familiar with don't pad automatically). As long as the input is certain to be a multiple of 16, that's ok. But the default for Node is to expect padding. If you know that your size will always be a multiple of 16, then you can change the Node side with cipher.setAutoPadding(false). If you don't know for certain that the input size will always be a multiple of 16, then you need to add a pad() call on the Python side for the final block.

What's really happening in this Python encryption script?

I'm currently learning to use Python for binary files. I came across this code in the book I'm reading:
FILENAME = 'pc_rose_copy.txt'
def display_contents(filename):
fp = open(filename, 'rb')
print(fp.read())
fp.close()
def encrypt(filename):
fp = open(filename, 'r+b')
text = fp.read()
fp.seek(0)
for c in text:
if c <= 128:
fp.write(bytes([c+128]))
else:
fp.write(bytes([c-128]))
fp.close()
display_contents(FILENAME)
encrypt(FILENAME)
display_contents(FILENAME)
I've several doubts regarding this code for which I can't find an answer in the book:
1) In line 13 ("if c <= 128"), since the file was opened in binary mode, each character is read as its index in the ASCII table (i.e., that is equivalent to 'if ord(c) <= 128' had the file not been in binary mode)?
2) If so, then what's the point in checking if any character's index is higher than 128, since this is a .txt with a passage from Romeo and Juliet?
3) This point is more of a curiosity, so pardon naivety. I know this doesn't apply in this case, but say the script encounters a 'c' with a byte value of 128, and so adds 128 to it. What would 256 byte look like -- would it be 11111111 00000001?
What's really happening is that the script is toggling the most significant bit of every byte. This is equivalent to adding/subtracting 128 to each byte. You can see this by looking at the file contents before/after running the script (xxd -b file.txt on linux or mac will let you see the exact bits/bytes).
Here's a run on some sample text:
File Contents Before:
11110000 10011111 10011000 10000100 00001010
File Contents After:
01110000 00011111 00011000 00000100 10001010
Running the script twice (or any even number of times) restores the original text by toggling all of the high bits back to the original values.
Question / Answer:
1) If the file is ASCII-encoded, yes. e.g. for a file abc\n, the values of c are 97, 98, 99, and 10 (newline). You can verify this by adding print(c) inside the loop. This script will also work* on non-ASCII encoded files (the example above is UTF-8).
2) So that we can flip the bits. Even if we were only handling ASCII files (which isn't guaranteed), the bytes we get from encrypting ASCII files will be larger than 128, since we've added 128 to each byte. So we still need to handle that case in order to decrypt our own files.
3) As is, the script crashes, because bytes() requires values in the range 0 <= x < 256 (see documentation). You can create a file that breaks the script with echo -n -e '\x80\x80\x80' > 128.txt. The script should be using < instead to handle this case properly.
* Except for 3)
I think that the encrypt function is also meant to be a decrypt function.
The encrypt goes from a text file to a binary file with only high bytes. But the else clause is for going back from high byte to text. I think that if you added an extra encrypt(FILENAME) you'd get the original file back.
'c' cannot really be 128, in a text file. The highest value there would be 126 (~), 127 is the del "character". But c=128 and adding 128 as bytes would be 0 (wrap around) as we work modulo 256. In C this would be the case (for unsigned char).

Reading-in a binary JPEG-Header (in Python)

I would like to read in a JPEG-Header and analyze it.
According to Wikipedia, the header consists of a sequences of markers. Each Marker starts with FF xx, where xx is a specific Marker-ID.
So my idea, was to simply read in the image in binary format, and seek for the corresponding character-combinations in the binary stream. This should enable me to split the header in the corresponding marker-fields.
For instance, this is, what I receive, when I read in the first 20 bytes of an image:
binary_data = open('picture.jpg','rb').read(20)
print(binary_data)
b'\xff\xd8\xff\xe1-\xfcExif\x00\x00MM\x00*\x00\x00\x00\x08'
My questions are now:
1) Why does python not return me nice chunks of 2 bytes (in hex-format).
Somthing like this I would expect:
b'\xff \xd8 \xff \xe1 \x-' ... and so on. Some blocks delimited by '\x' are much longer than 2 bytes.
2) Why are there symbols like -, M, * in the returned string? Those are no characters of a hex representation I expect from a byte string (only: 0-9, a-f, I think).
Both observations hinder me in writing a simple parser.
So ultimately my question summarizes to:
How do I properly read-in and parse a JPEG Header in Python?
You seem overly worried about how your binary data is represented on your console. Don't worry about that.
The default built-in string-based representation that print(..) applies to a bytes object is just "printable ASCII characters as such (except a few exceptions), all others as an escaped hex sequence". The exceptions are semi-special characters such as \, ", and ', which could mess up the string representation. But this alternative representation does not change the values in any way!
>>> a = bytes([1,2,4,92,34,39])
>>> a
b'\x01\x02\x04\\"\''
>>> a[0]
1
See how the entire object is printed 'as if' it's a string, but its individual elements are still perfectly normal bytes?
If you have a byte array and you don't like the appearance of this default, then you can write your own. But – for clarity – this still doesn't have anything to do with parsing a file.
>>> binary_data = open('iaijiedc.jpg','rb').read(20)
>>> binary_data
b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x01\x00H\x00H\x00\x00'
>>> ''.join(['%02x%02x ' % (binary_data[2*i],binary_data[2*i+1]) for i in range(len(binary_data)>>1)])
'ffd8 ffe0 0010 4a46 4946 0001 0201 0048 0048 0000 '
Why does python not return me nice chunks of 2 bytes (in hex-format)?
Because you don't ask it to. You are asking for a sequence of bytes, and that's what you get. If you want chunks of two-bytes, transform it after reading.
The code above only prints the data; to create a new list that contains 2-byte words, loop over it and convert each 2 bytes or use unpack (there are actually several ways):
>>> wd = [unpack('>H', binary_data[x:x+2])[0] for x in range(0,len(binary_data),2)]
>>> wd
[65496, 65504, 16, 19014, 18758, 1, 513, 72, 72, 0]
>>> [hex(x) for x in wd]
['0xffd8', '0xffe0', '0x10', '0x4a46', '0x4946', '0x1', '0x201', '0x48', '0x48', '0x0']
I'm using the little-endian specifier < and unsigned short H in unpack, because (I assume) these are the conventional ways to represent JPEG 2-byte codes. Check the documentation if you want to derive from this.

Why SHA256 hashes finish with " = "?

I've made a Webservice which returns a security Token after a successful authentification.
However when debugging I noticed that every hash the webservice returned finishes with "=" such as:
"tINH0JxmryvB6pRkEii1iBYP7FRedDqIEs0Ppbw83oc="
"INv7q72C1HvIixY1qmt5tNASFBEc0PnXRSb780Y5aeI="
"QkM8Kog8TtCczysDmKu6ZOjwwYlcR2biiUzxkb3uBio="
"6eNuCU6RBkwKMmVV6Mhm0Q0ehJ8Qo5SqcGm3LIl62uQ="
"dAPKN8aHl5tgKpmx9vNoYvXfAdF+76G4S+L+ep+TzU="
"O5qQNLEjmmgCIB0TOsNOPCHiquq8ALbHHLcWvWhMuI="
"N9ERYp+i7yhEblAjaKaS3qf9uvMja0odC7ERYllHCI="
"wsBTpxyNLVLbJEbMttFdSfOwv6W9rXba4GGodVVxgo="
"sr+nF83THUjYcjzRVQbnDFUQVTkuZOZYe3D3bmF1D8="
"9EosvgyYOG5a136S54HVmmebwiBJJ8a3qGVWD878j5k="
"8ORZmAXZ4dlWeaMOsyxAFphwKh9SeimwBzf8eYqTis="
"gVepn2Up5rjVplJUvDHtgIeaBL+X6TPzm2j9O2JTDFI="
Why such a behavior ?
This is because you don't see the raw bytes of the hash but rather the Base64 encoding.
Base64-encoding converts a block of 3 bytes to a block of four characters. This works well if the number of bytes is divisible by 3. If it is not, then you use a padding-character so the number of resulting characters is still divisible by 4.
So:
(no of bytes)%3 = 0 => no padding needed
(no of bytes)%3 = 1 => pad with ==
(no of bytes)%3 = 2 => pad with =
A SHA256-hash is 256 bit, that's 32 bytes. So you will get 40 characters for the first 30 bytes, 3 characters for the last 2 bytes and the padding will always be one =.
These strings are encoded using base64, = characters are used as paddings, to make the last block of a base64 string contains four characters.
The following Ruby code could be used to get base64 decoded string:
require 'base64'
s = "tINH0JxmryvB6pRkEii1iBYP7FRedDqIEs0Ppbw83oc="
puts Base64.decode64(s).bytes.map{|e| '%02x' % e}.join
Output: b48347d09c66af2bc1ea94641228b588160fec545e743a8812cd0fa5bc3cde87

Why does ToBase64String change a 16 byte string to 24 bytes

I have the following code. When I check the value of variable i it is 16 bytes but then when the output is converted to Base64 it is 24 bytes.
byte[] bytOut = ms.GetBuffer();
int i = 0;
for (i = 0; i < bytOut.Length; i++)
if (bytOut[i] == 0)
break;
// convert into Base64 so that the result can be used in xml
return System.Convert.ToBase64String(bytOut, 0, i);
Is this expected? I am trying to cut down storage and this is one of my problems.
Base64 expresses the input string made of 8-bit bytes using 64 human-readable characters (64 characters = 6 bits of information).
The key to the answer of your question is that it the encoding works in 24 bit chunks, so every 24 bits or fraction thereof results in 4 characters of output.
16 bytes * 8 bits = 128 bits of information
128 bits / 24 bits per chunk = 5.333 chunks
So the final output will be 6 chunks or 24 characters.
The fractional chunks are handled with equal signs, which represent the trailing "null bits". In your case, the output will always end in '=='.
Yes, you'd expect to see some expansion. You're representing your data in a base with only 64 characters. All those unprintable ASCII characters still need a way to be encoded though. So you end up with slight expansion of the data.
Here's a link that explains how much: Base64: What is the worst possible increase in space usage?
Edit: Based on your comment above, if you need to reduce size, you should look at compressing the data before you encrypt. This will get you the max benefit from compression. Compressing encrypted binary does not work.
This is because a base64 string can contain only 64 characters ( and that is because it should be displayable) in other hand and byte has a variety of 256 characters so it can contain more information in it.
Base64 is a great way to represent binary data in a string using only standard, printable characters. It is not, however, a good way to represent string data because it takes more characters than the original string.

Resources