What's really happening in this Python encryption script? - python-3.x

I'm currently learning to use Python for binary files. I came across this code in the book I'm reading:
FILENAME = 'pc_rose_copy.txt'
def display_contents(filename):
fp = open(filename, 'rb')
print(fp.read())
fp.close()
def encrypt(filename):
fp = open(filename, 'r+b')
text = fp.read()
fp.seek(0)
for c in text:
if c <= 128:
fp.write(bytes([c+128]))
else:
fp.write(bytes([c-128]))
fp.close()
display_contents(FILENAME)
encrypt(FILENAME)
display_contents(FILENAME)
I've several doubts regarding this code for which I can't find an answer in the book:
1) In line 13 ("if c <= 128"), since the file was opened in binary mode, each character is read as its index in the ASCII table (i.e., that is equivalent to 'if ord(c) <= 128' had the file not been in binary mode)?
2) If so, then what's the point in checking if any character's index is higher than 128, since this is a .txt with a passage from Romeo and Juliet?
3) This point is more of a curiosity, so pardon naivety. I know this doesn't apply in this case, but say the script encounters a 'c' with a byte value of 128, and so adds 128 to it. What would 256 byte look like -- would it be 11111111 00000001?

What's really happening is that the script is toggling the most significant bit of every byte. This is equivalent to adding/subtracting 128 to each byte. You can see this by looking at the file contents before/after running the script (xxd -b file.txt on linux or mac will let you see the exact bits/bytes).
Here's a run on some sample text:
File Contents Before:
11110000 10011111 10011000 10000100 00001010
File Contents After:
01110000 00011111 00011000 00000100 10001010
Running the script twice (or any even number of times) restores the original text by toggling all of the high bits back to the original values.
Question / Answer:
1) If the file is ASCII-encoded, yes. e.g. for a file abc\n, the values of c are 97, 98, 99, and 10 (newline). You can verify this by adding print(c) inside the loop. This script will also work* on non-ASCII encoded files (the example above is UTF-8).
2) So that we can flip the bits. Even if we were only handling ASCII files (which isn't guaranteed), the bytes we get from encrypting ASCII files will be larger than 128, since we've added 128 to each byte. So we still need to handle that case in order to decrypt our own files.
3) As is, the script crashes, because bytes() requires values in the range 0 <= x < 256 (see documentation). You can create a file that breaks the script with echo -n -e '\x80\x80\x80' > 128.txt. The script should be using < instead to handle this case properly.
* Except for 3)

I think that the encrypt function is also meant to be a decrypt function.
The encrypt goes from a text file to a binary file with only high bytes. But the else clause is for going back from high byte to text. I think that if you added an extra encrypt(FILENAME) you'd get the original file back.
'c' cannot really be 128, in a text file. The highest value there would be 126 (~), 127 is the del "character". But c=128 and adding 128 as bytes would be 0 (wrap around) as we work modulo 256. In C this would be the case (for unsigned char).

Related

Advice for decoding binary/hex WAV file metadata - Pro Tools UMID chunk

Pro Tools (AVID's DAW software) has a process for managing and linking to all of it's unique media using a Unique ID field, which gets embedded in to the WAV file in the form of a umid metadata chunk. Examining a particular file inside Pro Tools, I can see that the file's Unique ID comes in the form of an 11 character string, looking like: rS9ipS!x6Tf.
When I examine the raw data inside the WAV file, I find a 32-byte block of data - 4 bytes for the chars 'umid'; 4 bytes for the size of the following data block - 24; then the 24-byte data block, which, when examined in Hex Fiend, looks like this:
00000000 0000002A 5B7A5FFB 0F23DB11 00000000 00000000
As you can see, there are only 9 bytes that contain any non-zero information, but this is somehow being used to store the 11 char Unique ID field. It looks to me as if something is being done to interpret this raw data to retrieve that Unique ID string, but all my attempts to decode the raw data have not been at all fruitful. I have tried using https://gchq.github.io/CyberChef/ to run it through all the different formats that would make sense, but nothing it pointing me in the right direction. I have also tried looking at the data in 6-bit increments to see if it's being compressed in some way (9 bytes * 8 bits == 72 == 12 blocks * 6 bits) but have not had any luck stumbling on a pattern yet.
So I'm wondering if anyone has any specific tips/tricks/suggestions on how best to figure out what might be happening here - how to unpack this data in such a way that I might be able to end up with enough information to generate those 11 chars, of what I'm guessing would most likely be UTF-8.
Any and all help/suggestions welcome! Thanks.
It seems to be a base64 encoding only with a slightly different character map, here is my python implementation that I find best matches Pro Tools.
char_map = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789#!"
def encode_unique_id(uint64_value):
# unique id is a uint64_t, clamp
value = uint64_value & 0xFFFFFFFFFFFFFFFF
if value == 0:
return ""
# calculate the min number of bytes
# needed store value for int
byte_length = 0
tmp = value
while tmp:
tmp =tmp >> 8
byte_length += 1
# calculate number of chars needed to store encoding
char_total, remainder = divmod(byte_length * 8, 6)
if remainder:
char_total += 1
s = ""
for i in range(char_total):
value, index = divmod(value, 64)
s += char_map[index]
return s
Running encode_unique_id(0x2A5B7A5FFB0F23DB11) should give you rS9ipS!x6Tf

Reading-in a binary JPEG-Header (in Python)

I would like to read in a JPEG-Header and analyze it.
According to Wikipedia, the header consists of a sequences of markers. Each Marker starts with FF xx, where xx is a specific Marker-ID.
So my idea, was to simply read in the image in binary format, and seek for the corresponding character-combinations in the binary stream. This should enable me to split the header in the corresponding marker-fields.
For instance, this is, what I receive, when I read in the first 20 bytes of an image:
binary_data = open('picture.jpg','rb').read(20)
print(binary_data)
b'\xff\xd8\xff\xe1-\xfcExif\x00\x00MM\x00*\x00\x00\x00\x08'
My questions are now:
1) Why does python not return me nice chunks of 2 bytes (in hex-format).
Somthing like this I would expect:
b'\xff \xd8 \xff \xe1 \x-' ... and so on. Some blocks delimited by '\x' are much longer than 2 bytes.
2) Why are there symbols like -, M, * in the returned string? Those are no characters of a hex representation I expect from a byte string (only: 0-9, a-f, I think).
Both observations hinder me in writing a simple parser.
So ultimately my question summarizes to:
How do I properly read-in and parse a JPEG Header in Python?
You seem overly worried about how your binary data is represented on your console. Don't worry about that.
The default built-in string-based representation that print(..) applies to a bytes object is just "printable ASCII characters as such (except a few exceptions), all others as an escaped hex sequence". The exceptions are semi-special characters such as \, ", and ', which could mess up the string representation. But this alternative representation does not change the values in any way!
>>> a = bytes([1,2,4,92,34,39])
>>> a
b'\x01\x02\x04\\"\''
>>> a[0]
1
See how the entire object is printed 'as if' it's a string, but its individual elements are still perfectly normal bytes?
If you have a byte array and you don't like the appearance of this default, then you can write your own. But – for clarity – this still doesn't have anything to do with parsing a file.
>>> binary_data = open('iaijiedc.jpg','rb').read(20)
>>> binary_data
b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x01\x00H\x00H\x00\x00'
>>> ''.join(['%02x%02x ' % (binary_data[2*i],binary_data[2*i+1]) for i in range(len(binary_data)>>1)])
'ffd8 ffe0 0010 4a46 4946 0001 0201 0048 0048 0000 '
Why does python not return me nice chunks of 2 bytes (in hex-format)?
Because you don't ask it to. You are asking for a sequence of bytes, and that's what you get. If you want chunks of two-bytes, transform it after reading.
The code above only prints the data; to create a new list that contains 2-byte words, loop over it and convert each 2 bytes or use unpack (there are actually several ways):
>>> wd = [unpack('>H', binary_data[x:x+2])[0] for x in range(0,len(binary_data),2)]
>>> wd
[65496, 65504, 16, 19014, 18758, 1, 513, 72, 72, 0]
>>> [hex(x) for x in wd]
['0xffd8', '0xffe0', '0x10', '0x4a46', '0x4946', '0x1', '0x201', '0x48', '0x48', '0x0']
I'm using the little-endian specifier < and unsigned short H in unpack, because (I assume) these are the conventional ways to represent JPEG 2-byte codes. Check the documentation if you want to derive from this.

Python 3- check if buffered out bytes form a valid char

I am porting some code from python 2.7 to 3.4.2, I am struck at the bytes vs string complication.
I read this 3rd point in the wolf's answer
Exactly n bytes may cause a break between logical multi-byte characters (such as \r\n in binary mode and, I think, a multi-byte character in Unicode) or some underlying data structure not known to you;
So, when I buffer read a file (say - 1 byte each time) & the very first characters happens to be a 6-byte unicode how do I figure out how many more bytes to be read? Because if I do not read till the complete char, it will be skipped from processing; as next time read(x) will read x bytes relative to it's last position (i.e. halfway between it char's byte equivalent)
I tried the following approach:
import sys, os
def getBlocks(inputFile, chunk_size=1024):
while True:
try:
data=inputFile.read(chunk_size)
if data:
yield data
else:
break
except IOError as strerror:
print(strerror)
break
def isValid(someletter):
try:
someletter.decode('utf-8', 'strict')
return True
except UnicodeDecodeError:
return False
def main(src):
aLetter = bytearray()
with open(src, 'rb') as f:
for aBlock in getBlocks(f, 1):
aLetter.extend(aBlock)
if isValid(aLetter):
# print("char is now a valid one") # just for acknowledgement
# do more
else:
aLetter.extend( getBlocks(f, 1) )
Questions:
Am I doomed if I try fileHandle.seek(-ve_value_here, 1)
Python must be having something in-built to deal with this, what is it?
how can I really test if the program meets its purpose of ensuring complete characters are read (right now I have only simple english files)
how can I determine best chunk_size to make program faster. I mean reading 1024 bytes where first 1023 bytes were 1-byte-representable-char & last was a 6-byter leaves me with the only option of reading 1 byte each time
Note: I can't prefer buffered reading as I do not know range of input file sizes in advance
The answer to #2 will solve most of your issues. Use an IncrementalDecoder via codecs.getincrementaldecoder. The decoder maintains state and only outputs fully decoded sequences:
#!python3
import codecs
import sys
byte_string = '\u5000\u5001\u5002'.encode('utf8')
# Get the UTF-8 incremental decoder.
decoder_factory = codecs.getincrementaldecoder('utf8')
decoder_instance = decoder_factory()
# Simple example, read two bytes at a time from the byte string.
result = ''
for i in range(0,len(byte_string),2):
chunk = byte_string[i:i+2]
result += decoder_instance.decode(chunk)
print('chunk={} state={} result={}'.format(chunk,decoder_instance.getstate(),ascii(result)))
result += decoder_instance.decode(b'',final=True)
print(ascii(result))
Output:
chunk=b'\xe5\x80' state=(b'\xe5\x80', 0) result=''
chunk=b'\x80\xe5' state=(b'\xe5', 0) result='\u5000'
chunk=b'\x80\x81' state=(b'', 0) result='\u5000\u5001'
chunk=b'\xe5\x80' state=(b'\xe5\x80', 0) result='\u5000\u5001'
chunk=b'\x82' state=(b'', 0) result='\u5000\u5001\u5002'
'\u5000\u5001\u5002'
Note after the first two bytes are processed the internal decoder state just buffers them and appends no characters to the result. The next two complete a character and leave one in the internal state. The last call with no additional data and final=True just flushes the buffer. It will raise an exception if there is an incomplete character pending.
Now you can read your file in whatever chunk size you want, pass them all through the decoder and be sure that you only have complete code points.
Note that with Python 3, you can just open the file and declare the encoding. The chunk you read will actually be processed Unicode code points using an IncrementalDecoder internally:
input.csv (saved in UTF-8 without BOM)
我是美国人。
Normal text.
code
with open('input.txt',encoding='utf8') as f:
while True:
data = f.read(2) # reads 2 Unicode codepoints, not bytes.
if not data: break
print(ascii(data))
Result:
'\u6211\u662f'
'\u7f8e\u56fd'
'\u4eba\u3002'
'\nN'
'or'
'ma'
'l '
'te'
'xt'
'.'

Python3: Converting or 'Casting' byte array string from string to bytes array

New to this python thing.
A little while ago I saved off output from an external device that was connected to a serial port as I would not be able to keep that device. I read in the data at the serial port as bytes with the intention of creating an emulator for that device.
The data was saved one 'byte' per line to a file as example extract below.
b'\x9a'
b'X'
b'}'
b'}'
b'x'
b'\x8c'
I would like to read in each line from the data capture and append what would have been the original byte to a byte array.
I have tried various append() and concatenation operations (+=) on a bytearray but the above lines are python string objects and these operations fail.
Is there an easy way (a built-in way?) to add each of the original byte values of these lines to a byte array?
Thanks.
M
Update
I came across the .encode() string method and have created a crude function to meet my needs.
def string2byte(str):
# the 'byte' string will consist of 5, 6 or 8 characters including the newline
# as in b',' or b'\r' or b'\x0A'
if len(str) == 5:
return str[2].encode('Latin-1')
elif len(str) == 6:
return str[3].encode('Latin-1')
else:
return str[4:6].encode('Latin-1')
...well, it is functional.
If anyone knows of a more elegant solution perhaps you would be kind enough to post this.
b'\x9a' is a literal representation of the byte 0x9a in Python source code. If your file literally contains these seven characters b'\x9a' then it is bad because you could have saved it using only one byte. You could convert it to a byte using ast.literal_eval():
import ast
with open('input') as file:
a = b"".join(map(ast.literal_eval, file)) # assume 1 byte literal per line

Implement PKCS #7 Padding Scheme for AES in Python

I've written a small command line utility to encrypt single files with AES, using Python 3. As I'm sure we all know, AES works on 16-byte blocks, so if I want to encrypt a file that isn't exactly a multiple of 16, then I'll have to pad the file to make it a multiple of 16. PKCS #7 padding scheme says that I should pad the last chunk with N bytes all of value N. This is how I do that in my encryption function.
for chunk in getChunks(plainFile, chunkSizeBytes):
padLength = ((AES.block_size - len(chunk)) % AES.block_size)
# We have to have padding!
if padLength == 0:
padLength = 16
pad = chr(padLength) * padLength
chunk += pad.encode('utf-8')
# Write the encrypted chunk to an output file.
cipherFile.write(en.encrypt(chunk))
However, I'm unsure about how I should read this data from that last chunk of a decrypted file. Is there a way to read in files in reverse order? What's the correct way to do this?
I should pad the last chunk with N bytes all of value N.
In this sentence, the first N is equal to the second N, which means the value of the byte determines how much characters you need to remove for decoding.
For example, if you only have 9 characters in your last chunk, pad with 7 characters of value 7 (7 turns out to be the BEL character, but that doesn't matter).

Resources