How to get a byte given by its code in Python 3? - python-3.x

I have a code of a byte and I want to get corresponding bytes' sequence. For example, the code is 65, the sequence should be b'A'. I know there is a simple way to do it:
b = chr(65).encode()
print(b) # b'A'
But it seems to be too slow and overcharged because of converting to string in the middle. Is there a fast and elegant way to do same in Python 3?

Use the bytes constructor:
>>> bytes([65])
b'A'
There is also the to_bytes method, mostly useful if the integer represents multiple bytes but it also works for 1:
>>> (65).to_bytes(1, 'big') # big or little endian makes no difference for 1
b'A'
The other way is just indexing:
>>> b'A'[0]
65

Related

Data Being Read as Strings instead of Floats

A Pytorch program, which I don't fully understand, produced an output and wrote it into weight.txt. I'm trying to do some further calculations based on this output.
I'd like the output to be interpreted as a list of length 3, each entry of which is a list of floats of length 240.
I use this to load in the data
w=open("weight.txt","r")
weight=[]
for number in w:
weight.append(number)
print(len(weight)) yields 3. So far so good.
But then print(len(weight[0])) yields 6141. That's bad!
On closer inspection, it's because weight[0] is being read character-by-character instead of number-by-number. So for example, print(weight[0][0]) yields - instead of -1.327657848596572876e-01. These numbers are separated by single spaces, which are also being read as characters.
How do I fix this?
Thank you
Edit: I tried making a repair function:
def repair(S):
numbers=[]
num=''
for i in range(len(S)):
if S[i]!=' ':
num+=S[i]
elif S[i]==' ':
num=float(num)
numbers.append(num)
num=''
elif i==len(S)-1:
num+=S[i]
num=float(num)
numbers.append(num)
return numbers
Unfortunately, print(repair('123 456')) returns [123.0] instead of the desired [123.0 456.0].
You haven't told us what your input file looks like, so it's hard to give an exact answer. But, assuming it looks like this:
123 312.8 12
2.5 12.7 32
the following program:
w=open("weight.txt","r")
weight=[]
for line in w:
for n in line.split():
weight.append(float(n))
print weight
will print:
[123.0, 312.8, 12.0, 2.5, 12.7, 32.0]
which is closer to what you're looking for, I presume?
The crux of the issue here is that for number in w in your program simply goes through each line: You have to have another loop to split that line into its constituents and then convert appropriately.

XOR two strings of different length

So I am trying to XOR two strings together but am unsure if I am doing it correctly when the strings are different length.
The method I am using is as follows.
def xor_two_str(a,b):
xored = []
for i in range(max(len(a), len(b))):
xored_value = ord(a[i%len(a)]) ^ ord(b[i%len(b)])
xored.append(hex(xored_value)[2:])
return ''.join(xored)
I get output like so.
abc XOR abc: 000
abc XOR ab: 002
ab XOR abc: 5a
space XOR space: 0
I know something is wrong and I will eventually want to convert the hex value to ascii so am worried the foundation is wrong. Any help would be greatly appreciated.
Your code looks mostly correct (assuming the goal is to reuse the shorter input by cycling back to the beginning), but your output has a minor problem: It's not fixed width per character, so you could get the same output from two pairs characters with a small (< 16) difference as from a single pair of characters with a large difference.
Assuming you're only working with "bytes-like" strings (all inputs have ordinal values below 256), you'll want to pad your hex output to a fixed width of two, with padding zeroes changing:
xored.append(hex(xored_value)[2:])
to:
xored.append('{:02x}'.format(xored_value))
which saves a temporary string (hex + slice makes the longer string then slices off the prefix, when format strings can directly produce the result without the prefix) and zero-pads to a width of two.
There are other improvements possible for more Pythonic/performant code, but that should be enough to make your code produce usable results.
Side-note: When running your original code, xor_two_str('abc', 'ab') and xor_two_str('ab', 'abc') both produced the same output, 002 (Try it online!), which is what you'd expect (since xor-ing is commutative, and you cycle the shorter input, reversing the arguments to any call should produce the same results). Not sure why you think it produced 5a. My fixed code (Try it online!) just makes the outputs 000000, 000002, 000002, and 00; padded properly, but otherwise unchanged from your results.
As far as other improvements to make, manually converting character by character, and manually cycling the shorter input via remainder-and-indexing is a surprisingly costly part of this code, relative to the actual work performed. You can do a few things to reduce this overhead, including:
Convert from str to bytes once, up-front, in bulk (runs in roughly one seventh the time of the fastest character by character conversion)
Determine up front which string is shortest, and use itertools.cycle to extend it as needed, and zip to directly iterate over paired byte values rather than indexing at all
Together, this gets you:
from itertools import cycle
def xor_two_str(a,b):
# Convert to bytes so we iterate by ordinal, determine which is longer
short, long = sorted((a.encode('latin-1'), b.encode('latin-1')), key=len)
xored = []
for x, y in zip(long, cycle(short)):
xored_value = x ^ y
xored.append('{:02x}'.format(xored_value))
return ''.join(xored)
or to make it even more concise/fast, we just make the bytes object without converting to hex (and just for fun, use map+operator.xor to avoid the need for Python level loops entirely, pushing all the work to the C layer in the CPython reference interpreter), then convert to hex str in bulk with the (new in 3.5) bytes.hex method:
from itertools import cycle
from operator import xor
def xor_two_str(a,b):
short, long = sorted((a.encode('latin-1'), b.encode('latin-1')), key=len)
xored = bytes(map(xor, long, cycle(short)))
return xored.hex()

Reading-in a binary JPEG-Header (in Python)

I would like to read in a JPEG-Header and analyze it.
According to Wikipedia, the header consists of a sequences of markers. Each Marker starts with FF xx, where xx is a specific Marker-ID.
So my idea, was to simply read in the image in binary format, and seek for the corresponding character-combinations in the binary stream. This should enable me to split the header in the corresponding marker-fields.
For instance, this is, what I receive, when I read in the first 20 bytes of an image:
binary_data = open('picture.jpg','rb').read(20)
print(binary_data)
b'\xff\xd8\xff\xe1-\xfcExif\x00\x00MM\x00*\x00\x00\x00\x08'
My questions are now:
1) Why does python not return me nice chunks of 2 bytes (in hex-format).
Somthing like this I would expect:
b'\xff \xd8 \xff \xe1 \x-' ... and so on. Some blocks delimited by '\x' are much longer than 2 bytes.
2) Why are there symbols like -, M, * in the returned string? Those are no characters of a hex representation I expect from a byte string (only: 0-9, a-f, I think).
Both observations hinder me in writing a simple parser.
So ultimately my question summarizes to:
How do I properly read-in and parse a JPEG Header in Python?
You seem overly worried about how your binary data is represented on your console. Don't worry about that.
The default built-in string-based representation that print(..) applies to a bytes object is just "printable ASCII characters as such (except a few exceptions), all others as an escaped hex sequence". The exceptions are semi-special characters such as \, ", and ', which could mess up the string representation. But this alternative representation does not change the values in any way!
>>> a = bytes([1,2,4,92,34,39])
>>> a
b'\x01\x02\x04\\"\''
>>> a[0]
1
See how the entire object is printed 'as if' it's a string, but its individual elements are still perfectly normal bytes?
If you have a byte array and you don't like the appearance of this default, then you can write your own. But – for clarity – this still doesn't have anything to do with parsing a file.
>>> binary_data = open('iaijiedc.jpg','rb').read(20)
>>> binary_data
b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x01\x00H\x00H\x00\x00'
>>> ''.join(['%02x%02x ' % (binary_data[2*i],binary_data[2*i+1]) for i in range(len(binary_data)>>1)])
'ffd8 ffe0 0010 4a46 4946 0001 0201 0048 0048 0000 '
Why does python not return me nice chunks of 2 bytes (in hex-format)?
Because you don't ask it to. You are asking for a sequence of bytes, and that's what you get. If you want chunks of two-bytes, transform it after reading.
The code above only prints the data; to create a new list that contains 2-byte words, loop over it and convert each 2 bytes or use unpack (there are actually several ways):
>>> wd = [unpack('>H', binary_data[x:x+2])[0] for x in range(0,len(binary_data),2)]
>>> wd
[65496, 65504, 16, 19014, 18758, 1, 513, 72, 72, 0]
>>> [hex(x) for x in wd]
['0xffd8', '0xffe0', '0x10', '0x4a46', '0x4946', '0x1', '0x201', '0x48', '0x48', '0x0']
I'm using the little-endian specifier < and unsigned short H in unpack, because (I assume) these are the conventional ways to represent JPEG 2-byte codes. Check the documentation if you want to derive from this.

Fastest way to convert gmpy2 mpz to a bigendian byte-array (bytes)?

I tried several different approaches, from
def ToByteArray(x):
x = int(x)
return x.to_bytes((x.bit_length() + 7) // 8, byteorder='big')
or diving x by 256 and building a new bytearray in a loop but it just feels slow compared to the conversion of a normal python int or in gmpy2 c++.
Isn't there something like an mpz_export in c++? What is the fastest way to accomplish this?
Edit: The reason I need to convert it to bytes is that hashlib cannot hash mpz. If there is another fast way to get a strong cryptographic (sha256) hash of an mpz, without having to convert it to bytes first, that might help aswell!
I think gmpy2.to_binary() will do what you need. It converts a gmpy2 object to portable byte sequence. It uses mpz_export to convert the underlying mpz_t to a sequence of bytes. A short header containing the gmpy2 type and the length is placed at the beginning of the byte sequence. For the gmpy2.mpz type (and assuming the value is not 0), the header is two bytes long.
>>> gmpy2.to_binary(gmpy2.mpz(123456**7))
b'\x01\x01\x00\x00\x00\x00\x00\xe4\x9f\xcc\xfb\xad\xe5\x1f\xec.T'

Most concise, Pythonic way to represent a null byte as a binary (bytes) string in Python 3

I needed to add some padding to a bytes string. This is what I came up with:
if padding_len > 0:
data += bytes.fromhex('00') * padding_len
Is there a nicer way of representing a null byte in Python 3 than bytes.fromhex('00')?
>>> bytes.fromhex('00') == b'\x00'
True
>>> b'\x00' * 10
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
EDIT:
As hobbs points out in a comment below, it is certainly possible to use b'\0' instead of b'\x00' and as j-f-sebastian points out, it is not confusing in this instance.
But I wouldn't do it, anyway. It certainly works in this context (and it saves you two characters, if that's important). It will even work in the most common other context, where you are building strings for C and putting a null byte at the end.
But in the most general case, it can lead to confusion, because the compiler's interpretation of b'\0' is highly data dependent. In other words, it changes according to what comes after that zero.
>>> b'\0ABC' == b'\00ABC'
True
>>> b'\0ABC' == b'\000ABC'
True
>>> b'\0ABC' == '\0000ABC'
False
If you are debugging late at night when not all your brain cells are functioning, it is highly frustrating to have the length of a string change because you replaced a character in the string. All you have to do to avoid this is to always use two extra characters. It doesn't matter whether you use \x00 (hexadecimal) or \000 octal -- both of those will work properly no matter the value of the following character.

Resources