Fastest way to convert gmpy2 mpz to a bigendian byte-array (bytes)? - python-3.x

I tried several different approaches, from
def ToByteArray(x):
x = int(x)
return x.to_bytes((x.bit_length() + 7) // 8, byteorder='big')
or diving x by 256 and building a new bytearray in a loop but it just feels slow compared to the conversion of a normal python int or in gmpy2 c++.
Isn't there something like an mpz_export in c++? What is the fastest way to accomplish this?
Edit: The reason I need to convert it to bytes is that hashlib cannot hash mpz. If there is another fast way to get a strong cryptographic (sha256) hash of an mpz, without having to convert it to bytes first, that might help aswell!

I think gmpy2.to_binary() will do what you need. It converts a gmpy2 object to portable byte sequence. It uses mpz_export to convert the underlying mpz_t to a sequence of bytes. A short header containing the gmpy2 type and the length is placed at the beginning of the byte sequence. For the gmpy2.mpz type (and assuming the value is not 0), the header is two bytes long.
>>> gmpy2.to_binary(gmpy2.mpz(123456**7))
b'\x01\x01\x00\x00\x00\x00\x00\xe4\x9f\xcc\xfb\xad\xe5\x1f\xec.T'

Related

How to convert Hex into Signed Integer using python3

what is the simplest way to print the result as follows using pyhton3
I have a Hex string s="FFFC"
In python if using this command line: print(int(s,16))
The result I'm expecting is -4 (which is in signed format). But this is not the case, It displays the Unsigned format which the result is 65,532.
How can I convert this the easiest way?
Thank you in advance.
There are several ways, but you could just do the math explicitly (assuming s has no more than 4 characters, otherwise use s[-4:]):
i = int(s, 16)
if i >= 0x8000:
i -= 0x10000
You can use the bytes.fromhex and int.from_bytes class methods.
s = bytes.fromhex('FFFC')
i = int.from_bytes(s, 'big', signed=True)
print(i)
Pretty self-explanatory, the only thing that might need clarification is the 'big' argument, but that just means that the byte array s has the most significant byte first.

XOR two strings of different length

So I am trying to XOR two strings together but am unsure if I am doing it correctly when the strings are different length.
The method I am using is as follows.
def xor_two_str(a,b):
xored = []
for i in range(max(len(a), len(b))):
xored_value = ord(a[i%len(a)]) ^ ord(b[i%len(b)])
xored.append(hex(xored_value)[2:])
return ''.join(xored)
I get output like so.
abc XOR abc: 000
abc XOR ab: 002
ab XOR abc: 5a
space XOR space: 0
I know something is wrong and I will eventually want to convert the hex value to ascii so am worried the foundation is wrong. Any help would be greatly appreciated.
Your code looks mostly correct (assuming the goal is to reuse the shorter input by cycling back to the beginning), but your output has a minor problem: It's not fixed width per character, so you could get the same output from two pairs characters with a small (< 16) difference as from a single pair of characters with a large difference.
Assuming you're only working with "bytes-like" strings (all inputs have ordinal values below 256), you'll want to pad your hex output to a fixed width of two, with padding zeroes changing:
xored.append(hex(xored_value)[2:])
to:
xored.append('{:02x}'.format(xored_value))
which saves a temporary string (hex + slice makes the longer string then slices off the prefix, when format strings can directly produce the result without the prefix) and zero-pads to a width of two.
There are other improvements possible for more Pythonic/performant code, but that should be enough to make your code produce usable results.
Side-note: When running your original code, xor_two_str('abc', 'ab') and xor_two_str('ab', 'abc') both produced the same output, 002 (Try it online!), which is what you'd expect (since xor-ing is commutative, and you cycle the shorter input, reversing the arguments to any call should produce the same results). Not sure why you think it produced 5a. My fixed code (Try it online!) just makes the outputs 000000, 000002, 000002, and 00; padded properly, but otherwise unchanged from your results.
As far as other improvements to make, manually converting character by character, and manually cycling the shorter input via remainder-and-indexing is a surprisingly costly part of this code, relative to the actual work performed. You can do a few things to reduce this overhead, including:
Convert from str to bytes once, up-front, in bulk (runs in roughly one seventh the time of the fastest character by character conversion)
Determine up front which string is shortest, and use itertools.cycle to extend it as needed, and zip to directly iterate over paired byte values rather than indexing at all
Together, this gets you:
from itertools import cycle
def xor_two_str(a,b):
# Convert to bytes so we iterate by ordinal, determine which is longer
short, long = sorted((a.encode('latin-1'), b.encode('latin-1')), key=len)
xored = []
for x, y in zip(long, cycle(short)):
xored_value = x ^ y
xored.append('{:02x}'.format(xored_value))
return ''.join(xored)
or to make it even more concise/fast, we just make the bytes object without converting to hex (and just for fun, use map+operator.xor to avoid the need for Python level loops entirely, pushing all the work to the C layer in the CPython reference interpreter), then convert to hex str in bulk with the (new in 3.5) bytes.hex method:
from itertools import cycle
from operator import xor
def xor_two_str(a,b):
short, long = sorted((a.encode('latin-1'), b.encode('latin-1')), key=len)
xored = bytes(map(xor, long, cycle(short)))
return xored.hex()

How to get a byte given by its code in Python 3?

I have a code of a byte and I want to get corresponding bytes' sequence. For example, the code is 65, the sequence should be b'A'. I know there is a simple way to do it:
b = chr(65).encode()
print(b) # b'A'
But it seems to be too slow and overcharged because of converting to string in the middle. Is there a fast and elegant way to do same in Python 3?
Use the bytes constructor:
>>> bytes([65])
b'A'
There is also the to_bytes method, mostly useful if the integer represents multiple bytes but it also works for 1:
>>> (65).to_bytes(1, 'big') # big or little endian makes no difference for 1
b'A'
The other way is just indexing:
>>> b'A'[0]
65

Reading-in a binary JPEG-Header (in Python)

I would like to read in a JPEG-Header and analyze it.
According to Wikipedia, the header consists of a sequences of markers. Each Marker starts with FF xx, where xx is a specific Marker-ID.
So my idea, was to simply read in the image in binary format, and seek for the corresponding character-combinations in the binary stream. This should enable me to split the header in the corresponding marker-fields.
For instance, this is, what I receive, when I read in the first 20 bytes of an image:
binary_data = open('picture.jpg','rb').read(20)
print(binary_data)
b'\xff\xd8\xff\xe1-\xfcExif\x00\x00MM\x00*\x00\x00\x00\x08'
My questions are now:
1) Why does python not return me nice chunks of 2 bytes (in hex-format).
Somthing like this I would expect:
b'\xff \xd8 \xff \xe1 \x-' ... and so on. Some blocks delimited by '\x' are much longer than 2 bytes.
2) Why are there symbols like -, M, * in the returned string? Those are no characters of a hex representation I expect from a byte string (only: 0-9, a-f, I think).
Both observations hinder me in writing a simple parser.
So ultimately my question summarizes to:
How do I properly read-in and parse a JPEG Header in Python?
You seem overly worried about how your binary data is represented on your console. Don't worry about that.
The default built-in string-based representation that print(..) applies to a bytes object is just "printable ASCII characters as such (except a few exceptions), all others as an escaped hex sequence". The exceptions are semi-special characters such as \, ", and ', which could mess up the string representation. But this alternative representation does not change the values in any way!
>>> a = bytes([1,2,4,92,34,39])
>>> a
b'\x01\x02\x04\\"\''
>>> a[0]
1
See how the entire object is printed 'as if' it's a string, but its individual elements are still perfectly normal bytes?
If you have a byte array and you don't like the appearance of this default, then you can write your own. But – for clarity – this still doesn't have anything to do with parsing a file.
>>> binary_data = open('iaijiedc.jpg','rb').read(20)
>>> binary_data
b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x01\x00H\x00H\x00\x00'
>>> ''.join(['%02x%02x ' % (binary_data[2*i],binary_data[2*i+1]) for i in range(len(binary_data)>>1)])
'ffd8 ffe0 0010 4a46 4946 0001 0201 0048 0048 0000 '
Why does python not return me nice chunks of 2 bytes (in hex-format)?
Because you don't ask it to. You are asking for a sequence of bytes, and that's what you get. If you want chunks of two-bytes, transform it after reading.
The code above only prints the data; to create a new list that contains 2-byte words, loop over it and convert each 2 bytes or use unpack (there are actually several ways):
>>> wd = [unpack('>H', binary_data[x:x+2])[0] for x in range(0,len(binary_data),2)]
>>> wd
[65496, 65504, 16, 19014, 18758, 1, 513, 72, 72, 0]
>>> [hex(x) for x in wd]
['0xffd8', '0xffe0', '0x10', '0x4a46', '0x4946', '0x1', '0x201', '0x48', '0x48', '0x0']
I'm using the little-endian specifier < and unsigned short H in unpack, because (I assume) these are the conventional ways to represent JPEG 2-byte codes. Check the documentation if you want to derive from this.

what does it mean to convert octet strings to nonnegative integers in RSA?

I am trying to implement RSA PKCS #1 based on this spec
http://www.emc.com/emc-plus/rsa-labs/pkcs/files/h11300-wp-pkcs-1v2-2-rsa-cryptography-standard.pdf
However, I am not sure what the purpose of OS2IP is in page 9. Assume my message is integer 258 and private key is e. Also assume we don't do any other formatting besides OS2IP.
So I will convert the 258 into octet strings and store it into char buf[2] = {0x02, 0x01}. Now before I compute the exponentiation 258^e. I need to call OS2IP to reverse the byte order and save it to buf_new[2] = {0x01, 0x02}. Now 0x0102 = 258.
However, if I initially stored 258 as buf[2] = {0x01, 0x02}. Then there is no need to call OS2IP, correct? or is this the convention that I have to save it into {0x02, 0x01}?
OS2IP encodes an non negative integer into it's big endian representation.
However, if I initially stored 258 as buf[2] = {0x01, 0x02}. Then there is no need to call OS2IP, correct?
That is correct. 258 is already encoded in big endian although depending the length you choosed (!=2) you might be missing the leading zeros.
or is this the convention that I have to save it into {0x02, 0x01}?
I don't understand your question :/

Resources