Python3 decoding binary string with hex numbers higher than \x7f - python-3.x

i try to port some bmv2 thrift python2 code to python3 and have the following problem:
python2:
import struct
def demo(byte_list):
f = 'B' * len(byte_list)
r = struct.pack(f, *byte_list)
return r
demo([255, 255])
"\xff\xff"
ported to python3 it returns a binary string b"\xff\xff" because the struct module changed.
If i try to decode by r.decode() an exception throws because \xff is reserved in the unicode table.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
The easiest solution would be to concatenate the string by myself. I tried with a self made string like "\x01" and it works, if i try "\xff" it does not work with thrift. I think because "\xff" is "ÿ" in unicode and the thrift server expects "\xff".
I tried different encodings and raw strings.
TL;DR: Is there any way to decode a binary string containing \xff or in general higher than \x7f (which is ord 127) in python3? b"\xff" => "x\ff" OR use the old python2 struct import?

Related

What's the correct way of using logging functions with Unicode on PyQt5?

The following code:
from PyQt5.QtCore import qCritical
s = 'Negociação'
qCritical(s)
raises UnicodeError:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 7-8: ordinal not in range(128)
The documentation of qCritical() for Qt C++ expects a const char*, so I think that's why PyQt5's qCritical() doesn't accept a plain str, even though its function signature takes a str.
Enconding it to UTF-8 works, although it returns a bytes object, not str:
qCritical(s.encode('utf-8'))
Is it that the signature of qCritical() for PyQt5 is wrong or should I do it differently? Must I call str.encode() everytime I log a str that has unicode characters?

Converting 16-digit hexadecimal string into double value in Python 3

I want to convert 16-digit hexadecimal numbers into doubles. I actually did the reverse of this before which worked fine:
import struct
import wrap
def double_to_hex(doublein):
return hex(struct.unpack('<Q', struct.pack('<d', doublein))[0])
for i in modified_list:
encoded_list.append(double_to_hex(i))
modified_list.clear()
encoded_msg = ''.join(encoded_list).replace('0x', '')
encoded_list.clear()
print_command('encode', encoded_message)
And now I want to sort of do the reverse. I tried this without success:
from textwrap import wrap
import struct
import binascii
MESSAGE = 'c030a85d52ae57eac0129263c4fffc34'
#Splitting up message into n 16-bit strings
MSGLIST = wrap(MESSAGE, 16)
doubles = []
print(MSGLIST)
for msg in MSGLIST:
doubles.append(struct.unpack('d', binascii.unhexlify(msg)))
print(doubles)
However, when I run this, I get crazy values, which are of course not what I put in:
[(-1.8561629252326087e+204,), (1.8922789420412524e-53,)]
Were your original numbers -16.657673995556173 and -4.642958715557189 ?
If so, then the problem is that your hex strings contain the big-endian (most-significant byte first) representation of the double, but the 'd' format string in your unpack call specifies conversion using your system's native format, which happens to be little-endian (least-significant byte first). The result is that unpack reads and processes the bytes of the unhexlify'ed string from the wrong end. Unsurprisingly, that will produce the wrong value.
To fix, do one of:
convert the hex string into little-endian format (reverse the bytes, so c030a85d52ae57ea becomes ea57ae525da830c0) before passing it to binascii.unhexlify, or
reverse the bytes produced by unhexlify (change binascii.unhexlify(msg) to binascii.unhexlify(msg)[::-1]) before you pass them to unpack, or
tell unpack to do the conversion using big-endian order (replace the format string 'd' with '>d')
I'd go with the last one, replacing the format string.

Using Protocol Buffer to Serialize Bytes Python3

I am trying to serialize a bytes object - which is an initialization vector for my program's encryption. But, the Google Protocol Buffer only accepts strings. It seems like the error starts with casting bytes to string. Am I using the correct method to do this? Thank you for any help or guidance!
Or also, can I make the Initialization Vector a string object for AES-CBC mode encryption?
Code
Cast the bytes to a string
string_iv = str(bytes_iv, 'utf-8')
Serialize the string using SerializeToString():
serialized_iv = IV.SerializeToString()
Use ParseToString() to recover the string:
IV.ParseFromString( serialized_iv )
And finally, UTF-8 encode the string back to bytes:
bytes_iv = bytes(IV.string_iv, encoding= 'utf-8')
Error
string_iv = str(bytes_iv, 'utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9b in position 3: invalid start byte
If you must cast an arbitrary bytes object to str, these are your option:
simply call str() on the object. It will turn it into repr form, ie. something that could be parsed as a bytes literal, eg. "b'abc\x00\xffabc'"
decode with "latin1". This will always work, even though it technically makes no sense if the data isn't text encoded with Latin-1.
use base64 or base85 encoding (the standard library has a base64 module wich covers both)

Yet another person who can't figure how to use unicode in Python

I know there is tons of question about this, but somehow I could not find a solution to my problem (in python3) :
toto="//\udcc3\udca0"
fp = open('cool', 'w')
fp.write(toto)
I get:
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 2: surrogates not allowed
How can I make it work?
Some precision: the string "//\udcc3\udca0" is given to me and I have no control over it. '\udcc3\udca0' is supposed to represent the character 'à'.
'\udcc3\udca0' is supposed to represent the character 'à'
The proper way to write 'à' using Python Unicode escapes is '\u00E0'. Its UTF-8 encoding is b'\xc3\xa0'.
It seems that whatever process produced your string was trying to use the UTF-8 representation, but instead of properly converting it to a Unicode string, it put the individual bytes in the U+DCxx range used by Python 3's surrogateescape convention.
>>> 'à'.encode('UTF-8').decode('ASCII', 'surrogateescape')
'\udcc3\udca0'
To fix the string, invert the operations that mangled it.
toto="//\udcc3\udca0"
toto = toto.encode('ASCII', 'surrogateescape').decode('UTF-8')
# At this point, toto == '//à', as intended.
fp = open('cool', 'w')
fp.write(toto)

Converting octet strings to Unicode strings, Python 3

I'm trying to convert a string with octal-escaped Unicode back into a proper Unicode string as follows, using Python 3:
"training\345\256\214\346\210\220\345\276\214.txt" is the read-in string.
"training完成後.txt" is the string's actual representation, which I'm trying to obtain.
However, after skimming SO, seems the suggested solution was the following most everywhere I could find for Python 3:
decoded_string = bytes(myString, "utf-8").decode("unicode_escape")
Unfortunately, that seems to yield the wrong Unicode string when applied to my sample:
'trainingå®Â\x8cæÂ\x88Â\x90å¾Â\x8c.txt'
This seems easy to do with byte literals, as well as in Python 2, but unfortunately doesn't seem as easy with strings in Python 3. Help much appreciated, thanks! :)
Assuming your starting string is a Unicode string with literal backslashes, you first need a byte string to use the unicode-escape codec, but the octal escapes are UTF-8, so you'll need to convert it again to a byte string and then decode as UTF-8:
>>> s = r'training\345\256\214\346\210\220\345\276\214.txt'
>>> s
'training\\345\\256\\214\\346\\210\\220\\345\\276\\214.txt'
>>> s.encode('latin1')
b'training\\345\\256\\214\\346\\210\\220\\345\\276\\214.txt'
>>> s.encode('latin1').decode('unicode-escape')
'trainingå®\x8cæ\x88\x90å¾\x8c.txt'
>>> s.encode('latin1').decode('unicode-escape').encode('latin1')
b'training\xe5\xae\x8c\xe6\x88\x90\xe5\xbe\x8c.txt'
>>> s.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
'training完成後.txt'
Note that the latin1 codec does a direct translation of Unicode codepoints U+0000 to U+00FF to bytes 00-FF.

Resources