Python 3 Conversion from bytes to string - string

I have a code written in Python 2.7 that I need to convert in Python 3.6. In the code the zmq function recv_multipart outputs an array of bytes:
msg = b'\x80\x03csome_message'
that I need to convert to a string. If I do
msg.decode()
I get an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
It does not seem that zmq can directly output a string, so what can I do to convert the output to a string ?

Related

Python 3.6 ASCII inside hex bytes

I am receiving binary data, such as
data = b'\xaa\x44\x12\x1c\x2a'
When I try to parse each byte - what I am actually parsing is
b'\xaaD\x12\x1c*'
Is there a reason why bytes 44 and 2a are converted from HEX to ASCII?
Is there a way to prevent this conversion.
I have tried -
data = data.hex()
print(data.hex())
#print output aa44121c2a
Which does some what maintains the format but converts it to a string and cannot iterate through each byte but each character.
Any suggestions?

converting byte object to utf-8

k = b'\xf2-\x92\xe7\x98\x90#\xddF\xbf\x13I4\x92\x0f\xc5'
I tried encoding in 'utf-8', but I am getting an error
utf-8' codec can't decode byte 0xf2 in position 0: invalid continuation byte
How can I properly convert this to a string object?
#Update
Ok I had to look at your byte it was the wrong encoding you need to use
ISO-8859-1
encoding = 'ISO-8859-1'
k = b'\xf2-\x92\xe7\x98\x90#\xddF\xbf\x13I4\x92\x0f\xc5'.decode(encoding)
print(type(k))
That will fix the issue

Python3 decoding binary string with hex numbers higher than \x7f

i try to port some bmv2 thrift python2 code to python3 and have the following problem:
python2:
import struct
def demo(byte_list):
f = 'B' * len(byte_list)
r = struct.pack(f, *byte_list)
return r
demo([255, 255])
"\xff\xff"
ported to python3 it returns a binary string b"\xff\xff" because the struct module changed.
If i try to decode by r.decode() an exception throws because \xff is reserved in the unicode table.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
The easiest solution would be to concatenate the string by myself. I tried with a self made string like "\x01" and it works, if i try "\xff" it does not work with thrift. I think because "\xff" is "ÿ" in unicode and the thrift server expects "\xff".
I tried different encodings and raw strings.
TL;DR: Is there any way to decode a binary string containing \xff or in general higher than \x7f (which is ord 127) in python3? b"\xff" => "x\ff" OR use the old python2 struct import?

Using Protocol Buffer to Serialize Bytes Python3

I am trying to serialize a bytes object - which is an initialization vector for my program's encryption. But, the Google Protocol Buffer only accepts strings. It seems like the error starts with casting bytes to string. Am I using the correct method to do this? Thank you for any help or guidance!
Or also, can I make the Initialization Vector a string object for AES-CBC mode encryption?
Code
Cast the bytes to a string
string_iv = str(bytes_iv, 'utf-8')
Serialize the string using SerializeToString():
serialized_iv = IV.SerializeToString()
Use ParseToString() to recover the string:
IV.ParseFromString( serialized_iv )
And finally, UTF-8 encode the string back to bytes:
bytes_iv = bytes(IV.string_iv, encoding= 'utf-8')
Error
string_iv = str(bytes_iv, 'utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9b in position 3: invalid start byte
If you must cast an arbitrary bytes object to str, these are your option:
simply call str() on the object. It will turn it into repr form, ie. something that could be parsed as a bytes literal, eg. "b'abc\x00\xffabc'"
decode with "latin1". This will always work, even though it technically makes no sense if the data isn't text encoded with Latin-1.
use base64 or base85 encoding (the standard library has a base64 module wich covers both)

python 3, unicode conversion, two \u0000 as one character

My python3 script receives strings from c++ program via pipe.
Strings encoded via Unicode code points. I need to decode it correctly.
For example, consider string that contain cyrillic symbols: 'тест test'
Try to encode this string using python3: print('тест test'.encode()). We got b'\xd1\x82\xd0\xb5\xd1\x81\xd1\x82 test'
C++ program encodes this string like: b'\u00D1\u0082\u00D0\u00B5\u00D1\u0081\u00D1\u0082 test'
Encoded strings looks very similar - python3 uses \x (2bits) and c++ program uses \u (4bits).
But I can't figure out how to convert b'\u00D1\u0082\u00D0\u00B5\u00D1\u0081\u00D1\u0082 test' to 'тест test'.
Main problem - python3 consider b'\u00D1\u0082\u00D0\u00B5\u00D1\u0081\u00D1\u0082' as 8-character string, but it contain only 4 characters
If the string you receive from C++ is the following in Python:
s = b'\u00D1\u0082\u00D0\u00B5\u00D1\u0081\u00D1\u0082 test'
Then this will decode it:
result = s.decode('unicode-escape').encode('latin1').decode('utf8')
print(result)
Output:
тест test
The first stage converts the byte string received into a Unicode string:
>>> s1 = s.decode('unicode-escape')
>>> s1
'Ñ\x82еÑ\x81Ñ\x82 test'
Unfortunately, the Unicode codepoints are really UTF-8 byte values. The latin1 encoding is a 1:1 mapping of the first 256 Unicode codepoints, so encoding with this codec converts the codepoints back to byte values in a byte string:
>>> s2 = s1.encode('latin1')
>>> s2
b'\xd1\x82\xd0\xb5\xd1\x81\xd1\x82 test'
Now the byte string can be decoded to the correct Unicode string:
>>> s3 = s2.decode('utf8')
>>> s3
'тест test'

Resources