converting byte object to utf-8 - python-3.x

k = b'\xf2-\x92\xe7\x98\x90#\xddF\xbf\x13I4\x92\x0f\xc5'
I tried encoding in 'utf-8', but I am getting an error
utf-8' codec can't decode byte 0xf2 in position 0: invalid continuation byte
How can I properly convert this to a string object?

#Update
Ok I had to look at your byte it was the wrong encoding you need to use
ISO-8859-1
encoding = 'ISO-8859-1'
k = b'\xf2-\x92\xe7\x98\x90#\xddF\xbf\x13I4\x92\x0f\xc5'.decode(encoding)
print(type(k))
That will fix the issue

Related

Python3 decoding binary string with hex numbers higher than \x7f

i try to port some bmv2 thrift python2 code to python3 and have the following problem:
python2:
import struct
def demo(byte_list):
f = 'B' * len(byte_list)
r = struct.pack(f, *byte_list)
return r
demo([255, 255])
"\xff\xff"
ported to python3 it returns a binary string b"\xff\xff" because the struct module changed.
If i try to decode by r.decode() an exception throws because \xff is reserved in the unicode table.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
The easiest solution would be to concatenate the string by myself. I tried with a self made string like "\x01" and it works, if i try "\xff" it does not work with thrift. I think because "\xff" is "ÿ" in unicode and the thrift server expects "\xff".
I tried different encodings and raw strings.
TL;DR: Is there any way to decode a binary string containing \xff or in general higher than \x7f (which is ord 127) in python3? b"\xff" => "x\ff" OR use the old python2 struct import?

Using Protocol Buffer to Serialize Bytes Python3

I am trying to serialize a bytes object - which is an initialization vector for my program's encryption. But, the Google Protocol Buffer only accepts strings. It seems like the error starts with casting bytes to string. Am I using the correct method to do this? Thank you for any help or guidance!
Or also, can I make the Initialization Vector a string object for AES-CBC mode encryption?
Code
Cast the bytes to a string
string_iv = str(bytes_iv, 'utf-8')
Serialize the string using SerializeToString():
serialized_iv = IV.SerializeToString()
Use ParseToString() to recover the string:
IV.ParseFromString( serialized_iv )
And finally, UTF-8 encode the string back to bytes:
bytes_iv = bytes(IV.string_iv, encoding= 'utf-8')
Error
string_iv = str(bytes_iv, 'utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9b in position 3: invalid start byte
If you must cast an arbitrary bytes object to str, these are your option:
simply call str() on the object. It will turn it into repr form, ie. something that could be parsed as a bytes literal, eg. "b'abc\x00\xffabc'"
decode with "latin1". This will always work, even though it technically makes no sense if the data isn't text encoded with Latin-1.
use base64 or base85 encoding (the standard library has a base64 module wich covers both)

How to Turn string into bytes?

Using python3 and I've got a string which displayed as bytes
strategyName=\xe7\x99\xbe\xe5\xba\xa6
I need to change it into readable chinese letter through decode
orig=b'strategyName=\xe7\x99\xbe\xe5\xba\xa6'
result=orig.decode('UTF-8')
print()
which shows like this and it is what I want
strategyName=百度
But if I save it in another string,it works different
str0='strategyName=\xe7\x99\xbe\xe5\xba\xa6'
result_byte=str0.encode('UTF-8')
result_str=result_byte.decode('UTF-8')
print(result_str)
strategyName=ç¾åº¦é£é©ç­ç¥
Please help me about why this happening,and how can I fix it.
Thanks a lot
Your problem is using a str literal when you're trying to store the UTF-8 encoded bytes of your string. You should just use the bytes literal, but if that str form is necessary, the correct approach is to encode in latin-1 (which is a 1-1 converter for all ordinals below 256 to the matching byte value) to get the bytes with utf-8 encoded data, then decode as utf-8:
str0 = 'strategyName=\xe7\x99\xbe\xe5\xba\xa6'
result_byte = str0.encode('latin-1') # Only changed line
result_str = result_byte.decode('UTF-8')
print(result_str)
Of course, the other approach could be to just type the Unicode escapes you wanted in the first place instead of byte level escapes that correspond to a UTF-8 encoding:
result_str = 'strategyName=\u767e\u5ea6'
No rigmarole needed.

Python 3 Conversion from bytes to string

I have a code written in Python 2.7 that I need to convert in Python 3.6. In the code the zmq function recv_multipart outputs an array of bytes:
msg = b'\x80\x03csome_message'
that I need to convert to a string. If I do
msg.decode()
I get an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
It does not seem that zmq can directly output a string, so what can I do to convert the output to a string ?

Decoding of a encoded base64 string

I have a base64 encoded string S="aGVsbG8=", now i want to decode the string into ASCII, UTF-8, UTF-16, UTF-32, CP-1256, ISO-8659-1, ISO-8659-2, ISO-8659-6, ISO-8659-15 and Windows-1252, How i can decode the string into the mentioned format. For UTF-16 I tried following code, but it was giving error "'bytes' object has no attribute 'deocde'".
base64.b64decode(encodedBase64String).deocde('utf-8')
Please read the doc or docstring for the 3.x base64 module. The module works with bytes, not text. So your base64 encoded 'string' would be a byte string B = b"aGVsbG8". The result of base64.decodebytes(B) is bytes; binary data with whatever encoding it has (text or image or ...). In this case, it is b'hello', which can be viewed as ascii-encoded text. To change to other encodings, first decode to unicode text and then encode to bytes in whatever other encoding you want. Most of the encodings you list above will have the same bytes.
>>> B=b"aGVsbG8="
>>> b = base64.decodebytes(B)
>>> b
b'hello'
>>> t = b.decode()
>>> t
'hello'
>>> t.encode('utf-8')
b'hello'
>>> t.encode('utf-16')
b'\xff\xfeh\x00e\x00l\x00l\x00o\x00'
>>> t.encode('utf-32')
b'\xff\xfe\x00\x00h\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00'

Resources