How do I decode a dictionary of bytes to utf-8? - python-3.x

I'm trying to figure out how to convert the values of a dictionary from bytes to strings as the backend only supports primitive types.
oledata = {
'macros': macros,
'data': analysis
}
s = str(oledata)
save_data_to_s3(json.dumps(s), ['olevba3'])
As you can see, the values of this dict are bytes. Now this code will execute without errors on my test sample but the output has the b' prefix in front of the values (data), which will break the database. Dict's also have no decode() functionality which is why I used str(), but it must be doing something wrong since the values are still coming out with the b' prefix. Which leads to my general question, how do you decode the values of a dictionary to utf-8 format?

my_str = b"Hello" # b means its a byte string
new_str = my_str.decode('utf-8') # Decode using the utf-8 encoding
print(new_str)

Related

How to convert a variable to a raw string?

If I have a string, "foo; \n", I can turn this into a raw string with r"foo; \n". If I have a variable x = "foo; \n", how do I convert x into a raw string? I tried y = rf"{x}" but this did not work.
Motivation:
I have a python string variable, res. I compute res as
big_string = """foo; ${bar}"""
from string import Template
t = Template(big_string)
res = t.substitute(bar="baz")
As such, res is a variable. I'd like to convert this variable into a raw string. The reason is I am going to POST it as JSON, but I am getting json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 620 (char 619). Previously, when testing I fixed this by converting my string to a raw string with: x = r"""foo; baz""" (keeping in line with the example above). Now I am not dealing with a big raw string. I am dealing with a variable that is a JSON representation of a string where I have replaced a single variable, bar above, with a list for a query, and now I want to convert this string into a raw string (e.g. r"foo; baz", yes I realize this is not valid JSON).
Update: As per this question I need a raw string. The question and answer flagged in the comments as duplicate do not work (res.encode('unicode_escape')).

How to Turn string into bytes?

Using python3 and I've got a string which displayed as bytes
strategyName=\xe7\x99\xbe\xe5\xba\xa6
I need to change it into readable chinese letter through decode
orig=b'strategyName=\xe7\x99\xbe\xe5\xba\xa6'
result=orig.decode('UTF-8')
print()
which shows like this and it is what I want
strategyName=百度
But if I save it in another string,it works different
str0='strategyName=\xe7\x99\xbe\xe5\xba\xa6'
result_byte=str0.encode('UTF-8')
result_str=result_byte.decode('UTF-8')
print(result_str)
strategyName=ç¾åº¦é£é©ç­ç¥
Please help me about why this happening,and how can I fix it.
Thanks a lot
Your problem is using a str literal when you're trying to store the UTF-8 encoded bytes of your string. You should just use the bytes literal, but if that str form is necessary, the correct approach is to encode in latin-1 (which is a 1-1 converter for all ordinals below 256 to the matching byte value) to get the bytes with utf-8 encoded data, then decode as utf-8:
str0 = 'strategyName=\xe7\x99\xbe\xe5\xba\xa6'
result_byte = str0.encode('latin-1') # Only changed line
result_str = result_byte.decode('UTF-8')
print(result_str)
Of course, the other approach could be to just type the Unicode escapes you wanted in the first place instead of byte level escapes that correspond to a UTF-8 encoding:
result_str = 'strategyName=\u767e\u5ea6'
No rigmarole needed.

Why is this error appearing?

AttributeError: 'builtin_function_or_method' object has no attribute 'encode'
I'm trying to make a text to code converter as an example for an assignment and this is some code based off of some I found in my research,
import binascii
text = input('Message Input: ')
data = binascii.b2a_base64.encode(text)
text = binascii.a2b_base64.encode(data)
print (text), "<=>", repr(data)
data = binascii.b2a_uu(text)
text = binascii.a2b_uu(data)
print (text), "<=>", repr(data)
data = binascii.b2a_hqx(text)
text = binascii.a2b_hqx(data)
print (text), "<=>", repr(data)
can anyone help me get it working? it's supposed to take an input in and then convert it into hex and others and display those...
I am using Python 3.6 but I am also a little out of practice...
TL;DR:
data = binascii.b2a_base64(text.encode())
text = binascii.a2b_base64(data).decode()
print (text, "<=>", repr(data))
You've hit on a common problem in the Python3 - str object vs bytes object. The bytes object contains sequence of bytes. One byte can contain any number from 0 to 255. Usually those number are translated through the ASCII table into a characters like english letters. Usually in the Python you should use bytes for working with binary data.
On the other hand the str object contains sequence of code points. One code point usually represent one character printed on your screen when you call print. Internally it is sequence of bytes so the Chinese symbol 的 is internally saved as 3 bytes long sequence.
Now to the your problem. The function requires as input the bytes object but you've got a str object from the function input. To convert str into bytes you have to call str.encode() method on the str object.
data = binascii.b2a_base64(text.encode())
Your original call binascii.b2a_base64.encode(text) means call method encode of the object binascii.b2a_base64 with parameter text.
The function binascii.b2a_base64 returns bytes contains original input encoded with the base64 algorithms. Now to get back the original str from encoded data you have to call this:
# Take base64 encoded data and return it decoded as bytes object
decoded_data = binascii.a2b_base64(data)
# Convert bytes object into str
text = decoded_data.decode()
It can be written as one line
decoded_data = binascii.a2b_base64(data).decode()
WARNING: Your call of print is invalid for Python 3 (it will work only in the python console)

Incorporate Base64 encoded data in Python Web Service call

I am trying to make a web service call in Python 3. A subset of the request includes a base64 encoded string, which is coming from a list of Python dictionaries.
So I dump the list and encode the string:
j = json.dumps(dataDictList, indent=4, default = myconverter)
encodedData = base64.b64encode(j.encode('ASCII'))
Then, when I build my request, I add in that string. Because it comes back in bytes I need to change it to string:
...
\"data\": \"''' + str(encodedData) + '''\"
...
The response I'm getting from the web service is that my request is malformed. When I print our str(encodedData) I get:
b'WwogICAgewogICAgICAgICJEQVlfREFURSI6ICIyMDEyLTAzLTMxIDAwOjAwOjAwIiwKICAgICAgICAiQ0FMTF9DVFJfSUQiOiA1LAogICAgICAgICJUT1RfRE9MTEFSX1NBTEVTIjogMTk5MS4wLAogICAgICAgICJUT1RfVU5JVF9TQUxFUyI6IDQ0LjAsCiAgICAgICAgIlRPVF9DT1NUIjogMTYxOC4xMDM3MDAwMDAwMDA2LAogICAgICAgICJHUk9TU19ET0xMQVJfU0FMRVMiOiAxOTkxLjAKICAgIH0KXQ=='
If I copy this into a base64 decoder, I get gibberish until I remove the b' at the beginning as well as the last single quote. I think those are causing my request to fail. According to this note, though, I would think that the b' is ignored: What does the 'b' character do in front of a string literal?
I'll appreciate any advice.
Thank you.
Passing a bytes object into str causes it to be formatted for display, it doesn't convert the bytes into a string (you need to know the encoding for that to work):
In [1]: x = b'hello'
In [2]: str(x)
Out[2]: "b'hello'"
Note that str(x) actually starts with b' and ends with '. If you want to decode the bytes into a string, use bytes.decode:
In [5]: x = base64.b64encode(b'hello')
In [6]: x
Out[6]: b'aGVsbG8='
In [7]: x.decode('ascii')
Out[7]: 'aGVsbG8='
You can safely decode the base64 bytes as ASCII. Also, your JSON should be encoded as UTF-8, not ASCII. The following changes should work:
j = json.dumps(dataDictList, indent=4, default=myconverter)
encodedData = base64.b64encode(j.encode('utf-8')).decode('ascii')

bytes() initializer adding an additional byte?

I initialize a utf-8 encoding string in python3:
bytes('\xc2', encoding="utf-8", errors="strict")
but on writing it out I get two bytes!
>>> s = bytes('\xc2', encoding="utf-8", errors="strict")
>>> s
b'\xc3\x82'
Where is this additional byte coming from? Why should I not be able to encode any hex value up to 254 (I can understand that 255 is potentially reserved to extend to utf-16)?
The Unicode codepoint "\xc2" (which can also be written as "Â"), is two bytes long when encoded with the utf-8 encoding. If you were expecting it to be the single byte b'\xc2', you probably want to use a different encoding, such as "latin-1":
>>> s = bytes("\xc2", encoding="latin-1", errors="strict")
>>> s
b'\xc2'
If you area really creating "\xc2" directly with a literal though, there's no need to mess around with the bytes constructor to turn it into a bytes instance. Just use the b prefix on the literal to create the bytes directly:
s = b"\xc2"

Resources