Decoding of a encoded base64 string - python-3.x

I have a base64 encoded string S="aGVsbG8=", now i want to decode the string into ASCII, UTF-8, UTF-16, UTF-32, CP-1256, ISO-8659-1, ISO-8659-2, ISO-8659-6, ISO-8659-15 and Windows-1252, How i can decode the string into the mentioned format. For UTF-16 I tried following code, but it was giving error "'bytes' object has no attribute 'deocde'".
base64.b64decode(encodedBase64String).deocde('utf-8')

Please read the doc or docstring for the 3.x base64 module. The module works with bytes, not text. So your base64 encoded 'string' would be a byte string B = b"aGVsbG8". The result of base64.decodebytes(B) is bytes; binary data with whatever encoding it has (text or image or ...). In this case, it is b'hello', which can be viewed as ascii-encoded text. To change to other encodings, first decode to unicode text and then encode to bytes in whatever other encoding you want. Most of the encodings you list above will have the same bytes.
>>> B=b"aGVsbG8="
>>> b = base64.decodebytes(B)
>>> b
b'hello'
>>> t = b.decode()
>>> t
'hello'
>>> t.encode('utf-8')
b'hello'
>>> t.encode('utf-16')
b'\xff\xfeh\x00e\x00l\x00l\x00o\x00'
>>> t.encode('utf-32')
b'\xff\xfe\x00\x00h\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00'

Related

How to Turn string into bytes?

Using python3 and I've got a string which displayed as bytes
strategyName=\xe7\x99\xbe\xe5\xba\xa6
I need to change it into readable chinese letter through decode
orig=b'strategyName=\xe7\x99\xbe\xe5\xba\xa6'
result=orig.decode('UTF-8')
print()
which shows like this and it is what I want
strategyName=百度
But if I save it in another string,it works different
str0='strategyName=\xe7\x99\xbe\xe5\xba\xa6'
result_byte=str0.encode('UTF-8')
result_str=result_byte.decode('UTF-8')
print(result_str)
strategyName=ç¾åº¦é£é©ç­ç¥
Please help me about why this happening,and how can I fix it.
Thanks a lot
Your problem is using a str literal when you're trying to store the UTF-8 encoded bytes of your string. You should just use the bytes literal, but if that str form is necessary, the correct approach is to encode in latin-1 (which is a 1-1 converter for all ordinals below 256 to the matching byte value) to get the bytes with utf-8 encoded data, then decode as utf-8:
str0 = 'strategyName=\xe7\x99\xbe\xe5\xba\xa6'
result_byte = str0.encode('latin-1') # Only changed line
result_str = result_byte.decode('UTF-8')
print(result_str)
Of course, the other approach could be to just type the Unicode escapes you wanted in the first place instead of byte level escapes that correspond to a UTF-8 encoding:
result_str = 'strategyName=\u767e\u5ea6'
No rigmarole needed.

Why are all my strings printed as string literals? [duplicate]

How do I print a bytes string without the b' prefix in Python 3?
>>> print(b'hello')
b'hello'
Use decode:
>>> print(b'hello'.decode())
hello
If the bytes use an appropriate character encoding already; you could print them directly:
sys.stdout.buffer.write(data)
or
nwritten = os.write(sys.stdout.fileno(), data) # NOTE: it may write less than len(data) bytes
If the data is in an UTF-8 compatible format, you can convert the bytes to a string.
>>> print(str(b"hello", "utf-8"))
hello
Optionally, convert to hex first if the data is not UTF-8 compatible (e.g. data is raw bytes).
>>> from binascii import hexlify
>>> print(hexlify(b"\x13\x37"))
b'1337'
>>> print(str(hexlify(b"\x13\x37"), "utf-8"))
1337
>>> from codecs import encode # alternative
>>> print(str(encode(b"\x13\x37", "hex"), "utf-8"))
1337
According to the source for bytes.__repr__, the b'' is baked into the method.
One workaround is to manually slice off the b'' from the resulting repr():
>>> x = b'\x01\x02\x03\x04'
>>> print(repr(x))
b'\x01\x02\x03\x04'
>>> print(repr(x)[2:-1])
\x01\x02\x03\x04
To show or print:
<byte_object>.decode("utf-8")
To encode or save:
<str_object>.encode('utf-8')
I am a little late but for Python 3.9.1 this worked for me and removed the -b prefix:
print(outputCode.decode())
It's so simple...
(With that, you can encode the dictionary and list bytes, then you can stringify it using json.dump / json.dumps)
You just need use base64
import base64
data = b"Hello world!" # Bytes
data = base64.b64encode(data).decode() # Returns a base64 string, which can be decoded without error.
print(data)
There are bytes that cannot be decoded by default(pictures are an example), so base64 will encode those bytes into bytes that can be decoded to string, to retrieve the bytes just use
data = base64.b64decode(data.encode())
Use decode() instead of encode() for converting bytes to a string.
>>> import curses
>>> print(curses.version.decode())
2.2

Python, base64, float

Do you have any idea how to encode and decode a float number with base64 in Python.
I am trying to use
response='64.000000'
base64.b64decode(response)
the expected output is 'AAAAAAAALkA=' but i do not get any output for float numbers.
Thank you.
Base64 encoding is only defined for byte strings, so you have to convert your number into a sequence of bytes using struct.pack and then base64 encode that. The example you give looks like a base64 encoded little-endian double. So (for Python 2):
>>> import struct
>>> struct.pack('<d', 64.0).encode('base64')
'AAAAAAAAUEA=\n'
For the reverse direction you base64 decode and then unpack it:
>>> struct.unpack('<d', 'AAAAAAAALkA='.decode('base64'))
(15.0,)
So it looks like your example is 15.0 rather than 64.0.
For Python 3 you need to also use the base64 module:
>>> import struct
>>> import base64
>>> base64.encodebytes(struct.pack('<d', 64.0))
b'AAAAAAAAUEA=\n'
>>> struct.unpack('<d', base64.decodebytes(b'AAAAAAAALkA='))
(15.0,)

Convert string of encoded escape sequences to Unicode in Python 3

I'm attempting to decode a Python string containing a series of Shift-JIS escape sequences in Python. When I create a bytes literal containing the sequences, I can use decode('shift-jis') to get the expected result.
>>> seq = b'\201u\202\240\202\246\202\244\202\242\202\250\201v'
>>> seq.decode("shift-jis")
'「あえういお」'
The problem is that the sequences are passed in as a plain Python string. When I use str.encode, the sequence is interpreted as Unicode and extra bytes of \xc2 are inserted:
>>> seq = "\201u\202\240\202\246\202\244\202\242\202\250\201v"
>>> str.encode(seq)
b'\xc2\x81u\xc2\x82\xc2\xa0\xc2\x82\xc2\xa6\xc2\x82\xc2\xa4\xc2\x82\xc2\xa2\xc2\x82\xc2\xa8\xc2\x81v'
Is there a way to directly convert a Python string containing encoded escape sequences into a bytes literal, in the same way as placing a b in front of a string produces a bytes literal with the escaped characters?
Str.encode defaults to using utf-8 encoding. hence you get the utf-8 \xc2 prefixes. (Check Wikipedia for details if you want.) What you want instead is for codepoints 0 to 255 to be turned into bytes 0 to 255. In others words, the same data in an object of a different class. Latin-1 does this.
>>> seqb = seq.encode('latin-1')
>>> seqb.decode('shift-jis')
'「あえういお」'

python 3, unicode conversion, two \u0000 as one character

My python3 script receives strings from c++ program via pipe.
Strings encoded via Unicode code points. I need to decode it correctly.
For example, consider string that contain cyrillic symbols: 'тест test'
Try to encode this string using python3: print('тест test'.encode()). We got b'\xd1\x82\xd0\xb5\xd1\x81\xd1\x82 test'
C++ program encodes this string like: b'\u00D1\u0082\u00D0\u00B5\u00D1\u0081\u00D1\u0082 test'
Encoded strings looks very similar - python3 uses \x (2bits) and c++ program uses \u (4bits).
But I can't figure out how to convert b'\u00D1\u0082\u00D0\u00B5\u00D1\u0081\u00D1\u0082 test' to 'тест test'.
Main problem - python3 consider b'\u00D1\u0082\u00D0\u00B5\u00D1\u0081\u00D1\u0082' as 8-character string, but it contain only 4 characters
If the string you receive from C++ is the following in Python:
s = b'\u00D1\u0082\u00D0\u00B5\u00D1\u0081\u00D1\u0082 test'
Then this will decode it:
result = s.decode('unicode-escape').encode('latin1').decode('utf8')
print(result)
Output:
тест test
The first stage converts the byte string received into a Unicode string:
>>> s1 = s.decode('unicode-escape')
>>> s1
'Ñ\x82еÑ\x81Ñ\x82 test'
Unfortunately, the Unicode codepoints are really UTF-8 byte values. The latin1 encoding is a 1:1 mapping of the first 256 Unicode codepoints, so encoding with this codec converts the codepoints back to byte values in a byte string:
>>> s2 = s1.encode('latin1')
>>> s2
b'\xd1\x82\xd0\xb5\xd1\x81\xd1\x82 test'
Now the byte string can be decoded to the correct Unicode string:
>>> s3 = s2.decode('utf8')
>>> s3
'тест test'

Resources