Convert Stringed Bytes Back Into Bytes - python-3.x

I'm working on a project which saves some bytes as a string, but I can't seem to figure out how to get the bytes back to a actual bytes!
I've got this string:
"b'\x80\x03]q\x00(X\r\x00\x00\x00My First Noteq\x01X\x0e\x00\x00\x00My Second Noteq\x02e.'"
As you can see, the type() function on the data returns a string instead of bytes:
<class 'str'>
How can I convert this string back to bytes?
Any help is appreciated, Thanks!

Try:
x="b'\x80\x03]q\x00(X\r\x00\x00\x00My First Noteq\x01X\x0e\x00\x00\x00My Second Noteq\x02e.'"
y=x[2:-1].encode("utf-8")
>>> print(y)
b'\xc2\x80\x03]q\x00(X\r\x00\x00\x00My First Noteq\x01X\x0e\x00\x00\x00My Second Noteq\x02e.'
>>> print(type(y))
<class 'bytes'>
You have just bytes converted to regular string without encoding - so you have redundant tags indicating that: b'...' - you just need to drop them and python will do the rest for you ;)

in python 3:
>>> a=b'\x00\x00\x00\x00\x07\x80\x00\x03'
>>> b = list(a)
>>> b
[0, 0, 0, 0, 7, 128, 0, 3]
>>> c = bytes(b)
>>> c
b'\x00\x00\x00\x00\x07\x80\x00\x03'
>>>

Related

python3: bytes vs bytearray, and converting to and from strings

I'd like to understand about python3's bytes and bytearray classes. I've seen documentation on them, but not a comprehensive description of their differences and how they interact with string objects.
bytes and bytearrays are similar...
python3's bytes and bytearray classes both hold arrays of bytes, where each byte can take on a value between 0 and 255. The primary difference is that a bytes object is immutable, meaning that once created, you cannot modify its elements. By contrast, a bytearray object allows you to modify its elements.
Both bytes and bytearay provide functions to encode and decode strings.
bytes and encoding strings
A bytes object can be constructed in a few different ways:
>>> bytes(5)
b'\x00\x00\x00\x00\x00'
>>> bytes([116, 117, 118])
b'tuv'
>>> b'tuv'
b'tuv'
>>> bytes('tuv')
TypeError: string argument without an encoding
>>> bytes('tuv', 'utf-8')
b'tuv'
>>> 'tuv'.encode('utf-8')
b'tuv'
>>> 'tuv'.encode('utf-16')
b'\xff\xfet\x00u\x00v\x00'
>>> 'tuv'.encode('utf-16-le')
b't\x00u\x00v\x00'
Note the difference between the last two: 'utf-16' specifies a generic utf-16
encoding, so its encoded form includes a two-byte "byte order marker" preamble
of [0xff, 0xfe]. When specifying an explicit ordering of 'utf-16-le' as in
the latter example, the encoded form omits the byte order marker.
Because a bytes object is immutable, attempting to change one of its elements
results in an error:
>>> a = bytes('tuv', 'utf-8')
>>> a
b'tuv'
>>> a[0] = 115
TypeError: 'bytes' object does not support item assignment
bytearray and encoding strings
Like bytes, a bytearray can be constructed in a number of ways:
>>> bytearray(5)
bytearray(b'\x00\x00\x00\x00\x00')
>>>bytearray([1, 2, 3])
bytearray(b'\x01\x02\x03')
>>> bytearray('tuv')
TypeError: string argument without an encoding
>>> bytearray('tuv', 'utf-8')
bytearray(b'tuv')
>>> bytearray('tuv', 'utf-16')
bytearray(b'\xff\xfet\x00u\x00v\x00')
>>> bytearray('abc', 'utf-16-le')
bytearray(b't\x00u\x00v\x00')
Because a bytearray is mutable, you can modify its elements:
>>> a = bytearray('tuv', 'utf-8')
>>> a
bytearray(b'tuv')
>>> a[0]=115
>>> a
bytearray(b'suv')
appending bytes and bytearrays
bytes and bytearray objects may be catenated with the + operator:
>>> a = bytes(3)
>>> a
b'\x00\x00\x00'
>>> b = bytearray(4)
>>> b
bytearray(b'\x00\x00\x00\x00')
>>> a+b
b'\x00\x00\x00\x00\x00\x00\x00'
>>> b+a
bytearray(b'\x00\x00\x00\x00\x00\x00\x00')
Note that the catenated result takes on the type of the first argument, so a+b produces a bytes object and b+a produces a bytearray.
converting bytes and bytearray objects into strings
bytes and bytearray objects can be converted to strings using the decode function. The function assumes that you provide the same decoding type as the encoding type. For example:
>>> a = bytes('tuv', 'utf-8')
>>> a
b'tuv'
>>> a.decode('utf-8')
'tuv'
>>> b = bytearray('tuv', 'utf-16-le')
>>> b
bytearray(b't\x00u\x00v\x00')
>>> b.decode('utf-16-le')
'tuv'

(beginner Q) type error in a lambda

I am getting the following error:
cube_list=lambda i,x=0 : list(map(x**3, range(0,i)))
TypeError: 'int' object is not callable
Goal of this line was to produce a list of cubed numbers by giving the last number that is supposed to be cubed as i.
x is set to 0, but can be changed to swap the starting number.
This is probably pretty easy to fix but I just don't see it as I am just starting to learn programming
Thank you very much in advance! and happy coding everyone
The first argument to map must be a function.
>>> cube_list = lambda n: list(map(lambda x: x**3, range(n)))
>>> cube_list(3)
[0, 1, 8]
You can write this more simply.
>>> cube_list_2 = lambda n: [_**3 for _ in range(n)]
>>> cube_list_2(3)
[0, 1, 8]

why can't I see the decoded string?

I have a base64 string and I'm trying to figure out what it was, but I can't see anything. What am I doing wrong? Is this
>>> import base64
>>> b = base64.b64decode("FAAAAAMAAAAGAAAACQAAAAwAAAA=")
>>> b
b'\x14\x00\x00\x00\x03\x00\x00\x00\x06\x00\x00\x00\t\x00\x00\x00\x0c\x00\x00\x00'
>>> print(b.decode("utf16"))
>>> print(b.decode("utf8"))
>>>
It it is Base 64 encoding then it is not UTF-16 encoding, nor UTF-8. Have a look at RFC 3548. The Base 64 can be found at page 4 of the document.
Acually, the very purpose is different. The UTF-x encodings are here to encode a unicode string into a binary stream. That is, the abstract string is the decoded form. On the other hand, Base X and the like encodings are here to encode the original binary into a stream of selected ASCII values (basically small integers) so that the binary content could be transfered via e-mail that accepts only characters. The binary is the decoded, original form.
In your case, it looks that as if the serie of integers (32-bit) was transfered: 20, 3, 6, 9, and 12.
Updated later to answer the comment below: How I got the values...
b'\x14\x00\x00\x00\x03\x00\x00\x00\x06\x00\x00\x00\t\x00\x00\x00\x0c\x00\x00\x00'
The b prefix of the literal says it is the literal with the bytes type value. The bytes is a stream of small integers -- each of one byte, that is from zero to 255. When displayed as the literals, the hexadecimal notation of the small integers is used if the related ASCII character cannot be easily displayed -- starting with \x followed by two hexadecimal numerals. The \t is the representation of the tab character which has the ordinal value 9.
However, you can also convert it to the list of integers:
>>> list(b)
[20, 0, 0, 0, 3, 0, 0, 0, 6, 0, 0, 0, 9, 0, 0, 0, 12, 0, 0, 0]
Now it is more apparent. The zero is the filler if the values are small enough to fit into a single byte. The order of bytes is caused by endianness of the OS/machine. So, actually, it should be hexa (as five 32-bit integers):
00000014 00000003 00000006 00000009 0000000c
Which is:
20 3 6 9 12
In other words, the b'\x14\x00\x00\x00\x03\x00\x00\x00\x06\x00\x00\x00\t\x00\x00\x00\x0c\x00\x00\x00' is actually not a string. It is a bytes literal that captures the value of 5 * 4 bytes. The bytes is a sequence of small integers, not of characters. It is more apparent when you try:
>>> for value in b:
... print(value)
...
20
0
0
0
3
0
0
0
6
0
0
0
9
0
0
0
12
0
0
0
>>> type(b)
<class 'bytes'>
>>> type(b[0])
<class 'int'>
>>>

encode ,length of character and width of display in python3

>>> line="你好".encode("gbk").rjust(10)
>>> print(line)
b' \xc4\xe3\xba\xc3'
>>> print(line.decode("gbk"))
你好
>>> print("你好".rjust(10))
你好
>>> len("你好".rjust(10))
10
>>> len(line.decode("gbk"))
8
>>> len("你好".encode("gbk").rjust(10).decode("gbk"))
8
It is so strange that len("你好".rjust(10)) =10 ,len("你好".encode("gbk").rjust(10).decode("gbk"))=8, encode and decode can shrink two character in width.
What you are seeing is the difference between bytes and code points. When you take the len of a bytes object, you get the number of bytes. When you take the len of a str object, you get the number of unicode code points.
line is a bytes object, composed of 10 bytes:
>>> line
b' \xc4\xe3\xba\xc3'
>>> list(line)
[32, 32, 32, 32, 32, 32, 196, 227, 186, 195]
>>> len(line)
10
When you decode the bytes to a str, the str is composed of 8 code points:
>>> line.decode("gbk")
' 你好'
>>> list(line.decode("gbk"))
[' ', ' ', ' ', ' ', ' ', ' ', '你', '好']
>>> len(line.decode("gbk"))
8
The two bytes b'\xc4\xe3' get decoded to one code point:
>>> b'\xc4\xe3'.decode('gbk')
'你'
And the same goes for b'\xba\xc3'.
Note that code points are not exactly the same as characters. A code point might be a combining accent mark, for example:
>>> print(u'a\u0300')
à
>>> len(u'a\u0300')
2
Some combining marks can be composed with another code point to form one code point. Indeed, that's the case with the example above:
>>> import unicodedata as UD
>>> UD.normalize('NFKC', 'a\u0300')
'à'
>>> len(UD.normalize('NFKC', 'a\u0300'))
1
However, not all combining marks can be so composed:
>>> UD.normalize('NFKC', 'a\u030b')
'a̋'
>>> len(UD.normalize('NFKC', 'a\u030b'))
2
So even if you normalize, you can not assume that the number of characters you see is the number of code points in the str.

Python 3 byte string subscription

In Python, I am trying to byte string to handle some 8 bit character string. I find that byte string is not necessary behavior in a string like way. With subscript, it returns an number instead of a byte string of length 1.
In [243]: s=b'hello'
In [244]: s[1]
Out[244]: 101
In [245]: s[1:2]
Out[245]: b'e'
This makes it really difficult when I iterate it. For example, this code works with string but fail for byte string.
In [260]: d = {b'e': b'E', b'h': b'H', b'l': b'L', b'o': b'O'}
In [261]: list(map(d.get, s))
Out[261]: [None, None, None, None, None]
This breaks some code from Python 2. I also find this irregularity really inconcenient. Anyone has any insight what's going on with byte string?
Byte strings store byte values in the range 0-255. The repr of bytes is just a convenience to view them, but they are storing data not text. Observe:
>>> x=bytes([104,101,108,108,111])
>>> x
b'hello'
>>> x[0]
104
>>> x[1]
101
>>> list(x)
[104, 101, 108, 108, 111]
Use strings for text. If starting with bytes, decode them appropriately:
>>> s=b'hello'.decode('ascii')
>>> d = dict(zip('hello','HELLO'))
>>> list(map(d.get,s))
['H', 'E', 'L', 'L', 'O']
But if you want to work with bytes:
>>> d=dict(zip(b'hello',b'HELLO'))
>>> d
{104: 72, 108: 76, 101: 69, 111: 79}
>>> list(map(d.get,b'hello'))
[72, 69, 76, 76, 79]
>>> bytes(map(d.get,b'hello'))
b'HELLO'
You can simply decode the string, get the element you want and encode it back :
s=b'hello'
t = s.decode()
print(t[1]) # This gives a char object
print(t[1].encode()) # This gives a byte object

Resources