bytes() initializer adding an additional byte? - python-3.x

I initialize a utf-8 encoding string in python3:
bytes('\xc2', encoding="utf-8", errors="strict")
but on writing it out I get two bytes!
>>> s = bytes('\xc2', encoding="utf-8", errors="strict")
>>> s
b'\xc3\x82'
Where is this additional byte coming from? Why should I not be able to encode any hex value up to 254 (I can understand that 255 is potentially reserved to extend to utf-16)?

The Unicode codepoint "\xc2" (which can also be written as "Â"), is two bytes long when encoded with the utf-8 encoding. If you were expecting it to be the single byte b'\xc2', you probably want to use a different encoding, such as "latin-1":
>>> s = bytes("\xc2", encoding="latin-1", errors="strict")
>>> s
b'\xc2'
If you area really creating "\xc2" directly with a literal though, there's no need to mess around with the bytes constructor to turn it into a bytes instance. Just use the b prefix on the literal to create the bytes directly:
s = b"\xc2"

Related

How to Turn string into bytes?

Using python3 and I've got a string which displayed as bytes
strategyName=\xe7\x99\xbe\xe5\xba\xa6
I need to change it into readable chinese letter through decode
orig=b'strategyName=\xe7\x99\xbe\xe5\xba\xa6'
result=orig.decode('UTF-8')
print()
which shows like this and it is what I want
strategyName=百度
But if I save it in another string,it works different
str0='strategyName=\xe7\x99\xbe\xe5\xba\xa6'
result_byte=str0.encode('UTF-8')
result_str=result_byte.decode('UTF-8')
print(result_str)
strategyName=ç¾åº¦é£é©ç­ç¥
Please help me about why this happening,and how can I fix it.
Thanks a lot
Your problem is using a str literal when you're trying to store the UTF-8 encoded bytes of your string. You should just use the bytes literal, but if that str form is necessary, the correct approach is to encode in latin-1 (which is a 1-1 converter for all ordinals below 256 to the matching byte value) to get the bytes with utf-8 encoded data, then decode as utf-8:
str0 = 'strategyName=\xe7\x99\xbe\xe5\xba\xa6'
result_byte = str0.encode('latin-1') # Only changed line
result_str = result_byte.decode('UTF-8')
print(result_str)
Of course, the other approach could be to just type the Unicode escapes you wanted in the first place instead of byte level escapes that correspond to a UTF-8 encoding:
result_str = 'strategyName=\u767e\u5ea6'
No rigmarole needed.

Python bytes representation

I'm writing a hex viewer on python for examining raw packet bytes. I use dpkt module.
I supposed that one hex byte may have value between 0x00 and 0xFF. However, I've noticed that python bytes representation looks differently:
b'\x8a\n\x1e+\x1f\x84V\xf2\xca$\xb1'
I don't understand what do these symbols mean. How can I translate these symbols to original 1-byte values which could be shown in hex viewer?
The \xhh indicates a hex value of hh. i.e. it is the Python 3 way of encoding 0xhh.
See https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
The b at the start of the string is an indication that the variables should be of bytes type rather than str. The above link also covers that. The \n is a newline character.
You can use bytearray to store and access the data. Here's an example using the byte string in your question.
example_bytes = b'\x8a\n\x1e+\x1f\x84V\xf2\xca$\xb1'
encoded_array = bytearray(example_bytes)
print(encoded_array)
>>> bytearray(b'\x8a\n\x1e+\x1f\x84V\xf2\xca$\xb1')
# Print the value of \x8a which is 138 in decimal.
print(encoded_array[0])
>>> 138
# Encode value as Hex.
print(hex(encoded_array[0]))
>>> 0x8a
Hope this helps.

How to get single backslash instead of double backslash with encode("unicode-escape")?

Get unicode point of character Ä.
Python3 version.
>>> str="Ä"
>>> str.encode("unicode-escape")
b'\\xc4'
How to get the single backslash format b'\xc4' instead of b'\\xc4' as my output ?
It's not entirely clear to me what you want, so I'll give you a few options.
Get the (Unicode) code point of a character as an integer:
>>> ord('Ä')
196
Display the integer in hex notation:
>>> hex(ord('Ä'))
'0xc4'
or with string formatting:
>>> '{:X}'.format(ord('Ä'))
'C4'
However, you talk about backslashes and show the bytestring b'\xc4'.
This is the Latin-1 encoding of 'Ä' (all characters with a Unicode codepoint below 256 can be encoded with Latin-1, and their byte value equals the Unicode codepoint).
>>> 'Ä'.encode('latin-1')
b'\xc4'
This is a bytestring of length 1.
It is displayed in a way in which you could type this character, ie. using an escape sequence with backslash-x and a two-digit hex number.
The "unicode-escape" codec produces these four ASCII characters (\, x, c 4), but not as str, but as a bytes object (because str.encode() returns bytes by definition).
To get a backslash in a str/bytes literal, you need to type two backslashes, so the representation form also uses two backslashes:
>>> 'Ä'.encode('unicode-escape')
b'\\xc4'
The "unicode-escape" codec is very Python-specific and I don't see a lot of applications; maybe if you want to write your own pickle protocol or parse fragments of Python source code.

Raw byte values vs Unicode text?

I am a beginner in python and came across a chapter, which read :
In Python 3.X, the normal str string handles Unicode text (including ASCII, which is just a simple kind of Unicode); a distinct bytes string type represents raw byte values (including media and encoded text);
I understand what is unicode text, but what values are the raw bytes??
Raw bytes can be anything you want them to be. A single byte is limited to 0-255 (hexadecimal 00-FF) so more than one has to be interpreted by a program to something meaningful.
Given the byte string b'\x41\x42\x43\x44', this could be a little-endian integer:
>>> int.from_bytes(raw,'little')
1145258561
>>> hex(int.from_bytes(raw,'little'))
'0x44434241'
Or a big-ending integer:
>>> hex(int.from_bytes(raw,'big'))
'0x41424344'
Or a UTF-8-encoded Unicode string:
>>> raw.decode('utf8')
'ABCD'
Or two little-endian 16-bit unsigned integers:
>>> struct.unpack('HH',raw)
(16961, 17475)
>>> list(map(hex,struct.unpack('HH',raw)))
['0x4241', '0x4443']
It's just data. It's up to a program decide what the data means.
Byte strings can be transmitted across a TCP socket or read or written to a file. Unicode text cannot. It must be encoded to bytes first.

Convert string of encoded escape sequences to Unicode in Python 3

I'm attempting to decode a Python string containing a series of Shift-JIS escape sequences in Python. When I create a bytes literal containing the sequences, I can use decode('shift-jis') to get the expected result.
>>> seq = b'\201u\202\240\202\246\202\244\202\242\202\250\201v'
>>> seq.decode("shift-jis")
'「あえういお」'
The problem is that the sequences are passed in as a plain Python string. When I use str.encode, the sequence is interpreted as Unicode and extra bytes of \xc2 are inserted:
>>> seq = "\201u\202\240\202\246\202\244\202\242\202\250\201v"
>>> str.encode(seq)
b'\xc2\x81u\xc2\x82\xc2\xa0\xc2\x82\xc2\xa6\xc2\x82\xc2\xa4\xc2\x82\xc2\xa2\xc2\x82\xc2\xa8\xc2\x81v'
Is there a way to directly convert a Python string containing encoded escape sequences into a bytes literal, in the same way as placing a b in front of a string produces a bytes literal with the escaped characters?
Str.encode defaults to using utf-8 encoding. hence you get the utf-8 \xc2 prefixes. (Check Wikipedia for details if you want.) What you want instead is for codepoints 0 to 255 to be turned into bytes 0 to 255. In others words, the same data in an object of a different class. Latin-1 does this.
>>> seqb = seq.encode('latin-1')
>>> seqb.decode('shift-jis')
'「あえういお」'

Resources