python 3, unicode conversion, two \u0000 as one character - string

My python3 script receives strings from c++ program via pipe.
Strings encoded via Unicode code points. I need to decode it correctly.
For example, consider string that contain cyrillic symbols: 'тест test'
Try to encode this string using python3: print('тест test'.encode()). We got b'\xd1\x82\xd0\xb5\xd1\x81\xd1\x82 test'
C++ program encodes this string like: b'\u00D1\u0082\u00D0\u00B5\u00D1\u0081\u00D1\u0082 test'
Encoded strings looks very similar - python3 uses \x (2bits) and c++ program uses \u (4bits).
But I can't figure out how to convert b'\u00D1\u0082\u00D0\u00B5\u00D1\u0081\u00D1\u0082 test' to 'тест test'.
Main problem - python3 consider b'\u00D1\u0082\u00D0\u00B5\u00D1\u0081\u00D1\u0082' as 8-character string, but it contain only 4 characters

If the string you receive from C++ is the following in Python:
s = b'\u00D1\u0082\u00D0\u00B5\u00D1\u0081\u00D1\u0082 test'
Then this will decode it:
result = s.decode('unicode-escape').encode('latin1').decode('utf8')
print(result)
Output:
тест test
The first stage converts the byte string received into a Unicode string:
>>> s1 = s.decode('unicode-escape')
>>> s1
'Ñ\x82еÑ\x81Ñ\x82 test'
Unfortunately, the Unicode codepoints are really UTF-8 byte values. The latin1 encoding is a 1:1 mapping of the first 256 Unicode codepoints, so encoding with this codec converts the codepoints back to byte values in a byte string:
>>> s2 = s1.encode('latin1')
>>> s2
b'\xd1\x82\xd0\xb5\xd1\x81\xd1\x82 test'
Now the byte string can be decoded to the correct Unicode string:
>>> s3 = s2.decode('utf8')
>>> s3
'тест test'

Related

can not decoed using utf-8 after encoding with utf-8

In a situation I had to store data as utf-8 and now when I want to fetch and decode('utf-8') data it's just simply does not work. Consider line below as an example:
\x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87
You can simply copy the line below to convert the string above to the human readable format:
b"\x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87".decode("utf-8")
However could not find a way to convert the string to bytestring without corrupting the string. I tried following methods but all of them failed:
.decode("utf-8")
.decode()
.bytes()
Up until this point I could not find solution in OS or other places. Appreciate any help.
x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87
b'x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87'
The above lines (both given in the question) are particular instances of String and Bytes literals (respectively):
\xhh Character with hex value hh (2, 3)
2 Unlike in Standard C, exactly two hex digits are
required.
3 In a bytes literal, hexadecimal and octal escapes denote
the byte with the given value. In a string literal, these escapes
denote a Unicode character with the given value.
Let's check the string defined in such a way (inside Python prompt):
>>> xstr = "\x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87"
>>> xstr
'\r\nساÙ\x82Û\x8câ\x80\x8cÙ\x86اÙ\x85Ù\x87'
>>> print( xstr)
ساÙÛâÙاÙ
Ù
>>>
Apparently, the print( xstr) output does not resemble a word in any known language however all its characters belong (by definition) to Unicode range r'[\u0000-\u00ff]' i.e. the first 256 of characters in Unicode, and voila - it's iso-8859-1 aka 'latin1'.
We need to get an encoded version of the xstr string as a bytes object, e.g. using str.encode method or built-in bytes() function. Then
print( bytes(xstr,'latin1').decode()); print(xstr.encode("latin1").decode())
ساقی‌نامه
ساقی‌نامه

How to get single backslash instead of double backslash with encode("unicode-escape")?

Get unicode point of character Ä.
Python3 version.
>>> str="Ä"
>>> str.encode("unicode-escape")
b'\\xc4'
How to get the single backslash format b'\xc4' instead of b'\\xc4' as my output ?
It's not entirely clear to me what you want, so I'll give you a few options.
Get the (Unicode) code point of a character as an integer:
>>> ord('Ä')
196
Display the integer in hex notation:
>>> hex(ord('Ä'))
'0xc4'
or with string formatting:
>>> '{:X}'.format(ord('Ä'))
'C4'
However, you talk about backslashes and show the bytestring b'\xc4'.
This is the Latin-1 encoding of 'Ä' (all characters with a Unicode codepoint below 256 can be encoded with Latin-1, and their byte value equals the Unicode codepoint).
>>> 'Ä'.encode('latin-1')
b'\xc4'
This is a bytestring of length 1.
It is displayed in a way in which you could type this character, ie. using an escape sequence with backslash-x and a two-digit hex number.
The "unicode-escape" codec produces these four ASCII characters (\, x, c 4), but not as str, but as a bytes object (because str.encode() returns bytes by definition).
To get a backslash in a str/bytes literal, you need to type two backslashes, so the representation form also uses two backslashes:
>>> 'Ä'.encode('unicode-escape')
b'\\xc4'
The "unicode-escape" codec is very Python-specific and I don't see a lot of applications; maybe if you want to write your own pickle protocol or parse fragments of Python source code.

Converting octet strings to Unicode strings, Python 3

I'm trying to convert a string with octal-escaped Unicode back into a proper Unicode string as follows, using Python 3:
"training\345\256\214\346\210\220\345\276\214.txt" is the read-in string.
"training完成後.txt" is the string's actual representation, which I'm trying to obtain.
However, after skimming SO, seems the suggested solution was the following most everywhere I could find for Python 3:
decoded_string = bytes(myString, "utf-8").decode("unicode_escape")
Unfortunately, that seems to yield the wrong Unicode string when applied to my sample:
'trainingå®Â\x8cæÂ\x88Â\x90å¾Â\x8c.txt'
This seems easy to do with byte literals, as well as in Python 2, but unfortunately doesn't seem as easy with strings in Python 3. Help much appreciated, thanks! :)
Assuming your starting string is a Unicode string with literal backslashes, you first need a byte string to use the unicode-escape codec, but the octal escapes are UTF-8, so you'll need to convert it again to a byte string and then decode as UTF-8:
>>> s = r'training\345\256\214\346\210\220\345\276\214.txt'
>>> s
'training\\345\\256\\214\\346\\210\\220\\345\\276\\214.txt'
>>> s.encode('latin1')
b'training\\345\\256\\214\\346\\210\\220\\345\\276\\214.txt'
>>> s.encode('latin1').decode('unicode-escape')
'trainingå®\x8cæ\x88\x90å¾\x8c.txt'
>>> s.encode('latin1').decode('unicode-escape').encode('latin1')
b'training\xe5\xae\x8c\xe6\x88\x90\xe5\xbe\x8c.txt'
>>> s.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
'training完成後.txt'
Note that the latin1 codec does a direct translation of Unicode codepoints U+0000 to U+00FF to bytes 00-FF.

Convert string of encoded escape sequences to Unicode in Python 3

I'm attempting to decode a Python string containing a series of Shift-JIS escape sequences in Python. When I create a bytes literal containing the sequences, I can use decode('shift-jis') to get the expected result.
>>> seq = b'\201u\202\240\202\246\202\244\202\242\202\250\201v'
>>> seq.decode("shift-jis")
'「あえういお」'
The problem is that the sequences are passed in as a plain Python string. When I use str.encode, the sequence is interpreted as Unicode and extra bytes of \xc2 are inserted:
>>> seq = "\201u\202\240\202\246\202\244\202\242\202\250\201v"
>>> str.encode(seq)
b'\xc2\x81u\xc2\x82\xc2\xa0\xc2\x82\xc2\xa6\xc2\x82\xc2\xa4\xc2\x82\xc2\xa2\xc2\x82\xc2\xa8\xc2\x81v'
Is there a way to directly convert a Python string containing encoded escape sequences into a bytes literal, in the same way as placing a b in front of a string produces a bytes literal with the escaped characters?
Str.encode defaults to using utf-8 encoding. hence you get the utf-8 \xc2 prefixes. (Check Wikipedia for details if you want.) What you want instead is for codepoints 0 to 255 to be turned into bytes 0 to 255. In others words, the same data in an object of a different class. Latin-1 does this.
>>> seqb = seq.encode('latin-1')
>>> seqb.decode('shift-jis')
'「あえういお」'

Convert hexadecimal to normal string

I'm using Python 3.3.2 and I want convert a hex to a string.
This is my code:
junk = "\x41" * 50 # A
eip = pack("<L", 0x0015FCC4)
buffer = junk + eip
I've tried use
>>> binascii.unhexlify("4142")
b'AB'
... but I want the output "AB", no "b'AB'". What can I do?
Edit:
buffer = junk + binascii.unhexlify(eip).decode('ascii')
binascii.Error: Non-hexadecimal digit found
The problem is I can't concatenate junk + eip.
Thank you.
What that b stands for is to denote that is a bytes class, i.e. a string of bytes. If you want to convert that into a string you want to use the decode method.
>>> type(binascii.unhexlify(b"4142"))
<class 'bytes'>
>>> binascii.unhexlify(b"4142").decode('ascii')
'AB'
This results in a string, which is a string of unicode characters.
Edit:
If you want to work purely with binary data, don't do decode, stick with using the bytes type, so in your edited example:
>>> #- junk = "\x41" * 50 # A
>>> junk = b"\x41" * 50 # A
>>> eip = pack("<L", 0x0015FCC4)
>>> buffer = junk + eip
>>> buffer
b'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA\xc4\xfc\x15\x00'
Note the b in b"\x41", which denote that as a binary string, i.e. standard string type in python2, or literally a string of bytes rather than a string of unicode characters which are two completely different things.
That's just a literal representation. Don't worry about the b, as it's not actually part of the string itself.
See What does the 'b' character do in front of a string literal?

Resources