How to Turn string into bytes? - python-3.x

Using python3 and I've got a string which displayed as bytes
strategyName=\xe7\x99\xbe\xe5\xba\xa6
I need to change it into readable chinese letter through decode
orig=b'strategyName=\xe7\x99\xbe\xe5\xba\xa6'
result=orig.decode('UTF-8')
print()
which shows like this and it is what I want
strategyName=百度
But if I save it in another string,it works different
str0='strategyName=\xe7\x99\xbe\xe5\xba\xa6'
result_byte=str0.encode('UTF-8')
result_str=result_byte.decode('UTF-8')
print(result_str)
strategyName=ç¾åº¦é£é©ç­ç¥
Please help me about why this happening,and how can I fix it.
Thanks a lot

Your problem is using a str literal when you're trying to store the UTF-8 encoded bytes of your string. You should just use the bytes literal, but if that str form is necessary, the correct approach is to encode in latin-1 (which is a 1-1 converter for all ordinals below 256 to the matching byte value) to get the bytes with utf-8 encoded data, then decode as utf-8:
str0 = 'strategyName=\xe7\x99\xbe\xe5\xba\xa6'
result_byte = str0.encode('latin-1') # Only changed line
result_str = result_byte.decode('UTF-8')
print(result_str)
Of course, the other approach could be to just type the Unicode escapes you wanted in the first place instead of byte level escapes that correspond to a UTF-8 encoding:
result_str = 'strategyName=\u767e\u5ea6'
No rigmarole needed.

Related

can not decoed using utf-8 after encoding with utf-8

In a situation I had to store data as utf-8 and now when I want to fetch and decode('utf-8') data it's just simply does not work. Consider line below as an example:
\x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87
You can simply copy the line below to convert the string above to the human readable format:
b"\x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87".decode("utf-8")
However could not find a way to convert the string to bytestring without corrupting the string. I tried following methods but all of them failed:
.decode("utf-8")
.decode()
.bytes()
Up until this point I could not find solution in OS or other places. Appreciate any help.
x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87
b'x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87'
The above lines (both given in the question) are particular instances of String and Bytes literals (respectively):
\xhh Character with hex value hh (2, 3)
2 Unlike in Standard C, exactly two hex digits are
required.
3 In a bytes literal, hexadecimal and octal escapes denote
the byte with the given value. In a string literal, these escapes
denote a Unicode character with the given value.
Let's check the string defined in such a way (inside Python prompt):
>>> xstr = "\x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87"
>>> xstr
'\r\nساÙ\x82Û\x8câ\x80\x8cÙ\x86اÙ\x85Ù\x87'
>>> print( xstr)
ساÙÛâÙاÙ
Ù
>>>
Apparently, the print( xstr) output does not resemble a word in any known language however all its characters belong (by definition) to Unicode range r'[\u0000-\u00ff]' i.e. the first 256 of characters in Unicode, and voila - it's iso-8859-1 aka 'latin1'.
We need to get an encoded version of the xstr string as a bytes object, e.g. using str.encode method or built-in bytes() function. Then
print( bytes(xstr,'latin1').decode()); print(xstr.encode("latin1").decode())
ساقی‌نامه
ساقی‌نامه

Using Protocol Buffer to Serialize Bytes Python3

I am trying to serialize a bytes object - which is an initialization vector for my program's encryption. But, the Google Protocol Buffer only accepts strings. It seems like the error starts with casting bytes to string. Am I using the correct method to do this? Thank you for any help or guidance!
Or also, can I make the Initialization Vector a string object for AES-CBC mode encryption?
Code
Cast the bytes to a string
string_iv = str(bytes_iv, 'utf-8')
Serialize the string using SerializeToString():
serialized_iv = IV.SerializeToString()
Use ParseToString() to recover the string:
IV.ParseFromString( serialized_iv )
And finally, UTF-8 encode the string back to bytes:
bytes_iv = bytes(IV.string_iv, encoding= 'utf-8')
Error
string_iv = str(bytes_iv, 'utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9b in position 3: invalid start byte
If you must cast an arbitrary bytes object to str, these are your option:
simply call str() on the object. It will turn it into repr form, ie. something that could be parsed as a bytes literal, eg. "b'abc\x00\xffabc'"
decode with "latin1". This will always work, even though it technically makes no sense if the data isn't text encoded with Latin-1.
use base64 or base85 encoding (the standard library has a base64 module wich covers both)

Python bytes representation

I'm writing a hex viewer on python for examining raw packet bytes. I use dpkt module.
I supposed that one hex byte may have value between 0x00 and 0xFF. However, I've noticed that python bytes representation looks differently:
b'\x8a\n\x1e+\x1f\x84V\xf2\xca$\xb1'
I don't understand what do these symbols mean. How can I translate these symbols to original 1-byte values which could be shown in hex viewer?
The \xhh indicates a hex value of hh. i.e. it is the Python 3 way of encoding 0xhh.
See https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
The b at the start of the string is an indication that the variables should be of bytes type rather than str. The above link also covers that. The \n is a newline character.
You can use bytearray to store and access the data. Here's an example using the byte string in your question.
example_bytes = b'\x8a\n\x1e+\x1f\x84V\xf2\xca$\xb1'
encoded_array = bytearray(example_bytes)
print(encoded_array)
>>> bytearray(b'\x8a\n\x1e+\x1f\x84V\xf2\xca$\xb1')
# Print the value of \x8a which is 138 in decimal.
print(encoded_array[0])
>>> 138
# Encode value as Hex.
print(hex(encoded_array[0]))
>>> 0x8a
Hope this helps.

bytes() initializer adding an additional byte?

I initialize a utf-8 encoding string in python3:
bytes('\xc2', encoding="utf-8", errors="strict")
but on writing it out I get two bytes!
>>> s = bytes('\xc2', encoding="utf-8", errors="strict")
>>> s
b'\xc3\x82'
Where is this additional byte coming from? Why should I not be able to encode any hex value up to 254 (I can understand that 255 is potentially reserved to extend to utf-16)?
The Unicode codepoint "\xc2" (which can also be written as "Â"), is two bytes long when encoded with the utf-8 encoding. If you were expecting it to be the single byte b'\xc2', you probably want to use a different encoding, such as "latin-1":
>>> s = bytes("\xc2", encoding="latin-1", errors="strict")
>>> s
b'\xc2'
If you area really creating "\xc2" directly with a literal though, there's no need to mess around with the bytes constructor to turn it into a bytes instance. Just use the b prefix on the literal to create the bytes directly:
s = b"\xc2"

Perl's default string encoding and representation

In the following:
my $string = "Can you \x{FB01}nd my r\x{E9}sum\x{E9}?\n";
The x{FB01} and x{E9} are code points. And code points are encoded via an encoding scheme to a series of octets.
So the character è which has the codepoint \x{FB01} is part of the string of $string. But how does this work? Are all the characters in this sentence (including the ASCII ones) encoded via UTF-8?
If yes why do I get the following behavior?
my $str = "Some arbitrary string\n";
if(Encode::is_utf8($str)) {
print "YES str IS UTF8!\n";
}
else {
print "NO str IT IS NOT UTF8\n";
}
This prints "NO str IT IS NOT UTF8\n"
Additionally Encode::is_utf8($string) returns true.
In what way are $string and $str different and one is considered UTF-8 and the other not?
And in any case what is the encoding of $str? ASCII? Is this the default for Perl?
In C, a string is a collection of octets, but Perl has two string storage formats:
String of 8-bit values.
String of 72-bit values. (In practice, limited to 32-bit or 64-bit.)
As such, you don't need to encode code points to store them in a string.
my $s = "\x{2660}\x{2661}";
say length $s; # 2
say sprintf '%X', ord substr($s, 0, 1); # 2660
say sprintf '%X', ord substr($s, 1, 1); # 2661
(Internally, an extension of UTF-8 called "utf8" is used to store the strings of 72-bit chars. That's not something you should ever have to know except to realize the performance implications, but there are bugs that expose this fact.)
Encode's is_utf8 reports which type of string a scalar contains. It's a function that serves absolutely no use except to debug the bugs I previously mentioned.
An 8-bit string can store the value of "abc" (or the string in the OP's $str), so Perl used the more efficient 8-bit (UTF8=0) string format.
An 8-bit string can't store the value of "\x{2660}\x{2661}" (or the string in the OP's $string), so Perl used the 72-bit (UTF8=1) string format.
Zero is zero whether it's stored in a floating point number, a signed integer or an unsigned integer. Similarly, the storage format of strings conveys no information about the value of the string.
You can store code points in an 8-bit string (if they're small enough) just as easily as a 72-bit string.
You can store bytes in a 72-bit string just as easily as an 8-bit string.
In fact, Perl will switch between the two formats at will. For example, if you concatenate $string with $str, you'll get a string in the 72-bit format.
You can alter the storage format of a string with the builtins utf8::downgrade and utf8::upgrade, should you ever need to work around a bug.
utf8::downgrade($s); # Switch to strings of 8-bit values (UTF8=0).
utf8::upgrade($s); # Switch to strings of 72-bit values (UTF8=1).
You can see the effect using Devel::Peek.
>perl -MDevel::Peek -e"$s=chr(0x80); utf8::downgrade($s); Dump($s);"
SV = PV(0x7b8a74) at 0x4a84c4
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x7bab9c "\200"\0
CUR = 1
LEN = 12
>perl -MDevel::Peek -e"$s=chr(0x80); utf8::upgrade($s); Dump($s);"
SV = PV(0x558a6c) at 0x1cc843c
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x55ab94 "\302\200"\0 [UTF8 "\x{80}"]
CUR = 2
LEN = 12
The \x{FB01} and \x{E9} are code points.
Not quiet, the numeric values inside the braces are codepoints. The whole \x expression is just a notation for a character. There are several notations for characters, most of them starting with a backslash, but the common one is the simple string literal. You might as well write:
use utf8;
my $string = "Can you find my résumé?\n";
# ↑ ↑ ↑
And code points are encoded via an encoding scheme to a series of octets.
True, but so far your string is a string of characters, not a buffer of octets.
But how does this work?
Strings consist of characters. That's just Perl's model. You as a programmer are supposed to deal with it at this level.
Of course, the computer can't, and the internal data structure must have some form of internal encoding. Far too much confusion ensues because "Perl can't keep a secret", the details leak out occasionally.
Are all the characters in this sentence (including the ASCII ones) encoded via UTF-8?
No, the internal encoding is lax UTF8 (no dash). It does not have some of the restrictions that UTF-8 (a.k.a. UTF-8-strict) has.
UTF-8 goes up to 0x10_ffff, UTF8 goes up to 0xffff_ffff_ffff_ffff on my 64-bit system. Codepoints greater than 0xffff_ffff will emit a non-portability warning, though.
In UTF-8 certain codepoints are non-characters or illegal characters. In UTF8, anything goes.
Encode::is_utf8
… is an internals function, and is clearly marked as such. You as a programmer are not supposed to peek. But since you want to peek, no one can stop you. Devel::Peek::Dump is a better tool for getting at the internals.
Read http://p3rl.org/UNI for an introduction to the topic of encoding in Perl.
is_utf8 is a badly-named function that doesn't mean what you think it means or have anything to do with that. The answer to your question is that $string doesn't have an encoding, because it's not encoded. When you call Encode::encode with some encoding, the result of that will be a string that is encoded, and has a known encoding

Resources