python3 str to bytes convertation problem - python-3.x

In python 3.8.5 I try to convert some bytes to string and then string to bytes:
>>> a=chr(128)
>>> a
'\x80'
>>> type(a)
<class 'str'>
But when I try to do back convertation:
>>> a.encode()
b'\xc2\x80'
What is \xc2 bytes? Why it appears?
Thanks for any responce!

This a UTF-8 encoding, so the \xc2 comes from here, take a look here.
In a Python string, \x80 means Unicode codepoint #128 (Padding Character). When we encode that codepoint in UTF-8, it takes two bytes.
The original ASCII encoding only had 128 different characters, there are many thousands of Unicode codepoints, and a single byte can only represent 256 different values. A lot of computing is based on ASCII, and we’d like that stuff to keep working, but we need non-English-speakers to be able to use computers too, so we need to be able to represent their characters.
The answer is UTF-8, a scheme that encodes the first 128 Unicode code points (0-127, the ASCII characters) as a single byte – so text that only uses those characters is completely compatible with ASCII. The next 1920 characters, containing the most common non-English characters (U+80 up to U+7FF) are spread across two bytes.
So, in exchange for being slightly less efficient with some characters that could fit in a one-byte encoding (such as \x80), we gain the ability to represent every character from every written language.
For more reading, try this SO question
For example if you want to remove the \xc2 try to encode your string as latin-1
a=chr(128)
print(a)
#'\x80'
print(a.encode())
#b'\xc2\x80'
a.encode('latin-1')
#b'\x80'

Related

What exactly is encoding-independent means

While reading the Strings and Characters chapter of the official Swift document I found the following sentence
"Every string is composed of encoding-independent Unicode characters, and provide support for accessing those characters in various Unicode representations"
Question What exactly do encoding-independent mean?
From my reading on Advanced Swift By Chris and other experiences, the thing that this sentence is trying to convey can be 2 folds.
First, what are various unicode representations:
UTF-8 : compatible with ASCII
UTF-16
UTF-32
The number on the right hand side means how many bits a Character will take when it represented or stored.
For a character, UTF-8 requires 8 bits while UTF-32 requires 32 bits.
However, a chinese character which can be represented by 1 UTF-32 memory might not always fit in 1 block of UTF-16 memory. If the character aquires all 32 bits then in UTF-8 it will have a count of 4.
Then comes the storing part. When you store a character in the String, it doesn't matter how you want to read it later.
For example:
Every string is composed of encoding-independent Unicode characters, and provide support for accessing those characters in various Unicode representations
This means, you can compose String by any way you like. And this wont effect the representation when reading on various unicode encoding formats like UTF-8 or 16 or 32.
This is seen clearly in the above example, When i try to load a Japanese Character which takes up 24 bit to store. The same character is displayed irrespective of my choice of encoding.
However, count value will differ. There are other points to consider like Code Unit and Code Point that make up this Strings.
For Unicode Encoding variants
I would highly recommend reading this article which goes way deeper into String api in swift.
Detail View of String API in swift

YOaf/MrA - What character encoding?

What kind of character encoding are the strings below?
KDLwuq6IC
YOaf/MrAT
0vGzc3aBN
SQdLlM8G7
https://en.wikipedia.org/wiki/Character_encoding
Character encoding is the encoding of strings to bytes (or numbers). You are only showing us the characters itself. They don't have any encoding by themselves.
Some character encodings have a different range of characters that they can encode. Your characters are all in the ASCII range at least. So they would also be compatible with any scheme that incorporates ASCII as a subset such as Windows-1252 and of course Unicode (UTF-8, UTF-16LE, UTF-16BE etc).
Note that your code looks a lot like base 64. Base 64 is not a character encoding though, it is the encoding of bytes into characters (so the other way around). Base 64 can usually be recognized by looking for / and + characters in the text, as well as the text consisting of blocks that are a multiple of 4 characters (as 4 characters encode 3 bytes).
Looking at the text you are probably looking for an encoding scheme rather than a character-encoding scheme.

What is the difference between a unicode and binary string?

I am in python3.3.
What is the difference between a unicode string and a binary string?
b'\\u4f60'
u'\x4f\x60'
b'\x4f\x60'
u'4f60'
The concept of Unicode and binary string is confusing. How can i change b'\\u4f60' into b'\x4f\x60' ?
First - there is no difference between unicode literals and string literals in python 3. They are one and the same - you can drop the u up front. Just write strings. So instantly you should see that the literal u'4f60' is just like writing actual '4f60'.
A bytes literal - aka b'some literal' - is a series of bytes. Bytes between 32 and 127 (aka ASCII) can be displayed as their corresponding glyph, the rest are displayed as the \x escaped version. Don't be confused by this - b'\x61' is the same as b'a'. It's just a matter of printing.
A string literal is a string literal. It can contain unicode codepoints. There is far too much to cover to explain how unicode works here, but basically a codepoint represents a glyph (essentially, a character - a graphical representation of a letter/digit), it does not specify how the machine needs to represent it. In fact there are a great many different ways.
Thus there is a very large difference between bytes literals and str literals. The former describe the machine representation, the latter describe the alphanumeric glyphs that we are reading right now. The mapping between the two domains is encoding/decoding.
I'm skipping over a lot of vital information here. That should get us somewhere though. I highly recommend reading more since this is not an easy topic.
How can i change b'\\u4f60' into b'\x4f\x60' ?
Let's walk through it:
b'\u4f60'
Out[101]: b'\\u4f60' #note, unicode-escaped
b'\x4f\x60'
Out[102]: b'O`'
'\u4f60'
Out[103]: '你'
So, notice that \u4f60 is that Han ideograph glyph. \x4f\x60 is, if we represent it in ascii (or utf-8, actually), the letter O (\x4f) followed by backtick.
I can ask python to turn that unicode-escaped bytes sequence into a valid string with the according unicode glyph:
b'\\u4f60'.decode('unicode-escape')
Out[112]: '你'
So now all we need to do is to re-encode to bytes, right? Well...
Coming around to what I think you're wanting to ask -
How can i change '\\u4f60' into its proper bytes representation?
There is no 'proper' bytes representation of that unicode codepoint. There is only a representation in the encoding that you want. It so happens that there is one encoding that directly matches the transformation to b'\x4f\x60' - utf-16be.
b'\\u4f60'.decode('unicode-escape').encode('utf-16-be')
Out[47]: 'O`'
The reason this works is that utf-16 is a variable-length encoding. For code points below 16 bits it just directly uses the codepoint as the 2-byte encoding, and for points above it uses something called "surrogate pairs", which I won't get into.

Python 3 - if a string contains only ASCII, is it equal to the string as bytes?

Consider Python 3 SMTPD - the data received is contained in a string. http://docs.python.org/3.4/library/smtpd.html quote: "and data is a string containing the contents of the e-mail"
Facts (correct?):
Strings in Python 3 are Unicode.
Emails are always ASCII.
Pure ASCII is valid Unicode.
Therefore the email that came in is pure ASCII (which is valid Unicode), therefore the SMTPD DATA string is exactly equivalent to the original bytes received by SMPTD. Is this correct?
Thus my question, if I decode the SMTPD DATA string to ASCII, or convert the DATA string to bytes, is this equivalent to the bytes of the actual email message that arrived via SMTP?
Context, (and perhaps a better question) is "How do I save to a file Python 3's SMTPD DATA as PRECISELY the bytes that were received?" My concern is that when DATA goes through string to bytes conversion then somehow it has been changed from the original bytes that arrived via SMTP.
EDIT: it seems the Python developers think SMTPD should be returning binary data anyway. Doesn't seem to have been fixed... http://bugs.python.org/issue19662
if a string contains only ASCII, is it equal to the string as bytes?
No. It is not equal in Python 3:
>>> '1' == b'1'
False
bytes object is not equal to str (Unicode string) object in a similar way that an integer is not equal to a string:
>>> '1' == 1
False
In some programming languages the above comparisons are true e.g., in Python 2:
>>> b'1' == u'1'
True
and 1 == '1' in Perl:
$ perl -e "print qq(True\n) if 1 == q(1)"
True
Your question is a good example of why the stricter Python 3 behaviour is preferable. It forces programmers to confront their text/bytes misconceptions without waiting for their code to break for some input.
Strings in Python 3 are Unicode.
yes. Strings are immutable sequences of Unicode code points in Python 3.
Emails are always ASCII.
Most emails are transported as 7-bit messages (ASCII range: hex 00-7F). Though "virtually all modern email servers are 8-bit clean." i.e., 8-bit content won't be corrupted. And 8BITMIME extension sanctions the passing of some of 8-bit content.
In other words: emails are not always ASCII.
Pure ASCII is valid Unicode.
ASCII is a character encoding. You can decode some byte sequences to Unicode using US-ASCII character encoding. Unicode strings have no associated character encoding i.e., you can encode them into bytes using any character encoding that can represent corresponding Unicode code points.
Therefore the email that came in is pure ASCII (which is valid Unicode), therefore the SMTPD DATA string is exactly equivalent to the original bytes received by SMPTD. Is this correct?
If input is in ascii range then data.decode('ascii', 'strict').encode('ascii') == data.
Though Lib/smtpd.py does some conversions to the input data (according to RFC 5321) therefore the content that you get as data may be different even if the input is pure ASCII.
"How do I save to a file Python 3's SMTPD DATA as PRECISELY the bytes that were received?"
my goal is not to find malformed emails but to save inbound emails to disk in precisely the binary/bytes form that they arrived.
The bug that you've linked (smtpd.py should not decode utf-8) makes smptd.py non 8-bit clean.
You could override SMTPChannel.collect_incoming_data method from smtpd.py to save incoming bytes as is.
"A string of ASCII text is also valid UTF-8 text."
It is true. It is a nice property of UTF-8 encoding. If you can decode a byte sequence into Unicode using US-ASCII character encoding then you can also decode the bytes using UTF-8 character encoding (and the resulting Unicode code points are the same in both cases).
smptd.py should have used either latin1 (it decodes any byte sequence) or ascii (with 'strict' error handler to fail on any non-ascii byte) instead of utf-8 (it allows some non-ascii bytes -- bad).
Keep in mind:
some emails may have bytes outside ascii range
de-transparency according to RFC 5321 doesn't preserve input bytes as-is even if they are all in ascii range

example of a utf-8 format octet string

I'm working w/ a function that expects a string formatted as a utf-8 encoded octet string. Can someone give me an example of what a utf-8 encoded octet string would look like?
Put another way, if I convert 'foo' to bytes, I get 112, 111, 111. What would these char codes look like as a utf-8 encoded octet string? Would it be "0x70 0x6f 0x6f"?
The context of my question is the process of generating an openid signature as described in the openid spec: "The message MUST be encoded in UTF-8 to produce a byte string." I'm looking for an example of what this would look like.
Thanks
No. UTF-8 characters can span multiple bytes. If you want to learn about UTF-8, you should start with its article on Wikipedia, which has a good description.
I think you may have made some mistakes in encoding your example, but in any case, my guess is that the answer that you really need is the UTF-8 is a superset of ASCII (the standard way to encode characters into bytes).
So, if you give an ASCII encoded string into a function that expects a UTF-8 encoded string, it should work just fine.
However, the opposite isn't true at all. UTF-8 can represent a lot of character ASCII cannot, so giving a UTF-8 encoded string to a function that expects an ASCII (i.e. 'normal') string is dangerous (unless you're positive that all the characters are part of the ASCII subset).
The string "foo" gets encoded as 66 6F 6F, but it's like that in nearly all ASCII derivatives. That's one of the biggest features of UTF-8: Backwards compatibility with 7-bit ASCII. If you're only dealing with ASCII, you don't have to do anything special.
Other characters are encoded with up to 4 bytes. Specifically, the bits of the Unicode code point are broken up into one of the patterns:
0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
with the requirement of using the shortest sequence that fits. So, for example, the Euro sign ('€' = U+20AC = binary 10 000010 101100) gets encoded as 1110 0010, 10 000010, 10 101100 = E2 82 AC.
So, it's just a simple matter of going through the Unicode code points in a string and encoding each one in UTF-8.
The hard part is figuring out what encoding your string is in to begin with. Most modern languages (e.g., Java, C#, Python 3.x) have distinct types for "byte array" and "string", where "strings" always have the same internal encoding (UTF-16 or UTF-32), and you have to call an "encode" function if you want to convert it to an array of bytes in a specific encoding.
Unfortunately, older languages like C conflate "characters" and "bytes". (IIRC, PHP is like this too, but it's been a few years since I used it.) And even if your language does support Unicode, you still have to deal with disk files and web pages with unspecified encodings. For more details, search for "chardet".

Resources