If I have 2 strings of the same text, one UTF-8, and the other UTF-16.
Is it safe to assume the UTF-8 string will always be smaller, or the same size, as the UTF-16 one? (byte wise)
No, while the UTF-8 text will usually be shorter, it's not always the case.
Anything between U+0000 and U+FFFF will be represented with 2 bytes (one UTF-16 codepoint) in UTF-16.
Characters between U+0800 and U+FFFF will be represented with 3 bytes in UTF-8.
Therefore a text that contains only (or mostly) characters in that range, can easily be longer when represented in UTF-8 than in UTF-16.
Put differently:
U+0000 - U+007F: UTF-8 is shorter (1 < 2)
U+0080 - U+07FF: Both are the same size (2 = 2)
U+0800 - U+FFFF: UTF-8 is longer (3 > 2)
U+10000 - U+10FFFF: Both are the same size (4 = 4)
Note that 5 and 6 byte sequences used to be defined in UTF-8 but are no longer valid according to the newest standard and were never necessary to represent Unicode codepoints.
No. UTF-8 sometimes uses 3 or more bytes for a single character depending on how many bits it takes to represent the code point (number) of the character.
Related
In python 3.8.5 I try to convert some bytes to string and then string to bytes:
>>> a=chr(128)
>>> a
'\x80'
>>> type(a)
<class 'str'>
But when I try to do back convertation:
>>> a.encode()
b'\xc2\x80'
What is \xc2 bytes? Why it appears?
Thanks for any responce!
This a UTF-8 encoding, so the \xc2 comes from here, take a look here.
In a Python string, \x80 means Unicode codepoint #128 (Padding Character). When we encode that codepoint in UTF-8, it takes two bytes.
The original ASCII encoding only had 128 different characters, there are many thousands of Unicode codepoints, and a single byte can only represent 256 different values. A lot of computing is based on ASCII, and we’d like that stuff to keep working, but we need non-English-speakers to be able to use computers too, so we need to be able to represent their characters.
The answer is UTF-8, a scheme that encodes the first 128 Unicode code points (0-127, the ASCII characters) as a single byte – so text that only uses those characters is completely compatible with ASCII. The next 1920 characters, containing the most common non-English characters (U+80 up to U+7FF) are spread across two bytes.
So, in exchange for being slightly less efficient with some characters that could fit in a one-byte encoding (such as \x80), we gain the ability to represent every character from every written language.
For more reading, try this SO question
For example if you want to remove the \xc2 try to encode your string as latin-1
a=chr(128)
print(a)
#'\x80'
print(a.encode())
#b'\xc2\x80'
a.encode('latin-1')
#b'\x80'
What kind of character encoding are the strings below?
KDLwuq6IC
YOaf/MrAT
0vGzc3aBN
SQdLlM8G7
https://en.wikipedia.org/wiki/Character_encoding
Character encoding is the encoding of strings to bytes (or numbers). You are only showing us the characters itself. They don't have any encoding by themselves.
Some character encodings have a different range of characters that they can encode. Your characters are all in the ASCII range at least. So they would also be compatible with any scheme that incorporates ASCII as a subset such as Windows-1252 and of course Unicode (UTF-8, UTF-16LE, UTF-16BE etc).
Note that your code looks a lot like base 64. Base 64 is not a character encoding though, it is the encoding of bytes into characters (so the other way around). Base 64 can usually be recognized by looking for / and + characters in the text, as well as the text consisting of blocks that are a multiple of 4 characters (as 4 characters encode 3 bytes).
Looking at the text you are probably looking for an encoding scheme rather than a character-encoding scheme.
A in UTF-8 is U+0041 LATIN CAPITAL LETTER A. A in ASCII is 065.
How is UTF-8 is backwards-compatible with ASCII?
ASCII uses only the first 7 bits of an 8 bit byte. So all combinations from 00000000 to 01111111. All 128 bytes in this range are mapped to a specific character.
UTF-8 keep these exact mappings. The character represented by 01101011 in ASCII is also represented by the same byte in UTF-8. All other characters are encoded in sequences of multiple bytes in which each byte has the highest bit set; i.e. every byte of all non-ASCII characters in UTF-8 is of the form 1xxxxxxx.
Unicode is backward compatible with ASCII because ASCII is a subset of Unicode. Unicode simply uses all character codes in ASCII and adds more.
Although character codes are generally written out as 0041 in Unicode, the character codes are numeric so 0041 is the same value as (hexadecimal) 41.
UTF-8 is not a character set but an encoding used with Unicode. It happens to be compatible with ASCII too, because the codes used for multiple byte encodings lie in the part of the ASCII character set that is unused.
Note that it's only the 7-bit ASCII character set that is compatible with Unicode and UTF-8, the 8-bit character sets based on ASCII, like IBM850 and windows-1250, uses the part of the character set where UTF-8 has codes for multiple byte encodings.
Why:
Because everything was already in ASCII and have a backwards compatible Unicode format made adoption much easier. It's much easier to convert a program to use UTF-8 than it is to UTF-16, and that program inherits the backwards compatible nature by still working with ASCII.
How:
ASCII is a 7 bit encoding, but is always stored in bytes, which are 8 bit. That means 1 bit has always been unused.
UTF-8 simply uses that extra bit to signify non-ASCII characters.
Given a Unicode string encoded in UTF-8, which is just bytes in memory.
If a computer wants to convert these bytes to its corresponding Unicode codepoints (numbers), how can it know where one character ends and another one begins? Some characters are represented by 1 byte, others by up to 6 bytes. So if you have
00111101 10111001
This could represent 2 characters, or 1. How does the computer decide which it is to interpret it correctly? Is there some sort of convention from which we can know from the first byte how many bytes the current character uses or something?
The first byte of a multibyte sequence encodes the length of the sequence in the number of leading 1-bits:
0xxxxxxx is a character on its own;
10xxxxxx is a continuation of a multibyte character;
110xxxxx is the first byte of a 2-byte character;
1110xxxx is the first byte of a 3-byte character;
11110xxx is the first byte of a 4-byte character.
Bytes with more than 4 leading 1-bits don't encode valid characters in UTF-8 because the 4-byte sequences already cover more than the entire Unicode range from U+0000 to U+10FFFF.
So, the example posed in the question has one ASCII character and one continuation byte that doesn't encode a character on its own.
Is a "1055912799" ASCII string equivalent to "1055912799" Unicode string?
Yes, the digit characters 0 to 9 in Unicode are defined to be the same characters as in Ascii. More generally, all printable Ascii characters are coded in Unicode, too (and with the same code numbers, by the way).
Whether the internal representations as sequences of bytes are the same depends on the character encoding. The UTF-8 encoding for Unicode has been designed so that Ascii characters have the same coded representation as bytes as in the only encoding currently used for Ascii (which maps each Ascii code number to an 8-bit byte, with the first bit set to zero).
UTF-16 encoded representation for characters in the Ascii range could be said to be “equivalent” to the Ascii encoding in the sense that there is a simple mapping: in UTF-16, each Ascii characters appears as two bytes, one zero byte and one byte containing the Ascii number. (The order of these bytes depends on endianness of UTF-16.) But such an “equivalence” concept is normally not used and would not be particularly useful.
Because ASCII is a subset of unicode, any ASCII string will be the same in unicode, assuming of course you encode it with UTF-8. Clearly a UTF-16 or UTF-32 encoding will cause it to be fairly bloated.