Is the [0xff, 0xfe] prefix required on utf-16 encoded strings? - python-3.x

Rewritten question!
I am working with a vendor's device that requires "unicode encoding" of strings, where each character is represented in two bytes. My strings will always be ASCII based, so I thought this would be the way to translate my string into the vendor's string:
>>> b1 = 'abc'.encode('utf-16')
But examining the result, I see that there's a leading [0xff, 0xfe] on the bytearray:
>>> [hex(b) for b in b1]
['0xff', '0xfe', '0x61', '0x0', '0x62', '0x0', '0x63', '0x0']
Since the vendor's device is not expecting the [0xff, 0xfe], I can strip it off...
>>> b2 = 'abc'.encode('utf-16')[2:]
>>> [hex(b) for b in b2]
['0x61', '0x0', '0x62', '0x0', '0x63', '0x0']
... which is what I want.
But what really surprises me that I can decode b1 and b2 and they both reconstitute to the original string:
>>> b1.decode('utf-16') == b2.decode('utf-16')
True
So my two intertwined questions:
What is the significance of the 0xff, 0xfe on the head of the encoded bytes?
Is there any hazard in stripping off the 0xff, 0xfe prefix, as with b2 above?

This observation
... what really surprises me that I can decode b1 and b2 and they both reconstitute to the original string:
b1.decode('utf-16') == b2.decode('utf-16')
True
suggests there is a built-in default, because there are two possible arrangements for the 16-bit wide UTF-16 codes: Big and Little Endian.
Normally, Python deduces the endianness to use from the BOM when reading – and so it always adds one when writing. If you want to force a specific endianness, you can use the explicit encodings utf-16-le and utf-16-be:
… when such an encoding is used, the BOM will be automatically written as the first character and will be silently dropped when the file is read. There are variants of these encodings, such as ‘utf-16-le’ and ‘utf-16-be’ for little-endian and big-endian encodings, that specify one particular byte ordering and don’t skip the BOM.
(https://docs.python.org/3/howto/unicode.html#reading-and-writing-unicode-data)
But if you do not use a specific ordering, then what default gets used? The original Unicode proposal, PEP 100, warns
Note: 'utf-16' should be implemented by using and requiring byte order marks (BOM) for file input/output.
(https://www.python.org/dev/peps/pep-0100/, my emph.)
Yet it works for you. If we look up in the Python source code how this is managed, we find this comment in _codecsmodule.c:
/* This version provides access to the byteorder parameter of the
builtin UTF-16 codecs as optional third argument. It defaults to 0
which means: use the native byte order and prepend the data with a
BOM mark.
*/
and deeper, in unicodeobject.c,
/* Check for BOM marks (U+FEFF) in the input and adjust current
byte order setting accordingly. In native mode, the leading BOM
mark is skipped, in all other modes, it is copied to the output
stream as-is (giving a ZWNBSP character). */
So initially, the byte order is set to the default for your system, and when you start decoding UTF-16 data and a BOM follows, the byte order gets set to whatever this specifies. The "native order" in this last comment refers to whether or not a certain byte order has been explicitly declared OR has been encountered by way of a BOM; and when neither is true, it will use your system's endianness.

This is the byte order mark. It's a prefix to a UTF document that indicates what endianness the document uses. It does this by encoding the code point 0xFEFF in the byte order - in this case, little endian (less significant byte first). Anything trying to read it the other way around, in big endian (more significant byte first), will read the first character as 0xFFFE, which is a code point that is specifically not a valid character, informing the reader it needs to error or switch endianness for the rest of the document.

It is the byte order mark, a.k.a. BOM: see https://en.wikipedia.org/wiki/UTF-16 (look at the subheadin gByte order encoding schemes).
It's purpose is to allow the decoder to detect if the encoding is little-endian or big-endian.

It is the Unicode byte order mark encoded in UTF-16. Its purpose is to communicate the byte order to a reader expecting text encoded with a Unicode character encoding.
You can omit it if the reader otherwise knows or comes to know the byte order.
'abc'.encode('utf-16-le')

The answers, and especially the comment from usr2564301 are helpful: the 0xff 0xfe prefix is the "Byte Order Marker", and it carries the endian-ness information along with the byte string. If you know which endian-ness you want, you can specify utf-16-le or utf-16-be as part of the encoding.
This makes it clear:
>>> 'abc'.encode('utf-16').hex()
'fffe610062006300'
>>> 'abc'.encode('utf-16-le').hex()
'610062006300'
>>> 'abc'.encode('utf-16-be').hex()
'006100620063'

Related

What's the advantage of UTF-8 over highest-bit encoding such as ULEB128 or LLVM bitcode?

Highest-bit encoding means
if a byte starts with 0 it's the final byte, if a byte starts with 1 it's not the final byte
notice, this is also ASCII compatible, in fact, this is what LLVM bitcode does.
The key property that UTF-8 has which ULEB128 lacks, is that no UTF-8 character encoding is a substring of any other UTF-8 character. Why does this matter? It allows UTF-8 to satisfy an absolutely crucial design criterion: you can apply C string functions that work on single-byte encodings like ASCII and Latin-1 to UTF-8 and they will work correctly. This is not the case for ULEB128, as we'll see. That in turn means that most programs written to work with ASCII or Latin-1 will also "just work" when passed UTF-8 data, which was a huge benefit for the adoption of UTF-8 on older systems.
First, here's a brief primer on how ULEB128 encoding works: each code point is encoded as a sequence of bytes terminated by a byte without the high bit set—we'll call that a "low byte" (values 0-127); each leading byte has the high bit set—we'll call that a "high byte" (values 128-255). Since ASCII characters are 0-127, they are all low bytes and thus are encoded the same way in ULEB128 (and Latin-1 and UTF-8) as in ASCII: i.e. any ASCII string is also a valid ULEB128 string (and Latin-1 and UTF-8), encoding the same sequence of code points.
Note the subcharacter property: any ULEB128 character encoding can be prefixed with any of the 127 high byte values to encode longer, higher code points. This cannot occur with UTF-8 because the lead byte encodes how many following bytes there are, so no UTF-8 character can be contained in any other.
Why does this property matter? One particularly bad case of this is that the ULEB128 encoding of any character whose code point is divisible by 128 will contain a trailing NUL byte (0x00). For example, the character Ā (capital "A" with bar over it) has code point U+0100 which would be encoded in ULEB128 as the bytes 0x82 and 0x00. Since many C string functions interpret the NUL byte as a string terminator, they would incorrectly interpret the second byte to be the end of the string and interpret the 0x82 byte as an invalid dangling high byte. In UTF-8 on the other hand, a NUL byte only ever occurs in the encoding of the NUL byte code point, U+00, so this issue cannot occur.
Similarly, if you use strchr(str, c) to search a ULEB128-encoded string for an ASCII character, you can get false positives. Here's an example. Suppose you have the following very Unicode file path: 🌯/☯. This is presumably a file in your collection of burrito recipes for the burrito you call "The Yin & Yang". This path name has the following sequence of code points:
🌯 — Unicode U+1F32F
/ — ASCII/Unicode U+2F
☯ — Unicode U+262F
In ULEB128 this would be encoded with the following sequence of bytes:
10000111 – leading byte of 🌯 (U+1F32F)
11100110 – leading byte of 🌯 (U+1F32F)
00101111 – final byte of 🌯 (U+1F32F)
00101111 – only byte of / (U+2F)
11001100 – leading byte of ☯ (U+262F)
00101111 – final byte of ☯ (U+262F)
If you use strchr(path, '/') to search for slashes to split this path into path components, you'll get bogus matches for the last byte in 🌯 and the last byte in ☯, so to the C code it would look like this path consists of an invalid directory name \x87\xE6 followed by two slashes, followed by another invalid directory name \xCC and a final trailing slash. Compare this with how this path is encoded in UTF-8:
11110000 — leading byte of 🌯 (U+1F32F)
10011111 — continuation byte of 🌯 (U+1F32F)
10001100 — continuation byte of 🌯 (U+1F32F)
10101111 — continuation byte of 🌯 (U+1F32F)
00101111 — only byte of / (U+2F)
11110000 — leading byte of ☯ (U+262F)
10011111 — continuation byte of ☯ (U+262F)
10010000 — continuation byte of ☯ (U+262F)
10101111 — continuation byte of ☯ (U+262F)
The encoding of / is 0x2F which doesn't—and cannot—occur anywhere else in the string since the character does not occur anywhere else—the last bytes of 🌯 and ☯ are 0xAF rather than 0x2F. So using strchr to search for / in the UTF-8 encoding of the path is guaranteed to find the slash and only the slash.
The situation is similar with using strstr(haystack, needle) to search for substrings: using ULEB128 this would find actual occurrences of needle, but it would also find occurrences of needle where the first character is replaced by any character whose encoding extends the actual first character. Using UTF-8, on the other hand, there cannot be false positives—every place where the encoding of needle occurs in haystack it can only encode needle and cannot be a fragment of the encoding of some other string.
There are other pleasant properties that UTF-8 has—many as a corollary of the subcharacter property—but I believe this is the really crucial one. A few of these other properties:
You can predict how many bytes you need to read in order to decode a UTF-8 character by looking at its first byte. With ULEB128, you don't know you're done reading a character until you've read the last byte.
If you embed a valid UTF-8 string in any data, valid or invalid, it cannot affect the interpretation of the valid substring. This is a generalization of property that no character is a subsequence of any other character.
If you start reading a UTF-8 stream in the middle and you can't look backwards, you can tell if you're at the start of a character or in the middle of one. With ULEB128, you might decode the tail end of a character as a different character that looks valid but is incorrect, whereas with UTF-8 you know to skip ahead to the start of a whole character.
If you start reading a UTF-8 stream in the middle and you want to skip backwards to the start of the current character, you can do so without reading past the start of the character. With ULEB128 you'd have read one byte before the start of a character (if possible) to know where the character starts.
To address the previous two points you could use a little-endian variant of ULEB128 where you encode the bits of each code point in groups of seven from least to most significant. Then the first byte of each character's encoding is a low byte and the trailing bytes are high bytes. However, this encoding sacrifices another pleasant property of UTF-8, which is that if you sort an encoded strings lexicographically by bytes, they are sorted into the same order as if you had sorted them lexicographically by characters.
An interesting question is whether UTF-8 is inevitable given this set of pleasant properties? I suspect that some variation is possible, but not very much.
You want to anchor either the start or end of the character with some marker that cannot occur anywhere else. In UTF-8 this is a byte with 0, 2, 3 or 4 leading ones, which indicates the start of a character.
You want to encode the length of a character somehow, not just an indicator of when a character is done, since this prevents extending one valid character to another one. In UTF-8 the number of leading bits in the leading byte indicates how many bytes are in the character.
You want the start/end marker and the length encoding at the beginning so that you know in the middle of a stream that if you're at the start of a character or not and so that you know how many bytes you need to complete the character.
There are probably some variations that can satisfy all of this and the lexicographical ordering property, but with all these constraints UTF-8 starts to seem pretty inevitable.
If you drop a byte in a stream, or jump into the middle of a stream, UTF-8 can synchronize itself without ambiguity. If all you have is a single "last/not-last" flag, then you can't detect that a byte has been lost, and will silently decode the sequence incorrectly. In UTF-8, this can be converted to REPLACEMENT CHARACTER (U+FFFD) to indicate a non-decodable section, or the entire character will be lost. But it's much rarer to get the wrong character due to dropped bytes.
It also provides a clear memory limit on decoding. From the first byte, you immediately know how many bytes you need to read to complete the decode. This is important for security, since the scheme you describe could be used to exhaust memory unless you include an arbitrary limit on the length of a sequence. (This problem does exist for Unicode combining characters, and there's a special Stream-Safe Text Format that imposes just such arbitrary limits, but this only impacts parsers that require normalization.) This also can help performance, since you know immediately how many bytes to load to decode the entire code point.
You also know how many bytes you can safely skip in a searching operation. If the first byte is a mismatch, you can jump to the next character in a single step without reading the intermediate bytes.
It is also fairly unambiguous with other encodings since it's sparse. A random sequence of bytes, and particularly an extended ASCII encoding, is almost certainly not valid UTF-8. There are a few values that can never occur, so if found, you can rule out UTF-8 entirely, and there are rules for the values that can occur. The scheme you're describing is dense. Every sequence is legal, just like extended ASCII. That makes it harder to accurately detect the encoding.
There are trade off in most encodings.
Note: you are proposing something between encoding and escaping, which makes some things difficult to compare.
UTF-8 gained popularity because: it is quick to decode, it is full compatible with ASCII, and it is self-synchronizing, and you can decode BMP (so codepoint until 0xFFFF) with 3 bytes (so not much overhead).
Your encoding is also self-synchronizing: 0 on high bit indicate end of a code point. Quick: yes, it is just shift operation (and compare + masking high bits), and efficient: you encode characters until 0x7FFF with just two bytes (UTF-8 until 0x07FF). And such encodings were not unknown when UTF-8 where developed.
But the problem it is that you are using an "escaping" model. Byte 0x41 (ASCII letter A) can be read as letter A, or as final character of a sequence (so not a letter A). Which may be a security problem, when just comparing strings. It is easy to correct the code (you should just check backward only one single character). Note: also UTF-8 has some security issues: overlong sequence (which are illegal) may skip simple string comparison.
Note: I do not think having an over long sequence is a problem. UTF-8 has also a limit (which were lowered compared first version), and in practical use, we convert to an integer, so it has limited size (and so it should not spill other other bytes). OTOH like UTF-8, one should check the size and reject invalid sequences/codepoints (e.g. above 0x10FFFF).
So, I do not know the exact reason, and UTF-8 "sucks" also because it is not compatible with ISO-10646 (so C1 control code with just one bytes). OTOH UTF-1 design is also bad (and so now it is an obsolete encoding).
I have my suspect such encoding were ruled out because of "escaping" problem: easy to fix, but not interoperable (in a secure way) with old programs. UTF-8-unaware programs just works without surprises (but ev. with error not knowing the meaning of high bit characters). [Note: your proposal and UTF-8 both fails ISO-10646 (but it allows UTF-8 with special consideration: disregard bytes 0x80-0x9F (C1 bytes), so terminals may need to be UTF-8 aware.].
I think efficiency were not so high in stack: CJK characters with ISO-10646 were coded with longer sequences then UTF-8, and getting rid of most ISO-10646 encoding handling would simplify most programs. [ISO-10646 has also additional (non-encoding) functionalities].

How to save bytes to file as binary mode

I have a bytes-like object something like:
aa = b'abc\u6df7\u5408def.mp3'
I want to save it into a file in binary mode. the codes are below, but not work well
if __name__=="__main__":
aa = b'abc\u6df7\u5408def.mp3'
print(aa.decode('unicode-escape'))
with open('database.bin', "wb") as datafile:
datafile.write(aa)
the data in file is like that:
enter image description here
but i want the right format is like this, unicodes in binary data:
enter image description here
How can i convert the bytes to save it in file?
\uNNNN escapes do not make sense in byte strings because they do not specify a sequence of bytes. Unicode code points are conceptually abstract representations of strings, and do not straightforwardly map to a serialization format (consisting of bytes, or, in principle, any other sort of concrete symbolic representation).
There are well-defined serialization formats for Unicode; these are known as "encodings". You seem to be looking for the UTF-16 big-endian encoding of these characters.
aa = 'abc\u6df7\u5408def.mp3'.encode('utf-16-be')
With that out of the way, I believe the rest of your code should work as expected.
Unicode on disk is always encoded but you obviously have to know the encoding in order to read it correctly. An optional byte-order mark (BOM) is sometimes written to the beginning of serialized Unicode text files to help the reader discover the encoding; this is a single non-printing character whose sole purpose is to help disambiguate the encoding, and in particular its byte order (big-endian vs little-endian).
However, many places are standardizing on UTF-8 which doesn't require a BOM. The encoding itself is byte-oriented, so it is immune to byte order issues. Perhaps see also https://utf8everywhere.org/

How does UTF16 encode characters?

EDIT
Since it seems I'm not going to get an answer to the general question. I'll restrict it to one detail: Is my understanding of the following, correct?
That surrogates work as follows:
If the first pair of bytes is not between D800 and DBFF - there
will not be a second pair.
If it is between D800 and DBFF - a) there will be a second pair b)
the second pair will be in the range of DC00 and DFFF.
There is no single pair UTF16 character with a value between D800
and DBFF.
There is no single pair UTF16 character with a value between DC00
and DFFF.
Is this right?
Original question
I've tried reading about UTF16 but I can't seem to understand it. What are "planes" and "surrogates" etc.? Is a "plane" the first 5 bits of the first byte? If so, then why not 32 planes since we're using those 5 bits anyway? And what are surrogates? Which bits do they correspond to?
I do understand that UTF16 is a way to encode Unicode characters, and that it sometimes encodes characters using 16 bits, and sometimes 32 bits, no more no less. I assume that there is some list of values for the first 2 bytes (which are the most significant ones?) which indicates that a second 2 bytes will be present.
But instead of me going on about what I don't understand, perhaps someone can make some order in this?
Yes on all four.
To clarify, the term "pair" in UTF-16 refers to two UTF-16 code units, the first in the range D800-DBFF, the second in DC00-DFFF.
A code unit is 16-bits (2 bytes), typically written as an unsigned integer in hexadecimal (0x000A). The order of the bytes (0x00 0x0A or 0x0A 0x00) is specified by the author or indicated with a BOM (0xFEFF) at the beginning of the file or stream. (The BOM is encoded with the same algorithm as the text but is not part of the text. Once the byte order is determined and the bytes are reordered to the native ordering of the system, it typically is discarded.)

How are Linux shells and filesystem Unicode-aware?

I understand that Linux filesystem stores file names as byte sequences which is meant to be Unicode-encoding-independent.
But, encodings other than UTF-8 or Enhanced UTF-8 may very well use 0 byte as part of a multibyte representation of a Unicode character that can appear in a file name. And everywhere in Linux filesystem C code you terminate strings with 0 byte. So how does Linux filesystem support Unicode? Does it assume all applications that create filenames use UTF-8 only? But that is not true, is it?
Similarly, the shells (such as bash) use * in patterns to match any number of filename characters. I can see in shell C code that it simply uses the ASCII byte for * and goes byte-by-byte to delimit the match. Fine for UTF-8 encoded names, because it has the property that if you take the byte representation of a string, then match some bytes from the start with *, and match the rest with another string, then the bytes at the beginning in fact matched a string of whole characters, not just bytes.
But other encodings do not have that property, do they? So again, do shells assume UTF-8?
It is true that UTF-16 and other "wide character" encodings cannot be used for pathnames in Linux (nor any other POSIX-compliant OS).
It is not true in principle that anyone assumes UTF-8, although that may come to be true in the future as other encodings die off. What Unix-style programs assume is an ASCII-compatible encoding. Any encoding with these properties is ASCII-compatible:
The fundamental unit of the encoding is a byte, not a larger entity. Some characters might be encoded as a sequence of bytes, but there must be at least 127 characters that are encoded using only a single byte, namely:
The characters defined by ASCII (nowadays, these are best described as Unicode codepoints U+000000 through U+00007F, inclusive) are encoded as single bytes, with values equal to their Unicode codepoints.
Conversely, the bytes with values 0x00 through 0x7F must always decode to the characters defined by ASCII, regardless of surrounding context. (For instance, the string 0x81 0x2F must decode to two characters, whatever 0x81 decodes to and then /.)
UTF-8 is ASCII-compatible, but so are all of the ISO-8859-n pages, the EUC encodings, and many, many others.
Some programs may also require an additional property:
The encoding of a character, viewed as a sequence of bytes, is never a proper prefix nor a proper suffix of the encoding of any other character.
UTF-8 has this property, but (I think) EUC-JP doesn't.
It is also the case that many "Unix-style" programs reserve the codepoint U+000000 (NUL) for use as a string terminator. This is technically not a constraint on the encoding, but on the text itself. (The closely-related requirement that the byte 0x00 not appear in the middle of a string is a consequence of this plus the requirement that 0x00 maps to U+000000 regardless of surrounding context.)
There is no encoding of filenames in Linux (in ext family of filesystems at any rate). Filenames are sequences of bytes, not characters. It is up to application programs to interpret these bytes as UTF-8 or anything else. The filesystem doesn't care.
POSIX stipulates that the shell obeys the locale environment vsriables such as LC_CTYPE when performing pattern matches. Thus, pattern-matching code that just compares bytes regardless of the encoding would not be compatible with your hypothetical encoding, or with any stateful encoding. But this doesn't seem to matter much as such encodings are not commonly supported by existing locales. UTF-8 on the other hand seems to be supported well: in my experiments bash correctly matches the ? character with a single Unicode character, rather than a single byte, in the filename (given an UTF-8 locale) as prescribed by POSIX.

Does a strings length equal the byte size?

Exactly that: Does a strings length equal the byte size? Does it matter on the language?
I think it is, but I just want to make sure.
Additional Info: I'm just wondering in general. My specific situation was PHP with MySQL.
As the answer is no, that's all I need know.
Nope. A zero terminated string has one extra byte. A pascal string (the Delphi shortstring) has an extra byte for the length. And unicode strings has more than one byte per character.
By unicode it depends on the encoding. It could be 2 or 4 bytes per character or even a mix of 1,2 and 4 bytes.
It entirely depends on the platform and representation.
For example, in .NET a string takes two bytes in memory per UTF-16 code point. However, surrogate pairs require two UTF-16 values for a full Unicode character in the range U+100000 to U+10FFFF. The in-memory form also has an overhead for the length of the string and possibly some padding, as well as the normal object overhead of a type pointer etc.
Now, when you write a string out to disk (or the network, etc) from .NET, you specify the encoding (with most classes defaulting to UTF-8). At that point, the size depends very much on the encoding. ASCII always takes a single byte per character, but is very limited (no accents etc); UTF-8 gives the full Unicode range with a variable encoding (all ASCII characters are represented in a single byte, but others take up more). UTF-32 always uses exactly 4 bytes for any Unicode character - the list goes on.
As you can see, it's not a simple topic. To work out how much space a string is going to take up you'll need to specify exactly what the situation is - whether it's an object in memory on some platform (and if so, which platform - potentially even down to the implementation and operating system settings), or whether it's a raw encoded form such as a text file, and if so using which encoding.
It depends on what you mean by "length". If you mean "number of characters" then, no, many languages/encoding methods use more than one byte per character.
Not always, it depends on the encoding.
There's no single answer; it depends on language and implementation (remember that some languages have multiple implementations!)
Zero-terminated ASCII strings occupy at least one more byte than the "content" of the string. (More may be allocated, depending on how the string was created.)
Non-zero-terminated strings use a descriptor (or similar structure) to record length, which takes extra memory somewhere.
Unicode strings (in various languages) use two bytes per char.
Strings in an object store may be referenced via handles, which adds a layer of indirection (and more data) in order to simplify memory management.
You are correct. If you encode as ASCII, there is one byte per character. Otherwise, it is one or more bytes per character.
In particular, it is important to know how this effects substring operations. If you don't have one byte per character, does s[n] get the nth byte or nth char? Getting the nth char will be inefficient for large n instead of constant, as it is with a one byte per character.

Resources