What is a Pascal-style string? - string

The Photoshop file format documentation mentions Pascal strings without explaining what they are.
So, what are they, and how are they encoded?

A Pascal-style string has one leading byte (length), followed by length bytes of character data.
This means that Pascal-style strings can only encode strings of between 0 and 255 characters in length (assuming single-byte character encodings such as ASCII).
As an aside, another popular string encoding is C-style strings which have no length specifier, but use a zero-byte to denote the end of the string. They therefore have no length limit.
Yet other encodings may use a greater number of prefix bytes to facilitate longer strings. Terminator bytes/sentinels may also be used along with length prefixes.

Related

What's the advantage of UTF-8 over highest-bit encoding such as ULEB128 or LLVM bitcode?

Highest-bit encoding means
if a byte starts with 0 it's the final byte, if a byte starts with 1 it's not the final byte
notice, this is also ASCII compatible, in fact, this is what LLVM bitcode does.
The key property that UTF-8 has which ULEB128 lacks, is that no UTF-8 character encoding is a substring of any other UTF-8 character. Why does this matter? It allows UTF-8 to satisfy an absolutely crucial design criterion: you can apply C string functions that work on single-byte encodings like ASCII and Latin-1 to UTF-8 and they will work correctly. This is not the case for ULEB128, as we'll see. That in turn means that most programs written to work with ASCII or Latin-1 will also "just work" when passed UTF-8 data, which was a huge benefit for the adoption of UTF-8 on older systems.
First, here's a brief primer on how ULEB128 encoding works: each code point is encoded as a sequence of bytes terminated by a byte without the high bit set—we'll call that a "low byte" (values 0-127); each leading byte has the high bit set—we'll call that a "high byte" (values 128-255). Since ASCII characters are 0-127, they are all low bytes and thus are encoded the same way in ULEB128 (and Latin-1 and UTF-8) as in ASCII: i.e. any ASCII string is also a valid ULEB128 string (and Latin-1 and UTF-8), encoding the same sequence of code points.
Note the subcharacter property: any ULEB128 character encoding can be prefixed with any of the 127 high byte values to encode longer, higher code points. This cannot occur with UTF-8 because the lead byte encodes how many following bytes there are, so no UTF-8 character can be contained in any other.
Why does this property matter? One particularly bad case of this is that the ULEB128 encoding of any character whose code point is divisible by 128 will contain a trailing NUL byte (0x00). For example, the character Ā (capital "A" with bar over it) has code point U+0100 which would be encoded in ULEB128 as the bytes 0x82 and 0x00. Since many C string functions interpret the NUL byte as a string terminator, they would incorrectly interpret the second byte to be the end of the string and interpret the 0x82 byte as an invalid dangling high byte. In UTF-8 on the other hand, a NUL byte only ever occurs in the encoding of the NUL byte code point, U+00, so this issue cannot occur.
Similarly, if you use strchr(str, c) to search a ULEB128-encoded string for an ASCII character, you can get false positives. Here's an example. Suppose you have the following very Unicode file path: 🌯/☯. This is presumably a file in your collection of burrito recipes for the burrito you call "The Yin & Yang". This path name has the following sequence of code points:
🌯 — Unicode U+1F32F
/ — ASCII/Unicode U+2F
☯ — Unicode U+262F
In ULEB128 this would be encoded with the following sequence of bytes:
10000111 – leading byte of 🌯 (U+1F32F)
11100110 – leading byte of 🌯 (U+1F32F)
00101111 – final byte of 🌯 (U+1F32F)
00101111 – only byte of / (U+2F)
11001100 – leading byte of ☯ (U+262F)
00101111 – final byte of ☯ (U+262F)
If you use strchr(path, '/') to search for slashes to split this path into path components, you'll get bogus matches for the last byte in 🌯 and the last byte in ☯, so to the C code it would look like this path consists of an invalid directory name \x87\xE6 followed by two slashes, followed by another invalid directory name \xCC and a final trailing slash. Compare this with how this path is encoded in UTF-8:
11110000 — leading byte of 🌯 (U+1F32F)
10011111 — continuation byte of 🌯 (U+1F32F)
10001100 — continuation byte of 🌯 (U+1F32F)
10101111 — continuation byte of 🌯 (U+1F32F)
00101111 — only byte of / (U+2F)
11110000 — leading byte of ☯ (U+262F)
10011111 — continuation byte of ☯ (U+262F)
10010000 — continuation byte of ☯ (U+262F)
10101111 — continuation byte of ☯ (U+262F)
The encoding of / is 0x2F which doesn't—and cannot—occur anywhere else in the string since the character does not occur anywhere else—the last bytes of 🌯 and ☯ are 0xAF rather than 0x2F. So using strchr to search for / in the UTF-8 encoding of the path is guaranteed to find the slash and only the slash.
The situation is similar with using strstr(haystack, needle) to search for substrings: using ULEB128 this would find actual occurrences of needle, but it would also find occurrences of needle where the first character is replaced by any character whose encoding extends the actual first character. Using UTF-8, on the other hand, there cannot be false positives—every place where the encoding of needle occurs in haystack it can only encode needle and cannot be a fragment of the encoding of some other string.
There are other pleasant properties that UTF-8 has—many as a corollary of the subcharacter property—but I believe this is the really crucial one. A few of these other properties:
You can predict how many bytes you need to read in order to decode a UTF-8 character by looking at its first byte. With ULEB128, you don't know you're done reading a character until you've read the last byte.
If you embed a valid UTF-8 string in any data, valid or invalid, it cannot affect the interpretation of the valid substring. This is a generalization of property that no character is a subsequence of any other character.
If you start reading a UTF-8 stream in the middle and you can't look backwards, you can tell if you're at the start of a character or in the middle of one. With ULEB128, you might decode the tail end of a character as a different character that looks valid but is incorrect, whereas with UTF-8 you know to skip ahead to the start of a whole character.
If you start reading a UTF-8 stream in the middle and you want to skip backwards to the start of the current character, you can do so without reading past the start of the character. With ULEB128 you'd have read one byte before the start of a character (if possible) to know where the character starts.
To address the previous two points you could use a little-endian variant of ULEB128 where you encode the bits of each code point in groups of seven from least to most significant. Then the first byte of each character's encoding is a low byte and the trailing bytes are high bytes. However, this encoding sacrifices another pleasant property of UTF-8, which is that if you sort an encoded strings lexicographically by bytes, they are sorted into the same order as if you had sorted them lexicographically by characters.
An interesting question is whether UTF-8 is inevitable given this set of pleasant properties? I suspect that some variation is possible, but not very much.
You want to anchor either the start or end of the character with some marker that cannot occur anywhere else. In UTF-8 this is a byte with 0, 2, 3 or 4 leading ones, which indicates the start of a character.
You want to encode the length of a character somehow, not just an indicator of when a character is done, since this prevents extending one valid character to another one. In UTF-8 the number of leading bits in the leading byte indicates how many bytes are in the character.
You want the start/end marker and the length encoding at the beginning so that you know in the middle of a stream that if you're at the start of a character or not and so that you know how many bytes you need to complete the character.
There are probably some variations that can satisfy all of this and the lexicographical ordering property, but with all these constraints UTF-8 starts to seem pretty inevitable.
If you drop a byte in a stream, or jump into the middle of a stream, UTF-8 can synchronize itself without ambiguity. If all you have is a single "last/not-last" flag, then you can't detect that a byte has been lost, and will silently decode the sequence incorrectly. In UTF-8, this can be converted to REPLACEMENT CHARACTER (U+FFFD) to indicate a non-decodable section, or the entire character will be lost. But it's much rarer to get the wrong character due to dropped bytes.
It also provides a clear memory limit on decoding. From the first byte, you immediately know how many bytes you need to read to complete the decode. This is important for security, since the scheme you describe could be used to exhaust memory unless you include an arbitrary limit on the length of a sequence. (This problem does exist for Unicode combining characters, and there's a special Stream-Safe Text Format that imposes just such arbitrary limits, but this only impacts parsers that require normalization.) This also can help performance, since you know immediately how many bytes to load to decode the entire code point.
You also know how many bytes you can safely skip in a searching operation. If the first byte is a mismatch, you can jump to the next character in a single step without reading the intermediate bytes.
It is also fairly unambiguous with other encodings since it's sparse. A random sequence of bytes, and particularly an extended ASCII encoding, is almost certainly not valid UTF-8. There are a few values that can never occur, so if found, you can rule out UTF-8 entirely, and there are rules for the values that can occur. The scheme you're describing is dense. Every sequence is legal, just like extended ASCII. That makes it harder to accurately detect the encoding.
There are trade off in most encodings.
Note: you are proposing something between encoding and escaping, which makes some things difficult to compare.
UTF-8 gained popularity because: it is quick to decode, it is full compatible with ASCII, and it is self-synchronizing, and you can decode BMP (so codepoint until 0xFFFF) with 3 bytes (so not much overhead).
Your encoding is also self-synchronizing: 0 on high bit indicate end of a code point. Quick: yes, it is just shift operation (and compare + masking high bits), and efficient: you encode characters until 0x7FFF with just two bytes (UTF-8 until 0x07FF). And such encodings were not unknown when UTF-8 where developed.
But the problem it is that you are using an "escaping" model. Byte 0x41 (ASCII letter A) can be read as letter A, or as final character of a sequence (so not a letter A). Which may be a security problem, when just comparing strings. It is easy to correct the code (you should just check backward only one single character). Note: also UTF-8 has some security issues: overlong sequence (which are illegal) may skip simple string comparison.
Note: I do not think having an over long sequence is a problem. UTF-8 has also a limit (which were lowered compared first version), and in practical use, we convert to an integer, so it has limited size (and so it should not spill other other bytes). OTOH like UTF-8, one should check the size and reject invalid sequences/codepoints (e.g. above 0x10FFFF).
So, I do not know the exact reason, and UTF-8 "sucks" also because it is not compatible with ISO-10646 (so C1 control code with just one bytes). OTOH UTF-1 design is also bad (and so now it is an obsolete encoding).
I have my suspect such encoding were ruled out because of "escaping" problem: easy to fix, but not interoperable (in a secure way) with old programs. UTF-8-unaware programs just works without surprises (but ev. with error not knowing the meaning of high bit characters). [Note: your proposal and UTF-8 both fails ISO-10646 (but it allows UTF-8 with special consideration: disregard bytes 0x80-0x9F (C1 bytes), so terminals may need to be UTF-8 aware.].
I think efficiency were not so high in stack: CJK characters with ISO-10646 were coded with longer sequences then UTF-8, and getting rid of most ISO-10646 encoding handling would simplify most programs. [ISO-10646 has also additional (non-encoding) functionalities].

python3 str to bytes convertation problem

In python 3.8.5 I try to convert some bytes to string and then string to bytes:
>>> a=chr(128)
>>> a
'\x80'
>>> type(a)
<class 'str'>
But when I try to do back convertation:
>>> a.encode()
b'\xc2\x80'
What is \xc2 bytes? Why it appears?
Thanks for any responce!
This a UTF-8 encoding, so the \xc2 comes from here, take a look here.
In a Python string, \x80 means Unicode codepoint #128 (Padding Character). When we encode that codepoint in UTF-8, it takes two bytes.
The original ASCII encoding only had 128 different characters, there are many thousands of Unicode codepoints, and a single byte can only represent 256 different values. A lot of computing is based on ASCII, and we’d like that stuff to keep working, but we need non-English-speakers to be able to use computers too, so we need to be able to represent their characters.
The answer is UTF-8, a scheme that encodes the first 128 Unicode code points (0-127, the ASCII characters) as a single byte – so text that only uses those characters is completely compatible with ASCII. The next 1920 characters, containing the most common non-English characters (U+80 up to U+7FF) are spread across two bytes.
So, in exchange for being slightly less efficient with some characters that could fit in a one-byte encoding (such as \x80), we gain the ability to represent every character from every written language.
For more reading, try this SO question
For example if you want to remove the \xc2 try to encode your string as latin-1
a=chr(128)
print(a)
#'\x80'
print(a.encode())
#b'\xc2\x80'
a.encode('latin-1')
#b'\x80'

How are Linux shells and filesystem Unicode-aware?

I understand that Linux filesystem stores file names as byte sequences which is meant to be Unicode-encoding-independent.
But, encodings other than UTF-8 or Enhanced UTF-8 may very well use 0 byte as part of a multibyte representation of a Unicode character that can appear in a file name. And everywhere in Linux filesystem C code you terminate strings with 0 byte. So how does Linux filesystem support Unicode? Does it assume all applications that create filenames use UTF-8 only? But that is not true, is it?
Similarly, the shells (such as bash) use * in patterns to match any number of filename characters. I can see in shell C code that it simply uses the ASCII byte for * and goes byte-by-byte to delimit the match. Fine for UTF-8 encoded names, because it has the property that if you take the byte representation of a string, then match some bytes from the start with *, and match the rest with another string, then the bytes at the beginning in fact matched a string of whole characters, not just bytes.
But other encodings do not have that property, do they? So again, do shells assume UTF-8?
It is true that UTF-16 and other "wide character" encodings cannot be used for pathnames in Linux (nor any other POSIX-compliant OS).
It is not true in principle that anyone assumes UTF-8, although that may come to be true in the future as other encodings die off. What Unix-style programs assume is an ASCII-compatible encoding. Any encoding with these properties is ASCII-compatible:
The fundamental unit of the encoding is a byte, not a larger entity. Some characters might be encoded as a sequence of bytes, but there must be at least 127 characters that are encoded using only a single byte, namely:
The characters defined by ASCII (nowadays, these are best described as Unicode codepoints U+000000 through U+00007F, inclusive) are encoded as single bytes, with values equal to their Unicode codepoints.
Conversely, the bytes with values 0x00 through 0x7F must always decode to the characters defined by ASCII, regardless of surrounding context. (For instance, the string 0x81 0x2F must decode to two characters, whatever 0x81 decodes to and then /.)
UTF-8 is ASCII-compatible, but so are all of the ISO-8859-n pages, the EUC encodings, and many, many others.
Some programs may also require an additional property:
The encoding of a character, viewed as a sequence of bytes, is never a proper prefix nor a proper suffix of the encoding of any other character.
UTF-8 has this property, but (I think) EUC-JP doesn't.
It is also the case that many "Unix-style" programs reserve the codepoint U+000000 (NUL) for use as a string terminator. This is technically not a constraint on the encoding, but on the text itself. (The closely-related requirement that the byte 0x00 not appear in the middle of a string is a consequence of this plus the requirement that 0x00 maps to U+000000 regardless of surrounding context.)
There is no encoding of filenames in Linux (in ext family of filesystems at any rate). Filenames are sequences of bytes, not characters. It is up to application programs to interpret these bytes as UTF-8 or anything else. The filesystem doesn't care.
POSIX stipulates that the shell obeys the locale environment vsriables such as LC_CTYPE when performing pattern matches. Thus, pattern-matching code that just compares bytes regardless of the encoding would not be compatible with your hypothetical encoding, or with any stateful encoding. But this doesn't seem to matter much as such encodings are not commonly supported by existing locales. UTF-8 on the other hand seems to be supported well: in my experiments bash correctly matches the ? character with a single Unicode character, rather than a single byte, in the filename (given an UTF-8 locale) as prescribed by POSIX.

Are these ASCII / Unicode strings equivalent?

Is a "1055912799" ASCII string equivalent to "1055912799" Unicode string?
Yes, the digit characters 0 to 9 in Unicode are defined to be the same characters as in Ascii. More generally, all printable Ascii characters are coded in Unicode, too (and with the same code numbers, by the way).
Whether the internal representations as sequences of bytes are the same depends on the character encoding. The UTF-8 encoding for Unicode has been designed so that Ascii characters have the same coded representation as bytes as in the only encoding currently used for Ascii (which maps each Ascii code number to an 8-bit byte, with the first bit set to zero).
UTF-16 encoded representation for characters in the Ascii range could be said to be “equivalent” to the Ascii encoding in the sense that there is a simple mapping: in UTF-16, each Ascii characters appears as two bytes, one zero byte and one byte containing the Ascii number. (The order of these bytes depends on endianness of UTF-16.) But such an “equivalence” concept is normally not used and would not be particularly useful.
Because ASCII is a subset of unicode, any ASCII string will be the same in unicode, assuming of course you encode it with UTF-8. Clearly a UTF-16 or UTF-32 encoding will cause it to be fairly bloated.

Do I have to take the encoding into account when performing Boyer-Moore pattern matching?

I'm about to implement a variation of the Boyer-Moore pattern matching algorithm (the Sunday algorithm to be specific) and I was asking myself: What is my alphabet size?
Does it depend on the encoding (= number of possible characters) or can I just assume my alphabet consists of 256 symbols (= number of symbols which can be represented by a byte)?
In many other situations treating characters as bytes would be a problem because depending on the encoding a character can consist of multiple bytes, but if in my case both strings have the same encoding then equal characters are represented by equal byte sequences, so I would assume it doesn't matter.
So: Do I have to take the encoding into account and assume an alphabet consisting of the actual characters (> 90000 for Unicode) or can I just handle the text and the pattern as a stream of bytes?
A multi-byte encoding can be used with a byte-oriented search routine IF it is self-synchronizing.
So, you can safely use Boyer-Moore with:
CESU-8
UTF-8
UTF-EBCDIC
But can NOT use it with:
Big5
EUC-JP
GBK / GB18030
ISO 2022
Johab
Punycode
Shift-JIS
UTF-7
UTF-16
UTF-32

Resources