Are these ASCII / Unicode strings equivalent? - string

Is a "1055912799" ASCII string equivalent to "1055912799" Unicode string?

Yes, the digit characters 0 to 9 in Unicode are defined to be the same characters as in Ascii. More generally, all printable Ascii characters are coded in Unicode, too (and with the same code numbers, by the way).
Whether the internal representations as sequences of bytes are the same depends on the character encoding. The UTF-8 encoding for Unicode has been designed so that Ascii characters have the same coded representation as bytes as in the only encoding currently used for Ascii (which maps each Ascii code number to an 8-bit byte, with the first bit set to zero).
UTF-16 encoded representation for characters in the Ascii range could be said to be “equivalent” to the Ascii encoding in the sense that there is a simple mapping: in UTF-16, each Ascii characters appears as two bytes, one zero byte and one byte containing the Ascii number. (The order of these bytes depends on endianness of UTF-16.) But such an “equivalence” concept is normally not used and would not be particularly useful.

Because ASCII is a subset of unicode, any ASCII string will be the same in unicode, assuming of course you encode it with UTF-8. Clearly a UTF-16 or UTF-32 encoding will cause it to be fairly bloated.

Related

How are Linux shells and filesystem Unicode-aware?

I understand that Linux filesystem stores file names as byte sequences which is meant to be Unicode-encoding-independent.
But, encodings other than UTF-8 or Enhanced UTF-8 may very well use 0 byte as part of a multibyte representation of a Unicode character that can appear in a file name. And everywhere in Linux filesystem C code you terminate strings with 0 byte. So how does Linux filesystem support Unicode? Does it assume all applications that create filenames use UTF-8 only? But that is not true, is it?
Similarly, the shells (such as bash) use * in patterns to match any number of filename characters. I can see in shell C code that it simply uses the ASCII byte for * and goes byte-by-byte to delimit the match. Fine for UTF-8 encoded names, because it has the property that if you take the byte representation of a string, then match some bytes from the start with *, and match the rest with another string, then the bytes at the beginning in fact matched a string of whole characters, not just bytes.
But other encodings do not have that property, do they? So again, do shells assume UTF-8?
It is true that UTF-16 and other "wide character" encodings cannot be used for pathnames in Linux (nor any other POSIX-compliant OS).
It is not true in principle that anyone assumes UTF-8, although that may come to be true in the future as other encodings die off. What Unix-style programs assume is an ASCII-compatible encoding. Any encoding with these properties is ASCII-compatible:
The fundamental unit of the encoding is a byte, not a larger entity. Some characters might be encoded as a sequence of bytes, but there must be at least 127 characters that are encoded using only a single byte, namely:
The characters defined by ASCII (nowadays, these are best described as Unicode codepoints U+000000 through U+00007F, inclusive) are encoded as single bytes, with values equal to their Unicode codepoints.
Conversely, the bytes with values 0x00 through 0x7F must always decode to the characters defined by ASCII, regardless of surrounding context. (For instance, the string 0x81 0x2F must decode to two characters, whatever 0x81 decodes to and then /.)
UTF-8 is ASCII-compatible, but so are all of the ISO-8859-n pages, the EUC encodings, and many, many others.
Some programs may also require an additional property:
The encoding of a character, viewed as a sequence of bytes, is never a proper prefix nor a proper suffix of the encoding of any other character.
UTF-8 has this property, but (I think) EUC-JP doesn't.
It is also the case that many "Unix-style" programs reserve the codepoint U+000000 (NUL) for use as a string terminator. This is technically not a constraint on the encoding, but on the text itself. (The closely-related requirement that the byte 0x00 not appear in the middle of a string is a consequence of this plus the requirement that 0x00 maps to U+000000 regardless of surrounding context.)
There is no encoding of filenames in Linux (in ext family of filesystems at any rate). Filenames are sequences of bytes, not characters. It is up to application programs to interpret these bytes as UTF-8 or anything else. The filesystem doesn't care.
POSIX stipulates that the shell obeys the locale environment vsriables such as LC_CTYPE when performing pattern matches. Thus, pattern-matching code that just compares bytes regardless of the encoding would not be compatible with your hypothetical encoding, or with any stateful encoding. But this doesn't seem to matter much as such encodings are not commonly supported by existing locales. UTF-8 on the other hand seems to be supported well: in my experiments bash correctly matches the ? character with a single Unicode character, rather than a single byte, in the filename (given an UTF-8 locale) as prescribed by POSIX.

YOaf/MrA - What character encoding?

What kind of character encoding are the strings below?
KDLwuq6IC
YOaf/MrAT
0vGzc3aBN
SQdLlM8G7
https://en.wikipedia.org/wiki/Character_encoding
Character encoding is the encoding of strings to bytes (or numbers). You are only showing us the characters itself. They don't have any encoding by themselves.
Some character encodings have a different range of characters that they can encode. Your characters are all in the ASCII range at least. So they would also be compatible with any scheme that incorporates ASCII as a subset such as Windows-1252 and of course Unicode (UTF-8, UTF-16LE, UTF-16BE etc).
Note that your code looks a lot like base 64. Base 64 is not a character encoding though, it is the encoding of bytes into characters (so the other way around). Base 64 can usually be recognized by looking for / and + characters in the text, as well as the text consisting of blocks that are a multiple of 4 characters (as 4 characters encode 3 bytes).
Looking at the text you are probably looking for an encoding scheme rather than a character-encoding scheme.

What is a Pascal-style string?

The Photoshop file format documentation mentions Pascal strings without explaining what they are.
So, what are they, and how are they encoded?
A Pascal-style string has one leading byte (length), followed by length bytes of character data.
This means that Pascal-style strings can only encode strings of between 0 and 255 characters in length (assuming single-byte character encodings such as ASCII).
As an aside, another popular string encoding is C-style strings which have no length specifier, but use a zero-byte to denote the end of the string. They therefore have no length limit.
Yet other encodings may use a greater number of prefix bytes to facilitate longer strings. Terminator bytes/sentinels may also be used along with length prefixes.

Why Utf8 is compatible with ascii

A in UTF-8 is U+0041 LATIN CAPITAL LETTER A. A in ASCII is 065.
How is UTF-8 is backwards-compatible with ASCII?
ASCII uses only the first 7 bits of an 8 bit byte. So all combinations from 00000000 to 01111111. All 128 bytes in this range are mapped to a specific character.
UTF-8 keep these exact mappings. The character represented by 01101011 in ASCII is also represented by the same byte in UTF-8. All other characters are encoded in sequences of multiple bytes in which each byte has the highest bit set; i.e. every byte of all non-ASCII characters in UTF-8 is of the form 1xxxxxxx.
Unicode is backward compatible with ASCII because ASCII is a subset of Unicode. Unicode simply uses all character codes in ASCII and adds more.
Although character codes are generally written out as 0041 in Unicode, the character codes are numeric so 0041 is the same value as (hexadecimal) 41.
UTF-8 is not a character set but an encoding used with Unicode. It happens to be compatible with ASCII too, because the codes used for multiple byte encodings lie in the part of the ASCII character set that is unused.
Note that it's only the 7-bit ASCII character set that is compatible with Unicode and UTF-8, the 8-bit character sets based on ASCII, like IBM850 and windows-1250, uses the part of the character set where UTF-8 has codes for multiple byte encodings.
Why:
Because everything was already in ASCII and have a backwards compatible Unicode format made adoption much easier. It's much easier to convert a program to use UTF-8 than it is to UTF-16, and that program inherits the backwards compatible nature by still working with ASCII.
How:
ASCII is a 7 bit encoding, but is always stored in bytes, which are 8 bit. That means 1 bit has always been unused.
UTF-8 simply uses that extra bit to signify non-ASCII characters.

When converting a utf-8 encoded string from bytes to characters, how does the computer know where a character ends?

Given a Unicode string encoded in UTF-8, which is just bytes in memory.
If a computer wants to convert these bytes to its corresponding Unicode codepoints (numbers), how can it know where one character ends and another one begins? Some characters are represented by 1 byte, others by up to 6 bytes. So if you have
00111101 10111001
This could represent 2 characters, or 1. How does the computer decide which it is to interpret it correctly? Is there some sort of convention from which we can know from the first byte how many bytes the current character uses or something?
The first byte of a multibyte sequence encodes the length of the sequence in the number of leading 1-bits:
0xxxxxxx is a character on its own;
10xxxxxx is a continuation of a multibyte character;
110xxxxx is the first byte of a 2-byte character;
1110xxxx is the first byte of a 3-byte character;
11110xxx is the first byte of a 4-byte character.
Bytes with more than 4 leading 1-bits don't encode valid characters in UTF-8 because the 4-byte sequences already cover more than the entire Unicode range from U+0000 to U+10FFFF.
So, the example posed in the question has one ASCII character and one continuation byte that doesn't encode a character on its own.

Resources