YOaf/MrA - What character encoding? - string

What kind of character encoding are the strings below?
KDLwuq6IC
YOaf/MrAT
0vGzc3aBN
SQdLlM8G7
https://en.wikipedia.org/wiki/Character_encoding

Character encoding is the encoding of strings to bytes (or numbers). You are only showing us the characters itself. They don't have any encoding by themselves.
Some character encodings have a different range of characters that they can encode. Your characters are all in the ASCII range at least. So they would also be compatible with any scheme that incorporates ASCII as a subset such as Windows-1252 and of course Unicode (UTF-8, UTF-16LE, UTF-16BE etc).
Note that your code looks a lot like base 64. Base 64 is not a character encoding though, it is the encoding of bytes into characters (so the other way around). Base 64 can usually be recognized by looking for / and + characters in the text, as well as the text consisting of blocks that are a multiple of 4 characters (as 4 characters encode 3 bytes).
Looking at the text you are probably looking for an encoding scheme rather than a character-encoding scheme.

Related

How are Linux shells and filesystem Unicode-aware?

I understand that Linux filesystem stores file names as byte sequences which is meant to be Unicode-encoding-independent.
But, encodings other than UTF-8 or Enhanced UTF-8 may very well use 0 byte as part of a multibyte representation of a Unicode character that can appear in a file name. And everywhere in Linux filesystem C code you terminate strings with 0 byte. So how does Linux filesystem support Unicode? Does it assume all applications that create filenames use UTF-8 only? But that is not true, is it?
Similarly, the shells (such as bash) use * in patterns to match any number of filename characters. I can see in shell C code that it simply uses the ASCII byte for * and goes byte-by-byte to delimit the match. Fine for UTF-8 encoded names, because it has the property that if you take the byte representation of a string, then match some bytes from the start with *, and match the rest with another string, then the bytes at the beginning in fact matched a string of whole characters, not just bytes.
But other encodings do not have that property, do they? So again, do shells assume UTF-8?
It is true that UTF-16 and other "wide character" encodings cannot be used for pathnames in Linux (nor any other POSIX-compliant OS).
It is not true in principle that anyone assumes UTF-8, although that may come to be true in the future as other encodings die off. What Unix-style programs assume is an ASCII-compatible encoding. Any encoding with these properties is ASCII-compatible:
The fundamental unit of the encoding is a byte, not a larger entity. Some characters might be encoded as a sequence of bytes, but there must be at least 127 characters that are encoded using only a single byte, namely:
The characters defined by ASCII (nowadays, these are best described as Unicode codepoints U+000000 through U+00007F, inclusive) are encoded as single bytes, with values equal to their Unicode codepoints.
Conversely, the bytes with values 0x00 through 0x7F must always decode to the characters defined by ASCII, regardless of surrounding context. (For instance, the string 0x81 0x2F must decode to two characters, whatever 0x81 decodes to and then /.)
UTF-8 is ASCII-compatible, but so are all of the ISO-8859-n pages, the EUC encodings, and many, many others.
Some programs may also require an additional property:
The encoding of a character, viewed as a sequence of bytes, is never a proper prefix nor a proper suffix of the encoding of any other character.
UTF-8 has this property, but (I think) EUC-JP doesn't.
It is also the case that many "Unix-style" programs reserve the codepoint U+000000 (NUL) for use as a string terminator. This is technically not a constraint on the encoding, but on the text itself. (The closely-related requirement that the byte 0x00 not appear in the middle of a string is a consequence of this plus the requirement that 0x00 maps to U+000000 regardless of surrounding context.)
There is no encoding of filenames in Linux (in ext family of filesystems at any rate). Filenames are sequences of bytes, not characters. It is up to application programs to interpret these bytes as UTF-8 or anything else. The filesystem doesn't care.
POSIX stipulates that the shell obeys the locale environment vsriables such as LC_CTYPE when performing pattern matches. Thus, pattern-matching code that just compares bytes regardless of the encoding would not be compatible with your hypothetical encoding, or with any stateful encoding. But this doesn't seem to matter much as such encodings are not commonly supported by existing locales. UTF-8 on the other hand seems to be supported well: in my experiments bash correctly matches the ? character with a single Unicode character, rather than a single byte, in the filename (given an UTF-8 locale) as prescribed by POSIX.

What exactly is encoding-independent means

While reading the Strings and Characters chapter of the official Swift document I found the following sentence
"Every string is composed of encoding-independent Unicode characters, and provide support for accessing those characters in various Unicode representations"
Question What exactly do encoding-independent mean?
From my reading on Advanced Swift By Chris and other experiences, the thing that this sentence is trying to convey can be 2 folds.
First, what are various unicode representations:
UTF-8 : compatible with ASCII
UTF-16
UTF-32
The number on the right hand side means how many bits a Character will take when it represented or stored.
For a character, UTF-8 requires 8 bits while UTF-32 requires 32 bits.
However, a chinese character which can be represented by 1 UTF-32 memory might not always fit in 1 block of UTF-16 memory. If the character aquires all 32 bits then in UTF-8 it will have a count of 4.
Then comes the storing part. When you store a character in the String, it doesn't matter how you want to read it later.
For example:
Every string is composed of encoding-independent Unicode characters, and provide support for accessing those characters in various Unicode representations
This means, you can compose String by any way you like. And this wont effect the representation when reading on various unicode encoding formats like UTF-8 or 16 or 32.
This is seen clearly in the above example, When i try to load a Japanese Character which takes up 24 bit to store. The same character is displayed irrespective of my choice of encoding.
However, count value will differ. There are other points to consider like Code Unit and Code Point that make up this Strings.
For Unicode Encoding variants
I would highly recommend reading this article which goes way deeper into String api in swift.
Detail View of String API in swift

Why Utf8 is compatible with ascii

A in UTF-8 is U+0041 LATIN CAPITAL LETTER A. A in ASCII is 065.
How is UTF-8 is backwards-compatible with ASCII?
ASCII uses only the first 7 bits of an 8 bit byte. So all combinations from 00000000 to 01111111. All 128 bytes in this range are mapped to a specific character.
UTF-8 keep these exact mappings. The character represented by 01101011 in ASCII is also represented by the same byte in UTF-8. All other characters are encoded in sequences of multiple bytes in which each byte has the highest bit set; i.e. every byte of all non-ASCII characters in UTF-8 is of the form 1xxxxxxx.
Unicode is backward compatible with ASCII because ASCII is a subset of Unicode. Unicode simply uses all character codes in ASCII and adds more.
Although character codes are generally written out as 0041 in Unicode, the character codes are numeric so 0041 is the same value as (hexadecimal) 41.
UTF-8 is not a character set but an encoding used with Unicode. It happens to be compatible with ASCII too, because the codes used for multiple byte encodings lie in the part of the ASCII character set that is unused.
Note that it's only the 7-bit ASCII character set that is compatible with Unicode and UTF-8, the 8-bit character sets based on ASCII, like IBM850 and windows-1250, uses the part of the character set where UTF-8 has codes for multiple byte encodings.
Why:
Because everything was already in ASCII and have a backwards compatible Unicode format made adoption much easier. It's much easier to convert a program to use UTF-8 than it is to UTF-16, and that program inherits the backwards compatible nature by still working with ASCII.
How:
ASCII is a 7 bit encoding, but is always stored in bytes, which are 8 bit. That means 1 bit has always been unused.
UTF-8 simply uses that extra bit to signify non-ASCII characters.

Are these ASCII / Unicode strings equivalent?

Is a "1055912799" ASCII string equivalent to "1055912799" Unicode string?
Yes, the digit characters 0 to 9 in Unicode are defined to be the same characters as in Ascii. More generally, all printable Ascii characters are coded in Unicode, too (and with the same code numbers, by the way).
Whether the internal representations as sequences of bytes are the same depends on the character encoding. The UTF-8 encoding for Unicode has been designed so that Ascii characters have the same coded representation as bytes as in the only encoding currently used for Ascii (which maps each Ascii code number to an 8-bit byte, with the first bit set to zero).
UTF-16 encoded representation for characters in the Ascii range could be said to be “equivalent” to the Ascii encoding in the sense that there is a simple mapping: in UTF-16, each Ascii characters appears as two bytes, one zero byte and one byte containing the Ascii number. (The order of these bytes depends on endianness of UTF-16.) But such an “equivalence” concept is normally not used and would not be particularly useful.
Because ASCII is a subset of unicode, any ASCII string will be the same in unicode, assuming of course you encode it with UTF-8. Clearly a UTF-16 or UTF-32 encoding will cause it to be fairly bloated.

Do I have to take the encoding into account when performing Boyer-Moore pattern matching?

I'm about to implement a variation of the Boyer-Moore pattern matching algorithm (the Sunday algorithm to be specific) and I was asking myself: What is my alphabet size?
Does it depend on the encoding (= number of possible characters) or can I just assume my alphabet consists of 256 symbols (= number of symbols which can be represented by a byte)?
In many other situations treating characters as bytes would be a problem because depending on the encoding a character can consist of multiple bytes, but if in my case both strings have the same encoding then equal characters are represented by equal byte sequences, so I would assume it doesn't matter.
So: Do I have to take the encoding into account and assume an alphabet consisting of the actual characters (> 90000 for Unicode) or can I just handle the text and the pattern as a stream of bytes?
A multi-byte encoding can be used with a byte-oriented search routine IF it is self-synchronizing.
So, you can safely use Boyer-Moore with:
CESU-8
UTF-8
UTF-EBCDIC
But can NOT use it with:
Big5
EUC-JP
GBK / GB18030
ISO 2022
Johab
Punycode
Shift-JIS
UTF-7
UTF-16
UTF-32

Resources