What exactly is encoding-independent means - string

While reading the Strings and Characters chapter of the official Swift document I found the following sentence
"Every string is composed of encoding-independent Unicode characters, and provide support for accessing those characters in various Unicode representations"
Question What exactly do encoding-independent mean?

From my reading on Advanced Swift By Chris and other experiences, the thing that this sentence is trying to convey can be 2 folds.
First, what are various unicode representations:
UTF-8 : compatible with ASCII
UTF-16
UTF-32
The number on the right hand side means how many bits a Character will take when it represented or stored.
For a character, UTF-8 requires 8 bits while UTF-32 requires 32 bits.
However, a chinese character which can be represented by 1 UTF-32 memory might not always fit in 1 block of UTF-16 memory. If the character aquires all 32 bits then in UTF-8 it will have a count of 4.
Then comes the storing part. When you store a character in the String, it doesn't matter how you want to read it later.
For example:
Every string is composed of encoding-independent Unicode characters, and provide support for accessing those characters in various Unicode representations
This means, you can compose String by any way you like. And this wont effect the representation when reading on various unicode encoding formats like UTF-8 or 16 or 32.
This is seen clearly in the above example, When i try to load a Japanese Character which takes up 24 bit to store. The same character is displayed irrespective of my choice of encoding.
However, count value will differ. There are other points to consider like Code Unit and Code Point that make up this Strings.
For Unicode Encoding variants
I would highly recommend reading this article which goes way deeper into String api in swift.
Detail View of String API in swift

Related

Can lowercasing a UTF-8 string cause it to grow?

I'm helping out with someone writing some code to compare UTF-8 strings in a case-insensitive way. The scheme they are using is to uppercase the strings and then compare. The input strings can all fit in a 255 byte array. The output string similarly must fit in a 255 byte array.
I'm not a UTF-8 or Unicode expert, but I think this this scheme can't work for all strings. My understanding is that either lower casing or upper casing a UTF-8 string can result in the output string being longer (byte array wise), and as such changing case is probably not the best way to attack this problem. I'm trying to demonstrate the difficulty by giving a few strings that will not work with this design.
For example, take a string of the character U+0587 repeated 100 times. U+0587 takes two bytes in UTF-8, so the overall length of the byte array for the string is 200 bytes (ignoring the trailing null for now). If that string is uppercased, however, it becomes U+0535 U+0552, and each of those takes two bytes, for a total of 4 bytes. The 200 byte array is now 400 bytes, and cannot be stored in the limited space available.
So here's my question: I gave an example of a lowercase character needing more space to store when uppercased. Are there any examples of an uppercase character needing more space to store when lowercased? The locale is always en_US.UTF-8 in this case.
Thanks for any help.
Yes. Examples from my environment:
U+023A Ⱥ U+023E Ⱦ
There are several related factors that could cause variation. You already pointed out one:
The locale that you specify will affect the casing of characters that that locale is concerned with.
The version of the Unicode Common Locale Data Repository that your library uses.
The version of the Unicode Character Database that your library uses.
These aren't fixed targets because we can expect that there will be future versions and that there will be users using characters from them.
Ultimately, this comes down to your environment and to practical purpose this has.

What is the difference between a unicode and binary string?

I am in python3.3.
What is the difference between a unicode string and a binary string?
b'\\u4f60'
u'\x4f\x60'
b'\x4f\x60'
u'4f60'
The concept of Unicode and binary string is confusing. How can i change b'\\u4f60' into b'\x4f\x60' ?
First - there is no difference between unicode literals and string literals in python 3. They are one and the same - you can drop the u up front. Just write strings. So instantly you should see that the literal u'4f60' is just like writing actual '4f60'.
A bytes literal - aka b'some literal' - is a series of bytes. Bytes between 32 and 127 (aka ASCII) can be displayed as their corresponding glyph, the rest are displayed as the \x escaped version. Don't be confused by this - b'\x61' is the same as b'a'. It's just a matter of printing.
A string literal is a string literal. It can contain unicode codepoints. There is far too much to cover to explain how unicode works here, but basically a codepoint represents a glyph (essentially, a character - a graphical representation of a letter/digit), it does not specify how the machine needs to represent it. In fact there are a great many different ways.
Thus there is a very large difference between bytes literals and str literals. The former describe the machine representation, the latter describe the alphanumeric glyphs that we are reading right now. The mapping between the two domains is encoding/decoding.
I'm skipping over a lot of vital information here. That should get us somewhere though. I highly recommend reading more since this is not an easy topic.
How can i change b'\\u4f60' into b'\x4f\x60' ?
Let's walk through it:
b'\u4f60'
Out[101]: b'\\u4f60' #note, unicode-escaped
b'\x4f\x60'
Out[102]: b'O`'
'\u4f60'
Out[103]: '你'
So, notice that \u4f60 is that Han ideograph glyph. \x4f\x60 is, if we represent it in ascii (or utf-8, actually), the letter O (\x4f) followed by backtick.
I can ask python to turn that unicode-escaped bytes sequence into a valid string with the according unicode glyph:
b'\\u4f60'.decode('unicode-escape')
Out[112]: '你'
So now all we need to do is to re-encode to bytes, right? Well...
Coming around to what I think you're wanting to ask -
How can i change '\\u4f60' into its proper bytes representation?
There is no 'proper' bytes representation of that unicode codepoint. There is only a representation in the encoding that you want. It so happens that there is one encoding that directly matches the transformation to b'\x4f\x60' - utf-16be.
b'\\u4f60'.decode('unicode-escape').encode('utf-16-be')
Out[47]: 'O`'
The reason this works is that utf-16 is a variable-length encoding. For code points below 16 bits it just directly uses the codepoint as the 2-byte encoding, and for points above it uses something called "surrogate pairs", which I won't get into.

Proper encoding for fixed-length storage of Unicode strings?

I'm going to be working on software (in c#) that needs to read/write Unicode strings (specifically English, German, Spanish and Arabic) to a hardware device. The firmware developer tells me that his code expects to store each string as fixed-length byte array in one binary file so he can quickly access any string using an index (index * length = starting offset and then read the fixed-length number of bytes). I understand that .NET internally uses a UTF-16 encoding which I believe is technically a variable-length encoding (depending upon the number of the Unicode code point). I'm fairly certain that English, German and Spanish would all use two bytes/character when encoded using UTF-16 but I'm not so sure about Arabic. It looks like there might be some Arabic characters that could possibly require three bytes each in UTF-16 and that would seem to break the firmware developers plan to store the strings as a fixed length.
First, can anyone confirm my understanding of the variable-length nature of UTF-8/UTF-16 encodings? And second, although it would waste a lot of space, is UTF-32 (fixed-size, each character represented using 4 bytes) the best option for ensuring that each string could be stored as a fixed length? Thanks!
Unicode terminology:
Each entry in the Unicode character set is a code point
Encoded code points consist of one or more code units in a transformation format (UTF-8 uses 8 bit code units; UTF-16 uses 16 bit code units)
The user-visible grapheme might consist of a sequence of code points
So:
A code point in UTF-8 is 1, 2, 3 or 4 octets wide
A code point in UTF-16 is 2 or 4 octets wide
A code point in UTF-32 is 4 octets wide
The number of graphemes rendered on the screen might be less than the number of code points
So, if you want to support the entire Unicode range you need to make the fixed-length strings a multiple of 32 bits regardless of which of these UTFs you choose as the encoding (I'm assuming unused bytes will be set to 0x0 and that these will be appended, trimmed during I/O.)
In terms of communicating length restrictions via a user interface you'll probably want to decide on some compromise based on a code unit size and the typical customer rather than try to find the width of the most complicated grapheme you can build.

Does a strings length equal the byte size?

Exactly that: Does a strings length equal the byte size? Does it matter on the language?
I think it is, but I just want to make sure.
Additional Info: I'm just wondering in general. My specific situation was PHP with MySQL.
As the answer is no, that's all I need know.
Nope. A zero terminated string has one extra byte. A pascal string (the Delphi shortstring) has an extra byte for the length. And unicode strings has more than one byte per character.
By unicode it depends on the encoding. It could be 2 or 4 bytes per character or even a mix of 1,2 and 4 bytes.
It entirely depends on the platform and representation.
For example, in .NET a string takes two bytes in memory per UTF-16 code point. However, surrogate pairs require two UTF-16 values for a full Unicode character in the range U+100000 to U+10FFFF. The in-memory form also has an overhead for the length of the string and possibly some padding, as well as the normal object overhead of a type pointer etc.
Now, when you write a string out to disk (or the network, etc) from .NET, you specify the encoding (with most classes defaulting to UTF-8). At that point, the size depends very much on the encoding. ASCII always takes a single byte per character, but is very limited (no accents etc); UTF-8 gives the full Unicode range with a variable encoding (all ASCII characters are represented in a single byte, but others take up more). UTF-32 always uses exactly 4 bytes for any Unicode character - the list goes on.
As you can see, it's not a simple topic. To work out how much space a string is going to take up you'll need to specify exactly what the situation is - whether it's an object in memory on some platform (and if so, which platform - potentially even down to the implementation and operating system settings), or whether it's a raw encoded form such as a text file, and if so using which encoding.
It depends on what you mean by "length". If you mean "number of characters" then, no, many languages/encoding methods use more than one byte per character.
Not always, it depends on the encoding.
There's no single answer; it depends on language and implementation (remember that some languages have multiple implementations!)
Zero-terminated ASCII strings occupy at least one more byte than the "content" of the string. (More may be allocated, depending on how the string was created.)
Non-zero-terminated strings use a descriptor (or similar structure) to record length, which takes extra memory somewhere.
Unicode strings (in various languages) use two bytes per char.
Strings in an object store may be referenced via handles, which adds a layer of indirection (and more data) in order to simplify memory management.
You are correct. If you encode as ASCII, there is one byte per character. Otherwise, it is one or more bytes per character.
In particular, it is important to know how this effects substring operations. If you don't have one byte per character, does s[n] get the nth byte or nth char? Getting the nth char will be inefficient for large n instead of constant, as it is with a one byte per character.

UTF8 vs. UTF16 vs. char* vs. what? Someone explain this mess to me!

I've managed to mostly ignore all this multi-byte character stuff, but now I need to do some UI work and I know my ignorance in this area is going to catch up with me! Can anyone explain in a few paragraphs or less just what I need to know so that I can localize my applications? What types should I be using (I use both .Net and C/C++, and I need this answer for both Unix and Windows).
Check out Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
EDIT 20140523: Also, watch Characters, Symbols and the Unicode Miracle by Tom Scott on YouTube - it's just under ten minutes, and a wonderful explanation of the brilliant 'hack' that is UTF-8
A character encoding consists of a sequence of codes that each look up a symbol from a given character set. Please see this good article on Wikipedia on character encoding.
UTF8 (UCS) uses 1 to 4 bytes for each symbol. Wikipedia gives a good rundown of how the multi-byte rundown works:
The most significant bit of a single-byte character is always 0.
The most significant bits of the first byte of a multi-byte sequence
determine the length of the sequence.
These most significant bits are 110
for two-byte sequences; 1110 for
three-byte sequences, and so on.
The remaining bytes in a multi-byte sequence have 10 as their two most
significant bits.
A UTF-8 stream contains neither the byte FE nor FF. This makes sure that a
UTF-8 stream never looks like a UTF-16
stream starting with U+FEFF
(Byte-order mark)
The page also shows you a great comparison between the advantages and disadvantages of each character encoding type.
UTF16 (UCS2)
Uses 2 bytes to 4 bytes for each symbol.
UTF32 (UCS4)
uses 4 bytes always for each symbol.
char just means a byte of data and is not an actual encoding. It is not analogous to UTF8/UTF16/ascii. A char* pointer can refer to any type of data and any encoding.
STL:
Both stl's std::wstring and std::string are not designed for
variable-length character encodings like UTF-8 and UTF-16.
How to implement:
Take a look at the iconv library. iconv is a powerful character encoding conversion library used by such projects as libxml (XML C parser of Gnome)
Other great resources on character encoding:
tbray.org's Characters vs. Bytes
IANA character sets
www.cs.tut.fi's A tutorial on code issues
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) (first mentioned by #Dylan Beattie)
Received wisdom suggests that Spolsky's article misses a couple of important points.
This article is recommended as being more complete:
The Unicode® Standard: A Technical Introduction
This article is also a good introduction: Unicode Basics
The latter in particular gives an overview of the character encoding forms and schemes for Unicode.
The various UTF standards are ways to encode "code points". A codepoint is the index into the Unicode charater set.
Another encoding is UCS2 which is allways 16bit, and thus doesn't support the full Unicode range.
Good to know is also that one codepoint isn't equal to one character. For example a character such as å can be represented both as a code point or as two code points one for the a and one for the ring.
Comparing two unicode strings thus requires normalization to get the canonical representation before comparison.
There is also the issue with fonts. There are two ways to handle fonts. Either you use a gigantic font with glyphs for all the Unicode characters you need (I think recent versions of Windows comes with one or two such fonts). Or you use som library capable of combining glyphs from various fonts dedicated to subsets of the Unicode standard.

Resources