What is the difference between a unicode and binary string? - python-3.x

I am in python3.3.
What is the difference between a unicode string and a binary string?
b'\\u4f60'
u'\x4f\x60'
b'\x4f\x60'
u'4f60'
The concept of Unicode and binary string is confusing. How can i change b'\\u4f60' into b'\x4f\x60' ?

First - there is no difference between unicode literals and string literals in python 3. They are one and the same - you can drop the u up front. Just write strings. So instantly you should see that the literal u'4f60' is just like writing actual '4f60'.
A bytes literal - aka b'some literal' - is a series of bytes. Bytes between 32 and 127 (aka ASCII) can be displayed as their corresponding glyph, the rest are displayed as the \x escaped version. Don't be confused by this - b'\x61' is the same as b'a'. It's just a matter of printing.
A string literal is a string literal. It can contain unicode codepoints. There is far too much to cover to explain how unicode works here, but basically a codepoint represents a glyph (essentially, a character - a graphical representation of a letter/digit), it does not specify how the machine needs to represent it. In fact there are a great many different ways.
Thus there is a very large difference between bytes literals and str literals. The former describe the machine representation, the latter describe the alphanumeric glyphs that we are reading right now. The mapping between the two domains is encoding/decoding.
I'm skipping over a lot of vital information here. That should get us somewhere though. I highly recommend reading more since this is not an easy topic.
How can i change b'\\u4f60' into b'\x4f\x60' ?
Let's walk through it:
b'\u4f60'
Out[101]: b'\\u4f60' #note, unicode-escaped
b'\x4f\x60'
Out[102]: b'O`'
'\u4f60'
Out[103]: '你'
So, notice that \u4f60 is that Han ideograph glyph. \x4f\x60 is, if we represent it in ascii (or utf-8, actually), the letter O (\x4f) followed by backtick.
I can ask python to turn that unicode-escaped bytes sequence into a valid string with the according unicode glyph:
b'\\u4f60'.decode('unicode-escape')
Out[112]: '你'
So now all we need to do is to re-encode to bytes, right? Well...
Coming around to what I think you're wanting to ask -
How can i change '\\u4f60' into its proper bytes representation?
There is no 'proper' bytes representation of that unicode codepoint. There is only a representation in the encoding that you want. It so happens that there is one encoding that directly matches the transformation to b'\x4f\x60' - utf-16be.
b'\\u4f60'.decode('unicode-escape').encode('utf-16-be')
Out[47]: 'O`'
The reason this works is that utf-16 is a variable-length encoding. For code points below 16 bits it just directly uses the codepoint as the 2-byte encoding, and for points above it uses something called "surrogate pairs", which I won't get into.

Related

Can lowercasing a UTF-8 string cause it to grow?

I'm helping out with someone writing some code to compare UTF-8 strings in a case-insensitive way. The scheme they are using is to uppercase the strings and then compare. The input strings can all fit in a 255 byte array. The output string similarly must fit in a 255 byte array.
I'm not a UTF-8 or Unicode expert, but I think this this scheme can't work for all strings. My understanding is that either lower casing or upper casing a UTF-8 string can result in the output string being longer (byte array wise), and as such changing case is probably not the best way to attack this problem. I'm trying to demonstrate the difficulty by giving a few strings that will not work with this design.
For example, take a string of the character U+0587 repeated 100 times. U+0587 takes two bytes in UTF-8, so the overall length of the byte array for the string is 200 bytes (ignoring the trailing null for now). If that string is uppercased, however, it becomes U+0535 U+0552, and each of those takes two bytes, for a total of 4 bytes. The 200 byte array is now 400 bytes, and cannot be stored in the limited space available.
So here's my question: I gave an example of a lowercase character needing more space to store when uppercased. Are there any examples of an uppercase character needing more space to store when lowercased? The locale is always en_US.UTF-8 in this case.
Thanks for any help.
Yes. Examples from my environment:
U+023A Ⱥ U+023E Ⱦ
There are several related factors that could cause variation. You already pointed out one:
The locale that you specify will affect the casing of characters that that locale is concerned with.
The version of the Unicode Common Locale Data Repository that your library uses.
The version of the Unicode Character Database that your library uses.
These aren't fixed targets because we can expect that there will be future versions and that there will be users using characters from them.
Ultimately, this comes down to your environment and to practical purpose this has.

What exactly is encoding-independent means

While reading the Strings and Characters chapter of the official Swift document I found the following sentence
"Every string is composed of encoding-independent Unicode characters, and provide support for accessing those characters in various Unicode representations"
Question What exactly do encoding-independent mean?
From my reading on Advanced Swift By Chris and other experiences, the thing that this sentence is trying to convey can be 2 folds.
First, what are various unicode representations:
UTF-8 : compatible with ASCII
UTF-16
UTF-32
The number on the right hand side means how many bits a Character will take when it represented or stored.
For a character, UTF-8 requires 8 bits while UTF-32 requires 32 bits.
However, a chinese character which can be represented by 1 UTF-32 memory might not always fit in 1 block of UTF-16 memory. If the character aquires all 32 bits then in UTF-8 it will have a count of 4.
Then comes the storing part. When you store a character in the String, it doesn't matter how you want to read it later.
For example:
Every string is composed of encoding-independent Unicode characters, and provide support for accessing those characters in various Unicode representations
This means, you can compose String by any way you like. And this wont effect the representation when reading on various unicode encoding formats like UTF-8 or 16 or 32.
This is seen clearly in the above example, When i try to load a Japanese Character which takes up 24 bit to store. The same character is displayed irrespective of my choice of encoding.
However, count value will differ. There are other points to consider like Code Unit and Code Point that make up this Strings.
For Unicode Encoding variants
I would highly recommend reading this article which goes way deeper into String api in swift.
Detail View of String API in swift

How to flip text horizontally?

i'm need to write a function that will flip all the characters of a string left-to-right.
e.g.:
Thė quiçk ḇrown fox jumṕềᶁ ovểr thë lⱥzy ȡog.
should become
.goȡ yzⱥl ëht rểvo ᶁềṕmuj xof nworḇ kçiuq ėhT
i can limit the question to UTF-16 (which has the same problems as UTF-8, just less often).
Naive solution
A naive solution might try to flip all the things (e.g. word-for-word, where a word is 16-bits - i would have said byte for byte if we could assume that a byte was 16-bits. i could also say character-for-character where character is the data type Char which represents a single code-point):
String original = "ɗỉf̴ḟếr̆ęnͥt";
String flipped = "";
foreach (Char c in s)
{
flipped = c+fipped;
}
Results in the incorrectly flipped text:
ɗỉf̴ḟếr̆ęnͥt
̨tͥnę̆rếḟ̴fỉɗ
This is because one "character" takes multiple "code points".
ɗỉf̴ḟếr̆ęnͥt
ɗ ỉ f ˜ ḟ ế r ˘ ę n i t ˛
and flipping each "code point" gives:
˛ t i n ę ˘ r ế ḟ ˜ f ỉ ɗ
Which not only is not a valid UTF-16 encoding, it's not the same characters.
Failure
The problem happens in UTF-16 encoding when there is:
combining diacritics
characters in another lingual plane
Those same issues happen in UTF-8 encoding, with the additional case
any character outside the 0..127 ASCII range
i can limit myself to the simpler UTF-16 encoding (since that's the encoding that the language that i'm using has (e.g. C#, Delphi)
The problem, it seems to me, is discovering if a number of subsequent code points are combining characters, and need to come along with the base glyph.
It's also fun to watch an online text reverser site fail to take this into account.
Note:
any solution should assume that don't have access to a UTF-32 encoding library (mainly becuase i don't have access to any UTF-32 encoding library)
access to a UTF-32 encoding library would solve the UTF-8/UTF-16 lingual planes problem, but not the combining diacritics problem
The term you're looking for is “grapheme cluster”, as defined in Unicode TR29 Cluster Boundaries.
Group the UTF-16 code units into Unicode code points (=characters) using the surrogate algorithm (easy), then group the characters into grapheme clusters using the Grapheme_Cluster_Break rules. Finally reverse the group order.
You will need a copy of the Unicode character database in order to recognise grapheme cluster boundaries. That's already going to take up a considerable amount of space, so you're probably going to want to get a library to do it. For example in ICU you might use a CharacterIterator (which is misleadingly named as it works on grapheme clusters, not ‘characters’ as Unicode knows it).
If you work in UTF-32, you solve the non-base-plane issue. Converting from UTF-8 or UTF-16 to UTF-32 (and back) is relatively simple bit twiddling (see Wikipedia). You don't have to have a library for it.
Most of the combining characters are in a few ranges. You could determine those ranges by scanning the Unicode database (see Unicode.org). Hardcode those ranges into your application. With that, you can determine the groups of codepoints that represent a single character. (The drawback is that new combining marks could be introduced in the future, and you'd need to update your table.)
Segment appropriately, reverse the order (segment by segment), and convert back to UTF-8 or UTF-16 (or whatever you want).
Text Mechanic's Text Generator seems to do this in JavaScript. I'm sure it would be possible to translate the JS into another language after obtaining the author's consent (if you can find a 'contact' link for that site).

Does a strings length equal the byte size?

Exactly that: Does a strings length equal the byte size? Does it matter on the language?
I think it is, but I just want to make sure.
Additional Info: I'm just wondering in general. My specific situation was PHP with MySQL.
As the answer is no, that's all I need know.
Nope. A zero terminated string has one extra byte. A pascal string (the Delphi shortstring) has an extra byte for the length. And unicode strings has more than one byte per character.
By unicode it depends on the encoding. It could be 2 or 4 bytes per character or even a mix of 1,2 and 4 bytes.
It entirely depends on the platform and representation.
For example, in .NET a string takes two bytes in memory per UTF-16 code point. However, surrogate pairs require two UTF-16 values for a full Unicode character in the range U+100000 to U+10FFFF. The in-memory form also has an overhead for the length of the string and possibly some padding, as well as the normal object overhead of a type pointer etc.
Now, when you write a string out to disk (or the network, etc) from .NET, you specify the encoding (with most classes defaulting to UTF-8). At that point, the size depends very much on the encoding. ASCII always takes a single byte per character, but is very limited (no accents etc); UTF-8 gives the full Unicode range with a variable encoding (all ASCII characters are represented in a single byte, but others take up more). UTF-32 always uses exactly 4 bytes for any Unicode character - the list goes on.
As you can see, it's not a simple topic. To work out how much space a string is going to take up you'll need to specify exactly what the situation is - whether it's an object in memory on some platform (and if so, which platform - potentially even down to the implementation and operating system settings), or whether it's a raw encoded form such as a text file, and if so using which encoding.
It depends on what you mean by "length". If you mean "number of characters" then, no, many languages/encoding methods use more than one byte per character.
Not always, it depends on the encoding.
There's no single answer; it depends on language and implementation (remember that some languages have multiple implementations!)
Zero-terminated ASCII strings occupy at least one more byte than the "content" of the string. (More may be allocated, depending on how the string was created.)
Non-zero-terminated strings use a descriptor (or similar structure) to record length, which takes extra memory somewhere.
Unicode strings (in various languages) use two bytes per char.
Strings in an object store may be referenced via handles, which adds a layer of indirection (and more data) in order to simplify memory management.
You are correct. If you encode as ASCII, there is one byte per character. Otherwise, it is one or more bytes per character.
In particular, it is important to know how this effects substring operations. If you don't have one byte per character, does s[n] get the nth byte or nth char? Getting the nth char will be inefficient for large n instead of constant, as it is with a one byte per character.

UTF8 vs. UTF16 vs. char* vs. what? Someone explain this mess to me!

I've managed to mostly ignore all this multi-byte character stuff, but now I need to do some UI work and I know my ignorance in this area is going to catch up with me! Can anyone explain in a few paragraphs or less just what I need to know so that I can localize my applications? What types should I be using (I use both .Net and C/C++, and I need this answer for both Unix and Windows).
Check out Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
EDIT 20140523: Also, watch Characters, Symbols and the Unicode Miracle by Tom Scott on YouTube - it's just under ten minutes, and a wonderful explanation of the brilliant 'hack' that is UTF-8
A character encoding consists of a sequence of codes that each look up a symbol from a given character set. Please see this good article on Wikipedia on character encoding.
UTF8 (UCS) uses 1 to 4 bytes for each symbol. Wikipedia gives a good rundown of how the multi-byte rundown works:
The most significant bit of a single-byte character is always 0.
The most significant bits of the first byte of a multi-byte sequence
determine the length of the sequence.
These most significant bits are 110
for two-byte sequences; 1110 for
three-byte sequences, and so on.
The remaining bytes in a multi-byte sequence have 10 as their two most
significant bits.
A UTF-8 stream contains neither the byte FE nor FF. This makes sure that a
UTF-8 stream never looks like a UTF-16
stream starting with U+FEFF
(Byte-order mark)
The page also shows you a great comparison between the advantages and disadvantages of each character encoding type.
UTF16 (UCS2)
Uses 2 bytes to 4 bytes for each symbol.
UTF32 (UCS4)
uses 4 bytes always for each symbol.
char just means a byte of data and is not an actual encoding. It is not analogous to UTF8/UTF16/ascii. A char* pointer can refer to any type of data and any encoding.
STL:
Both stl's std::wstring and std::string are not designed for
variable-length character encodings like UTF-8 and UTF-16.
How to implement:
Take a look at the iconv library. iconv is a powerful character encoding conversion library used by such projects as libxml (XML C parser of Gnome)
Other great resources on character encoding:
tbray.org's Characters vs. Bytes
IANA character sets
www.cs.tut.fi's A tutorial on code issues
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) (first mentioned by #Dylan Beattie)
Received wisdom suggests that Spolsky's article misses a couple of important points.
This article is recommended as being more complete:
The Unicode® Standard: A Technical Introduction
This article is also a good introduction: Unicode Basics
The latter in particular gives an overview of the character encoding forms and schemes for Unicode.
The various UTF standards are ways to encode "code points". A codepoint is the index into the Unicode charater set.
Another encoding is UCS2 which is allways 16bit, and thus doesn't support the full Unicode range.
Good to know is also that one codepoint isn't equal to one character. For example a character such as å can be represented both as a code point or as two code points one for the a and one for the ring.
Comparing two unicode strings thus requires normalization to get the canonical representation before comparison.
There is also the issue with fonts. There are two ways to handle fonts. Either you use a gigantic font with glyphs for all the Unicode characters you need (I think recent versions of Windows comes with one or two such fonts). Or you use som library capable of combining glyphs from various fonts dedicated to subsets of the Unicode standard.

Resources