Is ASCII-only Unicode string always normalized? - locale

Imagine a string of single ASCII character i (U+0069). In Turkish and akin writing system, ı (U+0131) is present as well. Can Unicode normalization split U+0069 (i) into U+0131 U+0307 (ı̇)? Is it locale-dependent, and so might vary on environment?

The normali\ation forms defined by Unicode are not locale-specific; they have no input other than the sequence of code points to be normalized.
The Unicode website has a user-friendly chart of all characters which differ between the standardized normalization forms.
Unfortunately, it is grouped by script, not by block, so we can't quickly check all the characters in the "Basic Latin" block (which matches the 128 characters of ASCII).
Searching for "0069" specifically, we see that it appears as the result of normalising certain code points - either as part of a "decomposition" in NFD, or as a compatibility replacement in forms NFKC and NFKD. However, it doesn't appear in the input column, because it doesn't change when converted to any of the normalization forms.
I have not checked the other Basic Latin characters, but would be extremely surprised if any of them normalize to anything other than themselves. So to answer your original question: yes, I believe a string that only uses code points U+0000 to U+0127 (the code points inherited from the 7-bit ASCII standard) will not change in any of the normalization forms defined by Unicode.

Related

How to use Unicode::Normalize to create most compatible windows-1252 encoded string?

I have a legacy app in Perl processing XML encoded in UTF-8 most likely and which needs to store some data of that XML in some database, which uses windows-1252 for historical reasons. Yes, this setup can't support all possible characters of the Unicode standard, but in practice I don't need to anyway and can try to be reasonable compatible.
The specific problem currently is a file containing LATIN SMALL LETTER U, COMBINING DIAERESIS (U+0075 U+0308), which makes Perl break the existing encoding of the Unicode string to windows-1252 with the following exception:
"\x{0308}" does not map to cp1252
I was able to work around that problem using Unicode::Normalize::NFKC, which creates the character U+00FC (ü), which perfectly fine maps to windows-1252. That lead to some other problem of course, e.g. in case of the character VULGAR FRACTION ONE HALF (½, U+00BD), because NFKC creates DIGIT ONE, FRACTION SLASH, DIGIT TWO (1/2, U+0031 U+2044 U+0032) for that and Perl dies again:
"\x{2044}" does not map to cp1252
According to normalization rules, this is perfectly fine for NFKC. I used that because I thought it would give me the most compatible result, but that was wrong. Using NFC instead fixed both problems, as both characters provide a normalization compatible with windows-1252 in that case.
This approach gets additionally problematic for characters for which a normalization compatible with windows-1252 is available in general, only different from NFC. One example is LATIN SMALL LIGATURE FI (fi, U+FB01). According to it's normalization rules, it's representation after NFC is incompatible with windows-1252, while using NFKC this time results in two characters compatible with windows-1252: fi (U+0066 U+0069).
My current approach is to simply try encoding as windows-1252 as is, if that fails I'm using NFC and try again, if that fails I'm using NFKC and try again and if that fails I'm giving up for now. This works in the cases I'm currently dealing with, but obviously fails if all three characters of my examples above are present in a string at the same time. There's always one character then which results in windows-1252-incompatible output, regardless the order of NFC and NFKC. The only question is which character breaks when.
BUT the important point is that each character by itself could be normalized to something being compatible with windows-1252. It only seems that there's no one-shot-solution.
So, is there some API I'm missing, which already converts in the most backwards compatible way?
If not, what's the approach I would need to implement myself to support all the above characters within one string?
Sounds like I would need to process each string Unicode-character by Unicode-character, normalize individually with what is most compatible with windows-1252 and than concatenate the results again. Is there some incremental Unicode-character parser available which deals with combining characters and stuff already? Does a simple Unicode-character based regular expression handles this already?
Unicode::Normalize provides additional functions to work on partial strings and such, but I must admit that I currently don't fully understand their purpose. The examples focus on concatenation as well, but from my understanding I first need some parsing to be able to normalize individual characters differently.
I don't think you're missing an API because a best-effort approach is rather involved. I'd try something like the following:
Normalize using NFC. This combines decomposed sequences like LATIN SMALL LETTER U, COMBINING DIAERESIS.
Extract all codepoints which aren't combining marks using the regex /\PM/g. This throws away all combining marks remaining after NFC conversion which can't be converted to Windows-1252 anyway. Then for each code point:
If the codepoint can be converted to Windows-1252, do so.
Otherwise try to normalize the codepoint with NFKC. If the NFKC mapping differs from the input, apply all steps recursively on the resulting string. This handles things like ligatures.
As a bonus: If the codepoint is invariant under NFKC, convert to NFD and try to convert the first codepoint of the result to Windows-1252. This converts characters like Ĝ to G.
Otherwise ignore the character.
There are of course other approaches that convert unsupported characters to ones that look similar but they require to create mappings manually.
Since it seems that you can convert individual characters as needed (to cp-1252 encoding), one way is to process character by character, as proposed, once a word fails the procedure.
The \X in Perl's regex matches a logical Unicode character, an extended grapheme cluster, either as a single codepoint or a sequence. So if you indeed can convert all individual (logical) characters into the desired encoding, then with
while ($word =~ /(\X)/g) { ... }
you can access the logical characters and apply your working procedure to each.
In case you can't handle all logical characters that may come up, piece together an equivalent of \X using specific character properties, for finer granularity with combining marks or such (like /((.)\p{Mn}?)/, or \p{Nonspacing_Mark}). The full, grand, list is in perluniprops.

What exactly is encoding-independent means

While reading the Strings and Characters chapter of the official Swift document I found the following sentence
"Every string is composed of encoding-independent Unicode characters, and provide support for accessing those characters in various Unicode representations"
Question What exactly do encoding-independent mean?
From my reading on Advanced Swift By Chris and other experiences, the thing that this sentence is trying to convey can be 2 folds.
First, what are various unicode representations:
UTF-8 : compatible with ASCII
UTF-16
UTF-32
The number on the right hand side means how many bits a Character will take when it represented or stored.
For a character, UTF-8 requires 8 bits while UTF-32 requires 32 bits.
However, a chinese character which can be represented by 1 UTF-32 memory might not always fit in 1 block of UTF-16 memory. If the character aquires all 32 bits then in UTF-8 it will have a count of 4.
Then comes the storing part. When you store a character in the String, it doesn't matter how you want to read it later.
For example:
Every string is composed of encoding-independent Unicode characters, and provide support for accessing those characters in various Unicode representations
This means, you can compose String by any way you like. And this wont effect the representation when reading on various unicode encoding formats like UTF-8 or 16 or 32.
This is seen clearly in the above example, When i try to load a Japanese Character which takes up 24 bit to store. The same character is displayed irrespective of my choice of encoding.
However, count value will differ. There are other points to consider like Code Unit and Code Point that make up this Strings.
For Unicode Encoding variants
I would highly recommend reading this article which goes way deeper into String api in swift.
Detail View of String API in swift

What's the difference between a character, a code point, a glyph and a grapheme?

Trying to understand the subtleties of modern Unicode is making my head hurt. In particular, the distinction between code points, characters, glyphs and graphemes - concepts which in the simplest case, when dealing with English text using ASCII characters, all have a one-to-one relationship with each other - is causing me trouble.
Seeing how these terms get used in documents like Matthias Bynens' JavaScript has a unicode problem or Wikipedia's piece on Han unification, I've gathered that these concepts are not the same thing and that it's dangerous to conflate them, but I'm kind of struggling to grasp what each term means.
The Unicode Consortium offers a glossary to explain this stuff, but it's full of "definitions" like this:
Abstract Character. A unit of information used for the organization, control, or representation of textual data. ...
...
Character. ... (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. ...
...
Glyph. (1) An abstract form that represents one or more glyph images. (2) A synonym for glyph image. In displaying Unicode character data, one or more glyphs may be selected to depict a particular character.
...
Grapheme. (1) A minimally distinctive unit of writing in the context of a particular writing system. ...
Most of these definitions possess the quality of sounding very academic and formal, but lack the quality of meaning anything, or else defer the problem of definition to yet another glossary entry or section of the standard.
So I seek the arcane wisdom of those more learned than I. How exactly do each of these concepts differ from each other, and in what circumstances would they not have a one-to-one relationship with each other?
Character is an overloaded term that can mean many things.
A code point is the atomic unit of information. Text is a sequence of code points. Each code point is a number which is given meaning by the Unicode standard.
A code unit is the unit of storage of a part of an encoded code point. In UTF-8 this means 8 bits, in UTF-16 this means 16 bits. A single code unit may represent a full code point, or part of a code point. For example, the snowman glyph (☃) is a single code point but 3 UTF-8 code units, and 1 UTF-16 code unit.
A grapheme is a sequence of one or more code points that are displayed as a single, graphical unit that a reader recognizes as a single element of the writing system. For example, both a and ä are graphemes, but they may consist of multiple code points (e.g. ä may be two code points, one for the base character a followed by one for the diaeresis; but there's also an alternative, legacy, single code point representing this grapheme). Some code points are never part of any grapheme (e.g. the zero-width non-joiner, or directional overrides).
A glyph is an image, usually stored in a font (which is a collection of glyphs), used to represent graphemes or parts thereof. Fonts may compose multiple glyphs into a single representation, for example, if the above ä is a single code point, a font may choose to render that as two separate, spatially overlaid glyphs. For OTF, the font's GSUB and GPOS tables contain substitution and positioning information to make this work. A font may contain multiple alternative glyphs for the same grapheme, too.
Outside the Unicode standard a character is an individual unit of text composed of one or more graphemes. What the Unicode standard defines as "characters" is actually a mix of graphemes and characters. Unicode provides rules for the interpretation of juxtaposed graphemes as individual characters.
A Unicode code point is a unique number assigned to each Unicode character (which is either a character or a grapheme).
Unfortunately, the Unicode rules allow some juxtaposed graphemes to be interpreted as other graphemes that already have their own code points (precomposed forms). This means that there is more than one way in Unicode to represent a character. Unicode normalization addresses this issue.
A glyph is the visual representation of a character. A font provides a set of glyphs for a certain set of characters (not Unicode characters). For every character, there is an infinite number of possible glyphs.
A Reply to Mark Amery
First, as I stated, there is an infinite number of possible glyphs for each character so no, a character is not "always represented by a single glyph". Unicode doesn't concern itself much with glyphs, and the things it defines in its code charts are certainly not glyphs. The problem is that neither are they all characters. So what are they?
Which is the greater entity, the grapheme or the character? What does one call those graphic elements in text that are not letters or punctuation? One term that springs quickly to mind is "grapheme". It's a word that precisely conjure up the idea of "a graphical unit in a text". I offer this definition: A grapheme is the smallest distinct component in a written text.
One could go the other way and say that graphemes are composed of characters, but then they would be called "Chinese graphemes", and all those bits and pieces Chinese graphemes are composed of would have to be called "characters" instead. However, that's all backwards. Graphemes are the distinct little bits and pieces. Characters are more developed. The phrase "glyphs are composable", would be better stated in the context of Unicode as "characters are composable".
Unicode defines characters but it also defines graphemes that are to be composed with other graphemes or characters. Those monstrosities you composed are a fine example of this. If they catch on maybe they'll get their own code points in a later version of Unicode ;)
There's a recursive element to all this. At higher levels, graphemes become characters become graphemes, but it's graphemes all the way down.
A Reply to T S
Chapter 1 of the
standard states: "The Unicode character encoding treats alphabetic characters,
ideographic characters, and symbols equivalently, which means they can be used
in any mixture and with equal facility". Given this statement, we should be
prepared for some conflation of terms in the standard. Sometimes the proper
terminology only becomes clear in retrospect as a standard develops.
It often happens in formal definitions of a language that two fundamental
things are defined in terms of each other. For example, in
XML an element is defined as a starting tag
possibly followed by content, followed by an ending tag. Content is defined in
turn as either an element, character data, or a few other possible things. A
pattern of self-referential definitions is also implicit in the Unicode
standard:
A grapheme is a code point or a character.
A character is composed from a sequence of one or more graphemes.
When first confronted with these two definitions the reader might object to the
first definition on the grounds that a code point is a character, but
that's not always true. A sequence of two code points sometimes encodes a
single code point under
normalization, and that
encoded code point represents the character, as illustrated in
figure 2.7. Sequences of
code points that encode other code points. This is getting a little tricky and
we haven't even reached the layer where where character encoding schemes such
as UTF-8 are used to
encode code points into byte sequences.
In some contexts, for example a scholarly article on
diacritics, and individual
part of a character might show up in the text by itself. In that context, the
individual character part could be considered a character, so it makes sense
that the Unicode standard remain flexible as well.
As Mark Avery pointed out, a character can be composed into a more complex
thing. That is, each character can can serve as a grapheme if desired. The
final result of all composition is a thing that "the user thinks of as a
character". There doesn't seem to be any real resistance, either in the
standard or in this discussion, to the idea that at the highest level there are
these things in the text that the user thinks of as individual characters. To
avoid overloading that term, we can use "grapheme" in all cases where we want
to refer to parts used to compose a character.
At times the Unicode standard is all over the place with its terminology. For
example, Chapter 3
defines UTF-8 as an "encoding form" whereas the glossary defines "encoding
form" as something else, and UTF-8 as a "Character Encoding Scheme". Another
example is "Grapheme_Base" and "Grapheme_Extend", which are
acknowledged to be
mistakes but that persist because purging them is a bit of a task. There is
still work to be done to tighten up the terminology employed by the standard.
The Proposal for addition of COMBINING GRAPHEME
JOINER got it
wrong when it stated that "Graphemes are sequences of one or more encoded
characters that correspond to what users think of as characters." It should
instead read, "A sequence of one or more graphemes composes what the user
thinks of as a character." Then it could use the term "grapheme sequence"
distinctly from the term "character sequence". Both terms are useful.
"grapheme sequence" neatly implies the process of building up a character from
smaller pieces. "character sequence" means what we all typically intuit it to
mean: "A sequence of things the user thinks of as characters."
Sometimes a programmer really does want to operate at the level of grapheme
sequences, so mechanisms to inspect and manipulate those sequences should be
available, but generally, when processing text, it is sufficient to operate on
"character sequences" (what the user thinks of as a character) and let the
system manage the lower-level details.
In every case covered so far in this discussion, it's cleaner to use "grapheme"
to refer to the indivisible components and "character" to refer to the composed
entity. This usage also better reflects the long-established meanings of both
terms.

Is there a unicode range that is a copy of the first 128 characters?

I would like to be able to put and other characters into a text without it being interpreted by the computer. So was wondering is there a range that is defined as mapping to the same glyphs etc as the range 0-0x7f (the ascii range).
Please note I state that the range 0-0x7f is the same as ascii, so the question is not what range maps to ascii.
I am asking is there another range that also maps to the same glyphs. I.E when rendered will look the same. But when interpreted may be can be seen as a different code.
so I can write
print "hello "world""
characters in bold avoid the 0-0x7f (ascii range)
Additional:
I was meaning homographic and behaviourally, well everything the same except a different code point. I was hopping for the whole ascii/128bit set, directly mapped (an offset added to them all).
The reason: to avoid interpretation by any language that uses some of the ascii characters as part of its language but allows any unicode character in literal strings e.g. (when uft-8 encoded) C, html, css, …
I was trying to retro-fix the idea of “no reserved words” / “word colours” (string literals one colour, keywords another, variables another, numbers another, etc) so that a string literal or variable-name(though not in this case) can contain any character.
I interpret the question to mean "is there a set of code points which are homographic with the low 7-bit ASCII set". The answer is no.
There are some code points which are conventionally rendered homographically (e.g. Cyrillic upparcase А U+0410 looks identical to ASCII 65 in many fonts, and quite similar in most fonts which support this code point) but they are different code points with different semantics. Similarly, there are some code points which basically render identically, but have a specific set of semantics, like the non-breaking space U+00A0 which renders identically to ASCII 32 but is specified as having a particular line-breaking property; or the RIGHT SINGLE QUOTATION MARK U+2019 which is an unambiguous quotation mark, as opposed to its twin ASCII 39, the "apostrophe".
But in summary, there are many symbols in the basic ASCII block which do not coincide with a homograph in another code block. You might be able to find homographs or near-homographs for your sample sentence, though; I would investigate the IPA phonetic symbols and the Greek and Cyrillic blocks.
The answer to the question asked is “No”, as #tripleee described, but the following note might be relevant if the purpose is trickery or fun of some kind:
The printable ASCII characters excluding the space have been duplicated at U+FF01 to U+FF5E, but these are fullwidth characters intended for use in CJK texts. Their shape is (and is meant to be) different: hello  world. (Your browser may be unable to render them.) So they are not really homographic with ASCII characters but could be used for some special purposes. (I have no idea of what the purpose might be here.)
Depends on the Unicode standard you use.
In UTF-8, the first 128 characters have the exact ASCII counterparts as code numbers. In UTF-16, the first 128 ASCII characters are between 0x0000 and 0x007F (2 bytes).

How to flip text horizontally?

i'm need to write a function that will flip all the characters of a string left-to-right.
e.g.:
Thė quiçk ḇrown fox jumṕềᶁ ovểr thë lⱥzy ȡog.
should become
.goȡ yzⱥl ëht rểvo ᶁềṕmuj xof nworḇ kçiuq ėhT
i can limit the question to UTF-16 (which has the same problems as UTF-8, just less often).
Naive solution
A naive solution might try to flip all the things (e.g. word-for-word, where a word is 16-bits - i would have said byte for byte if we could assume that a byte was 16-bits. i could also say character-for-character where character is the data type Char which represents a single code-point):
String original = "ɗỉf̴ḟếr̆ęnͥt";
String flipped = "";
foreach (Char c in s)
{
flipped = c+fipped;
}
Results in the incorrectly flipped text:
ɗỉf̴ḟếr̆ęnͥt
̨tͥnę̆rếḟ̴fỉɗ
This is because one "character" takes multiple "code points".
ɗỉf̴ḟếr̆ęnͥt
ɗ ỉ f ˜ ḟ ế r ˘ ę n i t ˛
and flipping each "code point" gives:
˛ t i n ę ˘ r ế ḟ ˜ f ỉ ɗ
Which not only is not a valid UTF-16 encoding, it's not the same characters.
Failure
The problem happens in UTF-16 encoding when there is:
combining diacritics
characters in another lingual plane
Those same issues happen in UTF-8 encoding, with the additional case
any character outside the 0..127 ASCII range
i can limit myself to the simpler UTF-16 encoding (since that's the encoding that the language that i'm using has (e.g. C#, Delphi)
The problem, it seems to me, is discovering if a number of subsequent code points are combining characters, and need to come along with the base glyph.
It's also fun to watch an online text reverser site fail to take this into account.
Note:
any solution should assume that don't have access to a UTF-32 encoding library (mainly becuase i don't have access to any UTF-32 encoding library)
access to a UTF-32 encoding library would solve the UTF-8/UTF-16 lingual planes problem, but not the combining diacritics problem
The term you're looking for is “grapheme cluster”, as defined in Unicode TR29 Cluster Boundaries.
Group the UTF-16 code units into Unicode code points (=characters) using the surrogate algorithm (easy), then group the characters into grapheme clusters using the Grapheme_Cluster_Break rules. Finally reverse the group order.
You will need a copy of the Unicode character database in order to recognise grapheme cluster boundaries. That's already going to take up a considerable amount of space, so you're probably going to want to get a library to do it. For example in ICU you might use a CharacterIterator (which is misleadingly named as it works on grapheme clusters, not ‘characters’ as Unicode knows it).
If you work in UTF-32, you solve the non-base-plane issue. Converting from UTF-8 or UTF-16 to UTF-32 (and back) is relatively simple bit twiddling (see Wikipedia). You don't have to have a library for it.
Most of the combining characters are in a few ranges. You could determine those ranges by scanning the Unicode database (see Unicode.org). Hardcode those ranges into your application. With that, you can determine the groups of codepoints that represent a single character. (The drawback is that new combining marks could be introduced in the future, and you'd need to update your table.)
Segment appropriately, reverse the order (segment by segment), and convert back to UTF-8 or UTF-16 (or whatever you want).
Text Mechanic's Text Generator seems to do this in JavaScript. I'm sure it would be possible to translate the JS into another language after obtaining the author's consent (if you can find a 'contact' link for that site).

Resources