How to use Unicode::Normalize to create most compatible windows-1252 encoded string? - string

I have a legacy app in Perl processing XML encoded in UTF-8 most likely and which needs to store some data of that XML in some database, which uses windows-1252 for historical reasons. Yes, this setup can't support all possible characters of the Unicode standard, but in practice I don't need to anyway and can try to be reasonable compatible.
The specific problem currently is a file containing LATIN SMALL LETTER U, COMBINING DIAERESIS (U+0075 U+0308), which makes Perl break the existing encoding of the Unicode string to windows-1252 with the following exception:
"\x{0308}" does not map to cp1252
I was able to work around that problem using Unicode::Normalize::NFKC, which creates the character U+00FC (ü), which perfectly fine maps to windows-1252. That lead to some other problem of course, e.g. in case of the character VULGAR FRACTION ONE HALF (½, U+00BD), because NFKC creates DIGIT ONE, FRACTION SLASH, DIGIT TWO (1/2, U+0031 U+2044 U+0032) for that and Perl dies again:
"\x{2044}" does not map to cp1252
According to normalization rules, this is perfectly fine for NFKC. I used that because I thought it would give me the most compatible result, but that was wrong. Using NFC instead fixed both problems, as both characters provide a normalization compatible with windows-1252 in that case.
This approach gets additionally problematic for characters for which a normalization compatible with windows-1252 is available in general, only different from NFC. One example is LATIN SMALL LIGATURE FI (fi, U+FB01). According to it's normalization rules, it's representation after NFC is incompatible with windows-1252, while using NFKC this time results in two characters compatible with windows-1252: fi (U+0066 U+0069).
My current approach is to simply try encoding as windows-1252 as is, if that fails I'm using NFC and try again, if that fails I'm using NFKC and try again and if that fails I'm giving up for now. This works in the cases I'm currently dealing with, but obviously fails if all three characters of my examples above are present in a string at the same time. There's always one character then which results in windows-1252-incompatible output, regardless the order of NFC and NFKC. The only question is which character breaks when.
BUT the important point is that each character by itself could be normalized to something being compatible with windows-1252. It only seems that there's no one-shot-solution.
So, is there some API I'm missing, which already converts in the most backwards compatible way?
If not, what's the approach I would need to implement myself to support all the above characters within one string?
Sounds like I would need to process each string Unicode-character by Unicode-character, normalize individually with what is most compatible with windows-1252 and than concatenate the results again. Is there some incremental Unicode-character parser available which deals with combining characters and stuff already? Does a simple Unicode-character based regular expression handles this already?
Unicode::Normalize provides additional functions to work on partial strings and such, but I must admit that I currently don't fully understand their purpose. The examples focus on concatenation as well, but from my understanding I first need some parsing to be able to normalize individual characters differently.

I don't think you're missing an API because a best-effort approach is rather involved. I'd try something like the following:
Normalize using NFC. This combines decomposed sequences like LATIN SMALL LETTER U, COMBINING DIAERESIS.
Extract all codepoints which aren't combining marks using the regex /\PM/g. This throws away all combining marks remaining after NFC conversion which can't be converted to Windows-1252 anyway. Then for each code point:
If the codepoint can be converted to Windows-1252, do so.
Otherwise try to normalize the codepoint with NFKC. If the NFKC mapping differs from the input, apply all steps recursively on the resulting string. This handles things like ligatures.
As a bonus: If the codepoint is invariant under NFKC, convert to NFD and try to convert the first codepoint of the result to Windows-1252. This converts characters like Ĝ to G.
Otherwise ignore the character.
There are of course other approaches that convert unsupported characters to ones that look similar but they require to create mappings manually.

Since it seems that you can convert individual characters as needed (to cp-1252 encoding), one way is to process character by character, as proposed, once a word fails the procedure.
The \X in Perl's regex matches a logical Unicode character, an extended grapheme cluster, either as a single codepoint or a sequence. So if you indeed can convert all individual (logical) characters into the desired encoding, then with
while ($word =~ /(\X)/g) { ... }
you can access the logical characters and apply your working procedure to each.
In case you can't handle all logical characters that may come up, piece together an equivalent of \X using specific character properties, for finer granularity with combining marks or such (like /((.)\p{Mn}?)/, or \p{Nonspacing_Mark}). The full, grand, list is in perluniprops.

Related

Is ASCII-only Unicode string always normalized?

Imagine a string of single ASCII character i (U+0069). In Turkish and akin writing system, ı (U+0131) is present as well. Can Unicode normalization split U+0069 (i) into U+0131 U+0307 (ı̇)? Is it locale-dependent, and so might vary on environment?
The normali\ation forms defined by Unicode are not locale-specific; they have no input other than the sequence of code points to be normalized.
The Unicode website has a user-friendly chart of all characters which differ between the standardized normalization forms.
Unfortunately, it is grouped by script, not by block, so we can't quickly check all the characters in the "Basic Latin" block (which matches the 128 characters of ASCII).
Searching for "0069" specifically, we see that it appears as the result of normalising certain code points - either as part of a "decomposition" in NFD, or as a compatibility replacement in forms NFKC and NFKD. However, it doesn't appear in the input column, because it doesn't change when converted to any of the normalization forms.
I have not checked the other Basic Latin characters, but would be extremely surprised if any of them normalize to anything other than themselves. So to answer your original question: yes, I believe a string that only uses code points U+0000 to U+0127 (the code points inherited from the 7-bit ASCII standard) will not change in any of the normalization forms defined by Unicode.

Case-insensitive string comparison in Julia

I'm sure this has a simple answer, but how does one compare two string and ignore case in Julia? I've hacked together a rather inelegant solution:
function case_insensitive_match{S<:AbstractString}(a::S,b::S)
lowercase(a) == lowercase(b)
end
There must be a better way!
Efficiency Issues
The method that you have selected will indeed work well in most settings. If you are looking for something more efficient, you're not apt to find it. The reason is that capital vs. lowercase letters are stored with different bit encoding. Thus it isn't as if there is just some capitalization field of a character object that you can ignore when comparing characters in strings. Fortunately, the difference in bits between capital vs. lowercase is very small, and thus the conversions are simple and efficient. See this SO post for background on this:
How do uppercase and lowercase letters differ by only one bit?
Accuracy Issues
In most settings, the method that you have will work accurately. But, if you encounter characters such as capital vs. lowercase Greek letters, it could fail. For that, you would be better of with the normalize function (see docs for details) with the casefold option:
normalize("ad", casefold=true)
See this SO post in the context of Python which addresses the pertinent issues here and thus need not be repeated:
How do I do a case-insensitive string comparison?
Since it's talking about the underlying issues with utf encoding, it is applicable to Julia as well as Python.
See also this Julia Github discussion for additional background and specific examples of places where lowercase() can fail:
https://github.com/JuliaLang/julia/issues/7848

How does punycode distinguish similar IRIs?

I've been looking into internationalised resource identifiers and there's one thing bugging me.
My understanding is that, for each label in a domain name (xyzzy.plugh.com has three labels, xyzzy, plugh and com), the following process is performed to translate it into ASCII representation so that it can be processed okay by all legacy software:
If it consists solely of ASCII characters, it's copied as is.
Otherwise:
First we output xn-- followed by all the ASCII characters (skipping non-ASCII).
Then, if the final character isn't -, we output - to separate the ASCII from non-ASCII.
Finally, we encode each of the non-ASCII characters using punycode so that they appear to be ASCII.
My question then is: how do we distinguish between the following two Unicode URIs?
http://aa☃.net/
http://☃aa.net/
It seems to me that both of these will encode to:
http://xn--aa-nfh.net/
simply because the sequencing information has been lost for the label as a whole.
Or am I missing something in the specification?
According to one punycode encoder, there are encoded differently:
aa☃.net -> xn--aa-gsx.net
☃aa.net -> xn--aa-esx.net
^
see here
The relevant RFC 3492 details why this is the case. First, it provides clues in the introduction:
Uniqueness: There is at most one basic string that represents a given extended string.
Reversibility: Any extended string mapped to a basic string can be recovered from that basic string.
That means there must be differentiable one-to-one mapping for every single basic/extended string pair.
Understanding how it differentiates the two possibilities requires an understanding of the decoder (the thing that turns the basic string back into an extended one, with all its Unicode glory) works.
The decoder begins by starting with just the basic string aa.net with a pointer to the first a, then applies a series of deltas, such as gsx or esx.
The delta actually encodes two things. The first is the number of non-insertions to be done and the second is the actual insertion.
So, gsx (the delta in aa☃.net) would encode two non-insertions (to skip the aa) followed by an insertion of ☃. The esx delta (for ☃aa.net) would encode zero non-insertions followed by an insertion of ☃.
That is how position is encoded into the basic strings.

When to use Unicode Normalization Forms NFC and NFD?

The Unicode Normalization FAQ includes the following paragraph:
Programs should always compare canonical-equivalent Unicode strings as equal ... The Unicode Standard provides well-defined normalization forms that can be used for this: NFC and NFD.
and continues...
The choice of which to use depends on the particular program or system. NFC is the best form for general text, since it is more compatible with strings converted from legacy encodings. ... NFD and NFKD are most useful for internal processing.
My questions are:
What makes NFC best for "general text." What defines "internal processing" and why is it best left to NFD? And finally, never minding what is "best," are the two forms interchangable as long as two strings are compared using the same normalization form?
The FAQ is somewhat misleading, starting from its use of “should” followed by the inconsistent use of “requirement” about the same thing. The Unicode Standard itself (cited in the FAQ) is more accurate. Basically, you should not expect programs to treat canonically equivalent strings as different, but neither should you expect all programs to treat them as identical.
In practice, it really depends on what your software needs to do. In most situations, you don’t need to normalize at all, and normalization may destroy essential information in the data.
For example, U+0387 GREEK ANO TELEIA (·) is defined as canonical equivalent to U+00B7 MIDDLE DOT (·). This was a mistake, as the characters are really distinct and should be rendered differently and treated differently in processing. But it’s too late to change that, since this part of Unicode has been carved into stone. Consequently, if you convert data to NFC or otherwise discard differences between canonically equivalent strings, you risk getting wrong characters.
There are risks that you take by not normalizing. For example, the letter “ä” can appear as a single Unicode character U+00E4 LATIN SMALL LETTER A WITH DIAERESIS or as two Unicode characters U+0061 LATIN SMALL LETTER A U+0308 COMBINING DIAERESIS. It will mostly be the former, i.e. the precomposed form, but if it is the latter and your code tests for data containing “ä”, using the precomposed form only, then it will not detect the latter. But in many cases, you don’t do such things but simply store the data, concatenate strings, print them, etc. Then there is a risk that the two representations result in somewhat different renderings.
It also matters whether your software passes character data to other software somehow. The recipient might expect, due to naive implicit assumptions or consciously and in a documented manner, that its input is normalized.
NFC is the general common sense form that you should use, ä is 1 code point there and that makes sense.
NFD is good for certain internal processing - if you want to make accent-insensitive searches or sorting, having your string in NFD makes it much easier and faster. Another usage is making more robust slug titles. These are just the most obvious ones, I am sure there are plenty of more uses.
If two strings x and y are canonical equivalents, then
toNFC(x) = toNFC(y)
toNFD(x) = toNFD(y)
Is that what you meant?

How to flip text horizontally?

i'm need to write a function that will flip all the characters of a string left-to-right.
e.g.:
Thė quiçk ḇrown fox jumṕềᶁ ovểr thë lⱥzy ȡog.
should become
.goȡ yzⱥl ëht rểvo ᶁềṕmuj xof nworḇ kçiuq ėhT
i can limit the question to UTF-16 (which has the same problems as UTF-8, just less often).
Naive solution
A naive solution might try to flip all the things (e.g. word-for-word, where a word is 16-bits - i would have said byte for byte if we could assume that a byte was 16-bits. i could also say character-for-character where character is the data type Char which represents a single code-point):
String original = "ɗỉf̴ḟếr̆ęnͥt";
String flipped = "";
foreach (Char c in s)
{
flipped = c+fipped;
}
Results in the incorrectly flipped text:
ɗỉf̴ḟếr̆ęnͥt
̨tͥnę̆rếḟ̴fỉɗ
This is because one "character" takes multiple "code points".
ɗỉf̴ḟếr̆ęnͥt
ɗ ỉ f ˜ ḟ ế r ˘ ę n i t ˛
and flipping each "code point" gives:
˛ t i n ę ˘ r ế ḟ ˜ f ỉ ɗ
Which not only is not a valid UTF-16 encoding, it's not the same characters.
Failure
The problem happens in UTF-16 encoding when there is:
combining diacritics
characters in another lingual plane
Those same issues happen in UTF-8 encoding, with the additional case
any character outside the 0..127 ASCII range
i can limit myself to the simpler UTF-16 encoding (since that's the encoding that the language that i'm using has (e.g. C#, Delphi)
The problem, it seems to me, is discovering if a number of subsequent code points are combining characters, and need to come along with the base glyph.
It's also fun to watch an online text reverser site fail to take this into account.
Note:
any solution should assume that don't have access to a UTF-32 encoding library (mainly becuase i don't have access to any UTF-32 encoding library)
access to a UTF-32 encoding library would solve the UTF-8/UTF-16 lingual planes problem, but not the combining diacritics problem
The term you're looking for is “grapheme cluster”, as defined in Unicode TR29 Cluster Boundaries.
Group the UTF-16 code units into Unicode code points (=characters) using the surrogate algorithm (easy), then group the characters into grapheme clusters using the Grapheme_Cluster_Break rules. Finally reverse the group order.
You will need a copy of the Unicode character database in order to recognise grapheme cluster boundaries. That's already going to take up a considerable amount of space, so you're probably going to want to get a library to do it. For example in ICU you might use a CharacterIterator (which is misleadingly named as it works on grapheme clusters, not ‘characters’ as Unicode knows it).
If you work in UTF-32, you solve the non-base-plane issue. Converting from UTF-8 or UTF-16 to UTF-32 (and back) is relatively simple bit twiddling (see Wikipedia). You don't have to have a library for it.
Most of the combining characters are in a few ranges. You could determine those ranges by scanning the Unicode database (see Unicode.org). Hardcode those ranges into your application. With that, you can determine the groups of codepoints that represent a single character. (The drawback is that new combining marks could be introduced in the future, and you'd need to update your table.)
Segment appropriately, reverse the order (segment by segment), and convert back to UTF-8 or UTF-16 (or whatever you want).
Text Mechanic's Text Generator seems to do this in JavaScript. I'm sure it would be possible to translate the JS into another language after obtaining the author's consent (if you can find a 'contact' link for that site).

Resources