Why doesn't SameText work? - string

Why does
if SameText(ListBox1.Items[i],Edit1.Text)=true then
not work? It is case-sensitive (strings have different cases), but must be not. The strings are unicode. It works if the strings have the same cases.
Thanks!

According to SysUtils.pas (Delphi-XE), SameText "has the same 8-bit limitations as CompareText", and in CompareText "the compare operation is based on the 8-bit ordinal value of each character, after converting 'a'..'z' to 'A'..'Z', and is not affected by the current user locale."
So it seems that you are trying to compare some characters that are outside the 8 bit range.
Edit: you should try AnsiSameText.

Related

Lexicographical order of numbers

I'm currently learning about lexicographical sorting but not much is found for numbers. The example i found is based of What is lexicographical order?
In the example, it i said that
1 10 2
are in lexicographical ordering. The answer stated that "10 comes after 2 in numerical order but 10 comes before 2 in alphabetical order". I would like to know what does "10 comes before 2 in alphabetical order" really mean. Is 10 represented as a character in ASCII or something? I'm really confused.
Would it be something in python where:
ord(10)
Yes, lexicographic implies textual. I would fault the typography. When discussing a text string, that is usually made clear by using the literal text string syntax (for some programming language). "10" comes before "2".
There is no text but encoded text.
So that implies a character encoding of a character set. A character set is a mapping between a character and a codepoint (integer). An encoding maps between a codepoint and a sequence of code units for that encoding. A code unit is an integer of a fixed size. When an integer of a fixed size is stored as a sequence of bytes, it has a byte order (unless the size is 1).
Lexicographic could refer to ordering by the sequence of:
codepoint values
code unit values
byte value
For some character sets and encodings, these orders would all be the same. For some of those, the values would all be the same.
(Not sure why you would mention ASCII. You are almost certainly not using a programming environment that uses ASCII natively. You should look that up for your environment to avoid ASCII-splaining. Python 3.)

How to use Unicode::Normalize to create most compatible windows-1252 encoded string?

I have a legacy app in Perl processing XML encoded in UTF-8 most likely and which needs to store some data of that XML in some database, which uses windows-1252 for historical reasons. Yes, this setup can't support all possible characters of the Unicode standard, but in practice I don't need to anyway and can try to be reasonable compatible.
The specific problem currently is a file containing LATIN SMALL LETTER U, COMBINING DIAERESIS (U+0075 U+0308), which makes Perl break the existing encoding of the Unicode string to windows-1252 with the following exception:
"\x{0308}" does not map to cp1252
I was able to work around that problem using Unicode::Normalize::NFKC, which creates the character U+00FC (ü), which perfectly fine maps to windows-1252. That lead to some other problem of course, e.g. in case of the character VULGAR FRACTION ONE HALF (½, U+00BD), because NFKC creates DIGIT ONE, FRACTION SLASH, DIGIT TWO (1/2, U+0031 U+2044 U+0032) for that and Perl dies again:
"\x{2044}" does not map to cp1252
According to normalization rules, this is perfectly fine for NFKC. I used that because I thought it would give me the most compatible result, but that was wrong. Using NFC instead fixed both problems, as both characters provide a normalization compatible with windows-1252 in that case.
This approach gets additionally problematic for characters for which a normalization compatible with windows-1252 is available in general, only different from NFC. One example is LATIN SMALL LIGATURE FI (fi, U+FB01). According to it's normalization rules, it's representation after NFC is incompatible with windows-1252, while using NFKC this time results in two characters compatible with windows-1252: fi (U+0066 U+0069).
My current approach is to simply try encoding as windows-1252 as is, if that fails I'm using NFC and try again, if that fails I'm using NFKC and try again and if that fails I'm giving up for now. This works in the cases I'm currently dealing with, but obviously fails if all three characters of my examples above are present in a string at the same time. There's always one character then which results in windows-1252-incompatible output, regardless the order of NFC and NFKC. The only question is which character breaks when.
BUT the important point is that each character by itself could be normalized to something being compatible with windows-1252. It only seems that there's no one-shot-solution.
So, is there some API I'm missing, which already converts in the most backwards compatible way?
If not, what's the approach I would need to implement myself to support all the above characters within one string?
Sounds like I would need to process each string Unicode-character by Unicode-character, normalize individually with what is most compatible with windows-1252 and than concatenate the results again. Is there some incremental Unicode-character parser available which deals with combining characters and stuff already? Does a simple Unicode-character based regular expression handles this already?
Unicode::Normalize provides additional functions to work on partial strings and such, but I must admit that I currently don't fully understand their purpose. The examples focus on concatenation as well, but from my understanding I first need some parsing to be able to normalize individual characters differently.
I don't think you're missing an API because a best-effort approach is rather involved. I'd try something like the following:
Normalize using NFC. This combines decomposed sequences like LATIN SMALL LETTER U, COMBINING DIAERESIS.
Extract all codepoints which aren't combining marks using the regex /\PM/g. This throws away all combining marks remaining after NFC conversion which can't be converted to Windows-1252 anyway. Then for each code point:
If the codepoint can be converted to Windows-1252, do so.
Otherwise try to normalize the codepoint with NFKC. If the NFKC mapping differs from the input, apply all steps recursively on the resulting string. This handles things like ligatures.
As a bonus: If the codepoint is invariant under NFKC, convert to NFD and try to convert the first codepoint of the result to Windows-1252. This converts characters like Ĝ to G.
Otherwise ignore the character.
There are of course other approaches that convert unsupported characters to ones that look similar but they require to create mappings manually.
Since it seems that you can convert individual characters as needed (to cp-1252 encoding), one way is to process character by character, as proposed, once a word fails the procedure.
The \X in Perl's regex matches a logical Unicode character, an extended grapheme cluster, either as a single codepoint or a sequence. So if you indeed can convert all individual (logical) characters into the desired encoding, then with
while ($word =~ /(\X)/g) { ... }
you can access the logical characters and apply your working procedure to each.
In case you can't handle all logical characters that may come up, piece together an equivalent of \X using specific character properties, for finer granularity with combining marks or such (like /((.)\p{Mn}?)/, or \p{Nonspacing_Mark}). The full, grand, list is in perluniprops.

Can lowercasing a UTF-8 string cause it to grow?

I'm helping out with someone writing some code to compare UTF-8 strings in a case-insensitive way. The scheme they are using is to uppercase the strings and then compare. The input strings can all fit in a 255 byte array. The output string similarly must fit in a 255 byte array.
I'm not a UTF-8 or Unicode expert, but I think this this scheme can't work for all strings. My understanding is that either lower casing or upper casing a UTF-8 string can result in the output string being longer (byte array wise), and as such changing case is probably not the best way to attack this problem. I'm trying to demonstrate the difficulty by giving a few strings that will not work with this design.
For example, take a string of the character U+0587 repeated 100 times. U+0587 takes two bytes in UTF-8, so the overall length of the byte array for the string is 200 bytes (ignoring the trailing null for now). If that string is uppercased, however, it becomes U+0535 U+0552, and each of those takes two bytes, for a total of 4 bytes. The 200 byte array is now 400 bytes, and cannot be stored in the limited space available.
So here's my question: I gave an example of a lowercase character needing more space to store when uppercased. Are there any examples of an uppercase character needing more space to store when lowercased? The locale is always en_US.UTF-8 in this case.
Thanks for any help.
Yes. Examples from my environment:
U+023A Ⱥ U+023E Ⱦ
There are several related factors that could cause variation. You already pointed out one:
The locale that you specify will affect the casing of characters that that locale is concerned with.
The version of the Unicode Common Locale Data Repository that your library uses.
The version of the Unicode Character Database that your library uses.
These aren't fixed targets because we can expect that there will be future versions and that there will be users using characters from them.
Ultimately, this comes down to your environment and to practical purpose this has.

Is it dangerous to use special (0-31) ASCII character in a string?

I am building a string compressor and for simplicity reasons, I wanted to use some non-printable characters.
1) Is it in some way "bad" to use the 0-31 ASCII characters?
2) Can these characters occur in a normal text string?
If the answer is "partially":
3) What of them is better to use in this case? I think I will need maximum 9 of them.
Well the answer is that it depends on how you're using it. If you're treating the "string" as binary, then binary by definition can have any value. However if it is meant to be read/printed, it could cause serious problems to use characters 0-31.
It isn't too big a deal for the most part, except that 0 is "end of string" by many platforms. Though again, it depends entirely on how you're using it. My advice would be at the very least, avoid character 0. If you want the user to be able to copy and paste the string, then none of these would be suitable. They must be printable characters, in other words.

Case-insensitive string comparison in Julia

I'm sure this has a simple answer, but how does one compare two string and ignore case in Julia? I've hacked together a rather inelegant solution:
function case_insensitive_match{S<:AbstractString}(a::S,b::S)
lowercase(a) == lowercase(b)
end
There must be a better way!
Efficiency Issues
The method that you have selected will indeed work well in most settings. If you are looking for something more efficient, you're not apt to find it. The reason is that capital vs. lowercase letters are stored with different bit encoding. Thus it isn't as if there is just some capitalization field of a character object that you can ignore when comparing characters in strings. Fortunately, the difference in bits between capital vs. lowercase is very small, and thus the conversions are simple and efficient. See this SO post for background on this:
How do uppercase and lowercase letters differ by only one bit?
Accuracy Issues
In most settings, the method that you have will work accurately. But, if you encounter characters such as capital vs. lowercase Greek letters, it could fail. For that, you would be better of with the normalize function (see docs for details) with the casefold option:
normalize("ad", casefold=true)
See this SO post in the context of Python which addresses the pertinent issues here and thus need not be repeated:
How do I do a case-insensitive string comparison?
Since it's talking about the underlying issues with utf encoding, it is applicable to Julia as well as Python.
See also this Julia Github discussion for additional background and specific examples of places where lowercase() can fail:
https://github.com/JuliaLang/julia/issues/7848

Resources