String.equalsIgnoreCase - UpperCase v. LowerCase - string

I was browsing through the openjdk and noticed a weird code path in String.equalsIgnoreCase, specifically the method regionMatches:
if (ignoreCase) {
// If characters don't match but case may be ignored,
// try converting both characters to uppercase.
// If the results match, then the comparison scan should
// continue.
char u1 = Character.toUpperCase(c1);
char u2 = Character.toUpperCase(c2);
if (u1 == u2) {
continue;
}
// Unfortunately, conversion to uppercase does not work properly
// for the Georgian alphabet, which has strange rules about case
// conversion. So we need to make one last check before
// exiting.
if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) {
continue;
}
}
I understand the comment about adjusting for a specific alphabet to check the lower case equality, but was wondering why even have the upper case check? Why not just do all lower case?

Now that the question is re-opened, I transfer my answer here.
The short answer to "Why do they not just compare only lowercase instead of both upper and lower case, if it matches more cases than uppercase?": It does not match more character pairs, it merely matches different pairs.
Comparing only uppercase is not enough, e.g. the ASCII letter "I" and the capital I with dot "İ" ((char)304, used in Turkish alphabet) have different uppercase (they are already uppercase), but they have the same lowercase letter "i". (Note that the Turkish language considers i with dot and i without dot as different letters, not just an accented letter, similar to German with its Umlauts ä/ö/ü vs. a/o/u.)
Comparing only lowercase is not enough, e.g. the ASCII letter "i" and the small dotless i "ı" ((char)305). They have different lowercase (they are already lowercase), but they have the same uppercase letter "I".
And finally, compare capital I with dot "İ" with small dotless i "ı". Neither their uppercases ("İ" vs. "I") nor their lowercases ("i" vs. "ı") match, but the lowercase of their uppercase is the same ("I"). I found another case if this phenomenon, in the greek letters "ϴ" and "ϑ" (char 1012 and 977).
So a true case insensitive comparison can not even check uppercases and lowercases of the original characters, but must check the lowercases of the uppercases.

Related

Python: lower() method generates wrong letter in a string

text = 'ÇEKİM GÜNÜ KALİTESİNİ DÜZENLERLSE'
sentence = text.split(' ')
print(sentence)
if "ÇEKİM" in sentence:
print("yes-1")
print(" ")
sentence_ = text.lower().split(' ')
print(sentence_)
if "çekim" in sentence_:
print("yes-2")
>> output:
['ÇEKİM', 'GÜNÜ', 'KALİTESİNİ', 'DÜZENLERLSE']
yes-1
['çeki̇m', 'günü', 'kali̇tesi̇ni̇', 'düzenlerlse']
I have a problem about string. I have a sentence like a text. When I check a specific word in this sentence-splitted list, I can find "ÇEKİM" word (prints yes). However, while I make search by lowering sentence, I can not find in the list because it changes "i" letter. What is the reason of it (encoding/decoding) ? Why "lower()" method changes string in addition to lowering ? Btw, it is a turkish word. Upper:ÇEKİM - Lower:çekim
Turkish i and English i are treated differently. Capitalized Turkish i is İ, while capitalized English i is I. To differentiate Unicode has rules for converting to lower and upper case. Lowercase Turkish i has a combining mark. Also, converting the lower case version to upper case leaves the characters in a decomposed form, so proper comparison needs to normalize the string to a standard form. You can't compare a decomposed form to a composed form. Note the differences in the strings below:
#coding:utf8
import unicodedata as ud
def dump_names(s):
print('string:',s)
for c in s:
print(f'U+{ord(c):04X} {ud.name(c)}')
turkish_i = 'İ'
dump_names(turkish_i)
dump_names(turkish_i.lower())
dump_names(turkish_i.lower().upper())
dump_names(ud.normalize('NFC',turkish_i.lower().upper()))
string: İ
U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE
string: i̇
U+0069 LATIN SMALL LETTER I
U+0307 COMBINING DOT ABOVE
string: İ
U+0049 LATIN CAPITAL LETTER I
U+0307 COMBINING DOT ABOVE
string: İ
U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE
Some terminals also have display issues. My system displays '' with the dot over the m, not the i. For example, on the Chrome browser, below displays correctly:
>>> s = 'ÇEKİM'
>>> s.lower()
'çeki̇m'
But on one of my editors it displays as:
So it appears something like this is what the OP is seeing. The following comparison will work:
if "çeki\N{COMBINING DOT ABOVE}m" in sentence_:
print("yes-2")

The correct way to identify a regular expression of the sort [variableName].add(

I'm looking for a clean way to identify occurrences of [variableName] followed by the exact string .add(.
A variable name is a string which contains one or more characters from a-z, A-Z, 0-9 and an underscore.
One more thing is that it cannot start with any of the characters from 0-9, but I don't mind ignoring this condition because there are no such cases in the text that I need to parse anyway.
I've been following several tutorials, but the farthest I got was finding all occurrences of what I've referred to above as "variableName":
import re
txt = "The _rain() in+ Spain5"
x = re.split("[^a-zA-Z0-9_]+", txt)
print(x)
What is the right way to do it?
You may use
re.findall(r'\w+(?=\.add\()', txt, flags=re.ASCII)
The regex matches:
\w+ - 1+ word chars (due to re.ASCII, it only matches [A-Za-z0-9_] chars)
(?=\.add\() - a positive lookahead that matches a location immediately followed with .add( substring.

Prolog DCG Building/Recognizing Word Strings from Alphanumeric Characters

So I'm writing simple parsers for some programming languages in SWI-Prolog using Definite Clause Grammars. The goal is to return true if the input string or file is valid for the language in question, or false if the input string or file is not valid.
In all almost all of the languages there is an "identifier" predicate. In most of the languages the identifier is defined as the one of the following in EBNF: letter { letter | digit } or ( letter | digit ) { letter | digit }, that is to say in the first case a letter followed by zero or more alphanumeric characters, or i
My input file is split into a list of word strings (i.e. someIdentifier1 = 3 becomes the list [someIdentifier1,=,3]). The reason for the string to be split into lists of words rather than lists of letters is for recognizing keywords defined as terminals.
How do I implement "identifier" so that it recognizes any alphanumeric string or a string consisting of a letter followed by alphanumeric characters.
Is it possible or necessary to further split the word into letters for this particular predicate only, and if so how would I go about doing this? Or is there another solution, perhaps using SWI-Prolog libraries' built-in predicates?
I apologize for the poorly worded title of this question; however, I am unable to clarify it any further.
First, when you need to reason about individual letters, it is typically most convenient to reason about lists of characters.
In Prolog, you can easily convert atoms to characters with atom_chars/2.
For example:
?- atom_chars(identifier10, Cs).
Cs = [i, d, e, n, t, i, f, i, e, r, '1', '0'].
Once you have such characters, you can used predicates like char_type/2 to reason about properties of each character.
For example:
?- char_type(i, T).
T = alnum ;
T = alpha ;
T = csym ;
etc.
The general pattern to express identifiers such as yours with DCGs can look as follows:
identifier -->
[L],
{ letter(L) },
identifier_rest.
identifier_rest --> [].
identifier_rest -->
[I],
{ letter_or_digit(I) },
identifier_rest.
You can use this as a building block, and only need to define letter/1 and letter_or_digit/1. This is very easy with char_type/2.
Further, you can of course introduce an argument to relate such lists to atoms.

Checking if all letters in a string (from any major spoken language) are upper-casee

I simply want to check if all the letters that occur in a string are upper-case (if they have lower- and upper-case variants). Tcl's built-in procs don't behave quite as desired, e.g.,
string is upper "123A"
returns false, but I would want it to return true. I would also want it to return true if the A were replaced with, say, an upper-case Cyrillic letter, or a letter from another popular alphabet that doesn't have a case. I could simply filter out all non-letters from the string, but that's not so simple I think when you're trying to handle letters from languages other than just English.
In this case, you don't want string is upper as that checks if the string is just upper case letters. (Numbers aren't letters.)
Instead, you want to do:
set str "123A"
if {$str eq [string toupper $str]} {
# It's upper-case by your definition...
}

Can I use setlocale() and isalpha() to determine if character belongs to alphabet of current locale?

Is it possible to do setlocale(LC_CTYPE, "ru_RU.utf8") and for each symbol of string "рус eng" do isaplha() check and to get as result following:
р alpha
у alpha
с alpha
not alpha
e not alpha
n not alpha
g not alpha
now when I am setting locale ru_RU.utf8 all symbols except space symbol are alpha
The isalpha function asks the question:
The isalpha() function shall test whether c is a character of class alpha in the program's current locale.
and goes on to note:
The c argument is an int, the value of which the application shall ensure is representable as an unsigned char or equal to the value of the macro EOF. If the argument has any other value, the behavior is undefined.
Which means that it only works for ascii characters.
The test is pretty much is the character in the ranges [A-Z] or [a-z], nothing more.
Noe if you want to test characters outside of this range, then you need to use one of the wide character variants such as iswalpha.
What it looks like you're asking is if you can perform a test that will reject characters that are not explicit cyrillic letters? That's not going to work with the iswalpha() test because it assumes all alpha characters from pretty much all character sets are alpha characters - if you read the locale definition of ru_RU (glibc source localedata/locales/ru_RU), which uses the i18n file as it's data source for character types determines what is considered an alpha.
If the input data is truly only from the russian alphabet, then you can check if the character is non-ascii and if that is the case then accept it as a valid character; unfortunately there is a good chance that some characters that are typed e.g. е (i.e. CYRILLIC SMALL LETTER IE Unicode: U+0435, UTF-8: D0 B5) will be entered using the latin character e (i.e. LATIN SMALL LETTER E Unicode: U+0065, UTF-8: 65) and so would be missed by this test.
if you want to test for those cyrillic characters explicitly, then you need to test for the character ranges:
% CYRILLIC/
<U0400>..<U042F>;<U0460>..(2)..<U047E>;/
<U0480>;<U048A>..(2)..<U04BE>;<U04C0>;<U04C1>..(2)..<U04CD>;/
<U04D0>..(2)..<U04FE>;/
% CYRILLIC SUPPLEMENT/
<U0500>..(2)..<U0522>;/
% CYRILLIC SUPPLEMENT 2/
<UA640>..(2)..<UA65E>;<UA662>..(2)..<UA66C>;<UA680>..(2)..<UA696>;/
% CYRILLIC/
<U0430>..<U045F>;<U0461>..(2)..<U047F>;/
<U0481>;<U048B>..(2)..<U04BF>;<U04C2>..(2)..<U04CE>;/
<U04CF>;/
<U04D1>..(2)..<U0523>;/
% CYRILLIC SUPPLEMENT 2/
<UA641>..(2)..<UA65F>;<UA663>..(2)..<UA66D>;<UA681>..(2)..<UA697>;/

Resources