Python: lower() method generates wrong letter in a string - python-3.x

text = 'ÇEKİM GÜNÜ KALİTESİNİ DÜZENLERLSE'
sentence = text.split(' ')
print(sentence)
if "ÇEKİM" in sentence:
print("yes-1")
print(" ")
sentence_ = text.lower().split(' ')
print(sentence_)
if "çekim" in sentence_:
print("yes-2")
>> output:
['ÇEKİM', 'GÜNÜ', 'KALİTESİNİ', 'DÜZENLERLSE']
yes-1
['çeki̇m', 'günü', 'kali̇tesi̇ni̇', 'düzenlerlse']
I have a problem about string. I have a sentence like a text. When I check a specific word in this sentence-splitted list, I can find "ÇEKİM" word (prints yes). However, while I make search by lowering sentence, I can not find in the list because it changes "i" letter. What is the reason of it (encoding/decoding) ? Why "lower()" method changes string in addition to lowering ? Btw, it is a turkish word. Upper:ÇEKİM - Lower:çekim

Turkish i and English i are treated differently. Capitalized Turkish i is İ, while capitalized English i is I. To differentiate Unicode has rules for converting to lower and upper case. Lowercase Turkish i has a combining mark. Also, converting the lower case version to upper case leaves the characters in a decomposed form, so proper comparison needs to normalize the string to a standard form. You can't compare a decomposed form to a composed form. Note the differences in the strings below:
#coding:utf8
import unicodedata as ud
def dump_names(s):
print('string:',s)
for c in s:
print(f'U+{ord(c):04X} {ud.name(c)}')
turkish_i = 'İ'
dump_names(turkish_i)
dump_names(turkish_i.lower())
dump_names(turkish_i.lower().upper())
dump_names(ud.normalize('NFC',turkish_i.lower().upper()))
string: İ
U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE
string: i̇
U+0069 LATIN SMALL LETTER I
U+0307 COMBINING DOT ABOVE
string: İ
U+0049 LATIN CAPITAL LETTER I
U+0307 COMBINING DOT ABOVE
string: İ
U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE
Some terminals also have display issues. My system displays '' with the dot over the m, not the i. For example, on the Chrome browser, below displays correctly:
>>> s = 'ÇEKİM'
>>> s.lower()
'çeki̇m'
But on one of my editors it displays as:
So it appears something like this is what the OP is seeing. The following comparison will work:
if "çeki\N{COMBINING DOT ABOVE}m" in sentence_:
print("yes-2")

Related

How to find arabic character all occurrences with its dialect?

I am trying to find the occurence of arabic character with its harakat in string such as "رَّ" in "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ".
Arabic characters can take harakat for example "ر" is the original arabic character but can have harakat so it can look something like this "رَّ"> I am using Python 3 to find the character occurence with a specific harakat but could not do that. I have tried for loop and tried converting the string to unicode but could not do that.
str = "مرة رجل حكيم قال بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
i=0
for s in str:
if s == "رَّ":
i = i + 1
print(i)
Expected output is 2 but 0 is what I get.
len("رَّ") returns 3, which means the glyph is represented by three characters. Your loop checks a single character at a time and so never finds a match.
You need to be looking for substrings, which is exactly what .count() is for.
i = str.count('رَّ')

Python3 and combining Diacritics

I've been having a problem with Unicode in python3 and I can't seem to understand why that's happening.
symbol= "ῇ̣"
print(len(symbol))
>>>>2
This letter comes from a word: ἐ̣ν̣τ̣ῇ̣[αὐτ]ῇ where I have combining diacritical marks. I want to do the statistical analysis in Python 3 and store the results in a database, the thing is that I also store the character's position (index) in the text. The database-application correctly counts the symbol-variable in the example as one-character, whereas Python counts it as two - throwing off the entire indexing.
The project requires me to keep the diacritics, so I can't simply ignore them or do a .replace("combining diacritical mark","") on the string.
Since Python3 has unicode as default for strings I'm a bit dumbfounded by this.
I have tried to use the base(), strip(), and strip_length() method from Greek-accentuation: https://pypi.org/project/greek-accentuation/ but that's not helping either.
Project requirements are:
Detect the alphabet belonging to the character (OK)
Store string-positions (needed for highlighting in the database) (NotOK)
Be able to process multiple languages/alphabets mixed in one string. (OK)
Iterate over CSV-input. (OK)
Ignore set of predefined strings (OK)
Ignore set of strings that match certain conditions (OK)
This is the simplified code for this project:
# -*- coding: utf-8 -*-
import csv
from alphabet_detector import AlphabetDetector
ad = AlphabetDetector()
with open("tbltext.csv", "r", encoding="utf8") as txt:
data = csv.reader(txt)
for row in data:
text = row[1]
### Here I have some string manipulation (lowering everything, replacing the predefined set of strings by equal-length '-',...)
###then I use the ad-module to detect the language by looping over my characters, this is where it goes wrong.
for letter in text:
lang = ad.detect_alphabet(letter)
If I use the word: ἐ̣ν̣τ̣ῇ̣[αὐτ]ῇ as example with a forloop; my result is:
>>> word = "ἐ̣ν̣τ̣ῇ̣[αὐτ]ῇ"
>>> for letter in word:
... print(letter)
...
ἐ
̣
ν
̣
τ
̣
ῇ
̣
[
α
ὐ
τ
]
ῇ
How can I make Python see letters with a combining diacritical mark as one letter instead of making it print the letter and the diacritical mark separately?
The string has 2 in length, so this is correct: two code point:
>>> list(hex(ord(c)) for c in symbol)
['0x1fc7', '0x323']
>>> list(unicodedata.name(c) for c in symbol)
['GREEK SMALL LETTER ETA WITH PERISPOMENI AND YPOGEGRAMMENI', 'COMBINING DOT BELOW']
So you should not use len to count the characters.
You could count the characters that are non-combining, so:
>>> import unicodedata
>>> len(''.join(ch for ch in symbol if unicodedata.combining(ch) == 0))
1
From: How do I get the "visible" length of a combining Unicode string in Python? (but I ported it to python3).
But this is also not the optimal solution, depending on the scope of counting characters. I think in your case it is enough, but fonts could merge characters into ligatures. On some languages, that are visually new (and very different) characters (and not like ligature in western languages).
As last comment: I think you should normalize strings. With above code, in this case it doesn't matter, but in other cases, you may get different results. Especially if someone used combatibility characters (e.g. mu for units, or Eszett, instead of the true Greek characters).

Python 3.5: Is it possible to align punctuation (e.g. £, $) to the left side of a word using regex?

As part of my code, I need to align things like the pound sign to the left of a string. For example my code starts with:
"A price of £ 8 is roughly the same as $ 10.23!"
and needs to end with:
"A price of £8 is roughly the same as $10.23!"
I've created the following function to solve this however I feel that it is very inefficient and was wondering if there was a way to do this with regular expressions in Python?
for i in sentence:
if i == "(" or i == "{" or i == "[" or i == "£" or i == "$":
if i != len(sentence):
corrected_sentence.append(" ")
corrected_sentence.append(i)
else:
corrected_sentence.append(i)
What this is doing right now is going through the 'sentence' list where I have split up all of the words and punctuation and t then reforming this followed by a space EXPECT where the listed characters are used and adding to another list to be made into a single string again.
I only want to do this with the characters I have listed above (so I need to ignore things like full stops or exclamation marks etc).
Thanks!
I'm not sure what you want to do with the brackets, but from the description you can use a regex to find and replace whitespace preceded by the characters (lookbehind) and followed by a digit (lookahead).
>>> print(re.sub(r"(?<=[\{\[£\$])\s+(?=\d)", "", "A price of £ 8 is roughly the same as $ 10.23!"))
A price of £8 is roughly the same as $10.23!

How to isolate non english words separated by spaces in Lua?

I have this string
"Hello there, this is some line-aa."
how to slice it into an array like this?
Hello
there,
this
is
some
line-aa.
this is what I have tried so far
function sliceSpaces(arg)
local list = {}
for k in arg:gmatch("%w+") do
print(k)
table.insert(list, k)
end
return list
end
local sentence = "مرحبا يا اخوتي"
print("sliceSpaces")
print(sliceSpaces(sentence))
this code works for English text, but not for arabic, how can I make it work for arabic too?
Lua strings are sequences of bytes, not Unicode characters. The pattern %w matches alphanumeric characters, but it applies to ASCII only.
Instead, use %S to match a non-whitespace character:
for k in arg:gmatch("%S+") do

String.equalsIgnoreCase - UpperCase v. LowerCase

I was browsing through the openjdk and noticed a weird code path in String.equalsIgnoreCase, specifically the method regionMatches:
if (ignoreCase) {
// If characters don't match but case may be ignored,
// try converting both characters to uppercase.
// If the results match, then the comparison scan should
// continue.
char u1 = Character.toUpperCase(c1);
char u2 = Character.toUpperCase(c2);
if (u1 == u2) {
continue;
}
// Unfortunately, conversion to uppercase does not work properly
// for the Georgian alphabet, which has strange rules about case
// conversion. So we need to make one last check before
// exiting.
if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) {
continue;
}
}
I understand the comment about adjusting for a specific alphabet to check the lower case equality, but was wondering why even have the upper case check? Why not just do all lower case?
Now that the question is re-opened, I transfer my answer here.
The short answer to "Why do they not just compare only lowercase instead of both upper and lower case, if it matches more cases than uppercase?": It does not match more character pairs, it merely matches different pairs.
Comparing only uppercase is not enough, e.g. the ASCII letter "I" and the capital I with dot "İ" ((char)304, used in Turkish alphabet) have different uppercase (they are already uppercase), but they have the same lowercase letter "i". (Note that the Turkish language considers i with dot and i without dot as different letters, not just an accented letter, similar to German with its Umlauts ä/ö/ü vs. a/o/u.)
Comparing only lowercase is not enough, e.g. the ASCII letter "i" and the small dotless i "ı" ((char)305). They have different lowercase (they are already lowercase), but they have the same uppercase letter "I".
And finally, compare capital I with dot "İ" with small dotless i "ı". Neither their uppercases ("İ" vs. "I") nor their lowercases ("i" vs. "ı") match, but the lowercase of their uppercase is the same ("I"). I found another case if this phenomenon, in the greek letters "ϴ" and "ϑ" (char 1012 and 977).
So a true case insensitive comparison can not even check uppercases and lowercases of the original characters, but must check the lowercases of the uppercases.

Resources