Python3 and combining Diacritics - python-3.x

I've been having a problem with Unicode in python3 and I can't seem to understand why that's happening.
symbol= "ῇ̣"
print(len(symbol))
>>>>2
This letter comes from a word: ἐ̣ν̣τ̣ῇ̣[αὐτ]ῇ where I have combining diacritical marks. I want to do the statistical analysis in Python 3 and store the results in a database, the thing is that I also store the character's position (index) in the text. The database-application correctly counts the symbol-variable in the example as one-character, whereas Python counts it as two - throwing off the entire indexing.
The project requires me to keep the diacritics, so I can't simply ignore them or do a .replace("combining diacritical mark","") on the string.
Since Python3 has unicode as default for strings I'm a bit dumbfounded by this.
I have tried to use the base(), strip(), and strip_length() method from Greek-accentuation: https://pypi.org/project/greek-accentuation/ but that's not helping either.
Project requirements are:
Detect the alphabet belonging to the character (OK)
Store string-positions (needed for highlighting in the database) (NotOK)
Be able to process multiple languages/alphabets mixed in one string. (OK)
Iterate over CSV-input. (OK)
Ignore set of predefined strings (OK)
Ignore set of strings that match certain conditions (OK)
This is the simplified code for this project:
# -*- coding: utf-8 -*-
import csv
from alphabet_detector import AlphabetDetector
ad = AlphabetDetector()
with open("tbltext.csv", "r", encoding="utf8") as txt:
data = csv.reader(txt)
for row in data:
text = row[1]
### Here I have some string manipulation (lowering everything, replacing the predefined set of strings by equal-length '-',...)
###then I use the ad-module to detect the language by looping over my characters, this is where it goes wrong.
for letter in text:
lang = ad.detect_alphabet(letter)
If I use the word: ἐ̣ν̣τ̣ῇ̣[αὐτ]ῇ as example with a forloop; my result is:
>>> word = "ἐ̣ν̣τ̣ῇ̣[αὐτ]ῇ"
>>> for letter in word:
... print(letter)
...
ἐ
̣
ν
̣
τ
̣
ῇ
̣
[
α
ὐ
τ
]
ῇ
How can I make Python see letters with a combining diacritical mark as one letter instead of making it print the letter and the diacritical mark separately?

The string has 2 in length, so this is correct: two code point:
>>> list(hex(ord(c)) for c in symbol)
['0x1fc7', '0x323']
>>> list(unicodedata.name(c) for c in symbol)
['GREEK SMALL LETTER ETA WITH PERISPOMENI AND YPOGEGRAMMENI', 'COMBINING DOT BELOW']
So you should not use len to count the characters.
You could count the characters that are non-combining, so:
>>> import unicodedata
>>> len(''.join(ch for ch in symbol if unicodedata.combining(ch) == 0))
1
From: How do I get the "visible" length of a combining Unicode string in Python? (but I ported it to python3).
But this is also not the optimal solution, depending on the scope of counting characters. I think in your case it is enough, but fonts could merge characters into ligatures. On some languages, that are visually new (and very different) characters (and not like ligature in western languages).
As last comment: I think you should normalize strings. With above code, in this case it doesn't matter, but in other cases, you may get different results. Especially if someone used combatibility characters (e.g. mu for units, or Eszett, instead of the true Greek characters).

Related

How to substitute a repeating character with the same number of a different character in regex python?

Assume there's a string
"An example striiiiiing with other words"
I need to replace the 'i's with '*'s like 'str******ng'. The number of '*' must be same as 'i'. This replacement should happen only if there are consecutive 'i' greater than or equal to 3. If the number of 'i' is less than 3 then there is a different rule for that. I can hard code it:
import re
text = "An example striiiiing with other words"
out_put = re.sub(re.compile(r'i{3}', re.I), r'*'*3, text)
print(out_put)
# An example str***iing with other words
But number of i could be any number greater than 3. How can we do that using regex?
The i{3} pattern only matches iii anywhere in the string. You need i{3,} to match three or more is. However, to make it all work, you need to pass your match into a callable used as a replacement argument to re.sub, where you can get the match text length and multiply correctly.
Also, it is advisable to declare the regex outside of re.sub, or just use a string pattern since patterns are cached.
Here is the code that fixes the issue:
import re
text = "An example striiiiing with other words"
rx = re.compile(r'i{3,}', re.I)
out_put = rx.sub(lambda x: r'*'*len(x.group()), text)
print(out_put)
# => An example str*****ng with other words

Python: lower() method generates wrong letter in a string

text = 'ÇEKİM GÜNÜ KALİTESİNİ DÜZENLERLSE'
sentence = text.split(' ')
print(sentence)
if "ÇEKİM" in sentence:
print("yes-1")
print(" ")
sentence_ = text.lower().split(' ')
print(sentence_)
if "çekim" in sentence_:
print("yes-2")
>> output:
['ÇEKİM', 'GÜNÜ', 'KALİTESİNİ', 'DÜZENLERLSE']
yes-1
['çeki̇m', 'günü', 'kali̇tesi̇ni̇', 'düzenlerlse']
I have a problem about string. I have a sentence like a text. When I check a specific word in this sentence-splitted list, I can find "ÇEKİM" word (prints yes). However, while I make search by lowering sentence, I can not find in the list because it changes "i" letter. What is the reason of it (encoding/decoding) ? Why "lower()" method changes string in addition to lowering ? Btw, it is a turkish word. Upper:ÇEKİM - Lower:çekim
Turkish i and English i are treated differently. Capitalized Turkish i is İ, while capitalized English i is I. To differentiate Unicode has rules for converting to lower and upper case. Lowercase Turkish i has a combining mark. Also, converting the lower case version to upper case leaves the characters in a decomposed form, so proper comparison needs to normalize the string to a standard form. You can't compare a decomposed form to a composed form. Note the differences in the strings below:
#coding:utf8
import unicodedata as ud
def dump_names(s):
print('string:',s)
for c in s:
print(f'U+{ord(c):04X} {ud.name(c)}')
turkish_i = 'İ'
dump_names(turkish_i)
dump_names(turkish_i.lower())
dump_names(turkish_i.lower().upper())
dump_names(ud.normalize('NFC',turkish_i.lower().upper()))
string: İ
U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE
string: i̇
U+0069 LATIN SMALL LETTER I
U+0307 COMBINING DOT ABOVE
string: İ
U+0049 LATIN CAPITAL LETTER I
U+0307 COMBINING DOT ABOVE
string: İ
U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE
Some terminals also have display issues. My system displays '' with the dot over the m, not the i. For example, on the Chrome browser, below displays correctly:
>>> s = 'ÇEKİM'
>>> s.lower()
'çeki̇m'
But on one of my editors it displays as:
So it appears something like this is what the OP is seeing. The following comparison will work:
if "çeki\N{COMBINING DOT ABOVE}m" in sentence_:
print("yes-2")

How to replace several characters in a string using Julia

I'm essentially trying to solve this problem: http://rosalind.info/problems/revc/
I want to replace all occurrences of A, C, G, T with their compliments T, G, C, A .. in other words all A's will be replaced with T's, all C's with G's and etc.
I had previously used the replace() function to replace all occurrences of 'T' with 'U' and was hoping that the replace function would take a list of characters to replace with another list of characters but I haven't been able to make it work, so it might not have that functionality.
I know I could solve this easily using the BioJulia package and have done so using the following:
# creating complementary strand of DNA
# reverse the string
# find the complementary nucleotide
using Bio.Seq
s = dna"AAAACCCGGT"
t = reverse(complement(s))
println("$t")
But I'd like to not have to rely on the package.
Here's the code I have so far, if someone could steer me in the right direction that'd be great.
# creating complementary strand of DNA
# reverse the string
# find the complementary nucleotide
s = open("nt.txt") # open file containing sequence
t = reverse(s) # reverse the sequence
final = replace(t, r'[ACGT]', '[TGCA]') # this is probably incorrect
# replace characters ACGT with TGCA
println("$final")
It seems that replace doesn't yet do translations quite like, say, tr in Bash. So instead, here are couple of approaches using a dictionary mapping instead (the BioJulia package also appears to make similar use of dictionaries):
compliments = Dict('A' => 'T', 'C' => 'G', 'G' => 'C', 'T' => 'A')
Then if str = "AAAACCCGGT", you could use join like this:
julia> join([compliments[c] for c in str])
"TTTTGGGCCA"
Another approach could be to use a function and map:
function translate(c)
compliments[c]
end
Then:
julia> map(translate, str)
"TTTTGGGCCA"
Strings are iterable objects in Julia; each of these approaches reads one character in turn, c, and passes it to the dictionary to get back the complimentary character. A new string is built up from these complimentary characters.
Julia's strings are also immutable: you can't swap characters around in place, rather you need to build a new string.

Pulling valid data from bytestring in Python 3

Given the following bytestring, how can I remove any characters matching \xFF, and create a list object from what's left (by splitting on removed areas)?
b"\x07\x00\x00\x00~\x10\x00pts/5\x00\x00/5\x00\x00user\x00\x00"
Desired result:
["~", "pts/5", "/5", "user"]
The above string is just an example - I'd like to remove any \x.. (non-decoded) bytes.
I'm using Python 3.2.3, and would prefer to use standard libraries only.
>>> a = b"\x07\x00\x00\x00~\x10\x00pts/5\x00\x00/5\x00\x00user\x00\x00"
>>> import re
>>> re.findall(rb"[^\x00-\x1f\x7f-\xff]+", a)
[b'~', b'pts/5', b'/5', b'user']
The results are still bytes objects. If you want the results to be strings:
>>> [i.decode("ascii") for i in re.findall(rb"[^\x00-\x1f\x7f-\xff]+", a)]
['~', 'pts/5', '/5', 'user']
Explanation:
[^\x00-\x1f\x7f-\xff]+ matches one or more (+) characters that are not in the range ([^...]) between ASCII 0 and 31 (\x00-\x1F) or between ASCII 127 and 255 (\x7f-\xff).
Be aware that this approach only works if the "embedded texts" are ASCII. It will remove all extended alphabetic characters (like ä, é, € etc.) from strings encoded in an 8-bit codepage like latin-1, and it will effectively destroy strings encoded in UTF-8 and other Unicode encodings because those do contain byte values between 0 and 31/127 and 255 as parts of their character codes.
Of course, you can always manually fine-tune the exact ranges you want to remove according to the example given in this answer.

Defining a function name that starts with a number (in Python 3)?

I have tried creating the following function:
def 3utr():
do_something().
However, I get a SyntaxError. Replacing the "3" by "three" fixes the problem.
My questions are:
Why is it a syntax error?
Is there a way to have a function name start with a number in Python 3?
It is a syntax error because the language specification does not allow identifiers to start with a digit. So it’s not possible to have function names (which are identifiers) that start with digits in Python.
identifier ::= (letter|"_") (letter | digit | "_")*
Python 2 Language Reference
Within the ASCII range (U+0001..U+007F), the valid characters for identifiers are the same as in Python 2.x: the uppercase and lowercase letters A through Z, the underscore _ and, except for the first character, the digits 0 through 9.
Python 3 Language Reference
One workaround is use Roman numerals:
>>> def xxiv():
... print("ok\n")
...
>>> xxiv()
ok
If you really want to be distinctive.
You can add '_' in front of an identifier
For instance
def _3utr():
Then call the function
_3utr()

Resources