SQLite Query for a character with a prefix and a suffix - string

Okay actually I'm writing a program for parsing Japanese/Chinese text, but I try to map it to an english example. No, I don't want to use it to create password lists :).
Suppose there is a text without spaces (space is not used in most east asian languages) like :
helloiamwritingproperenglish!
Given is a specific character position in the text like the r in proper:
helloiamwritingproperenglish!
^
so the text can be decomposed in prefix + 'r' + suffix.
Additionally there is a dictionary stored in SQLite containing character combinations (words) like:
sqllite>SELECT writingKey from dic_writings;
writingKey
----------
A, Aa, ...
I want to find all regular words in the dictionary that are containing the selected character 'r' and a (maybe empty) substring of prefix and suffix like:
sqllite>FindCandidates('helloiamwritingp','r','operenglish!');
R, Pro, Rope, Prop, Proper
A Query to find all words in the input text could be:
SELECT * FROM dic_writings WHERE (text LIKE ('%'||writingKey||'%'));
but this approach is not very fast and I need to filter the words containing the selected 'r' (checking for 'r' is not enough actually). Anybody has an idea? Thank you for your time!

Related

Python: lower() method generates wrong letter in a string

text = 'ÇEKİM GÜNÜ KALİTESİNİ DÜZENLERLSE'
sentence = text.split(' ')
print(sentence)
if "ÇEKİM" in sentence:
print("yes-1")
print(" ")
sentence_ = text.lower().split(' ')
print(sentence_)
if "çekim" in sentence_:
print("yes-2")
>> output:
['ÇEKİM', 'GÜNÜ', 'KALİTESİNİ', 'DÜZENLERLSE']
yes-1
['çeki̇m', 'günü', 'kali̇tesi̇ni̇', 'düzenlerlse']
I have a problem about string. I have a sentence like a text. When I check a specific word in this sentence-splitted list, I can find "ÇEKİM" word (prints yes). However, while I make search by lowering sentence, I can not find in the list because it changes "i" letter. What is the reason of it (encoding/decoding) ? Why "lower()" method changes string in addition to lowering ? Btw, it is a turkish word. Upper:ÇEKİM - Lower:çekim
Turkish i and English i are treated differently. Capitalized Turkish i is İ, while capitalized English i is I. To differentiate Unicode has rules for converting to lower and upper case. Lowercase Turkish i has a combining mark. Also, converting the lower case version to upper case leaves the characters in a decomposed form, so proper comparison needs to normalize the string to a standard form. You can't compare a decomposed form to a composed form. Note the differences in the strings below:
#coding:utf8
import unicodedata as ud
def dump_names(s):
print('string:',s)
for c in s:
print(f'U+{ord(c):04X} {ud.name(c)}')
turkish_i = 'İ'
dump_names(turkish_i)
dump_names(turkish_i.lower())
dump_names(turkish_i.lower().upper())
dump_names(ud.normalize('NFC',turkish_i.lower().upper()))
string: İ
U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE
string: i̇
U+0069 LATIN SMALL LETTER I
U+0307 COMBINING DOT ABOVE
string: İ
U+0049 LATIN CAPITAL LETTER I
U+0307 COMBINING DOT ABOVE
string: İ
U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE
Some terminals also have display issues. My system displays '' with the dot over the m, not the i. For example, on the Chrome browser, below displays correctly:
>>> s = 'ÇEKİM'
>>> s.lower()
'çeki̇m'
But on one of my editors it displays as:
So it appears something like this is what the OP is seeing. The following comparison will work:
if "çeki\N{COMBINING DOT ABOVE}m" in sentence_:
print("yes-2")

Python3 and combining Diacritics

I've been having a problem with Unicode in python3 and I can't seem to understand why that's happening.
symbol= "ῇ̣"
print(len(symbol))
>>>>2
This letter comes from a word: ἐ̣ν̣τ̣ῇ̣[αὐτ]ῇ where I have combining diacritical marks. I want to do the statistical analysis in Python 3 and store the results in a database, the thing is that I also store the character's position (index) in the text. The database-application correctly counts the symbol-variable in the example as one-character, whereas Python counts it as two - throwing off the entire indexing.
The project requires me to keep the diacritics, so I can't simply ignore them or do a .replace("combining diacritical mark","") on the string.
Since Python3 has unicode as default for strings I'm a bit dumbfounded by this.
I have tried to use the base(), strip(), and strip_length() method from Greek-accentuation: https://pypi.org/project/greek-accentuation/ but that's not helping either.
Project requirements are:
Detect the alphabet belonging to the character (OK)
Store string-positions (needed for highlighting in the database) (NotOK)
Be able to process multiple languages/alphabets mixed in one string. (OK)
Iterate over CSV-input. (OK)
Ignore set of predefined strings (OK)
Ignore set of strings that match certain conditions (OK)
This is the simplified code for this project:
# -*- coding: utf-8 -*-
import csv
from alphabet_detector import AlphabetDetector
ad = AlphabetDetector()
with open("tbltext.csv", "r", encoding="utf8") as txt:
data = csv.reader(txt)
for row in data:
text = row[1]
### Here I have some string manipulation (lowering everything, replacing the predefined set of strings by equal-length '-',...)
###then I use the ad-module to detect the language by looping over my characters, this is where it goes wrong.
for letter in text:
lang = ad.detect_alphabet(letter)
If I use the word: ἐ̣ν̣τ̣ῇ̣[αὐτ]ῇ as example with a forloop; my result is:
>>> word = "ἐ̣ν̣τ̣ῇ̣[αὐτ]ῇ"
>>> for letter in word:
... print(letter)
...
ἐ
̣
ν
̣
τ
̣
ῇ
̣
[
α
ὐ
τ
]
ῇ
How can I make Python see letters with a combining diacritical mark as one letter instead of making it print the letter and the diacritical mark separately?
The string has 2 in length, so this is correct: two code point:
>>> list(hex(ord(c)) for c in symbol)
['0x1fc7', '0x323']
>>> list(unicodedata.name(c) for c in symbol)
['GREEK SMALL LETTER ETA WITH PERISPOMENI AND YPOGEGRAMMENI', 'COMBINING DOT BELOW']
So you should not use len to count the characters.
You could count the characters that are non-combining, so:
>>> import unicodedata
>>> len(''.join(ch for ch in symbol if unicodedata.combining(ch) == 0))
1
From: How do I get the "visible" length of a combining Unicode string in Python? (but I ported it to python3).
But this is also not the optimal solution, depending on the scope of counting characters. I think in your case it is enough, but fonts could merge characters into ligatures. On some languages, that are visually new (and very different) characters (and not like ligature in western languages).
As last comment: I think you should normalize strings. With above code, in this case it doesn't matter, but in other cases, you may get different results. Especially if someone used combatibility characters (e.g. mu for units, or Eszett, instead of the true Greek characters).

Algorithms for "shortening" strings?

I am looking for elegant ways to "shorten" the (user provided) names of object. More precisely:
my users can enter free text (used as "name" of some object), they can use up to 64 chars (including whitespaces, punctuation marks, ...)
in addition to that "long" name; we also have a "reduced" name (exactly 8 characters); required for some legacy interface
Now I am looking for thoughts on how to generate these "reduced" names, based on the 64-char name.
With "elegant" I am wondering about any useful ideas that "might" allow the user to recognize something with value within the shortened string.
Like, if the name is "Production Test Item A5"; then maybe "PTIA5" might (or might not) tell the user something useful.
Apply a substring method to the long version, trim it, in case there are any whitespace characters at the end, optionally remove any special characters from the very end (such as dashes) and finally add a dot, in case you want to indicate your abbreviation that way.
Just a quick hack to get you started:
String longVersion = "Aswaghtde-5d";
// Get substring 0..8 characters
String shortVersion = longVersion.substring(0, (longVersion.length() < 8 ? longVersion.length() : 8));
// Remove whitespace characters from end of String
shortVersion = shortVersion.trim();
// Remove any non-characters from end of String
shortVersion = shortVersion.replaceAll("[^a-zA-Z0-9\\s]+$", "");
// Add dot to end
shortVersion = shortVersion.substring(0, (shortVersion.length() < 8 ? shortVersion.length() : shortVersion.length() - 1)) + ".";
System.out.println(shortVersion);
I needed to shorten names to function as column names in a database. Ideally, the names should be recognizable for users. I set up a dictionary of patterns for commonly occuring words with corresponding "abbreviations". This was applied ONLY to those names which were over the limit of 30 characters.

Finding mean of ascii values in a string MATLAB

The string I am given is as follows:
scrap1 =
a le h
ke fd
zyq b
ner i
You'll notice there are 2 blank spaces indicating a space (ASCII 32) in each row. I need to find the mean ASCII value in each column without taking into account the spaces (32). So first I would convert to with double(scrap1) but then how do I find the mean without taking into account the spaces?
If it's only the ASCII 32 you want to omit:
d = double(scrap1);
result = mean(d(d~=32)); %// logical indexing to remove unwanted value, then mean
You can remove the intermediate spaces in the string with scrap1(scrap1 == ' ') = ''; This replaces any space in the input with an empty string. Then you can do the conversion to double and average the result. See here for other methods.
Probably, you can use regex to find the space and ignore it. "\s"
findSpace = regexp(scrap1, '\s', 'ignore')
% I am not sure about the ignore case, this what comes to my mind. but u can read more about regexp by typying doc regexp.

MATLAB string handling

I want to calculate the frequency of each word in a string. For that I need to turn string into an array (matrix) of words.
For example take "Hello world, can I ask you on a date?" and turn it into
['Hello' 'world,' 'can' 'I' 'ask' 'you' 'on' 'a' 'date?']
Then I can go over each entry and count every appearance of a particular word.
Is there a way to make an array (matrix) of words in MATLAB, instead of array of just chars?
Here is a little simpler regexp:
words = regexp(s,'\w+','match');
\w here means any symbol that can appear in words (including underscore).
Notice that the last question mark will not be included. Do you need it for counting words actually?
Regular expressions
s = 'Hello world, can I ask you on a date?'
slist = regexp(s, '[^ ]*', 'match')
yield
slist =
'Hello' 'world,' 'can' 'I' 'ask' 'you' 'on' 'a' 'date?'
Another way to do it is like this:
s = cell(java.lang.String('Hello world, can I ask you on a date?').split('[^\w]+'));
I.e. by creating a Java String object and using its methods to do the work, then converting back to a cell array of strings. Not necessarily the best way to do a job this simple, but Java has a rich library of string handling methods & classes that can come in handy.
Matlab's ability to switch into Java at the drop of a hat can come in handy sometimes - for example, when parsing & writing XML.

Resources