How to sort latin after local language in python 3? - python-3.x

There are many situations where the user's language is not a "latin" script (examples include: Greek, Russian, Chinese). In most of these cases a sorting is done by
first sorting the special characters and numbers (numbers in local language though...),
secondly the words in the local language-script
at the end, any non native characters such as French, English or German "imported" words, in a general utf collation.
Or even more specific for the rest...:
is it possible to select the sort based on script?
Example1: Chinese script first then Latin-Greek-Arabic (or even more...)
Example2: Greek script first then Latin-Arabic-Chinese (or even more...)
What is the most effective and pythonic way to create a sort like any of these? (by «any» I mean either the simple «selected script first» and rest as in unicode sort, or the more complicated «selected script first» and then a specified order for rest of the scripts)

Interesting question. Here’s some sample code that classifies strings
according to the writing system of the first character.
import unicodedata
words = ["Japanese", # English
"Nihongo", # Japanese, rōmaji
"にほんご", # Japanese, hiragana
"ニホンゴ", # Japanese, katakana
"日本語", # Japanese, kanji
"Японский язык", # Russian
"जापानी भाषा" # Hindi (Devanagari)
]
def wskey(s):
"""Return a sort key that is a tuple (n, s), where n is an int based
on the writing system of the first character, and s is the passed
string. Writing systems not addressed (Devanagari, in this example)
go at the end."""
sort_order = {
# We leave gaps to make later insertions easy
'CJK' : 100,
'HIRAGANA' : 200,
'KATAKANA' : 200, # hiragana and katakana at same level
'CYRILLIC' : 300,
'LATIN' : 400
}
name = unicodedata.name(s[0], "UNKNOWN")
first = name.split()[0]
n = sort_order.get(first, 999999);
return (n, s)
words.sort(key=wskey)
for s in words:
print(s)
In this example, I am sorting hiragana and katakana (the two Japanese
syllabaries) at the same level, which means pure-katakana strings will
always come after pure-hiragana strings. If we wanted to sort them such
that the same syllable (e.g., に and ニ) sorted together, that would be
trickier.

Related

Zapier Formatter - Hashtags Words In A String With Exemptions

I am looking to automate my social-media hashtags in Zapier, dependent on the post title.
Input: High School English As A Second Language Teacher
Output: #High #School #English #Second #Language #Teacher
I found the Regex (I think), which is \b(\w) to select the first letter of each word. However, this may not be Python. I would need exceptions too, to remove words like "A", "As", "The" etc.
While this is possible, it becomes very tricky and error prone once there's any punctuation or other characters. Nevertheless, here's a simple first pass:
import re
title = input_data['title']
# 'High School English As A Second Language Teacher'
words = re.findall(r'\w{3,}', title)
# ['High', 'School', 'English', 'Second', 'Language', 'Teacher']
result = ' '.join(['#' + word for word in words])
# '#High #School #English #Second #Language #Teacher'
return {'result': result}
That finds all words that are 3 or more characters, adds a # to each, and joins them all into a big string. You can play with that regex here.

How to quote some special words (registry numbers) to be not tokenized with Spacy?

I have some numbers inside text which I would like to stay as one sentence. Some of them:
7-2017-19121-B
7-2016-26132
wd/2012/0616
JLG486-01
H14-0890-12
How can I protect them to be not separated on words. I already use regex for custom tokenizer to never split words with dashes but it works only with letters not with numbers. I don't want to change the default regex which is big and very complicated. How can I do it easily?
What I have done already is using those "hyphen protector". For 7-2014-1721-Y I got tokens [7,-,2014,-,1721-Y], so last phrase is not divided but the previous are. As I said the code is complicated and would like to add the same to include such action for number-number entity.
This is the function:
def custom_tokenizer(nlp):
prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)
# changing default infixed
def_infx = nlp.Defaults.infixes
cur_infx = (d.replace('-|–|—|', '') for d in def_infx)
infix_re = compile_infix_regex(cur_infx)
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search, suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer, token_match=None)
Maybe there's some easier way except moditication? I've tried to quote these "plates" with some escape characters like {7-2017-19121-B} but it doesn't work.
By the way, there's a regex which matches these special "numbers". Maybe workaround for me will be just removing them from the text (which I'll try later) but now I'm asking if I have any chances here.
["(?=[^\d\s]*\d)(?:[a-zA-Z\d]+(?:/[a-zA-Z\d]+)+)", "(?:[[A-Z\d]+(?:[-][A-Z\d]+)+)"]
Hint. I found out changing from 7-2017-19121-B to 7/2017/19121/B works as needed. The question is (for me to check) how can I adapt this to my current code and stay with the performance I have now.
You may add them as "special cases":
nlp.tokenizer.add_special_case("7-2017-19121-B", [{ORTH: "7-2017-19121-B"}])
...
nlp.tokenizer.add_special_case("H14-0890-12", [{ORTH: "H14-0890-12"}])
Test:
print([w.text for w in nlp("Got JLG486-01 and 7-2017-19121-B codes.")])
# => ['Got', 'JLG486-01', 'and', '7-2017-19121-B', 'codes', '.']

using regular expressions isolate the words with ei or ie in it

How do I use regular expressions isolate the words with ei or ie in it?
import re
value = ("How can one receive one who over achieves while believing that he/she cannot be deceived.")
list = re.findall("[ei,ie]\w+", value)
print(list)
it should print ['receive', 'achieves', 'believing', 'deceived'], but I get ['eceive', 'er', 'ieves', 'ile', 'elieving', 'eceived'] instead.
The set syntax [] is for individual characters, so use (?:) instead, with words separated by |. This is like using a group, but it doesn't capture a match group like () would. You also want the \w on either side to be captured to get the whole word.
import re
value = ("How can one receive one who over achieves while believing that he/she cannot be deceived.")
list = re.findall("(\w*(?:ei|ie)\w*)", value)
print(list)
['receive', 'achieves', 'believing', 'deceived']
(I'm assuming you meant "achieves", not "achieve" since that's the word that actually appears here.)

What is Natural Language Processing Doing Exactly in This Code?

I am new to natural language processing and I want to use it to write a news aggregator(in Node.js in my case). Rather than just use a prepackage framework, I want to learn the nuts and bolts and I am starting with the NLP portion. I found this one tutorial that has been the most helpful so far:
http://www.p-value.info/2012/12/howto-build-news-aggregator-in-100-loc.html
In it, the author gets the RSS feeds and loops through them looking for the elements(or fields) title and description. I know Python and understand the code. But what I don't understand is what NLP is doing here with title and description under the hood(besides scraping and tokenizing, which is apparent...and those tasks don't need a NLP).
import feedparser
import nltk
corpus = []
titles=[]
ct = -1
for feed in feeds:
d = feedparser.parse(feed)
for e in d['entries']:
words = nltk.wordpunct_tokenize(nltk.clean_html(e['description']))
words.extend(nltk.wordpunct_tokenize(e['title']))
lowerwords=[x.lower() for x in words if len(x) > 1]
ct += 1
print ct, "TITLE",e['title']
corpus.append(lowerwords)
titles.append(e['title'])
(reading your question more carefully maybe this was all already obvious to you, but it doesn't look like anything more deep or interesting is going on)
wordpunct_tokenize is set up here here (last line) as
wordpunct_tokenize = WordPunctTokenizer().tokenize
WordPunctTokenizer is implemented by this code:
class WordPunctTokenizer(RegexpTokenizer):
def __init__(self):
RegexpTokenizer.__init__(self, r'\w+|[^\w\s]+')
The heart of this is just the regular expression r'\w+|[^\w\s]+', which defines what strings are considered to be tokens by this tokenizer. There are two options, separated by the |:
\w+, that is, more than one "word" character (alphabetical or numeric)
[^\w\s]+, more than one character that is not either a "word" character or whitespace, thus this matches any string of punctuation
Here is a reference for Python regular expressions.
I have not dug into the RegexpTokenizer, but I assume is set up such that the tokenize function returns an iterator that searches a string for the first match of the regular expression, then the next, etc.

Generate sensible strings using a pattern

I have a table of strings (about 100,000) in following format:
pattern , string
e.g. -
*l*ph*nt , elephant
c*mp*t*r , computer
s*v* , save
s*nn] , sunny
]*rr] , worry
To simplify, assume a * denotes a vowel, a consonant stands unchanged and ] denotes either a 'y' or a 'w' (say, for instance, semi-vowels/round-vowels in phonology).
Given a pattern, what is the best way to generate the possible sensible strings? A sensible string is defined as a string having each of its consecutive two-letter substrings, that were not specified in the pattern, inside the data-set.
e.g. -
h*ll* --> hallo, hello, holla ...
'hallo' is sensible because 'ha', 'al', 'lo' can be seen in the data-set as with the words 'have', 'also', 'low'. The two letters 'll' is not considered because it was specified in the pattern.
What are the simple and efficient ways to do this?
Are there any libraries/frameworks for achieving this?
I've no specific language in mind but prefer to use java for this program.
This is particularly well suited to Python itertools, set and re operations:
import re
import itertools
VOWELS = 'aeiou'
SEMI_VOWELS = 'wy'
DATASET = '/usr/share/dict/words'
SENSIBLES = set()
def digraphs(word, digraph=r'..'):
'''
>>> digraphs('bar')
set(['ar', 'ba'])
'''
base = re.findall(digraph, word)
base.extend(re.findall(digraph, word[1:]))
return set(base)
def expand(pattern, wildcard, elements):
'''
>>> expand('h?', '?', 'aeiou')
['ha', 'he', 'hi', 'ho', 'hu']
'''
tokens = re.split(re.escape(wildcard), pattern)
results = set()
for perm in itertools.permutations(elements, len(tokens)):
results.add(''.join([l for p in zip(tokens, perm) for l in p][:-1]))
return sorted(results)
def enum(pattern):
not_sensible = digraphs(pattern, r'[^*\]]{2}')
for p in expand(pattern, '*', VOWELS):
for q in expand(p, ']', SEMI_VOWELS):
if (digraphs(q) - not_sensible).issubset(SENSIBLES):
print q
## Init the data-set (may be long...)
## you may want to pre-compute this
## and adapt it to your data-set.
for word in open(DATASET, 'r').readlines():
for digraph in digraphs(word.rstrip()):
SENSIBLES.add(digraph)
enum('*l*ph*nt')
enum('s*nn]')
enum('h*ll*')
As there aren't many possibilites for two-letter substrings, you can go through your dataset and generate a table that contains the count for every two-letter substring, so the table will look something like this:
ee 1024 times
su 567 times
...
xy 45 times
xz 0 times
The table will be small as you'll only have about 26*26 = 676 values to store.
You have to do this only once for your dataset (or update the table every time it changes if the dataset is dynamic) and can use the table for evaluating possible strings. F.e., for your example, add the values for 'ha', 'al' and 'lo' to get a "score" for the string 'hallo'. After that, choose the string(s) with the highest score(s).
Note that the scoring can be improved by checking longer substrings, f.e. three letters, but this will also result in larger tables.

Resources