Extract Acronyms and Māori (non-english) words in a dataframe, and put them in adjacent columns within the dataframe - python-3.x

Regular expression seems a steep learning curve for me. I have a dataframe that contains texts (up to 300,000 rows). The text as contained in outcome column of a dummy file named foo_df.csv has a mixture of English words, acronyms and Māori words. foo_df.csv is as thus:
outcome
0 I want to go to DHB
1 Self Determination and Self-Management Rangatiratanga
2 mental health wellness and AOD counselling
3 Kai on my table
4 Fishing
5 Support with Oranga Tamariki Advocacy
6 Housing pathway with WINZ
7 Deal with personal matters
8 Referral to Owaraika Health services
The result I desire is in form of a table below such that has Abreviation and Māori_word columns:
outcome Abbreviation Māori_word
0 I want to go to DHB DHB
1 Self Determination and Self-Management Rangatiratanga Rangatiratanga
2 mental health wellness and AOD counselling AOD
3 Kai on my table Kai
4 Fishing
5 Support with Oranga Tamariki Advocacy Oranga Tamariki
6 Housing pathway with WINZ WINZ
7 Deal with personal matters
8 Referral to Owaraika Health services Owaraika
The approach I am using is to extract the ACRONYMS using regular expression and extract the Māori words using nltk module.
I have been able to extract the ACRONYMS using regular expression with this code:
pattern = '(\\b[A-Z](?:[\\.&]?[A-Z]){1,7}\\b)'
foo_df['Abbreviation'] = foo_df.outcome.str.extract(pattern)
I have been able to extract non-english words from a sentence using the code below:
import nltk
nltk.download('words')
from nltk.corpus import words
words = set(nltk.corpus.words.words())
sent = "Self Determination and Self-Management Rangatiratanga"
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
if not w.lower() in words or not w.isalpha())
However, I got an error TypeError: expected string or bytes-like object when I tried to iterate the above code over a dataframe. The iteration I tried is below:
def no_english(text):
words = set(nltk.corpus.words.words())
" ".join(w for w in nltk.wordpunct_tokenize(text['outcome']) \
if not w.lower() in words or not w.isalpha())
foo_df['Māori_word'] = foo_df.apply(no_english, axis = 1)
print(foo_df)
Any help in python3 will be appreciated. Thanks.

You can't magically tell if a word is English/Māori/abbreviation with a simple short regex. Actually, it is quite likely that some words can be found in multiple categories, so the task itself is not binary (or trinary in this case).
What you want to do is natural language processing, here are some examples of libraries for language detection in python. What you'll get is a probability that the input is in a given language. This is usually ran on full texts but you could apply it to single words.
Another approach is to use Māori and abbreviation dictionaries (=exhaustive/selected lists of words) and craft a function to tell if a word is one of them and assume English otherwise.

Related

Trying to detect products from text while using a dictionary

I have a list of products names and a collection of text generated from random users. I am trying to detect products mentioned in the text while talking into account spelling variation. For example the text
Text = i am interested in galxy s8
Mentions the product samsung galaxy s8
But note the difference in spellings.
I've implemented the following approaches:
1- max tokenized products names and users text (i split words by punctuation and digits so s8 will be tokenized into 's' and '8'. Then i did a check on each token in user's text to see if it is in my vocabulary with damerau levenshtein distance <= 1 to allow for variation in spelling. Once i have detected a sequence of tokens that do exist in the vocabulary i do a search for the product that matches the query while checking the damerau levenshtein distance on each token. This gave poor results. Mainly because the sequence of tokens that exist in the vocabulary do not necessarily represent a product. For example since text is max tokenized numbers can be found in the vocabulary and as such dates are detected as products.
2- i constructed bigram and trigram indicies from the list of products and converted each user text into a query.. but also results weren't so great given the spelling variation
3- i manually labeled 270 sentences and trained a named entity recognizer with labels ('O' and 'Product'). I split the data into 80% training and 20% test. Note that I didn't use the list of products as part of the features. Results were okay.. not great tho
None of the above results achieved a reliable performance. I tried regular expressions but since there are so many different combinations to consider it became too complicated.. Are there better ways to tackle this problem? I suppose ner could give better results if i train more data but suppose there isn't enough training data, what do u think a better solution would be?
If i come up with a better alternative to the ones I've already mentioned, I'll add it to this post. In the meantime I'm open to suggestions
Consider splitting your problem into two parts.
1) Conduct a spelling check using a dictionary of known product names (this is not a NLP task and there should be guides on how to impelement spell check).
2) Once you have done pre-processing (spell checking), use your NER algorithm
It should improve your accuracy.

NLP: How to get an exact number of sentences for a text summary using Gensim

I am trying to summarise some text using Gensim in python and want exactly 3 sentences in my summary. There doesn't seem to be an option to do this so I have done the following workaround:
with open ('speeches//'+speech, "r") as myfile:
speech=myfile.read()
sentences = speech.count('.')
x = gensim.summarization.summarize(speech, ratio=3.0/sentences)
However this code is only giving me two sentences. Furthermore, as I incrementally increase 3 to 5 still nothing happens.
Any help would be most appreciated.
You may not be able use 'ratio' for this. If you give ratio=0.3, and you have 10 sentences (assuming count of words in each sentence is same), your output will have 3 sentences, 6 for 20 and so on.
As per gensim doc
ratio (float, optional) – Number between 0 and 1 that determines the proportion of the number of sentences of the original text to be chosen for the summary.
Instead you might want to try using word_count, summarize(speech, word_count=60)
This question is a bit old, in case you found a better solution, pls share.

Set of rules for textual analysis - Natural language processing

Does there exist a guide with a set of rules for textual analysis / natural language processing?
Do you have some specific developed package (e.g. in Python) for textual sentiment analysis?
Here is the application I am faced with:
Let's say I have two dictionaries, A and B. A contains "negative" words, and B contains "positive" words. What I can do is count the negative and the positive number of words.
This created some issues, such as the following: let's suppose that "exceptionally" is a positive word, and "serious" is a negative word.
If I have the two words following each other, I have "exceptionally serious". In such a case, the two words cancel each other, which means I have 1 negative and 1 positive word. This is not true, because in reality it is a double negative.
So, my question is, is there a set of rules I can apply so that I improve my code, or is there some software that already takes into account such mechanisms, and applies textual sentiment analysis? Is there some implementation which I can feed the dictionaries and provide me with textual sentiment after it applies a set of rules such as double negatives?
We did sentiment analysis at San Diego State using nltk with python. Really fun and easy! http://text-processing.com/demo/sentiment/ for an example I entered "exceptionally serious" and it knows that it is NEG.
easy enough example to follow: http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/

Distinguishing well formed English sentences from "word salad"

I'm looking for a library easily usable from C++, Python or F#, which can distinguish well formed English sentences from "word salad". I tried The Stanford Parser and unfortunately, it parsed this:
Some plants have with done stems animals with exercise that to predict?
without a complaint. I'm not looking for something very sophisticated, able to handle all possible corner cases. I only need to filter out an obvious nonsense.
Here is something I just stumbled upon:
A general-purpose sentence-level nonsense detector, by a Stanford student named Ian Tenney.
Here is the code from the project, undocumented but available on GitHub.
If you want to develop your own solution based on this, I think you should pay attention the 4th group of features used, ie the language model, under section 3 "Features and preprocessing".
It might not suffice, but I think getting a probability score of each subsequences of length n is a good start. 3-grams like "plants have with", "have with done", "done stems animals", "stems animals with" and "that to predict" seem rather improbable, which could lead to a "nonsense" label on the whole sentence.
This method has the advantage of relying on a learned model rather than on a set of hand-made rules, which afaik is your other option. Many people would point you to Chapter 8 of NLTK's manual, but I think that developing your own context-free grammar for general English is asking a bit much.
The paper was useful, but goes into too much depth for solving this problem. Here is the author's basic approach, heuristically:
Baseline sentence heuristic: first letter is Capitalized,
and line ends with one of .?! (1 feature).
Number of characters, words, punctuation, digits, and named entities (from Stanford CoreNLP NER tagger), and normalized versions by text length (10 features).
Part-of-speech distributional tags: (# / # words) for each Penn treebank tag (45 features).
Indicators for the part of speech tag of the first
and last token in the text (45x2 = 90 features).
Language model raw score (s lm = log p(text))
and normalized score (s¯lm = s lm / # words) (2 features).
However, after a lot of searching, the github repo only includes the tests and visualizations. The raw training and test data are not there. Here is his function for calculating these features:
(note: this uses pandas dataframes as df)
def make_basic_features(df):
"""Compute basic features."""
df['f_nchars'] = df['__TEXT__'].map(len)
df['f_nwords'] = df['word'].map(len)
punct_counter = lambda s: sum(1 for c in s
if (not c.isalnum())
and not c in
[" ", "\t"])
df['f_npunct'] = df['__TEXT__'].map(punct_counter)
df['f_rpunct'] = df['f_npunct'] / df['f_nchars']
df['f_ndigit'] = df['__TEXT__'].map(lambda s: sum(1 for c in s
if c.isdigit()))
df['f_rdigit'] = df['f_ndigit'] / df['f_nchars']
upper_counter = lambda s: sum(1 for c in s if c.isupper())
df['f_nupper'] = df['__TEXT__'].map(upper_counter)
df['f_rupper'] = df['f_nupper'] / df['f_nchars']
df['f_nner'] = df['ner'].map(lambda ts: sum(1 for t in ts
if t != 'O'))
df['f_rner'] = df['f_nner'] / df['f_nwords']
# Check standard sentence pattern:
# if starts with capital, ends with .?!
def check_sentence_pattern(s):
ss = s.strip(r"""`"'""").strip()
return s[0].isupper() and (s[-1] in '.?!')
df['f_sentence_pattern'] = df['__TEXT__'].map(check_sentence_pattern)
# Normalize any LM features
# by dividing logscore by number of words
lm_cols = {c:re.sub("_lmscore_", "_lmscore_norm_",c)
for c in df.columns if c.startswith("f_lmscore")}
for c,cnew in lm_cols.items():
df[cnew] = df[c] / df['f_nwords']
return df
So I guess that's a function you can use in this case. For the minimalist version:
raw = ["This is is a well-formed sentence","but this ain't a good sent","just a fragment"]
import pandas as pd
df = pd.DataFrame([{"__TEXT__":i, "word": i.split(), 'ner':[]} for i in raw])
the parser seems to want a list of the words, and named entities recognized (NER) using the Stanford CoreNLP library, which is written in Java. You can pass in nothing (an empty list []) and the function do calculate everything else. You'll get back a dataframe (like a matrix) with all the features of sentences that you can then used to decide what to call "well formed" by the rules given.
Also, you don't HAVE to use pandas here. A list of dictionaries will also work. But the original code used pandas.
Because this example involved a lot of steps, I've created a gist where I run through an example up to the point of producing a clean list of sentences and a dirty list of not-well-formed sentences
my gist: https://gist.github.com/marcmaxson/4ccca7bacc72eb6bb6479caf4081cefb
This replaces the Stanford CoreNLP java library with spacy - a newer and easier to use python library that fills in the missing meta data, such as sentiment, named entities, and parts of speech used to determine if a sentence is well-formed. This runs under python 3.6, but could work under 2.7. all the libraries are backwards compatible.

Find most repeated phrase on huge text

I have huge text data. My entire database is text format in UTF-8
I need to have list of most repeated phrase on my whole text data.
For example my desire output something like this:
{
'a': 423412341,
'this': 423412341,
'is': 322472341,
'this is': 222472341,
'this is a': 122472341,
'this is a my': 5235634
}
Process and store each phrase take huge size of database.
For example store in MySQL or MongoDB.
Question is is there any more efficient database or alghorithm for find this result ?
Solr, Elasticsearch or etc ...
I think i have max 10 words in each phrase can be good for me.
I'd suggest combining ideas from two fields, here: Streaming Algorithms, and the Apriori Algorithm From Market-Basket Analysis.
Let's start with the problem of finding the k most frequent single words without loading the entire corpus into memory. A very simple algorithm, Sampling (see Finding Frequent Items in Data Streams]), can do so very easily. Moreover, it is very amenable to parallel implementation (described below). There is a plethora of work on top-k queries, including some on distributed versions (see, e.g., Efficient Top-K Query Calculation in Distributed Networks).
Now to the problem of k most frequent phrases (of possibly multiple phrases). Clearly, the most frequent phrases of length l + 1 must contain the most frequent phrases of length l as a prefix, as appending a word to a phrase cannot increase its popularity. Hence, once you have the k most frequent single words, you can scan the corpus for only them (which is faster) to build the most frequent phrases of length 2. Using this, you can build the most frequent phrases of length 3, and so on. The stopping condition is when a phrase of length l + 1 does not evict any phrase of length l.
A Short Description of The Sampling Algorithm
This is a very simple algorithm which will, with high probability, find the top k items out of those having frequency at least f. It operates in two stages: the first finds candidate elements, and the second counts them.
In the first stage, randomly select ~ log(n) / f words from the corpus (note that this is much less than n). With high probability, all your desired words appear in the set of these words.
In the second stage, maintain a dictionary of the counts of these candidate elements; scan the corpus, and count the occurrences.
Output the top k of the items resulting from the second stage.
Note that the second stage is very amenable to parallel implementation. If you partition the text into different segments, and count the occurrences in each segment, you can easily combine the dictionaries at the end.
If you can store the data in Apache Solr, then the Luke Request Handler could be used to find the most common phrases. Example query:
http://127.0.0.1:8983/solr/admin/luke?fl=fulltext&numTerms=100
Additionally, the Terms Component may help find the most common individual words. Here is an article about Self Updating Solr Stopwords which uses the Terms Component to find the 100 most common indexed words and add them to the Stopwords file. Example query:
http://127.0.0.1:8983/solr/terms?terms.fl=fulltext&terms.limit=100
Have you considered using MapReduce?
Assuming you have access to a proper infrastructure, this seems to be a clear fit for it. You will need a tokenizer that splits lines into multi-word tokens up to 10 words. I don't think that's a big deal. The outcome from the MR job will be token -> frequency pairs, which you can pass to another job to sort them on the frequencies (one option). I would suggest to read up on Hadoop/MapReduce before considering other solutions. You may also use HBase to store any intermediary outputs.
Original paper on MapReduce by Google.
tokenize it by 1 to 10 words and insert into 10 SQL tables by token lengths. Make sure to use hash index on the column with string tokens. Then just call SELECT token,COUNT(*) FROM tablename GROUP BY token on each table and dump results somewhere and wait.
EDIT: that would be infeasible for large datasets, just for each N-gram update the count by +1 or insert new row into table (in MYSQL would be useful query INSERT...ON DUPLICATE KEY UPDATE). You should definitely still use hash indexes, though.
After that just sort by number of occurences and merge data from these 10 tables (you could do that in single step, but that would put more strain on memory).
Be wary of heuristic methods like suggested by Ami Tavory, if you select wrong parameters, you can get wrong results (flaw of sampling algorithm can be seen on some classic terms or phrases - e.g. "habeas corpus" - neither habeas nor corpus will be selected as frequent by itself, but as a 2 word phrase it may very well rank higher than some phrases you get by appending/prepending to common word). There is surely no need to use them for tokens of lesser length, you could use them only when classic methods fail (take too much time or memory).
The top answer by Amy Tavori states:
Clearly, the most frequent phrases of length l + 1 must contain the most frequent phrases of length l as a prefix, as appending a word to a phrase cannot increase its popularity.
While it is true that appending a word to a phrase cannot increase its popularity, there is no reason to assume that the frequency of 2-grams are bounded by the frequency of 1-grams. To illustrate, consider the following corpus (constructed specifically to illustrate this point):
Here, a tricksy corpus will exist; a very strange, a sometimes cryptic corpus will dumbfound you maybe, perhaps a bit; in particular since my tricksy corpus will not match the pattern you expect from it; nor will it look like a fish, a boat, a sunflower, or a very handsome kitten. The tricksy corpus will surprise a user named Ami Tavory; this tricksy corpus will be fun to follow a year or a month or a minute from now.
Looking at the most frequent single words, we get:
1-Gram Frequency
------ ---------
a 12
will 6
corpus 5
tricksy 4
or 3
from 2
it 2
the 2
very 2
you 2
The method suggested by Ami Tavori would identify the top 1-gram, 'a', and narrow the search to 2-grams with the prefix 'a'. But looking at the corpus from before, the top 2-grams are:
2-Gram Frequency
------ ---------
corpus will 5
tricksy corpus 4
or a 3
a very 2
And moving on to 3-grams, there is only a single repeated 3-gram in the entire corpus, namely:
3-Gram Frequency
------ ---------
tricksy corpus will 4
To generalize: you can't use the top m-grams to extrapolate directly to top (m+1)-grams. What you can do is throw away the bottom m-grams, specifically the ones which do not repeat at all, and look at all the ones that do. That narrows the field a bit.
This can be simplified greatly. You don't need a database at all. Just store the full text in a file. Then write a PHP script to open and read the file contents. Use the PHP regex function to extract matches. Keep the total in a global variable. Write the results to another file. That's it.

Resources