Frequency of ngrams (strings) in tokenized text - string

I have a set of unique ngrams (list called ngramlist) and ngram tokenized text (list called ngrams). I want to construct a new vector, freqlist, where each element of freqlist is the fraction of ngrams that is equal to that element of ngramlist. I wrote the following code that gives the correct output, but I wonder if there is a way to optimize it:
freqlist = [
sum(int(ngram == ngram_condidate)
for ngram_condidate in ngrams) / len(ngrams)
for ngram in ngramlist
]
I imagine there is a function in nltk or elsewhere that does this faster but I am not sure which one.
Thanks!
Edit: for what it's worth the ngrams are producted as joined output of nltk.util.ngrams and ngramlist is just a list made from set of all found ngrams.
Edit2:
Here is reproducible code to test the freqlist line (the rest of the code is not really what I care about)
from nltk.util import ngrams
import wikipedia
import nltk
import pandas as pd
articles = ['New York City','Moscow','Beijing']
tokenizer = nltk.tokenize.TreebankWordTokenizer()
data={'article':[],'treebank_tokenizer':[]}
for article in articles:
data['article' ].append(wikipedia.page(article).content)
data['treebank_tokenizer'].append(tokenizer.tokenize(data['article'][-1]))
df=pd.DataFrame(data)
df['ngrams-3']=df['treebank_tokenizer'].map(
lambda x: [' '.join(t) for t in ngrams(x,3)])
ngramlist = list(set([trigram for sublist in df['ngrams-3'].tolist() for trigram in sublist]))
df['freqlist']=df['ngrams-3'].map(lambda ngrams_: [sum(int(ngram==ngram_condidate) for ngram_condidate in ngrams_)/len(ngrams_) for ngram in ngramlist])

You can probably optimize this a bit by pre-computing some quantities and using a Counter. This will be especially useful if most of the elements in ngramlist are contained in ngrams.
freqlist = [
sum(int(ngram == ngram_candidate)
for ngram_candidate in ngrams) / len(ngrams)
for ngram in ngramlist
]
You certainly don't need to iterate over ngrams every single time you check an ngram. One pass over ngrams will make this algorighm O(n) instead of the O(n2) one you have now. Remember, shorter code is not necessarily better or more efficient code:
from collections import Counter
...
counter = Counter(ngrams)
size = len(ngrams)
freqlist = [counter.get(ngram, 0) / size for ngram in ngramlist]
To use this function properly, you would have to write a def function instead of a lambda:
def count_ngrams(ngrams):
counter = Counter(ngrams)
size = len(ngrams)
freqlist = [counter.get(ngram, 0) / size for ngram in ngramlist]
return freqlist
df['freqlist'] = df['ngrams-3'].map(count_ngrams)

Firstly, don't pollute your imported functions by overriding them and using them as variables, keep the ngrams name as the function, and use something else as variable.
import time
from functools import partial
from itertools import chain
from collections import Counter
import wikipedia
import pandas as pd
from nltk import word_tokenize
from nltk.util import ngrams
Next the steps before the line you're asking in the original question might be a little inefficient, you can clean them up, make them easier to read and measure them as such:
# Downloading the articles.
titles = ['New York City','Moscow','Beijing']
start = time.time()
df = pd.DataFrame({'article':[wikipedia.page(title).content for title in titles]})
end = time.time()
print('Downloading wikipedia articles took', end-start, 'seconds')
And then:
# Tokenizing the articles
start = time.time()
df['tokens'] = df['article'].apply(word_tokenize)
end = time.time()
print('Tokenizing articles took', end-start, 'seconds')
Then:
# Extracting trigrams.
trigrams = partial(ngrams, n=3)
start = time.time()
# There's no need to flatten them to strings, you could just use list()
df['trigrams'] = df['tokens'].apply(lambda x: list(trigrams(x)))
end = time.time()
print('Extracting trigrams took', end-start, 'seconds')
Finally, to the last line
# Instead of a set, we use a Counter here because
# we can use an intersection between Counter objects later.
# see https://stackoverflow.com/questions/44012479/intersection-of-two-counters
all_trigrams = Counter(chain(*df['trigrams']))
# More often than not, you don't need to keep all the
# zeros in the vectors (aka dense vector),
# you could actually get the non-zero sparse vectors
# as a dict as such
df['trigrams_count'] = df['trigrams'].apply(lambda x: Counter(x) & all_trigrams)
# Now to normalize the count, simply do:
def featurize(list_of_ngrams):
nonzero_features = Counter(list_of_ngrams) & all_trigrams
total = len(list_of_ngrams)
return {ng:count/total for ng, count in nonzero_features.items()}
df['trigrams_count_normalize'] = df['trigrams'].apply(featurize)

Related

faster method for comparing two lists element-wise

I am building a relational DB using python. So far I have two tables, as follows:
>>> df_Patient.columns
[1] Index(['NgrNr', 'FamilieNr', 'DosNr', 'Geslacht', 'FamilieNaam', 'VoorNaam',
'GeboorteDatum', 'PreBirth'],
dtype='object')
>>> df_LaboRequest.columns
[2] Index(['RequestId', 'IsComplete', 'NgrNr', 'Type', 'RequestDate', 'IntakeDate',
'ReqMgtUnit'],
dtype='object')
The two tables are quite big:
>>> df_Patient.shape
[3] (386249, 8)
>>> df_LaboRequest.shape
[4] (342225, 7)
column NgrNr on df_LaboRequest if foreign key (FK) and references the homonymous column on df_Patient. In order to avoid any integrity error, I need to make sure that all the values under df_LaboRequest[NgrNr] are in df_Patient[NgrNr].
With list comprehension I tried the following (to pick up the values that would throw an error):
[x for x in list(set(df_LaboRequest['NgrNr'])) if x not in list(set(df_Patient['NgrNr']))]
Though this is taking ages to complete. Would anyone recommend a faster method (method as a general word, as synonym for for procedure, nothing to do with the pythonic meaning of method) for such a comparison?
One-liners aren't always better.
Don't check for membership in lists. Why on earth would you create a set (which is the recommended data structure for O(1) membership checks) and then cast it to a list which has O(N) membership checks?
Make the set of df_Patient once outside the list comprehension and use that instead of making the set in every iteration
patients = set(df_Patient['NgrNr'])
lab_requests = set(df_LaboRequest['NgrNr'])
result = [x for x in lab_requests if x not in patients]
Or, if you like to use set operations, simply find the difference of both sets:
result = lab_requests - patients
Alternatively, use pandas isin() function.
patients = patients.drop_duplicates()
lab_requests = lab_requests.drop_duplicates()
result = lab_requests[~lab_requests.isin(patients)]
Let's test how much faster these changes make the code:
import pandas as pd
import random
import timeit
# Make dummy dataframes of patients and lab_requests
randoms = [random.randint(1, 1000) for _ in range(10000)]
patients = pd.DataFrame("patient{0}".format(x) for x in randoms[:5000])[0]
lab_requests = pd.DataFrame("patient{0}".format(x) for x in randoms[2000:8000])[0]
# Do it your way
def fun1(pat, lr):
return [x for x in list(set(lr)) if x not in list(set(pat))]
# Do it my way: Set operations
def fun2(pat, lr):
pat_s = set(pat)
lr_s = set(lr)
return lr_s - pat_s
# Or explicitly iterate over the set
def fun3(pat, lr):
pat_s = set(pat)
lr_s = set(lr)
return [x for x in lr_s if x not in pat_s]
# Or using pandas
def fun4(pat, lr):
pat = pat.drop_duplicates()
lr = lr.drop_duplicates()
return lr[~lr.isin(pat)]
# Make sure all 3 functions return the same thing
assert set(fun1(patients, lab_requests)) == set(fun2(patients, lab_requests)) == set(fun3(patients, lab_requests)) == set(fun4(patients, lab_requests))
# Time it
timeit.timeit('fun1(patients, lab_requests)', 'from __main__ import patients, lab_requests, fun1', number=100)
# Output: 48.36615000000165
timeit.timeit('fun2(patients, lab_requests)', 'from __main__ import patients, lab_requests, fun2', number=100)
# Output: 0.10799920000044949
timeit.timeit('fun3(patients, lab_requests)', 'from __main__ import patients, lab_requests, fun3', number=100)
# Output: 0.11038020000069082
timeit.timeit('fun4(patients, lab_requests)', 'from __main__ import patients, lab_requests, fun4', number=100)
# Output: 0.32021789999998873
Looks like we have a ~150x speedup with pandas and a ~500x speedup with set operations!
I don't have a pandas installed right now to try this. But you could try removing the list(..) cast. I don't think it provides anything meaningful to the program and sets are much faster for lookup, e.g. x in set(...), than lists.
Also you could try doing this with the pandas API rather than lists and sets, sometimes this faster. Try searching for unique. Then you could compare the size of the two columns and if it is the same, sort them and do an equality check.

Using NLTK, how to search for concepts in a text

I'm novice to both Python and NLTK. So, I'm trying to see the representation of some concepts in text using NLTK. I have a CSV file which looks like this image
And I want to see how frequent, e.g., Freedom, Courage, and all other concepts are. I also want to know how to make sure the code looks for bi and trigrams. However, the code I have below only allows me to look for a single list of words in a text (Preps.txt like this ).
The output I expect is something like:
Concept = Frequency in text, i.e., Freedom = 10, Courage = 20
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = '/Users/Muhsa/Myfolder/Concepts' #this is where the texts I want to study are located
Concepts= PlaintextCorpusReader(corpus_root, '.*')
Concepts.fileids()
for fileid in Concepts.fileids():
text3 = Concepts.words(fileid)
from nltk import word_tokenize
from nltk import FreqDist
text3 = Concepts.words(fileid)
preps = open('preps.txt', encoding="utf-8")
rawpreps = preps.read() #preps refer to the file that has the list of words
tokens = word_tokenize(rawpreps)
texty = nltk.Text(tokens)
fdist = nltk.FreqDist(w.lower() for w in text3)
for m in texty:
print(m + ':', fdist[m], end=' ')
I reorganised your code a little bit. I assumed you had 1 file per concept words, and that 'preps.txt' only contained the courage words but not the others.
I hope it is easy to understand.
import nltk
from nltk.corpus import PlaintextCorpusReader
from nltk import word_tokenize
from nltk import FreqDist
# Load the courage vocabulary
with open('preps.txt', encoding="utf-8") as file:
content = file.read() #preps refer to the file that has the list of words
courage_words = content.split('\n') # This is a list of words
# load freedom and development words in the same fashion
# Load the corpus
corpus_root = '/Users/Muhsa/Myfolder/Concepts' #this is where the texts I want to study are located
corpus = PlaintextCorpusReader(corpus_root, '.*')
# Count the number of word in the whole corpus that are also in the courage vocabulry
courage_freq = len([w for w in corpus.words() if w in courage_words])
print('Corpus contains {} courage words'.format(courage_freq))
# For each file in the corpus
for file_id in corpus.fileids():
# Count the number of word in the file that are also in courage word
file_freq = len([w for w in corpus.words(file_id) if w in courage_words])
print(file_id, file_freq)
Or better
# Load concept vocabulary in different files, in a python dictionary
concept_voc = {}
for file_path in ['courage.txt', 'freedom.txt', 'development.txt']:
concept_name = file_path.replace('.txt', '')
with open(file_path) as f:
voc = f.read().split('\n')
concept_voc[concept_name] = voc
# Load concept vocabulary in a csv file, each column is one vocabulary, the first line is the "name"
df = pd.read_csv('to_dict.csv')
convept_voc = df.to_dict('columns')
# concept_voc['courage'] returns the list of courage words
# And then for each concept compute the frequency as before
for concept in concept_voc:
voc = concept_voc[concept]
corpus_freq = len([w for w in corpus.words() if w in voc])
print(concept, '=', corpus_freq)

Having issues computing the average of compound sentiment values for each text file in a folder

# below is the sentiment analysis code written for sentence-level analysis
import glob
import os
import nltk.data
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import sentiment
from nltk import word_tokenize
# Next, VADER is initialized so I can use it within the Python script
sid = SentimentIntensityAnalyzer()
# I will also initialize the 'english.pickle' function and give it a short
name
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
#Each of the text file is listed from the folder speeches
files = glob.glob(os.path.join(os.getcwd(), 'cnn_articles', '*.txt'))
text = []
#iterate over the list getting each file
for file in files:
#open the file and then call .read() to get the text
with open(file) as f:
text.append(f.read())
text_str = "\n".join(text)
# This breaks up the paragraph into a list of strings.
sentences = tokenizer.tokenize(text_str )
sent = 0.0
count = 0
# Iterating through the list of sentences and extracting the compound scores
for sentence in sentences:
count +=1
scores = sid.polarity_scores(sentence)
sent += scores['compound'] #Adding up the overall compound sentiment
# print(sent, file=open('cnn_compound.txt', 'a'))
if count != 0:
sent = float(sent / count)
print(sent, file=open('cnn_compound.txt', 'a'))
With these lines of code, I have been able to get the average of all the compound sentiment values for all the text files. What I really want is the
average compound sentiment value for each text file, such that if I have 10
text files in the folder, I will have 10 floating point values representing
each of the text file. So that I can plot these values against each other.
Kindly assist me as I am very new to Python.
# below is the sentiment analysis code written for sentence-level analysis
import os, string, glob, pandas as pd, numpy as np
import nltk.data
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import sentiment
from nltk import word_tokenize
# Next, VADER is initialized so I can use it within the Python
script
sid = SentimentIntensityAnalyzer()
exclude = set(string.punctuation)
# I will also initialize the 'english.pickle' function and give
it a short
name
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
#Each of the text file is listed from the folder speeches
files = glob.glob(os.path.join(os.getcwd(), 'cnn_articles',
'*.txt'))
text = []
sent = 0.0
count = 0
cnt = 0
#iterate over the list getting each file
for file in files:
f = open(file).read().split('.')
cnt +=1
count = (len(f))
for sentence in f:
if sentence not in exclude:
scores = sid.polarity_scores(sentence)
print(scores)
break
sent += scores['compound']
average = round((sent/count), 4)
t = [cnt, average]
text.append(t)
break
df = pd.DataFrame(text, columns=['Article Number', 'Average
Value'])
#
#df.to_csv(r'Result.txt', header=True, index=None, sep='"\t\"
+"\t\"', mode='w')
df.to_csv('cnn_result.csv', index=None)

Python (NLTK) - more efficient way to extract noun phrases?

I've got a machine learning task involving a large amount of text data. I want to identify, and extract, noun-phrases in the training text so I can use them for feature construction later on in the pipeline.
I've extracted the type of noun-phrases I wanted from text but I'm fairly new to NLTK, so I approached this problem in a way where I can break down each step in list comprehensions like you can see below.
But my real question is, am I reinventing the wheel here? Is there a faster way to do this that I'm not seeing?
import nltk
import pandas as pd
myData = pd.read_excel("\User\train_.xlsx")
texts = myData['message']
# Defining a grammar & Parser
NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
chunkr = nltk.RegexpParser(NP)
tokens = [nltk.word_tokenize(i) for i in texts]
tag_list = [nltk.pos_tag(w) for w in tokens]
phrases = [chunkr.parse(sublist) for sublist in tag_list]
leaves = [[subtree.leaves() for subtree in tree.subtrees(filter = lambda t: t.label == 'NP')] for tree in phrases]
flatten the list of lists of lists of tuples that we've ended up with, into
just a list of lists of tuples
leaves = [tupls for sublists in leaves for tupls in sublists]
Join the extracted terms into one bigram
nounphrases = [unigram[0][1]+' '+unigram[1][0] in leaves]
Take a look at Why is my NLTK function slow when processing the DataFrame?, there's no need to iterate through all rows multiple times if you don't need intermediate steps.
With ne_chunk and solution from
NLTK Named Entity recognition to a Python list and
How can I extract GPE(location) using NLTK ne_chunk?
[code]:
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import pandas as pd
def get_continuous_chunks(text, chunk_func=ne_chunk):
chunked = chunk_func(pos_tag(word_tokenize(text)))
continuous_chunk = []
current_chunk = []
for subtree in chunked:
if type(subtree) == Tree:
current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
return continuous_chunk
df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.',
'Another bar foo Washington DC thingy with Bruce Wayne.']})
df['text'].apply(lambda sent: get_continuous_chunks((sent)))
[out]:
0 [New York]
1 [Washington, Bruce Wayne]
Name: text, dtype: object
To use the custom RegexpParser :
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import pandas as pd
# Defining a grammar & Parser
NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
chunker = RegexpParser(NP)
def get_continuous_chunks(text, chunk_func=ne_chunk):
chunked = chunk_func(pos_tag(word_tokenize(text)))
continuous_chunk = []
current_chunk = []
for subtree in chunked:
if type(subtree) == Tree:
current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
return continuous_chunk
df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.',
'Another bar foo Washington DC thingy with Bruce Wayne.']})
df['text'].apply(lambda sent: get_continuous_chunks(sent, chunker.parse))
[out]:
0 [bar sentence, New York city]
1 [bar foo Washington DC thingy, Bruce Wayne]
Name: text, dtype: object
I suggest referring to this prior thread:
Extracting all Nouns from a text file using nltk
They suggest using TextBlob as the easiest way to achieve this (if not the one that is most efficient in terms of processing) and the discussion there addresses your question.
from textblob import TextBlob
txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages."""
blob = TextBlob(txt)
print(blob.noun_phrases)
The above methods didn't give me the required results. Following is the function that I would suggest
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import re
def get_noun_phrases(text):
pos = pos_tag(word_tokenize(text))
count = 0
half_chunk = ""
for word, tag in pos:
if re.match(r"NN.*", tag):
count+=1
if count>=1:
half_chunk = half_chunk + word + " "
else:
half_chunk = half_chunk+"---"
count = 0
half_chunk = re.sub(r"-+","?",half_chunk).split("?")
half_chunk = [x.strip() for x in half_chunk if x!=""]
return half_chunk
The Constituent-Treelib library, which can be installed via: pip install constituent-treelib does excatly what you are looking for in few lines of code. In order to extract noun (or any other) phrases, perform the following steps.
from constituent_treelib import ConstituentTree
# First, we have to provide a sentence that should be parsed
sentence = "I've got a machine learning task involving a large amount of text data."
# Then, we define the language that should be considered with respect to the underlying models
language = ConstituentTree.Language.English
# You can also specify the desired model for the language ("Small" is selected by default)
spacy_model_size = ConstituentTree.SpacyModelSize.Medium
# Next, we must create the neccesary NLP pipeline.
# If you wish, you can instruct the library to download and install the models automatically
nlp = ConstituentTree.create_pipeline(language, spacy_model_size) # , download_models=True
# Now, we can instantiate a ConstituentTree object and pass it the sentence and the NLP pipeline
tree = ConstituentTree(sentence, nlp)
# Finally, we can extract the phrases
tree.extract_all_phrases()
Result...
{'S': ["I 've got a machine learning task involving a large amount of text data ."],
'PP': ['of text data'],
'VP': ["'ve got a machine learning task involving a large amount of text data",
'got a machine learning task involving a large amount of text data',
'involving a large amount of text data'],
'NML': ['machine learning'],
'NP': ['a machine learning task involving a large amount of text data',
'a machine learning task',
'a large amount of text data',
'a large amount',
'text data']}
If you only want the noun phrases, just pick them out with tree.extract_all_phrases()['NP']
['a machine learning task involving a large amount of text data',
'a machine learning task',
'a large amount of text data',
'a large amount',
'text data']

Print only topic name using LDA with python

I need to print only the topic word (only one word). But it contains some number, But I can not get only the topic name like "Happy". My String word is "Happy", why it shows "Happi"
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
import string
tokenizer = RegexpTokenizer(r'\w+')
en_stop = get_stop_words('en')
p_stemmer = PorterStemmer()
fr = open('Happy DespicableMe.txt','r')
doc_a = fr.read()
fr.close()
doc_set = [doc_a]
texts = []
for i in doc_set:
raw = i.lower()
tokens = tokenizer.tokenize(raw)
stopped_tokens = [i for i in tokens if not i in en_stop]
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
texts.append(stemmed_tokens)
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=1, id2word = dictionary, passes=20)
rafa = ldamodel.show_topics(num_topics=1, num_words=1, log=False , formatted=False)
print(rafa)
It only shows [(0, '0.142*"happi"')]. But I want to print only the word.
You are plagued by a misunderstanding:
Stemming extracts the stem of a word through a series of transformation rules stripping off common suffixes and prefixes. Indeed, the resulting stem is not necessarily an actual English word. The purpose use of stemming is to normalize words for comparison. E.g.
stem_word('happy') == stem_word('happier')
What you need is a Lemmatizer (e.g. nltk.stem.wordnet) to lookup lemmas. Lemmas differ from stems in that a lemma is a canonical form of the word, while a stem may not be a real word.
After you have install the corpus/wordnet you can use it like this:
from nltk.corpus import wordnet
syns = wordnet.synsets("happier")
print(syns[0].lemmas()[0].name())
Output:
happy

Resources