I'm new to Python and nltk, so I would really appreciate your input on the following problem.
Goal:
I want to search and count the occurrence of specific terminology in tokenized sentences which are stored in a pandas DataFrame. The terms I'm searching for are stored in a list of strings. The output should be saved in a new column.
Since the words I'm searching for are grammatically inflected (e.g. cats instead of cat) I need a solution which not only displays exact matches. I guess stemming the data and searching for specific stems would be a proper approach but let's assume this is not an option here, as we would still have semantic overlaps.
What I tried so far:
In order to further handle the data I preprocessed the data while following these steps:
Put everything in lower case
Remove punctuation
Tokenization
Remove stop words
I tried searching for single terms with str.count('cat') but this doesn't do the trick and the data is marked as missing with NaN. Additionally, I don't know how to iterate over the search word list in an efficient way while using pandas.
My code so far:
import numpy as np
import pandas as pd
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
# Function to remove punctuation
def remove_punctuation(text):
return re.sub(r'[^\w\s]','',text)
# Target data where strings should be searched and counted
data = {'txt_body': ['Ab likes dogs.', 'Bc likes cats.',
'De likes cats and dogs.', 'Fg likes cats, dogs and cows.',
'Hi has two grey cats, a brown cat and two dogs.']}
df = pd.DataFrame(data=data)
# Search words stored in a list of strings
search_words = ['dog', 'cat', 'cow']
# Store stopwords from nltk.corpus
stop_words = set(stopwords.words('english'))
# Data preprocessing
df['txt_body'] = df['txt_body'].apply(lambda x: x.lower())
df['txt_body'] = df['txt_body'].apply(remove_punctuation)
df['txt_body'] = df['txt_body'].fillna("").map(word_tokenize)
df['txt_body'] = df['txt_body'].apply(lambda x: [word for word in x if word not in stop_words])
# Here is the problem space
df['search_count'] = df['txt_body'].str.count('cat')
print(df.head())
Expected output:
txt_body search_count
0 [ab, likes, dogs] 1
1 [bc, likes, cats] 1
2 [de, likes, cats, dogs] 2
3 [fg, likes, cats, dogs, cows] 3
4 [hi, two, grey, cats, brown, cat, two, dogs] 3
A very simple solution would be this:
def count_occurence(l, s):
counter = 0
for item in l:
if s in item:
counter += 1
return counter
df['search_count'] = df.apply(lambda row: count_occurence(row.txt_body, 'cat'),1)
You could then further decide how to define the count_occurence function. And, to search for the whole search_words, something like this will do the job, although it is probably not the most efficient:
def count_search_words(l, search_words):
counter = 0
for s in search_words:
counter += count_occurence(l, s)
return counter
df['search_count'] = df.apply(lambda row: count_search_words(row.txt_body, search_words),1)
Related
Basically the title.
I'm relearning NLP and tried to use some data I found on Kaggle to make some "cheat-sheet", however, I've been into an odyssey to covert into a string and tokenize the "Review" column correctly to then remove stop words.
If I display the variables I settled for every thing I keep getting the same review row, even when I settled the entire range of the data.
This is the code on Google Colaboratory
Comments are activated in the notebook so feel free to add your tips.
Thank you!
Use:
def filterstop(row):
return [word for word in word_tokenize(row) if word not in vacias]
Then apply that on the column.
Demostration:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import pandas as pd
data = pd.DataFrame({'reviews':['the first sent', 'second sent', 'it is the third sent', 'fourth sent']})
vacias = stopwords.words('english')
def filterstop(row):
return [word for word in word_tokenize(row) if word not in vacias]
data['tokenized reviews'] = data['reviews'].apply(filterstop)
data['tokenized reviews'].head(3)
Output:
0 [first, sent]
1 [second, sent]
2 [third, sent]
Name: tokenized reviews, dtype: object
I have a data frame, df with text, cleaned_text, and nouns as column names. text and cleaned_text contains string document, nouns is a list of nouns extracted from cleaned_text column. df.shape = (1927, 3).
I am trying to calculate TF-IDF values for all documents within df only for nouns, excluding spaCy stopwords.
What I have tried?
import spacy
from spacy.lang.en import English
nlp = spacy.load('en_core_web_sm')
# subclass to modify stop word lists recommended from spaCy version 3.0 onwards
excluded_stop_words = {'down'}
included_stop_words = {'dear', 'regards'}
class CustomEnglishDefaults(English.Defaults):
stop_words = English.Defaults.stop_words.copy()
stop_words -= excluded_stop_words
stop_words |= included_stop_words
class CustomEnglish(English):
Defaults = CustomEnglishDefaults
# function to extract nouns from cleaned_text column, excluding spaCy stowords.
nlp = CustomEnglish()
def nouns(text):
doc = nlp(text)
return [t for t in doc if t.pos_ in ['NOUN'] and not t.is_stop and not t.is_punct]
# calculate TF-IDF values for nouns, excluding spaCy stopwords.
from sklearn.feature_extraction.text import TfidfVectorizer
documents = df.cleaned_text
tfidf = TfidfVectorizer(stop_words=CustomEnglish)
X = tfidf.fit_transform(documents)
What I am expecting?
I am expecting to have an output as a list of tuples ranked in descending order;
nouns = [('noun_1', tf-idf_1), ('noun_2', tf-idf_2), ...]. All nouns in nouns should match those of df.nouns (this is to check whether I am on the right way).
What is my issue?
I got confused about how to apply TfidfVectorizer such that to calculate only TF-IDF values for Nouns extracted from cleaned_text. I am also not sure whether SkLearn TfidfVectorizer can calculate TF-IDF as I am expecting.
Not sure if you're still looking for a solution. Here is an option that you might wanna go ahead with.
First of all, by default TF_IDF takes into account the entire set of words, not just nouns. Hence, you would need to implement a custom TF_IDF function to apply results only on nouns. Following is a good reference on how TF_IDF works internally: https://www.askpython.com/python/examples/tf-idf-model-from-scratch
Instead of running the tf_idf function(as applied in the above url) for all words of a sentence/document, you can just run it on the list of nouns you've extracted,i.e., just change the code from:
def tf_idf(sentence):
tf_idf_vec = np.zeros((len(word_set),))
for word in sentence:
tf = termfreq(sentence,word)
idf = inverse_doc_freq(word)
value = tf*idf
tf_idf_vec[index_dict[word]] = value
return tf_idf_vec
to:
def tf_idf(sentence, nouns):
values = []
for word in nouns:
tf = termfreq(sentence,word)
idf = inverse_doc_freq(word)
value = tf*idf
values.append(value)
return tf_idf_vec, values
You now have a "values" list corresponding to the list of "nouns" for each sentence. Hope this makes sense.
Case study
Task 1
Import text corpus brown
Extract the list of words associated with text collections belonging to the news genre. Store the result in the variable news_words.
Convert each word of the list news_words into lower case, and store the result in lc_news_words.
Compute the length of each word present in the list lc_news_words, and store the result in the list len_news_words.
Compute bigrams of the list len_news_words. Store the result in the variable news_len_bigrams.
Compute the conditional frequency of news_len_bigrams, where condition and event refers to the length of the words. Store the result in cfd_news.
Determine the frequency of 6-letter words appearing next to a 4-letter word.
Task 2
Compute bigrams of the list lc_news_words, and store it in the variable lc_news_bigrams.
From lc_news_bigrams, filter bigrams where both words contain only alphabet characters. Store the result in lc_news_alpha_bigrams.
Extract the list of words associated with the corpus stopwords. Store the result in stop_words.
Convert each word of the list stop_words into lower case, and store the result in lc_stop_words.
Filter only the bigrams from lc_news_alpha_bigrams where the words are not part of lc_stop_words. Store the result in lc_news_alpha_nonstop_bigrams.
Print the total number of filtered bigrams.
Task 1 passed, but task 2 is getting failed please help me out where I am wrong!!!!
import nltk
from nltk.corpus import brown
from nltk.corpus import stopwords
news_words = brown.words(categories = 'news')
lc_news_words = [word.lower() for word in news_words]
len_news_words = [len(word) for word in lc_news_words]
news_len_bigrams = nltk.bigrams(len_news_words)
cfd_news = nltk.ConditionalFreqDist(news_len_bigrams )
print(cfd_news[4][6])
lc_news_bigrams = nltk.bigrams(lc_news_words)
lc_news_alpha_bigrams = [ (w1, w2) for w1, w2 in lc_news_bigrams if w1.isalpha() and w2.isalpha()]
stop_words = stopwords.words('english')
lc_stop_words = [word.lower() for word in stop_words]
lc_news_alpha_nonstop_bigrams = [(n1, n2) for n1, n2 in lc_news_alpha_bigrams if n1 not in lc_stop_words and n2 not in lc_stop_words]
print(len(lc_news_alpha_nonstop_bigrams))
Results
with english in code stop_words = stopwords.words('english')
1084
17704
with out english in code stop_words = stopwords.words()
1084
16876
stop_words = set(stopwords.words())
everything was good, just use the unique set from the list of stopwords. Also removing the 'english' parameter increasing the number of stop words and that is the actual set of stopwords to be considered.
I am trying to look for keywords in sentences which is stored as a list of lists. The outer list contains sentences and the inner list contains words in sentences. I want to iterate over each word in each sentence to look for keywords defined and return me the values where found.
This is how my token_sentences looks like.
I took help from this post. How to iterate through a list of lists in python? However, I am getting an empty list in return.
This is the code I have written.
import nltk
from nltk.tokenize import TweetTokenizer, sent_tokenize, word_tokenize
text = "MDCT SCAN OF THE CHEST: HISTORY: Follow-up LUL nodule. TECHNIQUES: Non-enhanced and contrast-enhanced MDCT scans were performed with a slice thickness of 2 mm. COMPARISON: Chest CT dated on 01/05/2018, 05/02/207, 28/09/2016, 25/02/2016, and 21/11/2015. FINDINGS: Lung parenchyma: There is further increased size and solid component of part-solid nodule associated with internal bubbly lucency and pleural tagging at apicoposterior segment of the LUL (SE 3; IM 38-50), now measuring about 2.9x1.7 cm in greatest transaxial dimension (previously size 2.5x1.3 cm in 2015). Also further increased size of two ground-glass nodules at apicoposterior segment of the LUL (SE 3; IM 37), and superior segment of the LLL (SE 3; IM 58), now measuring about 1 cm (previously size 0.4 cm in 2015), and 1.1 cm (previously size 0.7 cm in 2015) in greatest transaxial dimension, respectively."
tokenizer_words = TweetTokenizer()
tokens_sentences = [tokenizer_words.tokenize(t) for t in
nltk.sent_tokenize(text)]
nodule_keywords = ["nodules","nodule"]
count_nodule =[]
def GetNodule(sentence, keyword_list):
s1 = sentence.split(' ')
return [i for i in s1 if i in keyword_list]
for sub_list in tokens_sentences:
result_calcified_nod = GetNodule(sub_list[0], nodule_keywords)
count_nodule.append(result_calcified_nod)
However, I am getting the empty list as a result for the variable in count_nodule.
This is the value of first two rows of "token_sentences".
token_sentences = [['MDCT', 'SCAN', 'OF', 'THE', 'CHEST', ':', 'HISTORY', ':', 'Follow-up', 'LUL', 'nodule', '.'],['TECHNIQUES', ':', 'Non-enhanced', 'and', 'contrast-enhanced', 'MDCT', 'scans', 'were', 'performed', 'with', 'a', 'slice', 'thickness', 'of', '2', 'mm', '.']]
Please help me to figure out where I am doing wrong!
You need to remove s1 = sentence.split(' ') from GetNodule because sentence has already been tokenized (it is already a List).
Remove the [0] from GetNodule(sub_list[0], nodule_keywords). Not sure why you would want to pass the first word of each sentence into GetNodule!
The error is here:
for sub_list in tokens_sentences:
result_calcified_nod = GetNodule(sub_list[0], nodule_keywords)
You are looping over each sub_list in tokens_sentences, but only passing the first word sub_list[0] to GetNodule.
This type of error is fairly common, and somewhat hard to catch, because Python code which expects a list of strings will happily accept and iterate over the individual characters in a single string instead if you call it incorrectly. If you want to be defensive, maybe it would be a good idea to add something like
assert not all(len(x)==1 for x in sentence)
And of course, as #dyz notes in their answer, if you expect sentence to already be a list of words, there is no need to split anything inside the function. Just loop over the sentence.
return [w for w in sentence if w in keyword_list]
As an aside, you probably want to extend the final result with the list result_calcified_nod rather than append it.
I have a set of words for which I have to check whether they are present in the documents.
WordList = [w1, w2, ..., wn]
Another set have list of documents on which I have to check whether these words are present or not.
How to use scikit-learn CountVectorizer so that the features of term-document matrix are only words from WordList and each row represents each particular document with no of times the word from the given list appears in their respective column?
Ok. I get it.
The code is given below:
from sklearn.feature_extraction.text import CountVectorizer
# Counting the no of times each word(Unigram) appear in document.
vectorizer = CountVectorizer(input='content',binary=False,ngram_range=(1,1))
# First set the vocab
vectorizer = vectorizer.fit(WordList)
# Now transform the text contained in each document i.e list of text
Document:list
tfMatrix = vectorizer.transform(Document_List).toarray()
This will output only the term-document matrix with features from wordList only.
For custom documents, you can use Count Vectorizer approach
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer() #make object of Count Vectorizer
corpus = [
'This is a cat.',
'It likes to roam in the garden',
'It is black in color',
'The cat does not like the dog.',
]
X = vectorizer.fit_transform(corpus)
#print(X) to see count given to words
vectorizer.get_feature_names() == (
['cat', 'color', 'roam', 'The', 'garden',
'dog', 'black', 'like', 'does', 'not',
'the', 'in', 'likes'])
X.toarray()
#used to convert X into numpy array
vectorizer.transform(['A new cat.']).toarray()
# Checking it for a new document
Other Vectorizers can also be used like Tfidf Vectorizer. Tfidf vectorizer is a better approach as it not only provides with the number of occurences of words in a particular document but also tells about the importance of the word.
It is calculated by finding TF- term frequency and IDF- Inverse Document Frequency.
Term Freq is the number of times a word appeared in a particular document and IDF is calculated based on the context of the document.
For eg., if the documents are related to football, then the word "the" would not give any insight but the word "messi" would tell about the context of the document.
It is calculated by taking log of the number of occurences.
Eg. tf("the") = 10
tf("messi") = 5
idf("the") = log(10) = 0
idf("messi") = log(5) = 0.52
tfidf("the") = tf("the") * idf("the") = 10 * 0 = 0
tfidf("messi") = 5 * 0.52 = 2.6
These weights help the algorithm to identify the important words out of the documents that later helps to derive semantics out of the doc.