Create a dataframe with NLTK synonyms - python-3.x

Good Morning,
I am using NLTK to get synonyms out of a frame of words.
col_1 col_2
Book 5
Pen 5
Pencil 6
def get_synonyms(df, column_name):
df_1 = df["col_1"]
for i in df_1:
syn = wn.synsets(i)
for synset in list(wn.all_synsets('n'))[:2]:
print(i, "-->", synset)
for lemma in synset.lemmas():
ci =
And it does work, but I would like to get the following dataframe, with the first "n" synonyms, of each word in "col_1":
col_1 synonym
Book booklet
Book album
Pen cage
I tried initializing an empty list, before both synset and lemma loop, and appending, but in both cases it didn't work; for example:
synonyms = []
for lemma in synset.lemmas():
ci =

You can use:
from nltk.corpus import wordnet
from itertools import chain
def get_synonyms(df, column_name, N):
L = []
for i in df[column_name]:
syn = wordnet.synsets(i)
#flatten all lists by chain, remove duplicates by set
lemmas = list(set(chain.from_iterable([w.lemma_names() for w in syn])))
for j in lemmas[:N]:
#append to final list
L.append([i, j])
#create DataFrame
return (pd.DataFrame(L, columns=['word','syn']))
#add number of filtered synonyms
df1 = get_synonyms(df, 'col_1', 3)
print (df1)
word syn
0 Book record_book
1 Book book
2 Book Word
3 Pen penitentiary
4 Pen compose
5 Pen pen
6 Pencil pencil


How to simplify text comparison for big data-set where text meaning is same but not exact - deduplicate text data

I have text data set (different menu items like chocolate, cake, coke etc) of around 1.8 million records which belongs to 6 different categories (category A, B, C, D, E, F). one of the category has around 700k records. Most of the menu items are mixed up in multiple categories to which they doesn't belong to, for example: cake belongs to category 'A' but it is found in category 'B' & 'C' as well.
I want to identify those misclassified items and report to a personnel but the challenge is the item name is not always correct because it is totally human typed text. For example: Chocolate might be updated as hot chclt, sweet choklate, chocolat etc. There can also be items like chocolate cake ;)
so to handle this, I tried a simple method using cosine similarity to compare category-wise and identify those anomalies but it takes alot of time since I am comparing each items to 1.8 million records (Sample code is as shown below). Can anyone suggest a better way to deal with this problem?
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
def cos_similarity(a,b):
X =a
Y =b
# tokenization
X_list = word_tokenize(X)
Y_list = word_tokenize(Y)
# sw contains the list of stopwords
sw = stopwords.words('english')
l1 =[];l2 =[]
# remove stop words from the string
X_set = {w for w in X_list if not w in sw}
Y_set = {w for w in Y_list if not w in sw}
# form a set containing keywords of both strings
rvector = X_set.union(Y_set)
for w in rvector:
if w in X_set: l1.append(1) # create a vector
else: l1.append(0)
if w in Y_set: l2.append(1)
else: l2.append(0)
c = 0
# cosine formula
for i in range(len(rvector)):
c+= l1[i]*l2[i]
if float((sum(l1)*sum(l2))**0.5)>0:
cosine = c / float((sum(l1)*sum(l2))**0.5)
cosine = 0
return cosine
#Base code
cos_sim_list = []
for i in category_B.index:
ln_cosdegree = 0
ln_degsem = []
for j in category_A.index:
ln_j = str(category_A['item_name'][j])
ln_i = str(category_B['item_name'][i])
degreeOfSimilarity = cos_similarity(ln_j,ln_i)
if degreeOfSimilarity>0.5:
Consider text is already cleaned
I used KNeighbor and cosine similarity to solve this case. Though I am running the code multiple times to compare category by category; still it is effective because of lesser number of categories. Please suggest me if any better solution is available
cat_A_clean = category_A['item_name'].unique()
print('Vecorizing the data - this could take a few minutes for large datasets...')
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams, lowercase=False)
tfidf = vectorizer.fit_transform(cat_A_clean)
print('Vecorizing completed...')
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors=1, n_jobs=-1).fit(tfidf)
unique_B = set(category_B['item_name'].values)
def getNearestN(query):
queryTFIDF_ = vectorizer.transform(query)
distances, indices = nbrs.kneighbors(queryTFIDF_)
return distances, indices
import time
t1 = time.time()
print('getting nearest n...')
distances, indices = getNearestN(unique_B)
t = time.time()-t1
print("COMPLETED IN:", t)
unique_B = list(unique_B)
print('finding matches...')
matches = []
for i,j in enumerate(indices):
temp = [round(distances[i][0],2), cat_A_clean['item_name'].values[j],unique_B[i]]
print('Building data frame...')
matches = pd.DataFrame(matches, columns=['Match confidence (lower is better)','ITEM_A','ITEM_B'])
def clean_string(text):
text = str(text)
text = text.lower()
def cosine_sim_vectors(vec1,vec2):
vec1 = vec1.reshape(1,-1)
vec2 = vec2.reshape(1,-1)
return cosine_similarity(vec1,vec2)[0][0]
def cos_similarity(sentences):
cleaned = list(map(clean_string,sentences))
vectorizer = CountVectorizer().fit_transform(cleaned)
vectors = vectorizer.toarray()
cos_sim_list =[]
for ind in matches.index:
a = matches['Match confidence (lower is better)'][ind]
b = matches['ITEM_A'][ind]
c = matches['ITEM_B'][ind]
degreeOfSimilarity = cos_similarity([b,c])

How to remove key error from the program?

In the image you can see that i have ID still getting key error I am trying to do a recommendation algorithm so i got this error
#the first argument in the below function to be passed is the id of the book, second argument is the number of books you want to be recommended#
KeyError: <built-in function id>
I am sharing link of article
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
ds = pd.read_csv("test1.csv") #you can plug in your own list of products or movies or books here as csv file#
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
#ngram explanation begins#
#ngram (1,3) can be explained as follows#
#ngram(1,3) encompasses uni gram, bi gram and tri gram
#consider the sentence "The ball fell"
#ngram (1,3) would be the, ball, fell, the ball, ball fell, the ball fell
#ngram explanation ends#
tfidf_matrix = tf.fit_transform(ds['Book Title'])
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
results = {} # dictionary created to store the result in a dictionary format (ID :
for idx, row in ds.iterrows(): #iterates through all the rows
# the below code 'similar_indice' stores similar ids based on cosine similarity. sorts them in ascending
order. [:-5:-1] is then used so that the indices with most similarity are got. 0 means no similarity and 1 means perfect similarity#
similar_indices = cosine_similarities[idx].argsort()[:-5:-1]
#stores 5 most similar books, you can change it as per your needs
similar_items = [(cosine_similarities[idx][i], ds['ID'][i]) for i in similar_indices]
results[row['ID']] = similar_items[1:]
#below code 'function item(id)' returns a row matching the id along with Book Title. Initially it is a dataframe, then we convert it to a list#
def item(id):
return ds.loc[ds['ID'] == id]['Book Title'].tolist()[0]
def recommend(id, num):
if (num == 0):
print("Unable to recommend any book as you have not chosen the number of book to be
elif (num==1):
print("Recommending " + str(num) + " book similar to " + item(id))
else :
print("Recommending " + str(num) + " books similar to " + item(id))
recs = results[id][:num]
for rec in recs:
print("You may also like to read: " + item(rec[1]) + " (score:" + str(rec[0]) + ")")
#the first argument in the below function to be passed is the id of the book, second argument is the number of books you want to be recommended#
i have try and run successfully till results variable then getting error.
because python default id keyword is called when you call "def item(id):"
instead of id you have to declare another identifier....then i think this is the only reason for keyerror..
As the error suggests id is an build-in function in python-3. So if you change the name of the parameters id in def item(id) and def recommend(id, num) and all their references then the code should work.
After changing the id and correcting the indentation, an example could look like this:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
ds = pd.read_csv("test1.csv") # you can plug in your own list of products or movies or books here as csv file
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
# ngram explanation begins#
# ngram (1,3) can be explained as follows#
# ngram(1,3) encompasses uni gram, bi gram and tri gram
# consider the sentence "The ball fell"
# ngram (1,3) would be the, ball, fell, the ball, ball fell, the ball fell
# ngram explanation ends#
tfidf_matrix = tf.fit_transform(ds['Book Title'])
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
results = {} # dictionary created to store the result in a dictionary format (ID : (Score,item_id))
for idx, row in ds.iterrows(): # iterates through all the rows
# the below code 'similar_indice' stores similar ids based on cosine similarity. sorts them in ascending
# order. [:-5:-1] is then used so that the indices with most similarity are got. 0 means no similarity and
# 1 means perfect similarity
similar_indices = cosine_similarities[idx].argsort()[:-5:-1]
# stores 5 most similar books, you can change it as per your needs
similar_items = [(cosine_similarities[idx][i], ds['ID'][i]) for i in similar_indices]
results[row['ID']] = similar_items[1:]
# below code 'function item(id)' returns a row matching the id along with Book Title. Initially it is a dataframe,
# then we convert it to a list#
def item(ID):
return ds.loc[ds['ID'] == ID]['Book Title'].tolist()[0]
def recommend(ID, num):
if num == 0:
print("Unable to recommend any book as you have not chosen the number of book to be recommended")
elif num == 1:
print("Recommending " + str(num) + " book similar to " + item(ID))
print("Recommending " + str(num) + " books similar to " + item(ID))
recs = results[ID][:num]
for rec in recs:
print("You may also like to read: " + item(rec[1]) + " (score:" + str(rec[0]) + ")")
# the first argument in the below function to be passed is the id of the book, second argument is the number of books
# you want to be recommended
recommend(5, 2)

not produce empty list of lists in pandas

1) I have the following code to create a df
import pandas as pd
word_list = ['crayons', 'cars', 'camels']
l = ['there are many different crayons in the bright blue box',
'i like a lot of sports cars because they go really fast',
'the middle east has many camels to ride and have fun']
df = pd.DataFrame(l, columns=['Text'])
0 there are many different crayons in the bright blue box
1 i like a lot of sports cars because they go really fast
2 the middle east has many camels to ride and have fun
2) And I have the following code to create a function
def find_next_words(row, word_list):
sentence = row[0]
# trigger words are the elements in the word_list
trigger_words = []
next_words = []
last_words = []
for keyword in word_list:
words = sentence.split()
for index in range(0, len(words) - 1):
if words[index] == keyword:
#get the 3 words that follow trigger word
next_words.append(words[index + 1:index + 4])
#get the 3 words that come before trigger word
last_words.append(words[index - 1:index - 4])
return pd.Series([trigger_words, last_words, next_words], index = ['TriggerWords','LastWords', 'NextWords'])
3) This function uses the words in the word_list from above to find the 3 words that come before and after the "trigger_words" in the word_list
4) I then use the following code
df = df.join(df.apply(lambda x: find_next_words(x, word_list), axis=1))
5) And it produce the following df which is close to what I want
Text TriggerWords LastWords NextWords
0 there are many different crayons [crayons] [[]] [[in, the, bright]]
1 i like a lot of sports cars [cars] [[]] [[because, they, go]]
2 the middle east has many camels [camels] [[]] [[to, ride, and]]
6) However, the LastWords column is an empty list of list [[]] . I think the problem is this line of code last_words.append(words[index - 1:index - 4]) taken from the find_next_words function from above.
7) This is a bit confusing to me because the the NextWords column uses very similar code next_words.append(words[index + 1:index + 4]) taken from the find_next_words function and it works.
8) How do I fix my code so it does not produce the empty list of lists [[]] and instead it gives me the 3 words that come before the words in the word_list?
I think it should be words[max(index - 4, 0):max(index - 1, 0)] in the code.

Extract top words for each cluster

I have done K-means clustering for text data
#K-means clustering
from sklearn.cluster import KMeans
num_clusters = 4
km = KMeans(n_clusters=num_clusters)
clusters = km.labels_.tolist()
where features is the tf-idf vector
#preprocessing text - converting to a tf-idf vector form
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=0.01,max_df=0.75, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
features = tfidf.fit_transform(df.keywrds).toarray()
labels = df.CD
Then I added the cluster label to original dataset
df['clusters'] = clusters
And indexed the dataframe by clusters
pd.DataFrame(df,index = [clusters])
How do I fetch the top words for each cluster?
This is not really the top words in each cluster but orders them by most frequent words. Then you can just the first word as a word group instead of a cluster num.
built a dict with all feature names and tfidf score
for f, w in zip(tfidf.get_feature_names(), tfidf.idf_):
featurenames[len(f.split(' '))].append((f, w))
featurenames = dict(featurenames[1])
rounded off feature idf values cause they were a little long
featurenames = dict(zip(featurenames.keys(), [round(v, 4) for v in featurenames.values()]))
converted dict to df
dffeatures = pd.DataFrame.from_dict(featurenames, orient='index').reset_index() \
.rename(columns={'index': 'featurename',0:'featureid'})
dffeatures = dffeatures.round(4)
combined feature word with id and created a new dictionary. I did this to accommodate for duplicate id's.
dffeatures['combined'] = dffeatures.apply(lambda x:'%s:%s' % (x['featureid'],x['featurename']),axis=1)
featurenamesnew = pd.Series(dffeatures.combined.values, index=dffeatures.featurename).to_dict()
{'cat': '2.3863:cat', 'cow': '3.0794:cow', 'dog': '2.674:dog'....}
created a new col in the df and replaced all word with idf:feature value
df['temp'] = df['inputdata'].replace(featurenamesnew, regex=True)
ordered the df idf:feature value ascending so most frequent words appear first
df['temp'] = df['temp'].str.split().apply(lambda x: sorted(set(x), reverse=False)).str.join(' ').to_frame()
reverese map idf:featurevalue with the words
inv_map = {v: k for k, v in featurenamesnew.items()}
df['cluster_top_n_words'] = df['temp'].replace(inv_map, regex=True)
finally keep top n words in the new df col
df['cluster_top_n_words'] = df['cluster_top_n_words'].apply(lambda x: ' '.join(x.split()[:3]))

Python (NLTK) - more efficient way to extract noun phrases?

I've got a machine learning task involving a large amount of text data. I want to identify, and extract, noun-phrases in the training text so I can use them for feature construction later on in the pipeline.
I've extracted the type of noun-phrases I wanted from text but I'm fairly new to NLTK, so I approached this problem in a way where I can break down each step in list comprehensions like you can see below.
But my real question is, am I reinventing the wheel here? Is there a faster way to do this that I'm not seeing?
import nltk
import pandas as pd
myData = pd.read_excel("\User\train_.xlsx")
texts = myData['message']
# Defining a grammar & Parser
NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
chunkr = nltk.RegexpParser(NP)
tokens = [nltk.word_tokenize(i) for i in texts]
tag_list = [nltk.pos_tag(w) for w in tokens]
phrases = [chunkr.parse(sublist) for sublist in tag_list]
leaves = [[subtree.leaves() for subtree in tree.subtrees(filter = lambda t: t.label == 'NP')] for tree in phrases]
flatten the list of lists of lists of tuples that we've ended up with, into
just a list of lists of tuples
leaves = [tupls for sublists in leaves for tupls in sublists]
Join the extracted terms into one bigram
nounphrases = [unigram[0][1]+' '+unigram[1][0] in leaves]
Take a look at Why is my NLTK function slow when processing the DataFrame?, there's no need to iterate through all rows multiple times if you don't need intermediate steps.
With ne_chunk and solution from
NLTK Named Entity recognition to a Python list and
How can I extract GPE(location) using NLTK ne_chunk?
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import pandas as pd
def get_continuous_chunks(text, chunk_func=ne_chunk):
chunked = chunk_func(pos_tag(word_tokenize(text)))
continuous_chunk = []
current_chunk = []
for subtree in chunked:
if type(subtree) == Tree:
current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
current_chunk = []
return continuous_chunk
df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.',
'Another bar foo Washington DC thingy with Bruce Wayne.']})
df['text'].apply(lambda sent: get_continuous_chunks((sent)))
0 [New York]
1 [Washington, Bruce Wayne]
Name: text, dtype: object
To use the custom RegexpParser :
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import pandas as pd
# Defining a grammar & Parser
NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
chunker = RegexpParser(NP)
def get_continuous_chunks(text, chunk_func=ne_chunk):
chunked = chunk_func(pos_tag(word_tokenize(text)))
continuous_chunk = []
current_chunk = []
for subtree in chunked:
if type(subtree) == Tree:
current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
current_chunk = []
return continuous_chunk
df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.',
'Another bar foo Washington DC thingy with Bruce Wayne.']})
df['text'].apply(lambda sent: get_continuous_chunks(sent, chunker.parse))
0 [bar sentence, New York city]
1 [bar foo Washington DC thingy, Bruce Wayne]
Name: text, dtype: object
I suggest referring to this prior thread:
Extracting all Nouns from a text file using nltk
They suggest using TextBlob as the easiest way to achieve this (if not the one that is most efficient in terms of processing) and the discussion there addresses your question.
from textblob import TextBlob
txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages."""
blob = TextBlob(txt)
The above methods didn't give me the required results. Following is the function that I would suggest
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import re
def get_noun_phrases(text):
pos = pos_tag(word_tokenize(text))
count = 0
half_chunk = ""
for word, tag in pos:
if re.match(r"NN.*", tag):
if count>=1:
half_chunk = half_chunk + word + " "
half_chunk = half_chunk+"---"
count = 0
half_chunk = re.sub(r"-+","?",half_chunk).split("?")
half_chunk = [x.strip() for x in half_chunk if x!=""]
return half_chunk
The Constituent-Treelib library, which can be installed via: pip install constituent-treelib does excatly what you are looking for in few lines of code. In order to extract noun (or any other) phrases, perform the following steps.
from constituent_treelib import ConstituentTree
# First, we have to provide a sentence that should be parsed
sentence = "I've got a machine learning task involving a large amount of text data."
# Then, we define the language that should be considered with respect to the underlying models
language = ConstituentTree.Language.English
# You can also specify the desired model for the language ("Small" is selected by default)
spacy_model_size = ConstituentTree.SpacyModelSize.Medium
# Next, we must create the neccesary NLP pipeline.
# If you wish, you can instruct the library to download and install the models automatically
nlp = ConstituentTree.create_pipeline(language, spacy_model_size) # , download_models=True
# Now, we can instantiate a ConstituentTree object and pass it the sentence and the NLP pipeline
tree = ConstituentTree(sentence, nlp)
# Finally, we can extract the phrases
{'S': ["I 've got a machine learning task involving a large amount of text data ."],
'PP': ['of text data'],
'VP': ["'ve got a machine learning task involving a large amount of text data",
'got a machine learning task involving a large amount of text data',
'involving a large amount of text data'],
'NML': ['machine learning'],
'NP': ['a machine learning task involving a large amount of text data',
'a machine learning task',
'a large amount of text data',
'a large amount',
'text data']}
If you only want the noun phrases, just pick them out with tree.extract_all_phrases()['NP']
['a machine learning task involving a large amount of text data',
'a machine learning task',
'a large amount of text data',
'a large amount',
'text data']
