word not in vocabulary - python-3.x

First time using word2vec and the file I am working with is in XML format. I want to iterate through the patents to find each Title then apply word2vec to see if there are similar words(to indicate similar titles).
So far I have parsed the XML file using Element tree to retrieve each title, then I have applied sent_tokenizer followed by tweet tokenizer to return a list of sentences where each word has been tokenized (not sure if this was the best method). I then put the tokenized sentenses into my word2vec model and tested with one word to see if it returned a vector. This seems to only work for a word in the first sentence. I'm not sure it is recognising all the sentences?
import numpy as np
import pandas as pd
import gensim
import nltk
import xml.etree.ElementTree as ET
from gensim.models.word2vec import Word2Vec
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer, sent_tokenize
tree = ET.parse('6785.xml')
root = tree.getroot()
for child in root.iter("Title"):
Patent_Title = child.text
sentence = Patent_Title
stopWords = set(stopwords.words('english'))
tokens = nltk.sent_tokenize(sentence)
print(tokens)
tokenizer_words = TweetTokenizer()
tokens_sentences = [tokenizer_words.tokenize(t) for t in tokens]
#print(tokens_sentences)
model = gensim.models.Word2Vec(tokens_sentences, min_count=1,size=32)
words = list(model.wv.vocab)
print(words)
print(model['Solar'])
I would expect it to identify the word 'solar' in a sentence and print out the vector then I could look for similar words. I am receiving the error:
word 'Solar' not in vocabulary"

Just handle the errors as exceptions on first loop occurence.
# print(model['Solar'])
try:
print(model['Solar'])
except Exception as e:
pass
Working code :
import numpy as np
import pandas as pd
import gensim
import nltk
import xml.etree.ElementTree as ET
from gensim.models.word2vec import Word2Vec
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer, sent_tokenize
tree = ET.parse('6785.xml')
root = tree.getroot()
for child in root.iter("Title"):
Patent_Title = child.text
sentence = Patent_Title
stopWords = set(stopwords.words('english'))
tokens = nltk.sent_tokenize(sentence)
print(tokens)
tokenizer_words = TweetTokenizer()
tokens_sentences = [tokenizer_words.tokenize(t) for t in tokens]
#print(tokens_sentences)
model = gensim.models.Word2Vec(tokens_sentences, min_count=1,size=32)
words = list(model.wv.vocab)
print(words)
try:
print(model['Solar'])
except Exception as e:
pass

It is simply because Solar is not in your corpus.
Word2Vec tries to generate word vectors for each word in your tokens_sentences. If the training corpus didn't include the word/token that you try to look up, word2vec would not have the word vector for that word and that is why you got an error.
Advice: try to make your text data case-insensitive. That is, make all the text lower case (upper case works too but not the convention.)

Related

Deal with Out of vocabulary word with Gensim pretrained GloVe

I am working on an NLP assignment and loaded the GloVe vectors provided by Gensim:
import gensim.downloader
glove_vectors = gensim.downloader.load('glove-twitter-25')
I am trying to get the word embedding for each word in a sentence, but some of them are not in the vocabulary.
What is the best way to deal with it working with the Gensim API?
Thanks!
Load the model:
import gensim.downloader as api
model = api.load("glove-twitter-25") # load glove vectors
# model.most_similar("cat") # show words that similar to word 'cat'
There is a very simple way to find out if the words exist in the model's vocabulary.
result = print('Word exists') if word in model.wv.vocab else print('Word does not exist")
Apart from that, I had used the following logic to create sentence embedding (25 dim) with N tokens:
from __future__ import print_function, division
import os
import re
import sys
import regex
import numpy as np
from functools import partial
from fuzzywuzzy import process
from Levenshtein import ratio as lev_ratio
import gensim
import tempfile
def vocab_check(model, word):
similar_words = model.most_similar(word)
match_ratio = 0.
match_word = ''
for sim_word, sim_score in similar_words:
ratio = lev_ratio(word, sim_word)
if ratio > match_ratio:
match_word = sim_word
if match_word == '':
return similar_words[0][1]
return model.similarity(word, match_word)
def sentence2vector(model, sent, dim=25):
words = sent.split(' ')
emb = [model[w.strip()] for w in words]
weights = [1. if w in model.wv.vocab else vocab_check(model, w) for w in words]
if len(emb) == 0:
sent_vec = np.zeros(dim, dtype=np.float16)
else:
sent_vec = np.dot(weights, emb)
sent_vec = sent_vec.astype("float16")
return sent_vec

Remove punctuation and stop words from a data frame

My data frame looks like -
State text
Delhi 170 kw for330wp, shipping and billing in delhi...
Gujarat 4kw rooftop setup for home Photovoltaic Solar...
Karnataka language barrier no requirements 1kw rooftop ...
Madhya Pradesh Business PartnerDisqualified Mailed questionna...
Maharashtra Rupdaypur, panskura(r.s) Purba Medinipur 150kw...
I want to remove punctuation and stop words from this data frame. I have done the following code. But its not working -
import nltk
nltk.download('stopwords')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string
import collections
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.cm as cm
import matplotlib.pyplot as plt
% matplotlib inline
import nltk
from nltk.corpus import stopwords
import string
from sklearn.feature_extraction.text import CountVectorizer
import re
def message_cleaning(message):
Test_punc_removed = [char for char in message if char not in string.punctuation]
Test_punc_removed_join = ''.join(Test_punc_removed)
Test_punc_removed_join_clean = [word for word in Test_punc_removed_join.split() if word.lower() not in stopwords.words('english')]
return Test_punc_removed_join_clean
df['text'] = df['text'].apply(message_cleaning)
AttributeError: 'set' object has no attribute 'words'
Problem: I believe you have a name conflict for stopwords. There is probably a line somewhere in your notebook where you assign:
stopwords = stopwords.words("english")
That would explain the issue, as calling stopwords would turn ambiguous: you'd be referring to the variable and not the package anymore.
Solution: Make things unambiguous:
First assign a variable referring to stop words (that'll be faster than calling it everytime btw)
from nltk.corpus import stopwords
english_stop_words = set(stopwords.words("english"))
Use that in your function:
Test_punc_removed_join_clean = [
word for word in Test_punc_removed_join.split()
if word.lower() not in english_stop_words
]

NameError: name 'clean_text' is not defined

I am learning on how to implement nlp, so i started with data cleaning and now i am trying to Vectorizing Data using bag-of-words, this is my code
import pandas as pd
import numpy as np
import string
import re
import nltk
stopword=nltk.corpus.stopwords.words('english')
wn=nltk.WordNetLemmatizer()
from sklearn.feature_extraction.text import CountVectorizer
count_vect=CountVectorizer(analyzer=clean_text)
x_count=count_vect.fit_transform(lematizing_words)
print(x_count.shape)
but, when i run this code i get the following error
NameError: name 'clean_text' is not defined
how can i solve this?
i have referred this blog for the nlp implementation
def cleanText(text):
text = "".join([word.lower() for word in text if word not in string.punctuation])
tokens = re.split('\W+', text)
text = [ps.stem(word) for word in tokens if word not in stopwords]
return text
stopwords = nltk.corpus.stopwords.words('english')
Here is the function that the Badreesh put into github but is not in the blog.
The error message describes well your problem. What is clean_text ? Have you defined a clean_text function? Or imported the right python module containing this function?

how to calculate Word Coverage in gutenburg corpus in python library nltk?

Compute the word coverage of all file IDs associated with the text corpus gutenberg.
what is the write code for this,
import nltk
from nltk.corpus import gutenburg
from decimal import Decimal
for fileid in gutenburg.fileids():
n_chars = len(gutenburg.raw(fileid))
n_words = len(gutenburg.words(fileids))
print(round(Decimal(n_chars/n_words), 7), fileids)
import nltk
from nltk.corpus import gutenberg
for fileid in gutenberg.fileids():
total_unique_words = len(set(gutenberg.words(fileid)))
total_words = len(gutenberg.words(fileid))
print(total_words/total_unique_words,fileid)

How do I remove stopwords from amazon_baby.csv in python [duplicate]

This question already has an answer here:
Why is my NLTK function slow when processing the DataFrame?
(1 answer)
Closed 4 years ago.
I want to remove stop words and punctuations in Amazon_baby.csv.
import pandas as pd
data=pd.read_csv('amazon_baby.csv)
data.fillna(value='',inplace=True)
data.head()
import string
from nltk.corpus import stopwords
def text_process(msg):
no_punc=[char for char in msg if char not string.punctuation]
no_punc=''.join(no_punc)
return [word for word in no_punc.split() if word.lower() not in stopwords.words('English')]
data['review'].apply(text_process)
This code executing upto 10k rows , if apply on entire dataset kernel always showing as busy and cell is not executing .
Please help on this.
Find the data set here.
You are processing the data char by char which is extremely slow.
It is because of the vast size of the data (~183531 rows) and we have to process each row individually which makes the complexity to O(n2).
I have implemented a slightly different approach using word_tokenize below:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
def remove_punction_and_stopwords(msg):
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(msg)
filtered_words = [w for w in msg if w not in word_tokens and w not in string.punctuation]
new_sentence = ''.join(filtered_words)
return new_sentence
I tried running it for 6 mins and it processed 136322 rows. I'm sure if I had run it for 10 mins it would have completed execution successfully.
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
def text_clean(msg):
tokens=word_tokenize(msg)
tokens=[w.lower() for w in tokens]
import string
stop_words=set(stopwords.words('english))
no_punc_and_stop_words=[w for w in tokens if w not in string.punctuation and w not in stop_words]
return words

Resources