I am learning on how to implement nlp, so i started with data cleaning and now i am trying to Vectorizing Data using bag-of-words, this is my code
import pandas as pd
import numpy as np
import string
import re
import nltk
stopword=nltk.corpus.stopwords.words('english')
wn=nltk.WordNetLemmatizer()
from sklearn.feature_extraction.text import CountVectorizer
count_vect=CountVectorizer(analyzer=clean_text)
x_count=count_vect.fit_transform(lematizing_words)
print(x_count.shape)
but, when i run this code i get the following error
NameError: name 'clean_text' is not defined
how can i solve this?
i have referred this blog for the nlp implementation
def cleanText(text):
text = "".join([word.lower() for word in text if word not in string.punctuation])
tokens = re.split('\W+', text)
text = [ps.stem(word) for word in tokens if word not in stopwords]
return text
stopwords = nltk.corpus.stopwords.words('english')
Here is the function that the Badreesh put into github but is not in the blog.
The error message describes well your problem. What is clean_text ? Have you defined a clean_text function? Or imported the right python module containing this function?
Related
I am doing everything only on one jupyter notebook file.
I am trying to predict new store description by their category using logistic regression classification model and count vectorizer
All the code below are in SEQUENCE be it used or unused code
Below is my code:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(stop_words='english', ngram_range=(1,1))
X_train_cv=cv.fit_transform(X_train.values.astype('str'))
X_test_cv=cv.transform(X_test.values.astype('str'))
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression(solver='lbfgs')
lr.fit(X_train_cv,y_train)
y_pred_cv=lr.predict(X_test_cv)
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred_cv,target_names=['electronics','fashion','F&B','services']))
#i never use this code below as i am not doing on 2 notebook
import pickle
from datetime import datetime
model_path=['drive','mydrive','I125','models']
time=datetime.now().strfttime("%Y-%m-%d")
filename='lr-{}.pkl'.format(time)
templist=[]
templist.append(filename)
path1=os.sep.join(model_path+templist)
filename='countvectorizer-{}.pkl'.format(time)
templist=[]
templist.append(filename)
path2=os.sep.join(model_path+templist)
with open(path1,'wb')as f1:
pickle.dump(lr,f1)
with open(path2,'wb')as f2:
pickle.dump(cv,f2)
I am trying to predict a new description using the current classifier that i have. I only know how to use current classifier to predict new description if it's for separate notebook.
This is my code that i have to predict for new description:
#i never use this code below as i am not doing on 2 notebook
import os
import pickle
from google.colab import drive
drive.mount('/content/drive')
model_path=['drive','mydrive','I125','models']
filename=['lr-2022-10-10.pk1']
model_path=['drive','mydrive','I125','models']
filename=['countvectoriser-2022-10-10.pk1']
path2=os.sep.join(model_path+filename)
with open(path2,'rb')as f:
trained_cv=pickle.load(f)
path1=os.sep.join(model_path+filename)
with open(path1,'rb') as f:
model=pickle.load(f)
#i used this code below
import re
import string
def preprocess(text):
pattern_alphanumeric="\w*\d\w*"
pattern_punctuation="["+re.escape(string.punctuation)+"]"
text=re.sub(pattern_alphanumeric,'',text)
text=re.sub(pattern_punctuation,'',text).lower()
return text
new_text="This clothes so nice"
new_text_processed=preprocess(new_text)
def encode_text_to_vector(cv,test):
text_vector = cv.transform([text])
return text_vector
new_text_vector=encode_text_to_vector(trained_cv,new_text_processed) <--line with error
print(new_text_vector)
ERror:
trained_cv is undefined. (trained_cv is supposed to be the the saved logistic regression and count vectorizer if i have use different jupyter notebook)
My data frame looks like -
State text
Delhi 170 kw for330wp, shipping and billing in delhi...
Gujarat 4kw rooftop setup for home Photovoltaic Solar...
Karnataka language barrier no requirements 1kw rooftop ...
Madhya Pradesh Business PartnerDisqualified Mailed questionna...
Maharashtra Rupdaypur, panskura(r.s) Purba Medinipur 150kw...
I want to remove punctuation and stop words from this data frame. I have done the following code. But its not working -
import nltk
nltk.download('stopwords')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string
import collections
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.cm as cm
import matplotlib.pyplot as plt
% matplotlib inline
import nltk
from nltk.corpus import stopwords
import string
from sklearn.feature_extraction.text import CountVectorizer
import re
def message_cleaning(message):
Test_punc_removed = [char for char in message if char not in string.punctuation]
Test_punc_removed_join = ''.join(Test_punc_removed)
Test_punc_removed_join_clean = [word for word in Test_punc_removed_join.split() if word.lower() not in stopwords.words('english')]
return Test_punc_removed_join_clean
df['text'] = df['text'].apply(message_cleaning)
AttributeError: 'set' object has no attribute 'words'
Problem: I believe you have a name conflict for stopwords. There is probably a line somewhere in your notebook where you assign:
stopwords = stopwords.words("english")
That would explain the issue, as calling stopwords would turn ambiguous: you'd be referring to the variable and not the package anymore.
Solution: Make things unambiguous:
First assign a variable referring to stop words (that'll be faster than calling it everytime btw)
from nltk.corpus import stopwords
english_stop_words = set(stopwords.words("english"))
Use that in your function:
Test_punc_removed_join_clean = [
word for word in Test_punc_removed_join.split()
if word.lower() not in english_stop_words
]
I have a document in a text file in 2 lines as shown below. I wanted to apply tf-idf to it and I get the error as shown below, I am not sure where is int object in my file? why would it throw this error?
Env:
Jupter notebook, python 3.7
Error:
AttributeError: 'int' object has no attribute 'lower'
file.txt:
Random person from the random hill came to a running mill and I have a count of the hill. This is my house.
A person is from a great hill and he loves to run a mill.
Sub-disciplines of biology are defined by the research methods employed and the kind of system studied: theoretical biology uses mathematical methods to formulate quantitative models while experimental biology performs empirical experiments.
The objects of our research will be the different forms and manifestations of life, the conditions and laws under which these phenomena occur, and the causes through which they have been effected. The science that concerns itself with these objects we will indicate by the name biology.
Code:
import pandas as pd
import spacy
import csv
import collections
import sys
import itertools
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
from nltk.tokenize import sent_tokenize
from gensim import corpora, models
from stop_words import get_stop_words
from nltk.stem import PorterStemmer
data = pd.read_csv('file.txt', sep="\n", header=None)
data.dtypes
0 object
dtype: object
data.shape()
4, 1
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(data)
print(X)
I solved it by reading the file like this:
with open('file.txt') as f:
lines = [line.rstrip() for line in f]
Following the documentation of ?gensim.models.ldamodel, I want to train an ldamodel and (from this SO answer create a worcloud from it). I am using the following code from both sources:
from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary
import gensim
import matplotlib.pyplot as plt
from wordcloud import WordCloud
common_dictionary = Dictionary(common_texts) # create corpus
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]
lda = gensim.models.LdaModel(common_corpus, num_topics=10) # train model on corpus
for t in range(lda.num_topics):
plt.figure()
plt.imshow(WordCloud().fit_words(lda.show_topic(t, 200)))
plt.axis("off")
plt.title("Topic #" + str(t))
plt.show()
However, I get an AttributeError: 'list' object has no attribute 'items' on the line plt.imshow(...)
Can someone help me out here? (Answers to similar questions have not been working for me and I am trying to compile a minimal pipeline with this.)
From the docs, the method WordCloud.fit_words() expects a dictionary as input.
Your error seems to highlight that it's looking for an attribute 'items', typically an attribute of dictionaries, but instead finds a list object.
So the problem is: lda.show_topic(t, 200) returns a list instead of a dictionary. Use dict() to cast it!
Finally:
plt.imshow(WordCloud().fit_words(dict(lda.show_topic(t, 200))))
First time using word2vec and the file I am working with is in XML format. I want to iterate through the patents to find each Title then apply word2vec to see if there are similar words(to indicate similar titles).
So far I have parsed the XML file using Element tree to retrieve each title, then I have applied sent_tokenizer followed by tweet tokenizer to return a list of sentences where each word has been tokenized (not sure if this was the best method). I then put the tokenized sentenses into my word2vec model and tested with one word to see if it returned a vector. This seems to only work for a word in the first sentence. I'm not sure it is recognising all the sentences?
import numpy as np
import pandas as pd
import gensim
import nltk
import xml.etree.ElementTree as ET
from gensim.models.word2vec import Word2Vec
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer, sent_tokenize
tree = ET.parse('6785.xml')
root = tree.getroot()
for child in root.iter("Title"):
Patent_Title = child.text
sentence = Patent_Title
stopWords = set(stopwords.words('english'))
tokens = nltk.sent_tokenize(sentence)
print(tokens)
tokenizer_words = TweetTokenizer()
tokens_sentences = [tokenizer_words.tokenize(t) for t in tokens]
#print(tokens_sentences)
model = gensim.models.Word2Vec(tokens_sentences, min_count=1,size=32)
words = list(model.wv.vocab)
print(words)
print(model['Solar'])
I would expect it to identify the word 'solar' in a sentence and print out the vector then I could look for similar words. I am receiving the error:
word 'Solar' not in vocabulary"
Just handle the errors as exceptions on first loop occurence.
# print(model['Solar'])
try:
print(model['Solar'])
except Exception as e:
pass
Working code :
import numpy as np
import pandas as pd
import gensim
import nltk
import xml.etree.ElementTree as ET
from gensim.models.word2vec import Word2Vec
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer, sent_tokenize
tree = ET.parse('6785.xml')
root = tree.getroot()
for child in root.iter("Title"):
Patent_Title = child.text
sentence = Patent_Title
stopWords = set(stopwords.words('english'))
tokens = nltk.sent_tokenize(sentence)
print(tokens)
tokenizer_words = TweetTokenizer()
tokens_sentences = [tokenizer_words.tokenize(t) for t in tokens]
#print(tokens_sentences)
model = gensim.models.Word2Vec(tokens_sentences, min_count=1,size=32)
words = list(model.wv.vocab)
print(words)
try:
print(model['Solar'])
except Exception as e:
pass
It is simply because Solar is not in your corpus.
Word2Vec tries to generate word vectors for each word in your tokens_sentences. If the training corpus didn't include the word/token that you try to look up, word2vec would not have the word vector for that word and that is why you got an error.
Advice: try to make your text data case-insensitive. That is, make all the text lower case (upper case works too but not the convention.)