Remove punctuation and stop words from a data frame - python-3.x

My data frame looks like -
State text
Delhi 170 kw for330wp, shipping and billing in delhi...
Gujarat 4kw rooftop setup for home Photovoltaic Solar...
Karnataka language barrier no requirements 1kw rooftop ...
Madhya Pradesh Business PartnerDisqualified Mailed questionna...
Maharashtra Rupdaypur, panskura(r.s) Purba Medinipur 150kw...
I want to remove punctuation and stop words from this data frame. I have done the following code. But its not working -
import nltk
nltk.download('stopwords')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string
import collections
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.cm as cm
import matplotlib.pyplot as plt
% matplotlib inline
import nltk
from nltk.corpus import stopwords
import string
from sklearn.feature_extraction.text import CountVectorizer
import re
def message_cleaning(message):
Test_punc_removed = [char for char in message if char not in string.punctuation]
Test_punc_removed_join = ''.join(Test_punc_removed)
Test_punc_removed_join_clean = [word for word in Test_punc_removed_join.split() if word.lower() not in stopwords.words('english')]
return Test_punc_removed_join_clean
df['text'] = df['text'].apply(message_cleaning)
AttributeError: 'set' object has no attribute 'words'

Problem: I believe you have a name conflict for stopwords. There is probably a line somewhere in your notebook where you assign:
stopwords = stopwords.words("english")
That would explain the issue, as calling stopwords would turn ambiguous: you'd be referring to the variable and not the package anymore.
Solution: Make things unambiguous:
First assign a variable referring to stop words (that'll be faster than calling it everytime btw)
from nltk.corpus import stopwords
english_stop_words = set(stopwords.words("english"))
Use that in your function:
Test_punc_removed_join_clean = [
word for word in Test_punc_removed_join.split()
if word.lower() not in english_stop_words
]

Related

Unsatisfactory output from Tf-Idf

I have a document in a text file in 2 lines as shown below. I wanted to apply tf-idf to it and I get the error as shown below, I am not sure where is int object in my file? why would it throw this error?
Env:
Jupter notebook, python 3.7
Error:
AttributeError: 'int' object has no attribute 'lower'
file.txt:
Random person from the random hill came to a running mill and I have a count of the hill. This is my house.
A person is from a great hill and he loves to run a mill.
Sub-disciplines of biology are defined by the research methods employed and the kind of system studied: theoretical biology uses mathematical methods to formulate quantitative models while experimental biology performs empirical experiments.
The objects of our research will be the different forms and manifestations of life, the conditions and laws under which these phenomena occur, and the causes through which they have been effected. The science that concerns itself with these objects we will indicate by the name biology.
Code:
import pandas as pd
import spacy
import csv
import collections
import sys
import itertools
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
from nltk.tokenize import sent_tokenize
from gensim import corpora, models
from stop_words import get_stop_words
from nltk.stem import PorterStemmer
data = pd.read_csv('file.txt', sep="\n", header=None)
data.dtypes
0 object
dtype: object
data.shape()
4, 1
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(data)
print(X)
I solved it by reading the file like this:
with open('file.txt') as f:
lines = [line.rstrip() for line in f]

NameError: name 'clean_text' is not defined

I am learning on how to implement nlp, so i started with data cleaning and now i am trying to Vectorizing Data using bag-of-words, this is my code
import pandas as pd
import numpy as np
import string
import re
import nltk
stopword=nltk.corpus.stopwords.words('english')
wn=nltk.WordNetLemmatizer()
from sklearn.feature_extraction.text import CountVectorizer
count_vect=CountVectorizer(analyzer=clean_text)
x_count=count_vect.fit_transform(lematizing_words)
print(x_count.shape)
but, when i run this code i get the following error
NameError: name 'clean_text' is not defined
how can i solve this?
i have referred this blog for the nlp implementation
def cleanText(text):
text = "".join([word.lower() for word in text if word not in string.punctuation])
tokens = re.split('\W+', text)
text = [ps.stem(word) for word in tokens if word not in stopwords]
return text
stopwords = nltk.corpus.stopwords.words('english')
Here is the function that the Badreesh put into github but is not in the blog.
The error message describes well your problem. What is clean_text ? Have you defined a clean_text function? Or imported the right python module containing this function?

how to calculate Word Coverage in gutenburg corpus in python library nltk?

Compute the word coverage of all file IDs associated with the text corpus gutenberg.
what is the write code for this,
import nltk
from nltk.corpus import gutenburg
from decimal import Decimal
for fileid in gutenburg.fileids():
n_chars = len(gutenburg.raw(fileid))
n_words = len(gutenburg.words(fileids))
print(round(Decimal(n_chars/n_words), 7), fileids)
import nltk
from nltk.corpus import gutenberg
for fileid in gutenberg.fileids():
total_unique_words = len(set(gutenberg.words(fileid)))
total_words = len(gutenberg.words(fileid))
print(total_words/total_unique_words,fileid)

When I applied RandomForest in Python, ValueError: Found input variables with inconsistent numbers of samples: [2883, 1236]

File "D:\Users\Watson Rockstar\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 205, in check_consistent_length
" samples: %r" % [int(l) for l in lengths])
ValueError:
Found input variables with inconsistent numbers of samples: [2883, 1236]
This dataset totally has 4119 data, and the Xtrain volum= (2883,18), Xtest volum = (1236,18)
I have tried to use LabelEncoder and OneHotEncoder to sovle the problems, but it is not helpful:
# Ignore the warnings
import warnings
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')
# data visualisation and manipulation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns
import missingno as msno
#configure
# sets matplotlib to inline and displays graphs below the corressponding cell.
#import the necessary modelling algos.
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
#preprocessing
from sklearn.preprocessing import LabelEncoder
telebanking = pd.read_csv('bank-additional.csv')
telebank = telebanking.drop(['duration','default'],axis =1)
def transform(feature):
le = LabelEncoder()
telebank[feature] = le.fit_transform(telebank[feature])
print(le.classes_)
cat_telebank=telebank.select_dtypes(include='object')
cat_telebank.columns
for col in cat_telebank.columns:
transform(col)
scaler=StandardScaler()
scaled_telebank=scaler.fit_transform(telebank.drop('y',axis=1))
X=scaled_telebank
Y=telebank['y'].as_matrix()
Xtrain,Xtest,Ytrain,Ytest = train_test_split(X,Y,test_size=0.3)
def compare(model):
clf = model
clf.fit(Xtrain,Ytrain)
pred = clf.predict(Xtrain)
acc.append(accuracy_score(pred,Ytest))
prec.append(precision_score(pred,Ytest))
rec.append(recall_score(pred,Ytest))
auroc.append(roc_auc_score(pred,Ytest))
acc=[]
prec=[]
rec=[]
auroc=[]
models=[RandomForestClassifier(),DecisionTreeClassifier()]
model_names=['RandomForestClassifier','DecisionTreeClassifier']
for model in range(len(models)):
compare(models[model])
d={'Modelling Algo':model_names,'Accuracy':acc,'Precision':prec,'Recall':rec,'Area Under ROC Curve':auroc}
met_telebank=pd.DataFrame(d)
met_telebank
It is the first warning's detail.
Xtrain,Xtest,Ytrain,Ytest = train_test_split(X,Y,test_size=0.3)
should be
Xtrain,Ytrain,Xtest,Ytest = train_test_split(X,Y,test_size=0.3)
This is causing the error, because it wants to use Xtest as the Ytrain values.

word not in vocabulary

First time using word2vec and the file I am working with is in XML format. I want to iterate through the patents to find each Title then apply word2vec to see if there are similar words(to indicate similar titles).
So far I have parsed the XML file using Element tree to retrieve each title, then I have applied sent_tokenizer followed by tweet tokenizer to return a list of sentences where each word has been tokenized (not sure if this was the best method). I then put the tokenized sentenses into my word2vec model and tested with one word to see if it returned a vector. This seems to only work for a word in the first sentence. I'm not sure it is recognising all the sentences?
import numpy as np
import pandas as pd
import gensim
import nltk
import xml.etree.ElementTree as ET
from gensim.models.word2vec import Word2Vec
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer, sent_tokenize
tree = ET.parse('6785.xml')
root = tree.getroot()
for child in root.iter("Title"):
Patent_Title = child.text
sentence = Patent_Title
stopWords = set(stopwords.words('english'))
tokens = nltk.sent_tokenize(sentence)
print(tokens)
tokenizer_words = TweetTokenizer()
tokens_sentences = [tokenizer_words.tokenize(t) for t in tokens]
#print(tokens_sentences)
model = gensim.models.Word2Vec(tokens_sentences, min_count=1,size=32)
words = list(model.wv.vocab)
print(words)
print(model['Solar'])
I would expect it to identify the word 'solar' in a sentence and print out the vector then I could look for similar words. I am receiving the error:
word 'Solar' not in vocabulary"
Just handle the errors as exceptions on first loop occurence.
# print(model['Solar'])
try:
print(model['Solar'])
except Exception as e:
pass
Working code :
import numpy as np
import pandas as pd
import gensim
import nltk
import xml.etree.ElementTree as ET
from gensim.models.word2vec import Word2Vec
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer, sent_tokenize
tree = ET.parse('6785.xml')
root = tree.getroot()
for child in root.iter("Title"):
Patent_Title = child.text
sentence = Patent_Title
stopWords = set(stopwords.words('english'))
tokens = nltk.sent_tokenize(sentence)
print(tokens)
tokenizer_words = TweetTokenizer()
tokens_sentences = [tokenizer_words.tokenize(t) for t in tokens]
#print(tokens_sentences)
model = gensim.models.Word2Vec(tokens_sentences, min_count=1,size=32)
words = list(model.wv.vocab)
print(words)
try:
print(model['Solar'])
except Exception as e:
pass
It is simply because Solar is not in your corpus.
Word2Vec tries to generate word vectors for each word in your tokens_sentences. If the training corpus didn't include the word/token that you try to look up, word2vec would not have the word vector for that word and that is why you got an error.
Advice: try to make your text data case-insensitive. That is, make all the text lower case (upper case works too but not the convention.)

Resources