Error in applying lemmatization - python-3.x

Why am I getting this error, please help.
I am newbie to machine learning.
This is my code and here I've applied lemmatization on 20 newsgroups dataset.
This code aims to get the 500 words with highest counts while applying filtering.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from nltk.corpus import names
from nltk.stem import WordNetLemmatizer
def letters_only(astr):
return astr.isalpha()
cv = CountVectorizer(stop_words="english", max_features=500)
groups = fetch_20newsgroups()
cleaned = []
all_names = set(names.words())
lemmatizer = WordNetLemmatizer()
for post in groups.data:
cleaned.append(' '.join([lemmatizer.lemmatize(word.lower()
for word in post.split()
if letters_only(word) and word not in all_names)]))
transformed = cv.fit_transform(cleaned)
print(cv.get_feature_names())
Error:
Traceback (most recent call last):
File "<ipython-input-91-7158a74bae71>", line 18, in <module>
for word in post.split()
File "C:\Program Files\Anaconda3\lib\site-packages\nltk\stem\wordnet.py", line 40, in lemmatize
lemmas = wordnet._morphy(word, pos)
File "C:\Program Files\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1712, in _morphy
forms = apply_rules([form])
File "C:\Program Files\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1692, in apply_rules
for form in forms
File "C:\Program Files\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1694, in <listcomp>
if form.endswith(old)]
AttributeError: 'generator' object has no attribute 'endswith'

I'm not sure why, but turning that for loop one liner into regular for loop solved the problem;
for post in groups.data:
for word in post.split():
if letters_only(word) and word not in all_names:
cleaned.append(' '.join([lemmatizer.lemmatize(word.lower())]))

Related

Why am I getting "KeyError: 'Context'" in Python 3.8 (Anaconda/Spyder)

I'm trying to test a function out to normalise text I believe from a tutorial I'm following on an AI chatbot (https://medium.com/swlh/a-chatbot-in-python-using-nltk-938a37a9eacc) under the section saying 'Steps involved' but I keep getting KeyError: 'Context' when I try copy this line from the tutorial into Spyder.
I've tried researching and going through the tutorial again and carefully spell checking my libraries to see if I've missed anything but I still haven't figured out why the key is missing so I was hoping someone here could please help?
My code
import pandas as pd
import nltk
from nltk import pos_tag # for parts of speech
from nltk import word_tokenize # to create tokens
from nltk.stem import wordnet # to perform lemmatization
from nltk.corpus import stopwords # for stop words to end prgrm
import numpy as np
import re
from sklearn.metrics import pairwise_distances # to perform cosine similarity
from sklearn.feature_extraction.text import TfidfVectorizer # to perform tfidf
from sklearn.feature_extraction.text import CountVectorizer # to perform bow
df=pd.read_excel(r'C:\Users\mecha\Documents\Comp Sci - Year 3\ISYS30221 - Artificial Intel\New Try - AI with revisions\dialog_talk_agent.xlsx') # excel file of predetermined questions and answers
df.ffill(axis = 0, inplace=True) # fills all null values with previous value in dataset (NaN = null values)
df1 = df.head(10)
def step1(x):
for i in x:
a=str(i).lower()
p=re.sub(r'[^a-z0-9]', ' ', a)
print(p)
Code snippet I run in the console after running the earlier code
step1(df1['Context'])
Error feedback in the console
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2646, in get_loc
return self._engine.get_loc(key)
File "pandas\_libs\index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Context'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<ipython-input-8-6335e79211e5>", line 1, in <module>
step1(df1['Context'])
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2800, in __getitem__
indexer = self.columns.get_loc(key)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2648, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas\_libs\index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Context'
I've researched on KnowledgeHut and I get that the KeyError is because my program can't find the 'Context' key but I've been following a somewhat recent tutorial closely so I can't tell why I'm getting the error or maybe it's because I'm missing some library?
I was hoping someone on here could help me out on this while I try and learn some basics before getting onto my AI chatbot project for school.
If you look up at the top of the page the excel file has column names 'Context' and 'Text Response' so you might be missing those in your file or spelled them wrong, that's the only way it wouldn't work
step1(df1) should just work fine

KeyError: '[...] not in index' occurs when train/test sets are split manually into two files

I get the error KeyError: '[...] Not in index' when using an sklearn hyperopt regression example on my dataset.
I have seen other answers to this problem where the solution was that, e.g, X_train should be set to X_train = X.iloc[train_indices] and the lack of iloc usage was the issue. But in my problem, I have manually split my dataset into two files so I don't need do do any slicing or indexing. I used a different script to take a big data set and split it into a train set file and a test set file. These files do not have index columns and are only numeric. If you are wondering about the data set it is from UCI and called the protein physiochemical dataset.
from hpsklearn import HyperoptEstimator, any_regressor, xgboost_regression
from sklearn.datasets import load_iris
from hyperopt import tpe
import numpy as np
import pandas as pd
# Download the data and split into training and test sets
X_train = pd.read_csv('data2/CASP_train.csv')
X_test = pd.read_csv('data2/CASP_test.csv')
y_train = X_train['Y']
y_test = X_test['Y']
X_train.drop('Y',axis=1,inplace=True)
X_test.drop('Y',axis=1,inplace=True)
print(list(X_test))
#X_train.drop(list(X_train)[0],axis=1,inplace=True)
#X_test.drop(list(X_test)[0],axis=1,inplace=True)
print(list(X_test))
print(X_train)
# Instantiate a HyperoptEstimator with the search space and number of evaluations
estim = HyperoptEstimator(regressor=xgboost_regression('xgreg'),
preprocessing=('my_pre'),
algo=tpe.suggest,
max_evals=100,
trial_timeout=120)
estim.fit(X_train, y_train)
print(estim.score(X_test, y_test))
print(estim.best_model())
The full full traceback is as follows
Traceback (most recent call last):
File "PRSAXGB.py", line 30, in <module>
estim.fit(X_train, y_train)
File "/home/rj/anaconda3/lib/python3.6/site-packages/hpsklearn/estimator.py", line 783, in fit
fit_iter.send(increment)
File "/home/rj/anaconda3/lib/python3.6/site-packages/hpsklearn/estimator.py", line 693, in fit_iter
return_argmin=False, # -- in case no success so far
File "/home/rj/anaconda3/lib/python3.6/site-packages/hyperopt/fmin.py", line 389, in fmin
show_progressbar=show_progressbar,
File "/home/rj/anaconda3/lib/python3.6/site-packages/hyperopt/base.py", line 643, in fmin
show_progressbar=show_progressbar)
File "/home/rj/anaconda3/lib/python3.6/site-packages/hyperopt/fmin.py", line 408, in fmin
rval.exhaust()
File "/home/rj/anaconda3/lib/python3.6/site-packages/hyperopt/fmin.py", line 262, in exhaust
self.run(self.max_evals - n_done, block_until_done=self.asynchronous)
File "/home/rj/anaconda3/lib/python3.6/site-packages/hyperopt/fmin.py", line 227, in run
self.serial_evaluate()
File "/home/rj/anaconda3/lib/python3.6/site-packages/hyperopt/fmin.py", line 141, in serial_evaluate
result = self.domain.evaluate(spec, ctrl)
File "/home/rj/anaconda3/lib/python3.6/site-packages/hyperopt/base.py", line 848, in evaluate
rval = self.fn(pyll_rval)
File "/home/rj/anaconda3/lib/python3.6/site-packages/hpsklearn/estimator.py", line 656, in fn_with_timeout
raise fn_rval[1]
KeyError: '[ 0 1 2 ... 29264 29265 29266] not in index'
The solution was to do
estim.fit(X_train.values, y_train.values)

TypeError: a bytes-like object is required, not 'str' (file opening)

I've been unsuccessful with adding a "b" for byte-reading infront of all the strings, even though it should solve the problem according to lots of older threads.
Here's my code:
# -*- coding: utf-8 -*-
__author__ = 'chrispaulson'
import numpy as np
import math as m
import os
import pickle
import pyKriging
class samplingplan():
def optimallhc(self,n,population=30, iterations=30, generation=False):
if not generation:
# Check for existing LHC sampling plans
if os.path.isfile(b'{0}lhc_{1}_{2}.pkl'.format(self.path,self.k, n)):
# codecs.open(filename,'r',encoding='utf8')
X = pickle.load(open(b'{0}lhc_{1}_{2}.pkl'.format(self.path,self.k, n), 'rb'))
return X
else:
print(self.path)
print('SP not found on disk, generating it now.')
When referring to the function "optimallhc" with class "samplingplan()" within the main code
import pyKriging
from pyKriging.krige import kriging
from pyKriging.samplingplan import samplingplan
# The Kriging model starts by defining a sampling plan, we use an optimal Latin Hypercube here
sp = samplingplan(2)
X = sp.optimallhc(20)
it returns the following traceback message:
Traceback (most recent call last):
File "<ipython-input-1-e9ff50d9f93a>", line 1, in <module>
runfile('C:/Andreas Luckert/Kriging-plot-Ehingen.py', wdir='C:/Andreas Luckert')
File "C:\Users\Müller\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 710, in runfile
execfile(filename, namespace)
File "C:\Users\Müller\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 101, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Andreas Luckert/Kriging-plot-Ehingen.py", line 14, in <module>
X = sp.optimallhc(20)
File "C:\Users\Müller\Anaconda3\lib\site-packages\pyKriging\samplingplan.py", line 72, in optimallhc
if os.path.isfile(b'{0}lhc_{1}_{2}.pkl'.format(self.path,self.k, n)):
AttributeError: 'bytes' object has no attribute 'format'
As visible above, I already added the "b" for bytes-handling everywhere, even in other combinations, but it never worked.
I would appreciate it very much, if somebody knew what to do about it.
Thanks in advance!

Why the following tfidf vectorization is failing?

Hello I am making the following experiment, first I created a vectorizer called: tfidf:
tfidf_vectorizer = TfidfVectorizer(min_df=10,ngram_range=(1,3),analyzer='word',max_features=500)
Then I vectorized the following list:
tfidf = tfidf_vectorizer.fit_transform(listComments)
My list of comments looks as follows:
listComments = ["hello this is a test","the car is red",...]
I tried to save the model as follows:
#Saving tfidf
with open('vectorizerTFIDF.pickle','wb') as idxf:
pickle.dump(tfidf, idxf, pickle.HIGHEST_PROTOCOL)
I would like to use my vectorizer to apply the same tfidf to the following list:
lastComment = ["this is a car"]
Opening Model:
with open('vectorizerTFIDF.pickle', 'rb') as infile:
tdf = pickle.load(infile)
vector = tdf.transform(lastComment)
However I am getting:
Traceback (most recent call last):
File "C:/Users/LDA_test/ldaTest.py", line 141, in <module>
vector = tdf.transform(lastComment)
File "C:\Program Files\Anaconda3\lib\site-packages\scipy\sparse\base.py", line 559, in __getattr__
raise AttributeError(attr + " not found")
AttributeError: transform not found
I hope someone could support me with this issue thanks in advance,
You've pickled the vectorized array, not the transformer, you need pickle.dump(tfidf_vectorizer, idxf, pickle.HIGHEST_PROTOCOL)

nltk stemmer: string index out of range

I have a set of pickled text documents which I would like to stem using nltk's PorterStemmer. For reasons specific to my project, I would like to do the stemming inside of a django app view.
However, when stemming the documents inside the django view, I receive an IndexError: string index out of range exception from PorterStemmer().stem() for the string 'oed'. As a result, running the following:
# xkcd_project/search/views.py
from nltk.stem.porter import PorterStemmer
def get_results(request):
s = PorterStemmer()
s.stem('oed')
return render(request, 'list.html')
raises the mentioned error:
Traceback (most recent call last):
File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/exception.py", line 39, in inner
response = get_response(request)
File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/base.py", line 187, in _get_response
response = self.process_exception_by_middleware(e, request)
File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/base.py", line 185, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/Users/jkarimi91/Projects/xkcd_search/xkcd_project/search/views.py", line 15, in get_results
s.stem('oed')
File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 665, in stem
stem = self._step1b(stem)
File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 376, in _step1b
lambda stem: (self._measure(stem) == 1 and
File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 258, in _apply_rule_list
if suffix == '*d' and self._ends_double_consonant(word):
File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 214, in _ends_double_consonant
word[-1] == word[-2] and
IndexError: string index out of range
Now what is really odd is running the same stemmer on the same string outside django (be it a seperate python file or an interactive python console) produces no error. In other words:
# test.py
from nltk.stem.porter import PorterStemmer
s = PorterStemmer()
print s.stem('oed')
followed by:
python test.py
# successfully prints 'o'
what is causing this issue?
This is an NLTK bug specific to NLTK version 3.2.2, for which I am to blame. It was introduced by PR https://github.com/nltk/nltk/pull/1261 which rewrote the Porter stemmer.
I wrote a fix which went out in NLTK 3.2.3. If you're on version 3.2.2 and want the fix, just upgrade - e.g. by running
pip install -U nltk
I debugged nltk.stem.porter module using pdb. After a few iterations, in _apply_rule_list() you get:
>>> rule
(u'at', u'ate', None)
>>> word
u'o'
At this point the _ends_double_consonant() method tries to do word[-1] == word[-2] and it fails.
If I'm not mistaken, in NLTK 3.2 the relative method was the following:
def _doublec(self, word):
"""doublec(word) is TRUE <=> word ends with a double consonant"""
if len(word) < 2:
return False
if (word[-1] != word[-2]):
return False
return self._cons(word, len(word)-1)
As far as I can see, the len(word) < 2 check is missing in the new version.
Changing _ends_double_consonant() to something like this should work:
def _ends_double_consonant(self, word):
"""Implements condition *d from the paper
Returns True if word ends with a double consonant
"""
if len(word) < 2:
return False
return (
word[-1] == word[-2] and
self._is_consonant(word, len(word)-1)
)
I just proposed this change in the related NLTK issue.

Resources