nltk stemmer: string index out of range - nlp

I have a set of pickled text documents which I would like to stem using nltk's PorterStemmer. For reasons specific to my project, I would like to do the stemming inside of a django app view.
However, when stemming the documents inside the django view, I receive an IndexError: string index out of range exception from PorterStemmer().stem() for the string 'oed'. As a result, running the following:
# xkcd_project/search/views.py
from nltk.stem.porter import PorterStemmer
def get_results(request):
s = PorterStemmer()
s.stem('oed')
return render(request, 'list.html')
raises the mentioned error:
Traceback (most recent call last):
File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/exception.py", line 39, in inner
response = get_response(request)
File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/base.py", line 187, in _get_response
response = self.process_exception_by_middleware(e, request)
File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/base.py", line 185, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/Users/jkarimi91/Projects/xkcd_search/xkcd_project/search/views.py", line 15, in get_results
s.stem('oed')
File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 665, in stem
stem = self._step1b(stem)
File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 376, in _step1b
lambda stem: (self._measure(stem) == 1 and
File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 258, in _apply_rule_list
if suffix == '*d' and self._ends_double_consonant(word):
File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 214, in _ends_double_consonant
word[-1] == word[-2] and
IndexError: string index out of range
Now what is really odd is running the same stemmer on the same string outside django (be it a seperate python file or an interactive python console) produces no error. In other words:
# test.py
from nltk.stem.porter import PorterStemmer
s = PorterStemmer()
print s.stem('oed')
followed by:
python test.py
# successfully prints 'o'
what is causing this issue?

This is an NLTK bug specific to NLTK version 3.2.2, for which I am to blame. It was introduced by PR https://github.com/nltk/nltk/pull/1261 which rewrote the Porter stemmer.
I wrote a fix which went out in NLTK 3.2.3. If you're on version 3.2.2 and want the fix, just upgrade - e.g. by running
pip install -U nltk

I debugged nltk.stem.porter module using pdb. After a few iterations, in _apply_rule_list() you get:
>>> rule
(u'at', u'ate', None)
>>> word
u'o'
At this point the _ends_double_consonant() method tries to do word[-1] == word[-2] and it fails.
If I'm not mistaken, in NLTK 3.2 the relative method was the following:
def _doublec(self, word):
"""doublec(word) is TRUE <=> word ends with a double consonant"""
if len(word) < 2:
return False
if (word[-1] != word[-2]):
return False
return self._cons(word, len(word)-1)
As far as I can see, the len(word) < 2 check is missing in the new version.
Changing _ends_double_consonant() to something like this should work:
def _ends_double_consonant(self, word):
"""Implements condition *d from the paper
Returns True if word ends with a double consonant
"""
if len(word) < 2:
return False
return (
word[-1] == word[-2] and
self._is_consonant(word, len(word)-1)
)
I just proposed this change in the related NLTK issue.

Related

I am getting a raise key error in my python script. How can I resolve this?

I am hoping someone can help me with this. After having a nightmare installing numpy on a raspberry pi, I am stuck again!
The gist of what I am trying to do is I have an arduino, that sends numbers (bib race numbers entered by hand) over lora, to the rx of the raspberry pi.
This script is supposed to read the incoming data, - it prints so I can see it in the terminal. Pandas is then supposed to compare the number against a txt/csv file, and if it matches in the bib number column it is supposed to append the matched row to a new file.
Now, The first bit works (capturing the data and printing) and on my windows PC, the 2nd bit works when I was testing with a fixed number rather than incoming data.
I have basically tried my best to mash them together to get the incoming number to compare instead.
I should also state that the error happened after I pressed 3 on the arduino (which then printed on the terminal of the raspberry pi before erroring), so probably why it is keyerror 3
My code is here
#!/usr/bin/env python3
import serial
import csv
import pandas as pd
#import numpy as np
if __name__ == '__main__':
ser = serial.Serial('/dev/ttyS0', 9600, timeout=1)
ser.flush()
while True:
if ser.in_waiting > 0:
line = ser.readline().decode('utf-8').rstrip()
print(line)
with open ("test_data.csv","a") as f:
writer = csv.writer(f,delimiter=",")
writer.writerow([line])
df = pd.read_csv("data.txt")
#out = (line)
filtered_df = df[line]
print('Original Dataframe\n---------------\n',df)
print('\nFiltered Dataframe\n------------------\n',filtered_df)
filtered_df.to_csv("data_amended.txt", mode='a', index=False, header=False)
#print(df.to_string())
And my error is here:
Python 3.7.3 (/usr/bin/python3)
>>> %Run piserialmashupv1.py
3
Traceback (most recent call last):
File "/home/pi/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: '3'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/pi/piserialmashupv1.py", line 20, in <module>
filtered_df = df[line]
File "/home/pi/.local/lib/python3.7/site-packages/pandas/core/frame.py", line 3455, in __getitem__
indexer = self.columns.get_loc(key)
File "/home/pi/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc
raise KeyError(key) from err
KeyError: '3'
>>>
I had been asked to put the first few lines of data.txt
_id,firstname,surname,team,info
1, Peter,Smith,,Red Walk (70 miles- 14 mile walk/run + 56 mile cycle)
2, Samantha,Grey,Team Grey,Blue walk (14 mile walk/run)
3, Gary,Parker,,Red Walk (70 miles- 14 mile walk/run + 56 mile cycle)
I think it must be the way I am referencing the incoming rx number?
Any help very much appreciated!
Dave
I have it working, see the final code below
I know that Pandas just didnt like the way the data was inputting originally.
This fixes it. I also had to ensure it knew it was dealing with an integer when filtering, as the first attempt I didn't, and it couldn't filter the data properly.
``
import serial
import csv
import time
import pandas as pd
if __name__ == '__main__':
ser = serial.Serial('/dev/ttyS0', 9600, timeout=1)
ser.flush()
while True:
if ser.in_waiting > 0:
line = ser.readline().decode('utf-8').rstrip()
print(line)
with open ("test_data.txt","w") as f:
writer = csv.writer(f,delimiter=",")
writer.writerow([line])
time.sleep(0.1)
ser.write("Y".encode())
df = pd.read_csv("data.txt")
out = df['_id'] == int(line)
filtered_df = df[out]
print('Original Dataframe\n---------------\n',df)
print('\nFiltered Dataframe\n---------\n',filtered_df)
filtered_df.to_csv("data_amended.txt", mode='a',
index=False, header=False)
time.sleep(0.1)
``

Editing PDF metadata fields with Python3 and pdfrw

I'm trying to edit the metadata Title field of PDFs, to include the ASCII equivalents when possible. I'm using Python3 and the module pdfrw.
How can I do string operations that replace the metadata fields?
My test code is here:
from pdfrw import PdfReader, PdfWriter, PdfString
import unicodedata
def edit_title_metadata(inpdf):
trailer = PdfReader(inpdf)
# this statement is breaking pdfrw
trailer.Info.Title = unicode_normalize(trailer.Info.Title)
# also have tried:
#trailer.Info.Title = PdfString(unicode_normalize(trailer.Info.Title))
PdfWriter("test.pdf", trailer=trailer).write()
return
def unicode_normalize(s):
return unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')
if __name__ == "__main__":
edit_title_metadata('Anadon-2011-Scientific Opinion on the safety e.pdf')
And the traceback is:
Traceback (most recent call last):
File "get_metadata.py", line 68, in <module>
main()
File "get_metadata.py", line 54, in main
edit_title_metadata(pdf)
File "get_metadata.py", line 11, in edit_title_metadata
trailer.Info.Title = PdfString(unicode_normalize(trailer.Info.Title))
File "get_metadata.py", line 18, in unicode_normalize
return unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')
File "/path_to_python/python3.7/site-packages/pdfrw/objects/pdfstring.py", line 550, in encode
if isinstance(source, uni_type):
TypeError: isinstance() arg 2 must be a type or tuple of types
Notes:
This issue at GitHub may be related.
FWIW, Also getting same error with Python3.6
I've shared the pdf (which has non-ascii hyphens, unicode char \u2010)
.
wget https://gist.github.com/philshem/71507d4e8ecfabad252fbdf4d9f8bdd2/raw/cce346ab39dd6ecb3a718ad3f92c9f546761e87b/Anadon-2011-Scientific%2520Opinion%2520on%2520the%2520safety%2520e.pdf
You have to use the .decode() method on the metadata fields:
trailer.Info.Title = unicode_normalize(trailer.Info.Title.decode())
And full working code:
from pdfrw import PdfReader, PdfWriter, PdfReader
import unicodedata
def edit_title_metadata(inpdf):
trailer = PdfReader(inpdf)
trailer.Info.Title = unicode_normalize(trailer.Info.Title.decode())
PdfWriter("test.pdf", trailer=trailer).write()
return
def unicode_normalize(s):
return unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')
if __name__ == "__main__":
edit_title_metadata('Anadon-2011-Scientific Opinion on the safety e.pdf')

Error while using chatterbot

I'm trying to make a simple QnA program using Python chatterbot.
# -*- coding: utf-8 -*-
from chatterbot import ChatBot
bot = ChatBot(
"SQLMemoryTerminal",
storage_adapter='chatterbot.storage.SQLStorageAdapter',
logic_adapters=[
{
"import_path": "chatterbot.logic.BestMatch",
"statement_comparison_function":
"chatterbot.comparisons.levenshtein_distance"
},
{
'import_path' : 'chatterbot.logic.LowConfidenceAdapter',
'threshold' : 0.3,
'default_response' : "Sorry. I can not find the exact answer."
},
'chatterbot.logic.multi_adapter.MultiLogicAdapter',
],
input_adapter="chatterbot.input.TerminalAdapter",
output_adapter="chatterbot.output.TerminalAdapter",
read_only= True
)
print("input question")
while True:
try:
print("Q : ",end="")
bot_input = bot.get_response(None)
except (KeyboardInterrupt, EOFError, SystemExit):
break
However, when I try to use the multiadapter function built in chatterbot, I get an error.
Traceback (most recent call last):
File "C:/Users/KPvoice/PycharmProjects/Contact/ChatterbotTest.py", line
30, in <module>
bot_input = bot.get_response(None)
File "C:\Python36\lib\site-packages\chatterbot\chatterbot.py", line 113,
in get_response
statement, response = self.generate_response(input_statement,
conversation_id)
File "C:\Python36\lib\site-packages\chatterbot\chatterbot.py", line 132,
in generate_response
response = self.logic.process(input_statement)
File "C:\Python36\lib\site-packages\chatterbot\logic\multi_adapter.py",
line 52, in process
output = adapter.process(statement)
File "C:\Python36\lib\site-packages\chatterbot\logic\multi_adapter.py",
line 89, in process
result.confidence = max_confidence
AttributeError: 'NoneType' object has no attribute 'confidence'
I do not know how to solve it.
The working environment is Windows 10, Python 3.7
The MultiLogicAdapter typically doesn't get used directly in this way.
Each logic adapter that you add to the logic_adapters=[] will get processed by the MultiLogicAdapter internally by ChatterBot, no need to explicitly specify it.

Error in applying lemmatization

Why am I getting this error, please help.
I am newbie to machine learning.
This is my code and here I've applied lemmatization on 20 newsgroups dataset.
This code aims to get the 500 words with highest counts while applying filtering.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from nltk.corpus import names
from nltk.stem import WordNetLemmatizer
def letters_only(astr):
return astr.isalpha()
cv = CountVectorizer(stop_words="english", max_features=500)
groups = fetch_20newsgroups()
cleaned = []
all_names = set(names.words())
lemmatizer = WordNetLemmatizer()
for post in groups.data:
cleaned.append(' '.join([lemmatizer.lemmatize(word.lower()
for word in post.split()
if letters_only(word) and word not in all_names)]))
transformed = cv.fit_transform(cleaned)
print(cv.get_feature_names())
Error:
Traceback (most recent call last):
File "<ipython-input-91-7158a74bae71>", line 18, in <module>
for word in post.split()
File "C:\Program Files\Anaconda3\lib\site-packages\nltk\stem\wordnet.py", line 40, in lemmatize
lemmas = wordnet._morphy(word, pos)
File "C:\Program Files\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1712, in _morphy
forms = apply_rules([form])
File "C:\Program Files\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1692, in apply_rules
for form in forms
File "C:\Program Files\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1694, in <listcomp>
if form.endswith(old)]
AttributeError: 'generator' object has no attribute 'endswith'
I'm not sure why, but turning that for loop one liner into regular for loop solved the problem;
for post in groups.data:
for word in post.split():
if letters_only(word) and word not in all_names:
cleaned.append(' '.join([lemmatizer.lemmatize(word.lower())]))

TypeError: Can't convert 'bytes' object to str implicitly for tweepy

from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
ckey=''
csecret=''
atoken=''
asecret=''
class listener(StreamListener):
def on_data(self,data):
print(data)
return True
def on_error(self,status):
print(status)
auth = OAuthHandler(ckey,csecret)
auth.set_access_token(atoken, asecret)
twitterStream = Stream(auth, listener())
twitterStream.filter(track="cricket")
This code filter the twitter stream based on the filter. But I am getting following traceback after running the code. Can somebody please help
Traceback (most recent call last):
File "lab.py", line 23, in <module>
twitterStream.filter(track="car".strip())
File "C:\Python34\lib\site-packages\tweepy\streaming.py", line 430, in filter
self._start(async)
File "C:\Python34\lib\site-packages\tweepy\streaming.py", line 346, in _start
self._run()
File "C:\Python34\lib\site-packages\tweepy\streaming.py", line 286, in _run
raise exception
File "C:\Python34\lib\site-packages\tweepy\streaming.py", line 255, in _run
self._read_loop(resp)
File "C:\Python34\lib\site-packages\tweepy\streaming.py", line 298, in _read_loop
line = buf.read_line().strip()
File "C:\Python34\lib\site-packages\tweepy\streaming.py", line 171, in read_line
self._buffer += self._stream.read(self._chunk_size)
TypeError: Can't convert 'bytes' object to str implicitly
Im assuming you're using tweepy 3.4.0. The issue you've raised is 'open' on github (https://github.com/tweepy/tweepy/issues/615).
Two work-arounds :
1)
In streaming.py:
I changed line 161 to
self._buffer += self._stream.read(read_len).decode('UTF-8', 'ignore')
and line 171 to
self._buffer += self._stream.read(self._chunk_size).decode('UTF-8', 'ignore')
and then reinstalled via python3 setup.py install on my local copy of tweepy.
2)
remove the tweepy 3.4.0 module, and install 3.3.0 using command: pip install -I tweepy==3.3.0
Hope that helps,
-A
You can't do twitterStream.filter(track="car".strip()). Why are you adding the strip() it's serving no purpose in there.
track must be a str type before you invoke a connection to Twitter's Streaming API and tweepy is preventing that connection because you're trying to add strip()
If for some reason you need it, you can do track_word='car'.strip() then track=track_word, that's even unnecessary because:
>>> print('car'.strip())
car
Also, the error you're getting does not match the code you have listed, the code that's in your question should work fine.

Resources