NP-chunker value error (Python nltk) - python-3.x

I am building an NLP-pipeline based on the Python NLTK book (chapter 7). The first segment of codes correctly preprocesses the data, but I am unable to run its output through my NP-chunker:
import nltk, re, pprint
#Import Data
data = 'This is a test sentence to check if preprocessing works'
#Preprocessing
def preprocess(document):
sentences = nltk.sent_tokenize(document)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]
return(sentences)
tagged = preprocess(data)
print(tagged)
#regular expression-based NP chunker
grammar = "NP: {<DT>?<JJ>*<NN>}"
cp = nltk.RegexpParser(grammar) #chunk parser
chunked = []
for s in tagged:
chunked.append(cp.parse(tagged))
print(chunked)
This is the traceback I get:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/u0084411/Box Sync/Procesmanager DH/Text Mining/Tools/NLP_pipeline.py", line 24, in <module>
chunked.append(cp.parse(tagged))
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\chunk\regexp.py", line 1202, in parse
chunk_struct = parser.parse(chunk_struct, trace=trace)
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\chunk\regexp.py", line 1017, in parse
chunkstr = ChunkString(chunk_struct)
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\chunk\regexp.py", line 95, in __init__
tags = [self._tag(tok) for tok in self._pieces]
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\chunk\regexp.py", line 95, in <listcomp>
tags = [self._tag(tok) for tok in self._pieces]
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\chunk\regexp.py", line 105, in _tag
raise ValueError('chunk structures must contain tagged '
ValueError: chunk structures must contain tagged tokens or trees
>>>
What is my mistake here? 'Tagged' is tokenized, so why does the program not recognize this?
Many thanks!
Tom

You'll slap your forehead when you see this. Instead of this
for s in tagged:
chunked.append(cp.parse(tagged))
it should have been this:
for s in tagged:
chunked.append(cp.parse(s))
You were getting the error because you were not passing cp.parse() a tagged sentence, but a list of them.

Related

Import Excel xlsx to Python using Panda - Error Message - How to resolve?

import pandas as pd
data = pd.read_excel (r'C:\Users\royli\Downloads\Product List.xlsx',sheet_name='Sheet1' )
df = pd.DataFrame(data, columns= ['Product'])
print (df)
Error Message
Traceback (most recent call last):
File "main.py", line 3, in <module>
Traceback (most recent call last):
File "main.py", line 3, in <module>
data = pd.read_excel (r'C:\Users\royli\Downloads\Product List.xlsx',sheet_name='Sheet1' )
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/pandas/util/_decorators.py", line 296, in wrapper
return func(*args, **kwargs)
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 304, in read_excel
io = ExcelFile(io, engine=engine)
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 867, in __init__
self._reader = self._engines[engine](self._io)
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/pandas/io/excel/_xlrd.py", line 22, in __init__
super().__init__(filepath_or_buffer)
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 353, in __init__
self.book = self.load_workbook(filepath_or_buffer)
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/pandas/io/excel/_xlrd.py", line 37, in load_workbook
return open_workbook(filepath_or_buffer)
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/xlrd/__init__.py", line 111, in open_workbook
with open(filename, "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\royli\\Downloads\\Product List.xlsx'

KeyboardInterrupt

Generally when I get that problem am gonna change \ symbols to \ \ symbols and generally its solved. Try it.
I had this problem in Visual Studio Code.
table = pd.read_excel('Sales.xlsx')
When running the program on Pycharm, there were no errors.
When trying to run the same program in Visual Studio Code, it showed an error, without any changes.
To fix it, I had to address the file with //. Ex:
table = pd.read_excel('C:\\Users\\paste\\Desktop\\archives\\Sales.xlsx')
I am using Pycharm and after reviewing the Post and replies, I was able to get this resolved (thanks very much). I didn't need to specify a worksheet, as there is only one sheet on the Excel file I am reading.
I had to add the r (raw string), and I also removed the drive specification c:
data = pd.read_excel(r'\folder\subfolder\filename.xlsx')

data_block api :show_batch cause CUDA unknown error

I am a newbie and then trying to learn fastai data block api
Here is the mistake, the code is exactly the same as the tutorial:
coco = untar_data(URLs.COCO_TINY)
path=coco/'train.json'
images, lbl_bbox = get_annotations(coco/'train.json')
img2bbox = dict(zip(images, lbl_bbox))
get_y_func = lambda o:img2bbox[o.name]
data = (ObjectItemList.from_folder(coco)
.split_by_rand_pct()
.label_from_func(get_y_func)
.transform(get_transforms(), tfm_y=True)
.databunch(bs=1, num_workers=0,collate_fn=bb_pad_collate))
data.show_batch(rows=2, ds_type=DatasetType.Valid, figsize=(6,6))
Then the error is:
File "D:\Anaconda3\envs\pytorch-gpu\lib\site-packages\IPython\core\interactiveshell.py", line
3326, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-7-25e60680c0ba>", line 15, in <module>
data.show_batch(rows=2, ds_type=DatasetType.Valid, figsize=(6,6))
File "D:\Anaconda3\envs\pytorch-gpu\lib\site-packages\fastai\basic_data.py", line 185, in show_batch
x,y = self.one_batch(ds_type, True, True)
File "D:\Anaconda3\envs\pytorch-gpu\lib\site-packages\fastai\basic_data.py", line 168, in one_batch
try: x,y = next(iter(dl))
File "D:\Anaconda3\envs\pytorch-gpu\lib\site-packages\fastai\basic_data.py", line 75, in __iter__
for b in self.dl: yield self.proc_batch(b)
File "D:\Anaconda3\envs\pytorch-gpu\lib\site-packages\torch\utils\data\dataloader.py", line 348,__next__
data = _utils.pin_memory.pin_memory(data)
File "D:\Anaconda3\envs\pytorch-gpu\lib\site-packages\torch\utils\data\_utils\pin_memory.py", line
55, in pin_memory
return [pin_memory(sample) for sample in data]
File "D:\Anaconda3\envs\pytorch-gpu\lib\site-packages\torch\utils\data\_utils\pin_memory.py", line
55, in <listcomp>
return [pin_memory(sample) for sample in data]
File "D:\Anaconda3\envs\pytorch-gpu\lib\site-packages\torch\utils\data\_utils\pin_memory.py", line
47, in pin_memory
return data.pin_memory()
RuntimeError: CUDA error: unknown error
Regarding this error, the forums on the Internet are all setting the parameters of the DataLoader, but it does not seem to be used here
How would I go about this?
I also got this error. me it fixed by edit the following two lines in the head of notebook:
import os
os.environ['CUDA_VISIBLE_DEVICES']='2'

Fuzzy String Matching With Pandas and FuzzyWuzzy ;KeyError: 'name'

I have the data file which looks like this -
And I have another data file which has all the correct country names.
For matching both the files that, I am using below:
from fuzzywuzzy import process
import pandas as pd
names_array=[]
ratio_array=[]
def match_names(wrong_names,correct_names):
for row in wrong_names:
x=process.extractOne(row, correct_names)
names_array.append(x[0])
ratio_array.append(x[1])
return names_array,ratio_array
#Wrong country names dataset
df=pd.read_csv("wrong-country-names.csv",encoding="ISO-8859-1")
wrong_names=df['name'].dropna().values
#Correct country names dataset
choices_df=pd.read_csv("country-names.csv",encoding="ISO-8859-1")
correct_names=choices_df['name'].values
name_match,ratio_match=match_names(wrong_names,correct_names)
df['correct_country_name']=pd.Series(name_match)
df['country_names_ratio']=pd.Series(ratio_match)
df.to_csv("string_matched_country_names.csv")
print(df[['name','correct_country_name','country_names_ratio']].head(10))
I get the below error:
runfile('C:/Users/Drashti Bhatt/Desktop/untitled0.py', wdir='C:/Users/Drashti Bhatt/Desktop')
Traceback (most recent call last):
File "<ipython-input-155-a1fd87d9f661>", line 1, in <module>
runfile('C:/Users/Drashti Bhatt/Desktop/untitled0.py', wdir='C:/Users/Drashti Bhatt/Desktop')
File "C:\Users\Drashti Bhatt\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
execfile(filename, namespace)
File "C:\Users\Drashti Bhatt\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/Drashti Bhatt/Desktop/untitled0.py", line 17, in <module>
wrong_names=df['name'].dropna().values
File "C:\Users\Drashti Bhatt\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2927, in __getitem__
indexer = self.columns.get_loc(key)
File "C:\Users\Drashti Bhatt\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2659, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'name'
Any help on this will be much appreciated! Thanks much!
Your code contains e.g. wrong_names=df['name'].dropna().values (it is
mentioned in your Traceback as the "offending" line).
And now look at the picture presenting your DataFrame:
it does not contain name column,
it contains Country column.
Go back to the Traceback: At the very end there is the error message:
KeyError: 'name'.
So you attempt to access a non-existing column.
I noticed also another detail: values attribute contains the underlying
Numpy array, whereas process.extractOne requires "ordinary" Python lists (of strings, to perform the match).
So probably you should change the instruction above to:
wrong_names=df['Country'].dropna().values.tolist()
The same for the other instruction.

stanford-dependency parser with NLTK :UnicodeDecodeError:

I am trying to run the following lines of code:
import os
os.environ['JAVAHOME'] = 'path/to/java.exe'
os.environ['STANFORD_PARSER'] = 'path/to/stanford-parser.jar'
os.environ['STANFORD_MODELS'] = 'path/to/stanford-parser-3.8.0-models.jar'
from nltk.parse.stanford import StanfordDependencyParser
dep_parser = StanfordDependencyParser(model_path="path/to/englishPCFG.ser.gz")
sentence = "sample sentence ..."
# Dependency Parsing:
print("Dependency Parsing:")
print([parse.tree() for parse in dep_parser.raw_parse(sentence)])
and at the line:
print([parse.tree() for parse in dep_parser.raw_parse(sentence)])
I get the following issues:
Traceback (most recent call last):
File "C:/Users/Norbert/PycharmProjects/untitled/StanfordDependencyParser.py", line 21, in
print([parse.tree() for parse in dep_parser.raw_parse(sentence)])
File "C:\Users\Norbert\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\parse\stanford.py", line 134, in raw_parse
return next(self.raw_parse_sents([sentence], verbose))
File "C:\Users\Norbert\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\parse\stanford.py", line 152, in raw_parse_sents
return self._parse_trees_output(self._execute(cmd, '\n'.join(sentences), verbose))
File "C:\Users\Norbert\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\parse\stanford.py", line 218, in _execute
stdout=PIPE, stderr=PIPE)
File "C:\Users\Norbert\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\internals.py", line 135, in java
print(_decode_stdoutdata(stderr))
File "C:\Users\Norbert\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\internals.py", line 737, in _decode_stdoutdata
return stdoutdata.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xac in position 3097: invalid start byte
Any idea what could be wrong ? I am not even dealing with any non-utf-8 text.
I can print a few things by doing this, maybe is not what you wanted but is a start.
print("Dependency Parsing:")
result = dependency_parser.raw_parse(sentence)
#print (next(result))
dep = next(result)
print (list(dep.triples()))
Uncomment the line -> print(next(result)) if you want to see the entire output.

Converting a supposed excel file in csv in python

I am having an issue trying to use a code for converting a file into csv.
I am using the code below as a start
directory = 'C:\OI Data'
filename = 'OpenInterest08-24-16'
data_xls = pd.read_excel(os.path.join(directory,filename), 'Sheet1', index_col=None)
data_xls.to_csv(os.path.join(directory,filename +'.csv'), encoding='utf-8')
and I am getting the following error:
Traceback (most recent call last):
File "", line 1, in
File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 714, in runfile
execfile(filename, namespace)
File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/Public/Documents/Python Scripts/work.py", line 26, in
data_xls = pd.read_excel(os.path.join(directory,filename), 'Sheet1', index_col=None)
File "C:\Anaconda2\lib\site-packages\pandas\io\excel.py", line 170, in read_excel
io = ExcelFile(io, engine=engine)
File "C:\Anaconda2\lib\site-packages\pandas\io\excel.py", line 227, in init
self.book = xlrd.open_workbook(io)
File "C:\Anaconda2\lib\site-packages\xlrd__init__.py", line 441, in open_workbook
ragged_rows=ragged_rows,
File "C:\Anaconda2\lib\site-packages\xlrd\book.py", line 91, in open_workbook_xls
biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
File "C:\Anaconda2\lib\site-packages\xlrd\book.py", line 1230, in getbof
bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
File "C:\Anaconda2\lib\site-packages\xlrd\book.py", line 1224, in bof_error
raise XLRDError('Unsupported format, or corrupt file: ' + msg)
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '\n\n\n\n\n '
I am struggling to figure out the file format I am using
https://www.theice.com/marketdata/reports/icefuturesus/PreliminaryOpenInterest.shtml?futuresExcel=&tradeDate=8%2F24%2F16
opening the file myself I get the following
enter image description here
I am still a beginner at python and some help would be much appreciated.
Thanks
You can start by fixing this part:
data_xls.to_csv(os.path.join(directory,filename,'.csv'), encoding='utf-8')
What happens when you do that is:
'C:\OI Data\\OpenInterest08-24-16\\.csv'
Which is not what you want. Instead do:
os.path.join(directory,filename+'.csv')
Which will give you:
'C:\OI Data\\OpenInterest08-24-16.csv'
Also, this is not a problem here, but in general be careful with this because a single backslash and a character can indicate an escape sequence, e.g. \n is a newline:
directory = 'C:\OI Data'
Instead escape the backslash like so:
directory = 'C:\\OI Data'

Resources