stanford-dependency parser with NLTK :UnicodeDecodeError: - python-3.x

I am trying to run the following lines of code:
import os
os.environ['JAVAHOME'] = 'path/to/java.exe'
os.environ['STANFORD_PARSER'] = 'path/to/stanford-parser.jar'
os.environ['STANFORD_MODELS'] = 'path/to/stanford-parser-3.8.0-models.jar'
from nltk.parse.stanford import StanfordDependencyParser
dep_parser = StanfordDependencyParser(model_path="path/to/englishPCFG.ser.gz")
sentence = "sample sentence ..."
# Dependency Parsing:
print("Dependency Parsing:")
print([parse.tree() for parse in dep_parser.raw_parse(sentence)])
and at the line:
print([parse.tree() for parse in dep_parser.raw_parse(sentence)])
I get the following issues:
Traceback (most recent call last):
File "C:/Users/Norbert/PycharmProjects/untitled/StanfordDependencyParser.py", line 21, in
print([parse.tree() for parse in dep_parser.raw_parse(sentence)])
File "C:\Users\Norbert\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\parse\stanford.py", line 134, in raw_parse
return next(self.raw_parse_sents([sentence], verbose))
File "C:\Users\Norbert\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\parse\stanford.py", line 152, in raw_parse_sents
return self._parse_trees_output(self._execute(cmd, '\n'.join(sentences), verbose))
File "C:\Users\Norbert\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\parse\stanford.py", line 218, in _execute
stdout=PIPE, stderr=PIPE)
File "C:\Users\Norbert\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\internals.py", line 135, in java
print(_decode_stdoutdata(stderr))
File "C:\Users\Norbert\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\internals.py", line 737, in _decode_stdoutdata
return stdoutdata.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xac in position 3097: invalid start byte
Any idea what could be wrong ? I am not even dealing with any non-utf-8 text.

I can print a few things by doing this, maybe is not what you wanted but is a start.
print("Dependency Parsing:")
result = dependency_parser.raw_parse(sentence)
#print (next(result))
dep = next(result)
print (list(dep.triples()))
Uncomment the line -> print(next(result)) if you want to see the entire output.

Related

"UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 3965: invalid start byte" when using Pyinstaller

I am trying to create an executable from two python scripts. One script defines the GUI for the other backend script. The backend is reading in excel files, creating DataFrames with them for manipulation, then outputting a new excel file. This is the code that reads in the excel file, where "user_path, userAN, userRev1, userRev2" are grabbed as user input from the GUI:
import pandas as pd
import numpy as np
import string
from tkinter import messagebox
import os
def generate_BOM(user_path, userAN, userRev1, userRev2):
## Append filepath with '/' if it does not include directory separator
if not (user_path.endswith('/') or user_path.endswith('\\')):
user_path = user_path + '/'
## Set filepath to current directory if user inputted path does not exist
if not os.path.exists(user_path):
user_path = '.'
fileFormat1 = userAN + '_' + userRev1 + '.xls'
fileFormat2 = userAN + '_' + userRev2 + '.xls'
for file in os.listdir(path=user_path):
if file.endswith(fileFormat1):
df1 = pd.read_excel(user_path+file, index_col=None)
if file.endswith(fileFormat2):
df2 = pd.read_excel(user_path+file, index_col=None)
When running the two scripts through Spyder, everything works perfectly. To create the exe, I am using Pyinstaller with the following command:
pyinstaller --onefile Delta_BOM_Creator.py
This results in the following error:
Traceback (most recent call last):
File "c:\users\davhar\anaconda3\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "c:\users\davhar\anaconda3\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\davhar\Anaconda3\Scripts\pyinstaller.exe\__main__.py", line 7, in <module>
File "c:\users\davhar\anaconda3\lib\site-packages\PyInstaller\__main__.py", line 114, in run
run_build(pyi_config, spec_file, **vars(args))
File "c:\users\davhar\anaconda3\lib\site-packages\PyInstaller\__main__.py", line 65, in run_build
PyInstaller.building.build_main.main(pyi_config, spec_file, **kwargs)
File "c:\users\davhar\anaconda3\lib\site-packages\PyInstaller\building\build_main.py", line 737, in main
build(specfile, kw.get('distpath'), kw.get('workpath'), kw.get('clean_build'))
File "c:\users\davhar\anaconda3\lib\site-packages\PyInstaller\building\build_main.py", line 684, in build
exec(code, spec_namespace)
File "C:\Users\davhar\.spyder-py3\DELTA_BOM_Creator\Delta_BOM_Creator.spec", line 7, in <module>
a = Analysis(['Delta_BOM_Creator.py'],
File "c:\users\davhar\anaconda3\lib\site-packages\PyInstaller\building\build_main.py", line 242, in __init__
self.__postinit__()
File "c:\users\davhar\anaconda3\lib\site-packages\PyInstaller\building\datastruct.py", line 160, in __postinit__
self.assemble()
File "c:\users\davhar\anaconda3\lib\site-packages\PyInstaller\building\build_main.py", line 414, in assemble
priority_scripts.append(self.graph.run_script(script))
File "c:\users\davhar\anaconda3\lib\site-packages\PyInstaller\depend\analysis.py", line 303, in run_script
self._top_script_node = super(PyiModuleGraph, self).run_script(
File "c:\users\davhar\anaconda3\lib\site-packages\PyInstaller\lib\modulegraph\modulegraph.py", line 1411, in run_script
contents = fp.read() + '\n'
File "c:\users\davhar\anaconda3\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 3965: invalid start byte
I've tried everything I could find that somewhat related to this issue. To list just a few:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 15: invalid start byte
https://www.dlology.com/blog/solution-pyinstaller-unicodedecodeerror-utf-8-codec-cant-decode-byte/
Pandas read _excel: 'utf-8' codec can't decode byte 0xa8 in position 14: invalid start byte
I've never used Pyinstaller, or created an executable from python at all, so apologies for being a big time noob.
SOLUTION: I found a solution. I went into the codecs.py file mentioned in the error and added 'ignore' to line 322
(result, consumed) = self.buffer_decode(data, 'ignore', final)

Zipfile / shutil.make_archive throws EncodeError on german umlauts

I'm trying to zip a folder in Python 3 with the module zipfile.
Since I'm german I have some filenames containing umlauts (äöü).
While zipping, I get a UnicodeEncodeError: 'utf-8' codec can't encode character '\udcfc' in position 95: surrogates not allowed.
The character in question is an ü.
How can I get zipfile to zip all my files?
The relevant code is this:
def zipdir(path, ziph):
for root, dirs, files in os.walk(path):
for file in files:
ziph.write(os.path.join(root, file))
if __name__ == '__main__':
zipf = zipfile.ZipFile('path/to/destination', 'w', zipfile.ZIP_DEFLATED)
zipdir('path/to/folder', zipf)
zipf.close()
Edit:
I've got the same error when I'm using shutil.make_archive.
import shutil
shutil.make_archive('/path/to/destination', 'zip', '/path/to/folder')
Full stacktrace of shutil.make_archive():
Traceback (most recent call last):
File "/usr/lib64/python3.7/zipfile.py", line 452, in _encodeFilenameFlags
return self.filename.encode('ascii'), self.flag_bits
UnicodeEncodeError: 'ascii' codec can't encode character '\udcfc' in position 59: ordinal not in range(128)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run.py", line 39, in <module>
archive_dir(path, zip_fullpath)
File "run.py", line 19, in archive_dir
shutil.make_archive(dest, 'zip', source)
File "/home/sean/.local/share/virtualenvs/backup-script-QUcRKrDQ/lib/python3.7/shutil.py", line 822, in make_archive
filename = func(base_name, base_dir, **kwargs)
File "/home/sean/.local/share/virtualenvs/backup-script-QUcRKrDQ/lib/python3.7/shutil.py", line 720, in _make_zipfile
zf.write(path, path)
File "/usr/lib64/python3.7/zipfile.py", line 1746, in write
with open(filename, "rb") as src, self.open(zinfo, 'w') as dest:
File "/usr/lib64/python3.7/zipfile.py", line 1473, in open
return self._open_to_write(zinfo, force_zip64=force_zip64)
File "/usr/lib64/python3.7/zipfile.py", line 1586, in _open_to_write
self.fp.write(zinfo.FileHeader(zip64))
File "/usr/lib64/python3.7/zipfile.py", line 442, in FileHeader
filename, flag_bits = self._encodeFilenameFlags()
File "/usr/lib64/python3.7/zipfile.py", line 454, in _encodeFilenameFlags
return self.filename.encode('utf-8'), self.flag_bits | 0x800
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcfc' in position 59: surrogates not allowed
Full stacktrace of zipfile:
Traceback (most recent call last):
File "/usr/lib64/python3.7/zipfile.py", line 452, in _encodeFilenameFlags
return self.filename.encode('ascii'), self.flag_bits
UnicodeEncodeError: 'ascii' codec can't encode character '\udcfc' in position 95: ordinal not in range(128)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run.py", line 41, in <module>
zipdir(path, zipf)
File "run.py", line 16, in zipdir
ziph.write(filepath)
File "/usr/lib64/python3.7/zipfile.py", line 1746, in write
with open(filename, "rb") as src, self.open(zinfo, 'w') as dest:
File "/usr/lib64/python3.7/zipfile.py", line 1473, in open
return self._open_to_write(zinfo, force_zip64=force_zip64)
File "/usr/lib64/python3.7/zipfile.py", line 1586, in _open_to_write
self.fp.write(zinfo.FileHeader(zip64))
File "/usr/lib64/python3.7/zipfile.py", line 442, in FileHeader
filename, flag_bits = self._encodeFilenameFlags()
File "/usr/lib64/python3.7/zipfile.py", line 454, in _encodeFilenameFlags
return self.filename.encode('utf-8'), self.flag_bits | 0x800
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcfc' in position 95: surrogates not allowed
Update:
I've tried some solutions that seemed to work for some at the posted link. This is what I've got:
with
ziph.write(filepath.encode('utf8','surrogateescape').decode('ISO-8859-1')) I got:
Traceback (most recent call last):
File "run.py", line 41, in <module>
zipdir(path, zipf)
File "run.py", line 16, in zipdir
ziph.write(filepath.encode('utf8','surrogateescape').decode('ISO-8859-1'))
File "/usr/lib64/python3.7/zipfile.py", line 1713, in write
zinfo = ZipInfo.from_file(filename, arcname)
File "/usr/lib64/python3.7/zipfile.py", line 506, in from_file
st = os.stat(filename)
FileNotFoundError: [Errno 2] No such file or directory: '/some/path/to/documents/DIS_Broschüre_DE.pdf'
So the encoding/decoding returned something that can not be found in the file system.
The other option: ziph.write(filepath.encode('utf8','surrogateescape').decode('utf-8')) got me
Traceback (most recent call last):
File "run.py", line 41, in <module>
zipdir(path, zipf)
File "run.py", line 16, in zipdir
ziph.write(filepath.encode('utf8','surrogateescape').decode('utf-8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 96: invalid start byte
Ok. I've found the Problem.
The files in questen were not the ones I thought they were. Usual umlaus work fine. Somehow the filenames were actually corrupt. like this:
ls in one of the dirs gives:
2e_geh�usetechnologie_flyer_qrcode.pdf
Command line auto completion gives me:
2e_geh$'\344'usetechnologie_flyer_qrcode.pdf
Since these are files that got uploaded via a webinterface I can only imagine that these are made in Windows or another non-UNIX OS and the webserver couldn't handle it.
Other uploaded files had correct umlauts. I'm not shure what happened there but I'm glad it is not Python or the Linux FS to blame.
Thanks for all the tips.

How to use special character (æ,ø or å) in a urllib.request.urlopen in python 3.5.2? [duplicate]

This question already has answers here:
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' - -when using urlib.request python3
(2 answers)
Closed 3 years ago.
I'm using python 3.5.2 and I'm trying to automatically open a url with parameters (many of them read from a csv-file). My problem is that one of the paramters contain the Norwegian letter "ø" in "Møre 2013" (see ...projects:"Møre%202013", where %20 is used to include a space between Møre and 2013) which causes an error message.
A bat-file runs lesurl.py with input-parameters from a csv-file.
My code in the lesurl.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys # for å kunne lese variable
import urllib.request # for å lese url
fn=sys.argv[1]
minx = sys.argv[2]
miny = sys.argv[3]
maxx = sys.argv[4]
maxy = sys.argv[5]
pnavn = sys.argv[6]
c = sys.argv[7]
u = sys.argv[8]
pwd = sys.argv[9]
f = sys.argv[10]
bestilling = urllib.request.urlopen('https://tjenester.norgeibilder.no/REST/StartExport.ashx?request={username:"'+u+'",password:"'+pwd+'",copyEmail:"",comment:"'+fn+'",coordInput:{type:"Polygon",coordinates:[[['+minx+','+maxy+'],['+maxx+','+maxy+'],['+maxx+','+miny+'],['+minx+','+miny+'],['+minx+','+maxy+']]},inputWkid:'+c+',cutNationalBorder:0,format:'+f+',resolution:'+r+',outputWkid:'+c+',fillColor:255,projects:"Møre%202013",imagemosaic:2}').read()
print(bestilling)
"Møre%202013" seems to cause an error:
Traceback (most recent call last):
File "b_of_lesurl.py", line 28, in <module>
bestilling = urllib.request.urlopen('https://tjenester.norgeibilder.no/REST/StartExport.ashx?request={username:"'+u+'",password:"'+pwd+'
",copyEmail:"",comment:"'+fn+'",coordInput:{type:"Polygon",coordinates:[[['+minx+','+maxy+'],['+maxx+','+maxy+'],['+maxx+','+miny+'],['+minx
+','+miny+'],['+minx+','+maxy+']]},inputWkid:'+c+',cutNationalBorder:0,format:'+f+',resolution:'+r+',outputWkid:'+c+',fillColor:255,projects
:"Møre%202013",imagemosaic:2}').read()
File "C:\Users\ban\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 163, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\ban\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 466, in open
response = self._open(req, data)
File "C:\Users\ban\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 484, in _open
'_open', req)
File "C:\Users\ban\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 444, in _call_chain
result = func(*args)
File "C:\Users\ban\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1297, in https_open
context=self._context, check_hostname=self._check_hostname)
File "C:\Users\ban\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1254, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
File "C:\Users\ban\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1106, in request
self._send_request(method, url, body, headers)
File "C:\Users\ban\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1141, in _send_request
self.putrequest(method, url, **skips)
File "C:\Users\ban\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 983, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\xf8' in position 344: ordinal not in range(128)
I have tried different variants of encode('utf-8') like (and using ...projects:"'+s+'",... in urlopen.
s="Møre%202013"
print(s)
s=s.encode('utf-8')
print(s)
giving
Møre%202013
b'M\xc3\xb8re%202013'
b'M\xc3\xb8re%202013'
and still giving the encode error.
How do I include "ø" correctly? (Btw, e.g. "Oslo 2015" works fine.)
The quick answer is that you need to percent encode the string
Here is an example:
>>> s='Møre 2013'
>>> urllib.parse.quote(s)
'M%C3%B8re%202013'
>>> urllib.parse.unquote('M%C3%B8re%202013')
'Møre 2013'
The longer answer is that valid URL characters are very limited
See this answer for more details
https://stackoverflow.com/a/13500078/3776268
And also (from the linked answer) Explanations for why the characters are restricted are clearly spelled out in RFC-1738 (URLs) and RFC-2396 (URIs). Note the newer RFC-3986 (update to RFC-1738).

NP-chunker value error (Python nltk)

I am building an NLP-pipeline based on the Python NLTK book (chapter 7). The first segment of codes correctly preprocesses the data, but I am unable to run its output through my NP-chunker:
import nltk, re, pprint
#Import Data
data = 'This is a test sentence to check if preprocessing works'
#Preprocessing
def preprocess(document):
sentences = nltk.sent_tokenize(document)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]
return(sentences)
tagged = preprocess(data)
print(tagged)
#regular expression-based NP chunker
grammar = "NP: {<DT>?<JJ>*<NN>}"
cp = nltk.RegexpParser(grammar) #chunk parser
chunked = []
for s in tagged:
chunked.append(cp.parse(tagged))
print(chunked)
This is the traceback I get:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/u0084411/Box Sync/Procesmanager DH/Text Mining/Tools/NLP_pipeline.py", line 24, in <module>
chunked.append(cp.parse(tagged))
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\chunk\regexp.py", line 1202, in parse
chunk_struct = parser.parse(chunk_struct, trace=trace)
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\chunk\regexp.py", line 1017, in parse
chunkstr = ChunkString(chunk_struct)
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\chunk\regexp.py", line 95, in __init__
tags = [self._tag(tok) for tok in self._pieces]
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\chunk\regexp.py", line 95, in <listcomp>
tags = [self._tag(tok) for tok in self._pieces]
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\chunk\regexp.py", line 105, in _tag
raise ValueError('chunk structures must contain tagged '
ValueError: chunk structures must contain tagged tokens or trees
>>>
What is my mistake here? 'Tagged' is tokenized, so why does the program not recognize this?
Many thanks!
Tom
You'll slap your forehead when you see this. Instead of this
for s in tagged:
chunked.append(cp.parse(tagged))
it should have been this:
for s in tagged:
chunked.append(cp.parse(s))
You were getting the error because you were not passing cp.parse() a tagged sentence, but a list of them.

spyder unicode decode error in startup

I was using spyder-ide while parsing a tumblr page with the permission of the author, and at some point everything just crashed. Even my linux system had freezed. Well, to cut to the chase now I can not start spyder, it gives me the following error after I had written spyder to my terminal:
Traceback (most recent call last):
File "/home/dk/anaconda3/bin/spyder", line 2, in <module>
from spyderlib import start_app
File "/home/dk/anaconda3/lib/python3.5/site-packages/spyderlib/start_app.py", line 13, in <module>
from spyderlib.config import CONF
File "/home/dk/anaconda3/lib/python3.5/site-packages/spyderlib/config.py", line 736, in <module>
subfolder=SUBFOLDER, backup=True, raw_mode=True)
File "/home/dk/anaconda3/lib/python3.5/site-packages/spyderlib/userconfig.py", line 215, in __init__
self.load_from_ini()
File "/home/dk/anaconda3/lib/python3.5/site-packages/spyderlib/userconfig.py", line 265, in load_from_ini
self.read(self.filename(), encoding='utf-8')
File "/home/dk/anaconda3/lib/python3.5/configparser.py", line 696, in read
self._read(fp, filename)
File "/home/dk/anaconda3/lib/python3.5/configparser.py", line 1012, in _read
for lineno, line in enumerate(fp, start=1):
File "/home/dk/anaconda3/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte
I tried the solution here and I had received the following error:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/dk/anaconda3/lib/python3.5/site-packages/spyderlib/spyder.py", line 107, in <module>
from spyderlib.utils.qthelpers import qapplication
File "/home/dk/anaconda3/lib/python3.5/site-packages/spyderlib/utils/qthelpers.py", line 24, in <module>
from spyderlib.guiconfig import get_shortcut
File "/home/dk/anaconda3/lib/python3.5/site-packages/spyderlib/guiconfig.py", line 22, in <module>
from spyderlib.config import CONF
File "/home/dk/anaconda3/lib/python3.5/site-packages/spyderlib/config.py", line 736, in <module>
subfolder=SUBFOLDER, backup=True, raw_mode=True)
File "/home/dk/anaconda3/lib/python3.5/site-packages/spyderlib/userconfig.py", line 215, in __init__
self.load_from_ini()
File "/home/dk/anaconda3/lib/python3.5/site-packages/spyderlib/userconfig.py", line 265, in load_from_ini
self.read(self.filename(), encoding='utf-8')
File "/home/dk/anaconda3/lib/python3.5/configparser.py", line 696, in read
self._read(fp, filename)
File "/home/dk/anaconda3/lib/python3.5/configparser.py", line 1012, in _read
for lineno, line in enumerate(fp, start=1):
File "/home/dk/anaconda3/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte
I tried uninstalling and reinstalling anaconda and it doesn't seem to work I am open to suggestions, I am very much new to python, so I would appriciate a simple explanation of the possible causes of the error too.
Thanks in advance
Well here is how I solved the issue.
l opened this: spyderlib/userconfig.py
and changed this: self.read(self.filename(), encoding='utf-8')
to this: self.read(self.filename(), encoding='latin-1')
It gave me a Warning: File contains no section headers but started spyder anyway. After that, I closed spyder, opened the terminal and entered spyder --reset then restarted spyder, it seems to work now.
Here is what you should not do at all costs for this problem: thinkering with these, I learned my lesson the hard way:
python3.5/configparser.py
python3.5/codecs.py

Resources