Issue loading a model of Spanish data - python-3.x

I'm trying to load a model of that contains spanish words using gensim-1.0 in python3.5, but when I do gensim.models.KeyedVectors.load_word2vec_format(mymodel) the CLI says this:
Traceback (most recent call last):
File "./prueba.py", line 30, in <module>
model = KeyedVectors.load_word2vec_format('./data/WikiModelEsp/wiki.size.800.window.5.mincount.50.new.model', binary=True)
File "/usr/local/lib/python3.5/dist-packages/gensim/models/keyedvectors.py", line 192, in load_word2vec_format
header = utils.to_unicode(fin.readline(), encoding=encoding)
File "/usr/local/lib/python3.5/dist-packages/gensim/utils.py", line 231, in any2unicode
return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
I try to call load function with encoding='latin1' and binary=True but still doesn't work.

Did you try only with load function? Like this one:
model = KeyedVectors.load(path_model)

Related

Loading a dictionary saved as a msgpack with symspell

I am trying to use symspell in python to spellcheck some old spanish texts. Since they are all texts I need a dictionary that has old spanish words so I downloaded the large dictionary they share here which is a msgpack.
According to the basic usage, I can load a dictionary using this code
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary = pkg_resources.resource_filename(
"symspellpy", "dictionary.txt"
)
sym_spell.load_dictionary(dictionary, term_index=0, count_index=1)
as shown here
But when I try it with the msgpack file like this
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary = pkg_resources.resource_filename(
"symspellpy", "large_es.msgpack"
)
sym_spell.load_dictionary(dictionary, term_index=0, count_index=1)
I get this error
Traceback (most recent call last):
File ".../utils/quality_check.py", line 24, in <module>
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
File ".../lib/python3.8/site-packages/symspellpy/symspellpy.py", line 346, in load_dictionary
return self._load_dictionary_stream(
File ".../lib/python3.8/site-packages/symspellpy/symspellpy.py", line 1122, in _load_dictionary_stream
for line in corpus_stream:
File "/usr/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 0: invalid continuation byte
I know this means the file is supposed to be a txt file but anyone has an idea how I can load a frequency dictionary stored in a msgpack file with symspell on python?

python code of geograpy module gives some error

solve some error of nltk, but these are remaining
import geograpy
url = 'http://www.bbc.com/news/world-europe-26919928'
places = geograpy.get_place_context(url=url)
this is generated errors
Traceback (most recent call last):
File "C:\Users\Monika\Desktop\p.py", line 3, in <module>
places = geograpy.get_place_context(url=url)
File "C:\Users\Monika\AppData\Local\Programs\Python\Python37\lib\site-packages\geograpy\__init__.py", line 11, in get_place_context
pc.set_cities()
File "C:\Users\Monika\AppData\Local\Programs\Python\Python37\lib\site-packages\geograpy\places.py", line 137, in set_cities
self.populate_db()
File "C:\Users\Monika\AppData\Local\Programs\Python\Python37\lib\site-packages\geograpy\places.py", line 30, in populate_db
for row in reader:
File "C:\Users\Monika\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 274: character maps to <undefined>

Using Python 3 & RoboBrowser to submit a form to facebook & getting UnicodeDecodeError

So i am trying to make a script that will auto upload images to facebook for me using RoboBrowser to navigate the mbasic.facebook.com website and i am getting a strange error when submitting the image form:
Traceback (most recent call last):
File "C:\Users\Admin\OneDrive\Facebook Project\facebook.py", line 55, in <module>
browser.submit_form(form, submit=form["add_photo_done"])
File "C:\Users\Admin\AppData\Local\Programs\Python\Python37-32\lib\site-packages\robobrowser\browser.py", line 343, in submit_form
response = self.session.request(method, url, **send_args)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\sessions.py", line 498, in request
prep = self.prepare_request(req)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\sessions.py", line 441, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "C:\Users\Admin\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\models.py", line 312, in prepare
self.prepare_body(data, files, json)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\models.py", line 500, in prepare_body
(body, content_type) = self._encode_files(files, data)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\models.py", line 159, in _encode_files
fdata = fp.read()
File "C:\Users\Admin\AppData\Local\Programs\Python\Python37-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 57: character maps to <undefined>
Why would this be? I am guessing it is something to do with how i am handling submitting the image file to the form but i'm stumped as to why. My relevant code is below:
form = browser.get_forms()
form = form[0]
image = os.path.dirname(os.path.realpath(__file__)) + r"\test.png"
form['file1'] = image
browser.submit_form(form, submit=form["add_photo_done"])
print(browser.parsed())
Edit:
I do not believe this is a repost of UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to as i am not reading any files in the code, the error seems to come while submitting the form. In any case I had seen this post already and i could not work out how to use it to solve my issue.
Figured out that i need to send an image object and not path to the image. solution code below if anyone is looking at this down the line:
image = open(os.path.dirname(os.path.realpath(__file__)) + r"\test.jpg", 'rb')

stanford-dependency parser with NLTK :UnicodeDecodeError:

I am trying to run the following lines of code:
import os
os.environ['JAVAHOME'] = 'path/to/java.exe'
os.environ['STANFORD_PARSER'] = 'path/to/stanford-parser.jar'
os.environ['STANFORD_MODELS'] = 'path/to/stanford-parser-3.8.0-models.jar'
from nltk.parse.stanford import StanfordDependencyParser
dep_parser = StanfordDependencyParser(model_path="path/to/englishPCFG.ser.gz")
sentence = "sample sentence ..."
# Dependency Parsing:
print("Dependency Parsing:")
print([parse.tree() for parse in dep_parser.raw_parse(sentence)])
and at the line:
print([parse.tree() for parse in dep_parser.raw_parse(sentence)])
I get the following issues:
Traceback (most recent call last):
File "C:/Users/Norbert/PycharmProjects/untitled/StanfordDependencyParser.py", line 21, in
print([parse.tree() for parse in dep_parser.raw_parse(sentence)])
File "C:\Users\Norbert\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\parse\stanford.py", line 134, in raw_parse
return next(self.raw_parse_sents([sentence], verbose))
File "C:\Users\Norbert\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\parse\stanford.py", line 152, in raw_parse_sents
return self._parse_trees_output(self._execute(cmd, '\n'.join(sentences), verbose))
File "C:\Users\Norbert\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\parse\stanford.py", line 218, in _execute
stdout=PIPE, stderr=PIPE)
File "C:\Users\Norbert\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\internals.py", line 135, in java
print(_decode_stdoutdata(stderr))
File "C:\Users\Norbert\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\internals.py", line 737, in _decode_stdoutdata
return stdoutdata.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xac in position 3097: invalid start byte
Any idea what could be wrong ? I am not even dealing with any non-utf-8 text.
I can print a few things by doing this, maybe is not what you wanted but is a start.
print("Dependency Parsing:")
result = dependency_parser.raw_parse(sentence)
#print (next(result))
dep = next(result)
print (list(dep.triples()))
Uncomment the line -> print(next(result)) if you want to see the entire output.

spyder unicode decode error in startup

I was using spyder-ide while parsing a tumblr page with the permission of the author, and at some point everything just crashed. Even my linux system had freezed. Well, to cut to the chase now I can not start spyder, it gives me the following error after I had written spyder to my terminal:
Traceback (most recent call last):
File "/home/dk/anaconda3/bin/spyder", line 2, in <module>
from spyderlib import start_app
File "/home/dk/anaconda3/lib/python3.5/site-packages/spyderlib/start_app.py", line 13, in <module>
from spyderlib.config import CONF
File "/home/dk/anaconda3/lib/python3.5/site-packages/spyderlib/config.py", line 736, in <module>
subfolder=SUBFOLDER, backup=True, raw_mode=True)
File "/home/dk/anaconda3/lib/python3.5/site-packages/spyderlib/userconfig.py", line 215, in __init__
self.load_from_ini()
File "/home/dk/anaconda3/lib/python3.5/site-packages/spyderlib/userconfig.py", line 265, in load_from_ini
self.read(self.filename(), encoding='utf-8')
File "/home/dk/anaconda3/lib/python3.5/configparser.py", line 696, in read
self._read(fp, filename)
File "/home/dk/anaconda3/lib/python3.5/configparser.py", line 1012, in _read
for lineno, line in enumerate(fp, start=1):
File "/home/dk/anaconda3/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte
I tried the solution here and I had received the following error:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/dk/anaconda3/lib/python3.5/site-packages/spyderlib/spyder.py", line 107, in <module>
from spyderlib.utils.qthelpers import qapplication
File "/home/dk/anaconda3/lib/python3.5/site-packages/spyderlib/utils/qthelpers.py", line 24, in <module>
from spyderlib.guiconfig import get_shortcut
File "/home/dk/anaconda3/lib/python3.5/site-packages/spyderlib/guiconfig.py", line 22, in <module>
from spyderlib.config import CONF
File "/home/dk/anaconda3/lib/python3.5/site-packages/spyderlib/config.py", line 736, in <module>
subfolder=SUBFOLDER, backup=True, raw_mode=True)
File "/home/dk/anaconda3/lib/python3.5/site-packages/spyderlib/userconfig.py", line 215, in __init__
self.load_from_ini()
File "/home/dk/anaconda3/lib/python3.5/site-packages/spyderlib/userconfig.py", line 265, in load_from_ini
self.read(self.filename(), encoding='utf-8')
File "/home/dk/anaconda3/lib/python3.5/configparser.py", line 696, in read
self._read(fp, filename)
File "/home/dk/anaconda3/lib/python3.5/configparser.py", line 1012, in _read
for lineno, line in enumerate(fp, start=1):
File "/home/dk/anaconda3/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte
I tried uninstalling and reinstalling anaconda and it doesn't seem to work I am open to suggestions, I am very much new to python, so I would appriciate a simple explanation of the possible causes of the error too.
Thanks in advance
Well here is how I solved the issue.
l opened this: spyderlib/userconfig.py
and changed this: self.read(self.filename(), encoding='utf-8')
to this: self.read(self.filename(), encoding='latin-1')
It gave me a Warning: File contains no section headers but started spyder anyway. After that, I closed spyder, opened the terminal and entered spyder --reset then restarted spyder, it seems to work now.
Here is what you should not do at all costs for this problem: thinkering with these, I learned my lesson the hard way:
python3.5/configparser.py
python3.5/codecs.py

Resources