Loading a dictionary saved as a msgpack with symspell - python-3.x

I am trying to use symspell in python to spellcheck some old spanish texts. Since they are all texts I need a dictionary that has old spanish words so I downloaded the large dictionary they share here which is a msgpack.
According to the basic usage, I can load a dictionary using this code
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary = pkg_resources.resource_filename(
"symspellpy", "dictionary.txt"
)
sym_spell.load_dictionary(dictionary, term_index=0, count_index=1)
as shown here
But when I try it with the msgpack file like this
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary = pkg_resources.resource_filename(
"symspellpy", "large_es.msgpack"
)
sym_spell.load_dictionary(dictionary, term_index=0, count_index=1)
I get this error
Traceback (most recent call last):
File ".../utils/quality_check.py", line 24, in <module>
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
File ".../lib/python3.8/site-packages/symspellpy/symspellpy.py", line 346, in load_dictionary
return self._load_dictionary_stream(
File ".../lib/python3.8/site-packages/symspellpy/symspellpy.py", line 1122, in _load_dictionary_stream
for line in corpus_stream:
File "/usr/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 0: invalid continuation byte
I know this means the file is supposed to be a txt file but anyone has an idea how I can load a frequency dictionary stored in a msgpack file with symspell on python?

Related

How to enforce encoding for sqlcipher/sqlite export?

I'm trying to work with the JSON output taken from an sqlcipher output (taken from here)
sqlcipher/sqlcipher -json -noheader "db.sqlite" "PRAGMA key = \"x'"MY-ENCRYPTION-KEY"'\";PRAGMA encoding = \"UTF-8\";select * from messages_fts_data;" > messages_fts_data.json
but when I try to load the content with a Python3 script I'm getting problems with the encoding:
>>> json.load(open("messages_fts_data.json"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python3.7/json/__init__.py", line 293, in load
return loads(fp.read(),
File "/usr/lib64/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 32: invalid start byte
(I'm cheating a bit here - in my 'real' code I'm stripping of b'[{"ok":"ok"}]\n' first but since I'm getting the error when reading the file it's definitely not just an JSON error)
When I try to handle the encoding manually by specifying encoding="utf-8" in the load() or open() commands, I'm getting the same error.
What am I doing wrong? I thought I told sqlcipher to generate UTF-8 encoded output but this seems to not work.

Error while reading a csv file by using csv module in python3

When I am trying to read a csv file I am getting this type of error:
Traceback (most recent call last):
File "/root/Downloads/csvafa.py", line 4, in <module>
for i in a:
File "/usr/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
The code that i used:
import csv
with open('Book1.csv') as f:
a=csv.reader(f)
for i in a:
print(i)
i even tried to change the encoding to latin1:
import csv
with open('Book1.csv',encoding='latin1') as f:
a=csv.reader(f)
for i in a:
print(i)
After that i am getting this type of error message:
Traceback (most recent call last):
File "/root/Downloads/csvafa.py", line 4, in <module>
for i in a:
_csv.Error: line contains NUL
I am a beginner to python
This error is raised when we try to encode an invalid string. When Unicode string can’t be represented in this encoding (UTF-8), python raises a UnicodeEncodeError. You can try encoding: 'latin-1' or 'iso-8859-1'.
import pandas as pd
dataset = pd.read_csv('Book1.csv', encoding='ISO-8859–1')
It can also be that the data is compressed. Have a look at this answer.
I would try reading the file in utf-8 enconding
another solution might be this answer
It's still most likely gzipped data. gzip's magic number is 0x1f 0x8b, which is consistent with the UnicodeDecodeError you get.
You could try decompressing the data on the fly:
with open('destinations.csv', 'rb') as fd:
gzip_fd = gzip.GzipFile(fileobj=fd)
destinations = pd.read_csv(gzip_fd)

What is this error when i try to parse a simple pcap file?

import dpkt
f = open('gtp.pcap')
pcap = dpkt.pcap.Reader(f)
for ts, buf in pcap:
eth = dpkt.ethernet.Ethernet(buf)
print(eth)
Traceback (most recent call last):
File "new.py", line 4, in <module>
pcap = dpkt.pcap.Reader(f)
File "/home/user/gtp_gaurang/venv/lib/python3.5/site-packages/dpkt/pcap.py", line 244, in __init__
buf = self.__f.read(FileHdr.__hdr_len__)
File "/usr/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 16: invalid start byte
(venv) user#user-OptiPlex-7010:~/gtp_gaurang$ python3 new.py
Traceback (most recent call last):
File "new.py", line 4, in <module>
pcap = dpkt.pcap.Reader(f)
File "/home/user/gtp_gaurang/venv/lib/python3.5/site-packages/dpkt/pcap.py", line 244, in __init__
buf = self.__f.read(FileHdr.__hdr_len__)
File "/usr/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 16: invalid start byte
What is this error when i try to parse a simple pcap file?
I am running this simple pcap parser code. But it is showing the above
error. Can anyone please help.
Can you please check this link.
Related Answer
according to the answer suggestion, UTF-8 encounters an invalid byte which it cannot decode. So if you just read your file in binary format this error will not come as the decoding will not happen and the file contents will remain a bytes.
Open the file in binary mode
f = open('gtp.pcap', 'rb')
pcap = dpkt.pcap.Reader(f)
...

How to convert large binary file into pickle dictionary in python?

I am trying to convert large binary file contains Arabic words with 300 dimension vectors into pickle dictionary
What I am write so far is:
import pickle
ArabicDict = {}
with open('cc.ar.300.bin', encoding='utf-8') as lex:
for token in lex:
for line in lex.readlines():
data = line.split()
ArabicDict[data[0]] = float(data[1])
pickle.dump(ArabicDict,open("ArabicDictionary.p","wb"))
The error which I am getting is:
Traceback (most recent call last):
File "E:\Dataset", line 4, in <module>
for token in lex:
File "E:\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte

Issue loading a model of Spanish data

I'm trying to load a model of that contains spanish words using gensim-1.0 in python3.5, but when I do gensim.models.KeyedVectors.load_word2vec_format(mymodel) the CLI says this:
Traceback (most recent call last):
File "./prueba.py", line 30, in <module>
model = KeyedVectors.load_word2vec_format('./data/WikiModelEsp/wiki.size.800.window.5.mincount.50.new.model', binary=True)
File "/usr/local/lib/python3.5/dist-packages/gensim/models/keyedvectors.py", line 192, in load_word2vec_format
header = utils.to_unicode(fin.readline(), encoding=encoding)
File "/usr/local/lib/python3.5/dist-packages/gensim/utils.py", line 231, in any2unicode
return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
I try to call load function with encoding='latin1' and binary=True but still doesn't work.
Did you try only with load function? Like this one:
model = KeyedVectors.load(path_model)

Resources