How to convert large binary file into pickle dictionary in python? - python-3.x

I am trying to convert large binary file contains Arabic words with 300 dimension vectors into pickle dictionary
What I am write so far is:
import pickle
ArabicDict = {}
with open('cc.ar.300.bin', encoding='utf-8') as lex:
for token in lex:
for line in lex.readlines():
data = line.split()
ArabicDict[data[0]] = float(data[1])
pickle.dump(ArabicDict,open("ArabicDictionary.p","wb"))
The error which I am getting is:
Traceback (most recent call last):
File "E:\Dataset", line 4, in <module>
for token in lex:
File "E:\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte

Related

Loading a dictionary saved as a msgpack with symspell

I am trying to use symspell in python to spellcheck some old spanish texts. Since they are all texts I need a dictionary that has old spanish words so I downloaded the large dictionary they share here which is a msgpack.
According to the basic usage, I can load a dictionary using this code
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary = pkg_resources.resource_filename(
"symspellpy", "dictionary.txt"
)
sym_spell.load_dictionary(dictionary, term_index=0, count_index=1)
as shown here
But when I try it with the msgpack file like this
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary = pkg_resources.resource_filename(
"symspellpy", "large_es.msgpack"
)
sym_spell.load_dictionary(dictionary, term_index=0, count_index=1)
I get this error
Traceback (most recent call last):
File ".../utils/quality_check.py", line 24, in <module>
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
File ".../lib/python3.8/site-packages/symspellpy/symspellpy.py", line 346, in load_dictionary
return self._load_dictionary_stream(
File ".../lib/python3.8/site-packages/symspellpy/symspellpy.py", line 1122, in _load_dictionary_stream
for line in corpus_stream:
File "/usr/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 0: invalid continuation byte
I know this means the file is supposed to be a txt file but anyone has an idea how I can load a frequency dictionary stored in a msgpack file with symspell on python?

Error while reading a csv file by using csv module in python3

When I am trying to read a csv file I am getting this type of error:
Traceback (most recent call last):
File "/root/Downloads/csvafa.py", line 4, in <module>
for i in a:
File "/usr/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
The code that i used:
import csv
with open('Book1.csv') as f:
a=csv.reader(f)
for i in a:
print(i)
i even tried to change the encoding to latin1:
import csv
with open('Book1.csv',encoding='latin1') as f:
a=csv.reader(f)
for i in a:
print(i)
After that i am getting this type of error message:
Traceback (most recent call last):
File "/root/Downloads/csvafa.py", line 4, in <module>
for i in a:
_csv.Error: line contains NUL
I am a beginner to python
This error is raised when we try to encode an invalid string. When Unicode string can’t be represented in this encoding (UTF-8), python raises a UnicodeEncodeError. You can try encoding: 'latin-1' or 'iso-8859-1'.
import pandas as pd
dataset = pd.read_csv('Book1.csv', encoding='ISO-8859–1')
It can also be that the data is compressed. Have a look at this answer.
I would try reading the file in utf-8 enconding
another solution might be this answer
It's still most likely gzipped data. gzip's magic number is 0x1f 0x8b, which is consistent with the UnicodeDecodeError you get.
You could try decompressing the data on the fly:
with open('destinations.csv', 'rb') as fd:
gzip_fd = gzip.GzipFile(fileobj=fd)
destinations = pd.read_csv(gzip_fd)

UnicodeDecodeError: charmap' codec can't decode byte 0x8f in position 756

I'm unable to retrieve the data from a Microsoft Excel document. I've tried using encoding 'Latin-1' or 'UTF-8' but when it gives me hundreds of \x00's in the terminal. Is there any way I can retrieve the data and output it to a text file?
This is what I'm running on the terminal and the error I get:
PS C:\Users\Andy-\Desktop> python.exe SRT411-Lab2.py Lab2Data.xlsx
Traceback (most recent call last):
File "SRT411-Lab2.py", line 9, in
lines = file.readlines()
File "C:\ProgramFiles\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.1776.0_x64__qbz5n2kfra8p0\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 756: character maps to <\undefined>
Any help is greatly appreciated!
#!/usr/bin/python3
import sys
filename = sys.argv[1]
print(filename)
file = open(filename, 'r')
lines = file.readlines()
file.close()
print(lines)
I'd probably convert the excel file to csv file and use pandas to parse it

What is this error when i try to parse a simple pcap file?

import dpkt
f = open('gtp.pcap')
pcap = dpkt.pcap.Reader(f)
for ts, buf in pcap:
eth = dpkt.ethernet.Ethernet(buf)
print(eth)
Traceback (most recent call last):
File "new.py", line 4, in <module>
pcap = dpkt.pcap.Reader(f)
File "/home/user/gtp_gaurang/venv/lib/python3.5/site-packages/dpkt/pcap.py", line 244, in __init__
buf = self.__f.read(FileHdr.__hdr_len__)
File "/usr/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 16: invalid start byte
(venv) user#user-OptiPlex-7010:~/gtp_gaurang$ python3 new.py
Traceback (most recent call last):
File "new.py", line 4, in <module>
pcap = dpkt.pcap.Reader(f)
File "/home/user/gtp_gaurang/venv/lib/python3.5/site-packages/dpkt/pcap.py", line 244, in __init__
buf = self.__f.read(FileHdr.__hdr_len__)
File "/usr/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 16: invalid start byte
What is this error when i try to parse a simple pcap file?
I am running this simple pcap parser code. But it is showing the above
error. Can anyone please help.
Can you please check this link.
Related Answer
according to the answer suggestion, UTF-8 encounters an invalid byte which it cannot decode. So if you just read your file in binary format this error will not come as the decoding will not happen and the file contents will remain a bytes.
Open the file in binary mode
f = open('gtp.pcap', 'rb')
pcap = dpkt.pcap.Reader(f)
...

Some 'utf-8' codec can't decode byte

I get some error when i download a website using wget
code:
import threading
import urllib.request
import os
import re
import time
import json
def wget(url):
#self.url = url
data = os.popen('wget -qO- %s'% url).read()
return data
print (wget("http://jamesholm.se/dj.php"))
Error:
Traceback (most recent call last):
File "stand-alone-check-url.py", line 13, in <module>
print (wget("http://jamesholm.se/dj.php"))
File "stand-alone-check-url.py", line 10, in wget
data = os.popen('wget -qO- %s'% url).read()
File "/usr/local/lib/python3.4/codecs.py", line 313, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9a in position 13133: invalid start byte
How to overcome this error?
You can't decode arbitrary byte sequences as utf-8 encoded text:
>>> b'\xa9'.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 0: invalid start byte
The page indicates that it uses utf-8 but the actual data that the server sends is not utf-8. It happens.
There is bs4.UnicodeDammit that allows you to handle data with inconsistent encodings:
import bs4 # $ pip install beautifulsoup4
print(bs4.UnicodeDammit.detwingle(b'S\x9aben - Ostwind Rec').decode('utf-8'))
# -> Sšben - Ostwind Rec
Instead of wget, use requests python module.
>>> import requests
>>> data = requests.get("http://jamesholm.se/dj.php").text
>>> print(data)

Resources