How do I use importlib.resources with pickle files? - python-3.x

I'm trying to load a pickle file with importlib.resources, but I'm getting the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
The bit that is raising the error is:
with importlib.resources.open_text("directory_with_pickle_file", "pickle_file.pkl") as f:
data = pickle.load(f)
I'm certain that the file (pickle_file.pkl) was created with pickle.dump.
What am I doing wrong?

Through lots of trial and error I figured out that importlib.resources has a read_binary function which can be used to read pickled files like so:
text = importlib.resources.read_binary("directory_with_pickle_file", "pickle_file.pkl")
data = pickle.loads(text)
Here, data is the pickled object.

Related

How to get python to tolerate UTF-8 encoding errors

I have a set of UTF-8 texts I have scraped from web pages. I am trying to extract keywords from these files like so:
import os
import json
from rake_nltk import Rake
rake_nltk_var = Rake()
directory = 'files'
results = {}
for filename in os.scandir(directory):
if filename.is_file():
with open("files/" + filename.name, encoding="utf-8", mode = 'r') as infile:
text = infile.read()
rake_nltk_var.extract_keywords_from_text(text)
keyword_extracted = rake_nltk_var.get_ranked_phrases()
results[filename.name] = keyword_extracted
with open("extracted-keywords.json", "w") as outfile:
json.dump(results, outfile)
One of the files I've managed to process so far is throwing the following error on read:
Traceback (most recent call last):
File "extract-keywords.py", line 11, in <module>
text = infile.read()
File "c:\python36\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 66: invalid start byte
0x92 is a right single quotation mark, but the 66th char of the file is a "u" so IDK where this error is coming from. Regardless, is there some way to make the codec tolerate such encoding errors? For example, Perl simply substitutes a question mark for any character it can't decode. Is there some way to get Python to do the same? I have a lot of files and can't afford to stop and debug every encoding error they might contain.
I have a set of UTF-8 texts I have scraped from web pages
If they can't be read with the script you've shown, then these are not actually UTF-8 encoded files.
We have to know about the code which wrote the files in the first place to tell the correct way to decode. However, the ’ character is 0x92 byte in code page 1252, so try using that encoding instead, i.e.:
with open("files/" + filename.name, encoding="cp1252") as infile:
text = infile.read()
Ignoring decoding errors corrupts the data, so it's best to use the correct decoder when possible, so try and do that first! However, about this part of the question:
Regardless, is there some way to make the codec tolerate such encoding errors? For example, Perl simply substitutes a question mark for any character it can't decode. Is there some way to get Python to do the same?
Yes, you can specify errors="replace"
>>> with open("/tmp/f.txt", "w", encoding="cp1252") as f:
... f.write('this is a right quote: \N{RIGHT SINGLE QUOTATION MARK}')
...
>>> with open("/tmp/f.txt", encoding="cp1252") as f:
... print(f.read()) # using correct encoding
...
this is a right quote: ’
>>> with open("/tmp/f.txt", encoding="utf-8", errors="replace") as f:
... print(f.read()) # using incorrect encoding and replacing errors
this is a right quote: �

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte while reading a text file

I am training a word2vec model, using about 700 text files as my corpus. But, when I start reading the files after the preprocessing step, I get the mentioned error. The code is as follows
class MyCorpus(object):
def __iter__(self):
for i in ceo_path: /// ceo_path contains abs path of all text files
file = open(i, 'r', encoding='utf-8')
text = file.read()
###########
########### /// text preprocessing steps
###########
yield final_text /// returns preprocessed text
sentences = MyCorpus()
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)
# training the model
cores = multiprocessing.cpu_count()
w2v_model = Word2Vec(min_count=5,
iter=30,
window=3,
size=200,
sample=6e-5,
alpha=0.025,
min_alpha=0.0001,
negative=20,
workers=cores-1,
sg=1)
w2v_model.build_vocab(sentences)
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)
w2v_model.save('ceo1.model')
The error that I am getting is:
Traceback (most recent call last):
File "C:/Users/name/PycharmProjects/prac2/hbs_word2vec.py", line 131, in <module>
w2v_model.build_vocab(sentences)
File "C:\Users\name\PycharmProjects\prac1\venv\lib\site-packages\gensim\models\base_any2vec.py", line 921, in build_vocab
total_words, corpus_count = self.vocabulary.scan_vocab(
File "C:\Users\name\PycharmProjects\prac1\venv\lib\site-packages\gensim\models\word2vec.py", line 1403, in scan_vocab
total_words, corpus_count = self._scan_vocab(sentences, progress_per, trim_rule)
File "C:\Users\name\PycharmProjects\prac1\venv\lib\site-packages\gensim\models\word2vec.py", line 1372, in _scan_vocab
for sentence_no, sentence in enumerate(sentences):
File "C:/Users/name/PycharmProjects/prac2/hbs_word2vec.py", line 65, in __iter__
text = file.read()
File "C:\Users\name\AppData\Local\Programs\Python\Python38-32\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
I am not able to understand the error as I am new to this. I was not getting the error in reading the text files when I wasn't using the iter function and sending the data in chunks as I am doing currently.
It looks like one of your files doesn't have proper utf-8-encoded text.
(Your Word2Vec-related code probably isn't necessary for hitting the error, at all. You could probably trigger the same error with just: sentences_list = list(MyCorpus()).)
To find which file, two different possibilities might be:
Change your MyCorpus class so that it prints the path of each file before it tries to read it.
Add a Python try: ... except UnicodeDecodeError: ... statement around the read, and when the exception is caught, print the offending filename.
Once you know the file involved, you may want to fix the file, or change the code to be able to handle the files you have.
Maybe they're not really in utf-8 encoding, in which case you'd specify a different encoding.
Maybe just one or a few have problems, and it's be OK to just print their names for later investigation, and skip them. (You could use the exception-handling approach above to do that.)
Maybe, those that aren't utf-8 are always in some other platform-specific encoding, so when utf-8 fails, you could try a 2nd encoding.
Separately, when you solve the encoding issue, your iterable MyCorpus is not yet returning whet the Word2Vec class expects.
It doesn't want full text plain strings. It needs those texts to already be broken up into individual word-tokens.
(Often, simply performing a .split() on a string is close-enough-to-real-tokenization to try as a starting point, but usually, projects use some more-sophisticated punctuation-aware tokenization.)

Trying to convert CSV to Excel file in python

I have tried different codes and check online for the Solution. But not getting success in the below code.
df_new = pd.read_csv(path+'output.csv')
writer = pd.ExcelWriter(path+'output.xlsx')
df_new.to_excel(writer, index = False)
writer.save()
I am getting the below error when I am trying to execute it, I have try to add encoded as latin . But it is not working. Please guide me with it. When I am doing ignore_error it is running , but not providing any result.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Need to open and read a .bin file in Python. Getting error: utf-8' codec can't decode byte 0x81 in position 11: invalid start byte

I am trying to read and convert binary into text that anyone could read. I am having trouble with the error message:
'utf-8' codec can't decode byte 0x81 in position 11: invalid start byte
I have gone throughout: Reading binary file and looping over each byte
trying multiple versions of trying to open and read the binary file in some way. After reading about this error message, most people either had trouble with .cvs files, or had to change the utf-8 to -16. But reading up on https://en.wikipedia.org/wiki/UTF-16#Byte_order_encoding_schemes , Python does not use -16 anymore.
Also, if I add encoding = utf-16/32, the error states: binary mode doesn't take an encoding argument
Here is my code:
with open(b"P:\Projects\2018\1809-0068-R\Bin_Files\snap-pac-eb1-R10.0d.bin", "rb") as f:
byte = f.read(1)
while byte != b"":
byte = f.read(1)
print(f)
I am expecting to be able to read and write to the binary file. I would like to translate it to Hex and then to text (or to legible text somehow), but I think I have to go through this step before. If anyone could help with what I am missing, that would be greatly appreciated! Any way to open and read a binary file would be accepted. Thank you for your time!
I am not sure but this might help:
import binascii
with open('snap-pac-eb1-R10.0d.bin', 'rb') as f:
header = f.read(6)
b = bytearray(header)
binary=[bin(i)[2:].zfill(8) for i in b]
n = int('0b'+''.join(binary), 2)
nn = binascii.unhexlify('%x' % n)
nnn=nn.decode("ascii")[0:-1]
result='.'.join(str(ord(c)) for c in nnn[0:-1])
print(result)
Output:
16.0.8.0

Python3.x,pandas,csv,utf-8 error

I am trying to import a dataset using pandas and getting following error message:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 10: invalid start byte
I read about encoding and tried to use it as
df=pd.read_csv("file.csv",encoding="ISO-xxxx")
It showed error as invalid syntax.
I am sharing the link to my data if you guys want to have a look: https://www.kaggle.com/venkatramakrishnan/india-water-quality-data
import pandas as pd
df = pd.read_csv('IndiaAffectedWaterQualityAreas.csv',encoding = 'latin-1')
The above code is one of the solution written in python 3.6 and pandas '0.20.1'.
why does this problem occur?
There are some special character which by default utf-8 is cannot be used to
decode. if you have the raw data,try making the csv using pandas with
the following code:
df.to_csv('IndiaAffectedWaterQualityAreas.csv',encoding = 'latin-1')

Resources