Reading NULL bytes using pandas.read_csv() - python-3.x

I have a file in CSV format which contains NULL bytes (may be 0x84) in each line. I need to read this file using c engine of pd.read_csv() .
This values causes an error while reading - 'utf-8' codec can't decode byte 0x84 in position 14 .
Is there any way out to fix it without changing the file ?

Try these options if it helps:
Option 1:
Set the engine as python.
pd.read_csv(filename, engine='python')
Option 2:
Try utf-16 encoding, because the error could also mean the file is encoded in UTF-16. Or change the encoding to the correct format example
encoding = "cp1252"
encoding = "ISO-8859-1"
Option 3:
Read the file as bytes
with open(filename, 'rb') as f:
data = f.read()
That b in the mode specifier in the open() states that the file shall be treated as binary, so contents will remain a bytes. No decoding attempt will happen this way.
Alternatively you can use open method from the codecs module to read in the file:
import codecs
with codecs.open(filename, 'r', encoding='utf-8', errors='ignore') as f:
This will strip out (ignore) the characters and return the string without them. Only use this if your need is to strip them not convert them.
https://docs.python.org/3/howto/unicode.html#the-string-type

Related

How to get python to tolerate UTF-8 encoding errors

I have a set of UTF-8 texts I have scraped from web pages. I am trying to extract keywords from these files like so:
import os
import json
from rake_nltk import Rake
rake_nltk_var = Rake()
directory = 'files'
results = {}
for filename in os.scandir(directory):
if filename.is_file():
with open("files/" + filename.name, encoding="utf-8", mode = 'r') as infile:
text = infile.read()
rake_nltk_var.extract_keywords_from_text(text)
keyword_extracted = rake_nltk_var.get_ranked_phrases()
results[filename.name] = keyword_extracted
with open("extracted-keywords.json", "w") as outfile:
json.dump(results, outfile)
One of the files I've managed to process so far is throwing the following error on read:
Traceback (most recent call last):
File "extract-keywords.py", line 11, in <module>
text = infile.read()
File "c:\python36\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 66: invalid start byte
0x92 is a right single quotation mark, but the 66th char of the file is a "u" so IDK where this error is coming from. Regardless, is there some way to make the codec tolerate such encoding errors? For example, Perl simply substitutes a question mark for any character it can't decode. Is there some way to get Python to do the same? I have a lot of files and can't afford to stop and debug every encoding error they might contain.
I have a set of UTF-8 texts I have scraped from web pages
If they can't be read with the script you've shown, then these are not actually UTF-8 encoded files.
We have to know about the code which wrote the files in the first place to tell the correct way to decode. However, the ’ character is 0x92 byte in code page 1252, so try using that encoding instead, i.e.:
with open("files/" + filename.name, encoding="cp1252") as infile:
text = infile.read()
Ignoring decoding errors corrupts the data, so it's best to use the correct decoder when possible, so try and do that first! However, about this part of the question:
Regardless, is there some way to make the codec tolerate such encoding errors? For example, Perl simply substitutes a question mark for any character it can't decode. Is there some way to get Python to do the same?
Yes, you can specify errors="replace"
>>> with open("/tmp/f.txt", "w", encoding="cp1252") as f:
... f.write('this is a right quote: \N{RIGHT SINGLE QUOTATION MARK}')
...
>>> with open("/tmp/f.txt", encoding="cp1252") as f:
... print(f.read()) # using correct encoding
...
this is a right quote: ’
>>> with open("/tmp/f.txt", encoding="utf-8", errors="replace") as f:
... print(f.read()) # using incorrect encoding and replacing errors
this is a right quote: �

Cannot save byte literals to csv file with Python

I am trying to write a program that encrypts text data input by the user and saves the encrypted information to a csv file. To use the stream cipher I am first converting the string type data to bytes literals, then trying to save it in this format. The problem comes as I re-read the csv file the next time I open the program, the data I saved as bytes type has been converted to string type, including the b''. Please refer to the code below.
IN:
from Crypto.Cipher import Salsa20
import pandas as pd
df = pd.DataFrame({'col1': ['secret info', 'more secret info'], 'col2': ['top secret stuff', 'hide from prying eyes']})
key = b'*Thirty-two byte (256 bits) key*'
nonce = b'*8 byte*'
cipher = Salsa20.new(key=key, nonce=nonce)
for col in df.columns:
df[col] = df[col].apply(lambda a: a.encode('utf-8'))
df[col] = df[col].apply(lambda a: cipher.encrypt(a))
print(f"Format of data in dataframe pre saving: {type(df.iloc[0, 0])}")
df.to_csv('my_data.csv', encoding='utf-8')
encrypted_df = pd.read_csv('my_data.csv', encoding='utf-8', index_col=0)
print(f"Format of data in re-read dataframe: {type(encrypted_df.iloc[0, 0])}")
OUT:
Format of data in dataframe pre saving: <class 'bytes'>
Format of data in re-read dataframe: <class 'str'>
Is there a way to read the csv file so that the data is of bytes type and not a string so that I can easily decrypt it?
I have tried:
Decoding the data back to strings prior to writing to the csv file, however this gives rise to a unicode decode error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c in position 0: invalid start byte
Stripping the b'' from the strings and then encoding to bytes type, however the encoder does string escape adding loads backslashes so I then can't decrypt the text.
I'm relatively new to coding and very new to encryption so simple answers would be highly appreciated.
As you have found, CSV format can only deal with strings. A common way to convert bytes to a string is Base64. Change your bytes into a Base64 string which you can put into your CSV file.
When you read the CSV file you will get your Base64 back, You then need to convert the Base64 back into the original bytes.

Need to open and read a .bin file in Python. Getting error: utf-8' codec can't decode byte 0x81 in position 11: invalid start byte

I am trying to read and convert binary into text that anyone could read. I am having trouble with the error message:
'utf-8' codec can't decode byte 0x81 in position 11: invalid start byte
I have gone throughout: Reading binary file and looping over each byte
trying multiple versions of trying to open and read the binary file in some way. After reading about this error message, most people either had trouble with .cvs files, or had to change the utf-8 to -16. But reading up on https://en.wikipedia.org/wiki/UTF-16#Byte_order_encoding_schemes , Python does not use -16 anymore.
Also, if I add encoding = utf-16/32, the error states: binary mode doesn't take an encoding argument
Here is my code:
with open(b"P:\Projects\2018\1809-0068-R\Bin_Files\snap-pac-eb1-R10.0d.bin", "rb") as f:
byte = f.read(1)
while byte != b"":
byte = f.read(1)
print(f)
I am expecting to be able to read and write to the binary file. I would like to translate it to Hex and then to text (or to legible text somehow), but I think I have to go through this step before. If anyone could help with what I am missing, that would be greatly appreciated! Any way to open and read a binary file would be accepted. Thank you for your time!
I am not sure but this might help:
import binascii
with open('snap-pac-eb1-R10.0d.bin', 'rb') as f:
header = f.read(6)
b = bytearray(header)
binary=[bin(i)[2:].zfill(8) for i in b]
n = int('0b'+''.join(binary), 2)
nn = binascii.unhexlify('%x' % n)
nnn=nn.decode("ascii")[0:-1]
result='.'.join(str(ord(c)) for c in nnn[0:-1])
print(result)
Output:
16.0.8.0

Read .txt with emoji characters in python

I try to read a chat history with smilies in it, but I get the following error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 38: character maps to
My code looks like this:
file_name = "chat_file.txt"
chat = open(chat_file)
chatText = chat.read() # read data
chat.close()
print(chatText)
I am pretty certain that it's because of elements like: ❤
How can I implement the correct Transformation Format // what is the correct file encoding so python can read these elements?
Never open text files without specifying their encoding.
Also, use with blocks, these automatically call .close() so you don't have to.
file_name = "chat_file.txt"
with open(chat_file, encoding="utf8") as chat:
chat_text = chat.read()
print(chat_text)
iso-8859-1 is a legacy encoding, that means it cannot contain emoji. For emoji the text file has to be Unicode. And the most common encoding for Unicode is UTF-8.

Python 3: Persist strings without b'

I am confused. This talk explains, that you should only use unicode-strings in your code. When strings leave your code, you should turn them into bytes. I did this for a csv file:
import csv
with open('keywords.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter='\t', quotechar='\"')
for (p, keywords) in ml_data:
writer.writerow([p.encode("utf-8"), ', '.join(keywords).encode("utf-8")])
This leads to an annoying effect, where b' is added in front of every string, this didn't happen for me in python 2.7. When not encoding the strings before writing them into the csv file, the b' is not there, but don't I need to turn them into bytes when persisting? How do I write bytes into a file without this b' annoyance?
Stop trying to encode the individual strings, instead you should specify the encoding for the entire file:
import csv
with open('keywords.csv', 'w', newline='', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile, delimiter='\t', quotechar='\"')
for (p, keywords) in ml_data:
writer.writerow([p, ', '.join(keywords)])
The reason your code goes wrong is that writerow is expecting you to give it strings but you're passing bytes so it uses the repr() of the bytes which has the extra b'...' around it. If you pass it strings but use the encoding parameter when you open the file then the strings will be encoded correctly for you.
See the csv documentation examples. One of these shows you how to set the encoding.

Resources