Need to open and read a .bin file in Python. Getting error: utf-8' codec can't decode byte 0x81 in position 11: invalid start byte

Need to open and read a .bin file in Python. Getting error: utf-8' codec can't decode byte 0x81 in position 11: invalid start byte - python-3.x

I am trying to read and convert binary into text that anyone could read. I am having trouble with the error message:
'utf-8' codec can't decode byte 0x81 in position 11: invalid start byte
I have gone throughout: Reading binary file and looping over each byte
trying multiple versions of trying to open and read the binary file in some way. After reading about this error message, most people either had trouble with .cvs files, or had to change the utf-8 to -16. But reading up on https://en.wikipedia.org/wiki/UTF-16#Byte_order_encoding_schemes , Python does not use -16 anymore.
Also, if I add encoding = utf-16/32, the error states: binary mode doesn't take an encoding argument
Here is my code:
with open(b"P:\Projects\2018\1809-0068-R\Bin_Files\snap-pac-eb1-R10.0d.bin", "rb") as f:
byte = f.read(1)
while byte != b"":
byte = f.read(1)
print(f)
I am expecting to be able to read and write to the binary file. I would like to translate it to Hex and then to text (or to legible text somehow), but I think I have to go through this step before. If anyone could help with what I am missing, that would be greatly appreciated! Any way to open and read a binary file would be accepted. Thank you for your time!

I am not sure but this might help:
import binascii
with open('snap-pac-eb1-R10.0d.bin', 'rb') as f:
header = f.read(6)
b = bytearray(header)
binary=[bin(i)[2:].zfill(8) for i in b]
n = int('0b'+''.join(binary), 2)
nn = binascii.unhexlify('%x' % n)
nnn=nn.decode("ascii")[0:-1]
result='.'.join(str(ord(c)) for c in nnn[0:-1])
print(result)
Output:
16.0.8.0

Related

How to get python to tolerate UTF-8 encoding errors

I have a set of UTF-8 texts I have scraped from web pages. I am trying to extract keywords from these files like so:
import os
import json
from rake_nltk import Rake
rake_nltk_var = Rake()
directory = 'files'
results = {}
for filename in os.scandir(directory):
if filename.is_file():
with open("files/" + filename.name, encoding="utf-8", mode = 'r') as infile:
text = infile.read()
rake_nltk_var.extract_keywords_from_text(text)
keyword_extracted = rake_nltk_var.get_ranked_phrases()
results[filename.name] = keyword_extracted
with open("extracted-keywords.json", "w") as outfile:
json.dump(results, outfile)
One of the files I've managed to process so far is throwing the following error on read:
Traceback (most recent call last):
File "extract-keywords.py", line 11, in <module>
text = infile.read()
File "c:\python36\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 66: invalid start byte
0x92 is a right single quotation mark, but the 66th char of the file is a "u" so IDK where this error is coming from. Regardless, is there some way to make the codec tolerate such encoding errors? For example, Perl simply substitutes a question mark for any character it can't decode. Is there some way to get Python to do the same? I have a lot of files and can't afford to stop and debug every encoding error they might contain.

I have a set of UTF-8 texts I have scraped from web pages
If they can't be read with the script you've shown, then these are not actually UTF-8 encoded files.
We have to know about the code which wrote the files in the first place to tell the correct way to decode. However, the ’ character is 0x92 byte in code page 1252, so try using that encoding instead, i.e.:
with open("files/" + filename.name, encoding="cp1252") as infile:
text = infile.read()
Ignoring decoding errors corrupts the data, so it's best to use the correct decoder when possible, so try and do that first! However, about this part of the question:
Regardless, is there some way to make the codec tolerate such encoding errors? For example, Perl simply substitutes a question mark for any character it can't decode. Is there some way to get Python to do the same?
Yes, you can specify errors="replace"
>>> with open("/tmp/f.txt", "w", encoding="cp1252") as f:
... f.write('this is a right quote: \N{RIGHT SINGLE QUOTATION MARK}')
...
>>> with open("/tmp/f.txt", encoding="cp1252") as f:
... print(f.read()) # using correct encoding
...
this is a right quote: ’
>>> with open("/tmp/f.txt", encoding="utf-8", errors="replace") as f:
... print(f.read()) # using incorrect encoding and replacing errors
this is a right quote: �

Reading NULL bytes using pandas.read_csv()

I have a file in CSV format which contains NULL bytes (may be 0x84) in each line. I need to read this file using c engine of pd.read_csv() .
This values causes an error while reading - 'utf-8' codec can't decode byte 0x84 in position 14 .
Is there any way out to fix it without changing the file ?

Try these options if it helps:
Option 1:
Set the engine as python.
pd.read_csv(filename, engine='python')
Option 2:
Try utf-16 encoding, because the error could also mean the file is encoded in UTF-16. Or change the encoding to the correct format example
encoding = "cp1252"
encoding = "ISO-8859-1"
Option 3:
Read the file as bytes
with open(filename, 'rb') as f:
data = f.read()
That b in the mode specifier in the open() states that the file shall be treated as binary, so contents will remain a bytes. No decoding attempt will happen this way.
Alternatively you can use open method from the codecs module to read in the file:
import codecs
with codecs.open(filename, 'r', encoding='utf-8', errors='ignore') as f:
This will strip out (ignore) the characters and return the string without them. Only use this if your need is to strip them not convert them.
https://docs.python.org/3/howto/unicode.html#the-string-type

Trying to convert CSV to Excel file in python

I have tried different codes and check online for the Solution. But not getting success in the below code.
df_new = pd.read_csv(path+'output.csv')
writer = pd.ExcelWriter(path+'output.xlsx')
df_new.to_excel(writer, index = False)
writer.save()
I am getting the below error when I am trying to execute it, I have try to add encoded as latin . But it is not working. Please guide me with it. When I am doing ignore_error it is running , but not providing any result.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Read .txt with emoji characters in python

I try to read a chat history with smilies in it, but I get the following error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 38: character maps to
My code looks like this:
file_name = "chat_file.txt"
chat = open(chat_file)
chatText = chat.read() # read data
chat.close()
print(chatText)
I am pretty certain that it's because of elements like: ❤
How can I implement the correct Transformation Format // what is the correct file encoding so python can read these elements?

Never open text files without specifying their encoding.
Also, use with blocks, these automatically call .close() so you don't have to.
file_name = "chat_file.txt"
with open(chat_file, encoding="utf8") as chat:
chat_text = chat.read()
print(chat_text)
iso-8859-1 is a legacy encoding, that means it cannot contain emoji. For emoji the text file has to be Unicode. And the most common encoding for Unicode is UTF-8.

Trying to output hex data as readable text in Python 3.6

I am trying to read hex values from specific offsets in a file, and then show that as normal text. Upon reading the data from the file and saving it to a variable named uName, and then printing it, this is what I get:
Card name is: b'\x95\xdc\x00'
Here's the code:
cardPath = str(input("Enter card path: "))
print("Card name is: ", end="")
with open(cardPath, "rb+") as f:
f.seek(0x00000042)
uName = f.read(3)
print(uName)
How can remove the 'b' I am getting at the beginning? And how can I remove the '\x'es so that b'\x95\xdc\x00' becomes 95dc00? If I can do that, then I guess I can convert it to text using binascii.
I am sorry if my mistake is really really stupid because I don't have much experience with Python.

Those string started with b in python is a byte string.
Usually, you can use decode() or str(byte_string,'UTF-8) to decode the byte string(i.e. the string start with b') to string.
EXAMPLE
str(b'\x70\x79\x74\x68\x6F\x6E','UTF-8')
'python'
b'\x70\x79\x74\x68\x6F\x6E'.decode()
'python'
However, for your case, it raised an UnicodeDecodeError during decoding.
str(b'\x95\xdc\x00','UTF-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 0: invalid start byte
I guess you need to find out the encoding for your file and then specify it when you open the file, like below:
open("u.item", encoding="THE_ENCODING_YOU_FOUND")

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Need to open and read a .bin file in Python. Getting error: utf-8' codec can't decode byte 0x81 in position 11: invalid start byte - python-3.x

Related

How to get python to tolerate UTF-8 encoding errors

Reading NULL bytes using pandas.read_csv()

Trying to convert CSV to Excel file in python

Read .txt with emoji characters in python

Trying to output hex data as readable text in Python 3.6

Categories

Resources