binary input with an ASCII text header, read from stdin - python-3.x

I want to read a binary PNM image file from stdin. The file contains a header which is encoded as ASCII text, and a payload which is binary. As a simplified example of reading the header, I have created the following snippet:
#! /usr/bin/env python3
import sys
header = sys.stdin.readline()
print("header=["+header.strip()+"]")
I run it as "test.py" (from a Bash shell), and it works fine in this case:
$ printf "P5 1 1 255\n\x41" |./test.py
header=[P5 1 1 255]
However, a small change in the binary payload breaks it:
$ printf "P5 1 1 255\n\x81" |./test.py
Traceback (most recent call last):
File "./test.py", line 3, in <module>
header = sys.stdin.readline()
File "/usr/lib/python3.4/codecs.py", line 313, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 11: invalid start byte
Is there an easy way to make this work in Python 3?

To read binary data, you should use a binary stream e.g., using TextIOBase.detach() method:
#!/usr/bin/env python3
import sys
sys.stdin = sys.stdin.detach() # convert to binary stream
header = sys.stdin.readline().decode('ascii') # b'\n'-terminated
print(header, end='')
print(repr(sys.stdin.read()))

From the docs, it is possible to read binary data (as type bytes) from stdin with sys.stdin.buffer.read():
To write or read binary data from/to the standard streams, use the
underlying binary buffer object. For example, to write bytes to
stdout, use sys.stdout.buffer.write(b'abc').
So this is one direction that you can take -- read the data in binary mode. readline() and various other functions still work. Once you have captured the ASCII string, it can be converted to text, using decode('ASCII'), for additional text-specific processing.
Alternatively, you can use io.TextIOWrapper() to indicate the use of the latin-1 character set on the input stream. With this, the implicit decode operation will essentially be a pass-through operation -- so the data will be of type str (which represent text), but the data is represented with a 1-to-1 mapping from the binary (although it could be using more than one storage byte per input byte).
Here's code that works in either mode:
#! /usr/bin/python3
import sys, io
BINARY=True ## either way works
if BINARY: istream = sys.stdin.buffer
else: istream = io.TextIOWrapper(sys.stdin.buffer,encoding='latin-1')
header = istream.readline()
if BINARY: header = header.decode('ASCII')
print("header=["+header.strip()+"]")
payload = istream.read()
print("len="+str(len(payload)))
for i in payload: print( i if BINARY else ord(i) )
Test every possible 1-pixel payload with the following Bash command:
for i in $(seq 0 255) ; do printf "P5 1 1 255\n\x$(printf %02x $i)" |./test.py ; done

Related

How to get python to tolerate UTF-8 encoding errors

I have a set of UTF-8 texts I have scraped from web pages. I am trying to extract keywords from these files like so:
import os
import json
from rake_nltk import Rake
rake_nltk_var = Rake()
directory = 'files'
results = {}
for filename in os.scandir(directory):
if filename.is_file():
with open("files/" + filename.name, encoding="utf-8", mode = 'r') as infile:
text = infile.read()
rake_nltk_var.extract_keywords_from_text(text)
keyword_extracted = rake_nltk_var.get_ranked_phrases()
results[filename.name] = keyword_extracted
with open("extracted-keywords.json", "w") as outfile:
json.dump(results, outfile)
One of the files I've managed to process so far is throwing the following error on read:
Traceback (most recent call last):
File "extract-keywords.py", line 11, in <module>
text = infile.read()
File "c:\python36\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 66: invalid start byte
0x92 is a right single quotation mark, but the 66th char of the file is a "u" so IDK where this error is coming from. Regardless, is there some way to make the codec tolerate such encoding errors? For example, Perl simply substitutes a question mark for any character it can't decode. Is there some way to get Python to do the same? I have a lot of files and can't afford to stop and debug every encoding error they might contain.
I have a set of UTF-8 texts I have scraped from web pages
If they can't be read with the script you've shown, then these are not actually UTF-8 encoded files.
We have to know about the code which wrote the files in the first place to tell the correct way to decode. However, the ’ character is 0x92 byte in code page 1252, so try using that encoding instead, i.e.:
with open("files/" + filename.name, encoding="cp1252") as infile:
text = infile.read()
Ignoring decoding errors corrupts the data, so it's best to use the correct decoder when possible, so try and do that first! However, about this part of the question:
Regardless, is there some way to make the codec tolerate such encoding errors? For example, Perl simply substitutes a question mark for any character it can't decode. Is there some way to get Python to do the same?
Yes, you can specify errors="replace"
>>> with open("/tmp/f.txt", "w", encoding="cp1252") as f:
... f.write('this is a right quote: \N{RIGHT SINGLE QUOTATION MARK}')
...
>>> with open("/tmp/f.txt", encoding="cp1252") as f:
... print(f.read()) # using correct encoding
...
this is a right quote: ’
>>> with open("/tmp/f.txt", encoding="utf-8", errors="replace") as f:
... print(f.read()) # using incorrect encoding and replacing errors
this is a right quote: �

Need to open and read a .bin file in Python. Getting error: utf-8' codec can't decode byte 0x81 in position 11: invalid start byte

I am trying to read and convert binary into text that anyone could read. I am having trouble with the error message:
'utf-8' codec can't decode byte 0x81 in position 11: invalid start byte
I have gone throughout: Reading binary file and looping over each byte
trying multiple versions of trying to open and read the binary file in some way. After reading about this error message, most people either had trouble with .cvs files, or had to change the utf-8 to -16. But reading up on https://en.wikipedia.org/wiki/UTF-16#Byte_order_encoding_schemes , Python does not use -16 anymore.
Also, if I add encoding = utf-16/32, the error states: binary mode doesn't take an encoding argument
Here is my code:
with open(b"P:\Projects\2018\1809-0068-R\Bin_Files\snap-pac-eb1-R10.0d.bin", "rb") as f:
byte = f.read(1)
while byte != b"":
byte = f.read(1)
print(f)
I am expecting to be able to read and write to the binary file. I would like to translate it to Hex and then to text (or to legible text somehow), but I think I have to go through this step before. If anyone could help with what I am missing, that would be greatly appreciated! Any way to open and read a binary file would be accepted. Thank you for your time!
I am not sure but this might help:
import binascii
with open('snap-pac-eb1-R10.0d.bin', 'rb') as f:
header = f.read(6)
b = bytearray(header)
binary=[bin(i)[2:].zfill(8) for i in b]
n = int('0b'+''.join(binary), 2)
nn = binascii.unhexlify('%x' % n)
nnn=nn.decode("ascii")[0:-1]
result='.'.join(str(ord(c)) for c in nnn[0:-1])
print(result)
Output:
16.0.8.0

Trying to output hex data as readable text in Python 3.6

I am trying to read hex values from specific offsets in a file, and then show that as normal text. Upon reading the data from the file and saving it to a variable named uName, and then printing it, this is what I get:
Card name is: b'\x95\xdc\x00'
Here's the code:
cardPath = str(input("Enter card path: "))
print("Card name is: ", end="")
with open(cardPath, "rb+") as f:
f.seek(0x00000042)
uName = f.read(3)
print(uName)
How can remove the 'b' I am getting at the beginning? And how can I remove the '\x'es so that b'\x95\xdc\x00' becomes 95dc00? If I can do that, then I guess I can convert it to text using binascii.
I am sorry if my mistake is really really stupid because I don't have much experience with Python.
Those string started with b in python is a byte string.
Usually, you can use decode() or str(byte_string,'UTF-8) to decode the byte string(i.e. the string start with b') to string.
EXAMPLE
str(b'\x70\x79\x74\x68\x6F\x6E','UTF-8')
'python'
b'\x70\x79\x74\x68\x6F\x6E'.decode()
'python'
However, for your case, it raised an UnicodeDecodeError during decoding.
str(b'\x95\xdc\x00','UTF-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 0: invalid start byte
I guess you need to find out the encoding for your file and then specify it when you open the file, like below:
open("u.item", encoding="THE_ENCODING_YOU_FOUND")

subprocess stdout string decoding not working

I'm using the following subprocess call to use a command line tool. The output of the command line tool isn't printed in one go, it prints immediately on the command line, it generates over multiple lines over a period of time. The tool is bs1770gain and the command would be "path\to\bs1770gain.exe" "-i" "\path\to\audiofile.wav", By using the --loglevel parameter you can include more data but you cannot remove the progressive results being written to stdout.
I need stdout to return a human readable string (hence the stdout_formatted operation):
with subprocess.Popen(list_of_args, stdout=subprocess.PIPE, stderr=subprocess.PIPE) as proc:
stdout, stderr = proc.communicate()
stdout_formatted = stdout.decode('UTF-8')
stderr_formatted = stderr.decode('UTF-8')
However I can only view the variable as a human readable string if I print it e.g.
In [23]: print(stdout_formatted )
nalyzing ... [1/2] "filename.wav":
integrated: -2.73 LUFS / -20.27 LU [2/2]
"filename2.wav":
integrated: -4.47 LUFS / -18.53 LU
[ALBUM]:
integrated: -3.52 LUFS / -19.48 LU done.
In [24]: stdout_formatted
Out[24]: 'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\.......
In [6]: stdout
Out[6]: b'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\......
In [4]: type(stdout)
Out[4]: bytes
In [5]: type(stdout_formatted)
Out[5]: str
If you look carefully, the human readable chars are in the string (the first word is "analyzing"
I guessed that the stdout value needs decoding/encoding so I tried different ways:
stdout_formatted.encode("ascii")
Out[18]: b'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g
stdout_formatted.encode("utf-8")
Out[17]: b'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\
stdout.decode("utf-8")
Out[15]: 'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\
stdout.decode("ascii")
Out[14]: 'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\
bytes(stdout).decode("ascii")
Out[13]: 'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\
I used a library called chardet to check the encoding of stdout:
import chardet
chardet.detect(stdout)
Out[26]: {'confidence': 1.0, 'encoding': 'ascii', 'language': ''}
I'm working on Windows 10 and have am using python 3.6 (the anaconda package and it's integrated Spyder IDE).
I'm kind of clutching at straws now - is it possible to capture what is displayed in the console when print is called in a variable or remove the unwanted bytecode in the stdout string?
You don't have UTF-8 data. You have UTF-16 data. UTF-16 uses two bytes for every character; characters in the ASCII and Latin-1 ranges (such as a), still use 2 bytes, but one of those bytes is always a \x00 NUL byte.
Because UTF-16 always uses 2 bytes for every character, their order starts to matter. Encoders can pick between the two options; one is called Little Endian, the other Big Endian. Normally, encoders then include a Byte Order Mark at the very start, so that the decoder knows which of the two order options to use when decoding.
Your posted data doesn't appear to include the BOM (I don't see the 0xFF and 0xFE bytes, but your data does look like it is using little-endian ordering. That fits with this being Windows; Windows always uses little-endian ordering for it's UTF-16 output.
If your data does have the BOM present, you can just decode as 'utf-16'. If the BOM is missing, use 'utf-16-le':
>>> sample = b'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00'
>>> sample.decode('utf-16-le')
'analyzin'
>>> import codecs
>>> (codecs.BOM_UTF16_LE + sample)
b'\xff\xfea\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00'
>>> (codecs.BOM_UTF16_LE + sample).decode('utf-16')
'analyzin'
I had the same problem with the progress information added by bs1770gain. It is not said in the man page but there is a "--suppress-progress" flag !
So I just use this command line and everythig were OK ;)
bs1770gain --xml --suppress-progress "$filename"
Have you tried :
str(stdout_formatted).replace('\x00','')
?

How to create a dump file in hex format from python

I have a array of integer which I want to dump in one binary file (HEX file to be specific) using python script
I have written a code as
MemDump = Debug.readMemory(ic.IConnectDebug.fRealTime, 0, 0xB0009CC4, 0xCFF, 1)
MemData = MemDump[:3321]
# Create New file in binary mode and open for writing
fp = open("MON.dmp", 'w')
sys.stdout = fp
for byte in MemData:
print(byte)
Here MemDump contains an array of integer values. From this array first 3321 bytes I want to dump in file.
Here I am getting the the output in file MON.dmp but in ASCII fromat.
and if I create file in binary format using
fp = open("MON.dmp", 'wb')
print(byte) command gives me an error saying
'str' does not support the buffer interface
Thank you in Advance.
You need to convert byte to a binary string before you can write it to a file opened in 'wb' mode. This can be done using the bytearray() function. So in this case you should use:
for byte in MemData:
print(bytearray(byte))

Resources