subprocess stdout string decoding not working - string

I'm using the following subprocess call to use a command line tool. The output of the command line tool isn't printed in one go, it prints immediately on the command line, it generates over multiple lines over a period of time. The tool is bs1770gain and the command would be "path\to\bs1770gain.exe" "-i" "\path\to\audiofile.wav", By using the --loglevel parameter you can include more data but you cannot remove the progressive results being written to stdout.
I need stdout to return a human readable string (hence the stdout_formatted operation):
with subprocess.Popen(list_of_args, stdout=subprocess.PIPE, stderr=subprocess.PIPE) as proc:
stdout, stderr = proc.communicate()
stdout_formatted = stdout.decode('UTF-8')
stderr_formatted = stderr.decode('UTF-8')
However I can only view the variable as a human readable string if I print it e.g.
In [23]: print(stdout_formatted )
nalyzing ... [1/2] "filename.wav":
integrated: -2.73 LUFS / -20.27 LU [2/2]
"filename2.wav":
integrated: -4.47 LUFS / -18.53 LU
[ALBUM]:
integrated: -3.52 LUFS / -19.48 LU done.
In [24]: stdout_formatted
Out[24]: 'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\.......
In [6]: stdout
Out[6]: b'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\......
In [4]: type(stdout)
Out[4]: bytes
In [5]: type(stdout_formatted)
Out[5]: str
If you look carefully, the human readable chars are in the string (the first word is "analyzing"
I guessed that the stdout value needs decoding/encoding so I tried different ways:
stdout_formatted.encode("ascii")
Out[18]: b'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g
stdout_formatted.encode("utf-8")
Out[17]: b'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\
stdout.decode("utf-8")
Out[15]: 'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\
stdout.decode("ascii")
Out[14]: 'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\
bytes(stdout).decode("ascii")
Out[13]: 'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\
I used a library called chardet to check the encoding of stdout:
import chardet
chardet.detect(stdout)
Out[26]: {'confidence': 1.0, 'encoding': 'ascii', 'language': ''}
I'm working on Windows 10 and have am using python 3.6 (the anaconda package and it's integrated Spyder IDE).
I'm kind of clutching at straws now - is it possible to capture what is displayed in the console when print is called in a variable or remove the unwanted bytecode in the stdout string?

You don't have UTF-8 data. You have UTF-16 data. UTF-16 uses two bytes for every character; characters in the ASCII and Latin-1 ranges (such as a), still use 2 bytes, but one of those bytes is always a \x00 NUL byte.
Because UTF-16 always uses 2 bytes for every character, their order starts to matter. Encoders can pick between the two options; one is called Little Endian, the other Big Endian. Normally, encoders then include a Byte Order Mark at the very start, so that the decoder knows which of the two order options to use when decoding.
Your posted data doesn't appear to include the BOM (I don't see the 0xFF and 0xFE bytes, but your data does look like it is using little-endian ordering. That fits with this being Windows; Windows always uses little-endian ordering for it's UTF-16 output.
If your data does have the BOM present, you can just decode as 'utf-16'. If the BOM is missing, use 'utf-16-le':
>>> sample = b'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00'
>>> sample.decode('utf-16-le')
'analyzin'
>>> import codecs
>>> (codecs.BOM_UTF16_LE + sample)
b'\xff\xfea\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00'
>>> (codecs.BOM_UTF16_LE + sample).decode('utf-16')
'analyzin'

I had the same problem with the progress information added by bs1770gain. It is not said in the man page but there is a "--suppress-progress" flag !
So I just use this command line and everythig were OK ;)
bs1770gain --xml --suppress-progress "$filename"

Have you tried :
str(stdout_formatted).replace('\x00','')
?

Related

Trying to output hex data as readable text in Python 3.6

I am trying to read hex values from specific offsets in a file, and then show that as normal text. Upon reading the data from the file and saving it to a variable named uName, and then printing it, this is what I get:
Card name is: b'\x95\xdc\x00'
Here's the code:
cardPath = str(input("Enter card path: "))
print("Card name is: ", end="")
with open(cardPath, "rb+") as f:
f.seek(0x00000042)
uName = f.read(3)
print(uName)
How can remove the 'b' I am getting at the beginning? And how can I remove the '\x'es so that b'\x95\xdc\x00' becomes 95dc00? If I can do that, then I guess I can convert it to text using binascii.
I am sorry if my mistake is really really stupid because I don't have much experience with Python.
Those string started with b in python is a byte string.
Usually, you can use decode() or str(byte_string,'UTF-8) to decode the byte string(i.e. the string start with b') to string.
EXAMPLE
str(b'\x70\x79\x74\x68\x6F\x6E','UTF-8')
'python'
b'\x70\x79\x74\x68\x6F\x6E'.decode()
'python'
However, for your case, it raised an UnicodeDecodeError during decoding.
str(b'\x95\xdc\x00','UTF-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 0: invalid start byte
I guess you need to find out the encoding for your file and then specify it when you open the file, like below:
open("u.item", encoding="THE_ENCODING_YOU_FOUND")

binary input with an ASCII text header, read from stdin

I want to read a binary PNM image file from stdin. The file contains a header which is encoded as ASCII text, and a payload which is binary. As a simplified example of reading the header, I have created the following snippet:
#! /usr/bin/env python3
import sys
header = sys.stdin.readline()
print("header=["+header.strip()+"]")
I run it as "test.py" (from a Bash shell), and it works fine in this case:
$ printf "P5 1 1 255\n\x41" |./test.py
header=[P5 1 1 255]
However, a small change in the binary payload breaks it:
$ printf "P5 1 1 255\n\x81" |./test.py
Traceback (most recent call last):
File "./test.py", line 3, in <module>
header = sys.stdin.readline()
File "/usr/lib/python3.4/codecs.py", line 313, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 11: invalid start byte
Is there an easy way to make this work in Python 3?
To read binary data, you should use a binary stream e.g., using TextIOBase.detach() method:
#!/usr/bin/env python3
import sys
sys.stdin = sys.stdin.detach() # convert to binary stream
header = sys.stdin.readline().decode('ascii') # b'\n'-terminated
print(header, end='')
print(repr(sys.stdin.read()))
From the docs, it is possible to read binary data (as type bytes) from stdin with sys.stdin.buffer.read():
To write or read binary data from/to the standard streams, use the
underlying binary buffer object. For example, to write bytes to
stdout, use sys.stdout.buffer.write(b'abc').
So this is one direction that you can take -- read the data in binary mode. readline() and various other functions still work. Once you have captured the ASCII string, it can be converted to text, using decode('ASCII'), for additional text-specific processing.
Alternatively, you can use io.TextIOWrapper() to indicate the use of the latin-1 character set on the input stream. With this, the implicit decode operation will essentially be a pass-through operation -- so the data will be of type str (which represent text), but the data is represented with a 1-to-1 mapping from the binary (although it could be using more than one storage byte per input byte).
Here's code that works in either mode:
#! /usr/bin/python3
import sys, io
BINARY=True ## either way works
if BINARY: istream = sys.stdin.buffer
else: istream = io.TextIOWrapper(sys.stdin.buffer,encoding='latin-1')
header = istream.readline()
if BINARY: header = header.decode('ASCII')
print("header=["+header.strip()+"]")
payload = istream.read()
print("len="+str(len(payload)))
for i in payload: print( i if BINARY else ord(i) )
Test every possible 1-pixel payload with the following Bash command:
for i in $(seq 0 255) ; do printf "P5 1 1 255\n\x$(printf %02x $i)" |./test.py ; done

Seek from end of file in python 3

One of the changes in python 3 has been to remove the ability to seek from the end of the file in normal text mode. What is the generally accepted alternative to this?
For example in python 2.7 I would enter file.seek(-3,2)
I've read a bit about why they did this so please don't just link to a PEP. I know that using 'rb' would allow me to seek, but this makes my text file read in the wrong format.
In Python 2, the file data wasn't being decoded while reading. Seeking backwards and multi-byte encodings don't mix well (you can't know where would the next character start), which is why it is disabled for Python 3.
You can still seek on the underlying buffer object, via the TextIOBase.buffer attribute, but then you'll have to reattach a new TextIOBase wrapper, as the current wrapper will no longer know where it is at:
import io
file.buffer.seek(-3, 2)
file = io.TextIOWrapper(
file.buffer, encoding=file.encoding, errors=file.errors,
newline=file.newlines)
I've copied across any encoding and line handling information to the io.TextIOWrapper() object.
Take into account that this can decoding could break for UTF-16, UTF-32, UTF-8 and other multi-byte codecs.
Demo:
>>> import io
>>> with open('demo.txt', 'w') as out:
... out.write('Demonstration\nfor seeking from the end')
...
38
>>> with open('demo.txt') as inf:
... print(inf.readline())
... inf.buffer.seek(-3, 2)
... inf = io.TextIOWrapper(inf.buffer)
... print(inf.readline())
...
Demonstration
35
end
You could wrap this up in a utility function:
import io
def textio_seek(fobj, amount, whence=0):
fobj.buffer.seek(amount, whence)
return io.TextIOWrapper(
fobj.buffer, encoding=fobj.encoding, errors=fobj.errors,
newline=fobj.newlines)
and use this as:
with open(somefile) as file:
# ...
file = textio_seek(file, -2, 3)
# ...
Using the file object as a context manager just still works, as the original file object reference is still attached to the original file buffer object and thus can still be used to close the file.

How to display chinese character in 65001 in python?

I am in win7 +python3.3.
import os
os.system("chcp 936")
fh=open("test.ch","w",encoding="utf-8")
fh.write("你")
fh.close()
os.system("chcp 65001")
fh=open("test.ch","r",encoding="utf-8").read()
print(fh)
Äã
>>> print(fh.encode("utf-8"))
b'\xe4\xbd\xa0'
How can i display the chinese character 你 in 65001?
If your terminal is capable of displaying the character directly (which it may not be due to font issues) then it should Just Work(tm).
>>> hex(65001)
>>> u"\ufde9"
'\ufde9'
>>> print(u"\ufde9")
﷩
To avoid the use of literals, note that in Python 3, at least, the chr() function will take a code point and return the associated Unicode character. So this works too, avoiding the need to do hex conversions.
>>> print(chr(65001))
﷩

Why should I give `savetxt` a file opened in binary rather than text mode?

I was bitten by the following numpy behaviour:
In [234]: savetxt(open('/tmp/a.dat', 'wt'), array([1, 2, 3]))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-234-2adef92da877> in <module>()
----> 1 savetxt(open('/tmp/a.dat', 'wt'), array([1, 2, 3]))
/local/gerrit/python3.2/lib/python3.2/site-packages/numpy/lib/npyio.py in savetxt(fname, X, fmt, delimiter, newline)
1007 else:
1008 for row in X:
-> 1009 fh.write(asbytes(format % tuple(row) + newline))
1010 finally:
1011 if own_fh:
TypeError: must be str, not bytes
In [235]: savetxt(open('/tmp/a.dat', 'wb'), array([1, 2, 3]))
# success
I find this strange. I'm trying to save my array to a text file. Then why should I open the file in binary mode?
Because your data is bytes (ie binary) data.
What comes out is still a text file. Don't worry. :-) A "text" file is defined a something that contains only human readable text, not by in which mode you open it. The mode just makes a difference in how it handles the data given.
Text mode means it expects Unicode data, and it will encode it into bytes format for you. Binary mode means it expects data in bytes, and will not encode it.
Most likely because numpy maintainers have not updated this function to be fully compatible with python 3. A name "savetxt" certainly implies a text-only file would be adequate, and there's nothing preventing them from calling fh.write((format % tuple(row) + newline).encode()).
There's nothing wrong with using binary mode, either, except that it leads to a surprise in some cases, as you've discovered. I consider it a bug in api design if nothing else.

Resources