string.decode() throws an error, when i try to decode the line output of an stdout.PIPE. The error message is:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 8: invalid start byte
0x84 should be the letter 'ä'. The line that fails reads as follows:
b' Datentr\x84ger in Laufwerk C: ist System'
I can't nail it down. I already checked the encoding using sys.stdout.encoding, which is utf-8.
import subprocess
import re
prc = subprocess.Popen(["cmd.exe"], shell = False, stdout=subprocess.PIPE, stdin=subprocess.PIPE)
prc.stdin.write(b"dir\n")
outp, inp = prc.communicate()
regex = re.compile(r"^.*(\d\d:\d\d).*$")
for line in outp.splitlines():
match = regex.match(line.decode('utf-8'))# <--- decode fails here.
if match:
print(match.groups())
prc.stdin.close()
CMD encodes text using ISO-8859-15. So the text that comes through the PIPE needs to be decoded using ISO, even if python encodes the stdout using utf-8.
If you don’t know the encoding, the cleanest way to solve this is to specify the errors param of bytearray.decode, e.g.:
import subprocess
p = subprocess.run(['echo', b'Evil byte: \xe2'], stdout=subprocess.PIPE)
p.stdout.decode(errors='backslashreplace')
Output:
'Evil byte: \\xe2\n'
The list of possible values can be found here:
https://docs.python.org/3/library/codecs.html#codecs.register_error
Related
import serial
s = serial.Serial(port = 'Com3', 9600, timeout = 2)
data = s.readline().decode().rstrip("\r\n")
So basically when I try to read the data I get the error.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfb in position 20: invalid start byte
From the documentation of the instrument I am trying to communicate with. the data is in the form:
..ss3422/34/54--1.8E+03<,>…..0.7E+03<,...1.71E-09<,<√.<*.<
I was able to fix the problem by changing to encoding = "ISO-8859-1".
I have a schneider power meter with rs485 surport. I using python with pymodbus to read register and decode payload from it (success). But now I want to do this with NodeJS, I can get raw data but I dont know how to decode it, I tried some method but result wrong!
This my python code:
from pymodbus.client.sync import ModbusSerialClient as ModbusClient
from pymodbus.constants import Endian
from pymodbus.payload import BinaryPayloadDecoder
def validator(instance):
if not instance.isError():
'''.isError() implemented in pymodbus 1.4.0 and above.'''
decoder = BinaryPayloadDecoder.fromRegisters(
instance.registers,
byteorder=Endian.Big, wordorder=Endian.Little
)
return float(decoder.decode_32bit_float())
else:
# Error handling.
return None
validator([5658, 17242]) # Result is 218.1
When I use NodeJS it return buffer and i tried to decode with:
let buf = Buffer.from([0xd6, 0xd4, 0x42, 0x47]);
payload = buf.readFloatBE(0); // It return other float number not 218.1
Can everyone help me ! Thanks !
i'm trying to load .npy files from my google cloud storage to my model i followed this example here Load numpy array in google-cloud-ml job
but i get this error
'utf-8' codec can't decode byte 0x93 in
position 0: invalid start byte
can you help me please ??
here is sample from the code
Here i read the file
with file_io.FileIO(metadata_filename, 'r') as f:
self._metadata = [line.strip().split('|') for line in f]
and here i start processing on it
if self._offset >= len(self._metadata):
self._offset = 0
random.shuffle(self._metadata)
meta = self._metadata[self._offset]
self._offset += 1
text = meta[3]
if self._cmudict and random.random() < _p_cmudict:
text = ' '.join([self._maybe_get_arpabet(word) for word in text.split(' ')])
input_data = np.asarray(text_to_sequence(text, self._cleaner_names), dtype=np.int32)
f = StringIO(file_io.read_file_to_string(
os.path.join('gs://path',meta[0]))
linear_target = tf.Variable(initial_value=np.load(f), name='linear_target')
s = StringIO(file_io.read_file_to_string(
os.path.join('gs://path',meta[1])))
mel_target = tf.Variable(initial_value=np.load(s), name='mel_target')
return (input_data, mel_target, linear_target, len(linear_target))
and this is a sample from the data sample
This is likely because your file doesn't contain utf-8 encoded text.
Its possible, you may need to initialize the file_io.FileIO instance as a binary file using mode = 'rb', or set binary_mode = True in the call to read_file_to_string.
This will cause data that is read to be returned as a sequence of bytes, rather than a string.
I'm trying to find and replace some special chars in a file encoded in ISO-8859-1, then write the result to a new file encoded in UTF-8:
package inv
class MigrationScript {
static main(args) {
new MigrationScript().doStuff();
}
void doStuff() {
def dumpfile = "path to input file";
def newfileP = "path to output file"
def file = new File(dumpfile)
def newfile = new File(newfileP)
def x = [
"þ":"ş",
"ý":"ı",
"Þ":"Ş",
"ð":"ğ",
"Ý":"İ",
"Ð":"Ğ"
]
def r = file.newReader("ISO-8859-1")
def w = newfile.newWriter("UTF-8")
r.eachLine{
line ->
x.each {
key, value ->
if(line.find(key)) println "found a special char!"
line = line.replaceAll(key, value);
}
w << line + System.lineSeparator();
}
w.close()
}
}
My input file content is:
"þ": "ý": "Þ":" "ð":" "Ý":" "Ð":"
Problem is my code never finds the specified characters. The groovy script file itself is encoded in UTF-8. I'm guessing that may be the cause of the problem, but then I can't encode it in ISO-8859-1 because then I can't write "Ş" "Ğ" etc in it.
I took your code sample, run it with an input file encoded with charset ISO-8859-1 and it worked as expected. Can you double check if your input file is actually encoded with ISO-8859-1? Here is what I did:
I took file content from your question and saved it (using SublimeText) to a file /tmp/test.txt using Save -> Save with Encoding -> Western (ISO 8859-1)
I checked file encoding with following Linux command:
file -i /tmp/test.txt
/tmp/test.txt: text/plain; charset=iso-8859-1
I set up dumpfile variable with /tmp/test.txt file and newfile variable to /tmp/test_2.txt
I run your code and I saw in the console:
found a special char!
found a special char!
found a special char!
found a special char!
found a special char!
found a special char!
I checked encoding of the Groovy file in IntelliJ IDEA - it was UTF-8
I checked encoding of the output file:
file -i /tmp/test_2.txt
/tmp/test_2.txt: text/plain; charset=utf-8
I checked the content of the output file:
cat /tmp/test_2.txt
"ş": "ı": "Ş":" "ğ":" "İ":" "Ğ":"
I don't think it matters, but I have used the most recent Groovy 2.4.13
I'm guessing that your input file is not encoded properly. Do double check what is the encoding of the file - when I save the same content but with UTF-8 encoding, your program does not work as expected and I don't see any found a special char! entry in the console. When I display contents of ISO-8859-1 file I see something like that:
cat /tmp/test.txt
"�": "�": "�":" "�":" "�":" "�":"%
If I save the same content with UTF-8, I see the readable content of the file:
cat /tmp/test.txt
"þ": "ý": "Þ":" "ð":" "Ý":" "Ð":"%
Hope it helps in finding source of the problem.
I am trying to store variable string expressions from a file which contains special characters, like ø, æ , and å. Here is my code:
import h5py as h5
file = h5.File('deleteme.hdf5','a')
dt = h5.special_dtype(vlen=str)
dset = file.create_dataset("text",(1,),dtype=dt)
dset.attrs[str(1)] = "some text with ø, æ, å"
However the text is not stored properly. The data stored contains text:
"some text with \37777777703\37777777670, \37777777703\37777777646,\37777777703\37777777645"
How can I store the special characters properly? I have tried to follow the guide provided in the documentation here: Strings in HDF5 - Variable-length UTF-8
Edit:
The output was from h5dump. The answer below verified that the characters are properly stored as utf-8.
With:
import numpy as np
import h5py as h5
file = h5.File('deleteme.hdf5','w')
dt = h5.special_dtype(vlen=str)
dset = file.create_dataset("text",(3,),dtype=dt)
dset[:] = 'ø æ å'.split()
dset.attrs["1"] = "some text with ø, æ, å"
file.close()
file = h5.File('deleteme.hdf5','r')
print(file['text'][:])
print(file['text'].attrs["1"])
file.close()
I see:
$ python3 stack44661467.py
['ø' 'æ' 'å']
some text with ø, æ, å
That is h5py does see/interpret the strings as unicode - writing and reading.
With the dump utility:
$ h5dump deleteme.hdf5
HDF5 "deleteme.hdf5" {
GROUP "/" {
DATASET "text" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 3 ) / ( 3 ) }
DATA {
(0): "\37777777703\37777777670", "\37777777703\37777777646",
(2): "\37777777703\37777777645"
}
ATTRIBUTE "1" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "some text with \37777777703\37777777670, \37777777703\37777777646, \37777777703\37777777645"
}
}
}
}
}
Note that in both case the datatype is marked UTF8
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
That's what the docs say:
http://docs.h5py.org/en/latest/strings.html#variable-length-utf-8
They can store any character a Python unicode string can store, with the exception of NULLs. In the file these are created as variable-length strings with character set H5T_CSET_UTF8.
Let h5py (or other reader) worry about interpreting \37777777703\37777777670 as the proper unicode character.
You should try storing your data in UTF-8 format by doing the following:
To encode in utf-8 format (before storingwith h5py) do:
u"æ".encode("utf-8")
which returns:
'\xc3\xa6'
Then to decode you could use the string decode like this:
'\xc3\xa6'.decode("utf-8")
which would return:
æ
Hope it helps!
EDIT
When you open files and you want them to be in utf-8, you can use the encoding parameter on the read file method:
f = open(fname, encoding="utf-8")
This should help properly encoding the original file.
Source: python-notes