UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 5 - python-3.x

I am trying to read a file using the following code.
precomputed = pickle.load(open('test/vgg16_features.p', 'rb'))
features = precomputed['features']
But getting this error.
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 5: ordinal not in range(128)
The file I am trying to read contains image features which are extracted using deep neural networks. The file content looks like below.
(dp0
S'imageIds'
p1
(lp2
I262145
aI131074
aI131075
aI393221
aI393223
aI393224
aI524297
aI393227
aI393228
aI262146
aI393230
aI262159
aI524291
aI322975
aI131093
aI524311
....
....
....
Please note that, this is big file, of size 2.8GBs.
I know this is a duplicate question but I followed the suggested solutions in other stackoverflow posts but couldn't solve it. Any help would be appreciated!

Finally I found the solution. The problem was actually about unpickling a python 2 object with python 3 which I couldn't understand first because the pickle file I got was written through a python 2 program.
Thanks to this answer which solved the problem. So, all I need to do is set the encoding parameter of pickle.load() function to latin1 because latin1 works for any input as it maps the byte values 0-255 to the first 256 Unicode codepoints directly.
So, the following worked for me!
precomputed = pickle.load(open('test/vgg16_features.p', 'rb'), encoding='latin1')

Related

double encoding through cp1252 and base 64

From a client I am getting a pdf file, which is encoded in cp 1252 and for transfer is also encoded in base 64. Till now a shell program returns the file into the original form through this code line:
output= [System.Text.Encoding]::GetEncoding(1252).GetString([System.Convert]::FromBase64String(input))
and this works.
Now I am implementing a python version, to supersede this implementation. This looks generally like this:
enc_file = read_from_txt.open_file(location_of_file)
plain_file= base64.b64decode(enc_file)
with open('filename', 'w') as writer:
writer.write(plain_file.decode('cp1252'))
where read_from_txt.open_file just does this:
with open(file_location, 'rb') as fileReader:
read = fileReader.read()
return read
But for some reason, I am getting an error in the plain_file.decode('cp1252'), where it can not decode a line in the file. From what I am understanding though, the python program should do exactly the same, as the powershell does.
Concrete error is:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 188: character maps to undefined
Any help is appreciated.

Streaming data from the R&S rto oscilloscope - UnicodeDecodeError python3.6

I'm trying to get the signal data for a specific channel on the Rhode and Schwarz RTO oscilloscope . I'm using the vxi11 python(3.6) library to communicate with the scope.
On my first try, I was able to extract all the data of the scope channel I was querying without any errors(using this query command CHAN1:WAV1:DATA?) but soon after I started getting this error message.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 10: invalid continuation byte
The wierd thing is that I'm still able to get the head of the data without any issues. It's only when I request the entire data to be sent over that I see this error.
I've tried to change the format of the data between REAL(binary) and ASCii, but to no avail.
Another weird thing is that when I switch the data encoding of the received data to 'latin-1', it works fine for a moment(giving me a strange character string, that I'm assuming is the data I want - just in another format) and then crashes.
The entire output looks as follows:
****IDN : Rohde&Schwarz,RTO,1329.7002k04/100938,4.20.1.0
FORM[:DATA]ASCii : None
CHAN1:WAV1:DATA:HEAD? : -0.2008,0.1992,10000000,1
'utf-8' codec can't decode byte 0xc7 in position 10: invalid continuation byte
'utf-8' codec can't decode byte 0xc7 in position 10: invalid continuation byte
Traceback (most recent call last):
File "testing_rtodto.py", line 21, in ask_query
logger.debug(print(query+" :",str(conn._ask(query))))
File "../lib_maxiv_rtodto/client.py", line 187, in _ask
response = self.instrument.ask(data)#, encoding="latin-1")
File "/usr/lib/python3.6/site-packages/vxi11/vxi11.py", line 743, in ask
return self.read(num, encoding)
File "/usr/lib/python3.6/site-packages/vxi11/vxi11.py", line 731, in read
return self.read_raw(num).decode(encoding).rstrip('\r\n')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 10: invalid continuation byte
Alrighty, I found a fix. Thanks mostly to this thread https://github.com/pyvisa/pyvisa/issues/306
Though I'm not using the same communication library as they are, the problem seemed to be the way I was querying the data not how the library was reading it.
Turns out you have to follow R&S's instrument instructions very very VERY closely(although their documentation is extremely confusing and hard to find - not to mention the lack of example query strings for important query functions)
Essentially, the query command that worked was FORM ASC;:CHAN1:DATA?. This explicitly converts the data to ASCii format before returning it to the communicating library.
I also found some sample python scripts that R&S have provided (https://cdn.rohde-schwarz.com/pws/service_support/driver_pagedq/files_1/directscpi/DirectSCPI_PyCharm_Python_Examples.zip).

Encoding issues related to Python and foreign languages

Here's a problem I am facing with encoding and decoding texts.
I am trying to write a code that finds a 'string' or a 'byte' in a file, and return the path of the file.
Currently, since the files I am opening have encoding of 'windows-1252' or 'cp-1252', so I have been trying to:
1. encode my string into a byte corresponding to the encoding of the file
2. match the file and get the path of that file
I have a file, say 'f', that has the encoding of 'windows-1252' or 'cp-1252'. It includes a text that is in Chinese: '[跑Online農場]'
with open(os.path.join(root, filename), mode='rb') as f:
text = f.read()
print(encoding(text)) # encoding() is a separate function that I wrote that returns the encoding of the file
print(text)
Windows-1252
b'\x00StaticText\x00\x00\x12\x00[\xb6]Online\xb9A\xb3\xf5]\x00\x01\x00\x ...
As you may see, the 'binary' texts for [跑Online農場] is [\xb6]Online\xb9A\xb3\xf5]
However, the funny thing is that if I literally convert the string into bytes, I get:
enter_text = '[跑Online農場]'
print(bytes(enter_text, 'cp1252'))
UnicodeEncodeError: 'charmap' codec can't encode character '\u8dd1' in position 1: character maps to <undefined>
On the other hand, opening the file using
with open(os.path.join(root, filename), mode='r', encoding='cp-1252') as f ...
I get:
StaticText [¶]Online¹A³õ] €?‹ Œ î...
which I am not sure how I would 'translate' '[跑Online農場]' into '[¶]Online¹A³õ]'. Answer to this may also solve the problem
What should I do to correctly 'encode' the Chinese/Foreign characters so that it matches the 'rb' bytes that the Python returns?
Thank you!
Your encoding function is wrong: the codec of the file is probably CP950, but certainly not CP1252.
Note: guessing the encoding of a given byte string is always approximate.
There's no safe way of determining the encoding for sure.
If you have a byte string like
b'[\xb6]Online\xb9A\xb3\xf5]'
and you know it must translate (be decoded) into
'[跑Online農場]'
then what you can is trial and error with a few codecs.
I did this with the list of codecs supported by Python, searching for codecs for Chinese.
When using CP-1252 (the Windows version of Latin-1), as you did, you get mojibake:
>>> b'[\xb6]Online\xb9A\xb3\xf5]'.decode('cp1252')
'[¶]Online¹A³õ]'
When using CP-950 (the Windows codepage for Traditional Chinese), you get the expected output:
>>> b'[\xb6]Online\xb9A\xb3\xf5]'.decode('cp950')
'[跑Online農場]'
So: use CP-950 for reading the file.

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 591: character maps to <undefined>

I have a code to convert docx files to pure text:
import docx
import glob
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
for file in glob.glob('*.docx'):
outfile = open(file.replace('.docx', '-out.txt'), 'w', encoding='utf8')
for line in open(file):
print(getText(filename), end='', file=outfile)
outfile.close()
However, when I execute it, there is the following error:
Traceback (most recent call last):
File "C:\Users\User\Desktop\add spaces docx\converting docx to pure text.py", line 16, in <module>
for line in open(file):
File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 591: character maps to <undefined>
I am using Python 3.5.2.
Can anyone help to resolve this issue?
Thanks in advance.
Although I do not know the docx modules as much, I think I can find a solution.
According to fileformat, the Unicode character 8f (which is what the charmap codec couldn't decode, resulting in a UnicodeDecodeError) is a control character.
You should be aware that when reading files (which seems to be the case for what the docx module is doing), you should be aware of control characters, because sometimes Python can't decode it.
The solution to this is to give up on the docx module, learn how .docx files work and are formatted, and when you read a docx file, use open(filename, "rb") so Python will be able to decode it.
However, this might not be the problem. As you can see, in the directory encodings, it uses cp1512 as it's encoding (default) instead of utf-8. Try changing it to utf_8.py (for me it comes up as utf_8.pyc).
NOTE: Sorry for the lack of links. This is because I do not have higher than 10 reputation (because I am new to Stack Overflow).

Pytesseract: UnicodeDecodeError: 'charmap' codec can't decode byte

I'm running a large number of OCRs on screenshots with Pytesseract. This is working well in most cases, but a small number is causing this error:
pytesseract.image_to_string(image,None, False, "-psm 6")
Pytesseract: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2: character maps to <undefined>
I'm using Python 3.4. Any suggestions how I can prevent this error from happening (other than just a try/except) would be very helpful.
Use Unidecode
from unidecode import unidecode
import pytesseract
strs = pytesseract.image_to_string(Image.open('binarized_image.png'))
strs = unidecode(strs)
print (strs)
make sure you are using the right decoding options.
see https://docs.python.org/3/library/codecs.html#standard-encodings
str.decode('utf-8')
bytes.decode('cp950') for Traditional Chinese, etc

Resources