Pytesseract: UnicodeDecodeError: 'charmap' codec can't decode byte - python-3.x

I'm running a large number of OCRs on screenshots with Pytesseract. This is working well in most cases, but a small number is causing this error:
pytesseract.image_to_string(image,None, False, "-psm 6")
Pytesseract: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2: character maps to <undefined>
I'm using Python 3.4. Any suggestions how I can prevent this error from happening (other than just a try/except) would be very helpful.

Use Unidecode
from unidecode import unidecode
import pytesseract
strs = pytesseract.image_to_string(Image.open('binarized_image.png'))
strs = unidecode(strs)
print (strs)

make sure you are using the right decoding options.
see https://docs.python.org/3/library/codecs.html#standard-encodings
str.decode('utf-8')
bytes.decode('cp950') for Traditional Chinese, etc

Related

NLTK access local files; UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 834: char acter maps to <undefined>

This is my first post and I have little to no experience but I like to learn.
I hope this post is comprehensible but please feel free to ask for further details.
I am working with Cygwin, and occasionally I use IDLE Python 3.9 to complete some tasks for University.
Currently I am trying to use the NLTK module and tokenize a text.
The first thing I do is open python (through Cygwin or directly from IDLE but I've mostly been using Cygwin).
>>>import nltk
>>> from nltk import word_tokenize
>>> from nltk.book import *
at which point a library of different books would be downloaded for me to access. I don't really need them, though, because I need to access a local file, in a folder called "Tint"
The command that I HAVE managed to do but cannot replicate is
>>>Rev = open("/Users/acer/OneDrive - Università di Pavia/Desktop/Tint/amazon_jamon.no_alterations.txt", "r").read()
In the past, the first problem I was experiencing was using the escape command due to back-slash but when I fixed that to regular slashes it would work. Now that I am trying to access a similar .txt file in the same "Tint" folder, through this command I get a different error.
>>> desc = open("/Users/acer/OneDrive - Università di Pavia/Desktop/Tint/salame_
P.txt", "r").read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python\Python 395\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 834: char
acter maps to <undefined>

Streaming data from the R&S rto oscilloscope - UnicodeDecodeError python3.6

I'm trying to get the signal data for a specific channel on the Rhode and Schwarz RTO oscilloscope . I'm using the vxi11 python(3.6) library to communicate with the scope.
On my first try, I was able to extract all the data of the scope channel I was querying without any errors(using this query command CHAN1:WAV1:DATA?) but soon after I started getting this error message.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 10: invalid continuation byte
The wierd thing is that I'm still able to get the head of the data without any issues. It's only when I request the entire data to be sent over that I see this error.
I've tried to change the format of the data between REAL(binary) and ASCii, but to no avail.
Another weird thing is that when I switch the data encoding of the received data to 'latin-1', it works fine for a moment(giving me a strange character string, that I'm assuming is the data I want - just in another format) and then crashes.
The entire output looks as follows:
****IDN : Rohde&Schwarz,RTO,1329.7002k04/100938,4.20.1.0
FORM[:DATA]ASCii : None
CHAN1:WAV1:DATA:HEAD? : -0.2008,0.1992,10000000,1
'utf-8' codec can't decode byte 0xc7 in position 10: invalid continuation byte
'utf-8' codec can't decode byte 0xc7 in position 10: invalid continuation byte
Traceback (most recent call last):
File "testing_rtodto.py", line 21, in ask_query
logger.debug(print(query+" :",str(conn._ask(query))))
File "../lib_maxiv_rtodto/client.py", line 187, in _ask
response = self.instrument.ask(data)#, encoding="latin-1")
File "/usr/lib/python3.6/site-packages/vxi11/vxi11.py", line 743, in ask
return self.read(num, encoding)
File "/usr/lib/python3.6/site-packages/vxi11/vxi11.py", line 731, in read
return self.read_raw(num).decode(encoding).rstrip('\r\n')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 10: invalid continuation byte
Alrighty, I found a fix. Thanks mostly to this thread https://github.com/pyvisa/pyvisa/issues/306
Though I'm not using the same communication library as they are, the problem seemed to be the way I was querying the data not how the library was reading it.
Turns out you have to follow R&S's instrument instructions very very VERY closely(although their documentation is extremely confusing and hard to find - not to mention the lack of example query strings for important query functions)
Essentially, the query command that worked was FORM ASC;:CHAN1:DATA?. This explicitly converts the data to ASCii format before returning it to the communicating library.
I also found some sample python scripts that R&S have provided (https://cdn.rohde-schwarz.com/pws/service_support/driver_pagedq/files_1/directscpi/DirectSCPI_PyCharm_Python_Examples.zip).

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 591: character maps to <undefined>

I have a code to convert docx files to pure text:
import docx
import glob
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
for file in glob.glob('*.docx'):
outfile = open(file.replace('.docx', '-out.txt'), 'w', encoding='utf8')
for line in open(file):
print(getText(filename), end='', file=outfile)
outfile.close()
However, when I execute it, there is the following error:
Traceback (most recent call last):
File "C:\Users\User\Desktop\add spaces docx\converting docx to pure text.py", line 16, in <module>
for line in open(file):
File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 591: character maps to <undefined>
I am using Python 3.5.2.
Can anyone help to resolve this issue?
Thanks in advance.
Although I do not know the docx modules as much, I think I can find a solution.
According to fileformat, the Unicode character 8f (which is what the charmap codec couldn't decode, resulting in a UnicodeDecodeError) is a control character.
You should be aware that when reading files (which seems to be the case for what the docx module is doing), you should be aware of control characters, because sometimes Python can't decode it.
The solution to this is to give up on the docx module, learn how .docx files work and are formatted, and when you read a docx file, use open(filename, "rb") so Python will be able to decode it.
However, this might not be the problem. As you can see, in the directory encodings, it uses cp1512 as it's encoding (default) instead of utf-8. Try changing it to utf_8.py (for me it comes up as utf_8.pyc).
NOTE: Sorry for the lack of links. This is because I do not have higher than 10 reputation (because I am new to Stack Overflow).

UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 5

I am trying to read a file using the following code.
precomputed = pickle.load(open('test/vgg16_features.p', 'rb'))
features = precomputed['features']
But getting this error.
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 5: ordinal not in range(128)
The file I am trying to read contains image features which are extracted using deep neural networks. The file content looks like below.
(dp0
S'imageIds'
p1
(lp2
I262145
aI131074
aI131075
aI393221
aI393223
aI393224
aI524297
aI393227
aI393228
aI262146
aI393230
aI262159
aI524291
aI322975
aI131093
aI524311
....
....
....
Please note that, this is big file, of size 2.8GBs.
I know this is a duplicate question but I followed the suggested solutions in other stackoverflow posts but couldn't solve it. Any help would be appreciated!
Finally I found the solution. The problem was actually about unpickling a python 2 object with python 3 which I couldn't understand first because the pickle file I got was written through a python 2 program.
Thanks to this answer which solved the problem. So, all I need to do is set the encoding parameter of pickle.load() function to latin1 because latin1 works for any input as it maps the byte values 0-255 to the first 256 Unicode codepoints directly.
So, the following worked for me!
precomputed = pickle.load(open('test/vgg16_features.p', 'rb'), encoding='latin1')

Python's handling of shell strings

I still do not understand completely how python's unicode and str types work. Note: I am working in Python 2, as far as I know Python 3 has a completely different approach to the same issue.
What I know:
str is an older beast that saves strings encoded by one of the way too many encodings that history has forced us to work with.
unicode is an more standardised way of representing strings using a huge table of all possible characters, emojis, little pictures of dog poop and so on.
The decode function transforms strings to unicode, encode does the other way around.
If I, in python's shell, simply say:
>>> my_string = "some string"
then my_string is a str variable encoded in ascii (and, because ascii is a subset of utf-8, it is also encoded in utf-8).
Therefore, for example, I can convert this into a unicode variable by saying one of the lines:
>>> my_string.decode('ascii')
u'some string'
>>> my_string.decode('utf-8')
u'some string'
What I don't know:
How does Python handle non-ascii strings that are passed in the shell, and, knowing this, what is the correct way of saving the word "kožušček"?
For example, I can say
>>> s1 = 'kožušček'
In which case s1 becomes a str instance that I am unable to convert into unicode:
>>> s1='kožušček'
>>> s1
'ko\x9eu\x9a\xe8ek'
>>> print s1
kožušček
>>> s1.decode('ascii')
Traceback (most recent call last):
File "<pyshell#23>", line 1, in <module>
s1.decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9e in position 2: ordinal not in range(128)
Now, naturally I can't decode the string with ascii, but what encoding should I then use? After all, my sys.getdefaultencoding() returns ascii! Which encoding did Python use to encode s1 when fed the line s1=kožušček?
Another thought I had was to say
>>> s2 = u'kožušček'
But then, when I printed s2, I got
>>> print s2
kouèek
which means that Python lost a whole letter. Can someone explain this to me?
str objects contain bytes. What those bytes represent Python doesn't dictate. If you produced ASCII-compatible bytes, you can decode them as ASCII. If they contain bytes representing UTF-8 data they can be decoded as such. If they contain bytes representing an image, then you can decode that information and display an image somewhere. When you use repr() on a str object Python will leave any bytes that are ASCII printable as such, the rest are converted to escape sequences; this keeps debugging such information practical even in ASCII-only environments.
Your terminal or console in which you are running the interactive interpreter writes bytes to the stdin stream that Python reads from when you type. Those bytes are encoded according to the configuration of that terminal or console.
In your case, your console encoded the input you typed to a Windows codepage, most likely. You'll need to figure out the exact codepage and use that codec to decode the bytes. Codepage 1252 seems to fit:
>>> print 'ko\x9eu\x9a\xe8ek'.decode('cp1252')
kožušèek
When you print those same bytes, your console is reading those bytes and interpreting them in the same codec it is already configured with.
Python can tell you what codec it thinks your console is set to; it tries to detect this information for Unicode literals, where the input has to be decoded for you. It uses the locale.getpreferredencoding() function to determine this, and the sys.stdin and sys.stdout objects have an encoding attribute; mine is set to UTF-8:
>>> import sys
>>> sys.stdin.encoding
'UTF-8'
>>> import locale
>>> locale.getpreferredencoding()
'UTF-8'
>>> 'kožušèek'
'ko\xc5\xbeu\xc5\xa1\xc3\xa8ek'
>>> u'kožušèek'
u'ko\u017eu\u0161\xe8ek'
>>> print u'kožušèek'
kožušèek
Because my terminal has been configured for UTF-8 and Python has detected this, using a Unicode literal u'...' works. The data is automatically decoded by Python.
Why exactly your console lost a whole letter I don't know; I'd have to have access to your console and do some more experiments, see the output of print repr(s2), and test all bytes between 0x00 and 0xFF to see if this is on the input or output side of the console.
I recommend you read up on Python and Unicode:
Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Your system does not necessarily use the sys.getdefaultencoding() encoding; it is merely the default used when you convert without telling it the encoding, as in:
>>> sys.getdefaultencoding()
'ascii'
>>> unicode(s1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 2: ordinal not in range(128)
Python's idea of your system locale is in the locale module:
>>> import locale
>>> locale.getdefaultlocale()
('en_US', 'UTF-8')
>>> locale.getpreferredencoding()
'UTF-8'
And using this we can decode the string:
>>> u1=s1.decode(locale.getdefaultlocale()[1])
>>> u1
u'ko\u017eu\u0161\u010dek'
>>> print u1
kožušček
There's a chance the locale has not been set up, as is the case for the 'C' locale. That may cause the reported encoding to be None even though the default is 'ascii'. Normally figuring this out is the job of setlocale, which getpreferredencoding will automatically call. I would suggest calling it once in your program startup and saving the value returned for all further use. The encoding used for filenames may also be yet another case, reported in sys.getfilesystemencoding().
The Python-internal default encoding is set up by the site module, which contains:
def setencoding():
"""Set the string encoding used by the Unicode implementation. The
default is 'ascii', but if you're willing to experiment, you can
change this."""
encoding = "ascii" # Default value set by _PyUnicode_Init()
if 0:
# Enable to support locale aware default string encodings.
import locale
loc = locale.getdefaultlocale()
if loc[1]:
encoding = loc[1]
if 0:
# Enable to switch off string to Unicode coercion and implicit
# Unicode to string conversion.
encoding = "undefined"
if encoding != "ascii":
# On Non-Unicode builds this will raise an AttributeError...
sys.setdefaultencoding(encoding) # Needs Python Unicode build !
So if you want it set by default in every run of Python, you can change that first if 0 to if 1.

Resources