UnicodeDecodeError When Opening a tar File in Python 3 - python-3.x

I'm using Linux Mint 18.1 and Python 3.5.2.
I have a library that currently works using Python 2.7. I need to use the library for a Python 3 project. I'm updating it and have run into a unicode problem that I can't seem to fix.
First, a file is created via tar cvjf tarfile.tbz2 (on a Linux system) and is later opened in the Python library as open(tarfile).
If I run the code as is, using Python 3, I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc1 in position 11: invalid start byte
My first attempt at a fix was to open it as open(tarfile, encoding='utf-8') as I was under the impression that tar would just use what the file system gave it. When I do this, I get the same error (the byte value changes).
If I try with another encoding, say latin-1, I get the following error:
TypeError: Unicode-objects must be encoded before hashing
Which leads me to believe that utf-8 is correct, but I might be misunderstanding.
Can anyone provide suggestions?

I was going down the wrong path thinking this was some strange encoding problem. When it was just a simple problem with that fact that open() defaults to read as text (r). In Python 2 it's a no-op.
The fix is to open(tarfile, 'rb').
The head fake with unicode...should have seen this one coming. :facepalm:

Related

Pygrib UnicodeEncodeError

I use python 3.9.1 on macOS Big Sur with an M1 chip.
I would like to open the grib format file which is provided by Japan Meteorological Agency.
So, I tried to use the pygrib library as below:
import pygrib
gpv_file = pygrib.open("Z__C_RJTD_20171205000000_GSM_GPV_Rjp_Lsurf_FD0000-0312_grib2.bin")
But I got the error like this:
----> 1 gpv_file = pygrib.open("Z__C_RJTD_20171205000000_GSM_GPV_Rjp_Lsurf_FD0000-0312_grib2.bin")
pygrib/_pygrib.pyx in pygrib._pygrib.open.__cinit__()
pygrib/_pygrib.pyx in pygrib._pygrib._strencode()
UnicodeEncodeError: 'ascii' codec can't encode characters in position 50-51: ordinal not in range(128)
I asked other people to run the same code, and it somehow worked.
I am not sure what the problem is and how to fix it.

Why do different python versions have different behaviors on stand output print?

The Python 3.4 and Python 3.8/3.9 are different when I try execute below statement:
print('\u212B')
Python 3.8/3.9 can print it correctly.
Å
Python 3.4 will report an exception:
Traceback (most recent call last):
File "test.py", line 9, in <module>
print('\u212B')
UnicodeEncodeError: 'gbk' codec can't encode character '\u212b' in position 0: illegal multibyte sequence
And according to this page, I can avoid the exception by overwrite sys.stdout via statement:
sys.stdout = io.TextIOWrapper(buffer=sys.stdout.buffer,encoding='utf-8')
But python 3.4 still print different charactor as below:
鈩?
So my questions are:
Why do different python versions have different behaviors on stand output print?
How can I print correct value Å in python 3.4?
Edit 1:
I guess the difference is caused by PEP 528 -- Change Windows console encoding to UTF-8. But I still don't understand the machanism of console encoding and how I can print correct character in Python 3.4.
Edit 2:
One more difference, sys.getfilesystemencoding() will get utf-8 in Python 3.8/3.9 and get mbcs in Python 3.4.
Why?
Regarding the rationale behind the stdout encoding you can read more in the answers here: Changing default encoding of Python?
In short, Python 3.4 is using your OS's encoding by default as the one for stdout whereas with Python 3.8 it is set to UTF-8.
How to fix this?
You can use a new method - reconfigure introduced with Python 3.7:
sys.stdout.reconfigure(encoding='utf-8')
Typically, you can try setting the environment variable PYTHONIOENCODING to utf-8:
set PYTHONIOENCODING=utf8
in most of the operating systems except Windows where another environment variable must be set for it to work:
set PYTHONLEGACYWINDOWSIOENCODING=1
You can fix it in the version of Python preceding v. 3.7 via installing win-unicode-console package that handles UTF issues transparently on Windows:
pip install win-unicode-console
If you are not running the code directly from a console there is a possibility that your IDE configuration is interfering.

Converting with ImageMagick - illegal parameter

I am trying to convert a PDF to JPG, main goal is to have a thumbnail so that the user can preview the PDF within the application.
Apparently ImageMagick is a good way to do this, yet so far I failed to get it to convert the file.
import subprocess
params = ['convert', '-density 300 -resize 220x205', 'dummy.pdf', 'thumb.jpg']
subprocess.check_call(params)
So this is what I am getting instead of having the file converted:
Unzulässiger Parameter - 300
Traceback (most recent call last):
File "pdf_preview_test.py", line 4, in
subprocess.check_call(params)
File "C:\Users\EliasMessner\AppData\Local\Programs\Python\Python37-32\lib\subprocess.py", line 347, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['convert', '-density 300 -resize 220x205', 'dummy.pdf', 'thumb.jpg']' returned non-zero exit status 4.
I couldn't find any info about the problem. It looks like the "300" parameter is illegal but IMHO it is used correctly.. Help would be much appreciated. Thanks.
As you are on Windows the code could be trying the built in Windows convert program. You do not say what version on Imagemagick you are using; V7 uses magick as opposed to convert which can prevent this problem.
If you allowed Imagemagick to add convert to the environmental path on install it would probably not be a problem - it never has for me on multiple installs.
You could try changing convert to the full path to convert surrounding the path in " ". You can rename the convert program e.g. myconvert and use that in the program rather than convert.
I would try a Imagemagick command out in the command line to prove it works as well first.

Unicode character causing error with bdist_wininst on python 3 but not python 2

I'm compiling windows installers for my python code. I mostly write language-related tools and include examples that require utf-8 strings in my documentation, including the README file.
I'm slowly moving from Python 2 to Python 3 and recently found that the command
python setup.py bdist_wininst
works fine with python 2 but not for python 3.
I've traced the problem to the inclusion of unicode in my readme file. The readme file gets read into setup.py.
The bug occurs in Python36\lib\distutils\command\bdist_wininst.py
The error is:
UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character
In python 2.7 the relevant code in bdist_wininst.py is
if isinstance(cfgdata, str):
cfgdata = cfgdata.encode("mbcs")
In python 3.6 the equivalent code in bdist_wininst.py is
try:
unicode
except NameError:
pass
else:
if isinstance(cfgdata, unicode):
cfgdata = cfgdata.encode("mbcs")
Here is my readme file:
https://github.com/timmahrt/pysle/blob/master/README.rst
And here is my setup.py file that reads in the README file
https://github.com/timmahrt/pysle/blob/master/setup.py
And the relevant line from setup.py:
long_description=codecs.open('README.rst', 'r', encoding="utf-8").read()
My question:
Is there a way to make python 3 happy in this case?

Type Error in Python 3.5.2

Trying to make a wsgi server following this tutorial Let's Build a Web Server.
But getting error TypeError: initial_value must be str or None, not bytes using below code in python 3.5.2.
import io
env['wsgi.input'] = io.StringIO(self.request_data)
How can I fix the issue. Thanks in advance.
First, that tutorial uses python2, so you may run into additional problems if you try to apply it directly to python3
According to PEP3333 (the WSGI specification updated for python3) the wsgi.input environ variable should be a byte stream, not a text stream, so you should use io.BytesIO(), not io.StringIO.
The error you currently get ist because your self.request_data is bytes, but io.StringIO() requires its argument to be a str.

Resources