I'm updating an older application for Python 3, but trying to maintain compatibility if possible with Python 2.7. One of the issues I've encountered deals with inconsistencies in ast.literal_eval() between Python 2 & 3, when handling a UTF-8 string.
Specifically, one of the functions my application performs involves:
Reading a string from a UTF-8 encoded text file that represents a Python list of file names
Converting that UTF-8 string to a Python list via literal_eval()
Using that list to access those files and perform other processing.
My test .txt file has this string:
['FileName1.txt', 'CP1252-1-àlacrème.txt', 'dUTF8-1-木兰辞.txt']
I'm using this brief test script to emulate what the larger application does:
import io
from ast import literal_eval
with io.open('z.txt','r',encoding='utf_8') as inFile:
inStr = inFile.read()
print('Input string is length '+str(len(inStr)))
fileList = literal_eval(inStr)
print(fileList)
Now, when I run this test script on Python 3, I get the following (all OK and as expected) result:
Input string is length 61
['FileName1.txt', 'CP1252-1-àlacrème.txt','dUTF8-1-???.txt']
(The question marks are expected as this is a Windows CMD window; it doesn't handle non-latin-1 characters)
But anyway, when I run the same script with the same file on Python 2.7, I get this result:
Input string is length 61
['FileName1.txt', 'CP1252-1-\xc3\xa0lacr\xc3\xa8me.txt', 'dUTF8-1-\xe6\x9c\xa8\xe5\x85\xb0\xe8\xbe\x9e.txt']
So literal_eval() isn't maintaining the UTF-8 encoding in the resulting list. (Or, I guess, it's trying to maintain the encoding but the best it can do is represent the non-ASCII data as individual byte values.)
My question is: is there any way to make the Python 2 literal_eval() give the same result as the Python 3 version? Or am I stuck with this as a limitation?
As mentioned in comments, ast.literal_eval of the input parses differently between Python 2 and 3. Better to not write Python source as a data file, but use a module like pandas with .csv files:
If the input is a UTF-8 file with content:
FileName1.txt,CP1252-1-àlacrème.txt,dUTF8-1-木兰辞.txt
Then pandas can read it with:
import pandas as pd
data = pd.read_csv('test.txt',encoding='utf8',header=None)
print(data)
Output (Windows terminal Python 3, need appropriate font):
0 1 2
0 FileName1.txt CP1252-1-àlacrème.txt dUTF8-1-木兰辞.txt
Output (Windows IDLE, Python 2 in console needs appropriate code page to view ideographs):
0 1 2
0 FileName1.txt CP1252-1-àlacrème.txt dUTF8-1-木兰辞.txt
Related
I create a JSON file with VSC (or Notepad++). This contains an array of strings. One of the strings is "GRÜN". Then I read the file with Python 3.
with codecs.open(file,'r',"iso-8859-15") as infile:
dictionarry = json.load(infile)
If I print the array (inside in "dictionary") to the console, I see: "GRÃ\x9cN"
How can I convert "GRÃ\x9cN" to "GRÜN" again?
I try to read the JSON file with codec "iso-8859-1" too, but the issue still occurred.
I am upgrading some code from python 2 to python 3.
There is a function to open and read files. In Python 2 there is no need to specify binary mode or as a string. While in Python 3 I should specify the mode.
The python 2 code is:
with open(f_path, mode=open_mode) as fp:
content = fp.read()
This is causing me problems as it is called by various other functions where I don't necessarily know the file type in advance. (Sometimes the data is written to a zip file, other times the data is returned via an HTTP endpoint).
I expect most data will be binary image files, though CSv and text files will also be present.
What would be the best way of opening a file of unknown type and detecting if it is binary or string data?
Is it possible for example to open a file in binary mode, then detect that it contains text and convert it (or alternatively generate an exception and open it in string mode instead)?
You might try the binaryornot library.
pip install binaryornot
Then in the code:
from binaryornot.check import is_binary
is_binary(f_path)
Here is their documentation:
https://pypi.org/project/binaryornot/
In this simple tutorial written in Python 2.7, they have a line loading the numpy array.
train_data = np.load(open('../musicnet.npz','rb'))
Then, they get the data by calling different keys
X,Y = train_data['2494']
Everything works well in python 2.7
Data type of train_data is numpy.lib.npyio.NpzFile
My problem
However, whenever I try to do the same in Python 3.5, most of the lines work fine, except when it comes to the line of X,Y = train_data['2494'], it just freezes there forever. I would like to use Python 3.5 because my other projects are written in python 3.5.
How to rewrite this line so that it runs with Python 3.5?
Error Message
I finally managed to get the error message in terminal
It freezes there because there's tons of output right after the error message, my jupyter notebook just cannot handle that much information.
Solution
Change the encoding to 'bytes'
train_data = np.load('../musicnet.npz', encoding='bytes')
Then everything works fine.
You first said things crashed, now you say it freezes when trying to access a specific array. numpy has the same syntax in 3.5 compared to 2.7. You shouldn't have to rewrite anything.
np.load does have a couple of parameters that deal with differences between Py2 and Py3. But I'm not sure these are an issue for you.
fix_imports : bool, optional
Only useful when loading Python 2 generated pickled files on Python 3,
which includes npy/npz files containing object arrays. If `fix_imports`
is True, pickle will try to map the old Python 2 names to the new names
used in Python 3.
encoding : str, optional
What encoding to use when reading Python 2 strings. Only useful when
loading Python 2 generated pickled files in Python 3, which includes
npy/npz files containing object arrays. Values other than 'latin1',
'ASCII', and 'bytes' are not allowed, as they can corrupt numerical
data. Default: 'ASCII'
Try
print(list(train_data.keys()))
This should show the array names that were saved to the zip archive. Do they match the names in the Py2 load? Do they include the '2494' name?
A couple of things are unusual about:
X,Y = train_data['2494']
Naming an array in the zip archive by a string number, and unpacking the load into two variables.
Do you know anything about how this was savez? What was saved?
Another question - are you loading this file from the same machine that Py2 worked on? Or has the file been transferred from another machine, and possibly corrupted?
As those parameters indicate, there are differences in the pickle code between Py2 and Py3. If the original save included object dtype arrays, or non-array objects, then they will be pickled and there might be incompatibilities in the pickle versions.
Try this,
with np.load('../musicnet.npz') as train_data:
X,Y = train_data['2494']
There are 2 ways out in my point of view:
re-edit your code
train_data = np.load(open('../musicnet.npz','rb'))
to
train_data = np.load(open('../musicnet.npz','r'))
Because the mode of r/rb in python2.7 / 3.5 is a difference in your situation.
Using the default debugger to pointing the significant error. (Usually, work on my experience)
I am a newbie in Python3.
I have a question in writing a string into a file.
The below string is what I tried to write into a file.
ÀH \x10\x08\x81\x00 (in hex, c04820108810)
When I checked the file using xxd command, I could check there is a difference between the string and the file.
00000000: c380 4820 1008 c281 00 ..H .....
This is code I wrote.
s = 'ÀH \x10\x08\x81\x00'
with open('test', 'w') as f:
f.write(s)
The question is how can I write this string into file in its entirety.
It seems that you want to write binary data. In that case, you should use the bytes type instead of str as this gives you full control over the binary content of the sequence.
When dealing with strings, you have to take into account that Python will internally handle everything as UTF-8, so by the time you enter something like À, the file encoding will decide on what is actually entered. You can always encode() a string to look at its bytes:
>>> 'ÀH \x10\x08\x81\x00'.encode()
b'\xc3\x80H \x10\x08\xc2\x81\x00'
You can convert this to hex using the binascii module for a more readable hex string of those bytes:
>>> binascii.hexlify('ÀH \x10\x08\x81\x00'.encode())
b'c38048201008c28100'
As you can see, this is the same that was written to your file. So Python already does the correct thing. It’s just that the input is not what you want it to be.
So instead, use a bytes string and write to the file in binary mode:
# use a bytes string
s = b'\xc0\x48\x20\x10\x88\x10'
# open the file in binary mode
with open('test', 'bw') as f:
f.write(s)
Btw. if you look at the encoded string from the beginning, you can already see that you have a different encoding in mind than Python when you entered that string. You expected À to be 0xc0 in binary which is somewhat correct since that its Latin-1 representation. But when you lookup its other representations, you can see that in UTF-8, which is what Python uses by default, it is 0xc380 instead—which is again the value we got when encoding it in Python.
You have to setup coding style to utf-8 and also use raw strings because you have \ escape characters. So add coding style and put r before your string to make it raw.
# -*- coding: utf-8 -*-
s = r'ÀH \x10\x08\x81\x00'
with open('test.txt', 'w') as f:
f.write(s)
The problem is that for some archives or files up-loaded to the python application, ZipFile's namelist() returns badly decoded strings.
from zip import ZipFile
for name in ZipFile('zipfile.zip').namelist():
print('Listing zip files: %s' % name)
How to fix that code so i always decode file names in unicode (so Chineeze, Russian and other languages supported)?
I've seen some samples for Python 2, but since string's nature is changed in python3, i have no clue how to re-encode it, or apply chardet on it.
How to fix that code so i always decode file names in unicode (so Chineeze, Russian and other languages supported)?
Automatically? You can't. Filenames in a basic ZIP file are strings of bytes with no attached encoding information, so unless you know what the encoding was on the machine that created the ZIP you can't reliably get a human-readable filename back out.
There is an extension to the flags on modern ZIP files to tell you that the filename is UTF-8. Unfortunately files you receive from Windows users typically don't have it, so you'll left guessing with inherently unreliable methods like chardet.
I've seen some samples for Python 2, but since string's nature is changed in python3, i have no clue how to re-encode it, or apply chardet on it.
Python 2 would just give you raw bytes back. In Python 3 the new behaviour is:
if the UTF-8 flag is set, it decodes the filenames using UTF-8 and you get the correct string value back
otherwise, it decodes the filenames using DOS code page 437, which is pretty unlikely to be what was intended. However you can re-encode the string back to the original bytes, and then try to decode again using the code page you actually want, eg name.encode('cp437').decode('cp1252').
Unfortunately (again, because the unfortunatelies never end where ZIP is concerned), ZipFile does this decoding silently without telling you what it did. So if you want to switch and only do the transcode step when the filename is suspect, you have to duplicate the logic for sniffing whether the UTF-8 flag was set:
ZIP_FILENAME_UTF8_FLAG = 0x800
for info in ZipFile('zipfile.zip').filelist():
filename = info.filename
if info.flag_bits & ZIP_FILENAME_UTF8_FLAG == 0:
filename_bytes = filename.encode('437')
guessed_encoding = chardet.detect(filename_bytes)['encoding'] or 'cp1252'
filename = filename_bytes.decode(guessed_encoding, 'replace')
...
Here's the code that decodes filenames in zipfile.py according to the zip spec that supports only cp437 and utf-8 character encodings:
if flags & 0x800:
# UTF-8 file names extension
filename = filename.decode('utf-8')
else:
# Historical ZIP filename encoding
filename = filename.decode('cp437')
As you can see, if 0x800 flag is not set i.e., if utf-8 is not used in your input zipfile.zip then cp437 is used and therefore the result for "Chineeze, Russian and other languages" is likely to be incorrect.
In practice, ANSI or OEM Windows codepages may be used instead of cp437.
If you know the actual character encoding e.g., cp866 (OEM (console) codepage) may be used on Russian Windows then you could reencode filenames to get the original filenames:
filename = corrupted_filename.encode('cp437').decode('cp866')
The best option is to create the zip archive using utf-8 so that you can support multiple languages in the same archive:
c:\> 7z.exe a -tzip -mcu archive.zip <files>..
or
$ python -mzipfile -c archive.zip <files>..`
Got the same problem, but with defined language (Russian).
Most simple solution is just to convert it with this utility: https://github.com/vlm/zip-fix-filename-encoding
For me it works on 98% of archives (failed to run on 317 files from corpus of 11388)
More complex solution: use python module chardet with zipfile. But it depends on python version (2 or 3) you use - it has some differences on zipfile. For python 3 I wrote a code:
import chardet
original_name = name
try:
name = name.encode('cp437')
except UnicodeEncodeError:
name = name.encode('utf8')
encoding = chardet.detect(name)['encoding']
name = name.decode(encoding)
This code try to work with old style zips (having encoding CP437 and just has it broken), and if fails, it seems that zip archive is new style (UTF-8). After determining proper encoding, you can extract files by code like:
from shutil import copyfileobj
fp = archive.open(original_name)
fp_out = open(name, 'wb')
copyfileobj(fp, fp_out)
In my case, this resolved last 2% of failed files.