shutil.copyfileobj but without headers or skip first line [duplicate]

shutil.copyfileobj but without headers or skip first line [duplicate] - python-3.x

I am currently using the below code to import 6,000 csv files (with headers) and export them into a single csv file (with a single header row).
#import csv files from folder
path =r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
stockstats_data = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_,index_col=None,)
list_.append(df)
stockstats_data = pd.concat(list_)
print(file_ + " has been imported.")
This code works fine, but it is slow. It can take up to 2 days to process.
I was given a single line script for Terminal command line that does the same (but with no headers). This script takes 20 seconds.
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done
Does anyone know how I can speed up the first Python script? To cut the time down, I have thought about not importing it into a DataFrame and just concatenating the CSVs, but I cannot figure it out.
Thanks.

If you don't need the CSV in memory, just copying from input to output, it'll be a lot cheaper to avoid parsing at all, and copy without building up in memory:
import shutil
import glob
#import csv files from folder
path = r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
allFiles.sort() # glob lacks reliable ordering, so impose your own if output order matters
with open('someoutputfile.csv', 'wb') as outfile:
for i, fname in enumerate(allFiles):
with open(fname, 'rb') as infile:
if i != 0:
infile.readline() # Throw away header on all but first file
# Block copy rest of file from input to output without parsing
shutil.copyfileobj(infile, outfile)
print(fname + " has been imported.")
That's it; shutil.copyfileobj handles efficiently copying the data, dramatically reducing the Python level work to parse and reserialize. Don't omit the `allFiles.sort()!†
This assumes all the CSV files have the same format, encoding, line endings, etc., the encoding encodes such that newlines appear as a single byte equivalent to ASCII \n and it's the last byte in the character (so ASCII and all ASCII superset encodings work, as does UTF-16-BE and UTF-32-BE, but not UTF-16-LE and UTF-32-LE) and the header doesn't contain embedded newlines, but if that's the case, it's a lot faster than the alternatives.
For the cases where the encoding's version of a newline doesn't look enough like an ASCII newline, or where the input files are in one encoding, and the output file should be in a different encoding, you can add the work of encoding and decoding without adding CSV parsing/serializing work, with (adding a from io import open if on Python 2, to get Python 3-like efficient encoding-aware file objects, and defining known_input_encoding to some string representing the known encoding for input files, e.g. known_input_encoding = 'utf-16-le', and optionally a different encoding for output files):
# Other imports and setup code prior to first with unchanged from before
# Perform encoding to chosen output encoding, disabling line-ending
# translation to avoid conflicting with CSV dialect, matching raw binary behavior
with open('someoutputfile.csv', 'w', encoding=output_encoding, newline='') as outfile:
for i, fname in enumerate(allFiles):
# Decode with known encoding, disabling line-ending translation
# for same reasons as above
with open(fname, encoding=known_input_encoding, newline='') as infile:
if i != 0:
infile.readline() # Throw away header on all but first file
# Block copy rest of file from input to output without parsing
# just letting the file object decode from input and encode to output
shutil.copyfileobj(infile, outfile)
print(fname + " has been imported.")
This is still much faster than involving the csv module, especially in modern Python (where the io module has undergone greater and greater optimization, to the point where the cost of decoding and reencoding is pretty minor, especially next to the cost of performing I/O in the first place). It's also a good validity check for self-checking encodings (e.g. the UTF family) even if the encoding is not supposed to change; if the data doesn't match the assumed self-checking encoding, it's highly unlikely to decode validly, so you'll get an exception rather than silent misbehavior.
Because some of the duplicates linked here are looking for an even faster solution than copyfileobj, some options:
The only succinct, reasonably portable option is to continue using copyfileobj and explicitly pass a non-default length parameter, e.g. shutil.copyfileobj(infile, outfile, 1 << 20) (1 << 20 is 1 MiB, a number which shutil has switched to for plain shutil.copyfile calls on Windows due to superior performance).
Still portable, but only works for binary files and not succinct, would be to copy the underlying code copyfile uses on Windows, which uses a reusable bytearray buffer with a larger size than copyfileobj's default (1 MiB, rather than 64 KiB), removing some allocation overhead that copyfileobj can't fully avoid for large buffers. You'd replace shutil.copyfileobj(infile, outfile) with (3.8+'s walrus operator, :=, used for brevity) the following code adapted from CPython 3.10's implementation of shutil._copyfileobj_readinto (which you could always use directly if you don't mind using non-public APIs):
buf_length = 1 << 20 # 1 MiB buffer; tweak to preference
# Using a memoryview gets zero copy performance when short reads occur
with memoryview(bytearray(buf_length)) as mv:
while n := infile.readinto(mv):
if n < buf_length:
with mv[:n] as smv:
outfile.write(smv)
else:
outfile.write(mv)
Non-portably, if you can (in any way you feel like) determine the precise length of the header, and you know it will not change by even a byte in any other file, you can write the header directly, then use OS-specific calls similar to what shutil.copyfile uses under the hood to copy the non-header portion of each file, using OS-specific APIs that can do the work with a single system call (regardless of file size) and avoid extra data copies (by pushing all the work to in-kernel or even within file-system operations, removing copies to and from user space) e.g.:
a. On Linux kernel 2.6.33 and higher (and any other OS that allows the sendfile(2) system call to work between open files), you can replace the .readline() and copyfileobj calls with:
filesize = os.fstat(infile.fileno()).st_size # Get underlying file's size
os.sendfile(outfile.fileno(), infile.fileno(), header_len_bytes, filesize - header_len_bytes)
To make it signal resilient, it may be necessary to check the return value from sendfile, and track the number of bytes sent + skipped and the number remaining, looping until you've copied them all (these are low level system calls, they can be interrupted).
b. On any system Python 3.8+ built with glibc >= 2.27 (or on Linux kernel 4.5+), where the files are all on the same filesystem, you can replace sendfile with copy_file_range:
filesize = os.fstat(infile.fileno()).st_size # Get underlying file's size
os.copy_file_range(infile.fileno(), outfile.fileno(), filesize - header_len_bytes, header_len_bytes)
With similar caveats about checking for copying fewer bytes than expected and retrying.
c. On OSX/macOS, you can use the completely undocumented, and therefore even less portable/stable API shutil.copyfile uses, posix._fcopyfile for a similar purpose, with something like (completely untested, and really, don't do this; it's likely to break across even minor Python releases):
infile.seek(header_len_bytes) # Skip past header
posix._fcopyfile(infile.fileno(), outfile.fileno(), posix._COPYFILE_DATA)
which assumes fcopyfile pays attention to the seek position (docs aren't 100% on this) and, as noted, is not only macOS-specific, but uses undocumented CPython internals that could change in any release.
† An aside on sorting the results of glob: That allFiles.sort() call should not be omitted; glob imposes no ordering on the results, and for reproducible results, you'll want to impose some ordering (it wouldn't be great if the same files, with the same names and data, produced an output file in a different order simply because in-between runs, a file got moved out of the directory, then back in, and changed the native iteration order). Without the sort call, this code (and all other Python+glob module answers) will not reliably read from a directory containing a.csv and b.csv in alphabetical (or any other useful) order; it'll vary by OS, file system, and often the entire history of file creation/deletion in the directory in question. This has broken stuff before in the real world, see details at A Code Glitch May Have Caused Errors In More Than 100 Published Studies.

Are you required to do this in Python? If you are open to doing this entirely in shell, all you'd need to do is first cat the header row from a randomly selected input .csv file into merged.csv before running your one-liner:
cat a-randomly-selected-csv-file.csv | head -n1 > merged.csv
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done

You don't need pandas for this, just the simple csv module would work fine.
import csv
df_out_filename = 'df_out.csv'
write_headers = True
with open(df_out_filename, 'wb') as fout:
writer = csv.writer(fout)
for filename in allFiles:
with open(filename) as fin:
reader = csv.reader(fin)
headers = reader.next()
if write_headers:
write_headers = False # Only write headers once.
writer.writerow(headers)
writer.writerows(reader) # Write all remaining rows.

Here's a simpler approach - you can use pandas (though I am not sure how it will help with RAM usage)-
import pandas as pd
import glob
path =r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
stockstats_data = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_)
stockstats_data = pd.concat((df, stockstats_data), axis=0)

Related

Is it possible to remove characters from a compressed file without extracting it?

I have a compressed file that's about 200 MB, in the form of a tar.gz file. I understand that I can extract the xml files in it. It contains several small and one 5 GB xml file. I'm trying to remove certain characters from the xml files.
So my very basic question is: is it even possible to accomplish this without ever extracting the content of the compressed file?
I'm trying to speed up the process of reading through xml files looking for characters to remove.

You will have to decompress, change, and then recompress the files. There's no way around that.
However, this does not necessarily include writing the file to a storage. You might be able to do the changes you like in a streaming fashion, i.e. that everything is only done in memory without ever having the complete decompressed file somewhere. Unix uses pipes for such tasks.
Here is an example on how to do it:
Create two random files:
echo "hello world" > a
echo "hello world" > b
Create a compressed archive containing both:
tar -c -z -f x.tgz a b
Pipe the contents of the uncompressed archive through a changer. Unfortunately I haven't found any shell-based way to do this but you also specified Python in the tags, and with the tarfile module you can achieve this:
Here is the file tar.py:
#!/usr/bin/env python3
import sys
import tarfile
tar_in = tarfile.open(fileobj=sys.stdin.buffer, mode='r:gz')
tar_out = tarfile.open(fileobj=sys.stdout.buffer, mode='w:gz')
for tar_info in tar_in:
reader = tar_in.extractfile(tar_info)
if tar_info.path == 'a': # my example file names are "a" and "b"
# now comes the code which makes our change:
# we just skip the first two bytes in each file:
reader.read(2) # skip two bytes
tar_info.size -= 2 # reduce size in info object as well
# add the (maybe changed) file to the output:
tar_out.addfile(tar_info, reader)
tar_out.close()
tar_in.close()
This can be called like this:
./tar.py < x.tgz > y.tgz
y.tgz will contain both files again, but in a the first two bytes will be skipped (so its contents will be llo world).
You will have noticed that you need to know the resulting size of your change beforehand. tar is designed to handle files, and so it needs to write the size of the entry files into the tar info datagram which precedes every entry file in the resulting file, so I see no way around this. With a compressed output it also isn't possible to skip back after writing all output and adjust the file size.
But as you phrased your question, this might be possible in your case.
All you will have to do is provide a file-like object (could be a Popen object's output stream) like reader in my simple example case.

Output data from subprocess command line by line

I am trying to read a large data file (= millions of rows, in a very specific format) using a pre-built (in C) routine. I want to then yeild the results of this, line by line, via a generator function.
I can read the file OK, but where as just running:
<command> <filename>
directly in linux will print the results line by line as it finds them, I've had no luck trying to replicate this within my generator function. It seems to output the entire lot as a single string that I need to split on newline, and of course then everything needs reading before I can yield line 1.
This code will read the file, no problem:
import subprocess
import config
file_cmd = '<command> <filename>'
for rec in (subprocess.check_output([file_cmd], shell=True).decode(config.ENCODING).split('\n')):
yield rec
(ENCODING is set in config.py to iso-8859-1 - it's a Swedish site)
The code I have works, in that it gives me the data, but in doing so, it tries to hold the whole lot in memory. I have larger files than this to process which are likely to blow the available memory, so this isn't an option.
I've played around with bufsize on Popen, but not had any success (and also, I can't decode or split after the Popen, though I guess the fact I need to split right now is actually my problem!).

I think I have this working now, so will answer my own question in the event somebody else is looking for this later ...
proc = subprocess.Popen(shlex.split(file_cmd), stdout=subprocess.PIPE)
while True:
output = proc.stdout.readline()
if output == b'' and proc.poll() is not None:
break
if output:
yield output.decode(config.ENCODING).strip()

python: How do bz2 "incremental" and "one-shot" (de)compression differ from the "regular" method? [duplicate]

I am trying to use the bz2 and/or lzma packages in python. I am trying to compress a database dump in csv format and then put it to a zip file. I got it to work with one-shot compression with both the packages.
Code for which looks like this:
with ZipFile('something.zip', 'w') as zf:
content = bz2.compress(bytes(csv_string, 'UTF-8')) # also with lzma
zf.writestr(
'something.csv' + '.bz2',
content,
compress_type=ZIP_DEFLATED
)
When I try to use incremental compression then it creates a .zip file which when I try to extract keeps giving some archive file recursively.
Code for which looks like this:
with ZipFile('something.zip', 'w') as zf:
compressor = bz2.BZ2Compressor()
content = compressor.compress(bytes(csv_string, 'UTF-8')) # also with lzma
zf.writestr(
'something.csv' + '.bz2',
content,
compress_type=ZIP_DEFLATED
)
compressor.flush()
I went through the documentation and also look for information about the compression techniques, and there seems to be no comprehensive information about what one-shot and incremental compression are.

The difference between one-shot and incremental is that with one-shot mode you need to have the entire data in memory; if you are compressing a 100 gigabyte file, you ought to have loads of RAM.
With the incremental encoder your code can feed the compressor 1 megabyte or 1 kilobyte at a time and write whatever data results, into a file as soon as it is available. Another benefit is that an incremental compressor you can use to stream data - you can start writing compressed data before all uncompressed data is available!
Your second code is incorrect and it will cause you to lose your data. The flush may return more data that you need to save as well. Here I am compressing a string of 1000 'a' characters in Python 3; the result from compress is an empty string; the actual compressed data is returned from flush.
>>> c = bz2.BZ2Compressor()
>>> c.compress(b'a' * 1000)
b''
>>> c.flush()
b'BZh91AY&SYI\xdcOc\x00\x00\x01\x81\x01\xa0\x00\x00\x80\x00\x08 \x00
\xaamA\x98\xba\x83\xc5\xdc\x91N\x14$\x12w\x13\xd8\xc0'
Thus your second code should be:
compressor = bz2.BZ2Compressor()
content = compressor.compress(bytes(csv_string, 'UTF-8')) # also with lzma
content += compressor.flush()
But actually you're still doing the one-shot compression, in a very complicated manner.

How do I save a scipy distribution in a list or array to call? [duplicate]

I wonder, how to save and load numpy.array data properly. Currently I'm using the numpy.savetxt() method. For example, if I got an array markers, which looks like this:
I try to save it by the use of:
numpy.savetxt('markers.txt', markers)
In other script I try to open previously saved file:
markers = np.fromfile("markers.txt")
And that's what I get...
Saved data first looks like this:
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
But when I save just loaded data by the use of the same method, ie. numpy.savetxt() it looks like this:
1.398043286095131769e-76
1.398043286095288860e-76
1.396426376485745879e-76
1.398043286055061908e-76
1.398043286095288860e-76
1.182950697433698368e-76
1.398043275797188953e-76
1.398043286095288860e-76
1.210894289234927752e-99
1.398040649781712473e-76
What am I doing wrong? PS there are no other "backstage" operation which I perform. Just saving and loading, and that's what I get. Thank you in advance.

The most reliable way I have found to do this is to use np.savetxt with np.loadtxt and not np.fromfile which is better suited to binary files written with tofile. The np.fromfile and np.tofile methods write and read binary files whereas np.savetxt writes a text file.
So, for example:
a = np.array([1, 2, 3, 4])
np.savetxt('test1.txt', a, fmt='%d')
b = np.loadtxt('test1.txt', dtype=int)
a == b
# array([ True, True, True, True], dtype=bool)
Or:
a.tofile('test2.dat')
c = np.fromfile('test2.dat', dtype=int)
c == a
# array([ True, True, True, True], dtype=bool)
I use the former method even if it is slower and creates bigger files (sometimes): the binary format can be platform dependent (for example, the file format depends on the endianness of your system).
There is a platform independent format for NumPy arrays, which can be saved and read with np.save and np.load:
np.save('test3.npy', a) # .npy extension is added if not given
d = np.load('test3.npy')
a == d
# array([ True, True, True, True], dtype=bool)

np.save('data.npy', num_arr) # save
new_num_arr = np.load('data.npy') # load

The short answer is: you should use np.save and np.load.
The advantage of using these functions is that they are made by the developers of the Numpy library and they already work (plus are likely optimized nicely for processing speed).
For example:
import numpy as np
from pathlib import Path
path = Path('~/data/tmp/').expanduser()
path.mkdir(parents=True, exist_ok=True)
lb,ub = -1,1
num_samples = 5
x = np.random.uniform(low=lb,high=ub,size=(1,num_samples))
y = x**2 + x + 2
np.save(path/'x', x)
np.save(path/'y', y)
x_loaded = np.load(path/'x.npy')
y_load = np.load(path/'y.npy')
print(x is x_loaded) # False
print(x == x_loaded) # [[ True True True True True]]
Expanded answer:
In the end it really depends in your needs because you can also save it in a human-readable format (see Dump a NumPy array into a csv file) or even with other libraries if your files are extremely large (see best way to preserve numpy arrays on disk for an expanded discussion).
However, (making an expansion since you use the word "properly" in your question) I still think using the numpy function out of the box (and most code!) most likely satisfy most user needs. The most important reason is that it already works. Trying to use something else for any other reason might take you on an unexpectedly LONG rabbit hole to figure out why it doesn't work and force it work.
Take for example trying to save it with pickle. I tried that just for fun and it took me at least 30 minutes to realize that pickle wouldn't save my stuff unless I opened & read the file in bytes mode with wb. It took time to google the problem, test potential solutions, understand the error message, etc... It's a small detail, but the fact that it already required me to open a file complicated things in unexpected ways. To add to that, it required me to re-read this (which btw is sort of confusing): Difference between modes a, a+, w, w+, and r+ in built-in open function?.
So if there is an interface that meets your needs, use it unless you have a (very) good reason (e.g. compatibility with matlab or for some reason your really want to read the file and printing in Python really doesn't meet your needs, which might be questionable). Furthermore, most likely if you need to optimize it, you'll find out later down the line (rather than spending ages debugging useless stuff like opening a simple Numpy file).
So use the interface/numpy provide. It might not be perfect, but it's most likely fine, especially for a library that's been around as long as Numpy.
I already spent the saving and loading data with numpy in a bunch of way so have fun with it. Hope this helps!
import numpy as np
import pickle
from pathlib import Path
path = Path('~/data/tmp/').expanduser()
path.mkdir(parents=True, exist_ok=True)
lb,ub = -1,1
num_samples = 5
x = np.random.uniform(low=lb,high=ub,size=(1,num_samples))
y = x**2 + x + 2
# using save (to npy), savez (to npz)
np.save(path/'x', x)
np.save(path/'y', y)
np.savez(path/'db', x=x, y=y)
with open(path/'db.pkl', 'wb') as db_file:
pickle.dump(obj={'x':x, 'y':y}, file=db_file)
## using loading npy, npz files
x_loaded = np.load(path/'x.npy')
y_load = np.load(path/'y.npy')
db = np.load(path/'db.npz')
with open(path/'db.pkl', 'rb') as db_file:
db_pkl = pickle.load(db_file)
print(x is x_loaded)
print(x == x_loaded)
print(x == db['x'])
print(x == db_pkl['x'])
print('done')
Some comments on what I learned:
np.save as expected, this already compresses it well (see https://stackoverflow.com/a/55750128/1601580), works out of the box without any file opening. Clean. Easy. Efficient. Use it.
np.savez uses a uncompressed format (see docs) Save several arrays into a single file in uncompressed .npz format. If you decide to use this (you were warned about going away from the standard solution so expect bugs!) you might discover that you need to use argument names to save it, unless you want to use the default names. So don't use this if the first already works (or any works use that!)
Pickle also allows for arbitrary code execution. Some people might not want to use this for security reasons.
Human-readable files are expensive to make etc. Probably not worth it.
There is something called hdf5 for large files. Cool! https://stackoverflow.com/a/9619713/1601580
Note that this is not an exhaustive answer. But for other resources check this:
For pickle (guess the top answer is don't use pickle, use np.save): Save Numpy Array using Pickle
For large files (great answer! compares storage size, loading save and more!): https://stackoverflow.com/a/41425878/1601580
For matlab (we have to accept matlab has some freakin' nice plots!): "Converting" Numpy arrays to Matlab and vice versa
For saving in human-readable format: Dump a NumPy array into a csv file

np.fromfile() has a sep= keyword argument:
Separator between items if file is a text file. Empty (“”) separator means the file should be treated as binary. Spaces (” ”) in the separator match zero or more whitespace characters. A separator consisting only of spaces must match at least one whitespace.
The default value of sep="" means that np.fromfile() tries to read it as a binary file rather than a space-separated text file, so you get nonsense values back. If you use np.fromfile('markers.txt', sep=" ") you will get the result you are looking for.
However, as others have pointed out, np.loadtxt() is the preferred way to convert text files to numpy arrays, and unless the file needs to be human-readable it is usually better to use binary formats instead (e.g. np.load()/np.save()).

namelist() from ZipFile returns strings with an invalid encoding

The problem is that for some archives or files up-loaded to the python application, ZipFile's namelist() returns badly decoded strings.
from zip import ZipFile
for name in ZipFile('zipfile.zip').namelist():
print('Listing zip files: %s' % name)
How to fix that code so i always decode file names in unicode (so Chineeze, Russian and other languages supported)?
I've seen some samples for Python 2, but since string's nature is changed in python3, i have no clue how to re-encode it, or apply chardet on it.

How to fix that code so i always decode file names in unicode (so Chineeze, Russian and other languages supported)?
Automatically? You can't. Filenames in a basic ZIP file are strings of bytes with no attached encoding information, so unless you know what the encoding was on the machine that created the ZIP you can't reliably get a human-readable filename back out.
There is an extension to the flags on modern ZIP files to tell you that the filename is UTF-8. Unfortunately files you receive from Windows users typically don't have it, so you'll left guessing with inherently unreliable methods like chardet.
I've seen some samples for Python 2, but since string's nature is changed in python3, i have no clue how to re-encode it, or apply chardet on it.
Python 2 would just give you raw bytes back. In Python 3 the new behaviour is:
if the UTF-8 flag is set, it decodes the filenames using UTF-8 and you get the correct string value back
otherwise, it decodes the filenames using DOS code page 437, which is pretty unlikely to be what was intended. However you can re-encode the string back to the original bytes, and then try to decode again using the code page you actually want, eg name.encode('cp437').decode('cp1252').
Unfortunately (again, because the unfortunatelies never end where ZIP is concerned), ZipFile does this decoding silently without telling you what it did. So if you want to switch and only do the transcode step when the filename is suspect, you have to duplicate the logic for sniffing whether the UTF-8 flag was set:
ZIP_FILENAME_UTF8_FLAG = 0x800
for info in ZipFile('zipfile.zip').filelist():
filename = info.filename
if info.flag_bits & ZIP_FILENAME_UTF8_FLAG == 0:
filename_bytes = filename.encode('437')
guessed_encoding = chardet.detect(filename_bytes)['encoding'] or 'cp1252'
filename = filename_bytes.decode(guessed_encoding, 'replace')
...

Here's the code that decodes filenames in zipfile.py according to the zip spec that supports only cp437 and utf-8 character encodings:
if flags & 0x800:
# UTF-8 file names extension
filename = filename.decode('utf-8')
else:
# Historical ZIP filename encoding
filename = filename.decode('cp437')
As you can see, if 0x800 flag is not set i.e., if utf-8 is not used in your input zipfile.zip then cp437 is used and therefore the result for "Chineeze, Russian and other languages" is likely to be incorrect.
In practice, ANSI or OEM Windows codepages may be used instead of cp437.
If you know the actual character encoding e.g., cp866 (OEM (console) codepage) may be used on Russian Windows then you could reencode filenames to get the original filenames:
filename = corrupted_filename.encode('cp437').decode('cp866')
The best option is to create the zip archive using utf-8 so that you can support multiple languages in the same archive:
c:\> 7z.exe a -tzip -mcu archive.zip <files>..
or
$ python -mzipfile -c archive.zip <files>..`

Got the same problem, but with defined language (Russian).
Most simple solution is just to convert it with this utility: https://github.com/vlm/zip-fix-filename-encoding
For me it works on 98% of archives (failed to run on 317 files from corpus of 11388)
More complex solution: use python module chardet with zipfile. But it depends on python version (2 or 3) you use - it has some differences on zipfile. For python 3 I wrote a code:
import chardet
original_name = name
try:
name = name.encode('cp437')
except UnicodeEncodeError:
name = name.encode('utf8')
encoding = chardet.detect(name)['encoding']
name = name.decode(encoding)
This code try to work with old style zips (having encoding CP437 and just has it broken), and if fails, it seems that zip archive is new style (UTF-8). After determining proper encoding, you can extract files by code like:
from shutil import copyfileobj
fp = archive.open(original_name)
fp_out = open(name, 'wb')
copyfileobj(fp, fp_out)
In my case, this resolved last 2% of failed files.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string