Strange "BadZipfile: Bad CRC-32" problem - zip

This code is simplification of code in a Django app that receives an uploaded zip file via HTTP multi-part POST and does read-only processing of the data inside:
#!/usr/bin/env python
import csv, sys, StringIO, traceback, zipfile
try:
import io
except ImportError:
sys.stderr.write('Could not import the `io` module.\n')
def get_zip_file(filename, method):
if method == 'direct':
return zipfile.ZipFile(filename)
elif method == 'StringIO':
data = file(filename).read()
return zipfile.ZipFile(StringIO.StringIO(data))
elif method == 'BytesIO':
data = file(filename).read()
return zipfile.ZipFile(io.BytesIO(data))
def process_zip_file(filename, method, open_defaults_file):
zip_file = get_zip_file(filename, method)
items_file = zip_file.open('items.csv')
csv_file = csv.DictReader(items_file)
try:
for idx, row in enumerate(csv_file):
image_filename = row['image1']
if open_defaults_file:
z = zip_file.open('defaults.csv')
z.close()
sys.stdout.write('Processed %d items.\n' % idx)
except zipfile.BadZipfile:
sys.stderr.write('Processing failed on item %d\n\n%s'
% (idx, traceback.format_exc()))
process_zip_file(sys.argv[1], sys.argv[2], int(sys.argv[3]))
Pretty simple. We open the zip file and one or two CSV files inside the zip file.
What's weird is that if I run this with a large zip file (~13 MB) and have it instantiate the ZipFile from a StringIO.StringIO or a io.BytesIO (Perhaps anything other than a plain filename? I had similar problems in the Django app when trying to create a ZipFile from a TemporaryUploadedFile or even a file object created by calling os.tmpfile() and shutil.copyfileobj()) and have it open TWO csv files rather than just one, then it fails towards the end of processing. Here's the output that I see on a Linux system:
$ ./test_zip_file.py ~/data.zip direct 1
Processed 250 items.
$ ./test_zip_file.py ~/data.zip StringIO 1
Processing failed on item 242
Traceback (most recent call last):
File "./test_zip_file.py", line 26, in process_zip_file
for idx, row in enumerate(csv_file):
File ".../python2.7/csv.py", line 104, in next
row = self.reader.next()
File ".../python2.7/zipfile.py", line 523, in readline
return io.BufferedIOBase.readline(self, limit)
File ".../python2.7/zipfile.py", line 561, in peek
chunk = self.read(n)
File ".../python2.7/zipfile.py", line 581, in read
data = self.read1(n - len(buf))
File ".../python2.7/zipfile.py", line 641, in read1
self._update_crc(data, eof=eof)
File ".../python2.7/zipfile.py", line 596, in _update_crc
raise BadZipfile("Bad CRC-32 for file %r" % self.name)
BadZipfile: Bad CRC-32 for file 'items.csv'
$ ./test_zip_file.py ~/data.zip BytesIO 1
Processing failed on item 242
Traceback (most recent call last):
File "./test_zip_file.py", line 26, in process_zip_file
for idx, row in enumerate(csv_file):
File ".../python2.7/csv.py", line 104, in next
row = self.reader.next()
File ".../python2.7/zipfile.py", line 523, in readline
return io.BufferedIOBase.readline(self, limit)
File ".../python2.7/zipfile.py", line 561, in peek
chunk = self.read(n)
File ".../python2.7/zipfile.py", line 581, in read
data = self.read1(n - len(buf))
File ".../python2.7/zipfile.py", line 641, in read1
self._update_crc(data, eof=eof)
File ".../python2.7/zipfile.py", line 596, in _update_crc
raise BadZipfile("Bad CRC-32 for file %r" % self.name)
BadZipfile: Bad CRC-32 for file 'items.csv'
$ ./test_zip_file.py ~/data.zip StringIO 0
Processed 250 items.
$ ./test_zip_file.py ~/data.zip BytesIO 0
Processed 250 items.
Incidentally, the code fails under the same conditions but in a different way on my OS X system. Instead of the BadZipfile exception, it seems to read corrupted data and gets very confused.
This all suggests to me that I am doing something in this code that you are not supposed to do -- e.g.: call zipfile.open on a file while already having another file within the same zip file object open? This doesn't seem to be a problem when using ZipFile(filename), but perhaps it's problematic when passing ZipFile a file-like object, because of some implementation details in the zipfile module?
Perhaps I missed something in the zipfile docs? Or maybe it's not documented yet? Or (least likely), a bug in the zipfile module?

I might have just found the problem and the solution, but unfortunately I had to replace Python's zipfile module with a hacked one of my own (called myzipfile here).
$ diff -u ~/run/lib/python2.7/zipfile.py myzipfile.py
--- /home/msabramo/run/lib/python2.7/zipfile.py 2010-12-22 17:02:34.000000000 -0800
+++ myzipfile.py 2011-04-11 11:51:59.000000000 -0700
## -5,6 +5,7 ##
import binascii, cStringIO, stat
import io
import re
+import copy
try:
import zlib # We may need its compression method
## -877,7 +878,7 ##
# Only open a new file for instances where we were not
# given a file object in the constructor
if self._filePassed:
- zef_file = self.fp
+ zef_file = copy.copy(self.fp)
else:
zef_file = open(self.filename, 'rb')
The problem in the standard zipfile module is that when passed a file object (not a filename), it uses that same passed-in file object for every call to the open method. This means that tell and seek are getting called on the same file and so trying to open multiple files within the zip file is causing the file position to be shared and so multiple open calls result in them stepping all over each other. In contrast, when passed a filename, open opens a new file object. My solution is for the case when a file object is passed in, instead of using that file object directly, I create a copy of it.
This change to zipfile fixes the problems I was seeing:
$ ./test_zip_file.py ~/data.zip StringIO 1
Processed 250 items.
$ ./test_zip_file.py ~/data.zip BytesIO 1
Processed 250 items.
$ ./test_zip_file.py ~/data.zip direct 1
Processed 250 items.
but I don't know if it has other negative impacts on zipfile...
EDIT: I just found a mention of this in the Python docs that I had somehow overlooked before. At http://docs.python.org/library/zipfile.html#zipfile.ZipFile.open, it says:
Note: If the ZipFile was created by passing in a file-like object as the first argument to the
constructor, then the object returned by open() shares the ZipFile’s file pointer. Under these
circumstances, the object returned by open() should not be used after any additional operations
are performed on the ZipFile object. If the ZipFile was created by passing in a string (the
filename) as the first argument to the constructor, then open() will create a new file
object that will be held by the ZipExtFile, allowing it to operate independently of the ZipFile.

what i did was update setup tools then re download and it works now
https://pypi.python.org/pypi/setuptools/35.0.1

In my case, this solved the problem:
pip uninstall pillow

could it be that you had it open in your desktop? It has happened sometimes to me and the solution was just to run the code without having the files open outside of the python session.

Related

Saving tkinter variables to txt file error

I am trying to make my program save some tkinter String variables to a txt files.
Here is the code:
def saveFile():
file = filedialog.asksaveasfile(mode='w')
if file != None:
file.write(seat1, seat2, seat3, seat4, seat5)
file.close()
Then I get an error when I try to save the file:
Exception in Tkinter callback
Traceback (most recent call last):
File "C:\Users\User\AppData\Local\Programs\Thonny\lib\tkinter\__init__.py", line 1705, in __call__
return self.func(*args)
File "E:\Teacher Plan\seat-plan.py", line 64, in saveFile
file.write(seat1, seat2, seat3, seat4, seat5)
TypeError: write() takes exactly one argument (5 given)
Any ideas?
Make ALL your variables into 1 variable seat[0], seat[1], seat[2], seat[3], seat[4] and then save (seat)
I've used this in one of my projects and it works fine
First you define your main varibale at the start
ess=[[],[],[],[],[],[]]
Then you make your work with variables, when you're done you save (or append) then into the single variable and save to file
ess[0].append(ess_e), ess[1].append(essv), ess[2].append(essp), ess[3].append(essq), ess[4].append(ess_s), ess[5].append(ess_d)
file = open("relais.txt", "w")
file.write(repr(ess) + "\n")
file.close()

Async version of built-in print (stdout)

I have a problem understanding some of the limitations using print inside an async function. Basically this is my code:
#!/usr/bin/env python
import sys
import asyncio
import aiohttp
async amain(loop):
session = aiohttp.ClientSession(loop=loop)
try:
# using session to fetch a large json file which is stored
# in obj
print(obj) # for debugging purposes
finally:
await session.close()
def main():
loop = asyncio.get_event_loop()
res = 1
try:
res = loop.run_until_complete(amain(loop, args))
except KeyboardInterrupt:
# silence traceback when pressing ctrl+c
pass
loop.close()
return res
if __name__ == '__main__':
sys.exit(main())
If I execute this, then the json object is printed on stdout and the suddenly dies with this error
$ dwd-get-sensor-file ; echo $?
Traceback (most recent call last):
File "/home/yanez/anaconda/py3/envs/mondas/bin/dwd-get-sensor-file", line 11, in <module>
load_entry_point('mondassatellite', 'console_scripts', 'dwd-get-sensor-file')()
File "/home/yanez/projects/mondassatellite/mondassatellite/mondassatellite/bin/dwd_get_sensor_file.py", line 75, in main
res = loop.run_until_complete(amain(loop, args))
File "/home/yanez/anaconda/py3/envs/mondas/lib/python3.7/asyncio/base_events.py", line 579, in run_until_complete
return future.result()
File "/home/yanez/projects/mondassatellite/mondassatellite/mondassatellite/bin/dwd_get_sensor_file.py", line 57, in amain
print(obj)
BlockingIOError: [Errno 11] write could not complete without blocking
1
The funny thing is that when I execute my code redirecting stdout to a file like this
$ dwd-get-sensor-file > output.txt ; echo $?
0
the exception doesn't happen and the whole output is correctly redirected to output.txt.
For testing purposes I converted the json object to a string and instead of doing print(obj) I do sys.stdout.write(obj_as_str) then I get this
exception:
BlockingIOError: [Errno 11] write could not complete without blocking
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
I've searched for this BlockingIOError exception but all threads I find have something to do with network sockets or CI builds. But I found one
interesting github comment:
The make: write error is almost certainly EAGAIN from stdout. Pretty much every command line tool expects stdout to be in blocking mode, and does not properly retry when in nonblocking mode.
So when I executed this
python -c 'import os,sys,fcntl; flags = fcntl.fcntl(sys.stdout, fcntl.F_GETFL); print(flags&os.O_NONBLOCK);'
I get 2048, which means blocking (or is this the other way round? I'm confused). After executing this
python -c 'import os,sys,fcntl; flags = fcntl.fcntl(sys.stdout, fcntl.F_GETFL); fcntl.fcntl(sys.stdout, fcntl.F_SETFL, flags&~os.O_NONBLOCK);'
I don't get the BlockingIOError exceptions anymore, but I don't like this solution though.
So, my question is: how should we deal when writing to stdout inside an async function? If I know that I'm dealing with stdout, should I
set stdout to non-blocking and revert it back when my program exits? Is there a specific strategy for this?
Give aiofiles a try, using stdout FD as the file object.
aiofiles helps with this by introducing asynchronous versions of files that support delegating operations to a separate thread pool.
In terms of actually using aiofiles with an FD directly, you could probably extend the aiofiles.os module, using wrap(os.write).

How to read zipfile from bytes in Python 3.x?

I'm reading gzip file from bytes, which I have loaded from AWS S3, now I have tried below code to read:
gzip_bytes = s3.get_file() # for example I have loaded S3
gzip_file = BytesIO(gzip_bytes)
with GzipFile(gzip_file, mode="rb") as file:
# Todo somthing
I'm getting below error:
Traceback (most recent call last):
...
with GzipFile(BytesIO(pre_file_bytes), mode="rb") as pre_zip_file:
File "/usr/lib/python3.6/gzip.py", line 163, in __init__
fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
TypeError: expected str, bytes or os.PathLike object, not _io.BytesIO
How can I resolve that issue? Or maybe I'm missing something
Many Thanks
The gzipfile contructor takes:
class gzip.GzipFile(filename=None, mode=None, compresslevel=9, fileobj=None, mtime=None)
However, you are passing Bytes instead of a string as the filename.
This is explained by the error message:
expected str, bytes or os.PathLike object, not _io.BytesIO
It looks like you should download the file, then provide the filename to the downloaded file.
To decompress and read a downloaded gzip file, perhaps from a URL, pass the corresponding io.BytesIO object to the fileobj parameter instead of the default filename parameter. For example,
from io import BytesIO
from gzip import GzipFile
import urllib.request
url = urllib.request.urlopen("https://oeis.org/names.gz")
with GzipFile(fileobj=BytesIO(url.read()), mode='rb') as f:
# now you may treat f as an uncompressed file
# for example, print first line of file
for l in f:
print(l)
break
(This was already pointed out by the OP in a comment. I'm putting it in an answer so it is easier to find.)

gmail API: TypeError: sequence item 0: expected str instance, bytes found

I'm trying to download one message using the GMail API. Below is my traceback:
pdiracdelta#pdiracdelta-Laptop:~/GMail Metadata$ ./main.py
<oauth2client.client.OAuth2Credentials object at 0x7fd6306c4d30>
False
Traceback (most recent call last):
File "./main.py", line 105, in <module>
main()
File "./main.py", line 88, in main
service = discovery.build('gmail', 'v1', http=http)
File "/usr/lib/python3/dist-packages/oauth2client/util.py", line 137, in positional_wrapper
return wrapped(*args, **kwargs)
File "/usr/lib/python3/dist-packages/googleapiclient/discovery.py", line 197, in build
resp, content = http.request(requested_url)
File "/usr/lib/python3/dist-packages/oauth2client/client.py", line 562, in new_request
redirections, connection_type)
File "/usr/lib/python3/dist-packages/httplib2/__init__.py", line 1138, in request
headers = self._normalize_headers(headers)
File "/usr/lib/python3/dist-packages/httplib2/__init__.py", line 1106, in _normalize_headers
return _normalize_headers(headers)
File "/usr/lib/python3/dist-packages/httplib2/__init__.py", line 194, in _normalize_headers
return dict([ (key.lower(), NORMALIZE_SPACE.sub(value, ' ').strip()) for (key, value) in headers.items()])
File "/usr/lib/python3/dist-packages/httplib2/__init__.py", line 194, in <listcomp>
return dict([ (key.lower(), NORMALIZE_SPACE.sub(value, ' ').strip()) for (key, value) in headers.items()])
TypeError: sequence item 0: expected str instance, bytes found
And below is a snippet of code which produces the credential object and boolean print just before the Traceback. It confirms that the credentials object is valid and is being used as suggested by Google:
credentials = get_credentials()
print(credentials)
print(str(credentials.invalid))
http = credentials.authorize(httplib2.Http())
service = discovery.build('gmail', 'v1', http=http)
What is going wrong here? It seems to me that I am not at fault, since the problem can be traced back to service = discovery.build('gmail', 'v1', http=http) which uses nothing but valid information (implying one of the packages used further in the stack cannot handle this valid information). Is this a bug, or am I doing something wrong?
UPDATE: it seems that the _normalize_headers function has now been patched. Updating your python version should fix the problem (I'm using 3.6.7 now).
Solved with help from Padraic Cunningham, who identified the problem as an encoding issue. I solved this problem by applying .decode('utf-8') to the header keys and values (headers is a dict) if they are bytes-type objects (which are apparently UTF-8 encoded) and transforming them into python3 strings. This is probably due to some python2/3 mixing in the google API.
The fix also includes changing all code from google API examples to python3 code (e.g. exception handling) but most importantly my workaround involves editing /usr/lib/python3/dist-packages/httplib2/__init__.py at lines 193-194, redefining the _normalize_headers(headers) function as:
def _normalize_headers(headers):
for key in headers:
# if not encoded as a string, it is ASSUMED to be encoded as UTF-8, as it used to be in python2.
if not isinstance(key, str):
newkey = key.decode('utf-8')
headers[newkey] = headers[key]
del headers[key]
key = newkey
if not isinstance(headers[key], str):
headers[key] = headers[key].decode('utf-8')
return dict([ (key.lower(), NORMALIZE_SPACE.sub(value, ' ').strip()) for (key, value) in headers.items()])
WARNING: this workaround is obviously quite dirty as it involves editing files from the httplib2 package. If someone finds a better fix, please post it here.

Reopening a closed stringIO object in Python 3

So, I create a StringIO object to treat my string as a file:
>>> a = 'Me, you and them\n'
>>> import io
>>> f = io.StringIO(a)
>>> f.read(1)
'M'
And then I proceed to close the 'file':
>>> f.close()
>>> f.closed
True
Now, when I try to open the 'file' again, Python does not permit me to do so:
>>> p = open(f)
Traceback (most recent call last):
File "<pyshell#166>", line 1, in <module>
p = open(f)
TypeError: invalid file: <_io.StringIO object at 0x0325D4E0>
Is there a way to 'reopen' a closed StringIO object? Or should it be declared again using the io.StringIO() method?
Thanks!
I have a nice hack, which I am currently using for testing (Since my code can make I/O operations, and giving it StringIO is a nice get-around).
If this problem is kind of one time thing:
st = StringIO()
close = st.close
st.close = lambda: None
f(st) # Some function which can make I/O changes and finally close st
st.getvalue() # This is available now
close() # If you don't want to store the close function you can also:
StringIO.close(st)
If this is recurring thing, you can also define a context-manager:
#contextlib.contextmanager
def uncloseable(fd):
"""
Context manager which turns the fd's close operation to no-op for the duration of the context.
"""
close = fd.close
fd.close = lambda: None
yield fd
fd.close = close
which can be used in the following way:
st = StringIO()
with uncloseable(st):
f(st)
# Now st is still open!!!
I hope this helps you with your problem, and if not, I hope you will find the solution you are looking for.
Note: This should work exactly the same for other file-like objects.
No, there is no way to re-open an io.StringIO object. Instead, just create a new object with io.StringIO().
Calling close() on an io.StringIO object throws away the "file contents" data, so re-opening couldn't give access to that anyways.
If you need the data, call getvalue() before closing.
See also the StringIO documentation here:
The text buffer is discarded when the close() method is called.
and here:
getvalue()
Return a str containing the entire contents of the buffer.
The builtin open() creates a file object (i.e. a stream), but in your example, f is already a stream.
That's the reason why you get TypeError: invalid file
After the method close() has executed, any stream operation will raise ValueError.
And the documentation does not mention about how to reopen a closed stream.
Maybe you need not close() the stream yet if you want to use (reopen) it again later.
When you f.close() you remove it from memory. You're basically doing a deref x, call x; you're looking for a memory location that doesn't exist.
Here is what you could do in stead:
import io
a = 'Me, you and them\n'
f = io.StringIO(a)
f.read(1)
f.close()
# Put the text form a without the first char into StringIO.
p = io.StringIO(a[1:]).
# do some work with p.
I think your confusion comes form thinking of io.StringIO as a file on the block device. If you used open() and not StringIO, then you would be correct in your example and you could reopen the file. StringIO is not a file. It's the idea of a file object in memory. A file object does have a StringIO, but It also exists physically on the block device. A StringIO is just a buffer, a staging area in memory of the data with in it. When you call open() a buffer is created, but there is still the data on block device.
Perhaps this is more what you want
fo = open('f.txt','w+')
fo.write('Me, you and them\n')
fo.read(1)
fo.close()
# reopen the now closed file `f`
p = open('f.txt','r')
# do stuff with p
p.close()
Here we are writing the string to the block device, so that when we close the file, the information written to it will remain after it's closed. Because this is creating a file in the directory the progarm is run in, it may be a good idea to give the file an extension. For example, you could name the file f.txt instead of f.

Resources