I dumped a Jupyter Notebook session using dill.dump_session(filename), and at one point it told me that the disk storage was full. However, I made some space on the disk and tried again. Now, I am unable to load back the session using, dill.load_session(filename).
I get the following error:
~/.local/lib/python3.6/site-packages/dill/_dill.py in load_session(filename, main)
408 unpickler._main = main
409 unpickler._session = True
--> 410 module = unpickler.load()
411 unpickler._session = False
412 main.__dict__.update(module.__dict__)
EOFError: Ran out of input
And the file (i.e. filename) is about 30 gigs in size of data.
How can I retrieve my data from the file?
BTW, I’m running all this on Google Cloud, and it’s costing me a fortune to keep the instance up and running.
I have tried using undill, and other unpickle methods.
For example I tried this:
open(file, 'a').close()
try:
with open(file, "rb") as Score_file:
unpickler = pickle.Unpickler(Score_file)
scores = unpickler.load()
return scores
But got this error:
`6 with open(file, "rb") as Score_file:
7 unpickler = pickle.Unpickler(Score_file);
----> 8 scores = unpickler.load();
9
10 return scores
ModuleNotFoundError: No module named '__builtin__'`
I know this probably isn't the answer you want to hear, but... it sounds like you may have a corrupt pickle file. If that's the case, you can get the data back only if you edit it by hand, and can understand what the pickled strings are and how they are structured. Note that there are some very rare cases that an object will dump, but not load -- however, it's much more likely you have a corrupt file. Either way, the resolution is the same... a hand edit is the only way to potentially save what you have pickled.
Also, note that if you use dump_session, you really should use load_session (as it does a sequence of steps on top of a standard load, reversing what is done in dump_session) -- that's really irrelevant for the issue however, the issue likely is having an incomplete or corrupt pickle file.
Related
I built an hdf5 dataset using pytables. It contains thousands of nodes, each node being an image stored without compression (of shape 512x512x3). When I run a deep learning training loop (with a Pytorch dataloader) on it it randomly crashes, saying that the node does not exist. However, it is never the same node that is missing and when I open the file myself to verify if the node is here it is ALWAYS here.
I am running everything sequentially, as I thought that I may have been the fault of multithreading/multiprocessing access on the file. But it did not fix the problem. I tried a LOT of things but it never works.
Does someone have an idea about what to do ? Should I add like a timer between calls to give the machine the time to reallocate the file ?
Initially I was working with pytables only, but in an attempt to solve my problem I tried loading the file with h5py instead. Unfortunately it did not work better.
Here is the error I get with h5py: "RuntimeError: Unable to get link info (bad symbol table node signature)"
The exact error may change but every time it says "bad symbol table node signature"
PS: I cannot share the code because it is huge and part of a bigger basecode that is my company's property. I can still share part of the code below to show how I load the images:
with h5py.File(dset_filepath, "r", libver='latest', swmr=True) as h5file:
node = h5file["/train_group_0/sample_5"] # <- this line breaks
target = node.attrs.get('TITLE').decode('utf-8')
img = Image.fromarray(np.uint8(node))
return img, int(target.strip())
Before accessing the dataset (node), add a test to confirm it exists. While you're adding checks, do the same for the attribute 'TITLE'. If you are going to use hard-coded path names (like 'group_0') you should check all nodes in the path exist (for example, does 'group_0' exist? Or use one of the recursive visitor functions (.visit() or .visititems() to be sure you only access existing nodes.
Modified h5py code with rudimentary checks looks like this:
sample = 'sample_5'
with h5py.File(dset_filepath, 'r', libver='latest', swmr=True) as h5file:
if sample not in h5file['/train_group_0'].keys():
print(f'Dataset Read Error: {sample} not found')
return None, None
else:
node = h5file[f'/train_group_0/{sample}'] # <- this line breaks
img = Image.fromarray(np.uint8(node))
if 'TITLE' not in node.attrs.keys():
print(f'Attribute Read Error: TITLE not found')
return img, None
else:
target = node.attrs.get('TITLE').decode('utf-8')
return img, int(target.strip())
You said you were working with PyTables. Here is code to do the same with PyTables package:
import tables as tb
sample = 'sample_5'
with tb.File(dset_filepath, 'r', libver='latest', swmr=True) as h5file:
if sample not in h5file.get_node('/train_group_0'):
print(f'Dataset Read Error: {sample} not found')
return None, None
else:
node = h5file.get_node(f'/train_group_0/{sample}') # <- this line breaks
img = Image.fromarray(np.uint8(node))
if 'TITLE' not in node._v_attrs:
print(f'Attribute Read Error: TITLE not found')
return img, None
else:
target = node._v_attrs['TITLE'].decode('utf-8')
return img, int(target.strip())
I have a python script that "streams" a very large gzip file using urllib3 and feeds it into a zlib.decompressobj. This zlib decompression object is configured to read gzip compression. If this initial http connection is interrupted then the zlib.decompressobj begins to throw errors after the connection is "resumed". See my source code below if you want to cut to the chase.
These errors occur despite the fact that the script initiates a new http connection using the Range header to specify the number of bytes previously read. It resumes from the completed read point present when the connection was broken. I believe this arbitrary resume point is the source of my problem.
If I don't try to decompress the chunks of data being read in by urllib3, but instead just write them to a file, everything works just fine. Without trying to decompress the stream everything works even when there is an interruption. The completed archive is valid, it is the same size as one downloaded by a browser and the MD5 hash of the .gz file is the same as if I had downloaded the file directly with Chrome.
On the other hand, if I try to decompress the chunks of data coming in after the interruption, even with the Range header specified, the zlib library throws all kinds of errors. The most recent was Error -3 while decompressing data: invalid block type
Additional Notes:
The site that I am using has the Accept-Range flag set to bytes meaning that I am able to submit modified Range headers to the server.
I am not using the requests library in this script as it ultimately manages urllib3. I am instead using urllib3 directly in an attempt to cut out the middle man.
This script it an oversimplification of my ultimate goal, which is to stream the compressed data directly from where it is hosted, enrich it and store it in a MySQL database on the local network.
I am heavily resource constrained inside of the docker container where this processing will occur.
The genesis of this question is present in a question I asked almost 3 weeks ago: requests.iter_content() thinks file is complete but it's not
The most common problem I am encountering with the urllib3 (and requests) library is the IncompleteRead(self._fp_bytes_read, self.length_remaining) error.
This error only appears if the urllib3 library has been patched to raise an exception when an incomplete read occurs.
My best guess:
I am guessing that the break in the data stream being fed to zlib.decompressobj is causing zlib to somehow lose context and start attempting to decompress the data again in an odd location. Sometimes it will resume, however the data stream is garbled, making me believe the byte location used as the new Range header fell at the front of some bytes which are then incorrectly interpreted as headers. I do not know how to counteract this and I have been trying to solve it for several weeks. The fact that the data are still valid when downloaded whole (without being decompressed prior to completion) even with an interruption occurs, makes me believe that some "loss of context" within zlib is the cause.
Source Code: (Has been updated to include a "buffer")
This code is a little bit slapped together so forgive me. Also, this target gzip file is quite a lot smaller than the actual file I will be using. Additionally, the target file in this example will no longer be available from Rapid7 in about a month's time. You may choose to substitute a different .gz file if that suits you.
import urllib3
import certifi
import inspect
import os
import time
import zlib
def patch_urllib3():
"""Set urllib3's enforce_content_length to True by default."""
previous_init = urllib3.HTTPResponse.__init__
def new_init(self, *args, **kwargs):
previous_init(self, *args, enforce_content_length = True, **kwargs)
urllib3.HTTPResponse.__init__ = new_init
#Path the urllib3 module to throw an exception for IncompleteRead
patch_urllib3()
#Set the target URL
url = "https://opendata.rapid7.com/sonar.http/2021-11-27-1638020044-http_get_8899.json.gz"
#Set the local filename
local_filename = '2021-11-27-1638020044-http_get_8899_script.json.gz'
#Configure the PoolManager to handle https (I think...)
http = urllib3.PoolManager(ca_certs=certifi.where())
#Initiate start bytes at 0 then update as download occurs
sum_bytes_read=0
session_bytes_read=0
total_bytes_read=0
#Dummy variable to silence console output from file write
writer=0
#Set zlib window bits to 16 bits for gzip decompression
decompressor = zlib.decompressobj(zlib.MAX_WBITS|16)
#Build a buffer list
buf_list=[]
i=0
while True:
print("Building request. Bytes read:",total_bytes_read)
resp = http.request(
'GET',
url,
timeout=urllib3.Timeout(connect=15, read=40),
preload_content=False)
print("Setting headers.")
#This header should cause the request to resume at "total_bytes_read"
resp.headers['Range'] = 'bytes=%s' % (total_bytes_read)
print("Local filename:",local_filename)
#If file already exists then append to it
if os.path.exists(local_filename):
print("File already exists.")
try:
print("Starting appended download.")
with open(local_filename, 'ab') as f:
for chunk in resp.stream(2048):
buf_list.append(chunk)
#Use i to offset the chunk being read from the "buffer"
#I.E. load 3 chunks (0,1,2) in the buffer list before starting to read from it
if i >2:
buffered_chunk=buf_list.pop(0)
writer=f.write(buffered_chunk)
#Comment out the below line to stop the error from occurring.
#File download should complete successfully even if interrupted when the following line is commented out.
decompressed_chunk=decompressor.decompress(buffered_chunk)
#Increment i so that the buffer list will fill before reading from it
i=i+1
session_bytes_read = resp._fp_bytes_read
#Sum bytes read is an updated value that isn't stored. It is only used for console print
sum_bytes_read = total_bytes_read + session_bytes_read
print("[+] Bytes read:",str(format(sum_bytes_read, ",")), end='\r')
print("\nAppended download complete.")
break
except Exception as e:
print(e)
#Add to total bytes read to current session bytes each time the loop needs to repeat
total_bytes_read=total_bytes_read+session_bytes_read
print("Bytes Read:",total_bytes_read)
#Mod the total_bytes back to the nearest chunk size so it can be - re-requested
total_bytes_read=total_bytes_read-(total_bytes_read%2048)-2048
print("Rounded bytes Read:",total_bytes_read)
#Pop the last entry off of the buffer since it may be incomplete
buf_list.pop()
#reset i so that the buffer has to rebuilt
i=0
print("Sleeping for 30 seconds before re-attempt...")
time.sleep(30)
#If file doesn't already exist then write to it directly
else:
print("File does not exist.")
try:
print("Starting initial download.")
with open(local_filename, 'wb') as f:
for chunk in resp.stream(2048):
buf_list.append(chunk)
#Use i to offset the chunk being read from the "buffer"
#I.E. load 3 chunks (0,1,2) in the buffer list before starting to read from it
if i > 2:
buffered_chunk=buf_list.pop(0)
#print("Buffered Chunk",str(i-2),"-",buffered_chunk)
writer=f.write(buffered_chunk)
decompressed_chunk=decompressor.decompress(buffered_chunk)
#Increment i so that the buffer list will fill before reading from it
i=i+1
session_bytes_read = resp._fp_bytes_read
print("[+] Bytes read:",str(format(session_bytes_read, ",")), end='\r')
print("\nInitial download complete.")
break
except Exception as e:
print(e)
#Set the total bytes read equal to the session bytes since this is the first failure
total_bytes_read=session_bytes_read
print("Bytes Read:",total_bytes_read)
#Mod the total_bytes back to the nearest chunk size so it can be - re-requested
total_bytes_read=total_bytes_read-(total_bytes_read%2048)-2048
print("Rounded bytes Read:",total_bytes_read)
#Pop the last entry off of the buffer since it may be incomplete
buf_list.pop()
#reset i so that the buffer has to rebuilt
i=0
print("Sleeping for 30 seconds before re-attempt...")
time.sleep(30)
print("Looping...")
#Finish writing from buffer into file
#BE SURE TO SET TO "APPEND" with "ab" or you will overwrite the start of the file
f = open(local_filename, 'ab')
print("[+] Finishing write from buffer.")
while not len(buf_list) == 0:
buffered_chunk=buf_list.pop(0)
writer=f.write(buffered_chunk)
decompressed_chunk=decompressor.decompress(buffered_chunk)
#Flush and close the file
f.flush()
f.close()
resp.release_conn()
Reproducing the error
To reproduce the error perform the following actions:
Run the script and let the download start
Be sure that line 65 decompressed_chunk=decompressor.decompress(chunk) is not commented out
Turn off your network connection until an exception is raised
Turn your network connection back on immediately.
If the decompressor.decompress(chunk) line is removed from the script then it will download the file and the data can be successfully decompressed from the file itself. However, if line 65 is present and an interruption occurs, the zlib library will not be able to continue decompressing the data stream. I need to decompress the data stream as I cannot store the actual file I am trying to use.
Is there some way to prevent this from occurring? I have now attempted to add a "buffer" list that stores the chunks; the script discards the last chunk after a failure and moves back to a point in the file that preceded the "failed" chunk. I am able to re-establish the connection and even pull back all the data correctly but even with a "buffer" my ability to decompress the stream is interrupted. I must not be smoothly recovering the data back to the buffer somehow.
Visualization:
I put this together very quickly in an attempt to better describe what I am trying to do...
I bet Mark Adler is hiding out there somewhere...
r+b doesn't append. You would need to use ab for that. It appears that on the re-try, you are reading the entire gzip file again from the start. With r+b, that file is written correctly to your output file, by overwriting what was read before.
However, you are feeding the initial read to the decompressor, and then the start of the file again. Not surprisingly, the decompressor then soon detects invalid compressed data.
I have a 1.5 terabyte sized hdf5 file on an Amazon Simple Storage Service located at the link below. I don't have the disk space to save it nor do I have the memory to read it. Accordingly, I want to read it by chunk, process it, and discard the read part. I was hoping to use pandas' read_hdf to read it but it does not support urls. Neither does the h5py library it seems. Though it does mention a ros3 driver but I haven't been able to get it to work yet. I also tried the response to this question but the chunks cannot be read by h5py or I have not found a way yet. So I'm rather left with no idea on how to process this file. Does anyone have any idea how to do so? The link to the file is this:
https://oedi-data-lake.s3-us-west-2.amazonaws.com/building_synthetic_dataset/A_Synthetic_Building_Operation_Dataset.h5
After having this exact same issue, I believe I've cobbled together a working solution for this using fsspec:
import h5py
import fsspec
URL = "..." # Assuming a publicly accessible url
remote_f = fsspec.open(URL, mode="rb")
if hasattr(remote_f, "open"):
remote_f = remote_f.open()
f = h5py.File(remote_f)
# Do regular hdf5 things...
I've confirmed, using your link above, that this does not read the data into memory, just as if it were a local file:
import h5py
import fsspec
URL = "https://oedi-data-lake.s3-us-west-2.amazonaws.com/building_synthetic_dataset/A_Synthetic_Building_Operation_Dataset.h5"
remote_f = fsspec.open(URL, mode="rb")
if hasattr(remote_f, "open"):
remote_f = remote_f.open()
f = h5py.File(remote_f)
f.visititems(print)
# 1. README <HDF5 dataset "1. README": shape (), type "|O">
# 2. Resources <HDF5 group "/2. Resources" (2 members)>
# 2. Resources/2.1. Building Models <HDF5 group "/2. Resources/2.1. Building Models" (9 members)>
...
I have set up a google cloud account
I want to perform my deep learning much more faster on a jupyter notebook, but
I cannot find a way to read my csv file
I downloaded it with wget from my github account and afterwards I tried
dataset = pd.read_csv('/home/user/.jupyter/SIEMENSTRAIN.csv')
but I get the following error
pandas.parser.CParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 12
Why? When I read it on my laptop using my jupyter notebooks, everything runs well
Any suggestions?
I tried the recommended solutions for this error and I got the next warning
/home/user/anaconda3/lib/python3.5/site-packages/ipykernel/main.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
if name == 'main':
When I ran dataset.head() this is what appeared
Any help please?
There are a number of possibilities that could be causing the problem... I would first always make sure that Pandas (pd)'s version is updated and compatible.
The more likely cause is that the CSV itself is not right, so pd.read_csv() is not able to work correctly (thus a Parse Error). This may have something to do with the headers, though I'm not sure what your original CSV file looks like. It's worth playing around with read_csv, for example:
df = pandas.read_csv(fileName, sep='delimiter', header=None)
This tampers with 2 things - the delimiter, and if pd is reading a header from CSV or not.
I go through some pd.read_csv() stuff in my book about Stock Prediction (another cool Machine Learning problem) and Deep Learning, feel free to check it out.
Good Luck!
I tried what you proposed and this is what I got
So, any suggestions?
I suppose that the path is ok, but it just won't be read properly, or am I wrong?
I'm trying to openpyxl.load_workbook xlsx files from compressed zip file, but it doesn't work. The following code fails at openpyxl.load_workbook with "BadZipfile: File is not a zip file"
with zipfile.ZipFile(os.path.join(root, raw)) as z:
for file_info in z.infolist():
wb = openpyxl.load_workbook(z.open(file_info), read_only=True)
There is nothing wrong with the archive and the excel file in it, as if i extract it to disk then the following works:
with open('report.xlsx') as f:
wb = openpyxl.load_workbook(f, read_only=True)
I can go with this solution and temporary extract it somewhere and load xslx, but would like to understand if it possible to load it from zipfile.
The problem is that readonly=True does not do quite what you think it does. According to the docs:
Fortunately, there are two modes that enable you to read and write unlimited amounts of data with (near) constant memory consumption.
While not explicitly stated, I would assume that this involves some equivalent to a memory-mapped file (because of "constant memory consumption") and random access (because of the range of allowed operations).
Either way, setting readonly=True is not an indication of the fact that you only intend to read a workbook (that's all load_workbook can do anyway, you have to overwrite the existing one to make any "changes"). It is an indication of the fact that you want to access the file directly on disk, without loading the entire contents.
It seems pretty clear (and intuitively expected) that ZipFile.open does not provide a random-access file:
Note: The file-like object is read-only and provides the following methods: read(), readline(), readlines(), __iter__(), next().
The fact that seek is not mentioned in this list is quite telling (pun only somewhat intended).
You can get more information about the exception by splitting the offending line into two (a useful general debugging technique for nested function calls):
x = z.open(file_info)
wb = openpyxl.load_workbook(x, readonly=True)
You will notice that the error occurs on the second of those two lines. This is because pretty much all the Microsoft open-document formats are actually just fancy zip files. The problem is most likely that openpyxl cannot open your file in random access mode, not that it's actually an invalid zip file.
Either way, this is a bunch of very educated guesswork that leads to a simple, one-keyword-deletion solution:
TL;DR
Get rid of readonly=True when reading non-random-access data like a compressed zip entry:
wb = openpyxl.load_workbook(z.open(file_info))
Appendix
You should get in the habit of writing minimal programs that demonstrate your issue so that people answering your question can focus on doing their job instead of getting irritated and closing down what would otherwise be a perfectly good question. I liked your question enough to do that for you, so here is a minimal program that demonstrates your issue and requires nothing more than copy-and-paste to run:
import openpyxl, zipfile
from openpyxl.workbook.workbook import Workbook
wb = Workbook()
wb.active['A1'] = 12
wb.active['A2'] = 13
wb.save('report.xlsx')
with zipfile.ZipFile('test.zip', 'w') as z:
z.write('report.xlsx')
with open('report.xlsx') as f:
wb = openpyxl.load_workbook(f, read_only=True)
print(wb.active['A1'].value)
print(wb.active['A2'].value)
with zipfile.ZipFile('test.zip', 'r') as z:
for file_info in z.infolist():
x = z.open(file_info, 'r')
wb = openpyxl.load_workbook(x, readonly=True)
print(wb.active['A1'].value)
print(wb.active['A2'].value)