Cannot access hdf5 array from a thread - multithreading

I am using h5py processing the array in several threads:
def process(start, end, dataset):
for i in xrange(start, end):
# Do something with dataset[i]
f = h5py.File(path, 'r')
dataset = f[...]
worker = [threading.Thread(target=process, args=(start, end, dataset)) \
for start, end in ...]
I get an error when accessing the array from the thread
File "/usr/lib/python2.7/dist-packages/h5py/_hl/dataset.py", line 367, in __getitem__
if self._local.astype is not None:
AttributeError: 'thread._local' object has no attribute 'astype'
I have really no clue why this happens :/, I can access dtype and shape, but on access of any slice I get this error.
Edit:
Calling
process(0, len(dataset), dataset)
in the main thread works as expected.

Can you provide a full example that reproduces the error? Please also mention the versions of Python, h5py, and the hdf5 library that you're using, as well as your OS.
The following code works with my setup, does it work on yours?
import threading
import h5py
import numpy as np
db = h5py.File('/tmp/test.h5', 'w')
dataset = db.create_dataset("mydata", data=np.random.random((10,)))
def process(start, end, dataset):
for i in xrange(start, end):
print(dataset[i])
workers = [threading.Thread(target=process, args=(start, end, dataset))
for start, end in [[1,2], [3,4]]]
workers[0].start()
workers[1].start()
db.close()
May I ask why you are using threads to process your hdf5 file? Note that hdf5 does not provide thread-level concurrency. Although the above example "works" (premise: hdf5 compiled with "threadsafe" option), the two operations will run sequentially. Hdf5 operations are blocking and do not release the global interpreter lock during I/O, which prevents threads from running in parallel.
If you want your code to be executed in parallel, you have to use processes instead of threads. Note, however, that parallel reading and writing is safe only with the MPI version of h5py/hdf5.

I had the same problem.
It was solved by updating h5py (from 2.2 to to 2.6).

Related

Pytorch data pipeline

I am trying to implement a bounded buffer like solution where data generator and the model work as two separate processes. The data generator preprocess the data and stores in a shared queue (with predefined max size to limit the memory usage). The model on the other hand consumes data from this queue at its own pace until the queue is empty. Below is the snippet of my implementation.
'''
self._buffer is an object of multiprocessing.Queue
'''
def produce(self):
for obj in self._generator:
self._buffer.put(obj=obj, block=True, timeout=None)
self._buffer.put(obj=None)
def consume(self):
while True:
dat = self._buffer.get(block=True, timeout=None)
if dat is None:
break
else:
# Train model on `dat`
def run(self):
pt = multiprocessing.Process(target=self.produce)
ct = multiprocessing.Process(target=self.consume)
pt.start()
ct.start()
pt.join()
ct.join()
However, the solution above does not work. I used the torch.multiprocessing as instructed the documentation. I also set torch.multiprocessing.set_start_method('spawn') in order to avoid "RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method"
But now I get "TypeError: cannot pickle 'generator' object". How this can be fixed?
Since you work with pytorch you should use the Dataset and Dataloader approach. This handles all problems with multiprocessing, shared memory and so on for you.
You can have map style datasets or things like iterable-style.Best to read the official documentation, what is what and how they work.
In your case you probably are fine with an iterable-style dataset. I used both approaches for similar cases. You can have the iterable style dataset, which you might need if you don't know how much samples you will be processing. For other cases I had a map-style dataset, where I knew the total number of my samples beforehand (e.g. processing all images in a directory) and could use a sequential sampler to give me the elements in order.
Regarding one of your problems. All errors like this TypeError: cannot pickle 'generator' object happen when you have objects which can't be serialized. For serialization pickle is used. In your case self._generator seems to be an object which can't be serialized for some reason. Without code it is not possible to say why. I had cases where used wrapped c++ packages created with pybind where objects were not serializable or I had some mutex variables somewhere.

How is it possible to execute python code during deserialization?

I was reading about pickling in the context of persisting instances, and ran across this snippet:
Pickle files can be hacked. If you receive a raw pickle file over the network, don't trust it! It could have malicious code in it, that would run arbitrary python when you try to de-pickle it. [1]
My understanding is that pickling turns a data-structure into an array of bytes, and the pickle library also contains methods to take a pickled byte array and rebuild a python instance from it.
I tested some code to see if simply putting code into the class or init method would run it:
import pickle
class A:
print('class')
def __init__(self):
print('instance')
a = A()
print('pickling...')
with open('/home/usrname/Desktop/pfile', 'wb') as pfile:
pickle.dump(a, pfile, pickle.HIGHEST_PROTOCOL)
print('de-pickling...')
with open('/home/usrname/Desktop/pfile', 'rb') as pfile:
a2 = pickle.load(pfile)
However this only yields
class
instance
pickling...
de-pickling...
suggesting that the __ init__ method doesn't actually get run when the instance is unpickled. So I'm still confused how you would make code run during that process.
Really thorough writeup here: https://intoli.com/blog/dangerous-pickles/
From what I understand, it has to do with how pickles are interpreted by the Pickle Machine (PM) and run. You can craft a pickle file that will cause it to evaluate using eval() the statements provided.

Multiprocessing hangs when applying function to pandas dataframe Python 3.7.1

I am trying to parallelize a function on my pandas dataframe and I'm running into an issue where it seems that the multiprocessing library is hanging. I am doing this all within a Jupyter notebook with myFunction() existing in a separate .py file. Can someone point out what I am doing wrong here?
Surprisingly, this piece of code has worked previously on my Windows 7 machine with the same version of python. I have just copied the file over to my Mac laptop.
I also use tqdm so I can monitor the progress, the behavior is the same with or without it.
#This function hands the multiprocessing
from multiprocessing import Pool, cpu_count
import numpy as np
import tqdm
def parallelize_dataframe(df, func):
num_partitions = cpu_count()*2 # number of partitions to split dataframe
num_cores = cpu_count() # number of cores on your machine
df_split = np.array_split(df, num_partitions)
pool = Pool(num_cores)
return pd.concat(list(tqdm.tqdm_notebook(pool.imap(func, df_split),total=num_partitions)))
#My function that I am applying to the dataframe is in another file
#myFunction retrieves a JSON from an API for each ID in myDF and converts it to a dataframe
from myFuctions import myFunction
#Code that calls the parallelize function
finalDF = parallelize_dataframe(myDF,myFunction)
The expected result is a concatenation of a list of dataframes that have been retrieved by myFunction(). This is worked in the past, but now the process seems to hang indefinitely without any error messages.
Q : Can someone point out what I am doing wrong here?
You just expected the MacOS to use the same mechanism for process-instantiations as the WinOS did in past.
The multiprocessing module does not do the same set of things on either of the supported O/S-es and even reported some methods to be dangerous and also had changed the default behaviour on MacOS- and Linux-based systems.
Next steps to try to move forward :
re-read how to do the explicit setup of the call-signatures in multiprocessing documentation ( avoid hidden dependency of the code-behaviour on "new" default values )
test if may avoid the cases where multiprocessing will spawn the full-copy of the python-interpreter process, that many times as you instruct ( memory allocations could soon get devastatingly large, if many replicas try to get instantiated beyond the localhost RAM-footprint, just due to a growing number of CPU-cores )
test if the "worker"-code is not computing intensive but rather network-remote API-call latency driven. In such a case asyncio/await decorated tools will help more with latency-masking than going into in the case of IO-latency dominated use-cases inefficient multiprocessing spawned and rather expensive full-copy concurrency of many python-processes (that just stay waitin for receiving remote-API answers ).
last but not least - performance-sensitive code best runs outside any mediating-ecosystem, like the interactivity-focused Jupyter-notebooks are.

Memory use if multiprocessing queue is not used by two separate processes

I have a thread in my python program that acquires images from a webcam and puts them in a multiprocessing queue. A separate process then takes these images from the queue and does some processing. However, if I try to empty the queue from the image acquisition (producer) thread I do not free any memory, and the program eventually uses all the available memory and crashes the machine (Python 3.6.6 / Ubuntu 18.04 64bit / Linux 4.15.0-43-generic)
I have a simple working example that reproduces the problem.
import multiprocessing
import time
import numpy as np
queue_mp = multiprocessing.Queue(maxsize=500)
def producer(q):
while True:
# Generate object to put in queue
dummy_in = np.ones((1000,1000))
# If the queue is full, get the oldest object (FIFO),
# to make space for the latest incoming object.
if q.full():
__ = q.get()
q.put(dummy_in)
def consumer(q):
while True:
# Get object from queue
dummy_out = q.get()
# Do some processing on the object, which we simulate here by time.sleep
time.sleep(3)
producer_process = multiprocessing.Process(target=producer,
args=(queue_mp,),
daemon=False)
consumer_process = multiprocessing.Process(target=consumer,
args=(queue_mp,),
daemon=False)
# Start producer and consumer processes
producer_process.start()
consumer_process.start()
I can rewrite my code to avoid this problem, but I'd like to understand what is happening. Is there a general rule that producers and consumers of a multiprocessing queue must be running in separate processes?
If anyone understands why this happens, or what exactly is happening behind the scenes of multiprocessing queues that would explain this memory behavior I would appreciate it. The docs did not go into a lot of detail.
I figured out what was happening, so I'll post it here for the benefit of anyone that stumbles across question.
My memory problem resulted from a numpy bug in numpy version 1.16.0. Reverting to numpy version 1.13.3 resolved the problem.
To answer the basic question: No, there is no need to worry which thread/process is doing the consuming (get) and which thread/process is doing the producing (put) for multiprocessing queues. There is nothing special about multiprocessing queues with respect to garbage collection. As kindall explains in response to a similar question:
When there are no longer any references to an object, the memory it occupies is freed immediately and can be reused by other Python objects
I hope that helps someone. In any case, the numpy bug should be resolved in the 1.16.1 release.

Tensorflow memory leak in tf.decode_csv function

So I am running a DNN that is based upon the iris Model located here:https://www.tensorflow.org/get_started/estimator and the textlineReader advice located here: https://www.tensorflow.org/api_guides/python/reading_data
It is having a memory leak problem, and I have narrowed down the leak to these few lines of code:
import numpy as np
import tensorflow as tf
def main():
filename_queue = tf.train.string_input_producer(file_path)
defaults = [[0.],[0.],[0.],[0.],[0]]
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
for i in range(50000):
columns = tf.decode_csv(value, record_defaults=defaults)
if __name__ == "__main__":
main()
Where the .csv file referred to by file_path contains 1 line:
5.9,3.0,4.2,1.5,1
When I run the program this is my system usage over 60 seconds:
Interestingly, all of the memory gets deallocated when I kill the program, or when the OOM manager does.
Anyway, I have to use batch processing in my program because of the size of the training dataset, so I have to perform the decoding of the .csv file in batches as well.
Is there a way to circumvent this leak, or is this a bug that should be reported?
Any information or suggestions are welcome.
Sort of obviously, the leak is coming from calling the decode_csv function, which is allocating some space that isn't deallocated until the program returns. The solution is to call the tf.decode_csv function outside of the for loop when getting a batch. As unintuitive as this sounds, I have been able to verify that it still shuffles the data with consecutive reads.
More importantly, this gives insight into the nature of what I believe are called graph operation in Tensorflow. One allocation no where near the session and it still works. I guess it is more like setting up a pipeline, and then feeding data through that pipelinne.
My code actually runs faster now too without all those mallocs!

Resources