Pytorch data pipeline - python-3.x

I am trying to implement a bounded buffer like solution where data generator and the model work as two separate processes. The data generator preprocess the data and stores in a shared queue (with predefined max size to limit the memory usage). The model on the other hand consumes data from this queue at its own pace until the queue is empty. Below is the snippet of my implementation.
'''
self._buffer is an object of multiprocessing.Queue
'''
def produce(self):
for obj in self._generator:
self._buffer.put(obj=obj, block=True, timeout=None)
self._buffer.put(obj=None)
def consume(self):
while True:
dat = self._buffer.get(block=True, timeout=None)
if dat is None:
break
else:
# Train model on `dat`
def run(self):
pt = multiprocessing.Process(target=self.produce)
ct = multiprocessing.Process(target=self.consume)
pt.start()
ct.start()
pt.join()
ct.join()
However, the solution above does not work. I used the torch.multiprocessing as instructed the documentation. I also set torch.multiprocessing.set_start_method('spawn') in order to avoid "RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method"
But now I get "TypeError: cannot pickle 'generator' object". How this can be fixed?

Since you work with pytorch you should use the Dataset and Dataloader approach. This handles all problems with multiprocessing, shared memory and so on for you.
You can have map style datasets or things like iterable-style.Best to read the official documentation, what is what and how they work.
In your case you probably are fine with an iterable-style dataset. I used both approaches for similar cases. You can have the iterable style dataset, which you might need if you don't know how much samples you will be processing. For other cases I had a map-style dataset, where I knew the total number of my samples beforehand (e.g. processing all images in a directory) and could use a sequential sampler to give me the elements in order.
Regarding one of your problems. All errors like this TypeError: cannot pickle 'generator' object happen when you have objects which can't be serialized. For serialization pickle is used. In your case self._generator seems to be an object which can't be serialized for some reason. Without code it is not possible to say why. I had cases where used wrapped c++ packages created with pybind where objects were not serializable or I had some mutex variables somewhere.

Related

How to pass a share value to Processes which has jit / njit function that read and modify the share value?

I am trying to have an integer value which would be assigned to a multiprocess programme and each process has a jit funtion to read and modify the value.
I came accross with multiprocessing.Manager().value which would pass a share value to each process, but numba.jit does not accept this type.
Is there any solution to work around it?
import numba
import multiprocessing
#numba.jit()
def jj (o, ii):
print (o.value)
o.value = ii
print (o.value)
if __name__ == '__main__':
o = multiprocessing.Manager().Value('i', 0 , lock=False)
y1 = multiprocessing.Process(target=jj, args=(o,10))
y1.daemon = True
y2 = multiprocessing.Process(target=jj, args=(o,20))
y2.daemon = True
y1.start()
y2.start()
y1.join()
y2.join()
You cannot modify a CPython object from an njit function so the function will (almost) not benefit from Numba (the only optimization Numba can do is looplifting but it cannot be used here anyway). What you try to archive is not possible with multiprocessing + njitted Numba functions. Numba can be fast because it does not operate on CPython types but native ones but multiprocessing's managers operate on only on CPython types. You can use the very experimental objmode scope of Numba so to execute pure-Python in a Numba function but be aware that this is slow (and it sometimes just crash currently).
Another big issue is that shared CPython objects are protected by the global interpreter lock (GIL) which basically prevent any parallel speed-up inside a process (unless on IO-based codes or similar things). The GIL is designed so to protect the interpreter of race conditions on the internal state of objects. AFAIK, managers can transfer pure-Python objects between processes thanks to pickling (which is slow), but using lock=False is unsafe and can also cause a race condition (not at the interpreter level thanks to the GIL).
Note the Numba function have to be recompiled for each process which is slow (caching can help the subsequent runs but not the first time because of concurrent compilation in multiple processes).

How is it possible to execute python code during deserialization?

I was reading about pickling in the context of persisting instances, and ran across this snippet:
Pickle files can be hacked. If you receive a raw pickle file over the network, don't trust it! It could have malicious code in it, that would run arbitrary python when you try to de-pickle it. [1]
My understanding is that pickling turns a data-structure into an array of bytes, and the pickle library also contains methods to take a pickled byte array and rebuild a python instance from it.
I tested some code to see if simply putting code into the class or init method would run it:
import pickle
class A:
print('class')
def __init__(self):
print('instance')
a = A()
print('pickling...')
with open('/home/usrname/Desktop/pfile', 'wb') as pfile:
pickle.dump(a, pfile, pickle.HIGHEST_PROTOCOL)
print('de-pickling...')
with open('/home/usrname/Desktop/pfile', 'rb') as pfile:
a2 = pickle.load(pfile)
However this only yields
class
instance
pickling...
de-pickling...
suggesting that the __ init__ method doesn't actually get run when the instance is unpickled. So I'm still confused how you would make code run during that process.
Really thorough writeup here: https://intoli.com/blog/dangerous-pickles/
From what I understand, it has to do with how pickles are interpreted by the Pickle Machine (PM) and run. You can craft a pickle file that will cause it to evaluate using eval() the statements provided.

Multiprocessing hangs when applying function to pandas dataframe Python 3.7.1

I am trying to parallelize a function on my pandas dataframe and I'm running into an issue where it seems that the multiprocessing library is hanging. I am doing this all within a Jupyter notebook with myFunction() existing in a separate .py file. Can someone point out what I am doing wrong here?
Surprisingly, this piece of code has worked previously on my Windows 7 machine with the same version of python. I have just copied the file over to my Mac laptop.
I also use tqdm so I can monitor the progress, the behavior is the same with or without it.
#This function hands the multiprocessing
from multiprocessing import Pool, cpu_count
import numpy as np
import tqdm
def parallelize_dataframe(df, func):
num_partitions = cpu_count()*2 # number of partitions to split dataframe
num_cores = cpu_count() # number of cores on your machine
df_split = np.array_split(df, num_partitions)
pool = Pool(num_cores)
return pd.concat(list(tqdm.tqdm_notebook(pool.imap(func, df_split),total=num_partitions)))
#My function that I am applying to the dataframe is in another file
#myFunction retrieves a JSON from an API for each ID in myDF and converts it to a dataframe
from myFuctions import myFunction
#Code that calls the parallelize function
finalDF = parallelize_dataframe(myDF,myFunction)
The expected result is a concatenation of a list of dataframes that have been retrieved by myFunction(). This is worked in the past, but now the process seems to hang indefinitely without any error messages.
Q : Can someone point out what I am doing wrong here?
You just expected the MacOS to use the same mechanism for process-instantiations as the WinOS did in past.
The multiprocessing module does not do the same set of things on either of the supported O/S-es and even reported some methods to be dangerous and also had changed the default behaviour on MacOS- and Linux-based systems.
Next steps to try to move forward :
re-read how to do the explicit setup of the call-signatures in multiprocessing documentation ( avoid hidden dependency of the code-behaviour on "new" default values )
test if may avoid the cases where multiprocessing will spawn the full-copy of the python-interpreter process, that many times as you instruct ( memory allocations could soon get devastatingly large, if many replicas try to get instantiated beyond the localhost RAM-footprint, just due to a growing number of CPU-cores )
test if the "worker"-code is not computing intensive but rather network-remote API-call latency driven. In such a case asyncio/await decorated tools will help more with latency-masking than going into in the case of IO-latency dominated use-cases inefficient multiprocessing spawned and rather expensive full-copy concurrency of many python-processes (that just stay waitin for receiving remote-API answers ).
last but not least - performance-sensitive code best runs outside any mediating-ecosystem, like the interactivity-focused Jupyter-notebooks are.

Tensorflow memory leak in tf.decode_csv function

So I am running a DNN that is based upon the iris Model located here:https://www.tensorflow.org/get_started/estimator and the textlineReader advice located here: https://www.tensorflow.org/api_guides/python/reading_data
It is having a memory leak problem, and I have narrowed down the leak to these few lines of code:
import numpy as np
import tensorflow as tf
def main():
filename_queue = tf.train.string_input_producer(file_path)
defaults = [[0.],[0.],[0.],[0.],[0]]
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
for i in range(50000):
columns = tf.decode_csv(value, record_defaults=defaults)
if __name__ == "__main__":
main()
Where the .csv file referred to by file_path contains 1 line:
5.9,3.0,4.2,1.5,1
When I run the program this is my system usage over 60 seconds:
Interestingly, all of the memory gets deallocated when I kill the program, or when the OOM manager does.
Anyway, I have to use batch processing in my program because of the size of the training dataset, so I have to perform the decoding of the .csv file in batches as well.
Is there a way to circumvent this leak, or is this a bug that should be reported?
Any information or suggestions are welcome.
Sort of obviously, the leak is coming from calling the decode_csv function, which is allocating some space that isn't deallocated until the program returns. The solution is to call the tf.decode_csv function outside of the for loop when getting a batch. As unintuitive as this sounds, I have been able to verify that it still shuffles the data with consecutive reads.
More importantly, this gives insight into the nature of what I believe are called graph operation in Tensorflow. One allocation no where near the session and it still works. I guess it is more like setting up a pipeline, and then feeding data through that pipelinne.
My code actually runs faster now too without all those mallocs!

Cannot access hdf5 array from a thread

I am using h5py processing the array in several threads:
def process(start, end, dataset):
for i in xrange(start, end):
# Do something with dataset[i]
f = h5py.File(path, 'r')
dataset = f[...]
worker = [threading.Thread(target=process, args=(start, end, dataset)) \
for start, end in ...]
I get an error when accessing the array from the thread
File "/usr/lib/python2.7/dist-packages/h5py/_hl/dataset.py", line 367, in __getitem__
if self._local.astype is not None:
AttributeError: 'thread._local' object has no attribute 'astype'
I have really no clue why this happens :/, I can access dtype and shape, but on access of any slice I get this error.
Edit:
Calling
process(0, len(dataset), dataset)
in the main thread works as expected.
Can you provide a full example that reproduces the error? Please also mention the versions of Python, h5py, and the hdf5 library that you're using, as well as your OS.
The following code works with my setup, does it work on yours?
import threading
import h5py
import numpy as np
db = h5py.File('/tmp/test.h5', 'w')
dataset = db.create_dataset("mydata", data=np.random.random((10,)))
def process(start, end, dataset):
for i in xrange(start, end):
print(dataset[i])
workers = [threading.Thread(target=process, args=(start, end, dataset))
for start, end in [[1,2], [3,4]]]
workers[0].start()
workers[1].start()
db.close()
May I ask why you are using threads to process your hdf5 file? Note that hdf5 does not provide thread-level concurrency. Although the above example "works" (premise: hdf5 compiled with "threadsafe" option), the two operations will run sequentially. Hdf5 operations are blocking and do not release the global interpreter lock during I/O, which prevents threads from running in parallel.
If you want your code to be executed in parallel, you have to use processes instead of threads. Note, however, that parallel reading and writing is safe only with the MPI version of h5py/hdf5.
I had the same problem.
It was solved by updating h5py (from 2.2 to to 2.6).

Resources