Multiprocessing hangs when applying function to pandas dataframe Python 3.7.1 - python-3.x

I am trying to parallelize a function on my pandas dataframe and I'm running into an issue where it seems that the multiprocessing library is hanging. I am doing this all within a Jupyter notebook with myFunction() existing in a separate .py file. Can someone point out what I am doing wrong here?
Surprisingly, this piece of code has worked previously on my Windows 7 machine with the same version of python. I have just copied the file over to my Mac laptop.
I also use tqdm so I can monitor the progress, the behavior is the same with or without it.
#This function hands the multiprocessing
from multiprocessing import Pool, cpu_count
import numpy as np
import tqdm
def parallelize_dataframe(df, func):
num_partitions = cpu_count()*2 # number of partitions to split dataframe
num_cores = cpu_count() # number of cores on your machine
df_split = np.array_split(df, num_partitions)
pool = Pool(num_cores)
return pd.concat(list(tqdm.tqdm_notebook(pool.imap(func, df_split),total=num_partitions)))
#My function that I am applying to the dataframe is in another file
#myFunction retrieves a JSON from an API for each ID in myDF and converts it to a dataframe
from myFuctions import myFunction
#Code that calls the parallelize function
finalDF = parallelize_dataframe(myDF,myFunction)
The expected result is a concatenation of a list of dataframes that have been retrieved by myFunction(). This is worked in the past, but now the process seems to hang indefinitely without any error messages.

Q : Can someone point out what I am doing wrong here?
You just expected the MacOS to use the same mechanism for process-instantiations as the WinOS did in past.
The multiprocessing module does not do the same set of things on either of the supported O/S-es and even reported some methods to be dangerous and also had changed the default behaviour on MacOS- and Linux-based systems.
Next steps to try to move forward :
re-read how to do the explicit setup of the call-signatures in multiprocessing documentation ( avoid hidden dependency of the code-behaviour on "new" default values )
test if may avoid the cases where multiprocessing will spawn the full-copy of the python-interpreter process, that many times as you instruct ( memory allocations could soon get devastatingly large, if many replicas try to get instantiated beyond the localhost RAM-footprint, just due to a growing number of CPU-cores )
test if the "worker"-code is not computing intensive but rather network-remote API-call latency driven. In such a case asyncio/await decorated tools will help more with latency-masking than going into in the case of IO-latency dominated use-cases inefficient multiprocessing spawned and rather expensive full-copy concurrency of many python-processes (that just stay waitin for receiving remote-API answers ).
last but not least - performance-sensitive code best runs outside any mediating-ecosystem, like the interactivity-focused Jupyter-notebooks are.

Related

Is it possible limit memory usage by writing to disk?

I cannot understand if what I want to do in Dask is possible...
Currently, I have a long list of heavy files.
I am using multiprocessing library to process every entry of the list. My function opens and entry, operates on it, saves the result in a binary file to disk, and returns None. Everything works fine. I did this essentially to reduce RAM usage.
I would like to do "the same" in Dask, but I cannot figure out how to save binary data in parallel. In my mind, it should be something like:
for element in list:
new_value = func(element)
new_value.tofile('filename.binary')
where there can only be N elements loaded at once, where N is the number of workers, and each element is used and forgotten at the end of each cycle.
Is it possible?
Thanks a lot for any suggestion!
That does sound like a feasible task:
from dask import delayed, compute
#delayed
def myfunc(element):
new_value = func(element)
new_value.tofile('filename.binary') # you might want to
# change the destination for each element...
delayeds = [myfunc(e) for e in list]
results = compute(delayeds)
If you want fine control over tasks, you might want to explicitly specify the number of workers by starting a LocalCluster:
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=3)
client = Client(cluster)
There is a lot more that can be done to customize the settings/workflow, but perhaps the above will work for your use case.

Multiprocessing code does not work when trying to initialize dataframe columns

I am trying to use multiprocessing module to initialize each column of a dataframe using a separate CPU core in Python 3.6 but my code doesn't work. Does anybody know the issue with this code? I appreciate your help.
My laptop has Windows 10 and its CPU is Core i7 8th Gen:
import time
import pandas as pd
import numpy as np
import multiprocessing
df=pd.DataFrame(index=range(10),columns=["A","B","C","D"])
def multiprocessing_func(col):
for i in range(0,df.shape[0]):
df.iloc[i,col]=np.random(4)
print("column "+str(col)+ " is completed" )
if __name__ == '__main__':
starttime = time.time()
processes = []
for i in range(0,df.shape[1]):
p = multiprocessing.Process(target=multiprocessing_func, args=(i,))
processes.append(p)
p.start()
for process in processes:
process.join()
print('That took {} seconds'.format(time.time() - starttime))
When you start a Process, it is basically a copy of the parent process. (I'm skipping over some details here, but they shouldn't matter for the explanation).
Unlike threads, processes don't share data. (Processes can use shared memory, but this is not automatic. To the best of my knowledge, the mechanisms in multiprocessing for sharing data cannot handle a dataframe.)
So what happens is that each of the worker processes is modifying its own copy of the dataframe, not the dataframe in the parent process.
For this to work, you'd have to send the new data back to the parent process. You could do that by e.g. return-ing it from the worker function, and then putting the returned data into the original dataframe.
It only makes sense to use multiprocessing like this if the work of generating the data takes significantly longer then launching a new worker process, sending the data back to the parent process and putting it into the dataframe. Since you are basically filling the columns with random data, I don't think that is the case here.
So I don't see why you would use multiprocessing here.
Edit: Based on your comment that it takes days to calculate each column, I would propose the following.
Use Proces like you have been doing, but have each of the worker processes save the numbers they produce in a file where the filename includes the value of i. Have the workers return a status code so you can determine that thay have succeeded or failed. In case of failure, also return some kind of index of the amount of data successfully completed, so you don't have to re-calculate that again.
The file format should be simple and preferable readable. E.g. one number per line.
Wait for all processes to finish, read the files and fill the dataframe.

Memory use if multiprocessing queue is not used by two separate processes

I have a thread in my python program that acquires images from a webcam and puts them in a multiprocessing queue. A separate process then takes these images from the queue and does some processing. However, if I try to empty the queue from the image acquisition (producer) thread I do not free any memory, and the program eventually uses all the available memory and crashes the machine (Python 3.6.6 / Ubuntu 18.04 64bit / Linux 4.15.0-43-generic)
I have a simple working example that reproduces the problem.
import multiprocessing
import time
import numpy as np
queue_mp = multiprocessing.Queue(maxsize=500)
def producer(q):
while True:
# Generate object to put in queue
dummy_in = np.ones((1000,1000))
# If the queue is full, get the oldest object (FIFO),
# to make space for the latest incoming object.
if q.full():
__ = q.get()
q.put(dummy_in)
def consumer(q):
while True:
# Get object from queue
dummy_out = q.get()
# Do some processing on the object, which we simulate here by time.sleep
time.sleep(3)
producer_process = multiprocessing.Process(target=producer,
args=(queue_mp,),
daemon=False)
consumer_process = multiprocessing.Process(target=consumer,
args=(queue_mp,),
daemon=False)
# Start producer and consumer processes
producer_process.start()
consumer_process.start()
I can rewrite my code to avoid this problem, but I'd like to understand what is happening. Is there a general rule that producers and consumers of a multiprocessing queue must be running in separate processes?
If anyone understands why this happens, or what exactly is happening behind the scenes of multiprocessing queues that would explain this memory behavior I would appreciate it. The docs did not go into a lot of detail.
I figured out what was happening, so I'll post it here for the benefit of anyone that stumbles across question.
My memory problem resulted from a numpy bug in numpy version 1.16.0. Reverting to numpy version 1.13.3 resolved the problem.
To answer the basic question: No, there is no need to worry which thread/process is doing the consuming (get) and which thread/process is doing the producing (put) for multiprocessing queues. There is nothing special about multiprocessing queues with respect to garbage collection. As kindall explains in response to a similar question:
When there are no longer any references to an object, the memory it occupies is freed immediately and can be reused by other Python objects
I hope that helps someone. In any case, the numpy bug should be resolved in the 1.16.1 release.

Tensorflow memory leak in tf.decode_csv function

So I am running a DNN that is based upon the iris Model located here:https://www.tensorflow.org/get_started/estimator and the textlineReader advice located here: https://www.tensorflow.org/api_guides/python/reading_data
It is having a memory leak problem, and I have narrowed down the leak to these few lines of code:
import numpy as np
import tensorflow as tf
def main():
filename_queue = tf.train.string_input_producer(file_path)
defaults = [[0.],[0.],[0.],[0.],[0]]
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
for i in range(50000):
columns = tf.decode_csv(value, record_defaults=defaults)
if __name__ == "__main__":
main()
Where the .csv file referred to by file_path contains 1 line:
5.9,3.0,4.2,1.5,1
When I run the program this is my system usage over 60 seconds:
Interestingly, all of the memory gets deallocated when I kill the program, or when the OOM manager does.
Anyway, I have to use batch processing in my program because of the size of the training dataset, so I have to perform the decoding of the .csv file in batches as well.
Is there a way to circumvent this leak, or is this a bug that should be reported?
Any information or suggestions are welcome.
Sort of obviously, the leak is coming from calling the decode_csv function, which is allocating some space that isn't deallocated until the program returns. The solution is to call the tf.decode_csv function outside of the for loop when getting a batch. As unintuitive as this sounds, I have been able to verify that it still shuffles the data with consecutive reads.
More importantly, this gives insight into the nature of what I believe are called graph operation in Tensorflow. One allocation no where near the session and it still works. I guess it is more like setting up a pipeline, and then feeding data through that pipelinne.
My code actually runs faster now too without all those mallocs!

using dask distributed computing via jupyter notebook

I am seeing strange behavior from dask when using it from jupyter notebook. So I am initiating a local client and giving it a list of jobs to do. My real code is a bit complex so I am putting a simple example for you here:
from dask.distributed import Client
def inc(x):
return x + 1
if __name__ == '__main__':
c = Client()
futures = [c.submit(inc, i) for i in range(1,10)]
result = c.gather(futures)
print(len(result))
The problem is that, I realize that:
1. Dask initiates more than 9 processes for this example.
2. After the code has ran and it is done (nothing in the notebook is running), the processes created by dask are not killed (and the client is not shutdown). When I do a top, I can see all those processes still alive.
I saw in the documents that there is a client.close() option, but interestingly enough, such a functionality does not exist in 0.15.2.
The only time that the dask processes are killed, is when I stop the jupyter notebook. This issue is causing strange and unpredictable performance behavior. Is there anyway that the processes can get killed or the client shutdown when there is no code running on the notebook?
The default Client allows for optional parameters which are passed to LocalCluster (see the docs) and allow you to specify, for example, the number of processes you wish. Also, it is a context manager, which will close itself and end processes when you are done.
with Client(n_workers=2) as c:
futures = [c.submit(inc, i) for i in range(1,10)]
result = c.gather(futures)
print(len(result))
When this ends, the processes will be terminated.

Resources