Is it possible limit memory usage by writing to disk? - python-3.x

I cannot understand if what I want to do in Dask is possible...
Currently, I have a long list of heavy files.
I am using multiprocessing library to process every entry of the list. My function opens and entry, operates on it, saves the result in a binary file to disk, and returns None. Everything works fine. I did this essentially to reduce RAM usage.
I would like to do "the same" in Dask, but I cannot figure out how to save binary data in parallel. In my mind, it should be something like:
for element in list:
new_value = func(element)
new_value.tofile('filename.binary')
where there can only be N elements loaded at once, where N is the number of workers, and each element is used and forgotten at the end of each cycle.
Is it possible?
Thanks a lot for any suggestion!

That does sound like a feasible task:
from dask import delayed, compute
#delayed
def myfunc(element):
new_value = func(element)
new_value.tofile('filename.binary') # you might want to
# change the destination for each element...
delayeds = [myfunc(e) for e in list]
results = compute(delayeds)
If you want fine control over tasks, you might want to explicitly specify the number of workers by starting a LocalCluster:
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=3)
client = Client(cluster)
There is a lot more that can be done to customize the settings/workflow, but perhaps the above will work for your use case.

Related

Array assignment using multiprocessing

I have a uniform 2D coordinate grid stored in a numpy array. The values of this array are assigned by a function that looks roughly like the following:
def update_grid(grid):
n, m = grid.shape
for i in range(n):
for j in range(m):
#assignment
Calling this function takes 5-10 seconds for a 100x100 grid, and it needs to be called several hundred times during the execution of my main program. This function is the rate limiting step in my program, so I want to reduce the process time as much as possible.
I believe that the assignment expression inside can be split up in a manner which accommodates multiprocessing. The value at each gridpoint is independent of the others, so the assignments can be split something like this:
def update_grid(grid):
n, m = grid.shape
for i in range (n):
for j in range (m):
p = Process(target=#assignment)
p.start()
So my questions are:
Does the above loop structure ensure each process will only operate
on a single gridpoint? Do I need anything else to allow each
process to write to the same array, even if they're writing to
different placing in that array?
The assignment expression requires a set of parameters. These are
constant, but each process will be reading at the same time. Is this okay?
To explicitly write the code I've structured above, I would need to
define my assignment expression as another function inside of
update_grid, correct?
Is this actually worthwhile?
Thanks in advance.
Edit:
I still haven't figured out how to speed up the assignment, but I was able to avoid the problem by changing my main program. It no longer needs to update the entire grid with each iteration, and instead tracks changes and only updates what has changed. This cut the execution time of my program down from an hour to less than a minute.

Multiprocessing code does not work when trying to initialize dataframe columns

I am trying to use multiprocessing module to initialize each column of a dataframe using a separate CPU core in Python 3.6 but my code doesn't work. Does anybody know the issue with this code? I appreciate your help.
My laptop has Windows 10 and its CPU is Core i7 8th Gen:
import time
import pandas as pd
import numpy as np
import multiprocessing
df=pd.DataFrame(index=range(10),columns=["A","B","C","D"])
def multiprocessing_func(col):
for i in range(0,df.shape[0]):
df.iloc[i,col]=np.random(4)
print("column "+str(col)+ " is completed" )
if __name__ == '__main__':
starttime = time.time()
processes = []
for i in range(0,df.shape[1]):
p = multiprocessing.Process(target=multiprocessing_func, args=(i,))
processes.append(p)
p.start()
for process in processes:
process.join()
print('That took {} seconds'.format(time.time() - starttime))
When you start a Process, it is basically a copy of the parent process. (I'm skipping over some details here, but they shouldn't matter for the explanation).
Unlike threads, processes don't share data. (Processes can use shared memory, but this is not automatic. To the best of my knowledge, the mechanisms in multiprocessing for sharing data cannot handle a dataframe.)
So what happens is that each of the worker processes is modifying its own copy of the dataframe, not the dataframe in the parent process.
For this to work, you'd have to send the new data back to the parent process. You could do that by e.g. return-ing it from the worker function, and then putting the returned data into the original dataframe.
It only makes sense to use multiprocessing like this if the work of generating the data takes significantly longer then launching a new worker process, sending the data back to the parent process and putting it into the dataframe. Since you are basically filling the columns with random data, I don't think that is the case here.
So I don't see why you would use multiprocessing here.
Edit: Based on your comment that it takes days to calculate each column, I would propose the following.
Use Proces like you have been doing, but have each of the worker processes save the numbers they produce in a file where the filename includes the value of i. Have the workers return a status code so you can determine that thay have succeeded or failed. In case of failure, also return some kind of index of the amount of data successfully completed, so you don't have to re-calculate that again.
The file format should be simple and preferable readable. E.g. one number per line.
Wait for all processes to finish, read the files and fill the dataframe.

Multiprocessing hangs when applying function to pandas dataframe Python 3.7.1

I am trying to parallelize a function on my pandas dataframe and I'm running into an issue where it seems that the multiprocessing library is hanging. I am doing this all within a Jupyter notebook with myFunction() existing in a separate .py file. Can someone point out what I am doing wrong here?
Surprisingly, this piece of code has worked previously on my Windows 7 machine with the same version of python. I have just copied the file over to my Mac laptop.
I also use tqdm so I can monitor the progress, the behavior is the same with or without it.
#This function hands the multiprocessing
from multiprocessing import Pool, cpu_count
import numpy as np
import tqdm
def parallelize_dataframe(df, func):
num_partitions = cpu_count()*2 # number of partitions to split dataframe
num_cores = cpu_count() # number of cores on your machine
df_split = np.array_split(df, num_partitions)
pool = Pool(num_cores)
return pd.concat(list(tqdm.tqdm_notebook(pool.imap(func, df_split),total=num_partitions)))
#My function that I am applying to the dataframe is in another file
#myFunction retrieves a JSON from an API for each ID in myDF and converts it to a dataframe
from myFuctions import myFunction
#Code that calls the parallelize function
finalDF = parallelize_dataframe(myDF,myFunction)
The expected result is a concatenation of a list of dataframes that have been retrieved by myFunction(). This is worked in the past, but now the process seems to hang indefinitely without any error messages.
Q : Can someone point out what I am doing wrong here?
You just expected the MacOS to use the same mechanism for process-instantiations as the WinOS did in past.
The multiprocessing module does not do the same set of things on either of the supported O/S-es and even reported some methods to be dangerous and also had changed the default behaviour on MacOS- and Linux-based systems.
Next steps to try to move forward :
re-read how to do the explicit setup of the call-signatures in multiprocessing documentation ( avoid hidden dependency of the code-behaviour on "new" default values )
test if may avoid the cases where multiprocessing will spawn the full-copy of the python-interpreter process, that many times as you instruct ( memory allocations could soon get devastatingly large, if many replicas try to get instantiated beyond the localhost RAM-footprint, just due to a growing number of CPU-cores )
test if the "worker"-code is not computing intensive but rather network-remote API-call latency driven. In such a case asyncio/await decorated tools will help more with latency-masking than going into in the case of IO-latency dominated use-cases inefficient multiprocessing spawned and rather expensive full-copy concurrency of many python-processes (that just stay waitin for receiving remote-API answers ).
last but not least - performance-sensitive code best runs outside any mediating-ecosystem, like the interactivity-focused Jupyter-notebooks are.

Using multiple map_async (Multiprocessing) in Python3

I have sample code that uses map_async in Multiprocessing using Python 3. What I'm trying to figure out is how I can run map_async(a, c) and map_async(b, d) concurrently. But it seems like to second map_async(b, d) statement seems to run when the first one is about to finish. Is there a way I can run two map_async functions to run at the same time? I tried to search online but didn't get the answer that I wanted. Below is the sample code. If you have other suggestions, I'm very happy to listen to that as well. Thank you all for the help!
from multiprocessing import Pool
import time
import os
def a(i):
print('First': i)
return
def b(i):
print('Second': i)
return
if __name__ = '__main__':
c = range(100)
d = range(100)
pool = Pool(os.cpu_count())
pool.map_async(a, c)
pool.map_async(b, d)
pool.close()
pool.join()
map_async simply splits the iterable in a set of chunks and sends those chunks via a os.pipe to the workers. Therefore, two subsequent calls to map_async will appear to the workers as a single list composed by the join of the two above mentioned sets.
This is the correct behaviour as the workers really don't care about which map_async call a chunk belongs. Running two map_async in parallel would not bring any improvement in terms of speed or throughput.
If for any reason you really need the two call to be executed in parallel, the only way is to create two different Pool objects. I would nevertheless recommend against such approach as it would make things much more unpredictable.

Threading in Python 3

I write Python 3 code, in which I have 2 functions. The first function insertBlock() inserts data in MongoDB collection 1, the second function insertTransactionData() takes data from collection 1 and inserts it into collection 2. Data is in very large amount so I use threading to increase performance. But when I use threading it is taking more time to insert data than without threading. I am so confused that exactly how threading will work in my code and how to increase performance? Here is the main function :
if __name__ == '__main__':
t1 = threading.Thread(target=insertBlock())
t1.start()
t2 = threading.Thread(target=insertTransactionData())
t2.start()
From the python documentation for threading:
target is the callable object to be invoked by the run() method. Defaults to None, meaning nothing is called.
So the correct usage is
threading.Thread(target=insertBlock)
(without the () after insertBlock), because otherwise insertBlock is called, executed normally (blocking the main thread) and target is set to it's return value None. This causes t1.start() not to do anything and you don't get any performance improvement.
Warning:
Be aware that multithreading gives you no guarantee on what the order of execution in different threads will be. You can not rely on the data that insertBlock has inserted into the database inside the insertTransactionData function, because at the time insertTransactionData uses this data, you can not be sure that it was already inserted. So, maybe multithreading does not work at all for this code or you need to restructure your code and only parallelize those parts that do not depend on each other.
I solved this problem by merging these two functionalities into one new function
insertBlockAndTransaction(startrange,endrange). As these two functionalities depend on each other so what I did is I insert transaction information immediately below where block information is inserted (block number was common and needed for both functionalities).Then did multithreading by creating 10 threads for single function:
for i in range(10):
print('thread:',i)
t1 = threading.Thread(target=insertBlockAndTransaction,args(5000000+i*10000,5000000+(i+1)*10000))
t1.start()
It helps me to deal with increasing execution time for more than 1lakh data.

Resources