I need to run two consecutive pool.maps in python. But the second one is dependent on the results of the first map. Thus, before running the second pool.map I need to make sure the function1 is executed for all args. Can anyone show me how to do that?
# The first multiprocessing unit
pool = Pool(processes=num_p)
new_args=dict(pool.map(function1, args))
# The second multiprocessing unit
pool.map(function2, new_args)
Thanks
Surely pool.map will block until the results are done. How else could it return them?
You can also confirm this fact from the documentation.
It blocks until the result is ready.
Related
I am measuring the metrics of an encryption algorithm that I designed. I have declared 2 functions and a brief sample is as follows:
import sys, random, timeit, psutil, os, time
from multiprocessing import Process
from subprocess import check_output
pid=0
def cpuUsage():
global running
while pid == 0:
time.sleep(1)
running=true
p = psutil.Process(pid)
while running:
print(f'PID: {pid}\t|\tCPU Usage: {p.memory_info().rss/(1024*1024)} MB')
time.sleep(1)
def Encryption()
global pid, running
pid = os.getpid()
myList=[]
for i in range(1000):
myList.append(random.randint(-sys.maxsize,sys.maxsize)+random.random())
print('Now running timeit function for speed metrics.')
p1 = Process(target=metric_collector())
p1.start()
p1.join()
number=1000
unit='msec'
setup = '''
import homomorphic,random,sys,time,os,timeit
myList={myList}
'''
enc_code='''
for x in range(len(myList)):
myList[x] = encryptMethod(a, b, myList[x], d)
'''
dec_code='''
\nfor x in range(len(myList)):
myList[x] = decryptMethod(myList[x])
'''
time=timeit.timeit(setup=setup,
stmt=(enc_code+dec_code),
number=number)
running=False
print(f'''Average Time:\t\t\t {time/number*.0001} seconds
Total time for {number} Iters:\t\t\t {time} {unit}s
Total Encrypted/Decrypted Values:\t {number*len(myList)}''')
sys.exit()
if __name__ == '__main__':
print('Beginning Metric Evaluation\n...\n')
p2 = Process(target=Encryption())
p2.start()
p2.join()
I am sure there's an implementation error in my code, I'm just having trouble grabbing the PID for the encryption method and I am trying to make the overhead from other calls as minimal as possible so I can get an accurate reading of just the functionality of the methods being called by timeit. If you know a simpler implementation, please let me know. Trying to figure out how to measure all of the metrics has been killing me softly.
I've tried acquiring the pid a few different ways, but I only want to measure performance when timeit is run. Good chance I'll have to break this out separately and run it that way (instead of multiprocessing) to evaluate the function properly, I'm guessing.
There are at least three major problems with your code. The net result is that you are not actually doing any multiprocessing.
The first problem is here, and in a couple of other similar places:
p2 = Process(target=Encryption())
What this code passes to Process is not the function Encryption but the returned value from Encryption(). It is exactly the same as if you had written:
x = Encryption()
p2 = Process(target=x)
What you want is this:
p2 = Process(target=Encryption)
This code tells Python to create a new Process and execute the function Encryption() in that Process.
The second problem has to do with the way Python handles memory for Processes. Each Process lives in its own memory space. Each Process has its own local copy of global variables, so you cannot set a global variable in one Process and have another Process be aware of this change. There are mechanisms to handle this important situation, documented in the multiprocessing module. See the section titled "Sharing state between processes." The bottom line here is that you cannot simply set a global variable inside a Process and expect other Processes to see the change, as you are trying to do with pid. You have to use one of the approaches described in the documentation.
The third problem is this code pattern, which occurs for both p1 and p2.
p2 = Process(target=Encryption)
p2.start()
p2.join()
This tells Python to create a Process and to start it. Then you immediately wait for it to finish, which means that your current Process must stop at that point until the new Process is finished. You never allow two Processes to run at once, so there is no performance benefit. The only reason to use multiprocessing is to run two things at the same time, which you never do. You might as well not bother with multiprocessing at all since it is only making your life more difficult.
Finally I am not sure why you have decided to try to use multiprocessing in the first place. The functions that measure memory usage and execution time are almost certainly very fast, and I would expect them to be much faster than any method of synchronizing one Process to another. If you're worried about errors due to the time used by the diagnostic functions themselves, I doubt that you can make things better by multiprocessing. Why not just start with a simple program and see what results you get?
I have sample code that uses map_async in Multiprocessing using Python 3. What I'm trying to figure out is how I can run map_async(a, c) and map_async(b, d) concurrently. But it seems like to second map_async(b, d) statement seems to run when the first one is about to finish. Is there a way I can run two map_async functions to run at the same time? I tried to search online but didn't get the answer that I wanted. Below is the sample code. If you have other suggestions, I'm very happy to listen to that as well. Thank you all for the help!
from multiprocessing import Pool
import time
import os
def a(i):
print('First': i)
return
def b(i):
print('Second': i)
return
if __name__ = '__main__':
c = range(100)
d = range(100)
pool = Pool(os.cpu_count())
pool.map_async(a, c)
pool.map_async(b, d)
pool.close()
pool.join()
map_async simply splits the iterable in a set of chunks and sends those chunks via a os.pipe to the workers. Therefore, two subsequent calls to map_async will appear to the workers as a single list composed by the join of the two above mentioned sets.
This is the correct behaviour as the workers really don't care about which map_async call a chunk belongs. Running two map_async in parallel would not bring any improvement in terms of speed or throughput.
If for any reason you really need the two call to be executed in parallel, the only way is to create two different Pool objects. I would nevertheless recommend against such approach as it would make things much more unpredictable.
I write Python 3 code, in which I have 2 functions. The first function insertBlock() inserts data in MongoDB collection 1, the second function insertTransactionData() takes data from collection 1 and inserts it into collection 2. Data is in very large amount so I use threading to increase performance. But when I use threading it is taking more time to insert data than without threading. I am so confused that exactly how threading will work in my code and how to increase performance? Here is the main function :
if __name__ == '__main__':
t1 = threading.Thread(target=insertBlock())
t1.start()
t2 = threading.Thread(target=insertTransactionData())
t2.start()
From the python documentation for threading:
target is the callable object to be invoked by the run() method. Defaults to None, meaning nothing is called.
So the correct usage is
threading.Thread(target=insertBlock)
(without the () after insertBlock), because otherwise insertBlock is called, executed normally (blocking the main thread) and target is set to it's return value None. This causes t1.start() not to do anything and you don't get any performance improvement.
Warning:
Be aware that multithreading gives you no guarantee on what the order of execution in different threads will be. You can not rely on the data that insertBlock has inserted into the database inside the insertTransactionData function, because at the time insertTransactionData uses this data, you can not be sure that it was already inserted. So, maybe multithreading does not work at all for this code or you need to restructure your code and only parallelize those parts that do not depend on each other.
I solved this problem by merging these two functionalities into one new function
insertBlockAndTransaction(startrange,endrange). As these two functionalities depend on each other so what I did is I insert transaction information immediately below where block information is inserted (block number was common and needed for both functionalities).Then did multithreading by creating 10 threads for single function:
for i in range(10):
print('thread:',i)
t1 = threading.Thread(target=insertBlockAndTransaction,args(5000000+i*10000,5000000+(i+1)*10000))
t1.start()
It helps me to deal with increasing execution time for more than 1lakh data.
so I've got this multithreaded, recursive application. It's coded in Pharo Smalltalk but the logical solution to the issue is likely to be the same across most languages.
I have 4 of the same process running relatively simultaneously. It's the last iteration of a recursive call. I'd like to print the result calculated by my recursive function (it's a dictionary being modified in the argument of the recursive function/message). The issue I'm facing right now is that the print is called in the base case terminator of the recursion, so the result is printed 4 times.
I tried setting a global variable which allows for me to print the result of the process which finishes first, but of course that means that the result is wrong. It needs to print the result of the last process to execute of all the processes in that last iteration of the recursion.
How could I go about this without going too deep into the Process class? Thanks for any help.
Do you know the number of threads? (Supposedly, 4)
Then you can use an atomic long (in java, for example):
AtomicLong myAtomicLong = new AtomicLong(0);
...
...
// do my work
if (totalThreadCount == myAtomicLong.getAndIncrement() -1)
{
//my print
}
The increment and get is atomic, so the last thread to want to print, will get there and the condition will be true after all other threads have finished their jobs. Please notice that it is important to place the increment and check after the job, is done.
I currently have code in the form of a generator which calls an IO-bound task. The generator actually calls sub-generators as well, so a more general solution would be appreciated.
Something like the following:
def processed_values(list_of_io_tasks):
for task in list_of_io_tasks:
value = slow_io_call(task)
yield postprocess(value) # in real version, would iterate over
# processed_values2(value) here
I have complete control over slow_io_call, and I don't care in which order I get the items from processed_values. Is there something like coroutines I can use to get the yielded results in the fastest order by turning slow_io_call into an asynchronous function and using whichever call returns fastest? I expect list_of_io_tasks to be at least thousands of entries long. I've never done any parallel work other than with explicit threading, and in particular I've never used the various forms of lightweight threading which are available.
I need to use the standard CPython implementation, and I'm running on Linux.
Sounds like you are in search of multiprocessing.Pool(), specifically the Pool.imap_unordered() method.
Here is a port of your function to use imap_unordered() to parallelize calls to slow_io_call().
def processed_values(list_of_io_tasks):
pool = multiprocessing.Pool(4) # num workers
results = pool.imap_unordered(slow_io_call, list_of_io_tasks)
while True:
yield results.next(9999999) # large time-out
Note that you could also iterate over results directly (i.e. for item in results: yield item) without a while True loop, however calling results.next() with a time-out value works around this multiprocessing keyboard interrupt bug and allows you to kill the main process and all subprocesses with Ctrl-C. Also note that the StopIteration exceptions are not caught in this function but one will be raised when results.next() has no more items return. This is legal from generator functions, such as this one, which are expected to either raise StopIteration errors when there are no more values to yield or just stop yielding and a StopIteration exception will be raised on it's behalf.
To use threads in place of processes, replace
import multiprocessing
with
import multiprocessing.dummy as multiprocessing