I have a problem with python multiprocessing Queues.
I'm doing some hard computation on some data. I have created few processes to lower calculation time, also data have been split evenly before sending it to processes. It decrease the time of calculations nicely but when I want to return data from the process by multiprocessing.Queue it takes ages and whole thing is slower than calculating in main thread.
processes = []
proc = 8
for i in range(proc):
processes.append(multiprocessing.Process(target=self.calculateTriangles, args=(inData[i],outData,timer)))
for p in processes:
p.start()
results = []
for i in range(proc):
results.append(outData.get())
print("killing threads")
print(datetime.datetime.now() - timer)
for p in processes:
p.join()
print("Finish Threads")
print(datetime.datetime.now() - timer)
all of threads print their finish time when they are done. Here is example output of this code
0:00:00.017873 CalcDone
0:00:01.692940 CalcDone
0:00:01.777674 CalcDone
0:00:01.780019 CalcDone
0:00:01.796739 CalcDone
0:00:01.831723 CalcDone
0:00:01.842356 CalcDone
0:00:01.868633 CalcDone
0:00:05.497160 killing threads
60968 calculated triangles
As you can see everything is quiet simple until this code.
for i in range(proc):
results.append(outData.get())
print("killing threads")
print(datetime.datetime.now() - timer)
here are some observations I have made on mine computer and slower one.
https://docs.google.com/spreadsheets/d/1_8LovX0eSgvNW63-xh8L9-uylAVlzY4VSPUQ1yP2F9A/edit?usp=sharing . On slower one there isn't any improvement as you can see.
Why does it take so much time to get items from queue when process is finished?? Is there way to speed this up?
So I have solved it myself. Calculations are fast but copying objects from one process to another takes ages. I just made a method that cleared all not-necessary fields in the objects, also using pipes is faster than multiprocessing queues. It took down the time on my slower computer from 29 seconds to 15 seconds.
This time is mainly spent on putting another object to the Queue and spiking up the Semaphore count. If you are able to bulk insert the Queue with all the data at once, then you cut down to 1/10 of the previous time.
I've assigned dynamically a new method to Queue based on the old one. Go to the multiprocessing module for your Python version:
/usr/lib/pythonx.x/multiprocessing.queues.py
Copy the "put" method of the class to your project e.g. for Python 3.7:
def put(self, obj, block=True, timeout=None):
assert not self._closed, "Queue {0!r} has been closed".format(self)
if not self._sem.acquire(block, timeout):
raise Full
with self._notempty:
if self._thread is None:
self._start_thread()
self._buffer.append(obj)
self._notempty.notify()
modify it:
def put_bla(self, obj, block=True, timeout=None):
assert not self._closed, "Queue {0!r} has been closed".format(self)
for el in obj:
if not self._sem.acquire(block, timeout): #spike the semaphore count
raise Full
with self._notempty:
if self._thread is None:
self._start_thread()
self._buffer += el # adding a collections.deque object
self._notempty.notify()
The last step is to add the new method to the class. The multiprocessing.Queue is a DefaultContext method which returns a Queue object. It is easier to inject the method directly to the class of the created object. So:
from collections import deque
queue = Queue()
queue.__class__.put_bulk = put_bla # injecting new method
items = (500, 400, 450, 350) * count # (500, 400, 450, 350, 500, 400...)
queue.put_bulk(deque(items))
Unfortunately the multiprocessing.Pool was always faster by 10%, so just stick with that if you don't require everlasting workers to process your tasks. It is based on multiprocessing.SimpleQueue which is based on multiprocessing.Pipe and I have no idea why it is faster because my SimpleQueue solution wasn't and it is not bulk-injectable:) Break that and You'll have the fastest worker ever:)
Related
I have an event that I am listening to every minute that returns a list ; it could be empty, 1 element, or more. And with those elements in that list, I'd like to run a function that would monitor an event on that element every minute for 10 minute.
For that I wrote that script
from concurrent.futures import ThreadPoolExecutor
from time import sleep
import asyncio
import Client
client = Client()
def handle_event(event):
for i in range(10):
client.get_info(event)
sleep(60)
async def main():
while True:
entires = client.get_new_entry()
if len(entires) > 0:
with ThreadPoolExecutor(max_workers=len(entires)) as executor:
executor.map(handle_event, entires)
await asyncio.sleep(60)
if __name__ == "__main__":
loop = asyncio.new_event_loop()
loop.run_until_complete(main())
However, instead of keep monitoring the entries, it blocks while the previous entries are still being monitors.
Any idea how I could do that please?
First let me explain why your program doesn't work the way you want it to: It's because you use the ThreadPoolExecutor as a context manager, which will not close until all the threads started by the call to map are finished. So main() waits there, and the next iteration of the loop can't happen until all the work is finished.
There are ways around this. Since you are using asyncio already, one approach is to move the creation of the Executor to a separate task. Each iteration of the main loop starts one copy of this task, which runs as long as it takes to finish. It's a async def function so many copies of this task can run concurrently.
I changed a few things in your code. Instead of Client I just used some simple print statements. I pass a list of integers, of random length, to handle_event. I increment a counter each time through the while True: loop, and add 10 times the counter to every integer in the list. This makes it easy to see how old calls continue for a time, mixing with new calls. I also shortened your time delays. All of these changes were for convenience and are not important.
The important change is to move ThreadPoolExecutor creation into a task. To make it cooperate with other tasks, it must contain an await expression, and for that reason I use executor.submit rather than executor.map. submit returns a concurrent.futures.Future, which provides a convenient way to await the completion of all the calls. executor.map, on the other hand, returns an iterator; I couldn't think of any good way to convert it to an awaitable object.
To convert a concurrent.futures.Future to an asyncio.Future, an awaitable, there is a function asyncio.wrap_future. When all the futures are complete, I exit from the ThreadPoolExecutor context manager. That will be very fast since all of the Executor's work is finished, so it does not block other tasks.
import random
from concurrent.futures import ThreadPoolExecutor
from time import sleep
import asyncio
def handle_event(event):
for i in range(10):
print("Still here", event)
sleep(2)
async def process_entires(counter, entires):
print("Counter", counter, "Entires", entires)
x = [counter * 10 + a for a in entires]
with ThreadPoolExecutor(max_workers=len(entires)) as executor:
futs = []
for z in x:
futs.append(executor.submit(handle_event, z))
await asyncio.gather(*(asyncio.wrap_future(f) for f in futs))
async def main():
counter = 0
while True:
entires = [0, 1, 2, 3, 4][:random.randrange(5)]
if len(entires) > 0:
counter += 1
asyncio.create_task(process_entires(counter, entires))
await asyncio.sleep(3)
if __name__ == "__main__":
asyncio.run(main())
I have a gzipped file spanning (compressed 10GB, uncompressed 100GB) and which has some reports separated by demarcations and I have to parse it.
The parsing and processing the data is taking long time and hence is a CPU bound problem (not an IO bound problem). So I am planning to split the work into multiple processes using multiprocessing module. The problem is I am unable to send/share data to child processes efficiently. I am using subprocess.Popen to stream in the uncompressed data in parent process.
process = subprocess.Popen('gunzip --keep --stdout big-file.gz',
shell=True,
stdout=subprocess.PIPE)
I am thinking of using a Lock() to read/parse one report in child-process-1 and then release the lock, and switch to child-process-2 to read/parse next report and then switch back to child-process-1 to read/parse next report). When I share the process.stdout as args with the child processes, I get a pickling error.
I have tried to create multiprocessing.Queue() and multiprocessing.Pipe() to send data to child processes, but this is way too slow (in fact it is way slower than doing it in single thread ie serially).
Any thoughts/examples about sending data to child processes efficiently will help.
Could you try something simple instead? Have each worker process run its own instance of gunzip, with no interprocess communication at all. Worker 1 can process the first report and just skip over the second. The opposite for worker 2. Each worker skips every other report. Then an obvious generalization to N workers.
Or not ...
I think you'll need to be more specific about what you tried, and perhaps give more info about your problem (like: how many records are there? how big are they?).
Here's a program ("genints.py") that prints a bunch of random ints, one per line, broken into groups via "xxxxx\n" separator lines:
from random import randrange, seed
seed(42)
for i in range(1000):
for j in range(randrange(1, 1000)):
print(randrange(100))
print("xxxxx")
Because it forces the seed, it generates the same stuff every time. Now a program to process those groups, both in parallel and serially, via the most obvious way I first thought of. crunch() takes time quadratic in the number of ints in a group, so it's quite CPU-bound. The output from one run, using (as shown) 3 worker processes for the parallel part:
parallel result: 10,901,000,334 0:00:35.559782
serial result: 10,901,000,334 0:01:38.719993
So the parallelized run took about one-third the time. In what relevant way(s) does that differ from your problem? Certainly, a full run of "genints.py" produces less than 2 million bytes of output, so that's a major difference - but it's impossible to guess from here whether it's a relevant difference. Perahps, e.g., your problem is only very mildly CPU-bound? It's obvious from output here that the overheads of passing chunks of stdout to worker processes are all but insignificant in this program.
In short, you probably need to give people - as I just did for you - a complete program they can run that reproduces your problem.
import multiprocessing as mp
NWORKERS = 3
DELIM = "xxxxx\n"
def runjob():
import subprocess
# 'py' is just a shell script on my box that
# invokes the desired version of Python -
# which happened to be 3.8.5 for this run.
p = subprocess.Popen("py genints.py",
shell=True,
text=True,
stdout=subprocess.PIPE)
return p.stdout
# Return list of lines up to (but not including) next DELIM,
# or EOF. If the file is already exhausted, return None.
def getrecord(f):
result = []
foundone = False
for line in f:
foundone = True
if line == DELIM:
break
result.append(line)
return result if foundone else None
def crunch(rec):
total = 0
for a in rec:
for b in rec:
total += abs(int(a) - int(b))
return total
if __name__ == "__main__":
import datetime
now = datetime.datetime.now
s = now()
total = 0
f = runjob()
with mp.Pool(NWORKERS) as pool:
for i in pool.imap_unordered(crunch,
iter((lambda: getrecord(f)), None)):
total += i
f.close()
print(f"parallel result: {total:,}", now() - s)
s = now()
# try the same thing serially
total = 0
f = runjob()
while True:
rec = getrecord(f)
if rec is None:
break
total += crunch(rec)
f.close()
print(f"serial result: {total:,}", now() - s)
I am still quite new to asyncio and struggling a bit with how to deal with loops within loops:
import asyncio
import concurrent.futures
import logging
import sys
import time
sub_dict = {
1: ['one', 'commodore', 'apple', 'linux', 'windows'],
2: ['two', 'commodore', 'apple', 'linux', 'windows'],
3: ['three', 'commodore', 'apple', 'linux', 'windows'],
4: ['four', 'commodore', 'apple', 'linux', 'windows'],
5: ['five', 'commodore', 'apple', 'linux', 'windows'],
6: ['six', 'commodore', 'apple', 'linux', 'windows'],
7: ['seven', 'commodore', 'apple', 'linux', 'windows'],
8: ['eight', 'commodore', 'apple', 'linux', 'windows']
}
def blocks(key, value):
for v in value:
log = logging.getLogger('blocks({} {})'.format(key, v))
log.info('running')
log.info('done')
time.sleep(5)
return key, v
async def run_blocking_tasks(executor, sub_dict2):
log = logging.getLogger('run_blocking_tasks')
log.info('starting')
log.info('creating executor tasks')
loop = asyncio.get_event_loop()
blocking_tasks = [
loop.run_in_executor(executor, blocks, key, value)
for key, value in sub_dict2.items()
]
log.info('waiting for executor tasks')
completed, pending = await asyncio.wait(blocking_tasks)
results = [t.result() for t in completed]
log.info('results: {!r}'.format(results))
log.info('exiting')
def new_func():
logging.basicConfig(
level=logging.INFO,
format='%(threadName)10s %(name)18s: %(message)s',
stream=sys.stderr,
)
executor = concurrent.futures.ThreadPoolExecutor(
max_workers=8,
)
event_loop = asyncio.get_event_loop()
event_loop.run_until_complete(
run_blocking_tasks(executor, sub_dict)
)
event_loop.close()
new_func()
Here you can see all the value items for each element are assigned to the same thread. For example all values of element '1' are on thread zero.
I know enough to understand that this is because my for v in value loop is not plugged into asyncio properly.
My desired output is, if I assigned five workers, each value item for the element '1' would be on it's own thread, numbered 0-4, giving five threads in total. This would then repeat for elements 2 through 8.
Should I assign 40 threads, 8 dictionary elements * 5 value items per element = 1 unique thread for each dictionary item.
Hope that makes sense...
Something about asking a question on SO always seems to trigger extra layers of IQ in me. The answer is as so, should anyone be interested:
def blocks(key, v):
#for v in value:
log = logging.getLogger('blocks({} {})'.format(key,v))
log.info('running')
log.info('done')
time.sleep(30)
return v
async def run_blocking_tasks(executor, sub_dict2):
log = logging.getLogger('run_blocking_tasks')
log.info('starting')
log.info('creating executor tasks')
for key, value in sub_dict2.items():
loop = asyncio.get_event_loop()
blocking_tasks = [
loop.run_in_executor(executor, blocks, key, v)
for v in value
]
log.info('waiting for executor tasks')
completed, pending = await asyncio.wait(blocking_tasks)
results = [t.result() for t in completed]
log.info('results: {!r}'.format(results))
log.info('exiting')
def new_func():
logging.basicConfig(
level=logging.INFO,
format='%(threadName)10s %(name)18s: %(message)s',
stream=sys.stderr,
)
sub_dict2 = dict(list(sub_dict.items())[0:8])
executor = concurrent.futures.ThreadPoolExecutor(
max_workers=5,
)
event_loop = asyncio.get_event_loop()
event_loop.run_until_complete(
run_blocking_tasks(executor, sub_dict2)
)
event_loop.close()
new_func()
EDIT:
Here is a rough layout of my present concurrent.futures code as per comment thread below. Is a layout of the logical order, rather than full code, as this has a few functions of pre steps that are quite lengthy...
#some code here that chunks a bigger dictionary using slicing in a for loop.
#sub_dict a 20 element subset of a bigger dictionary
#slices are parameterised in real code
sub_dict = dict(list(fin_dict.items())[0:20])
#set 20 workers
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
#submit to the executor an enumerator/counter and my sub dict
future_to_pair = {executor.submit(function_name, v, i): (i, v) for i, v in enumerate(sub_dict.items(), 1)}
#await results
for future in concurrent.futures.as_completed(future_to_pair):
pair = future_to_pair[future]
data = future.result()
#function that is being called by concurrent.futures
#am happy for all the v's in value to be on a single thread
#i want each key to be on an individual thread
#this will process 20 keys simultaneously, but wait for the slowest one before clearing
def function_name(sub_dict, i):
for key, value in sub_dict:
for v in value:
# using subprocess, execute some stuff
# dictionary loops provide parameters for the executables.
I think you have missed a key concept: waiting. An async def function without an await is perfectly legal but rather pointless. The purpose of asyncio programming is to handle situations where your program must wait for something, and the program has something useful it could be doing in the meantime. Otherwise it has very little utility.* It's also not easy to find simple examples that illustrate how useful this is.
Python offers several types of concurrency: processes, which use multiple CPU cores; threads, which use multiple lines of execution within a single process; and asyncio tasks, which use multiple units of execution within a thread. These can be combined in various ways, and have different characteristics.
Threads allow your program to block in one place while waiting for a resource, but continue to execute in another place. But synchronizing between threads is often tricky because scheduling CPU time between threads is pre-emptive. It's not under your direct control.
Tasks also allow your program to stop in one place and continue in another, but switching between tasks is cooperative. It's under your control. When a task encounters an "await" expression, it stops there and allows another task to run. That task continues until it bumps into an await expression, and so on. If this solves a problem you have, great. It's a fantastic tool.
It seems, based on reading a number of SO questions, that programmers sometimes get the impression that asyncio will make their programs run faster by sending them off to some sort of Never-Never Land where they execute without consuming any CPU cycles, and the result will come floating back on the breeze. Sorry, that won't happen. The main use case is what I described: you have to wait for something but you've got something else to do.
*Remark added for completeness: I have used the cross-threading capabilities of asyncio as means of coordinating between threads. For example, create an event loop in Thread B, and cause it to execute a function on demand by Thread A using the "call_soon_threadsafe" method, or the "run_coroutine_threadsafe" method. This is a handy capability even if it doesn't require the use of an await expression.
At the bottom is the code I have now. It seems to work fine. However, I don't completely understand it. I thought without .join(), I'd risking the code going onto the next for-loop before the pool finishes executing. Wouldn't we need those 3 commented-out lines?
On the other hand, if I were to go with the .close() and .join() way, is there any way to 'reopen' that closed pool instead of Pool(6) every time?
import multiprocessing as mp
import random as rdm
from statistics import stdev, mean
import time
def mesh_subset(population, n_chosen=5):
chosen = rdm.choices(population, k=n_chosen)
return mean(chosen)
if __name__ == '__main__':
population = [x for x in range(20)]
N_iteration = 10
start_time = time.time()
pool = mp.Pool(6)
for i in range(N_iteration):
print([round(x,2) for x in population])
print(stdev(population))
# pool = mp.Pool(6)
population = pool.map(mesh_subset, [population]*len(population))
# pool.close()
# pool.join()
print('run time:', time.time() - start_time)
A pool of workers is a relatively costly thing to set up, so it should be done (if possible) only once, usually at the beginning of the script.
The pool.map command blocks until all the tasks are completed. After all, it returns a list of the results. It couldn't do that unless mesh_subset has been called on all the inputs and has returned a result for each. In contrast, methods like pool.apply_async do not block. apply_async returns an ApplyResult object with a get method which blocks until it obtains a result from a worker process.
pool.close sets the worker handler's state to CLOSE. This causes the handler to signal the workers to terminate.
The pool.join blocks until all the worker processes have been terminated.
So you don't need to call -- in fact you shouldn't call -- pool.close and pool.join until you are finished with the pool. Once the workers have been sent the signal to terminate (by pool.close), there is no way to "reopen" them. You would need to start a new pool instead.
In your situation, since you do want the loop to wait until all the tasks are completed, there would be no advantage to using pool.apply_async instead of pool.map. But if you were to use pool.apply_async, you could obtain the same result as before by calling get instead of resorting to closing and restarting the pool:
# you could do this, but using pool.map is simpler
for i in range(N_iteration):
apply_results = [pool.apply_async(mesh_subset, [population]) for i in range(len(population))]
# the call to result.get() blocks until its worker process (running
# mesh_subset) returns a value
population = [result.get() for result in apply_results]
When the loops complete, len(population) is unchanged.
If you did NOT want each loop to block until all the tasks are completed, you could use apply_async's callback feature:
N_pop = len(population)
result = []
for i in range(N_iteration):
for i in range(N_pop):
pool.apply_async(mesh_subset, [population]),
callback=result.append)
pool.close()
pool.join()
print(result)
Now, when any mesh_subset returns a return_value,
result.append(return_value) is called. The calls to apply_async do not
block, so N_iteration * N_pop tasks are pushed into the pools task
queue all at once. But since the pool has 6 workers, at most 6 calls to
mesh_subset are running at any given time. As the workers complete the tasks,
whichever worker finishes first calls result.append(return_value). So the
values in result are unordered. This is different than pool.map which
returns a list whose return values are in the same order as its corresponding
list of arguments.
Barring an exception, result will eventually contain N_iteration * N_pop return values once
all the tasks complete. Above, pool.close() and pool.join() were used to
wait for all the tasks to complete.
I'm trying to leverage concurrent.futures.ProcessPoolExecutor in Python3 to process a large matrix in parallel. The general structure of the code is:
class X(object):
self.matrix
def f(self, i, row_i):
<cpu-bound process>
def fetch_multiple(self, ids):
with ProcessPoolExecutor() as executor:
futures = [executor.submit(self.f, i, self.matrix.getrow(i)) for i in ids]
return [f.result() for f in as_completed(futures)]
self.matrix is a large scipy csr_matrix. f is my concurrrent function that takes a row of self.matrix and apply a CPU-bound process on it. Finally, fetch_multiple is a function that run multiple instance of f in parallel and returns the results.
The problem is that after running the script, all cpu cores are less than 50% busy (See the following screenshot):
Why all cores are not busy?
I think the problem is the large object of self.matrix and passing row vectors between processes. How can I solve this problem?
Yes.
The overhead should not be that big - but it is likely the cause of your CPUs appearing iddle (although, they should be busy passing the data around anyway).
But try the recipe here to pass a "pointer" of the object to the subprocess using shared memory.
http://briansimulator.org/sharing-numpy-arrays-between-processes/
Quoting from there:
from multiprocessing import sharedctypes
size = S.size
shape = S.shape
S.shape = size
S_ctypes = sharedctypes.RawArray('d', S)
S = numpy.frombuffer(S_ctypes, dtype=numpy.float64, count=size)
S.shape = shape
Now we can send S_ctypes and shape to a child process in
multiprocessing, and convert it back to a numpy array in the child
process as follows:
from numpy import ctypeslib
S = ctypeslib.as_array(S_ctypes)
S.shape = shape
It should be tricky to take care of reference counting, but I suppose numpy.ctypeslib takes care of that - so, just coordinate the passing of the actual row number to sub-processes in a way they don't work on the same data