Why the performance of concurrent.futures.ProcessPoolExecutor is very low? - python-3.x

I'm trying to leverage concurrent.futures.ProcessPoolExecutor in Python3 to process a large matrix in parallel. The general structure of the code is:
class X(object):
self.matrix
def f(self, i, row_i):
<cpu-bound process>
def fetch_multiple(self, ids):
with ProcessPoolExecutor() as executor:
futures = [executor.submit(self.f, i, self.matrix.getrow(i)) for i in ids]
return [f.result() for f in as_completed(futures)]
self.matrix is a large scipy csr_matrix. f is my concurrrent function that takes a row of self.matrix and apply a CPU-bound process on it. Finally, fetch_multiple is a function that run multiple instance of f in parallel and returns the results.
The problem is that after running the script, all cpu cores are less than 50% busy (See the following screenshot):
Why all cores are not busy?
I think the problem is the large object of self.matrix and passing row vectors between processes. How can I solve this problem?

Yes.
The overhead should not be that big - but it is likely the cause of your CPUs appearing iddle (although, they should be busy passing the data around anyway).
But try the recipe here to pass a "pointer" of the object to the subprocess using shared memory.
http://briansimulator.org/sharing-numpy-arrays-between-processes/
Quoting from there:
from multiprocessing import sharedctypes
size = S.size
shape = S.shape
S.shape = size
S_ctypes = sharedctypes.RawArray('d', S)
S = numpy.frombuffer(S_ctypes, dtype=numpy.float64, count=size)
S.shape = shape
Now we can send S_ctypes and shape to a child process in
multiprocessing, and convert it back to a numpy array in the child
process as follows:
from numpy import ctypeslib
S = ctypeslib.as_array(S_ctypes)
S.shape = shape
It should be tricky to take care of reference counting, but I suppose numpy.ctypeslib takes care of that - so, just coordinate the passing of the actual row number to sub-processes in a way they don't work on the same data

Related

Shared memory and how to access a global variable from within a class in Python, with multiprocessing?

I am currently developing some code that deals with big multidimensional arrays. Of course, Python gets very slow if you try to perform these computations in a serialized manner. Therefore, I got into code parallelization, and one of the possible solutions I found has to do with the multiprocessing library.
What I have come up with so far is first dividing the big array in smaller chunks and then do some operation on each of those chunks in a parallel fashion, using a Pool of workers from multiprocessing. For that to be efficient and based on this answer I believe that I should use a shared memory array object defined as a global variable, to avoid copying it every time a process from the pool is called.
Here I add some minimal example of what I'm trying to do, to illustrate the issue:
import numpy as np
from functools import partial
import multiprocessing as mp
import ctypes
class Trials:
# Perform computation along first dimension of shared array, representing the chunks
def Compute(i, shared_array):
shared_array[i] = shared_array[i] + 2
# The function you actually call
def DoSomething(self):
# Initializer function for Pool, should define the global variable shared_array
# I have also tried putting this function outside DoSomething, as a part of the class,
# with the same results
def initialize(base, State):
global shared_array
shared_array = np.ctypeslib.as_array(base.get_obj()).reshape(125, 100, 100) + State
base = mp.Array(ctypes.c_float, 125*100*100) # Create base array
state = np.random.rand(125,100,100) # Create seed
# Initialize pool of workers and perform calculations
with mp.Pool(processes = 10,
initializer = initialize,
initargs = (base, state,)) as pool:
run = partial(self.Compute,
shared_array = shared_array) # Here the error says that shared_array is not defined
pool.map(run, np.arange(125))
pool.close()
pool.join()
print(shared_array)
if __name__ == '__main__':
Trials = Trials()
Trials.DoSomething()
The trouble I am encountering is that when I define the partial function, I get the following error:
NameError: name 'shared_array' is not defined
For what I understand, I think that means that I cannot access the global variable shared_array. I'm sure that the initialize function is executing, as putting a print statement inside of it gives back a result in the terminal.
What am I doing incorrectly, is there any way to solve this issue?

Simulating many agents in PyTorch using multiprocessing

I want to simulate multiple reinforcement learning agents that are coded using Pytorch. The agents do not share any data dynamically, so I expect that the task should be "embarassingly parallel". I need a lot of simulations (I want to see what is the distribution my agents converge to) so I hope to speed it up using multiprocessing.
I have a model class that stores all the parameters of my agents (which are the same across agents) and the environment. I can simulate N agents over T periods using
model.simulate(N = 10, T = 50)
My class would then run simulation loops and store all networks and simulation histories. I am very new to parallel programming, and I (naively) try the following:
import torch.multiprocessing as mp
num_processes = 6
processes = []
for _ in range(num_processes):
p = mp.Process(target=model.simulate(N = 10, T = 50), args= ())
p.start()
processes.append(p)
for p in processes:
p.join()
For now I do not even try to store results, I just want to see some speed-up. But the time it takes to run the code above is roughly the same as when I simply run a loop and do 6 simulations consequently:
for _ in range(num_processes):
model.simulate(N = 10, T = 50)
I also tried to make processes for different instances of the model class, but it did not help.
It looks like your problem is in this line
p = mp.Process(target=model.simulate(N = 10, T = 50), args= ())
The part model.simulate(N = 10, T = 50) is executed first, then the result (I'm assuming None if there is no return from this method) is passed to the mp.Process as the target parameter. So you are doing all the computation sequentially, and not performing it on the new processes.
What you need to do instead is to pass the simulate function (without executing it) and provide the args separately.
i.e. something like...
p = mp.Process(target=model.simulate, args=(10, 50))
Providing target=model.simulate will pass a reference to the function itself rather than executing it and passing the result. This way it will be executed on the new process and you should acheive the parallelism.
See offical docs for an example.

Need to do CPU bound processing using 2+ processes in Python by reading from a gzipped file

I have a gzipped file spanning (compressed 10GB, uncompressed 100GB) and which has some reports separated by demarcations and I have to parse it.
The parsing and processing the data is taking long time and hence is a CPU bound problem (not an IO bound problem). So I am planning to split the work into multiple processes using multiprocessing module. The problem is I am unable to send/share data to child processes efficiently. I am using subprocess.Popen to stream in the uncompressed data in parent process.
process = subprocess.Popen('gunzip --keep --stdout big-file.gz',
shell=True,
stdout=subprocess.PIPE)
I am thinking of using a Lock() to read/parse one report in child-process-1 and then release the lock, and switch to child-process-2 to read/parse next report and then switch back to child-process-1 to read/parse next report). When I share the process.stdout as args with the child processes, I get a pickling error.
I have tried to create multiprocessing.Queue() and multiprocessing.Pipe() to send data to child processes, but this is way too slow (in fact it is way slower than doing it in single thread ie serially).
Any thoughts/examples about sending data to child processes efficiently will help.
Could you try something simple instead? Have each worker process run its own instance of gunzip, with no interprocess communication at all. Worker 1 can process the first report and just skip over the second. The opposite for worker 2. Each worker skips every other report. Then an obvious generalization to N workers.
Or not ...
I think you'll need to be more specific about what you tried, and perhaps give more info about your problem (like: how many records are there? how big are they?).
Here's a program ("genints.py") that prints a bunch of random ints, one per line, broken into groups via "xxxxx\n" separator lines:
from random import randrange, seed
seed(42)
for i in range(1000):
for j in range(randrange(1, 1000)):
print(randrange(100))
print("xxxxx")
Because it forces the seed, it generates the same stuff every time. Now a program to process those groups, both in parallel and serially, via the most obvious way I first thought of. crunch() takes time quadratic in the number of ints in a group, so it's quite CPU-bound. The output from one run, using (as shown) 3 worker processes for the parallel part:
parallel result: 10,901,000,334 0:00:35.559782
serial result: 10,901,000,334 0:01:38.719993
So the parallelized run took about one-third the time. In what relevant way(s) does that differ from your problem? Certainly, a full run of "genints.py" produces less than 2 million bytes of output, so that's a major difference - but it's impossible to guess from here whether it's a relevant difference. Perahps, e.g., your problem is only very mildly CPU-bound? It's obvious from output here that the overheads of passing chunks of stdout to worker processes are all but insignificant in this program.
In short, you probably need to give people - as I just did for you - a complete program they can run that reproduces your problem.
import multiprocessing as mp
NWORKERS = 3
DELIM = "xxxxx\n"
def runjob():
import subprocess
# 'py' is just a shell script on my box that
# invokes the desired version of Python -
# which happened to be 3.8.5 for this run.
p = subprocess.Popen("py genints.py",
shell=True,
text=True,
stdout=subprocess.PIPE)
return p.stdout
# Return list of lines up to (but not including) next DELIM,
# or EOF. If the file is already exhausted, return None.
def getrecord(f):
result = []
foundone = False
for line in f:
foundone = True
if line == DELIM:
break
result.append(line)
return result if foundone else None
def crunch(rec):
total = 0
for a in rec:
for b in rec:
total += abs(int(a) - int(b))
return total
if __name__ == "__main__":
import datetime
now = datetime.datetime.now
s = now()
total = 0
f = runjob()
with mp.Pool(NWORKERS) as pool:
for i in pool.imap_unordered(crunch,
iter((lambda: getrecord(f)), None)):
total += i
f.close()
print(f"parallel result: {total:,}", now() - s)
s = now()
# try the same thing serially
total = 0
f = runjob()
while True:
rec = getrecord(f)
if rec is None:
break
total += crunch(rec)
f.close()
print(f"serial result: {total:,}", now() - s)

Python avoiding large array allocation multiple times

I have to compute a function many many times.
To compute this function the elements of an array must be computed.
The array is quite large.
How can I avoid the allocation of the array in every function call.
The code I have tried goes something like this:
class FunctionCalculator(object):
def __init__(self, data):
"""
Get the data and do some small handling of it
Let's say that we do
self.data = data
"""
def function(self, point):
return numpy.sum(numpy.array([somecomputations(item) for item in self.data]))
Well, maybe my concern is unfounded, so I have first this question.
Question: Is it true that the array [somecomputations(item) for item in data] is being allocated and deallocated for every call to function?
Thinking that that is the case I have tried
class FunctionCalculator(object):
def __init__(self, data):
"""
Get the data and do some small handling of it
Let's say that we do
self.data = data
"""
self.number_of_data = range(0, len(data))
self.my_array = numpy.zeros(len(data))
def function(self, point):
for i in self.number_of_data:
self.my_array[i] = somecomputations(self.data[i])
return numpy.sum(self.my_array)
This is slower than the previous version. I assume that the list comprehension in the first version can be ran in C entirely, while in the second version smaller parts of the script can be translated into optimized C code.
I have very little idea of how Python works inside.
Question: Is there a good way to skip the array allocation in every function call and at the same time take advantage of a well optimized loop on the array?
I am using Python3.5
Looping over the array is unnecessary and access python to c many times, hence the slow down. The beauty of numpy arrays that functions work on them cell by cell. I think the fastest would be:
return numpy.sum(somecomputations(self.data))
Somecomputations may need a bit of a modification, but often it will work off the bat. Also, you're not using point, and other stuff.

Python Multiprocessing Queue Slow

I have a problem with python multiprocessing Queues.
I'm doing some hard computation on some data. I have created few processes to lower calculation time, also data have been split evenly before sending it to processes. It decrease the time of calculations nicely but when I want to return data from the process by multiprocessing.Queue it takes ages and whole thing is slower than calculating in main thread.
processes = []
proc = 8
for i in range(proc):
processes.append(multiprocessing.Process(target=self.calculateTriangles, args=(inData[i],outData,timer)))
for p in processes:
p.start()
results = []
for i in range(proc):
results.append(outData.get())
print("killing threads")
print(datetime.datetime.now() - timer)
for p in processes:
p.join()
print("Finish Threads")
print(datetime.datetime.now() - timer)
all of threads print their finish time when they are done. Here is example output of this code
0:00:00.017873 CalcDone
0:00:01.692940 CalcDone
0:00:01.777674 CalcDone
0:00:01.780019 CalcDone
0:00:01.796739 CalcDone
0:00:01.831723 CalcDone
0:00:01.842356 CalcDone
0:00:01.868633 CalcDone
0:00:05.497160 killing threads
60968 calculated triangles
As you can see everything is quiet simple until this code.
for i in range(proc):
results.append(outData.get())
print("killing threads")
print(datetime.datetime.now() - timer)
here are some observations I have made on mine computer and slower one.
https://docs.google.com/spreadsheets/d/1_8LovX0eSgvNW63-xh8L9-uylAVlzY4VSPUQ1yP2F9A/edit?usp=sharing . On slower one there isn't any improvement as you can see.
Why does it take so much time to get items from queue when process is finished?? Is there way to speed this up?
So I have solved it myself. Calculations are fast but copying objects from one process to another takes ages. I just made a method that cleared all not-necessary fields in the objects, also using pipes is faster than multiprocessing queues. It took down the time on my slower computer from 29 seconds to 15 seconds.
This time is mainly spent on putting another object to the Queue and spiking up the Semaphore count. If you are able to bulk insert the Queue with all the data at once, then you cut down to 1/10 of the previous time.
I've assigned dynamically a new method to Queue based on the old one. Go to the multiprocessing module for your Python version:
/usr/lib/pythonx.x/multiprocessing.queues.py
Copy the "put" method of the class to your project e.g. for Python 3.7:
def put(self, obj, block=True, timeout=None):
assert not self._closed, "Queue {0!r} has been closed".format(self)
if not self._sem.acquire(block, timeout):
raise Full
with self._notempty:
if self._thread is None:
self._start_thread()
self._buffer.append(obj)
self._notempty.notify()
modify it:
def put_bla(self, obj, block=True, timeout=None):
assert not self._closed, "Queue {0!r} has been closed".format(self)
for el in obj:
if not self._sem.acquire(block, timeout): #spike the semaphore count
raise Full
with self._notempty:
if self._thread is None:
self._start_thread()
self._buffer += el # adding a collections.deque object
self._notempty.notify()
The last step is to add the new method to the class. The multiprocessing.Queue is a DefaultContext method which returns a Queue object. It is easier to inject the method directly to the class of the created object. So:
from collections import deque
queue = Queue()
queue.__class__.put_bulk = put_bla # injecting new method
items = (500, 400, 450, 350) * count # (500, 400, 450, 350, 500, 400...)
queue.put_bulk(deque(items))
Unfortunately the multiprocessing.Pool was always faster by 10%, so just stick with that if you don't require everlasting workers to process your tasks. It is based on multiprocessing.SimpleQueue which is based on multiprocessing.Pipe and I have no idea why it is faster because my SimpleQueue solution wasn't and it is not bulk-injectable:) Break that and You'll have the fastest worker ever:)

Resources