Slow multiprocessing when using torch - python-3.x

I have a function func that takes on average 9 sec to run. But when I try to use multiprocessing to parallelize it (even using torch.multiprocessing) each inference takes on average 20 sec why is that ?
func is an inference function which takes in a patient_name and runs a torch model in inference on that patient's data.
device = torch.device(torch.device('cpu'))
def func(patient_name):
data = np.load(my_dict[system_name]['data_path'])
model_state = torch.load(my_dict[system_name]['model_state_path'],map_location='cpu')
model = my_net(my_dict[system_name]['HPs'])
model = model.to(device)
model.load_state_dict(model_state)
model.eval()
result = model(torch.FloatTensor(data).to(device))
return result
from torch.multiprocessing import pool
core_cnt = 10
pool = Pool(core_cnt)
out = pool.starmap(func, pool_args)

Please check if the inference on your model architecture with data that you provide already uses substantial computational power. That would make OS process scheduler to switch between each process which takes additional time.
Also, you load model every time inside your function. It is always faster to copy data object between processes than read it from a disk (that's the strategy with multiprocessing, or share model altogether with torch.multiprocessing)

Related

Calling VGG many times causes an out of memory error

I want to extract the VGG features of a set of images and keep them in memory in a dictionary. The dictionary ends up holding 8091 tensors each of shape (1,4096), but my machine crashes with an out of memory error after about 6% of the way. Does anybody have a clue why this is happening and how to prevent it?
In fact, this seems to be triggered by the call to VGG rather than the memory space, since storing the VGG classification is sufficient to trigger the error.
Below is the simplest code I've found to reproduce the error. Once a helper function is defined:
import torch, torchvision
from tqdm import tqdm
vgg = torchvision.models.vgg16(weights='DEFAULT')
def try_and_crash(gen_data):
store_out = {}
for i in tqdm(range(8091)):
my_output = gen_data(torch.randn(1,3,224,224))
store_out[i] = my_output
return store_out
Calling it to quickly produce a large tensor doesn't cause a fuss
just_fine = try_and_crash(lambda x: torch.randn(1,4096))
but calling it to use vgg causes the machine to crash:
will_crash = try_and_crash(vgg)
The problem is that each element of the dictionary store_out[i] also stores the gradients that led to its computation, therefore ends up being much larger than a simple 1x4096 element tensor.
Running the code with torch.no_grad(), or equivalently with torch.set_grad_enabled(False) solves the issue. We can test it by slightly changing the helper function
def try_and_crash_grad(gen_data, grad_enabled):
store_out = {}
for i in tqdm(range(8091)):
with torch.set_grad_enabled(grad_enabled):
my_output = gen_data(torch.randn(1,3,224,224))
store_out[i] = my_output
return store_out
Now the following works
works_fine = try_and_crash_grad(vgg, False)
while the following throws an out of memory error
crashes = try_and_crash_grad(vgg, True)

Shared memory and how to access a global variable from within a class in Python, with multiprocessing?

I am currently developing some code that deals with big multidimensional arrays. Of course, Python gets very slow if you try to perform these computations in a serialized manner. Therefore, I got into code parallelization, and one of the possible solutions I found has to do with the multiprocessing library.
What I have come up with so far is first dividing the big array in smaller chunks and then do some operation on each of those chunks in a parallel fashion, using a Pool of workers from multiprocessing. For that to be efficient and based on this answer I believe that I should use a shared memory array object defined as a global variable, to avoid copying it every time a process from the pool is called.
Here I add some minimal example of what I'm trying to do, to illustrate the issue:
import numpy as np
from functools import partial
import multiprocessing as mp
import ctypes
class Trials:
# Perform computation along first dimension of shared array, representing the chunks
def Compute(i, shared_array):
shared_array[i] = shared_array[i] + 2
# The function you actually call
def DoSomething(self):
# Initializer function for Pool, should define the global variable shared_array
# I have also tried putting this function outside DoSomething, as a part of the class,
# with the same results
def initialize(base, State):
global shared_array
shared_array = np.ctypeslib.as_array(base.get_obj()).reshape(125, 100, 100) + State
base = mp.Array(ctypes.c_float, 125*100*100) # Create base array
state = np.random.rand(125,100,100) # Create seed
# Initialize pool of workers and perform calculations
with mp.Pool(processes = 10,
initializer = initialize,
initargs = (base, state,)) as pool:
run = partial(self.Compute,
shared_array = shared_array) # Here the error says that shared_array is not defined
pool.map(run, np.arange(125))
pool.close()
pool.join()
print(shared_array)
if __name__ == '__main__':
Trials = Trials()
Trials.DoSomething()
The trouble I am encountering is that when I define the partial function, I get the following error:
NameError: name 'shared_array' is not defined
For what I understand, I think that means that I cannot access the global variable shared_array. I'm sure that the initialize function is executing, as putting a print statement inside of it gives back a result in the terminal.
What am I doing incorrectly, is there any way to solve this issue?

Simulating many agents in PyTorch using multiprocessing

I want to simulate multiple reinforcement learning agents that are coded using Pytorch. The agents do not share any data dynamically, so I expect that the task should be "embarassingly parallel". I need a lot of simulations (I want to see what is the distribution my agents converge to) so I hope to speed it up using multiprocessing.
I have a model class that stores all the parameters of my agents (which are the same across agents) and the environment. I can simulate N agents over T periods using
model.simulate(N = 10, T = 50)
My class would then run simulation loops and store all networks and simulation histories. I am very new to parallel programming, and I (naively) try the following:
import torch.multiprocessing as mp
num_processes = 6
processes = []
for _ in range(num_processes):
p = mp.Process(target=model.simulate(N = 10, T = 50), args= ())
p.start()
processes.append(p)
for p in processes:
p.join()
For now I do not even try to store results, I just want to see some speed-up. But the time it takes to run the code above is roughly the same as when I simply run a loop and do 6 simulations consequently:
for _ in range(num_processes):
model.simulate(N = 10, T = 50)
I also tried to make processes for different instances of the model class, but it did not help.
It looks like your problem is in this line
p = mp.Process(target=model.simulate(N = 10, T = 50), args= ())
The part model.simulate(N = 10, T = 50) is executed first, then the result (I'm assuming None if there is no return from this method) is passed to the mp.Process as the target parameter. So you are doing all the computation sequentially, and not performing it on the new processes.
What you need to do instead is to pass the simulate function (without executing it) and provide the args separately.
i.e. something like...
p = mp.Process(target=model.simulate, args=(10, 50))
Providing target=model.simulate will pass a reference to the function itself rather than executing it and passing the result. This way it will be executed on the new process and you should acheive the parallelism.
See offical docs for an example.

Why the performance of concurrent.futures.ProcessPoolExecutor is very low?

I'm trying to leverage concurrent.futures.ProcessPoolExecutor in Python3 to process a large matrix in parallel. The general structure of the code is:
class X(object):
self.matrix
def f(self, i, row_i):
<cpu-bound process>
def fetch_multiple(self, ids):
with ProcessPoolExecutor() as executor:
futures = [executor.submit(self.f, i, self.matrix.getrow(i)) for i in ids]
return [f.result() for f in as_completed(futures)]
self.matrix is a large scipy csr_matrix. f is my concurrrent function that takes a row of self.matrix and apply a CPU-bound process on it. Finally, fetch_multiple is a function that run multiple instance of f in parallel and returns the results.
The problem is that after running the script, all cpu cores are less than 50% busy (See the following screenshot):
Why all cores are not busy?
I think the problem is the large object of self.matrix and passing row vectors between processes. How can I solve this problem?
Yes.
The overhead should not be that big - but it is likely the cause of your CPUs appearing iddle (although, they should be busy passing the data around anyway).
But try the recipe here to pass a "pointer" of the object to the subprocess using shared memory.
http://briansimulator.org/sharing-numpy-arrays-between-processes/
Quoting from there:
from multiprocessing import sharedctypes
size = S.size
shape = S.shape
S.shape = size
S_ctypes = sharedctypes.RawArray('d', S)
S = numpy.frombuffer(S_ctypes, dtype=numpy.float64, count=size)
S.shape = shape
Now we can send S_ctypes and shape to a child process in
multiprocessing, and convert it back to a numpy array in the child
process as follows:
from numpy import ctypeslib
S = ctypeslib.as_array(S_ctypes)
S.shape = shape
It should be tricky to take care of reference counting, but I suppose numpy.ctypeslib takes care of that - so, just coordinate the passing of the actual row number to sub-processes in a way they don't work on the same data

How do I get improved pymongo performance using threading?

I'm trying to see performance improvements on pymongo, but I'm not observing any.
My sample db has 400,000 records. Essentially I'm seeing threaded and single threaded performance be equal - and the only performance gain coming from multiple process execution.
Does pymongo not release the GIL during queries?
Single Perf: real 0m0.618s
Multiproc:real 0m0.144s
Multithread:real 0m0.656s
Regular code:
choices = ['foo','bar','baz']
def regular_read(db, sample_choice):
rows = db.test_samples.find({'choice':sample_choice})
return 42 # done to remove calculations from the picture
def main():
client = MongoClient('localhost', 27017)
db = client['test-async']
for sample_choice in choices:
regular_read(db, sample_choice)
if __name__ == '__main__':
main()
$ time python3 mongotest_read.py
real 0m0.618s
user 0m0.085s
sys 0m0.018s
Now when I use multiprocessing I can see some improvement.
from random import randint, choice
import functools
from pymongo import MongoClient
from concurrent import futures
choices = ['foo','bar','baz']
MAX_WORKERS = 4
def regular_read(sample_choice):
client = MongoClient('localhost', 27017,connect=False)
db = client['test-async']
rows = db.test_samples.find({'choice':sample_choice})
#return sum(r['data'] for r in rows)
return 42
def main():
f = functools.partial(regular_read)
with futures.ProcessPoolExecutor(MAX_WORKERS) as executor:
res = executor.map(f, choices)
print(list(res))
return len(list(res))
if __name__ == '__main__':
main()
$ time python3 mongotest_proc_read.py
[42, 42, 42]
real 0m0.144s
user 0m0.106s
sys 0m0.041s
But when you switch from ProcessPoolExecutor to ThreadPoolExecutor the speed drops back to single threaded mode.
...
def main():
client = MongoClient('localhost', 27017,connect=False)
f = functools.partial(regular_read, client)
with futures.ThreadPoolExecutor(MAX_WORKERS) as executor:
res = executor.map(f, choices)
print(list(res))
return len(list(res))
$ time python3 mongotest_thread_read.py
[42, 42, 42]
real 0m0.656s
user 0m0.111s
sys 0m0.024s
...
PyMongo uses the standard Python socket module, which does drop the GIL while sending and receiving data over the network. However, it's not MongoDB or the network that's your bottleneck: it's Python.
CPU-intensive Python processes do not scale by adding threads; indeed they slow down slightly due to context-switching and other inefficiencies. To use more than one CPU in Python, start subprocesses.
I know it doesn't seem intuitive that a "find" should be CPU intensive, but the Python interpreter is slow enough to contradict our intuition. If the query is fast and there's no latency to MongoDB on localhost, MongoDB can easily outperform the Python client. The experiment you just ran, substituting subprocesses for threads, confirms that Python performance is the bottleneck.
To ensure maximum throughput, make sure you have C extensions enabled: pymongo.has_c() == True. With that in place, PyMongo runs as fast as a Python client library can achieve, to get more throughput go to multiprocessing.
If your expected real-world scenario involves more time-consuming queries, or a remote MongoDB with some network latency, multithreading may give you some performance increase.
You can use mongodb indexes to optimize your queries.
https://docs.mongodb.com/manual/tutorial/optimize-query-performance-with-indexes-and-projections/

Resources