Simulating many agents in PyTorch using multiprocessing - pytorch

I want to simulate multiple reinforcement learning agents that are coded using Pytorch. The agents do not share any data dynamically, so I expect that the task should be "embarassingly parallel". I need a lot of simulations (I want to see what is the distribution my agents converge to) so I hope to speed it up using multiprocessing.
I have a model class that stores all the parameters of my agents (which are the same across agents) and the environment. I can simulate N agents over T periods using
model.simulate(N = 10, T = 50)
My class would then run simulation loops and store all networks and simulation histories. I am very new to parallel programming, and I (naively) try the following:
import torch.multiprocessing as mp
num_processes = 6
processes = []
for _ in range(num_processes):
p = mp.Process(target=model.simulate(N = 10, T = 50), args= ())
p.start()
processes.append(p)
for p in processes:
p.join()
For now I do not even try to store results, I just want to see some speed-up. But the time it takes to run the code above is roughly the same as when I simply run a loop and do 6 simulations consequently:
for _ in range(num_processes):
model.simulate(N = 10, T = 50)
I also tried to make processes for different instances of the model class, but it did not help.

It looks like your problem is in this line
p = mp.Process(target=model.simulate(N = 10, T = 50), args= ())
The part model.simulate(N = 10, T = 50) is executed first, then the result (I'm assuming None if there is no return from this method) is passed to the mp.Process as the target parameter. So you are doing all the computation sequentially, and not performing it on the new processes.
What you need to do instead is to pass the simulate function (without executing it) and provide the args separately.
i.e. something like...
p = mp.Process(target=model.simulate, args=(10, 50))
Providing target=model.simulate will pass a reference to the function itself rather than executing it and passing the result. This way it will be executed on the new process and you should acheive the parallelism.
See offical docs for an example.

Related

Multiproccesing and lists in python

I have a list of jobs but due to certain condition not all of the jobs should run in parallel at the same time because sometimes it is important that a finishes before I start b or vice versa (actually its not important which one runs first just not that they run both at the same time) so i thought i keep a list of the currently running threads and when ever a new on starts it checks in this list of currently running threads if the thread can proceed or not. I wrote some sample code for that:
from time import sleep
from multiprocessing import Pool
def square_and_test(x):
print(running_list)
if not x in running_list:
running_list = running_list.append(x)
sleep(1)
result_list = result_list.append(x**2)
running_list = running_list.remove(x)
else:
print(f'{x} is currently worked on')
task_list = [1,2,3,4,1,1,4,4,2,2]
running_list = []
result_list = []
pool = Pool(2)
pool.map(square_and_test, task_list)
print(result_list)
this code fails with UnboundLocalError: local variable 'running_list' referenced before assignment so i guess my threads don't have access to global variables. Is there a way around this? If not is there another way to solve this problem?

Slow multiprocessing when using torch

I have a function func that takes on average 9 sec to run. But when I try to use multiprocessing to parallelize it (even using torch.multiprocessing) each inference takes on average 20 sec why is that ?
func is an inference function which takes in a patient_name and runs a torch model in inference on that patient's data.
device = torch.device(torch.device('cpu'))
def func(patient_name):
data = np.load(my_dict[system_name]['data_path'])
model_state = torch.load(my_dict[system_name]['model_state_path'],map_location='cpu')
model = my_net(my_dict[system_name]['HPs'])
model = model.to(device)
model.load_state_dict(model_state)
model.eval()
result = model(torch.FloatTensor(data).to(device))
return result
from torch.multiprocessing import pool
core_cnt = 10
pool = Pool(core_cnt)
out = pool.starmap(func, pool_args)
Please check if the inference on your model architecture with data that you provide already uses substantial computational power. That would make OS process scheduler to switch between each process which takes additional time.
Also, you load model every time inside your function. It is always faster to copy data object between processes than read it from a disk (that's the strategy with multiprocessing, or share model altogether with torch.multiprocessing)

How to change global variables when using parallel programing

I am using multiprocessing in my code to do somethings parallel. Actually in a simple version of my goal, I want to change some global variables by two different processes in parallel.
But in the end of the code running, the result which is getting from mp.Queue is true but the variables are not changed.
here is a simple version of code:
import multiprocessing as mp
a = 3
b = 5
# define a example function
def f(length, output):
global a
global b
if length==5:
a = length + a
output.put(a)
if length==3:
b = length + b
output.put(b)
if __name__ == '__main__':
# Define an output queue
output = mp.Queue()
# Setup a list of processes that we want to run
processes = []
processes.append(mp.Process(target=f, args=(5, output)))
processes.append(mp.Process(target=f, args=(3, output)))
# Run processes
for p in processes:
p.start()
# Exit the completed processes
for p in processes:
p.join()
# Get process results from the output queue
results = [output.get() for p in processes]
print(results)
print ("a:",a)
print ("b:",b)
And the blow is the answers:
[8, 8]
a: 3
b: 5
How can I apply the results of processes to the global variables? or how can I run this code with multiprocessing and get answer like running a simple threat code ?
When you use Threading, the two (or more) threads are created within the same process and share their memory (globals).
When you use MultiProcessing, a whole new process is created and each one gets its own copy of the memory (globals).
You could look at mutiprocessing Value/Array or Manager to allow pseudo-globals, i.e. shared objects.

Why the performance of concurrent.futures.ProcessPoolExecutor is very low?

I'm trying to leverage concurrent.futures.ProcessPoolExecutor in Python3 to process a large matrix in parallel. The general structure of the code is:
class X(object):
self.matrix
def f(self, i, row_i):
<cpu-bound process>
def fetch_multiple(self, ids):
with ProcessPoolExecutor() as executor:
futures = [executor.submit(self.f, i, self.matrix.getrow(i)) for i in ids]
return [f.result() for f in as_completed(futures)]
self.matrix is a large scipy csr_matrix. f is my concurrrent function that takes a row of self.matrix and apply a CPU-bound process on it. Finally, fetch_multiple is a function that run multiple instance of f in parallel and returns the results.
The problem is that after running the script, all cpu cores are less than 50% busy (See the following screenshot):
Why all cores are not busy?
I think the problem is the large object of self.matrix and passing row vectors between processes. How can I solve this problem?
Yes.
The overhead should not be that big - but it is likely the cause of your CPUs appearing iddle (although, they should be busy passing the data around anyway).
But try the recipe here to pass a "pointer" of the object to the subprocess using shared memory.
http://briansimulator.org/sharing-numpy-arrays-between-processes/
Quoting from there:
from multiprocessing import sharedctypes
size = S.size
shape = S.shape
S.shape = size
S_ctypes = sharedctypes.RawArray('d', S)
S = numpy.frombuffer(S_ctypes, dtype=numpy.float64, count=size)
S.shape = shape
Now we can send S_ctypes and shape to a child process in
multiprocessing, and convert it back to a numpy array in the child
process as follows:
from numpy import ctypeslib
S = ctypeslib.as_array(S_ctypes)
S.shape = shape
It should be tricky to take care of reference counting, but I suppose numpy.ctypeslib takes care of that - so, just coordinate the passing of the actual row number to sub-processes in a way they don't work on the same data

What can be slowing down my program when i use multithreading?

I'm writing a program that downloads data from a website (eve-central.com). It returns xml when I send a GET request with some parameters. The problem is that I need to make about 7080 of such requests because i can't specify the typeid parameter more than once.
def get_data_eve_central(typeids, system, hours, minq=1, thread_count=1):
import xmltodict, urllib3
pool = urllib3.HTTPConnectionPool('api.eve-central.com')
for typeid in typeids:
r = pool.request('GET', '/api/quicklook', fields={'typeid': typeid, 'usesystem': system, 'sethours': hours, 'setminQ': minq})
answer = xmltodict.parse(r.data)
It was really slow when I just connected to the website and made all the requests so I decided to make it use multiple threads at a time (I read that if the process involves a lot of waiting (I/O, HTTP requests), it can be speeded up a lot with multithreading). I rewrote it using multiple threads, but it somehow isn't any faster (a bit slower in fact). Here's the code rewritten using multithreading:
def get_data_eve_central(all_typeids, system, hours, minq=1, thread_count=1):
if thread_count > len(all_typeids): raise NameError('TooManyThreads')
def requester(typeids):
pool = urllib3.HTTPConnectionPool('api.eve-central.com')
for typeid in typeids:
r = pool.request('GET', '/api/quicklook', fields={'typeid': typeid, 'usesystem': system, 'sethours': hours, 'setminQ': minq})
answer = xmltodict.parse(r.data)['evec_api']['quicklook']
answers.append(answer)
def chunkify(items, quantity):
chunk_len = len(items) // quantity
rest_count = len(items) % quantity
chunks = []
for i in range(quantity):
chunk = items[:chunk_len]
items = items[chunk_len:]
if rest_count and items:
chunk.append(items.pop(0))
rest_count -= 1
chunks.append(chunk)
return chunks
t = time.clock()
threads = []
answers = []
for typeids in chunkify(all_typeids, thread_count):
threads.append(threading.Thread(target=requester, args=[typeids]))
threads[-1].start()
threads[-1].join()
print(time.clock()-t)
return answers
What I do is I divide all typeids into as many chunks as the quantity of threads i want to use and create a thread for each chunk to process it. The question is: what can slow it down? (I apologise for my bad english)
Python has Global Interpreter Lock. It can be your problem. Actually Python cannot do it in a genuine parallel way. You may think about switching to other languages or staying with Python but use process-based parallelism to solve your task. Here is a nice presentation Inside the Python GIL

Resources