Second gpu not utilized in multi-GPU Pytorch script - pytorch

I'm trying to use two processes to speed up a script that runs on a sequence of images (each image is its own optimization problem).
I'm using torch.multiprocessing to spawn two processes. Each process initializes tensors, models optimizers running on a different GPU:
if __name__ == '__main__':
num_processes = 2
processes = []
img_list = [...]
img_indices = np.range(0, len(img_list))
for gpu_idx in range(num_processes):
subindices = img_indices[gpu_idx::num_processes]
p = mp.Process(target=my_single_gpu_optimization_func, args=(img, img_list, subindices, gpu_idx))
p.start()
processes.append(p)
for p in processes:
p.join()
Inside my_single_gpu_optimization_func, I define the target device as:
device = f'cuda:{gpu}'
model = MyModel(device=device)
The idea is that each GPU processes half of the images.
So when running, I expect to see both GPUs loaded, but in practice, I see that the memory usage on the first GPU doubles, compared to the single-GPU use case, and the runtime halves. The second GPU seems to be idle.
Why am I unable to utilize both GPUs and double my throughput?

What seems to work is to set CUDA_VISIBLE_DEVICES inside each process/thread function.
So:
os.environ['CUDA_VISIBLE_DEVICES'] = f'{gpu}'
my_model.to('cuda:0')
This seems very crude. I might as well just run two instances of my code from the command line this way. Is there a cleaner way of doing this without setting environment variables?
(BTW, I'm not sure overriding environment variables would work with threads, I really do need to fork a separate process for this solution to work)

Related

Python Multiprocessing GPU receiving SIGSEV meesage

I am looking to run 2 process at the same time. The processes use AI models. One of them is almost 1Gb. I have researched and seems that the best way is to use multiprocessing. This is a Linux server and it has 8 core CPU and one GPU. Due to the weight, I need GPU to process this files. archivo_diar is the path to the file and modelo is previously loaded. Code goes like this.
from multiprocessing import Process
def diariza(archivo_diar, pipeline):
dz = pipeline(archivo_diar, pipeline)
def transcribe_archivo(archivo_modelo, modelo):
resultado = modelo.transcribe(archivo_diar)
print(resultado)
p1 = Process(target= transcribe_archivo, args = (archivos_diar, modelo))
p1.start()
After p1.start() is run, I get the following message:
SIGSEGV received at time = 16766367473 on cpu 7*
PC: # 0x7fb2c29705 144 GOMP_pararallel
What I have found so far is that is it is a problem related to memory, but I have not seen any case related to multiprocessing. As I understand, This child process inherits memory from the main process and modelo which is the heavy file is already loaded in memory so it should not be the case.
As you can see, the 2 process (in the functions) are different, what I read is that in those cases the est approach is to use Pool. I also tried using pool like this:
pool = Pool (processes = 4)
pool.imap_unordered(transcribe_archivo, [archivo_diar, modelo]
And I got the following error.
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use 'spawn' start method.
I tried using
multiprocessing.set_start_method('spawn')
and when I do pool.join() it hangs forever.
Does anyone knows the reason of this?
Thanks.

Multiprocessing with Multiple Functions: Need to add a function to the pool from within another function

I am measuring the metrics of an encryption algorithm that I designed. I have declared 2 functions and a brief sample is as follows:
import sys, random, timeit, psutil, os, time
from multiprocessing import Process
from subprocess import check_output
pid=0
def cpuUsage():
global running
while pid == 0:
time.sleep(1)
running=true
p = psutil.Process(pid)
while running:
print(f'PID: {pid}\t|\tCPU Usage: {p.memory_info().rss/(1024*1024)} MB')
time.sleep(1)
def Encryption()
global pid, running
pid = os.getpid()
myList=[]
for i in range(1000):
myList.append(random.randint(-sys.maxsize,sys.maxsize)+random.random())
print('Now running timeit function for speed metrics.')
p1 = Process(target=metric_collector())
p1.start()
p1.join()
number=1000
unit='msec'
setup = '''
import homomorphic,random,sys,time,os,timeit
myList={myList}
'''
enc_code='''
for x in range(len(myList)):
myList[x] = encryptMethod(a, b, myList[x], d)
'''
dec_code='''
\nfor x in range(len(myList)):
myList[x] = decryptMethod(myList[x])
'''
time=timeit.timeit(setup=setup,
stmt=(enc_code+dec_code),
number=number)
running=False
print(f'''Average Time:\t\t\t {time/number*.0001} seconds
Total time for {number} Iters:\t\t\t {time} {unit}s
Total Encrypted/Decrypted Values:\t {number*len(myList)}''')
sys.exit()
if __name__ == '__main__':
print('Beginning Metric Evaluation\n...\n')
p2 = Process(target=Encryption())
p2.start()
p2.join()
I am sure there's an implementation error in my code, I'm just having trouble grabbing the PID for the encryption method and I am trying to make the overhead from other calls as minimal as possible so I can get an accurate reading of just the functionality of the methods being called by timeit. If you know a simpler implementation, please let me know. Trying to figure out how to measure all of the metrics has been killing me softly.
I've tried acquiring the pid a few different ways, but I only want to measure performance when timeit is run. Good chance I'll have to break this out separately and run it that way (instead of multiprocessing) to evaluate the function properly, I'm guessing.
There are at least three major problems with your code. The net result is that you are not actually doing any multiprocessing.
The first problem is here, and in a couple of other similar places:
p2 = Process(target=Encryption())
What this code passes to Process is not the function Encryption but the returned value from Encryption(). It is exactly the same as if you had written:
x = Encryption()
p2 = Process(target=x)
What you want is this:
p2 = Process(target=Encryption)
This code tells Python to create a new Process and execute the function Encryption() in that Process.
The second problem has to do with the way Python handles memory for Processes. Each Process lives in its own memory space. Each Process has its own local copy of global variables, so you cannot set a global variable in one Process and have another Process be aware of this change. There are mechanisms to handle this important situation, documented in the multiprocessing module. See the section titled "Sharing state between processes." The bottom line here is that you cannot simply set a global variable inside a Process and expect other Processes to see the change, as you are trying to do with pid. You have to use one of the approaches described in the documentation.
The third problem is this code pattern, which occurs for both p1 and p2.
p2 = Process(target=Encryption)
p2.start()
p2.join()
This tells Python to create a Process and to start it. Then you immediately wait for it to finish, which means that your current Process must stop at that point until the new Process is finished. You never allow two Processes to run at once, so there is no performance benefit. The only reason to use multiprocessing is to run two things at the same time, which you never do. You might as well not bother with multiprocessing at all since it is only making your life more difficult.
Finally I am not sure why you have decided to try to use multiprocessing in the first place. The functions that measure memory usage and execution time are almost certainly very fast, and I would expect them to be much faster than any method of synchronizing one Process to another. If you're worried about errors due to the time used by the diagnostic functions themselves, I doubt that you can make things better by multiprocessing. Why not just start with a simple program and see what results you get?

Why does implementing multiprocessing makes my program slower

I'm trying to implement multiprocessing in my code to make it faster.
To make it easier to understand I will just say the program fits an observed curve using a linear combination of a library of curves and from that measures properties of the observed curve.
I have to do this for over 400 curves and in order to estimate the errors of these properties I perform a Monte Carlo simulation, which means I have to iterate a number of times each calculation.
This takes a lot of time and work, and granted I believe it is a CPU-bound task I figured I'd use multiprocessing in the error estimation step. Here's a simplification of my code:
Without multiprocessing
import numpy as np
import fitting_package
import multiprocessing
def estimate_errors(best_fit_curve, signal_to_noise, fit_kwargs, iterations=100)
results = defaultdict(list)
def fit(best_fit_curve, signal_to_noise, fit_kwargs, results):
# Here noise is added to simulate a new curve (Monte Carlo simulation)
noise = best_fit/signal_to_noise
simulated_curve = np.random.normal(best_fit_curve, noise)
# The arguments from the original fit (outside the error estimation) are passed to the fitting
fit_kwargs.update({'curve' : simulated_curve})
# The fit is performed and it returns the properties packed together
solutions = fitting_package(**fit_kwargs)
# There are more properties so this is a simplification
property_1, property_2 = solutions
aux_dict = {'property_1' : property_1, 'property_2' : property_2}
for key, value in aux_dict.items():
results[key].append(values)
for _ in range(iterations):
fit(best_fit_curve, signal_to_noise, fit_kwargs, results)
return results
With multiprocessing
def estimate_errors(best_fit_curve, signal_to_noise, fit_kwargs, iterations=100)
def fit(best_fit_curve, signal_to_noise, fit_kwargs, queue):
results = queue.get()
noise = best_fit/signal_to_noise
simulated_curve = np.random.normal(best_fit_curve, noise)
fit_kwargs.update({'curve' : simulated_curve})
solutions = fitting_package(**fit_kwargs)
property_1, property_2 = solutions
aux_dict = {'property_1' : property_1, 'property_2' : property_2}
for key, value in aux_dict.items():
results[key].append(values)
queue.put(results)
process_list = []
queue = multiprocessing.Queue()
queue.put(defaultdict(list))
for _ in range(iterations):
process = multiprocessing.Process(target=fit, args=(best_fit_curve, signal_to_noise, fit_kwargs, queue))
process.start()
process_list.append(process)
for p in process_list:
p.join()
results = queue.get()
return results
I thought using multiprocessing would save time, but it actually takes more than double than the other way to do it. Why is this? Is there anyway I can make it faster with multiprocessing?
I thought using multiprocessing would save time, but it actually takes more than double than the other way to do it. Why is this?
Starting a process takes a long time (at least in computer terms). It also uses a lot of memory.
In your code, you are starting 100 separate Python interpreters in 100 separate OS processes. That takes a really long time, so unless each process runs a very long time, the time it takes to start the process is going to dominate the time it actually does useful work.
In addition to that, unless you actually have 100 un-used CPU cores, those 100 processes will just spend most of their time waiting for each other to finish. Even worse, since they all have the same priority, the OS will try to give each of them a fair amount of time, so it will run them for a bit of time, then suspend them, run others for a bit of time, suspend them, etc. All this scheduling also takes time.
Having more parallel workloads than parallel resources cannot speed up your program, since they have to wait to be executed one-after-another anyway.
Parallelism will only speed up your program if the time for the parallel tasks is not dominated by the time of creating, managing, scheduling, and re-joining the parallel tasks.

Why does the pytorch .backward() method occupies two more CPU threads when I want to restrict it to use only one CPU thread?

I am trying to deploy my service using CPU and pytorch. I am asked to restrict the single-process service to use only one CPU thread. I have assigned torch.set_num_interop_threads(1) and torch.set_num_threads(1). The question is that when I check the actual CPU thread number using pstree <pid> I got three. Later I found that the problem happened in a.backward(). If I delete this line, I got 1 threads; if I keep this line, I got 3 CPU threads. Why did it happen and what should I do to keep the CPU thread to be one?
I wrote a simple python3 script to reproduce the problem:
import torch
import time
while(1):
w = torch.Tensor([2])
w = torch.autograd.Variable(w, requires_grad=True)
x = torch.rand([1])
x = torch.autograd.Variable(x, requires_grad=True)
y = x**w
y.backward()
time.sleep(3)
Run it and use pstree <pid> to check the CPU threads, I got python───2*[{python}], but if I delete y.backward() I got only python. I believe the extra two CPU threads in this script is in the same problem with my service code.

Python multiprocessing: with and without pooling

I'm trying to understand Python's multiprocessing, and have devised the following code to test it:
import multiprocessing
def F(n):
if n == 0: return 0
elif n == 1: return 1
else: return F(n-1)+F(n-2)
def G(n):
print(f'Fibbonacci of {n}: {F(n)}')
processes = []
for i in range(25, 35):
processes.append(multiprocessing.Process(target=G, args=(i, )))
for pro in processes:
pro.start()
When I run it, I tells me that the computing time was roughly of 6.65s.
I then wrote the following code, which I thought to be functionally equivalent to the latter:
from multiprocessing.dummy import Pool as ThreadPool
def F(n):
if n == 0: return 0
elif n == 1: return 1
else: return F(n-1)+F(n-2)
def G(n):
print(f'Fibbonacci of {n}: {F(n)}')
in_data = [i for i in range(25, 35)]
pool = ThreadPool(10)
results = pool.map(G, in_data)
pool.close()
pool.join()
and its running time was almost 12s.
Why is it that the second takes almost twice as the first one? Aren't they supposed to be equivalent?
(NB. I'm running Python 3.6, but also tested a similar code on 3.52 with same results.)
The reason the second takes twice as long as the first is likely due to the CPython Global Interpreter Lock.
From http://python-notes.curiousefficiency.org/en/latest/python3/multicore_python.html:
[...] the GIL effectively restricts bytecode execution to a single core, thus rendering pure Python threads an ineffective tool for distributing CPU bound work across multiple cores.
As you know, multiprocessing.dummy is a wrapper around the threading module, so you're creating threads, not processes. The Global Interpreter Lock, with a CPU-bound task as here, is not much different than simply executing your Fibonacci calculations sequentially in a single thread (except that you've added some thread-management/context-switching overhead).
With the "true multiprocessing" version, you only have a single thread in each process, each of which is using its own GIL. Hence, you can actually make use of multiple processors to improve the speed.
For this particular processing task, there is no significant advantage to using multiple threads over multiple processes. If you only have a single processor, there is no advantage to using either multiple processes or multiple threads over a single thread/process (in fact, both merely add context-switching overhead to your task).
(FWIW: A join in the true multiprocessing version is apparently being done automatically by the python runtime so adding an explicit join doesn't seem to make any difference in my tests using time(1). And, by the way, if you did want to add join, you should add a second loop for the join processing. Adding join to the existing loop will simply serialize your processes.)

Resources