Python Multiprocessing GPU receiving SIGSEV meesage - python-3.x

I am looking to run 2 process at the same time. The processes use AI models. One of them is almost 1Gb. I have researched and seems that the best way is to use multiprocessing. This is a Linux server and it has 8 core CPU and one GPU. Due to the weight, I need GPU to process this files. archivo_diar is the path to the file and modelo is previously loaded. Code goes like this.
from multiprocessing import Process
def diariza(archivo_diar, pipeline):
dz = pipeline(archivo_diar, pipeline)
def transcribe_archivo(archivo_modelo, modelo):
resultado = modelo.transcribe(archivo_diar)
print(resultado)
p1 = Process(target= transcribe_archivo, args = (archivos_diar, modelo))
p1.start()
After p1.start() is run, I get the following message:
SIGSEGV received at time = 16766367473 on cpu 7*
PC: # 0x7fb2c29705 144 GOMP_pararallel
What I have found so far is that is it is a problem related to memory, but I have not seen any case related to multiprocessing. As I understand, This child process inherits memory from the main process and modelo which is the heavy file is already loaded in memory so it should not be the case.
As you can see, the 2 process (in the functions) are different, what I read is that in those cases the est approach is to use Pool. I also tried using pool like this:
pool = Pool (processes = 4)
pool.imap_unordered(transcribe_archivo, [archivo_diar, modelo]
And I got the following error.
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use 'spawn' start method.
I tried using
multiprocessing.set_start_method('spawn')
and when I do pool.join() it hangs forever.
Does anyone knows the reason of this?
Thanks.

Related

Second gpu not utilized in multi-GPU Pytorch script

I'm trying to use two processes to speed up a script that runs on a sequence of images (each image is its own optimization problem).
I'm using torch.multiprocessing to spawn two processes. Each process initializes tensors, models optimizers running on a different GPU:
if __name__ == '__main__':
num_processes = 2
processes = []
img_list = [...]
img_indices = np.range(0, len(img_list))
for gpu_idx in range(num_processes):
subindices = img_indices[gpu_idx::num_processes]
p = mp.Process(target=my_single_gpu_optimization_func, args=(img, img_list, subindices, gpu_idx))
p.start()
processes.append(p)
for p in processes:
p.join()
Inside my_single_gpu_optimization_func, I define the target device as:
device = f'cuda:{gpu}'
model = MyModel(device=device)
The idea is that each GPU processes half of the images.
So when running, I expect to see both GPUs loaded, but in practice, I see that the memory usage on the first GPU doubles, compared to the single-GPU use case, and the runtime halves. The second GPU seems to be idle.
Why am I unable to utilize both GPUs and double my throughput?
What seems to work is to set CUDA_VISIBLE_DEVICES inside each process/thread function.
So:
os.environ['CUDA_VISIBLE_DEVICES'] = f'{gpu}'
my_model.to('cuda:0')
This seems very crude. I might as well just run two instances of my code from the command line this way. Is there a cleaner way of doing this without setting environment variables?
(BTW, I'm not sure overriding environment variables would work with threads, I really do need to fork a separate process for this solution to work)

Multiprocessing with Multiple Functions: Need to add a function to the pool from within another function

I am measuring the metrics of an encryption algorithm that I designed. I have declared 2 functions and a brief sample is as follows:
import sys, random, timeit, psutil, os, time
from multiprocessing import Process
from subprocess import check_output
pid=0
def cpuUsage():
global running
while pid == 0:
time.sleep(1)
running=true
p = psutil.Process(pid)
while running:
print(f'PID: {pid}\t|\tCPU Usage: {p.memory_info().rss/(1024*1024)} MB')
time.sleep(1)
def Encryption()
global pid, running
pid = os.getpid()
myList=[]
for i in range(1000):
myList.append(random.randint(-sys.maxsize,sys.maxsize)+random.random())
print('Now running timeit function for speed metrics.')
p1 = Process(target=metric_collector())
p1.start()
p1.join()
number=1000
unit='msec'
setup = '''
import homomorphic,random,sys,time,os,timeit
myList={myList}
'''
enc_code='''
for x in range(len(myList)):
myList[x] = encryptMethod(a, b, myList[x], d)
'''
dec_code='''
\nfor x in range(len(myList)):
myList[x] = decryptMethod(myList[x])
'''
time=timeit.timeit(setup=setup,
stmt=(enc_code+dec_code),
number=number)
running=False
print(f'''Average Time:\t\t\t {time/number*.0001} seconds
Total time for {number} Iters:\t\t\t {time} {unit}s
Total Encrypted/Decrypted Values:\t {number*len(myList)}''')
sys.exit()
if __name__ == '__main__':
print('Beginning Metric Evaluation\n...\n')
p2 = Process(target=Encryption())
p2.start()
p2.join()
I am sure there's an implementation error in my code, I'm just having trouble grabbing the PID for the encryption method and I am trying to make the overhead from other calls as minimal as possible so I can get an accurate reading of just the functionality of the methods being called by timeit. If you know a simpler implementation, please let me know. Trying to figure out how to measure all of the metrics has been killing me softly.
I've tried acquiring the pid a few different ways, but I only want to measure performance when timeit is run. Good chance I'll have to break this out separately and run it that way (instead of multiprocessing) to evaluate the function properly, I'm guessing.
There are at least three major problems with your code. The net result is that you are not actually doing any multiprocessing.
The first problem is here, and in a couple of other similar places:
p2 = Process(target=Encryption())
What this code passes to Process is not the function Encryption but the returned value from Encryption(). It is exactly the same as if you had written:
x = Encryption()
p2 = Process(target=x)
What you want is this:
p2 = Process(target=Encryption)
This code tells Python to create a new Process and execute the function Encryption() in that Process.
The second problem has to do with the way Python handles memory for Processes. Each Process lives in its own memory space. Each Process has its own local copy of global variables, so you cannot set a global variable in one Process and have another Process be aware of this change. There are mechanisms to handle this important situation, documented in the multiprocessing module. See the section titled "Sharing state between processes." The bottom line here is that you cannot simply set a global variable inside a Process and expect other Processes to see the change, as you are trying to do with pid. You have to use one of the approaches described in the documentation.
The third problem is this code pattern, which occurs for both p1 and p2.
p2 = Process(target=Encryption)
p2.start()
p2.join()
This tells Python to create a Process and to start it. Then you immediately wait for it to finish, which means that your current Process must stop at that point until the new Process is finished. You never allow two Processes to run at once, so there is no performance benefit. The only reason to use multiprocessing is to run two things at the same time, which you never do. You might as well not bother with multiprocessing at all since it is only making your life more difficult.
Finally I am not sure why you have decided to try to use multiprocessing in the first place. The functions that measure memory usage and execution time are almost certainly very fast, and I would expect them to be much faster than any method of synchronizing one Process to another. If you're worried about errors due to the time used by the diagnostic functions themselves, I doubt that you can make things better by multiprocessing. Why not just start with a simple program and see what results you get?

Why to call multiprocessing module with Process can create same instance?

My platform info:
uname -a
Linux debian 5.10.0-9-amd64 #1 SMP Debian 5.10.70-1 (2021-09-30) x86_64 GNU/Linux
python3 --version
Python 3.9.2
Note:to add a lock can make Singleton class take effect ,that is not my issue,no need to talk the process lock.
The same parts which will be executed in different status:
Singleton class
class Singleton(object):
def __init__(self):
pass
#classmethod
def instance(cls, *args, **kwargs):
if not hasattr(Singleton, "_instance"):
Singleton._instance = Singleton(*args, **kwargs)
return Singleton._instance
working function in the process
import time,os
def task():
print("start the process %d" % os.getpid())
time.sleep(2)
obj = Singleton.instance()
print(hex(id(obj)))
print("end the process %d" % os.getpid())
Creating multi-process with processing pool way:
from multiprocessing.pool import Pool
with Pool(processes = 4) as pool:
[pool.apply_async(func=task) for item in range(4)]
#same effcet with pool.apply ,pool.map,pool.map_async in this example,i have verified,you can try again
pool.close()
pool.join()
The result is as below:
start the process 11986
start the process 11987
start the process 11988
start the process 11989
0x7fd8764e04c0
end the process 11986
0x7fd8764e05b0
end the process 11987
0x7fd8764e0790
end the process 11989
0x7fd8764e06a0
end the process 11988
All sub-process has its own memory,they don't share space each other,they don't know other process has already created an instance,so it output different instances.
Creating multi-process with Process way:
import multiprocessing
for i in range(4):
t = multiprocessing.Process(target=task)
t.start()
The result is as below:
start the process 12012
start the process 12013
start the process 12014
start the process 12015
0x7fb288c21910
0x7fb288c21910
end the process 12014
end the process 12012
0x7fb288c21910
end the process 12013
0x7fb288c21910
end the process 12015
Why it create same instance with this way ?What is the working principle in the multiprocessing module?
#Reed Jones,i have read the related post you provided for many times.
In lmjohns3's answer:
So the net result is actually the same, but in the first case you're guaranteed to run foo and bar on different processes.
The first case is Process sub-module,the Process will guarante to run on different processes,so in my case :
import multiprocessing
for i in range(4):
t = multiprocessing.Process(target=task)
t.start()
It should result in several (maybe 4 or not,at least bigger than 1) instances instead of the same instance.
I am sure that the material can't explain my case.
As already explained in this answer, id implementation is platform specific and is not a good method to guarantee unique identifiers across multiple processes.
In CPython specifically, id returns the pointer of the object within its own process address space. Most of modern OSes abstract the computer memory using a methodology known as Virtual Memory.
What you are observing are actual different objects. Nevertheless, they appear to have the same identifiers as each process allocated that object in the same offset of its own memory address space.
The reason why this does not happen in the pool is most likely due to the fact the Pool is allocating several resources in the worker process (pipes, counters, etc..) before running the task function. Hence, it randomizes the process address space utilization enough such that the object IDs appear different across their sibling processes.

Why does the pytorch .backward() method occupies two more CPU threads when I want to restrict it to use only one CPU thread?

I am trying to deploy my service using CPU and pytorch. I am asked to restrict the single-process service to use only one CPU thread. I have assigned torch.set_num_interop_threads(1) and torch.set_num_threads(1). The question is that when I check the actual CPU thread number using pstree <pid> I got three. Later I found that the problem happened in a.backward(). If I delete this line, I got 1 threads; if I keep this line, I got 3 CPU threads. Why did it happen and what should I do to keep the CPU thread to be one?
I wrote a simple python3 script to reproduce the problem:
import torch
import time
while(1):
w = torch.Tensor([2])
w = torch.autograd.Variable(w, requires_grad=True)
x = torch.rand([1])
x = torch.autograd.Variable(x, requires_grad=True)
y = x**w
y.backward()
time.sleep(3)
Run it and use pstree <pid> to check the CPU threads, I got python───2*[{python}], but if I delete y.backward() I got only python. I believe the extra two CPU threads in this script is in the same problem with my service code.

When running Numba code on a CUDA GPU, I notice one of my CPU cores stays at 100%. Is this limiting performance?

I have a test code that is computationally intense and I run that on the GPU using Numba. I noticed that while that is running, one of my CPU cores goes to 100% and stays there the whole time. The GPU seems to be at 100% as well. You can see both in the screenshot below.
My benchmark code is as follows:
from numba import *
import numpy as np
from numba import cuda
import time
def benchmark():
input_list = np.random.randint(10, size=3200000).astype(np.intp)
output_list = np.zeros(input_list.shape).astype(np.intp)
d_input_array = cuda.to_device(input_list)
d_output_array = cuda.to_device(output_list)
run_test[32, 512](d_input_array, d_output_array)
out = d_output_array.copy_to_host()
print('Result: ' + str(out))
#cuda.jit("void(intp[::1], intp[::1])", fastmath=True)
def run_test(d_input_array, d_output_array):
array_slice_len = len(d_input_array) / (cuda.blockDim.x * cuda.gridDim.x)
thread_coverage = cuda.threadIdx.x * array_slice_len
slice_start = thread_coverage + (cuda.blockDim.x * cuda.blockIdx.x * array_slice_len)
for step in range(slice_start, slice_start + array_slice_len, 1):
if step > len(d_input_array) - 1:
return
count = 0
for item2 in d_input_array:
if d_input_array[step] == item2:
count = count + 1
d_output_array[step] = count
if __name__ == '__main__':
import timeit
# make_multithread(benchmark, 64)
print(timeit.timeit("benchmark()", setup="from __main__ import benchmark", number=1))
You can run the code above to repro if you got python 3.7, Numba and codatoolkit installed. I'm on Linux Mint 20.
I got 32 cores - doesn't seem right to have one 100% while everyone else seats idle.
I'm wondering why that is, if there is a way to have other cores help with whatever is being done to increase performance?
How can I investigate what is taking 100% of a single core and know what is going on?
CUDA kernel launches (and some other operations) are asynchronous from the point of view of the host thread. And as you say, you're running the computationally intense portion of the work on the GPU.
So the host thread has nothing to do, other than launch some work and wait for it to be finished. The waiting process here is a spin-wait which means the CPU thread is in a tight loop, waiting for a status condition to change.
The CPU thread will hit that spin-wait here:
out = d_output_array.copy_to_host()
which is the line of code after your kernel launch, and it expects to copy (valid) results back from the GPU to the CPU. In order for this to work, the CPU thread must wait there until the results are ready. Numba implements this with a blocking sync operation, between GPU and CPU activity. Therefore, for most of the duration of your program, the CPU thread is actually waiting at that line of code.
This waiting takes up 100% of that thread's activity, and thus one core is seen as fully utilized.
There wouldn't be any sense or reason to try to "distribute" this "work" to multiple threads/cores, so this is not a "performance" issue in the way you are suggesting.
Any CPU profiler that shows hotspots or uses PC sampling should be able to give you a picture of this. That line of code should show up near the top of the list of lines of code most heavily visited by your CPU/core/thread.

Resources