How to use child kernels (CUDA dynamic parallelism) using PyCUDA - python-3.x

My python code has a gpu kernel function which is called multiple times in a for loop from host like this :
for i in range:
gpu_kernel_func(blocksize, grid)
Since this function call requires communication between host and gpu device multiple times which is not efficient, I want to make this as
gpu_kernel_function(){
for(){
computation } ;
}
But this requires extra step to make sure all the blocks in grid are in sync. According to dynamic parallelism, calling a dummy child kernel should ensure that every thread (in whole grid) should finish that child kernel before the code continues running. So I defined another kernel just like gpu_kernel_function and I tried this :
GPUcode = '''
\__global__ gpu_kernel_function() {... }
\__global__ dummy_child_kernel(){ ... }
'''
gpu_kernel_function(){
for() {
computation } ;
dummy_child_kernel(void);
}
But I am getting this error " nvcc fatal : Option '--cubin (-cubin)' is not allowed when compiling for a virtual compute architecture "
I am using Tesla P100 (compute 6.0), python 3.5, cuda.8.0.44. I am compiling my sourcemodule like this :
mod = SourceModule(GPUcode, options=['-rdc=true' ,'-lcudart','-lcudadevrt','--machine=64'],arch='compute_60' )
I tried compute_35 too which gives same error.

The error message is explicitly telling you what the issue is. compute_60 is a virtual architecture. You can't statically compile virtual architectures to machine code. They are intended for producing PTX (virtual machine assembler) for JIT translation to machine code by the runtime. PyCUDA compiles code to a binary payload ("cubin") using the CUDA toolchain and them loads it via the driver API into the CUDA context. Thus the error.
You can fix the error by specifying a valid physical GPU target architecture. So you should modify the source module constructor call to something like this:
mod = SourceModule(GPUcode,
options=['-rdc=true','-lcudart','-lcudadevrt','--machine=64'],
arch='sm_60' )
This should fix the compiler error.
However, note that using dynamic parallelism requires device code linkage, and I am 99% sure that PyCUDA still doesn't support this, so you likely won't be able to do what you are asking about via a SourceModule. You could link your own cubin by hand using the compiler outside of PyCUDA and then load that cubin inside PyCUDA. You will find many examples of how to compile dynamic parallelism correctly if you search for them.

Related

Performance degradation due to loading a shared library with thread local storage

I write a python wrapper around a large Fortran program with pybind11 as a python module. The Fortran program is a large simulation tool, that uses OpenMP for multithreading. My initial work was to reproduce the Fortran executable from a python function. That yielded (as expected) exactly the same results and the same performance. But when I started to add more functions, I observed a large performance degradation (about 50% to 100% longer runtimes).
Tracking the cause in pybind11
I could track it down to a call of the pybind11 macro PYBIND11_NUMPY_DTYPE, which loads in its internals the numpy library numpy.core._multiarray_umath. I could reproduce the performance degradation with the following code:
import ctypes
import time
# This is the fortran code, compiled to a shared library and a subroutine modulemain, that resembles the main program.
fcode = ctypes.CDLL("./libfcode.so")
# Only loading the library results in a worse performance of the Fortran code.
import numpy.core._multiarray_umath
t = time.time()
fcode.modulemain()
print("runtime: ", time.time()-t)
Tracking the cause in numpy
After finding, that the reason of my bad performance lies just in including the numpy.core._multiarray_umath library, I further digged into it. Ultimately I could track it down to two lines in that library, where two variables with thread local storage a defined.
// from numpy 1.21.5, numpy/core/src/multiarray/multiarraymodule.c:4011
static NPY_TLS int sigint_buf_init = 0;
static NPY_TLS NPY_SIGJMP_BUF _NPY_SIGINT_BUF;
where NPY_TLSis defined as
#define NPY_TLS __thread
So the inclusion of a shared object with __thread TLS is the root cause for my performance degradation. This leads me to my two questions:
Why?
Is there any way to prevent it? Not using PYBIND11_NUMPY_DTYPE is no option, as the loading of the numpy library after my module will trigger the bug as well!
Minimal working example
My error is about a large and heavy Fortran code, that I wanted to export to python via pybind11. But in the end it resulted in a problem of using OpenMP thread local storage and then loading a library that exports a variable with __thread thread local storage in the python interpreter. I could create a minimal working example, that reproduced the behavior.
The worker program work.f90
module data
integer, parameter :: N = 10000
real :: X(1:N)
!$omp threadprivate(X)
end module
subroutine work() bind(C, name="worker")
use data, only: X,N
!$omp parallel
X(1) = 0.131
do i=2,N
do j=1,i-1
X(i) = X(i) + 0.431*sin(X(i-1))
end do
end do
!$omp end parallel
The bad library tl.c
__thread int badVariable = 3;
a python script that shows the effect run.py
import ctypes
import time
work = ctypes.CDLL("./libwork.so")
# first worker run without loaded libtl.so. Good performance!
t = time.time()
work.worker()
print("TIME: ", time.time()-t)
# load the bad library
bad = ctypes.CDLL("./libtl.so")
# second worker with degraded performance
t = time.time()
work.worker()
print("TIME: ", time.time()-t)
The Makefile
FLAGS = -fPIC -shared
all: libwork.so libtl.so
libwork.so: work.f90
gfortran-11 $(FLAGS) work.f90 -fopenmp -o $#
libtl.so: tl.c
gcc-11 $(FLAGS) tl.c -o $#
The worker is so simple, that enabling optimization will hide the effect. I guess is could be a call to access the thread local storage area, that could be easily optimized out here. But in a real program, the effect is there with optimization.
Setup
I have the problem on a ubuntu 22.04 LTS computer with a x86 CPU (Xeon 8280M). gcc is Ubuntu 11.3.0-1ubuntu1~22.04 (I tried others down to 7.5.0 with the same effect). Python is version 3.10.6.
The problem is not Fortran specific, I can easily write a worker in plain C with the same effect. And I also tried this on a Raspberry Pi with the same effect! (ARM, GCC 8.3.0, Python 2.7.16)

When would I use model.to("cuda:1") as opposed to model.to("cuda:0")?

I have a user with two GPU's; the first one is AMD which can't run CUDA, and the second one is a cuda-capable NVIDIA GPU. I am using the code model.half().to("cuda:0"). I'm not sure if the invocation successfully used the GPU, nor am I able to test it because I don't have any spare computer with more than 1 GPU lying around.
In this case, does "cuda:0" mean the first device which can run CUDA, so it would've worked even if their first device was AMD? Or would I need to say "cuda:1" instead? How would I detect which number is the first CUDA-capable device?
The package nvidia-smi can help to track GPU's memory while running your code.
To install, run pip install nvidia-ml-py3. Take a look at this code snip:
import nvidia_smi
cuda_idx = 0 # edit device index that you want to track
to_cuda = f'cuda:{cuda_idx}' # 'cuda:0' in this case
nvidia_smi.nvmlInit()
handle = nvidia_smi.nvmlDeviceGetHandleByIndex(cuda_idx)
def B2G(num):
return round(num/(1024**3),2)
def print_memory(name, handle, pre_used):
info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
used = info.used
print(f'{name}: {B2G(used)}')
print(f'This step use: {B2G(used-pre_used)}')
print('------------')
return used
# start
mem = print_memory('Start', handle, 0)
model = ... # init your model
model.to(to_cuda)
mem = print_memory('Init model', handle, mem)
Above is the example with nvidia-smi that can help to track the memory that needs for each part of the model and print it in GB unit.
Edited: To check the list of GPUs:
def check_gpu():
for i in range(torch.cuda.device_count()):
device_name = f'cuda:{i}'
print(f'{i} device name:{torch.cuda.get_device_name(torch.device(device_name))}')
I tested it and as I suspected the model.half().to("cuda:0") will put your model in the first available GPU with CUDA support i.e. NVIDIA GPU in your case, the AMD GPU isn't visible as a cuda device, feel safe to assume cuda:0 is only a CUDA enabled GPU, and AMD GPU won't be seen by your program.
Have a good day.
There are plenty of methods of torch.cuda to query and monitor GPU devices.
For example, you can check the type of each device:
torch.cuda.get_device_name(torch.device('cuda:0'))
% or
torch.cuda.get_device_name(torch.device('cuda:1'))
In my case, the output of get_device_name returns:
'Quadro RTX 6000'
If you want a more programmatic way to explore the properties of your devices, you can use torch.cuda.get_device_properties.
Once you are working with a device (or believe you are), you can use [torch.cuda]'s memory management functions to monitor GPU memory usage.
For instance, you can get a very detailed account of the current state of your device's memory using:
torch.cuda.memory_stats(torch.device('cuda:0'))
% or
torch.cuda.memory_stats(torch.device('cuda:0'))
If you want nvidia-smi-like stats on utilization, you can use torch.cuda.utilization

TensorFlow Lite Android Crashes on GPU Compute only when Input Size is >1

I've been working on an AndroidStudio app which uses TensorFlow Lite's GPU delegate to speed up inference speed. It uses a model which takes an input array of size [n]x[384] and outputs an array of size [n]x[1], with n being the number of 384-sized inputs I wish to feed in at a given time. Output n is only dependent on input n. For n=1, I have no problems - TF Lite's CPU and GPU inference both work fine (albeit GPU does take longer - potentially because of the smaller input size?). When I increase n so that it is greater than 1 and run my model, CPU compute works fine, however GPU compute crashes my program. When I'm using an emulated Pixel 3 XL to run the program on I get this error message:
E/AndroidRuntime: FATAL EXCEPTION: main
Process: com.example.mlptest, PID: 10405
java.lang.IllegalArgumentException: Internal error: Failed to apply delegate: OpenCL library not loaded - dlopen failed: library "libOpenCL-pixel.so" not found
Falling back to OpenGL
TfLiteGpuDelegate Init: OpenGL ES 3.1 or above is required to use OpenGL inference.
TfLiteGpuDelegate Prepare: delegate is not initialized
Node number 4 (TfLiteGpuDelegateV2) failed to prepare.
When I run GPU compute on my personal phone, a Motorla Moto G7 Power, I get this error message:
E/AndroidRuntime: FATAL EXCEPTION: main
Process: com.example.mlptest, PID: 16906
java.lang.IllegalStateException: Internal error: Unexpected failure when preparing tensor allocations: TfLiteGpuDelegate Init: Index is out of range
TfLiteGpuDelegate Prepare: delegate is not initialized
Node number 4 (TfLiteGpuDelegateV2) failed to prepare.
This crash happens as soon as it the GPU Delegate's interpreter runs. I'm creating the delegate using these lines of code:
GpuDelegate delegate = new GpuDelegate();
Interpreter.Options options = (new Interpreter.Options()).addDelegate(delegate);
Initializing the interpreter with the options then running it:
Interpreter tfliteGPU = new Interpreter(loadedFile, options);
And finally closing the delegate after my computation:
delegate.close();
The original TensorFlow model I am using was made in TensorFlow 1.x and converted from a frozen graph using the tflite_convert command. I'm running the app off of TF Lite 2.2.0 and TF Lite GPU 2.2.0:
implementation 'org.tensorflow:tensorflow-lite:2.2.0'
implementation 'org.tensorflow:tensorflow-lite-gpu:2.2.0'
I've looked at TF Lite's Android API reference and their page on the GPU Delegate and have not found any relevant solutions. Any help is appreciated!
After a recommendation to try out the TensorFlow Nightly implementation:
implementation 'org.tensorflow:tensorflow-lite:0.0.0-nightly'
implementation 'org.tensorflow:tensorflow-lite-gpu:0.0.0-nightly'
I switched my implementation in build.gradle to use 0.0.0-nightly and my problem went away. I can't speak as to what may have originally caused it, however this is what solved it.

How to find TLS segments of the current thread on linux amd64?

I'm looking for a way to find out the memory addresses of TLS segments for the current thread on linux, amd64. Bonus point for a solution that works on OSX.
Looked into various language runtime or GC (like boehm), but couldn't go through the multiple layer of abstractions to support all kind of systems so far. Any help appreciated.
Did you have a look at the solution Martin and I came up with in druntime?
What we do there boils down to scanning the segments in the corresponding dl_phdr_info (obtained by looking for the correct one using dl_iterate_phdr) for the segment with type PT_TLS, and storing its module id and size.
You can then get the start of the address range on the current thread by calling __tls_get_addr for offset 0 and the module id (there is an offset on some archs), and the end by simply adding the size you determined to that. If you do not need to support shared libraries, you can also simply use fs/gs on x86 for that (might be required if you want to link a static executable).
This works for Linux and FreeBSD (and probably other ELF platforms), but not OS X. There, the best I could come up with so far is this:
void _d_dyld_getTLSRange(void* arbitraryTLSSymbol, void** start, size_t* size) {
dyld_enumerate_tlv_storage(
^(enum dyld_tlv_states state, const dyld_tlv_info *info) {
assert(state == dyld_tlv_state_allocated);
if (info->tlv_addr <= arbitraryTLSSymbol &&
arbitraryTLSSymbol < (info->tlv_addr + info->tlv_size)
) {
// Found the range we are looking for.
*start = info->tlv_addr;
*size = info->tlv_size;
}
}
);
}
The naive implementation currently used in LDC's druntime does not quite handle shared libraries, though, and dyld_enumerate_tlv_storage is from dyld_priv.h, which might or might not be a problem for App Store publishing.
On Linux, the thread-specific segment is set up via arch_prtcl(ARCH_SET_FS, <addr>) call. You can find out what it was set to in the current thread via arch_prctl(ARCH_GET_FS, ...).
Bonus point for a solution that works on OSX.
OSX is a completely different OS, and uses completely different mechanism for its TLS support.

Pthreads & Multicore compiler

I'm working with the support SMP kernel: Snapgear 2.6.21.
I have created 4 threads in my c application, and I am trying to set thread 1 to run on CPU1, thread2 on CPU 2, etc.
However, the compiler sparc-linux-gcc does not recognize these functions:
CPU_SET (int cpu, cpu_set_t * set);
CPU_ZERO (cpu_set_t * set);
and this type: cpu_set_t
It always gives me these errors:
implicit declaration of function 'CPU_ZERO'
implicit declaration of function 'CPU_SET'
'cpu_set_t' undeclared (first use in this function)
Here is my code to bind active thread to processor 0:
cpu_set_t mask;
CPU_ZERO (& mask);
CPU_SET (0, & mask) // bind processor 0
sched_setaffinity (0, sizeof(mask), & mask);
I have included and defined at the top :
**define _GNU_SOURCE
include <sched.h>**
But I always get the same errors. can you help me please?
You should read sched_setaffinity(2) carefully and test its result (and display errno on failure, e.g. with perror).
Actually, I believe you should use pthread_setaffinity_np(3) instead (and of course test its failure, etc...)
Even more, I believe that you should not bother to explicitly set the affinity. Recent Linux kernels are often quite good at dispatching running threads on different CPUs.
So simply use pthreads and don't bother about affinity, unless you see actual issues when benchmarking.
BTW, passing the -H flag to your GCC (cross-)compiler could be helpful. It shows you the included files. Perhaps also look into the preprocessed form obtained with gcc -C -E ; it looks like some header files are missing or not found (maybe some missing -I include-directory at compilation time, or some missing headers on your development system)
BTW, your kernel version looks ancient. Can't you upgrade your kernel to something newer (3.15.x or some 3.y)?

Resources