Is there a kernel queue inside CUDA enabled GPU? - pytorch

When multiple PyTorch process is running inference on the same Nvidia GPU. I would like to know what happens when two kernel requests(cuLaunchKernel) from different contexts are handled by CUDA? Can CUDA GPU have a FIFO queue for those kernel requests?
I have no idea about measuring the state of CUDA when running my PyTorch program. Any advice on how to profile a Nvidia GPU when running multiple concurrent jobs is helpful!

Kernels from different contexts never run at the same time. They run in time-sharing way. (Unless MPS is used)
Within the same CUDA context, kernels launched on the same CUDA stream never run at the same time. Instead, they are serialized by the launch order and GPU executes them one at a time. So CUDA stream is similar to a queue in the CUDA context. Kernels launched on different CUDA streams (in the same context) have the potential to run concurrently.
Pytorch by default uses one CUDA stream. You can use APIs to manipulate multiple streams: https://pytorch.org/docs/stable/notes/cuda.html#cuda-streams

Related

Tensorflow GPU Threading

I'm trying to run tensorflow app using GPU. But, I'm not sure that GPU is automatically making process of threading. Could I control the GPU threads through tensorflow framework? How can I sure that GPU is working efficiently means nearly all threads are working?

Multithreded applications on different CPUS

If, for example, there is a let's say embedded application which run on unicore CPU. And then that application would be ported on multi core CPU. Would that app run on single or multiple cores?
To be more specific I am interested in ARM CPU (but not only) and toolchain specifics e. g. standard C/C++ libraries.
The intention of this question is this: is it CPU's responsibility to "decide" to execute on multiple cores or compiler toolchain, developer and standard platfor specific libraries? And again, I am interested also in other systems' tendencies out there.
There are plenty of applications and RTOS (for example Linux) that run on different CPUs but the same architecture, so does that mean that they are compiled differently?
Generally speaking single-threaded code will always run on one core. To take advantage of multiple cores you need to have either multiple processes, multiple threads, or both.
There's nothing your compiler can do to help you here. This is an architectural consideration.
If you have multiple threads, for example, most multi-core systems will run them on whatever cores are available if the operating system you're running is properly compiled to support that. Running an OS that's been compiled single-core only will obviously limit your options here.
A single threaded program will run in one thread. It is theoretically possible for the thread to be scheduled to move to a different core, but the scheduler cannot turn a single thread into multiple threads and give you any parallel processing.
EDIT
I misunderstood your question. If there are multiple threads in the application, and that application is binary compatible with the new multicore CPU, the threads will indeed be scheduled to run on different CPUs, if the OS scheduler deems it appropriate.
Well it all depends on the software that if it wants to utilize other cores or not (if present). Lets take an example of Linux on ARM's cortexA53.
Initially a vendor provided boot loader runs on, FSBL (First state bootloader). It then passes control to Arm trusted firmware. ATF then runs uboot. All these run on a single core. Then uboot loads linux kernel and passes control to it. Linux then initializes some stuff and looks into some option, first in the bootargs for smp or nosmp flags. if smp it will get the number of CPUs assigned to it from dtb and then using SMC calls to ATF it will start other cores and then assign work to those cores to provide true feel of multiprocessing environment. This is normally called load balancing and in linux it is mostly done in fair.c file.

CUDA and Graphics Kernels Order of Execution

I have a code which goes something like this.
1) Host: Launch Graphics Kernels
2) Host: Launch CUDA Kernels (all async calls)
3) Host: Do a bunch of number crunching on the host
4) Back to step 1
My questions is this. The CUDA API guarantees that the CUDA kernels even if they are async are executed in order of being launched. Does this apply to the rendering ? Lets say I have some rendering related calculations in progress on the GPU. If I launch async CUDA calls, Will they only be executed once the rendering is complete ? Or will these two operations overlap ?
Also, if i call a CUDA device synchronize after step 2, it certainly forces the device to complete CUDA related functions calls. What about rendering ? Does it stall the host until the rendering related operations are complete as well ?
Calling CUDA kernels somehow locks GPU, therefore any other usage of GPU is not supported. Each process of host code has to execute device code in a specific context and the only one context can be active on a single device at a time.
Callig cudaDeviceSynchronize(); blocks the calling host code. After completing the execution of all streams of device code, control is returned to the calling host code.
EDIT:
See this very comprehensive but somewhat out-of-date answer and you can study this paper to see what are capable of last devices. In short, launching CUDA kernel, or even calling cudaSetDevice() on a device that is being concurrently utilized by another thread crashes by throwing some error. If you would like to utilize your GPU by concurrent CUDA processes, there is a possibility (on linux-only machines) to use some kind of inter-layer (called MPS) between host threads and CUDA API calls. This is described in my second link.

CUDA performance penalty when running in Windows

I've noticed a big performance hit when I run my CUDA application in Windows 7 (versus Linux). I think I may know where the slowdown occurs: For whatever reason, the Windows Nvidia driver (version 331.65) does not immediately dispatch a CUDA kernel when invoked via the runtime API.
To illustrate the problem I profiled the mergeSort application (from the examples that ship with CUDA 5.5).
Consider first the kernel launch time when running in Linux:
Next, consider the launch time when running in Windows:
This post suggests the problem might have something to do with the windows driver batching the kernel launches. Is there anyway I can disable this batching?
I am running with a GTX 690 GPU, Windows 7, and version 331.65 of the Nvidia driver.
There is a fair amount of overhead in sending GPU hardware commands through the WDDM stack.
As you've discovered, this means that under WDDM (only) GPU commands can get "batched" to amortize this overhead. The batching process may (probably will) introduce some latency, which can be variable, depending on what else is going on.
The best solution under windows is to switch the operating mode of the GPU from WDDM to TCC, which can be done via the nvidia-smi command, but it is only supported on Tesla GPUs and certain members of the Quadro family of GPUs -- i.e. not GeForce. (It also has the side effect of preventing the device from being used as a windows accelerated display adapter, which might be relevant for a Quadro device or a few specific older Fermi Tesla GPUs.)
AFAIK there is no officially documented method to circumvent or affect the WDDM batching process in the driver, but unofficially I've heard , according to Greg#NV in this link the command to issue after the cuda kernel call is cudaEventQuery(0); which may/should cause the WDDM batch queue to "flush" to the GPU.
As Greg points out, extensive use of this mechanism will wipe out the amortization benefit, and may do more harm than good.
EDIT: moving forward to 2016, a newer recommendation for a "low-impact" flush of the WDDM command queue would be cudaStreamQuery(stream);
EDIT2: Using recent drivers on windows, you should be able to place Titan family GPUs in TCC mode, assuming you have some other GPU set up for primary display. The nvidia-smi tool will allow you to switch modes (using nvidia-smi --help for more info).
Additional info about the TCC driver model can be found in the windows install guide, including that it may reduce the latency of kernel launches.
The statement about TCC support is a general one. Not all Quadro GPUs are supported. The final determinant of support for TCC (or not) on a particular GPU is the nvidia-smi tool. Nothing here should be construed as a guarantee of support for TCC on your particular GPU.
Even it's been almost 3 years since the issue has been active, I still consider it necesssary to provide my findings.
I've been in the same situation: the same cuda programme elapsed for 5ms in Ubuntu cuda 8.0 while over 30ms in Windows 10 cuda 10.1. Both with GTX 1080Ti.
However, in Windows when I changed the compiler from VS Studio to cmd's nvcc compiler suddenly the programme was boosted to the same speed as the Linux one.
This suggests that maybe the problem comes from Visual Studio.

Can I share cuda GPU device memory between host processes?

Is it possible to have two or more linux host processes that can access the same device memory?
I have two processes streaming high data rate between them and I don't want to bring the data back out of the GPU to the host in process A just to pass it to process B who will memcpy h2d back into the GPU.
Combining the multiple processes into a single process is not an option.
My understanding of the CUDA APIs is that this cannot be done. The device pointers are relative to a given CUDA context, and there's no way to share those between processes.

Resources