Is it possible to have two or more linux host processes that can access the same device memory?
I have two processes streaming high data rate between them and I don't want to bring the data back out of the GPU to the host in process A just to pass it to process B who will memcpy h2d back into the GPU.
Combining the multiple processes into a single process is not an option.
My understanding of the CUDA APIs is that this cannot be done. The device pointers are relative to a given CUDA context, and there's no way to share those between processes.
Related
When multiple PyTorch process is running inference on the same Nvidia GPU. I would like to know what happens when two kernel requests(cuLaunchKernel) from different contexts are handled by CUDA? Can CUDA GPU have a FIFO queue for those kernel requests?
I have no idea about measuring the state of CUDA when running my PyTorch program. Any advice on how to profile a Nvidia GPU when running multiple concurrent jobs is helpful!
Kernels from different contexts never run at the same time. They run in time-sharing way. (Unless MPS is used)
Within the same CUDA context, kernels launched on the same CUDA stream never run at the same time. Instead, they are serialized by the launch order and GPU executes them one at a time. So CUDA stream is similar to a queue in the CUDA context. Kernels launched on different CUDA streams (in the same context) have the potential to run concurrently.
Pytorch by default uses one CUDA stream. You can use APIs to manipulate multiple streams: https://pytorch.org/docs/stable/notes/cuda.html#cuda-streams
I wonder that at start up time, the kernel need to load device driver for initializing e.g. cpu clock. But at this time, the kernel has not initialized completely yet. So that we can use the mutex at this time (because device object use mutex as protect mechanism)? When will mutex be available to use?
For this, you need a small glance into the Linux kernel initialisation process.
The kernel is kicked off by a single process, running on a single core.
It detects the number of CPUs available and some other stuff, and configures the scheduler. It then triggers the scheduler.
Any driver loading or so will only happen after this point.
In fact, drivers are loaded way after the scheduler has been started up.
Some great insights into the topic of Linux initialisation:
Linux inside.
When writing a device driver that lets the device and the user-space code share some memory, are there any benefits in using reserved memory for this task?
Some things I can think of:
If the reserved memory is assigned to this particular device alone, other processes won't be able to use up available memory and leave nothing for the driver. This might be especially useful when a large continuous chunk of memory is required by the driver.
The driver could always use the same fixed (physical) memory address, which might be something the device requires.
Are there any other good reasons (not) to use reserved memory for the driver in this situation?
My main question is there piece of code running in X-Server process memory (Excluded drivers - which we all know can be written in different manners) is directly accessing memory in GPU card?
Or it employs drivers and drm, or any other interface for communication with GPU and queuing draw/render/clear/... commands?
I know question seems lame, but I am interested in specifics?
EDIT:
More specifically: to my understanding kernel communicates with hardware with assistance from drivers, and exposes API to the rest (if I am wrong please correct me).
In this context can X-Server circumvent DMA-API (I am only guessing DMA IO is responsible for communication with periferials) located in kernel to communicate and exchange data with GPU card (in a direct way - without anyones assistance == without kernel, drivers, ...)?
And what would be bare minimum requirement for X-Server to communicate with GPU. I am aiming to understand how this communication is done on low level.
It is entirely possible that on Linux a given X server accesses part of the video card memory directly as a framebuffer. It's not the most efficient way of displaying things, but it works.
I have a code which goes something like this.
1) Host: Launch Graphics Kernels
2) Host: Launch CUDA Kernels (all async calls)
3) Host: Do a bunch of number crunching on the host
4) Back to step 1
My questions is this. The CUDA API guarantees that the CUDA kernels even if they are async are executed in order of being launched. Does this apply to the rendering ? Lets say I have some rendering related calculations in progress on the GPU. If I launch async CUDA calls, Will they only be executed once the rendering is complete ? Or will these two operations overlap ?
Also, if i call a CUDA device synchronize after step 2, it certainly forces the device to complete CUDA related functions calls. What about rendering ? Does it stall the host until the rendering related operations are complete as well ?
Calling CUDA kernels somehow locks GPU, therefore any other usage of GPU is not supported. Each process of host code has to execute device code in a specific context and the only one context can be active on a single device at a time.
Callig cudaDeviceSynchronize(); blocks the calling host code. After completing the execution of all streams of device code, control is returned to the calling host code.
EDIT:
See this very comprehensive but somewhat out-of-date answer and you can study this paper to see what are capable of last devices. In short, launching CUDA kernel, or even calling cudaSetDevice() on a device that is being concurrently utilized by another thread crashes by throwing some error. If you would like to utilize your GPU by concurrent CUDA processes, there is a possibility (on linux-only machines) to use some kind of inter-layer (called MPS) between host threads and CUDA API calls. This is described in my second link.