CUDA and Graphics Kernels Order of Execution

CUDA and Graphics Kernels Order of Execution - graphics

I have a code which goes something like this.
1) Host: Launch Graphics Kernels
2) Host: Launch CUDA Kernels (all async calls)
3) Host: Do a bunch of number crunching on the host
4) Back to step 1
My questions is this. The CUDA API guarantees that the CUDA kernels even if they are async are executed in order of being launched. Does this apply to the rendering ? Lets say I have some rendering related calculations in progress on the GPU. If I launch async CUDA calls, Will they only be executed once the rendering is complete ? Or will these two operations overlap ?
Also, if i call a CUDA device synchronize after step 2, it certainly forces the device to complete CUDA related functions calls. What about rendering ? Does it stall the host until the rendering related operations are complete as well ?

Calling CUDA kernels somehow locks GPU, therefore any other usage of GPU is not supported. Each process of host code has to execute device code in a specific context and the only one context can be active on a single device at a time.
Callig cudaDeviceSynchronize(); blocks the calling host code. After completing the execution of all streams of device code, control is returned to the calling host code.
EDIT:
See this very comprehensive but somewhat out-of-date answer and you can study this paper to see what are capable of last devices. In short, launching CUDA kernel, or even calling cudaSetDevice() on a device that is being concurrently utilized by another thread crashes by throwing some error. If you would like to utilize your GPU by concurrent CUDA processes, there is a possibility (on linux-only machines) to use some kind of inter-layer (called MPS) between host threads and CUDA API calls. This is described in my second link.

Related

Is there a kernel queue inside CUDA enabled GPU?

When multiple PyTorch process is running inference on the same Nvidia GPU. I would like to know what happens when two kernel requests(cuLaunchKernel) from different contexts are handled by CUDA? Can CUDA GPU have a FIFO queue for those kernel requests?
I have no idea about measuring the state of CUDA when running my PyTorch program. Any advice on how to profile a Nvidia GPU when running multiple concurrent jobs is helpful!

Kernels from different contexts never run at the same time. They run in time-sharing way. (Unless MPS is used)
Within the same CUDA context, kernels launched on the same CUDA stream never run at the same time. Instead, they are serialized by the launch order and GPU executes them one at a time. So CUDA stream is similar to a queue in the CUDA context. Kernels launched on different CUDA streams (in the same context) have the potential to run concurrently.
Pytorch by default uses one CUDA stream. You can use APIs to manipulate multiple streams: https://pytorch.org/docs/stable/notes/cuda.html#cuda-streams

How to emulate the interrupts mecanism on Linux

I am developing a program for a micro controller using FreeRTOS. My micro controller has a CAN driver and uses hardware interrupts. There is an interrupt fired when the CAN driver finished transmitting a CAN frame.
For simplicity I am developing and testing some part on Linux (Ubuntu 20). I am using socketCAN on Linux, with a virtual CAN port.
Is it possible to mimic the hardware interrupts on Linux ?
I was thinking to use the POSIX Signals, what do you think ?
Thanks

I found the solution.
The solution is to execute a task in parallel and call the function normally pointed by the interrupt vector.
While finding the solution I remarked that using POSIX signals between FreeRTOS tasks of the POSIX_GCC may lead to problem with linux system calls.
I also figured out that FreeRTOS tasks are monopolizing CPU time if they are used along with classical pthreads.

Can we use mutex in device drivers, etc. during startup?

I wonder that at start up time, the kernel need to load device driver for initializing e.g. cpu clock. But at this time, the kernel has not initialized completely yet. So that we can use the mutex at this time (because device object use mutex as protect mechanism)? When will mutex be available to use?

For this, you need a small glance into the Linux kernel initialisation process.
The kernel is kicked off by a single process, running on a single core.
It detects the number of CPUs available and some other stuff, and configures the scheduler. It then triggers the scheduler.
Any driver loading or so will only happen after this point.
In fact, drivers are loaded way after the scheduler has been started up.
Some great insights into the topic of Linux initialisation:
Linux inside.

Multithreded applications on different CPUS

If, for example, there is a let's say embedded application which run on unicore CPU. And then that application would be ported on multi core CPU. Would that app run on single or multiple cores?
To be more specific I am interested in ARM CPU (but not only) and toolchain specifics e. g. standard C/C++ libraries.
The intention of this question is this: is it CPU's responsibility to "decide" to execute on multiple cores or compiler toolchain, developer and standard platfor specific libraries? And again, I am interested also in other systems' tendencies out there.
There are plenty of applications and RTOS (for example Linux) that run on different CPUs but the same architecture, so does that mean that they are compiled differently?

Generally speaking single-threaded code will always run on one core. To take advantage of multiple cores you need to have either multiple processes, multiple threads, or both.
There's nothing your compiler can do to help you here. This is an architectural consideration.
If you have multiple threads, for example, most multi-core systems will run them on whatever cores are available if the operating system you're running is properly compiled to support that. Running an OS that's been compiled single-core only will obviously limit your options here.

A single threaded program will run in one thread. It is theoretically possible for the thread to be scheduled to move to a different core, but the scheduler cannot turn a single thread into multiple threads and give you any parallel processing.
EDIT
I misunderstood your question. If there are multiple threads in the application, and that application is binary compatible with the new multicore CPU, the threads will indeed be scheduled to run on different CPUs, if the OS scheduler deems it appropriate.

Well it all depends on the software that if it wants to utilize other cores or not (if present). Lets take an example of Linux on ARM's cortexA53.
Initially a vendor provided boot loader runs on, FSBL (First state bootloader). It then passes control to Arm trusted firmware. ATF then runs uboot. All these run on a single core. Then uboot loads linux kernel and passes control to it. Linux then initializes some stuff and looks into some option, first in the bootargs for smp or nosmp flags. if smp it will get the number of CPUs assigned to it from dtb and then using SMC calls to ATF it will start other cores and then assign work to those cores to provide true feel of multiprocessing environment. This is normally called load balancing and in linux it is mostly done in fair.c file.

Node.JS kernel mode threading

I'm trying to figure out how does Node.JS (of its Windows version) is working behind the scenes.
I know there is user mode and kernel mode threads, and I know the processing model looks like this:
I also know that moving from a kernel mode thread to a user mode thread is consider to be a context switching.
Does Node.JS C++ Non-Blocking worker threads are kernel mode ? and where does the single event loop thread lives at kernel mode or user mode ?

As you know node.js has a single threaded architecture. The JavaScript environment and event-loop is managed by a single thread only, internally all the other threads are handled by a C++ level thread pool (like asynchronous I/O handled by libuv thread) .
To answer your question these node.js C++ non-blocking worker threads are not kernel mode. They are user mode. The event-loop thread is also user mode. The threads request kernel mode as and when needed.
When the CPU is in kernel mode, it is assumed to be executing trusted software. Kernel mode is the highest privelege level and the code has full access to all devices. In Windows, only select files written by Windows developers runs completely on kernel mode. All user mode software must request use of the kernel by means of a system call in order to perform privileged instructions, such as process creation or I/O operations.
All processes begin execution in user mode, and they switch to kernel mode only when obtaining a service provided by the kernel. This change in mode is termed mode switch, not context switch, which is the switching of the CPU from one process to another.
I hope it is clear to you that even user-mode threads can execute privileged operations (network access) via system calls, and return to user-mode when required task is finished. Node.js simply uses system calls.
Source : http://www.linfo.org/kernel_mode.html
Update
I should have mentioned that mode switch does not always mean context switch. Quoting the wiki:
When a transition between user mode and kernel mode is required in an
operating system, a context switch is not necessary; a mode transition
is not by itself a context switch. However, depending on the operating
system, a context switch may also take place at this time.
What you mention is also correct that mode switch can cause context switch. But it does not happen always. It is not desirable to have context switches (heavy performance penalty) whenever mode switch happens. What happens inside Windows is difficult to say, but most likely mode switch does not cause context switch every time.
Regarding the one-to-one thread model. Both Windows and Linux follow that. So given each user thread (like node.js event loop thread) OS provides a kernel thread, which takes care of the system calls. Node.js can only invoke mode switch through system calls. Context switch is controlled only by the kernel (thread scheduler).
Update 2
Yes, HTTP.SYS executes in kernel mode. But there is more to it. Node.js does not have many threads, so fewer context switching happens between threads unlike IIS. Context switch (mode switch) for each request is definitely less in HTTP.SYS. It is an improvement from past (which happened to be a disaster), see here. The context switching due to multiple threads is much more than reduction of context switch by using HTTP.SYS. So overall node.js has less context switches.
HTTP.SYS also has other advantages over node's own HTTP implementation that helps IIS. It may be possible (in future) to use HTTP.SYS from node itself to take those advantages. But for now, I don't think HTTP.SYS/IIS compete anywhere near node.js.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string