How to manage same CUDA kernel call from multiple CPU threads? - multithreading

I have a cuda kernel which works fine when called from a single CPU threads. However when the same is called from multiple CPU threads (~100), most of the kernel seems not be executed at all as the results comes out to be all zeros.Can someone please guide me how to resolve this problem?
In the current version of kernel I am using a cudadevicesynchronize() at the end of kernel call. Will adding a sync command before cudaMalloc() and kernel call be of any help in this case?
There is another thing which need some clarification. i.e. If two CPU threads executes the same cudaMalloc() command, will the later overwrite the former in GPU memory or will they create their own memory?
Thanks in advance for your help

Usually one CPU thread can be used for calling a CUDA kernel. However, since CUDA 4.0, multiple CPU threads can share context. You can use cuCtxSetCurrent to tie the context of the kernel to the current thread. More information about this API function can be found here.
Another workaround for this is to create a GPU worker thread that holds the context and pass any CUDA request to that thread.
Regarding your other question, without setting the context for the proper thread, I remember that cudaMalloc would not even execute (I work with JCuda so the behavior may be a little different). But if the context is currently set to the calling kernel, the memories will not be overwritten.

Related

Is sched_getcpu() reliable on Linux?

I'm trying to debug some performance issues with pthreads on Linux and I think sched_getcpu() may be lying to me. It reports a constant CPU for each thread, whereas profiling experiments seem to suggest the threads are actually migrating from one core to another during their life-time.
I wonder if sched_cpu() just reports the first CPU that the thread started running on, and is oblivious to thread migration ? Has anyone else noticed this, or seen any evidence that the the return value of sched_getcpu() might change ? If it's not realiable, are there any other methods for tracking current CPU (use CPUID maybe ?) ?
http://man7.org/linux/man-pages/man2/getcpu.2.html indicates sched_getcpu() is just a wrapper for getcpu().
http://man7.org/linux/man-pages/man2/getcpu.2.html suggests that the information provided is accurate, because an old caching option is no longer used:
The tcache argument is unused since Linux 2.6.24...it specified a
pointer to a caller-allocated buffer in thread-local storage that was
used to provide a caching mechanism for getcpu(). Use of the cache
could speed getcpu() calls, at the cost that there was a very small
chance that the returned information would be out of date. The caching
mechanism was considered to cause problems when migrating threads
between CPUs, and so the argument is now ignored.
So unless you are using a pre-2.6.24 kernel it seems unlikely you could be seeing old/cached information.
Calling sched_getcpu has two problems:
It only tells you where the thread is running when it executes the call,
Calling a system routine could cause a thread to migrate.
If you are using Intel runtime you could set KMP_AFFINITY=verbose as it will provide the same information (formatted differently) on stderr when the program executes its first parallel section.

cudaDeviceSynchronize is very slow

I have to use the result from a CUDA kernel function at subsequent CPU host code, so just below the kernel function, I'm calling cudaDeviceSynchronize(). Then the execution gets very slow, so that the time saving gained by using kernel function has gone.
Originally execution time was reduced to below 100ms using CUDA kernel, but cudaDeviceSynchronize() takes 150ms. It makes me wonder whether it is justified to use CUDA in this case. Please clarify if I'm wrong.
cudaDeviceSynchronize() will wait until the kernel is complete, so it's more likely that the performance of your kernel is slow and the synchronise is simply waiting for the kernel to complete. You should profile your code with the Visual Profiler to see whether it is actually your kernel that is taking the time. The profiler should also help you to understand why the kernel is slow and to optimise it.

If I don't call clRelease* ,will it cause memory leak?

I want to add some OpenCL support to Chromium, so I used APIs like clCreateCommandQueue(), but I can't find a proper place in Chromium to do cleanup.
So, if I don't call APIs like clReleaseCommandQueue(), will OS reclaim the memory after the process terminates? Or need I call it at the exit point of the process?
PS, The commandqueue is needed during the whole life of the process, so I just want to make sure it will not cause memory leak after process termination.
Thank you for help.
Since all the OpenCL objects are, ultimately, held by the device driver, you can't expect them to be automatically released once the application terminates. That is always your job.
If you use the OpenCL C++ wrapper (cl.hpp) then the compiler will figure out where to clean up your objects (when the referring object goes out of scope).

What is the relationship between assembly and multi-core?

This is hard to word/ask so please bear with me:
When we see the output of assembly, this is what is going to be executed on the core(s) of the CPU. However, if a CPU has multiple cores- is all of the assembly executed on the same core? At what point would the assembly from the same program begin executed on a different core?
So if I had (assembly pseudo):
ADD x, y, z
SUB p, x, q
how will I know whether ADD and SUB will execute on the same core? Is this linked to affinity? I thought affinity only pinned a process to a CPU, not a core?
I am asking this because I want to try and understand whether you can reasonably predict whether consecutive assembly instructions execute on the same core and whether I can control that they only execute on the same core. I am trying to understand how the decision is made to change executing the same program code from one core, to a different core?
If assembly can change execution (even when using affinity) from CPUA Core1 to Core2, is this where QPI link speed will take effect- and also whether the caches are shared amongst the different CPU cores?
This is a rough overview that hopefully will provide you with the details you need.
Assembly code is translated into machine code; ie binary data, that is run by a CPU.
A CPU is the same as a core on a multi-core processor; ie a CPU is not the same as a processor (chip).
Every CPU has an instruction pointer that points to the instruction to execute next. This is incremented for every instruction executed.
So in a multi-core processor you would have one instruction pointer per core. To support more processes than there are available CPUs (or cores), the operating system will interrupt running processes and store their state (including the instruction pointer) at regular intervals. It will then restore the state of already interrupted processes and let them execute for a bit.
What core the execution is continued on is up to the operating system to decide, and is controlled by the affinity of the running thread (and probably some other settings also).
So to answer your question, there is no way of knowing if two adjacent assembly statements will run on the same core or not.
I'm talking mostly about Linux; but I guess what I am saying should be applicable to other OSes. However, without access to Windows source code, no one can reliably say how it behaves in its detail
I think your "abstraction" of what a computer is doing is inadequate. Basically, a (mono-threaded) process (or just a thread) is running on some "virtual" CPU, whose instruction set is the unpriviledged x86 machine instructions augmented by the ability to enter the kernel thru syscalls (usually, thru a special instruction like SYSENTER). So from an application point of view, system calls to the linux kernel are "atomic". See this and that answers.
Indeed, the processor is getting (at arbitrary instants) some interrupts (on Linux, cat /proc/interrupts repeated twice with a one-second delay would show you how often it is getting interrupted, basically many thousand times per second), and these interrupts are handled by the kernel. The kernel is scheduling tasks (e.g. threads or processes) premptively (they can be interrupted and restarted by the kernel at any time).
From an application point of view, interrupts don't really exist (but the kernel can send signals to the process).
Cores, interrupts and caches are handled by the hardware and/or the kernel, so from the application point of view, they don't really exist -except by "slowing down" the process. cache coherency is mostly dealt with in hardware, and with out-of-order execution makes a given -even tiny- binary program execution time unpredictable. (in other words, you cannot statically predict exactly how many milliseconds some given routine or loop will need; you can only dynamically measure that; read more about worst-case execution time).
Reading the Advanced Linux Programming book and the Linux Assembly Howto would help.
You cannot normally predict where each individial instruction will execute. As long as an individual thread is executing continuously, it will run inside of the same core/processor, but you cannot predict on which instruction the thread will be switched-out. The OS makes that decision, the decision of when to switch it back in, and on which core/processor to put it, based on the workload of the system and priority levels, among other things.
You can usually request to the OS specifically that a thread should always run on the same core, this is called affinity. This is normally a bad idea and it should only be done when absolutely necessary because it takes away from the OS the flexibility to decide what to do where, based on the workload; affinity will almost always result in a performance penalty.
Requesting processor-affinity is an extraordinary request that requires extraordinary proof that it would result in better performance. Don't try to outsmart the OS; the OS knows things about the current running environment that you don't know about.

Linux Kernel Threads - scheduler

Is Linux Kernel scheduler a part of init process? My understanding is that it is part of Kernel threads managed internally not visible to user by either top or ps. Please correct my understanding.
Is it possible to view standard kernel threads through any kernel debugger to see how standard threads occupy cpu activity?
-Kartlee
Kernel threads can be seen through "top" and "ps" and can be distinguished by having zero VM size (they have no userspace, so no userspace memory map).
These are created by kernel_thread (or its friends). Some facilities create one thread per CPU and tie it to a CPU, so you see stuff like aio/0 aio/1 on the PS list.
Also some work is done through the several deferred execution mechanisms and gets attributed to other tasks, typically something called "events/0" (one per CPU). Time spent "really" in interrupts isn't counted anywhere (it just runs at the expense of whatever task happened to be on that CPU at the time).
1) Is Linux Kernel scheduler a part of init process?
-> no, scheduler is a subsystem, init process is just process but special and is scheduled by scheduler.
2) My understanding is that it is part of Kernel threads managed internally not visible to user by either top or ps. Please correct my understanding.
-> It is a kind of kernel thread and typically not shown to user.
3) Is it possible to view standard kernel threads through any kernel debugger to see how standard threads occupy cpu activity?
-> yes!
use ps aux, the kernel thread's name is surrounded by square brackets, e.g. [kthreadd]
kernel threads are created by kthread_create function. And it is finally handled by kthreadd, i.e. the PID=2 thread in the kernel;
And all the kernel threads is forked/copied/cloned by kthreadd (pid=2). Not init(pid=1).
the source code is here: https://elixir.bootlin.com/linux/latest/source/kernel/kthread.c

Resources