cudaDeviceSynchronize is very slow

cudaDeviceSynchronize is very slow - visual-studio-2012

I have to use the result from a CUDA kernel function at subsequent CPU host code, so just below the kernel function, I'm calling cudaDeviceSynchronize(). Then the execution gets very slow, so that the time saving gained by using kernel function has gone.
Originally execution time was reduced to below 100ms using CUDA kernel, but cudaDeviceSynchronize() takes 150ms. It makes me wonder whether it is justified to use CUDA in this case. Please clarify if I'm wrong.

cudaDeviceSynchronize() will wait until the kernel is complete, so it's more likely that the performance of your kernel is slow and the synchronise is simply waiting for the kernel to complete. You should profile your code with the Visual Profiler to see whether it is actually your kernel that is taking the time. The profiler should also help you to understand why the kernel is slow and to optimise it.

Related

What do we mean by "Non-preemptive Kernel"? [duplicate]

I read that Linux kernel is preemptive, which is different from most Unix kernels. So, what does it really mean for a kernal to be preemptive?
Some analogies or examples would be better than pure theoretical explanation.
ADD 1 -- 11:00 AM 12/7/2018
Preemptive is just one paradigm of multi-tasking. There are others like Cooperative Multi-tasking. A better understanding can be achieved by comparing them.

Prior to Linux kernel version 2.5.4, Linux Kernel was not preemptive which means a process running in kernel mode cannot be moved out of processor until it itself leaves the processor or it starts waiting for some input output operation to get complete.
Generally a process in user mode can enter into kernel mode using system calls. Previously when the kernel was non-preemptive, a lower priority process could priority invert a higher priority process by denying it access to the processor by repeatedly calling system calls and remaining in the kernel mode. Even if the lower priority process' timeslice expired, it would continue running until it completed its work in the kernel or voluntarily relinquished control. If the higher priority process waiting to run is a text editor in which the user is typing or an MP3 player ready to refill its audio buffer, the result is poor interactive performance. This way non-preemptive kernel was a major drawback at that time.

Imagine the simple view of preemptive multi-tasking. We have two user tasks, both of which are running all the time without using any I/O or performing kernel calls. Those two tasks don't have to do anything special to be able to run on a multi-tasking operating system. The kernel, typically based on a timer interrupt, simply decides that it's time for one task to pause to let another one run. The task in question is completely unaware that anything happened.
However, most tasks make occasional requests of the kernel via syscalls. When this happens, the same user context exists, but the CPU is running kernel code on behalf of that task.
Older Linux kernels would never allow preemption of a task while it was busy running kernel code. (Note that I/O operations always voluntarily re-schedule. I'm talking about a case where the kernel code has some CPU-intensive operation like sorting a list.)
If the system allows that task to be preempted while it is running kernel code, then we have what is called a "preemptive kernel." Such a system is immune to unpredictable delays that can be encountered during syscalls, so it might be better suited for embedded or real-time tasks.
For example, if on a particular CPU there are two tasks available, and one takes a syscall that takes 5ms to complete, and the other is an MP3 player application that needs to feed the audio pipe every 2ms, you might hear stuttering audio.
The argument against preemption is that all kernel code that might be called in task context must be able to survive preemption-- there's a lot of poor device driver code, for example, that might be better off if it's always able to complete an operation before allowing some other task to run on that processor. (With multi-processor systems the rule rather than the exception these days, all kernel code must be re-entrant, so that argument isn't as relevant today.) Additionally, if the same goal could be met by improving the syscalls with bad latency, perhaps preemption is unnecessary.
A compromise is CONFIG_PREEMPT_VOLUNTARY, which allows a task-switch at certain points inside the kernel, but not everywhere. If there are only a small number of places where kernel code might get bogged down, this is a cheap way of reducing latency while keeping the complexity manageable.

Traditional unix kernels had a single lock, which was held by a thread while kernel code was running. Therefore no other kernel code could interrupt that thread.
This made designing the kernel easier, since you knew that while one thread using kernel resources, no other thread was. Therefore the different threads cannot mess up each others work.
In single processor systems this doesn't cause too many problems.
However in multiprocessor systems, you could have a situation where several threads on different processors or cores all wanted to run kernel code at the same time. This means that depending on the type of workload, you could have lots of processors, but all of them spend most of their time waiting for each other.
In Linux 2.6, the kernel resources were divided up into much smaller units, protected by individual locks, and the kernel code was reviewed to make sure that locks were only held while the corresponding resources were in use. So now different processors only have to wait for each other if they want access to the same resource (for example hardware resource).

The preemption allows the kernel to give the IMPRESSION of parallelism: you've got only one processor (let's say a decade ago), but you feel like all your processes are running simulaneously. That's because the kernel preempts (ie, take the execution out of) the execution from one process to give it to the next one (maybe according to their priority).
EDIT Not preemptive kernels wait for processes to give back the hand (ie, during syscalls), so if your process computes a lot of data and doesn't call any kind of yield function, the other processes won't be able to execute to execute their calls. Such systems are said to be cooperative because they ask for the cooperation of the processes to ensure the equity of the execution time
EDIT 2 The main goal of preemption is to improve the reactivity of the system among multiple tasks, so that's good for end-users, whereas on the other-hand, servers want to achieve the highest througput, so they don't need it: (from the Linux kernel configuration)
Preemptible kernel (low-latency desktop)
Voluntary kernel preemption (desktop)
No forced preemption (server)

The linux kernel is monolithic and give a little computing timespan to all the running process sequentially. It means that the processes (eg. the programs) do not run concurrently, but they are given a give timespan regularly to execute their logic. The main problem is that some logic can take longer to terminate and prevent the kernel to allow time for the next process. This results in system "lags".
A preemtive kernel has the ability to switch context. It means that it can stop a "hanging" process even if it is not finished, and give the computing time to the next process as expected. The "hanging" process will continue to execute when its time has come without any problem.
Practically, it means that the kernel has the ability to achieve tasks in realtime, which is particularly interesting for audio recording and editing.
The ubuntu studio districution packages a preemptive kernel as well as a buch of quality free software devoted to audio and video edition.

It means that the operating system scheduler is free to suspend the execution of the running processes to give the CPU to another process whenever it wants; the normal way to do this is to give to each process that is waiting for the CPU a "quantum" of CPU time to run. After it has expired the scheduler takes back the control (and the running process cannot avoid this) to give another quantum to another process.
This method is often compared with the cooperative multitasking, in which processes keep the CPU for all the time they need, without being interrupted, and to let other applications run they have to call explicitly some kind of "yield" function; naturally, to avoid giving the feeling of the system being stuck, well-behaved applications will yield the CPU often. Still,if there's a bug in an application (e.g. an infinite loop without yield calls) the whole system will hang, since the CPU is completely kept by the faulty program.
Almost all recent desktop OSes use preemptive multitasking, that, even if it's more expensive in terms of resources, is in general more stable (it's more difficult for a sigle faulty app to hang the whole system, since the OS is always in control). On the other hand, when the resources are tight and the application are expected to be well-behaved, cooperative multitasking is used. Windows 3 was a cooperative multitasking OS; a more recent example can be RockBox, an opensource PMP firmware replacement.

I think everyone did a good job of explaining this but I'm just gonna add little more info. in context of Linux IRQ, interrupt and kernel scheduler.
Process scheduler is the component of the OS that is responsible for deciding if current running job/process should continue to run and if not which process should run next.
preemptive scheduler is a scheduler which allows to be interrupted and a running process then can change it's state and then let another process to run (since the current one was interrupted).
On the other hand, non-preemptive scheduler can't take away CPU away from a process (aka cooperative)
FYI, the name word "cooperative" can be confusing because the word's meaning does not clearly indicate what scheduler actually does.
For example, Older Windows like 3.1 had cooperative schedulers.
Full credit to wonderful article here

I think it became preemptive from 2.6. preemptive means when a new process is ready to run, the cpu will be allocated to the new process, it doesn't need the running process be co-operative and give up the cpu.

Linux kernel is preemptive means that The kernel supports preemption.
For example, there are two processes P1(higher priority) and P2(lower priority) which are doing read system calls and they are running in kernel mode. Suppose P2 is running and is in the kernel mode and P2 is scheduled to run.
If kernel preemption is available, then preemption can happen at the kernel level i.e P2 can get preempted and but to sleep and the P1 can continue to run.
If kernel preemption is not available, since P2 is in kernel mode, system simply waits till P2 is complete and then

How to manage same CUDA kernel call from multiple CPU threads?

I have a cuda kernel which works fine when called from a single CPU threads. However when the same is called from multiple CPU threads (~100), most of the kernel seems not be executed at all as the results comes out to be all zeros.Can someone please guide me how to resolve this problem?
In the current version of kernel I am using a cudadevicesynchronize() at the end of kernel call. Will adding a sync command before cudaMalloc() and kernel call be of any help in this case?
There is another thing which need some clarification. i.e. If two CPU threads executes the same cudaMalloc() command, will the later overwrite the former in GPU memory or will they create their own memory?
Thanks in advance for your help

Usually one CPU thread can be used for calling a CUDA kernel. However, since CUDA 4.0, multiple CPU threads can share context. You can use cuCtxSetCurrent to tie the context of the kernel to the current thread. More information about this API function can be found here.
Another workaround for this is to create a GPU worker thread that holds the context and pass any CUDA request to that thread.
Regarding your other question, without setting the context for the proper thread, I remember that cudaMalloc would not even execute (I work with JCuda so the behavior may be a little different). But if the context is currently set to the calling kernel, the memories will not be overwritten.

What is the relationship between assembly and multi-core?

This is hard to word/ask so please bear with me:
When we see the output of assembly, this is what is going to be executed on the core(s) of the CPU. However, if a CPU has multiple cores- is all of the assembly executed on the same core? At what point would the assembly from the same program begin executed on a different core?
So if I had (assembly pseudo):
ADD x, y, z
SUB p, x, q
how will I know whether ADD and SUB will execute on the same core? Is this linked to affinity? I thought affinity only pinned a process to a CPU, not a core?
I am asking this because I want to try and understand whether you can reasonably predict whether consecutive assembly instructions execute on the same core and whether I can control that they only execute on the same core. I am trying to understand how the decision is made to change executing the same program code from one core, to a different core?
If assembly can change execution (even when using affinity) from CPUA Core1 to Core2, is this where QPI link speed will take effect- and also whether the caches are shared amongst the different CPU cores?

This is a rough overview that hopefully will provide you with the details you need.
Assembly code is translated into machine code; ie binary data, that is run by a CPU.
A CPU is the same as a core on a multi-core processor; ie a CPU is not the same as a processor (chip).
Every CPU has an instruction pointer that points to the instruction to execute next. This is incremented for every instruction executed.
So in a multi-core processor you would have one instruction pointer per core. To support more processes than there are available CPUs (or cores), the operating system will interrupt running processes and store their state (including the instruction pointer) at regular intervals. It will then restore the state of already interrupted processes and let them execute for a bit.
What core the execution is continued on is up to the operating system to decide, and is controlled by the affinity of the running thread (and probably some other settings also).
So to answer your question, there is no way of knowing if two adjacent assembly statements will run on the same core or not.

I'm talking mostly about Linux; but I guess what I am saying should be applicable to other OSes. However, without access to Windows source code, no one can reliably say how it behaves in its detail
I think your "abstraction" of what a computer is doing is inadequate. Basically, a (mono-threaded) process (or just a thread) is running on some "virtual" CPU, whose instruction set is the unpriviledged x86 machine instructions augmented by the ability to enter the kernel thru syscalls (usually, thru a special instruction like SYSENTER). So from an application point of view, system calls to the linux kernel are "atomic". See this and that answers.
Indeed, the processor is getting (at arbitrary instants) some interrupts (on Linux, cat /proc/interrupts repeated twice with a one-second delay would show you how often it is getting interrupted, basically many thousand times per second), and these interrupts are handled by the kernel. The kernel is scheduling tasks (e.g. threads or processes) premptively (they can be interrupted and restarted by the kernel at any time).
From an application point of view, interrupts don't really exist (but the kernel can send signals to the process).
Cores, interrupts and caches are handled by the hardware and/or the kernel, so from the application point of view, they don't really exist -except by "slowing down" the process. cache coherency is mostly dealt with in hardware, and with out-of-order execution makes a given -even tiny- binary program execution time unpredictable. (in other words, you cannot statically predict exactly how many milliseconds some given routine or loop will need; you can only dynamically measure that; read more about worst-case execution time).
Reading the Advanced Linux Programming book and the Linux Assembly Howto would help.

You cannot normally predict where each individial instruction will execute. As long as an individual thread is executing continuously, it will run inside of the same core/processor, but you cannot predict on which instruction the thread will be switched-out. The OS makes that decision, the decision of when to switch it back in, and on which core/processor to put it, based on the workload of the system and priority levels, among other things.
You can usually request to the OS specifically that a thread should always run on the same core, this is called affinity. This is normally a bad idea and it should only be done when absolutely necessary because it takes away from the OS the flexibility to decide what to do where, based on the workload; affinity will almost always result in a performance penalty.
Requesting processor-affinity is an extraordinary request that requires extraordinary proof that it would result in better performance. Don't try to outsmart the OS; the OS knows things about the current running environment that you don't know about.

Are GPIO APIs in linux deterministic in time taken?

I need to call gpio_get_value, gpio_set_value, gpio_direction_input/output in my driver, and there is a timing requirement that requests the function calls to be returned in less than 5us time.
Can gpiolib meet this requirement or is it not deterministic? If not, what could be the solution? directly accessing GPIO registers?
Thanks a lot.

Calling one of that functions involves executing a linux kernel function, thus there are at least two context switches: one to run kernel code and one to come back to userland.
These switches can mean some time can get wasted if interrupts or signal come in to be executed.
Anyways if you need deterministic time deadlines you need to switch to a real-time patched kernel Real Time linux wiki homepage
I don't know if 5us is feasible, it depends on the system load, on the active drivers (e.g. there's a read/write filesystem?), but you can test.

What does it mean to say "linux kernel is preemptive"?

I read that Linux kernel is preemptive, which is different from most Unix kernels. So, what does it really mean for a kernal to be preemptive?
Some analogies or examples would be better than pure theoretical explanation.
ADD 1 -- 11:00 AM 12/7/2018
Preemptive is just one paradigm of multi-tasking. There are others like Cooperative Multi-tasking. A better understanding can be achieved by comparing them.

Prior to Linux kernel version 2.5.4, Linux Kernel was not preemptive which means a process running in kernel mode cannot be moved out of processor until it itself leaves the processor or it starts waiting for some input output operation to get complete.
Generally a process in user mode can enter into kernel mode using system calls. Previously when the kernel was non-preemptive, a lower priority process could priority invert a higher priority process by denying it access to the processor by repeatedly calling system calls and remaining in the kernel mode. Even if the lower priority process' timeslice expired, it would continue running until it completed its work in the kernel or voluntarily relinquished control. If the higher priority process waiting to run is a text editor in which the user is typing or an MP3 player ready to refill its audio buffer, the result is poor interactive performance. This way non-preemptive kernel was a major drawback at that time.

Traditional unix kernels had a single lock, which was held by a thread while kernel code was running. Therefore no other kernel code could interrupt that thread.
This made designing the kernel easier, since you knew that while one thread using kernel resources, no other thread was. Therefore the different threads cannot mess up each others work.
In single processor systems this doesn't cause too many problems.
However in multiprocessor systems, you could have a situation where several threads on different processors or cores all wanted to run kernel code at the same time. This means that depending on the type of workload, you could have lots of processors, but all of them spend most of their time waiting for each other.
In Linux 2.6, the kernel resources were divided up into much smaller units, protected by individual locks, and the kernel code was reviewed to make sure that locks were only held while the corresponding resources were in use. So now different processors only have to wait for each other if they want access to the same resource (for example hardware resource).

The preemption allows the kernel to give the IMPRESSION of parallelism: you've got only one processor (let's say a decade ago), but you feel like all your processes are running simulaneously. That's because the kernel preempts (ie, take the execution out of) the execution from one process to give it to the next one (maybe according to their priority).
EDIT Not preemptive kernels wait for processes to give back the hand (ie, during syscalls), so if your process computes a lot of data and doesn't call any kind of yield function, the other processes won't be able to execute to execute their calls. Such systems are said to be cooperative because they ask for the cooperation of the processes to ensure the equity of the execution time
EDIT 2 The main goal of preemption is to improve the reactivity of the system among multiple tasks, so that's good for end-users, whereas on the other-hand, servers want to achieve the highest througput, so they don't need it: (from the Linux kernel configuration)
Preemptible kernel (low-latency desktop)
Voluntary kernel preemption (desktop)
No forced preemption (server)

The linux kernel is monolithic and give a little computing timespan to all the running process sequentially. It means that the processes (eg. the programs) do not run concurrently, but they are given a give timespan regularly to execute their logic. The main problem is that some logic can take longer to terminate and prevent the kernel to allow time for the next process. This results in system "lags".
A preemtive kernel has the ability to switch context. It means that it can stop a "hanging" process even if it is not finished, and give the computing time to the next process as expected. The "hanging" process will continue to execute when its time has come without any problem.
Practically, it means that the kernel has the ability to achieve tasks in realtime, which is particularly interesting for audio recording and editing.
The ubuntu studio districution packages a preemptive kernel as well as a buch of quality free software devoted to audio and video edition.

It means that the operating system scheduler is free to suspend the execution of the running processes to give the CPU to another process whenever it wants; the normal way to do this is to give to each process that is waiting for the CPU a "quantum" of CPU time to run. After it has expired the scheduler takes back the control (and the running process cannot avoid this) to give another quantum to another process.
This method is often compared with the cooperative multitasking, in which processes keep the CPU for all the time they need, without being interrupted, and to let other applications run they have to call explicitly some kind of "yield" function; naturally, to avoid giving the feeling of the system being stuck, well-behaved applications will yield the CPU often. Still,if there's a bug in an application (e.g. an infinite loop without yield calls) the whole system will hang, since the CPU is completely kept by the faulty program.
Almost all recent desktop OSes use preemptive multitasking, that, even if it's more expensive in terms of resources, is in general more stable (it's more difficult for a sigle faulty app to hang the whole system, since the OS is always in control). On the other hand, when the resources are tight and the application are expected to be well-behaved, cooperative multitasking is used. Windows 3 was a cooperative multitasking OS; a more recent example can be RockBox, an opensource PMP firmware replacement.

I think everyone did a good job of explaining this but I'm just gonna add little more info. in context of Linux IRQ, interrupt and kernel scheduler.
Process scheduler is the component of the OS that is responsible for deciding if current running job/process should continue to run and if not which process should run next.
preemptive scheduler is a scheduler which allows to be interrupted and a running process then can change it's state and then let another process to run (since the current one was interrupted).
On the other hand, non-preemptive scheduler can't take away CPU away from a process (aka cooperative)
FYI, the name word "cooperative" can be confusing because the word's meaning does not clearly indicate what scheduler actually does.
For example, Older Windows like 3.1 had cooperative schedulers.
Full credit to wonderful article here

I think it became preemptive from 2.6. preemptive means when a new process is ready to run, the cpu will be allocated to the new process, it doesn't need the running process be co-operative and give up the cpu.

Linux kernel is preemptive means that The kernel supports preemption.
For example, there are two processes P1(higher priority) and P2(lower priority) which are doing read system calls and they are running in kernel mode. Suppose P2 is running and is in the kernel mode and P2 is scheduled to run.
If kernel preemption is available, then preemption can happen at the kernel level i.e P2 can get preempted and but to sleep and the P1 can continue to run.
If kernel preemption is not available, since P2 is in kernel mode, system simply waits till P2 is complete and then

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string