I am trying to write some code to ensure all GPU activity (in particular all running threads) are stopped. I need to do this to unload a module with dlclose, so I need to ensure all threads have stopped on both the host and the device.
According to the CUDA documentation, cudaDeviceSynchronize:
Blocks until the device has completed all preceding requested tasks... If the cudaDeviceScheduleBlockingSync flag was set for this device, the host thread will block until the device has finished its work.
However, when I set the blocking sync flag and call cudaDeviceSynchronize, a new host thread is spawned, which is still running after cudaDeviceSynchronize has returned. This is the opposite of what I am trying to achieve.
This behaviour is demonstrated in an example program:
#include <iostream>
void initialiseDevice()
{
cudaError result = cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync);
if (cudaSuccess == result)
std::cout << "Set device flags." << std::endl;
else
std::cout << "Could not set device flags. (" << result << ")"
<< std::endl;
}
void synchroniseDevice()
{
cudaError result = cudaDeviceSynchronize();
if (cudaSuccess == result)
std::cout << "Device synchronise returned success." << std::endl;
else
std::cout << "Device synchronise returned error. (" << result << ")"
<< std::endl;
}
int main()
{
initialiseDevice();
sleep(1);
synchroniseDevice(); // new thread is spawned here
sleep(1); // new thread is still running here!
return 0;
}
If I compile this program with nvcc -g main.cu, and run it in gdb, a call to info threads shows that there are two threads running after cudaDeviceSynchronize has returned.
Output of info threads on the line after cudaDeviceSynchronise when running in gdb:
(gdb) info threads
Id Target Id Frame
2 Thread 0x7ffff5b8b700 (LWP 28458) "a.out" 0x00007ffff75aa023 in select
() at ../sysdeps/unix/syscall-template.S:82
* 1 Thread 0x7ffff7fd4740 (LWP 28255) "a.out" main () at cuda_test.cu:30
Could anyone help me understand why cudaDeviceSynchronize is spawning a new thread, and why the thread is still running after the call returns?
Could anyone point me in the right direction to help me find a method to block until all device and host activity/threads are finished?
CUDA 4.2 and later have intermediary worker threads that mediate blocking calls between application threads and operating system. My testing suggests that one thread gets created for each GPU your application uses (one for each CUDA context?). I suspect these worker threads were introduced to make the implementation of stream event callbacks easier (I think these threads may execute the callbacks); although, I could be entirely wrong on this technical reason.
I really wish NVIDIA would have provided an environment variable to disable these intermediary threads. It introduces problems if you want to run your program as SCHED_FIFO. You must be sure to transition to SCHED_FIFO before any CUDA routines are invoked. Otherwise, any worker threads spawned prior the SCHED_FIFO transition will be scheduled as regular threads while your main thread is SCHED_FIFO. This leads to priority inversions where your main thread is blocked waiting for a worker thread to be scheduled with a lower priority. Transitioning to SCHED_FIFO before any thread spawning allows future threads to inherit the parent's SCHED_FIFO policy and priority.
As for a solution to your problem: Can you call cudaDeviceReset() in the context of your application? This should hopefully reinitialize any CUDA runtime state in your system and kill off any worker threads. Otherwise, there's always pthread_cancel() (or Windows equivalent), but this may leave CUDA in an inconsistent state.
Related
I have a program where one thread creates a file and other threads are waiting for the file to be created. Once the file is created the other threads will read this file and continue their processing. I want these other threads to run in parallel. The thread that creates the file does not know how many threads are waiting for it. What synchronization model should I use?
If you're on Linux, you might look at the inotify interface. This allows a thread / threads to watch a filesystem for something happening (e.g. a file being created). inotify.
Or you could use a pub/sub pattern in ZeroMQ; the creating thread makes the file, then sends some sort of message into a PUB socket. Whichever SUBscribers are listening would all get that message (they'd be blocked waiting for the message in the meantime), and then read the file. This would probably be more efficient.
If it is not known upfront how many threads are going to be waiting, you can use a the 'completion' pattern - this is a mutex-protected flag, with an condition variable to signal changes:
int done = 0;
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
To wait for the completion (ie. in the threads waiting for the file to be created), you use the condition variable to wait for the flag to be set:
pthread_mutex_lock(&lock);
while (!done)
pthread_cond_wait(&cond, &lock);
pthread_mutex_unlock(&lock);
To signal the completion (ie. in the thread that has created the file), you set the flag and signal the condition variable:
pthread_mutex_lock(&lock);
done = 1;
pthread_cond_signal(&cond);
pthread_mutex_unlock(&lock);
You could wrap this all up in a 'completion' data type if you wanted.
On a linux system, does the child process view the existing threads the same way as the parent process ?
int main() {
//create thread 1
int child_pid = fork();
if ( 0 == child_pid)
{
..
}
else
{
..
}
Since the whole address space is copied for the child process, what happens to the state of the threads. What if the thread 1 in the above segment is waiting on a conditional signal. Is it in the waiting state in child process as well ?
Threads on Linux nowadays try to stay POSIX compliant. Only the calling thread is replicated, not other threads (note that e.g. on Solaris you can choose what fork does depending on what library you link to)
From http://www.opengroup.org/onlinepubs/000095399/functions/fork.html (POSIX 2004):
A process shall be created with a
single thread. If a multi-threaded
process calls fork(), the new process
shall contain a replica of the calling
thread and its entire address space,
possibly including the states of
mutexes and other resources.
Consequently, to avoid errors, the
child process may only execute
async-signal-safe operations until
such time as one of the exec functions
is called. Fork
handlers may be established by means
of the pthread_atfork() function in
order to maintain application
invariants across fork() calls.
The POSIX 2018 specification of fork() is similar.
Threads are not inherited from a child process on a linux system using fork(). An in-depth source is here: http://linas.org/linux/threads-faq.html
I am currently learning about Semaphores.
I know they are there to restrict access to ressources in a concurrent system.
f.e. in Java a Semaphore class has the methods aquire() and release() which calling processes will access.
Now in general, when a Semaphore has four spaces open and four threads try to aquire access, the access will be granted.
But if a fifth thread tries to access the resource the thread has to be blocked or set to sleep somehow. Do I as a programmer have to implement something like
if (semaphore.hasSpaceleft()){
semaphore.aquire();
ressource.access();
} else {
sleep until semaphore.hasSpaceLeft();
}
Or will the semaphore handle sleeping/blocking the calling thread?
(not only in Java but in general)
No, according to docs:
public void acquire()
throws InterruptedException
Acquires a permit from this semaphore, blocking until one is
available, or the thread is interrupted.
Emphasis mine
Actually, your proposed program has a bug - a race condition - in it:
if (semaphore.hasSpaceleft()){ // 1
semaphore.aquire(); // 2
ressource.access();
} else {
sleep until semaphore.hasSpaceLeft();
}
In the time window between 1 and 2 another thread might acquire the semaphore. That's why in general any semaphore implementation has to be designed around blocking acquire function. Alternatively, you can use tryAcquire() function:
if (semaphore.tryAcquire()) {
resource.access();
} else {
doSomethingElseInTheMeantime();
}
What does "blocking" mean?
In general, operating system (OS) is responsible for scheduling threads. If a thread cannot acquire a semaphore, it is suspended by OS (i.e. it is marked as not eligible for running, so that OS scheduler ignores it when selecting threads to run) until semaphore can be acquired again.
If I have a program that has N running threads, and N-1 of them block delivery of the SIGUSR1 signal using pthread_sigmask:
int rc;
sigset_t signal_mask;
sigemptyset(&signal_mask);
sigaddset(&signal_mask, SIGUSR1);
rc = pthread_sigmask(SIG_BLOCK, &signal_mask, NULL);
if (rc != 0) {
// handle error
}
When the OS (Linux, recent kernel) delivers SIGUSR1 to the process, is it guaranteed to be delivered to the unblocked thread? Or could it, for example, try some subset of the blocked threads and then give up?
Yes, it is guaranteed that a process-directed signal will be delivered to one of the threads that has it unblocked (if there are any). The relevant quote from POSIX Signal Generation and Delivery:
Signals generated for the process shall be delivered to exactly one of
those threads within the process which is in a call to a sigwait()
function selecting that signal or has not blocked delivery of the
signal.
Is there any mechanism through which I can wake up a thread in another process without going through the kernel? The waiting thread might spin in a loop, no problem (each thread is pegged to a separate core), but in my case the sending thread has to be quick, and can't afford to go through the kernel to wake up the waiting thread.
No, if the other thread is sleeping (not on CPU). To wake up such thread you need to change its state into "RUNNING" by calling scheduler which is part of the kernel.
Yes, you can syncronize two threads or processes if both are running on different CPUs, and if there is shared memory between them. You should bind all threads to different CPUs. Then you may use spinlock:pthread_spin_lock and pthread_spin_unlock functions from optional part of POSIX's Pthread ('(ADVANCED REALTIME THREADS)'; [THR SPI]); or any of custom spinlock. Custom spinlock most likely will use some atomic operations and/or memory barriers.
Sending thread will change the value in memory, which is checked in loop by receiver thread.
E.g.
init:
pthread_spinlock_t lock;
pthread_spin_lock(&lock); // close the "mutex"
then start threads.
waiting thread:
{
pthread_spin_lock(&lock); // wait for event;
work();
}
main thread:
{
do_smth();
pthread_spin_unlock(&lock); // open the mutex; other thread will see this change
// in ~150 CPU ticks (checked on Pentium4 and Intel Core2 single socket systems);
// time of the operation itself is of the same order; didn't measure it.
continue_work();
}
To signal to another process that it should continue, without forcing the sender to spend time in a kernel call, one mechanism comes to mind right away. Without kernel calls, all a process can do is modify memory; so the solution is inter-process shared memory. Once the sender writes to shared memory, the receiver should see the change without any explicit kernel calls, and naive polling by the receiver should work fine.
One cheap (but maybe not cheap enough) alternative is delegating the sending to a helper thread in the same process, and have the helper thread make a proper inter-process "semaphore release" or pipe write call.
I understand that you want to avoid using the kernel in order to avoid kernel-related overheads. Most of such overheads are context-switch related. Here is a demonstration of one way to accomplish what you need using signals without spinning, and without context switches:
#include <signal.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <pthread.h>
#include <iostream>
#include <thread>
using namespace std;
void sigRtHandler(int sig) {
cout << "Recevied signal" << endl;
}
int main() {
constexpr static int kIter = 100000;
thread t([]() {
signal(SIGRTMIN, sigRtHandler);
for (int i = 0; i < kIter; ++i) {
usleep(1000);
}
cout << "Done" << endl;
});
usleep(1000); // Give child time to setup signal handler.
auto handle = t.native_handle();
for (int i = 0; i < kIter; ++i)
pthread_kill(handle, SIGRTMIN);
t.join();
return 0;
}
If you run this code, you'll see that the child thread keeps receiving the SIGRTMIN. While the process is running, if you look in the files /proc/(PID)/task/*/status for this process, you'll see that parent thread does not incur context switches from calling pthread_kill().
The advantage of this approach is that the waiting thread doesn't need to spin. If the waiting thread's job is not time-sensitive, this approach allows you to save CPU.