Is there any mechanism through which I can wake up a thread in another process without going through the kernel? The waiting thread might spin in a loop, no problem (each thread is pegged to a separate core), but in my case the sending thread has to be quick, and can't afford to go through the kernel to wake up the waiting thread.
No, if the other thread is sleeping (not on CPU). To wake up such thread you need to change its state into "RUNNING" by calling scheduler which is part of the kernel.
Yes, you can syncronize two threads or processes if both are running on different CPUs, and if there is shared memory between them. You should bind all threads to different CPUs. Then you may use spinlock:pthread_spin_lock and pthread_spin_unlock functions from optional part of POSIX's Pthread ('(ADVANCED REALTIME THREADS)'; [THR SPI]); or any of custom spinlock. Custom spinlock most likely will use some atomic operations and/or memory barriers.
Sending thread will change the value in memory, which is checked in loop by receiver thread.
E.g.
init:
pthread_spinlock_t lock;
pthread_spin_lock(&lock); // close the "mutex"
then start threads.
waiting thread:
{
pthread_spin_lock(&lock); // wait for event;
work();
}
main thread:
{
do_smth();
pthread_spin_unlock(&lock); // open the mutex; other thread will see this change
// in ~150 CPU ticks (checked on Pentium4 and Intel Core2 single socket systems);
// time of the operation itself is of the same order; didn't measure it.
continue_work();
}
To signal to another process that it should continue, without forcing the sender to spend time in a kernel call, one mechanism comes to mind right away. Without kernel calls, all a process can do is modify memory; so the solution is inter-process shared memory. Once the sender writes to shared memory, the receiver should see the change without any explicit kernel calls, and naive polling by the receiver should work fine.
One cheap (but maybe not cheap enough) alternative is delegating the sending to a helper thread in the same process, and have the helper thread make a proper inter-process "semaphore release" or pipe write call.
I understand that you want to avoid using the kernel in order to avoid kernel-related overheads. Most of such overheads are context-switch related. Here is a demonstration of one way to accomplish what you need using signals without spinning, and without context switches:
#include <signal.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <pthread.h>
#include <iostream>
#include <thread>
using namespace std;
void sigRtHandler(int sig) {
cout << "Recevied signal" << endl;
}
int main() {
constexpr static int kIter = 100000;
thread t([]() {
signal(SIGRTMIN, sigRtHandler);
for (int i = 0; i < kIter; ++i) {
usleep(1000);
}
cout << "Done" << endl;
});
usleep(1000); // Give child time to setup signal handler.
auto handle = t.native_handle();
for (int i = 0; i < kIter; ++i)
pthread_kill(handle, SIGRTMIN);
t.join();
return 0;
}
If you run this code, you'll see that the child thread keeps receiving the SIGRTMIN. While the process is running, if you look in the files /proc/(PID)/task/*/status for this process, you'll see that parent thread does not incur context switches from calling pthread_kill().
The advantage of this approach is that the waiting thread doesn't need to spin. If the waiting thread's job is not time-sensitive, this approach allows you to save CPU.
Related
I have a program where one thread creates a file and other threads are waiting for the file to be created. Once the file is created the other threads will read this file and continue their processing. I want these other threads to run in parallel. The thread that creates the file does not know how many threads are waiting for it. What synchronization model should I use?
If you're on Linux, you might look at the inotify interface. This allows a thread / threads to watch a filesystem for something happening (e.g. a file being created). inotify.
Or you could use a pub/sub pattern in ZeroMQ; the creating thread makes the file, then sends some sort of message into a PUB socket. Whichever SUBscribers are listening would all get that message (they'd be blocked waiting for the message in the meantime), and then read the file. This would probably be more efficient.
If it is not known upfront how many threads are going to be waiting, you can use a the 'completion' pattern - this is a mutex-protected flag, with an condition variable to signal changes:
int done = 0;
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
To wait for the completion (ie. in the threads waiting for the file to be created), you use the condition variable to wait for the flag to be set:
pthread_mutex_lock(&lock);
while (!done)
pthread_cond_wait(&cond, &lock);
pthread_mutex_unlock(&lock);
To signal the completion (ie. in the thread that has created the file), you set the flag and signal the condition variable:
pthread_mutex_lock(&lock);
done = 1;
pthread_cond_signal(&cond);
pthread_mutex_unlock(&lock);
You could wrap this all up in a 'completion' data type if you wanted.
I have the following situation:
Two C++11 threads are working on a calculation and they are synchronized through a std::mutex.
Thread A locks the mutex until the data is ready for the operation Thread B executes. When the mutex is unlocked Thread B starts to work.
Thread B tries to lock the mutex and is blocked until it is unlocked by Thread A.
void ThreadA (std::mutex* mtx, char* data)
{
mtx->lock();
//do something useful with data
mtx->unlock();
}
void ThreadB (std::mutex* mtx, char* data)
{
mtx->lock(); //wait until Thread A is ready
//do something useful with data
//.....
}
It is asserted that Thread A can block the mutex first.
Now I am wondering if the mtx->lock() in Thread B waits active or passive. So is Thread B polling the mutex state and wasting processor time or is released passively by the sheduler when the mutex is unlocked.
In the different C++ references it is only mentioned that the thread is blocked, but not in which way.
Could it be, however, that the std::mutex implementation is hardly depended on the used plattform and OS?
It's highly implementation defined, even for the same compiler and OS
for example,on VC++, in Visual Studio 2010, std::mutex was implemented with Win32 CRITICAL_SECTION. EnterCriticalSection(CRITICAL_SECTION*) has some nice feature: first it tries to lock the CRITICAL_SECTION by iterating on the lock again and again. after specified number of iteration, it makes a kernel-call which makes the thread go sleep, only to be awakened up again when the lock is released and the whole deal starts again.
in this case , the mechanism polls the lock again and again before going to sleep, then the control switches to the kernel.
Visual Studio 2012 came with a different implementation. std::mutex was implemented with Win32 mutex. Win32 mutex shifts the control immediately to the kernel. there is no active polling done by the lock.
you can read about the implementation switch in the answer : std::mutex performance compared to win32 CRITICAL_SECTION
So, it is unspecified how the mutex acquires the lock. it is the best not to rely on such behaviour.
ps. do not lock the mutex manually, use std::lock_guard instead. also, you might want to use condition_variable for more-refined way of controlling your synchronization.
With a friend of mine, we disagree on how synchronization is handled at userspace level (in the pthread library).
a. I think that during a pthread_mutex_lock, the thread actively waits. Meaning the linux scheduler rises this thread, let it execute his code, which should looks like:
while (mutex_resource->locked);
Then, another thread is scheduled which potentially free the locked field, etc.
So this means that the scheduler waits for the thread to complete its schedule time before switching to the next one, no matter what the thread is doing.
b. My friend thinks that the waiting thread somehow tells the kernel "Hey, I'm asleep, don't wait for me at all".
In this case, the kernel would schedule the next thread right away, without waiting for the current thread to complete its schedule time, being aware this thread is sleeping.
From what I see in the code of pthread, it seems there is loop handling the lock. But maybe I missed something.
In embedded systems, it could make sense to prevent the kernel from waiting. So he may be right (but I hope he does not :D).
Thanks!
a. I think that during a pthread_mutex_lock, the thread actively waits.
Yes, glibc's NPTL pthread_mutex_lock have active wait (spinning),
BUT the spinning is used only for very short amount of time and only for some types of mutexes. After this amount, pthread_mutex_lock will go to sleep, by calling linux syscall futex with WAIT argument.
Only mutexes with type PTHREAD_MUTEX_ADAPTIVE_NP will spin, and default is PTHREAD_MUTEX_TIMED_NP (normal mutex) without spinning. Check MAX_ADAPTIVE_COUNT in __pthread_mutex_lock sources).
If you want to do infinite spinning (active waiting), use pthread_spin_lock function with pthread_spinlock_t-types locks.
I'll consider the rest of your question as if you are using pthread_spin_lock:
Then, another thread is scheduled which potentially free the locked field, etc. So this means that the scheduler waits for the thread to complete its schedule time before switching to the next one, no matter what the thread is doing.
Yes, if there is contention for CPU cores, the your thread with active spinning may block other thread from execute, even if the other thread is the one who will unlock the mutex (spinlock) which is needed by your thread.
But if there is no contention (no thread oversubscribing), and threads are scheduled on different cores (by coincidence, or by manual setting of cpu affinity with sched_setaffinity or pthread_setaffinity_np), spinning will enable you to proceed faster, then using OS-based futex.
b. My friend thinks that the waiting thread somehow tells the kernel "Hey, I'm asleep, don't wait for me at all". In this case, the kernel would schedule the next thread right away, without waiting for the current thread to complete...
Yes, he is right.
futex is the modern way to say OS that this thread is waiting for some value in memory (for opening some mutex); and in current implementation futex also puts our thread to sleep. It is not needed to wake it to do spinning, if kernel knows when to wake up this thread. How it knows? The lock owner, when doing pthread_mutex_unlock, will check, is there any other threads, sleeping on this mutex. If there is any, lock owner will call futex with FUTEX_WAKE, telling OS to wake some thread, registered as sleeper on this mutex.
There is no need to spin, if thread registers itself as waiter in OS.
Some debuging with gdb for this test program:
#include <pthread.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
pthread_mutex_t x = PTHREAD_MUTEX_INITIALIZER;
void* thr_func(void *arg)
{
pthread_mutex_lock(&x);
}
int main(int argc, char **argv)
{
pthread_t thr;
pthread_mutex_lock(&x);
pthread_create(&thr, NULL, thr_func, NULL);
pthread_join(thr,NULL);
return 0;
}
shows that a call to pthread_mutex_lock on a mutex results in a calling a system call futex with the op parameter set to FUTEX_WAIT (http://man7.org/linux/man-pages/man2/futex.2.html)
And this is description of FUTEX_WAIT:
FUTEX_WAIT
This operation atomically verifies that the futex address
uaddr still contains the value val, and sleeps awaiting FUTEX_WAKE on
this futex address. If the timeout argument is
non-NULL, its contents describe the maximum duration of the wait,
which is infinite otherwise. The arguments uaddr2 and val3 are
ignored.
So from this description I can say that if a mutex is locked then a thread will sleep and not actively wait. And it will sleep until futex with op equal to FUTEX_WAKE is called.
I am trying to write some code to ensure all GPU activity (in particular all running threads) are stopped. I need to do this to unload a module with dlclose, so I need to ensure all threads have stopped on both the host and the device.
According to the CUDA documentation, cudaDeviceSynchronize:
Blocks until the device has completed all preceding requested tasks... If the cudaDeviceScheduleBlockingSync flag was set for this device, the host thread will block until the device has finished its work.
However, when I set the blocking sync flag and call cudaDeviceSynchronize, a new host thread is spawned, which is still running after cudaDeviceSynchronize has returned. This is the opposite of what I am trying to achieve.
This behaviour is demonstrated in an example program:
#include <iostream>
void initialiseDevice()
{
cudaError result = cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync);
if (cudaSuccess == result)
std::cout << "Set device flags." << std::endl;
else
std::cout << "Could not set device flags. (" << result << ")"
<< std::endl;
}
void synchroniseDevice()
{
cudaError result = cudaDeviceSynchronize();
if (cudaSuccess == result)
std::cout << "Device synchronise returned success." << std::endl;
else
std::cout << "Device synchronise returned error. (" << result << ")"
<< std::endl;
}
int main()
{
initialiseDevice();
sleep(1);
synchroniseDevice(); // new thread is spawned here
sleep(1); // new thread is still running here!
return 0;
}
If I compile this program with nvcc -g main.cu, and run it in gdb, a call to info threads shows that there are two threads running after cudaDeviceSynchronize has returned.
Output of info threads on the line after cudaDeviceSynchronise when running in gdb:
(gdb) info threads
Id Target Id Frame
2 Thread 0x7ffff5b8b700 (LWP 28458) "a.out" 0x00007ffff75aa023 in select
() at ../sysdeps/unix/syscall-template.S:82
* 1 Thread 0x7ffff7fd4740 (LWP 28255) "a.out" main () at cuda_test.cu:30
Could anyone help me understand why cudaDeviceSynchronize is spawning a new thread, and why the thread is still running after the call returns?
Could anyone point me in the right direction to help me find a method to block until all device and host activity/threads are finished?
CUDA 4.2 and later have intermediary worker threads that mediate blocking calls between application threads and operating system. My testing suggests that one thread gets created for each GPU your application uses (one for each CUDA context?). I suspect these worker threads were introduced to make the implementation of stream event callbacks easier (I think these threads may execute the callbacks); although, I could be entirely wrong on this technical reason.
I really wish NVIDIA would have provided an environment variable to disable these intermediary threads. It introduces problems if you want to run your program as SCHED_FIFO. You must be sure to transition to SCHED_FIFO before any CUDA routines are invoked. Otherwise, any worker threads spawned prior the SCHED_FIFO transition will be scheduled as regular threads while your main thread is SCHED_FIFO. This leads to priority inversions where your main thread is blocked waiting for a worker thread to be scheduled with a lower priority. Transitioning to SCHED_FIFO before any thread spawning allows future threads to inherit the parent's SCHED_FIFO policy and priority.
As for a solution to your problem: Can you call cudaDeviceReset() in the context of your application? This should hopefully reinitialize any CUDA runtime state in your system and kill off any worker threads. Otherwise, there's always pthread_cancel() (or Windows equivalent), but this may leave CUDA in an inconsistent state.
I'm trying to implement a simulation of a microcontroller. This simulation is not meant to do a clock cycle precise representation of one specific microcontroller but check the general correctness of the code.
I thought of having a "main thread" executing normal code and a second thread executing ISR code. Whenever an ISR needs to be run, the ISR thread suspends the "main thread".
Of course, I want to have a feature to block interrupts.
I thought of solving this with a mutex that the ISR thread holds whenever it executes ISR code while the main thread holds it as long as "interrupts are blocked".
A POR (power on reset) can then be implemented by not only suspending but killing the main thread (and starting a new one executing the POR function).
The windows API provides the necessary functions.
But it seems to be impossible to do the above with posix threads (on linux).
I don't want to change the actual hardware independent microcontroller code. So inserting anything to check for pending interrupts is not an option.
Receiving interrupts at non well behaved points is desirable, as this also happens on microcontrollers (unless you block interrupts).
Is there a way to suspend another thread on linux? (Debuggers must use that option somehow, I think.)
Please, don't tell me this is a bad idea. I know that is true in most circumstances. But the main code does not use standard libs or lock/mutexes/semaphores.
SIGSTOP does not work - it always stops the entire process.
Instead you can use some other signals, say SIGUSR1 for suspending and SIGUSR2 for resuming:
// at process start call init_pthread_suspending to install the handlers
// to suspend a thread use pthread_kill(thread_id, SUSPEND_SIG)
// to resume a thread use pthread_kill(thread_id, RESUME_SIG)
#include <signal.h>
#define RESUME_SIG SIGUSR2
#define SUSPEND_SIG SIGUSR1
static sigset_t wait_mask;
static __thread int suspended; // per-thread flag
void resume_handler(int sig)
{
suspended = 0;
}
void suspend_handler(int sig)
{
if (suspended) return;
suspended = 1;
do sigsuspend(&wait_mask); while (suspended);
}
void init_pthread_suspending()
{
struct sigaction sa;
sigfillset(&wait_mask);
sigdelset(&wait_mask, SUSPEND_SIG)
sigdelset(&wait_mask, RESUME_SIG);
sigfillset(&sa.sa_mask);
sa.sa_flags = 0;
sa.sa_handler = resume_handler;
sigaction(RESUME_SIG, &sa, NULL);
sa.sa_handler = suspend_handler;
sigaction(SUSPEND_SIG, &sa, NULL);
}
I am very annoyed by replies like "you should not suspend another thread, that is bad".
Guys why do you assume others are idiots and don't know what they are doing? Imagine that others, too, have heard about deadlocking and still, in full consciousness, want to suspend other threads.
If you don't have a real answer to their question why do you waste your and the readers' time.
An yes, IMO pthreads are very short-sighted api, a disgrace for POSIX.
The Hotspot JAVA VM uses SIGUSR2 to implement suspend/resume for JAVA threads on linux.
A procedure based on on a signal handler for SIGUSR2 might be:
Providing a signal handler for SIGUSR2 allows a thread to request a lock
(which has already been acquired by the signal sending thread).
This suspends the thread.
As soon as the suspending thread releases the lock, the signal handler can
(and will?) get the lock. The signal handler releases the lock immediately and
leaves the signal handler.
This resumes the thread.
It will probably be necessary to introduce a control variable to make sure that the main thread is in the signal handler before starting the actual processing of the ISR.
(The details depend on whether the signal handler is called synchronously or asynchronously.)
I don't know, if this is exactly how it is done in the Java VM, but I think the above procedure does what I need.
Somehow I think sending the other thread SIGSTOP works.
However, you are far better off writing some thread communication involving senaogires.mutexes and global variables.
You see, if you suspend the other thread in malloc() and you call malloc() -> deadlock.
Did I mention that lots of C standard library functions, let alone other libraries you use, will call malloc() behind your back?
EDIT:
Hmmm, no standard library code. Maybe use setjmp/longjump() from signal handler to simulate the POR and a signal handier to simulate interrupt.
TO THOSE WHO KEEP DOWNVOTING THIS: The answer was accepted for the contents after EDIT, which is a specific scenario that cannot be used in any other scenario.
Solaris has the thr_suspend(3C) call that would do what you want. Is switching to Solaris a possibility?
Other than that, you're probably going to have to do some gymnastics with mutexes and/or semaphores. The problem is that you'll only suspend when you check the mutex, which will probably be at a well-behaved point. Depending on what you're actually trying to accomplish, this might now be desirable.
It makes more sense to have the main thread execute the ISRs - because that's how the real controller works (presumably). Just have it check after each emulated instruction if there is both an interrupt pending, and interrupts are currently enabled - if so, emulate a call to the ISR.
The second thread is still used - but it just listens for the conditions which cause an interrupt, and mark the relevant interrupt as pending (for the other thread to later pick up).