I'm observing a behaviour which doesn't seem to be inline with how pthread_cond_signal and pthread_cond_wait should behave (accordingly to manpages). man 3 pthread_cond_signal stipulates that:
The pthread_cond_signal() function shall unblock at least one of the threads that are blocked on the specified condition variable cond (if any threads are blocked on cond).
This isn't precise enough and doesn't clarify if, at the same time, the thread calling pthread_cond_signal will yield its time back to the scheduler.
Here's an example program:
1 #include <pthread.h>
2 #include <iostream>
3 #include <time.h>
4 #include <unistd.h>
6 int mMsg = 0;
7 pthread_mutex_t mMsgMutex;
8 pthread_cond_t mMsgCond;
9 pthread_t consumerThread;
10 pthread_t producerThread;
12 void* producer(void* data) {
13 (void) data;
14 while(true) {
15 pthread_mutex_lock(&mMsgMutex);
16 std::cout << "1> locked" << std::endl;
17 mMsg += 1;
18 std::cout << "1> sending signal, mMsg = " << mMsg << "" << std::endl;
19 pthread_cond_signal(&mMsgCond);
20 pthread_mutex_unlock(&mMsgMutex);
21 }
23 return nullptr;
24 }
26 void* consumer(void* data) {
27 (void) data;
28 pthread_mutex_lock(&mMsgMutex);
30 while(true) {
31 while (mMsg == 0) {
32 pthread_cond_wait(&mMsgCond, &mMsgMutex);
33 }
34 std::cout << "2> wake up, msg: " << mMsg << std::endl;
35 mMsg = 0;
36 }
38 return nullptr;
39 }
41 int main()
42 {
43 pthread_mutex_init(&mMsgMutex, nullptr);
44 pthread_cond_init(&mMsgCond, nullptr);
46 pthread_create(&consumerThread, nullptr, consumer, nullptr);
48 std::cout << "starting producer..." << std::endl;
50 sleep(1);
51 pthread_create(&producerThread, nullptr, producer, nullptr);
53 pthread_join(consumerThread, nullptr);
54 pthread_join(producerThread, nullptr);
55 return 0;
56 }
Here's the output:
starting producer...
1> locked
1> sending signal, mMsg = 1
1> locked
1> sending signal, mMsg = 2
1> locked
1> sending signal, mMsg = 3
1> locked
1> sending signal, mMsg = 4
1> locked
1> sending signal, mMsg = 5
1> locked
1> sending signal, mMsg = 6
1> locked
1> sending signal, mMsg = 7
1> locked
1> sending signal, mMsg = 8
1> locked
1> sending signal, mMsg = 9
1> locked
1> sending signal, mMsg = 10
2> wake up, msg: 10
1> locked
1> sending signal, mMsg = 1
1> locked
1> sending signal, mMsg = 2
1> locked
1> sending signal, mMsg = 3
It seems like there's no guarantee that any pthread_cond_signal will indeed immediately unblock any waiting pthread_cond_wait thread. At the same time it seems that any amount of pthread_cond_signal can be lost after first one has been issued.
Is this really the intended behaviour or am I doing something wrong here?
This is the intended behavior. pthread_cond_signal does not yield it's remaining runtime, but will continue to run.
And yes, pthread_cond_signal will immediately unblock (one or more) thread waiting on the corresponding condition variable. However, that doesn't guarantee that said waiting thread will immediately run. It just tells the OS that this thread is no longer blocked, and it's up to the OS thread scheduler to decide when to start running it. Since the signalling thread is already running, is hot in the cache etc., it will likely have plenty of time to do something before the now-unblocked thread starts doing anything.
In your example above, if you don't want to skip messages, maybe what you're looking for is something like a producer-consumer queue, maybe backed by a ring buffer.
I'm newbie in thread, I have the code below:
#include <stdlib.h>
#define NUM_THREADS 5
void *PrintHello (void *threadid ){
long tid ;tid = (long) threadid ;
printf ("Hello World! It’s me, thread#%ld !\n" , tid );
int main (int argc ,char *argv[] ){
pthread_t threads [NUM_THREADS] ;
int rc ;
long t ;
for( t=0; t<NUM_THREADS; t++){
printf ("In main: creating thread %ld\n" , t );
rc = pthread_create(&threads[t],NULL,PrintHello,(void *)t );
I compile and the output here:1
But when i delete the last line "pthread_exit(NULL)", the output is sometimes as same as the above which always prints enough 5 sub-threads, sometimes just prints 4 sub-thread from thread 0-3 for instace:2
Help me with this, please!
Omitting that pthread_exit call in main will cause main to implicitly return and terminate the process, including all running threads.
Now you have a race condition: will your five worker threads print their messages before main terminates the whole process? Sometimes yes, sometimes no, as you have observed.
If you keep the pthread_exit call in main, your program will instead terminate when the last running thread calls pthread_exit.
What is the behavior of cudaStreamSynchronize in the following case
ThreadA pseudo code :
submit new cuda Kernel to cudaStreamX
ThreadB pseudo code:
call cudaStreamSynchronize(cudaStreamX)
My question is when will ThreadB return? Since ThreadA will always push new cuda kernels, and the cudaStreamX will never finish.
The API documentation isn't directly explicit about this, however the CUDA C programming guide is basically explicit:
cudaStreamSynchronize() takes a stream as a parameter and waits until all preceding commands in the given stream have completed
Furthermore, I think it should be sensible that:
cudaStreamSynchronize() cannot reasonably take into account work issued to a stream after that cudaStreamSynchronize() call. this would more or less require it to know the future.
cudaStreamSynchronize() should reasonably be expected to return after all previously issued work to that stream is complete.
Putting together an experimental test app, the above description is what I observe:
$ cat t396.cu
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <unistd.h>
const int PTHREADS=2;
const int TRIGGER1=5;
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
#define DELAY_T 1000000000ULL
template <int type>
__global__ void delay_kern(int i){
unsigned long long time = clock64();
#ifdef DEBUG
printf("hello %d\n", type);
while (clock64() < time+(i*DELAY_T));
volatile static int flag, flag0, loop_cnt;
// The thread configuration structure.
typedef struct
int my_thread_ordinal;
pthread_t thread;
cudaError_t status;
cudaStream_t stream;
int delay_usec;
// The function executed by each thread assigned with CUDA device.
void *thread_func(void *arg)
// Unpack the config structure.
config_t *config = (config_t *)arg;
int my_thread=config->my_thread_ordinal;
cudaError_t cuda_status = cudaSuccess;
cuda_status = cudaSetDevice(0);
if (cuda_status != cudaSuccess) {
fprintf(stderr, "Cannot set focus to device %d, status = %d\n",
0, cuda_status);
config->status = cuda_status;
printf("thread %d initialized\n", my_thread);
case 0:
//master thread
while (flag0) {
if (loop_cnt++ > TRIGGER1) flag = 1;
printf("master thread loop: %d\n", loop_cnt);
//slave thread
while (!flag);
printf("slave thread issuing stream sync at loop count: %d\n", loop_cnt);
flag0 = 0;
printf("slave thread set trigger and exit\n");
cudaCheckErrors("thread CUDA error");
printf("thread %d complete\n", my_thread);
config->status = cudaSuccess;
return NULL;
int main(int argc, char* argv[])
int mydelay_usec = 1;
if (argc > 1) mydelay_usec = atoi(argv[1]);
if ((mydelay_usec < 1) || (mydelay_usec > 10000000)) {printf("invalid delay time specified\n"); return -1;}
flag = 0; flag0 = 1; loop_cnt = 0;
const int nthreads = PTHREADS;
// Create workers configs. Its data will be passed as
// argument to thread_func.
config_t* configs = (config_t*)malloc(sizeof(config_t) * nthreads);
cudaStream_t str;
// create a separate thread
// and execute the thread_func.
for (int i = 0; i < nthreads; i++) {
config_t *config = configs + i;
config->my_thread_ordinal = i;
config->stream = str;
config->delay_usec = mydelay_usec;
int status = pthread_create(&config->thread, NULL, thread_func, config);
if (status) {
fprintf(stderr, "Cannot create thread for device %d, status = %d\n",
i, status);
// Wait for device threads completion.
// Check error status.
int status = 0;
for (int i = 0; i < nthreads; i++) {
pthread_join(configs[i].thread, NULL);
status += configs[i].status;
if (status)
return status;
return 0;
$ nvcc -arch=sm_61 -o t396 t396.cu -lpthread
$ time ./t396 100000
thread 0 initialized
thread 1 initialized
master thread loop: 1
master thread loop: 2
master thread loop: 3
master thread loop: 4
master thread loop: 5
master thread loop: 6
slave thread issuing stream sync at loop count: 7
master thread loop: 7
master thread loop: 8
master thread loop: 9
master thread loop: 10
master thread loop: 11
master thread loop: 12
master thread loop: 13
master thread loop: 14
master thread loop: 15
master thread loop: 16
master thread loop: 17
master thread loop: 18
master thread loop: 19
master thread loop: 20
master thread loop: 21
master thread loop: 22
master thread loop: 23
master thread loop: 24
master thread loop: 25
master thread loop: 26
master thread loop: 27
master thread loop: 28
master thread loop: 29
master thread loop: 30
master thread loop: 31
master thread loop: 32
master thread loop: 33
master thread loop: 34
master thread loop: 35
master thread loop: 36
master thread loop: 37
master thread loop: 38
master thread loop: 39
slave thread set trigger and exit
thread 1 complete
thread 0 complete
real 0m5.416s
user 0m2.990s
sys 0m1.623s
This will require some careful thought to understand. However, in a nutshell, the app will issue kernels that simply execute about a 0.7s delay before returning from one thread, and from the other thread will wait for a small number of kernels to be issued, then will issue a cudaStreamSynchronize() call. The overall time measurement for the application defines when that call returned. As long as you keep the command line parameter (host delay) between kernel launches to a value less than about 0.5s, then the app will reliably exit in about 5.4s (this will vary depending on which GPU you are running on, but the overall app execution time should be constant up to a fairly large value of the host delay parameter).
If you specify a command line parameter that is larger than the kernel duration on your machine, then the overall app execution time will be approximately 5 times your command line parameter (microseconds), since the trigger point for the cudaStreamSynchronize() call is 5.
In my case, I compiled and ran this on CUDA 8.0.61, Ubuntu 14.04, Pascal Titan X.
i have such code
#include <iostream>
#include <thread>
#include <mutex>
#include <iostream>
#include <unistd.h>
using namespace std;
bool isRunning;
mutex locker;
void threadFunc(int num) {
while(isRunning) {
cout << num << endl;
int main(int argc, char *argv[])
isRunning = true;
thread thr1(threadFunc,1);
thread thr2(threadFunc,2);
cout << "Hello World!" << endl;
return 0;
when running this code i'm waiting to get output like:
but i dont't get that and get something like this instead:
2 <--- why so?
and if i run this code on Windows with replacing #include <unistd.h> to #include <windows.h> and sleep(1) to Sleep(1000) the output i get is exactly what i want, i.e. 1212121212.
So why is so and how to achieve the same result on linux?
It pertains to the scheduling of threads. Sometime one thread may be executing faster. Apparently, thread 2 is executing faster once and so you are getting ... 1 2 2 ... Nothing wrong with that because mutex is only ensuring that only one thread is printing count at a time and nothing more. There are uncertainties like when a thread is going to sleep and when it is woken up, etc. All this may not be taking exactly the same time in the two threads all the time.
For having the threads execute alternately, a different semaphore arrangement is needed. For example, let there be two semaphores, s1 and s2. Let the initial values of s1 and s2 be 1 and zero respectively. Consider the following pseudo code:
// Thread 1:
P (s1)
print number
V (s2)
// Thread 2:
P (s2)
print number
V (s1)
I have a program that needs to execute with 100% performance but I see that it is sometimes paused for more than 20 uSec. I've struggled with this for a while and can't find the reason/explanation.
So my question is:
Why is my program "paused"/"stalled" for 20 uSec every now and then?
To investigate this I wrote the following small program:
#include <string.h>
#include <iostream>
#include <signal.h>
using namespace std;
unsigned long long get_time_in_ns(void)
struct timespec tmp;
if (clock_gettime(CLOCK_MONOTONIC, &tmp) == 0)
return tmp.tv_sec * 1000000000 + tmp.tv_nsec;
bool go_on = true;
static void Sig(int sig)
go_on = false;
int main()
unsigned long long t1=0;
unsigned long long t2=0;
unsigned long long t3=0;
unsigned long long t4=0;
unsigned long long t5=0;
unsigned long long t2saved=0;
unsigned long long t3saved=0;
unsigned long long t4saved=0;
unsigned long long t5saved=0;
struct sigaction sig;
memset(&sig, 0, sizeof(sig));
sig.sa_handler = Sig;
if (sigaction(SIGINT, &sig, 0) < 0)
cout << "sigaction failed" << endl;
return 0;
while (go_on)
t1 = get_time_in_ns();
t2 = get_time_in_ns();
t3 = get_time_in_ns();
t4 = get_time_in_ns();
t5 = get_time_in_ns();
if ((t2-t1)>t2saved) t2saved = t2-t1;
if ((t3-t2)>t3saved) t3saved = t3-t2;
if ((t4-t3)>t4saved) t4saved = t4-t3;
if ((t5-t4)>t5saved) t5saved = t5-t4;
cout <<
t1 << " " <<
t2-t1 << " " <<
t3-t2 << " " <<
t4-t3 << " " <<
t5-t4 << " " <<
t2saved << " " <<
t3saved << " " <<
t4saved << " " <<
t5saved << endl;
cout << endl << "Closing..." << endl;
return 0;
The program simply test how long time it takes to call the function "get_time_in_ns". The program does this 5 times in a row. The program also tracks the longest time measured.
Normally it takes 30 ns to call the function but sometimes it takes as long as 20000 ns. Which I don't understand.
A little part of the program output is:
8909078678739 37 29 28 28 17334 17164 17458 18083
8909078680355 36 30 29 28 17334 17164 17458 18083
8909078681947 38 28 28 27 17334 17164 17458 18083
8909078683521 37 29 28 27 17334 17164 17458 18083
8909078685096 39 27 28 29 17334 17164 17458 18083
8909078686665 37 29 28 28 17334 17164 17458 18083
8909078688256 37 29 28 28 17334 17164 17458 18083
8909078689827 37 27 28 28 17334 17164 17458 18083
The output shows that normal call time is approx. 30ns (column 2 to 5) but the largest time is nearly 20000ns (column 6 to 9).
I start the program like this:
chrt -f 99 nice -n -20 myprogram
Any ideas why the call sometimes takes 20000ns when it normally takes 30ns?
The program is executed on a dual Xeon (8 cores each) machine.
I connect using SSH.
top shows:
8107 root rt -20 16788 1448 1292 S 3.0 0.0 0:00.88 myprogram
2327 root 20 0 69848 7552 5056 S 1.3 0.0 0:37.07 sshd
Even the lowest value of niceness is not a real time priority — it is still in policy SCHED_OTHER, which is a round-robin time-sharing policy. You need to switch to a real time scheduling policy with sched_setscheduler(), either SCHED_FIFO or SCHED_RR as required.
Note that that will still not give you absolute 100% CPU if it isn't the only task running. If you run the task without interruption, Linux will still grant a few percent of the CPU time to non-real time tasks so that a runaway RT task will not effectively hang the machine. Of course, a real time task needing 100% CPU time is unlikely to perform correctly.
Edit: Given that the process already runs with a RT scheduler (nice values are only relevant to SCHED_OTHER, so it's pointless to set those in addition) as pointed out, the rest of my answer still applies as to how and why other tasks still are being run (remember that there are also a number kernel tasks).
The only way better than this is probably dedicating one CPU core to the task to get the most out of it. Obviously this only works on multi-core CPUs. There is a question related to that here: Whole one core dedicated to single process
Does a call to MPI_Barrier affect every thread in an MPI process or only the thread
that makes the call?
For your information , my MPI application will run with MPI_THREAD_MULTIPLE.
The way to think of this is that MPI_Barrier (and other collectives) are blocking function calls, which block until all processes in the communicator have completed the function. That, I think, makes it a little easier to figure out what should happen; the function blocks, but other threads continue on their way unimpeded.
So consider the following chunk of code (The shared done flag being flushed to communicate between threads is not how you should be doing thread communication, so please don't use this as a template for anything. Furthermore, using a reference to done will solve this bug/optimization, see the end of comment 2):
#include <mpi.h>
#include <omp.h>
#include <stdio.h>
#include <unistd.h>
int main(int argc, char**argv) {
int ierr, size, rank;
int provided;
volatile int done=0;
MPI_Comm comm;
ierr = MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
if (provided == MPI_THREAD_SINGLE) {
fprintf(stderr,"Could not initialize with thread support\n");
ierr = MPI_Comm_size(comm, &size);
ierr = MPI_Comm_rank(comm, &rank);
if (rank == 1) sleep(10);
#pragma omp parallel num_threads(2) default(none) shared(rank,comm,done)
#pragma omp single
/* spawn off one thread to do the barrier,... */
#pragma omp task
printf("%d -- thread done Barrier\n", rank);
done = 1;
#pragma omp flush
/* and another to do some printing while we're waiting */
#pragma omp task
int *p = &done;
while(!(*p) {
printf("%d -- thread waiting\n", rank);
return 0;
Rank 1 sleeps for 10 seconds, and all the ranks start a barrier in one thread. If you run this with mpirun -np 2, you'd expect the first of rank 0s threads to hit the barrier, and the other to cycle around printing and waiting -- and sure enough, that's what happens:
$ mpirun -np 2 ./threadbarrier
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
1 -- thread waiting
0 -- thread done Barrier
1 -- thread done Barrier