Why sleep() after acquiring a pthread_mutex_lock will block the whole program? - linux

In my test program, I start two threads, each of them just do the following logic:
1) pthread_mutex_lock()
2) sleep(1)
3) pthread_mutex_unlock()
However, I find that after some time, one of the two threads will block on pthread_mutex_lock() forever, while the other thread works normal. This is a very strange behavior and I think maybe a potential serious issue. By Linux manual, sleep() is not prohibited when a pthread_mutex_t is acquired. So my question is: is this a real problem or is there any bug in my code ?
The following is the test program. In the code, the 1st thread's output is directed to stdout, while the 2nd's is directed to stderr. So we can check these two different output to see whether the thread is blocked.
I have tested it on linux kernel (2.6.31) and (2.6.9). Both results are the same.
//======================= Test Program ===========================
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <pthread.h>
#define THREAD_NUM 2
static int data[THREAD_NUM];
static int sleepFlag = 1;
static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
static void * threadFunc(void *arg)
{
int* idx = (int*) arg;
FILE* fd = NULL;
if (*idx == 0)
fd = stdout;
else
fd = stderr;
while(1) {
fprintf(fd, "\n[%d]Before pthread_mutex_lock is called\n", *idx);
if (pthread_mutex_lock(&mutex) != 0) {
exit(1);
}
fprintf(fd, "[%d]pthread_mutex_lock is finisheded. Sleep some time\n", *idx);
if (sleepFlag == 1)
sleep(1);
fprintf(fd, "[%d]sleep done\n\n", *idx);
fprintf(fd, "[%d]Before pthread_mutex_unlock is called\n", *idx);
if (pthread_mutex_unlock(&mutex) != 0) {
exit(1);
}
fprintf(fd, "[%d]pthread_mutex_unlock is finisheded.\n", *idx);
}
}
// 1. compile
// gcc -o pthread pthread.c -lpthread
// 2. run
// 1) ./pthread sleep 2> /tmp/error.log # Each thread will sleep 1 second after it acquires pthread_mutex_lock
// ==> We can find that /tmp/error.log will not increase.
// or
// 2) ./pthread nosleep 2> /tmp/error.log # No sleep is done when each thread acquires pthread_mutex_lock
// ==> We can find that both stdout and /tmp/error.log increase.
int main(int argc, char *argv[]) {
if ((argc == 2) && (strcmp(argv[1], "nosleep") == 0))
{
sleepFlag = 0;
}
pthread_t t[THREAD_NUM];
int i;
for (i = 0; i < THREAD_NUM; i++) {
data[i] = i;
int ret = pthread_create(&t[i], NULL, threadFunc, &data[i]);
if (ret != 0) {
perror("pthread_create error\n");
exit(-1);
}
}
for (i = 0; i < THREAD_NUM; i++) {
int ret = pthread_join(t[i], (void*)0);
if (ret != 0) {
perror("pthread_join error\n");
exit(-1);
}
}
exit(0);
}
This is the output:
On the terminal where the program is started:
root#skyscribe:~# ./pthread sleep 2> /tmp/error.log
[0]Before pthread_mutex_lock is called
[0]pthread_mutex_lock is finisheded. Sleep some time
[0]sleep done
[0]Before pthread_mutex_unlock is called
[0]pthread_mutex_unlock is finisheded.
...
On another terminal to see the file /tmp/error.log
root#skyscribe:~# tail -f /tmp/error.log
[1]Before pthread_mutex_lock is called
And no new lines are outputed from /tmp/error.log

This is a wrong way to use mutexes. A thread should not hold a mutex for more time than it does not own it, particularly not if it sleeps while holding the mutex. There is no FIFO guarantee for locking a mutex (for efficiency reasons).
More specifically, if thread 1 unlocks the mutex while thread 2 is waiting for it, it makes thread 2 runnable but this does not force the scheduler to preempt thread 1 or make thread 2 run immediately. Most likely, it will not because thread 1 has recently slept. When thread 1 subsequently reaches the pthread_mutex_lock() call, it will generally be allowed to lock the mutex immediately, even though there is a thread waiting (and the implementation can know it). When thread 2 wakes up after that, it will find the mutex already locked and go back to sleep.
The best solution is not to hold a mutex for that long. If that is not possible, consider moving the lock-needing operations to a single thread (removing the need for the lock) or waking up the correct thread using condition variables.

There's neither a problem, nor a bug in your code, but a combination of buffering and scheduling effects. Add an fflush here:
fprintf (fd, "[%d]pthread_mutex_unlock is finisheded.\n", *idx);
fflush (fd);
and run
./a.out >1 1.log 2> 2.log &
and you'll see rather equal progress made by the two threads.
EDIT: and like #jilles above said, a mutex is supposed to be a short wait lock, as opposed to long waits like condition variable wait, waiting for I/O or sleeping. That's why a mutex is not a cancellation point too.

Related

How does the main thread_exit work with a loop sub-thread?

I'm newbie in thread, I have the code below:
#include<pthread.h>
#include<stdio.h>
#include <stdlib.h>
#define NUM_THREADS 5
void *PrintHello (void *threadid ){
long tid ;tid = (long) threadid ;
printf ("Hello World! It’s me, thread#%ld !\n" , tid );
pthread_exit(NULL);
}
int main (int argc ,char *argv[] ){
pthread_t threads [NUM_THREADS] ;
int rc ;
long t ;
for( t=0; t<NUM_THREADS; t++){
printf ("In main: creating thread %ld\n" , t );
rc = pthread_create(&threads[t],NULL,PrintHello,(void *)t );
}
pthread_exit(NULL);
}
I compile and the output here:1
But when i delete the last line "pthread_exit(NULL)", the output is sometimes as same as the above which always prints enough 5 sub-threads, sometimes just prints 4 sub-thread from thread 0-3 for instace:2
Help me with this, please!
Omitting that pthread_exit call in main will cause main to implicitly return and terminate the process, including all running threads.
Now you have a race condition: will your five worker threads print their messages before main terminates the whole process? Sometimes yes, sometimes no, as you have observed.
If you keep the pthread_exit call in main, your program will instead terminate when the last running thread calls pthread_exit.

Pthread join or pthread exit of terminating mulithreaded c program?

I wanted to print 1 to 10 for 3 threads. My code is able to do that but after that the program gets stuck. I tried using pthread_exit in the end of function. Also, I tried to remove while (1) in main and using pthread join there. But still, I got the same result. How should I terminate the threads?
enter code here
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <unistd.h>
int done = 1;
//Thread function
void *foo()
{
for (int i = 0; i < 10; i++)
{
printf(" \n #############");
pthread_mutex_lock(&lock);
if(done == 1)
{
done = 2;
printf (" \n %d", i);
pthread_cond_signal(&cond2);
pthread_cond_wait(&cond1, &lock);
printf (" \n Thread 1 woke up");
}
else if(done == 2)
{
printf (" \n %d", i);
done = 3;
pthread_cond_signal(&cond3);
pthread_cond_wait(&cond2, &lock);
printf (" \n Thread 2 woke up");
}
else
{
printf (" \n %d", i);
done = 1;
pthread_cond_signal(&cond1);
pthread_cond_wait(&cond3, &lock);
printf (" \n Thread 3 woke up");
}
pthread_mutex_unlock(&lock);
}
pthread_exit(NULL);
return NULL;
}
int main(void)
{
pthread_t tid1, tid2, tid3;
pthread_create(&tid1, NULL, foo, NULL);
pthread_create(&tid2, NULL, foo, NULL);
pthread_create(&tid3, NULL, foo, NULL);
while(1);
printf ("\n $$$$$$$$$$$$$$$$$$$$$$$$$$$");
return 0;
}
How should I terminate the threads?
Either returning from the outermost call to the thread function or calling pthread_exit() terminates a thread. Returning a value p from the outermost call is equivalent to calling pthread_exit(p).
the program gets stuck
Well of course it does when the program performs
while(1);
.
Also, I tried to remove while (1) in main and using pthread join there. But still, I got the same result.
You do need to join the threads to ensure that they terminate before the overall program does. This is the only appropriate way to achieve that. But if your threads in fact do not terminate in the first place, then that's moot.
In your case, observe that each thread unconditionally performs a pthread_cond_wait() on every iteration of the loop, requiring it to be signaled before it resumes. Normally, the preceding thread will signal it, but that does not happen after the last iteration of the loop. You could address that by having each thread perform an appropriate additional call to pthread_cond_signal() after it exits the loop, or by ensuring that the threads do not wait in the last loop iteration.

cudaStreamSynchronize behavior under multiple threads

What is the behavior of cudaStreamSynchronize in the following case
ThreadA pseudo code :
while(true):
submit new cuda Kernel to cudaStreamX
ThreadB pseudo code:
call cudaStreamSynchronize(cudaStreamX)
My question is when will ThreadB return? Since ThreadA will always push new cuda kernels, and the cudaStreamX will never finish.
The API documentation isn't directly explicit about this, however the CUDA C programming guide is basically explicit:
cudaStreamSynchronize() takes a stream as a parameter and waits until all preceding commands in the given stream have completed
Furthermore, I think it should be sensible that:
cudaStreamSynchronize() cannot reasonably take into account work issued to a stream after that cudaStreamSynchronize() call. this would more or less require it to know the future.
cudaStreamSynchronize() should reasonably be expected to return after all previously issued work to that stream is complete.
Putting together an experimental test app, the above description is what I observe:
$ cat t396.cu
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <unistd.h>
const int PTHREADS=2;
const int TRIGGER1=5;
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
#define DELAY_T 1000000000ULL
template <int type>
__global__ void delay_kern(int i){
unsigned long long time = clock64();
#ifdef DEBUG
printf("hello %d\n", type);
#endif
while (clock64() < time+(i*DELAY_T));
}
volatile static int flag, flag0, loop_cnt;
// The thread configuration structure.
typedef struct
{
int my_thread_ordinal;
pthread_t thread;
cudaError_t status;
cudaStream_t stream;
int delay_usec;
}
config_t;
// The function executed by each thread assigned with CUDA device.
void *thread_func(void *arg)
{
// Unpack the config structure.
config_t *config = (config_t *)arg;
int my_thread=config->my_thread_ordinal;
cudaError_t cuda_status = cudaSuccess;
cuda_status = cudaSetDevice(0);
if (cuda_status != cudaSuccess) {
fprintf(stderr, "Cannot set focus to device %d, status = %d\n",
0, cuda_status);
config->status = cuda_status;
pthread_exit(NULL);
}
printf("thread %d initialized\n", my_thread);
switch(config->my_thread_ordinal){
case 0:
//master thread
while (flag0) {
delay_kern<0><<<1,1,0,config->stream>>>(1);
if (loop_cnt++ > TRIGGER1) flag = 1;
printf("master thread loop: %d\n", loop_cnt);
usleep(config->delay_usec);
}
break;
default:
//slave thread
while (!flag);
printf("slave thread issuing stream sync at loop count: %d\n", loop_cnt);
cudaStreamSynchronize(config->stream);
flag0 = 0;
printf("slave thread set trigger and exit\n");
break;
}
cudaCheckErrors("thread CUDA error");
printf("thread %d complete\n", my_thread);
config->status = cudaSuccess;
return NULL;
}
int main(int argc, char* argv[])
{
int mydelay_usec = 1;
if (argc > 1) mydelay_usec = atoi(argv[1]);
if ((mydelay_usec < 1) || (mydelay_usec > 10000000)) {printf("invalid delay time specified\n"); return -1;}
flag = 0; flag0 = 1; loop_cnt = 0;
const int nthreads = PTHREADS;
// Create workers configs. Its data will be passed as
// argument to thread_func.
config_t* configs = (config_t*)malloc(sizeof(config_t) * nthreads);
cudaSetDevice(0);
cudaStream_t str;
cudaStreamCreate(&str);
// create a separate thread
// and execute the thread_func.
for (int i = 0; i < nthreads; i++) {
config_t *config = configs + i;
config->my_thread_ordinal = i;
config->stream = str;
config->delay_usec = mydelay_usec;
int status = pthread_create(&config->thread, NULL, thread_func, config);
if (status) {
fprintf(stderr, "Cannot create thread for device %d, status = %d\n",
i, status);
}
}
// Wait for device threads completion.
// Check error status.
int status = 0;
for (int i = 0; i < nthreads; i++) {
pthread_join(configs[i].thread, NULL);
status += configs[i].status;
}
if (status)
return status;
free(configs);
return 0;
}
$ nvcc -arch=sm_61 -o t396 t396.cu -lpthread
$ time ./t396 100000
thread 0 initialized
thread 1 initialized
master thread loop: 1
master thread loop: 2
master thread loop: 3
master thread loop: 4
master thread loop: 5
master thread loop: 6
slave thread issuing stream sync at loop count: 7
master thread loop: 7
master thread loop: 8
master thread loop: 9
master thread loop: 10
master thread loop: 11
master thread loop: 12
master thread loop: 13
master thread loop: 14
master thread loop: 15
master thread loop: 16
master thread loop: 17
master thread loop: 18
master thread loop: 19
master thread loop: 20
master thread loop: 21
master thread loop: 22
master thread loop: 23
master thread loop: 24
master thread loop: 25
master thread loop: 26
master thread loop: 27
master thread loop: 28
master thread loop: 29
master thread loop: 30
master thread loop: 31
master thread loop: 32
master thread loop: 33
master thread loop: 34
master thread loop: 35
master thread loop: 36
master thread loop: 37
master thread loop: 38
master thread loop: 39
slave thread set trigger and exit
thread 1 complete
thread 0 complete
real 0m5.416s
user 0m2.990s
sys 0m1.623s
$
This will require some careful thought to understand. However, in a nutshell, the app will issue kernels that simply execute about a 0.7s delay before returning from one thread, and from the other thread will wait for a small number of kernels to be issued, then will issue a cudaStreamSynchronize() call. The overall time measurement for the application defines when that call returned. As long as you keep the command line parameter (host delay) between kernel launches to a value less than about 0.5s, then the app will reliably exit in about 5.4s (this will vary depending on which GPU you are running on, but the overall app execution time should be constant up to a fairly large value of the host delay parameter).
If you specify a command line parameter that is larger than the kernel duration on your machine, then the overall app execution time will be approximately 5 times your command line parameter (microseconds), since the trigger point for the cudaStreamSynchronize() call is 5.
In my case, I compiled and ran this on CUDA 8.0.61, Ubuntu 14.04, Pascal Titan X.

What's the correct way of waiting for detached threads to finish?

Look at this sample code:
void OutputElement(int e, int delay)
{
this_thread::sleep_for(chrono::milliseconds(100 * delay));
cout << e << '\n';
}
void SleepSort(int v[], uint n)
{
for (uint i = 0 ; i < n ; ++i)
{
thread t(OutputElement, v[i], v[i]);
t.detach();
}
}
It starts n new threads and each one sleeps for some time before outputting a value and finishing. What's the correct/best/recommended way of waiting for all threads to finish in this case? I know how to work around this but I want to know what's the recommended multithreading tool/design that I should use in this situation (e.g. condition_variable, mutex etc...)?
And now for the slightly dissenting answer. And I do mean slightly because I mostly agree with the other answer and the comments that say "don't detach, instead join."
First imagine that there is no join(). And that you have to communicate among your threads with a mutex and condition_variable. This really isn't that hard nor complicated. And it allows an arbitrarily rich communication, which can be anything you want, as long as it is only communicated while the mutex is locked.
Now a very common idiom for such communication would simply be a state that says "I'm done". Child threads would set it, and the parent thread would wait on the condition_variable until the child said "I'm done." This idiom would in fact be so common as to deserve a convenience function that encapsulated the mutex, condition_variable and state.
join() is precisely this convenience function.
But imho one has to be careful. When one says: "Never detach, always join," that could be interpreted as: Never make your thread communication more complicated than "I'm done."
For a more complex interaction between parent thread and child thread, consider the case where a parent thread launches several child threads to go out and independently search for the solution to a problem. When the problem is first found by any thread, that gets communicated to the parent, and the parent can then take that solution, and tell all the other threads that they don't need to search any more.
For example:
#include <chrono>
#include <iostream>
#include <iterator>
#include <random>
#include <thread>
#include <vector>
void OneSearch(int id, std::shared_ptr<std::mutex> mut,
std::shared_ptr<std::condition_variable> cv,
int& state, int& solution)
{
std::random_device seed;
// std::mt19937_64 eng{seed()};
std::mt19937_64 eng{static_cast<unsigned>(id)};
std::uniform_int_distribution<> dist(0, 100000000);
int test = 0;
while (true)
{
for (int i = 0; i < 100000000; ++i)
{
++test;
if (dist(eng) == 999)
{
std::unique_lock<std::mutex> lk(*mut);
if (state == -1)
{
state = id;
solution = test;
cv->notify_one();
}
return;
}
}
std::unique_lock<std::mutex> lk(*mut);
if (state != -1)
return;
}
}
auto findSolution(int n)
{
std::vector<std::thread> threads;
auto mut = std::make_shared<std::mutex>();
auto cv = std::make_shared<std::condition_variable>();
int state = -1;
int solution = -1;
std::unique_lock<std::mutex> lk(*mut);
for (uint i = 0 ; i < n ; ++i)
threads.push_back(std::thread(OneSearch, i, mut, cv,
std::ref(state), std::ref(solution)));
while (state == -1)
cv->wait(lk);
lk.unlock();
for (auto& t : threads)
t.join();
return std::make_pair(state, solution);
}
int
main()
{
auto p = findSolution(5);
std::cout << '{' << p.first << ", " << p.second << "}\n";
}
Above I've created a "dummy problem" where a thread searches for how many times it needs to query a URNG until it comes up with the number 999. The parent thread puts 5 child threads to work on it. The child threads work for awhile, and then every once in a while, look up and see if any other thread has found the solution yet. If so, they quit, else they keep working. The main thread waits until solution is found, and then joins with all the child threads.
For me, using the bash time facility, this outputs:
$ time a.out
{3, 30235588}
real 0m4.884s
user 0m16.792s
sys 0m0.017s
But what if instead of joining with all the threads, it detached those threads that had not yet found a solution. This might look like:
for (unsigned i = 0; i < n; ++i)
{
if (i == state)
threads[i].join();
else
threads[i].detach();
}
(in place of the t.join() loop from above). For me this now runs in 1.8 seconds, instead of the 4.9 seconds above. I.e. the child threads are not checking with each other that often, and so main just detaches the working threads and lets the OS bring them down. This is safe for this example because the child threads own everything they are touching. Nothing gets destructed out from under them.
One final iteration can be realized by noticing that even the thread that finds the solution doesn't need to be joined with. All of the threads could be detached. The code is actually much simpler:
auto findSolution(int n)
{
auto mut = std::make_shared<std::mutex>();
auto cv = std::make_shared<std::condition_variable>();
int state = -1;
int solution = -1;
std::unique_lock<std::mutex> lk(*mut);
for (uint i = 0 ; i < n ; ++i)
std::thread(OneSearch, i, mut, cv,
std::ref(state), std::ref(solution)).detach();
while (state == -1)
cv->wait(lk);
return std::make_pair(state, solution);
}
And the performance remains at about 1.8 seconds.
There is still (sort of) an effective join with the solution-finding thread here. But it is accomplished with the condition_variable::wait instead of with join.
thread::join() is a convenience function for the very common idiom that your parent/child thread communication protocol is simply "I'm done." Prefer thread::join() in this common case as it is easier to read, and easier to write.
However don't unnecessarily constrain yourself to such a simple parent/child communication protocol. And don't be afraid to build your own richer protocol when the task at hand needs it. And in this case, thread::detach() will often make more sense. thread::detach() doesn't necessarily imply a fire-and-forget thread. It can simply mean that your communication protocol is more complex than "I'm done."
Don't detach, but instead join:
std::vector<std::thread> ts;
for (unsigned int i = 0; i != n; ++i)
ts.emplace_back(OutputElement, v[i], v[i]);
for (auto & t : threads)
t.join();

Does a call to MPI_Barrier affect every thread in an MPI process?

Does a call to MPI_Barrier affect every thread in an MPI process or only the thread
that makes the call?
For your information , my MPI application will run with MPI_THREAD_MULTIPLE.
Thanks.
The way to think of this is that MPI_Barrier (and other collectives) are blocking function calls, which block until all processes in the communicator have completed the function. That, I think, makes it a little easier to figure out what should happen; the function blocks, but other threads continue on their way unimpeded.
So consider the following chunk of code (The shared done flag being flushed to communicate between threads is not how you should be doing thread communication, so please don't use this as a template for anything. Furthermore, using a reference to done will solve this bug/optimization, see the end of comment 2):
#include <mpi.h>
#include <omp.h>
#include <stdio.h>
#include <unistd.h>
int main(int argc, char**argv) {
int ierr, size, rank;
int provided;
volatile int done=0;
MPI_Comm comm;
ierr = MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
if (provided == MPI_THREAD_SINGLE) {
fprintf(stderr,"Could not initialize with thread support\n");
MPI_Abort(MPI_COMM_WORLD,1);
}
comm = MPI_COMM_WORLD;
ierr = MPI_Comm_size(comm, &size);
ierr = MPI_Comm_rank(comm, &rank);
if (rank == 1) sleep(10);
#pragma omp parallel num_threads(2) default(none) shared(rank,comm,done)
{
#pragma omp single
{
/* spawn off one thread to do the barrier,... */
#pragma omp task
{
MPI_Barrier(comm);
printf("%d -- thread done Barrier\n", rank);
done = 1;
#pragma omp flush
}
/* and another to do some printing while we're waiting */
#pragma omp task
{
int *p = &done;
while(!(*p) {
printf("%d -- thread waiting\n", rank);
sleep(1);
}
}
}
}
MPI_Finalize();
return 0;
}
Rank 1 sleeps for 10 seconds, and all the ranks start a barrier in one thread. If you run this with mpirun -np 2, you'd expect the first of rank 0s threads to hit the barrier, and the other to cycle around printing and waiting -- and sure enough, that's what happens:
$ mpirun -np 2 ./threadbarrier
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
1 -- thread waiting
0 -- thread done Barrier
1 -- thread done Barrier

Resources