Does a call to MPI_Barrier affect every thread in an MPI process? - multithreading

Does a call to MPI_Barrier affect every thread in an MPI process or only the thread
that makes the call?
For your information , my MPI application will run with MPI_THREAD_MULTIPLE.
Thanks.

The way to think of this is that MPI_Barrier (and other collectives) are blocking function calls, which block until all processes in the communicator have completed the function. That, I think, makes it a little easier to figure out what should happen; the function blocks, but other threads continue on their way unimpeded.
So consider the following chunk of code (The shared done flag being flushed to communicate between threads is not how you should be doing thread communication, so please don't use this as a template for anything. Furthermore, using a reference to done will solve this bug/optimization, see the end of comment 2):
#include <mpi.h>
#include <omp.h>
#include <stdio.h>
#include <unistd.h>
int main(int argc, char**argv) {
int ierr, size, rank;
int provided;
volatile int done=0;
MPI_Comm comm;
ierr = MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
if (provided == MPI_THREAD_SINGLE) {
fprintf(stderr,"Could not initialize with thread support\n");
MPI_Abort(MPI_COMM_WORLD,1);
}
comm = MPI_COMM_WORLD;
ierr = MPI_Comm_size(comm, &size);
ierr = MPI_Comm_rank(comm, &rank);
if (rank == 1) sleep(10);
#pragma omp parallel num_threads(2) default(none) shared(rank,comm,done)
{
#pragma omp single
{
/* spawn off one thread to do the barrier,... */
#pragma omp task
{
MPI_Barrier(comm);
printf("%d -- thread done Barrier\n", rank);
done = 1;
#pragma omp flush
}
/* and another to do some printing while we're waiting */
#pragma omp task
{
int *p = &done;
while(!(*p) {
printf("%d -- thread waiting\n", rank);
sleep(1);
}
}
}
}
MPI_Finalize();
return 0;
}
Rank 1 sleeps for 10 seconds, and all the ranks start a barrier in one thread. If you run this with mpirun -np 2, you'd expect the first of rank 0s threads to hit the barrier, and the other to cycle around printing and waiting -- and sure enough, that's what happens:
$ mpirun -np 2 ./threadbarrier
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
1 -- thread waiting
0 -- thread done Barrier
1 -- thread done Barrier

Related

How does the main thread_exit work with a loop sub-thread?

I'm newbie in thread, I have the code below:
#include<pthread.h>
#include<stdio.h>
#include <stdlib.h>
#define NUM_THREADS 5
void *PrintHello (void *threadid ){
long tid ;tid = (long) threadid ;
printf ("Hello World! It’s me, thread#%ld !\n" , tid );
pthread_exit(NULL);
}
int main (int argc ,char *argv[] ){
pthread_t threads [NUM_THREADS] ;
int rc ;
long t ;
for( t=0; t<NUM_THREADS; t++){
printf ("In main: creating thread %ld\n" , t );
rc = pthread_create(&threads[t],NULL,PrintHello,(void *)t );
}
pthread_exit(NULL);
}
I compile and the output here:1
But when i delete the last line "pthread_exit(NULL)", the output is sometimes as same as the above which always prints enough 5 sub-threads, sometimes just prints 4 sub-thread from thread 0-3 for instace:2
Help me with this, please!
Omitting that pthread_exit call in main will cause main to implicitly return and terminate the process, including all running threads.
Now you have a race condition: will your five worker threads print their messages before main terminates the whole process? Sometimes yes, sometimes no, as you have observed.
If you keep the pthread_exit call in main, your program will instead terminate when the last running thread calls pthread_exit.

cudaStreamSynchronize behavior under multiple threads

What is the behavior of cudaStreamSynchronize in the following case
ThreadA pseudo code :
while(true):
submit new cuda Kernel to cudaStreamX
ThreadB pseudo code:
call cudaStreamSynchronize(cudaStreamX)
My question is when will ThreadB return? Since ThreadA will always push new cuda kernels, and the cudaStreamX will never finish.
The API documentation isn't directly explicit about this, however the CUDA C programming guide is basically explicit:
cudaStreamSynchronize() takes a stream as a parameter and waits until all preceding commands in the given stream have completed
Furthermore, I think it should be sensible that:
cudaStreamSynchronize() cannot reasonably take into account work issued to a stream after that cudaStreamSynchronize() call. this would more or less require it to know the future.
cudaStreamSynchronize() should reasonably be expected to return after all previously issued work to that stream is complete.
Putting together an experimental test app, the above description is what I observe:
$ cat t396.cu
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <unistd.h>
const int PTHREADS=2;
const int TRIGGER1=5;
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
#define DELAY_T 1000000000ULL
template <int type>
__global__ void delay_kern(int i){
unsigned long long time = clock64();
#ifdef DEBUG
printf("hello %d\n", type);
#endif
while (clock64() < time+(i*DELAY_T));
}
volatile static int flag, flag0, loop_cnt;
// The thread configuration structure.
typedef struct
{
int my_thread_ordinal;
pthread_t thread;
cudaError_t status;
cudaStream_t stream;
int delay_usec;
}
config_t;
// The function executed by each thread assigned with CUDA device.
void *thread_func(void *arg)
{
// Unpack the config structure.
config_t *config = (config_t *)arg;
int my_thread=config->my_thread_ordinal;
cudaError_t cuda_status = cudaSuccess;
cuda_status = cudaSetDevice(0);
if (cuda_status != cudaSuccess) {
fprintf(stderr, "Cannot set focus to device %d, status = %d\n",
0, cuda_status);
config->status = cuda_status;
pthread_exit(NULL);
}
printf("thread %d initialized\n", my_thread);
switch(config->my_thread_ordinal){
case 0:
//master thread
while (flag0) {
delay_kern<0><<<1,1,0,config->stream>>>(1);
if (loop_cnt++ > TRIGGER1) flag = 1;
printf("master thread loop: %d\n", loop_cnt);
usleep(config->delay_usec);
}
break;
default:
//slave thread
while (!flag);
printf("slave thread issuing stream sync at loop count: %d\n", loop_cnt);
cudaStreamSynchronize(config->stream);
flag0 = 0;
printf("slave thread set trigger and exit\n");
break;
}
cudaCheckErrors("thread CUDA error");
printf("thread %d complete\n", my_thread);
config->status = cudaSuccess;
return NULL;
}
int main(int argc, char* argv[])
{
int mydelay_usec = 1;
if (argc > 1) mydelay_usec = atoi(argv[1]);
if ((mydelay_usec < 1) || (mydelay_usec > 10000000)) {printf("invalid delay time specified\n"); return -1;}
flag = 0; flag0 = 1; loop_cnt = 0;
const int nthreads = PTHREADS;
// Create workers configs. Its data will be passed as
// argument to thread_func.
config_t* configs = (config_t*)malloc(sizeof(config_t) * nthreads);
cudaSetDevice(0);
cudaStream_t str;
cudaStreamCreate(&str);
// create a separate thread
// and execute the thread_func.
for (int i = 0; i < nthreads; i++) {
config_t *config = configs + i;
config->my_thread_ordinal = i;
config->stream = str;
config->delay_usec = mydelay_usec;
int status = pthread_create(&config->thread, NULL, thread_func, config);
if (status) {
fprintf(stderr, "Cannot create thread for device %d, status = %d\n",
i, status);
}
}
// Wait for device threads completion.
// Check error status.
int status = 0;
for (int i = 0; i < nthreads; i++) {
pthread_join(configs[i].thread, NULL);
status += configs[i].status;
}
if (status)
return status;
free(configs);
return 0;
}
$ nvcc -arch=sm_61 -o t396 t396.cu -lpthread
$ time ./t396 100000
thread 0 initialized
thread 1 initialized
master thread loop: 1
master thread loop: 2
master thread loop: 3
master thread loop: 4
master thread loop: 5
master thread loop: 6
slave thread issuing stream sync at loop count: 7
master thread loop: 7
master thread loop: 8
master thread loop: 9
master thread loop: 10
master thread loop: 11
master thread loop: 12
master thread loop: 13
master thread loop: 14
master thread loop: 15
master thread loop: 16
master thread loop: 17
master thread loop: 18
master thread loop: 19
master thread loop: 20
master thread loop: 21
master thread loop: 22
master thread loop: 23
master thread loop: 24
master thread loop: 25
master thread loop: 26
master thread loop: 27
master thread loop: 28
master thread loop: 29
master thread loop: 30
master thread loop: 31
master thread loop: 32
master thread loop: 33
master thread loop: 34
master thread loop: 35
master thread loop: 36
master thread loop: 37
master thread loop: 38
master thread loop: 39
slave thread set trigger and exit
thread 1 complete
thread 0 complete
real 0m5.416s
user 0m2.990s
sys 0m1.623s
$
This will require some careful thought to understand. However, in a nutshell, the app will issue kernels that simply execute about a 0.7s delay before returning from one thread, and from the other thread will wait for a small number of kernels to be issued, then will issue a cudaStreamSynchronize() call. The overall time measurement for the application defines when that call returned. As long as you keep the command line parameter (host delay) between kernel launches to a value less than about 0.5s, then the app will reliably exit in about 5.4s (this will vary depending on which GPU you are running on, but the overall app execution time should be constant up to a fairly large value of the host delay parameter).
If you specify a command line parameter that is larger than the kernel duration on your machine, then the overall app execution time will be approximately 5 times your command line parameter (microseconds), since the trigger point for the cudaStreamSynchronize() call is 5.
In my case, I compiled and ran this on CUDA 8.0.61, Ubuntu 14.04, Pascal Titan X.

setitimer and signal count on Linux. Is signal count directly proportional to run time?

There is a test program to work with setitimer on Linux (kernel 2.6; HZ=100). It sets various itimers to send signal every 10 ms (actually it is set as 9ms, but the timeslice is 10 ms). Then program runs for some fixed time (e.g. 30 sec) and counts signals.
Is it guaranteed that signal count will be proportional to running time? Will count be the same in every run and with every timer type (-r -p -v)?
Note, on the system should be no other cpu-active processes; and the question is about fixed-HZ kernel.
#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <sys/time.h>
/* Use 9 ms timer */
#define usecs 9000
int events = 0;
void count(int a) {
events++;
}
int main(int argc, char**argv)
{
int timer,j,i,k=0;
struct itimerval timerval = {
.it_interval = {.tv_sec=0, .tv_usec=usecs},
.it_value = {.tv_sec=0, .tv_usec=usecs}
};
if ( (argc!=2) || (argv[1][0]!='-') ) {
printf("Usage: %s -[rpv]\n -r - ITIMER_REAL\n -p - ITIMER_PROF\n -v - ITIMER_VIRTUAL\n", argv[0]);
exit(0);
}
switch(argv[1][1]) {
case'r':
timer=ITIMER_REAL;
break;
case'p':
timer=ITIMER_PROF;
break;
case'v':
timer=ITIMER_VIRTUAL;
};
signal(SIGALRM,count);
signal(SIGPROF,count);
signal(SIGVTALRM,count);
setitimer(timer, &timerval, NULL);
/* constants should be tuned to some huge value */
for (j=0; j<4; j++)
for (i=0; i<2000000000; i++)
k += k*argc + 5*k + argc*3;
printf("%d events\n",events);
return 0;
}
Is it guaranteed that signal count will be proportional to running time?
Yes. In general, for all the three timers the longer the code runs, the more the number of signals received.
Will count be the same in every run and with every timer type (-r -p -v)?
No.
When the timer is set using ITIMER_REAL, the timer decrements in real time.
When it is set using ITIMER_VIRTUAL, the timer decrements only when the process is executing in the user address space. So, it doesn't decrement when the process makes a system call or during interrupt service routines.
So we can expect that #real_signals > #virtual_signals
ITIMER_PROF timers decrement both during user space execution of the process and when the OS is executing on behalf of the process i.e. during system calls.
So #prof_signals > #virtual_signals
ITIMER_PROF doesn't decrement when OS is not executing on behalf of the process. So #real_signals > #prof_signals
To summarise, #real_signals > #prof_signals > #virtual_signals.

OpenMP behaviour detecting CPU and thread

I'm at very beginning with OpenMP, i just compiled with gcc -fopenmp openmp_c_helloworld.c the following piece of code:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id)
{
th_id = omp_get_thread_num();
printf("Hello World from thread %d\n", th_id);
#pragma omp barrier
if ( th_id == 0 ) {
nthreads = omp_get_num_threads();
printf("There are %d threads\n",nthreads);
}
}
return EXIT_SUCCESS;
}
I just run the executable on a quad-core Intel CPU with HyperThreading and i obtain the following output:
Hello World from thread 2
Hello World from thread 0
Hello World from thread 3
Hello World from thread 1
There are 4 threads
Technically speaking i have 8 thread available on my CPU and 4 CPU-core, why OpenMP shows me only 4 thread?
To put it simply, I think it's because OpenMP looks for the number of CPU's (cores) rather than the number of processor threads.
See this page: `
Implementation default - usually the number of CPUs on a node, though
it could be dynamic (see next bullet).
Something you could try out is setting the number of threads in your program to be equal to the number of processor threads and see if there's a performance improvement (you'll have to create your own benchmarking program).
In parallel programming, good performance is obtained when the number of worker threads are equal to the number of processor threads. You can keep a thread or two extra for I/O as well.

Why sleep() after acquiring a pthread_mutex_lock will block the whole program?

In my test program, I start two threads, each of them just do the following logic:
1) pthread_mutex_lock()
2) sleep(1)
3) pthread_mutex_unlock()
However, I find that after some time, one of the two threads will block on pthread_mutex_lock() forever, while the other thread works normal. This is a very strange behavior and I think maybe a potential serious issue. By Linux manual, sleep() is not prohibited when a pthread_mutex_t is acquired. So my question is: is this a real problem or is there any bug in my code ?
The following is the test program. In the code, the 1st thread's output is directed to stdout, while the 2nd's is directed to stderr. So we can check these two different output to see whether the thread is blocked.
I have tested it on linux kernel (2.6.31) and (2.6.9). Both results are the same.
//======================= Test Program ===========================
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <pthread.h>
#define THREAD_NUM 2
static int data[THREAD_NUM];
static int sleepFlag = 1;
static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
static void * threadFunc(void *arg)
{
int* idx = (int*) arg;
FILE* fd = NULL;
if (*idx == 0)
fd = stdout;
else
fd = stderr;
while(1) {
fprintf(fd, "\n[%d]Before pthread_mutex_lock is called\n", *idx);
if (pthread_mutex_lock(&mutex) != 0) {
exit(1);
}
fprintf(fd, "[%d]pthread_mutex_lock is finisheded. Sleep some time\n", *idx);
if (sleepFlag == 1)
sleep(1);
fprintf(fd, "[%d]sleep done\n\n", *idx);
fprintf(fd, "[%d]Before pthread_mutex_unlock is called\n", *idx);
if (pthread_mutex_unlock(&mutex) != 0) {
exit(1);
}
fprintf(fd, "[%d]pthread_mutex_unlock is finisheded.\n", *idx);
}
}
// 1. compile
// gcc -o pthread pthread.c -lpthread
// 2. run
// 1) ./pthread sleep 2> /tmp/error.log # Each thread will sleep 1 second after it acquires pthread_mutex_lock
// ==> We can find that /tmp/error.log will not increase.
// or
// 2) ./pthread nosleep 2> /tmp/error.log # No sleep is done when each thread acquires pthread_mutex_lock
// ==> We can find that both stdout and /tmp/error.log increase.
int main(int argc, char *argv[]) {
if ((argc == 2) && (strcmp(argv[1], "nosleep") == 0))
{
sleepFlag = 0;
}
pthread_t t[THREAD_NUM];
int i;
for (i = 0; i < THREAD_NUM; i++) {
data[i] = i;
int ret = pthread_create(&t[i], NULL, threadFunc, &data[i]);
if (ret != 0) {
perror("pthread_create error\n");
exit(-1);
}
}
for (i = 0; i < THREAD_NUM; i++) {
int ret = pthread_join(t[i], (void*)0);
if (ret != 0) {
perror("pthread_join error\n");
exit(-1);
}
}
exit(0);
}
This is the output:
On the terminal where the program is started:
root#skyscribe:~# ./pthread sleep 2> /tmp/error.log
[0]Before pthread_mutex_lock is called
[0]pthread_mutex_lock is finisheded. Sleep some time
[0]sleep done
[0]Before pthread_mutex_unlock is called
[0]pthread_mutex_unlock is finisheded.
...
On another terminal to see the file /tmp/error.log
root#skyscribe:~# tail -f /tmp/error.log
[1]Before pthread_mutex_lock is called
And no new lines are outputed from /tmp/error.log
This is a wrong way to use mutexes. A thread should not hold a mutex for more time than it does not own it, particularly not if it sleeps while holding the mutex. There is no FIFO guarantee for locking a mutex (for efficiency reasons).
More specifically, if thread 1 unlocks the mutex while thread 2 is waiting for it, it makes thread 2 runnable but this does not force the scheduler to preempt thread 1 or make thread 2 run immediately. Most likely, it will not because thread 1 has recently slept. When thread 1 subsequently reaches the pthread_mutex_lock() call, it will generally be allowed to lock the mutex immediately, even though there is a thread waiting (and the implementation can know it). When thread 2 wakes up after that, it will find the mutex already locked and go back to sleep.
The best solution is not to hold a mutex for that long. If that is not possible, consider moving the lock-needing operations to a single thread (removing the need for the lock) or waking up the correct thread using condition variables.
There's neither a problem, nor a bug in your code, but a combination of buffering and scheduling effects. Add an fflush here:
fprintf (fd, "[%d]pthread_mutex_unlock is finisheded.\n", *idx);
fflush (fd);
and run
./a.out >1 1.log 2> 2.log &
and you'll see rather equal progress made by the two threads.
EDIT: and like #jilles above said, a mutex is supposed to be a short wait lock, as opposed to long waits like condition variable wait, waiting for I/O or sleeping. That's why a mutex is not a cancellation point too.

Resources