syncronizing 2 threads c++ linux - linux

i have such code
#include <iostream>
#include <thread>
#include <mutex>
#include <iostream>
#include <unistd.h>
using namespace std;
bool isRunning;
mutex locker;
void threadFunc(int num) {
while(isRunning) {
locker.lock();
cout << num << endl;
locker.unlock();
sleep(1);
}
}
int main(int argc, char *argv[])
{
isRunning = true;
thread thr1(threadFunc,1);
thread thr2(threadFunc,2);
cout << "Hello World!" << endl;
thr1.join();
thr2.join();
return 0;
}
when running this code i'm waiting to get output like:
1
2
1
2
1
2
1
2
...
but i dont't get that and get something like this instead:
1
2
1
2
2 <--- why so?
1
2
1
and if i run this code on Windows with replacing #include <unistd.h> to #include <windows.h> and sleep(1) to Sleep(1000) the output i get is exactly what i want, i.e. 1212121212.
So why is so and how to achieve the same result on linux?

It pertains to the scheduling of threads. Sometime one thread may be executing faster. Apparently, thread 2 is executing faster once and so you are getting ... 1 2 2 ... Nothing wrong with that because mutex is only ensuring that only one thread is printing count at a time and nothing more. There are uncertainties like when a thread is going to sleep and when it is woken up, etc. All this may not be taking exactly the same time in the two threads all the time.
For having the threads execute alternately, a different semaphore arrangement is needed. For example, let there be two semaphores, s1 and s2. Let the initial values of s1 and s2 be 1 and zero respectively. Consider the following pseudo code:
// Thread 1:
P (s1)
print number
V (s2)
// Thread 2:
P (s2)
print number
V (s1)

Related

What thread competition infulenceļ¼Ÿ

As you see,when I remove mt.lock() and mt.unlockļ¼Œthe result is smaller than 50000.
Why?What actually happens? I will be very grateful if you can explain it for me.
#include <iostream>
#include <thread>
#include <vector>
#include <mutex>
using namespace std;
class counter{
public:
mutex mt;
int value;
public:
counter():value(0){}
void increase()
{
//mt.lock();
value++;
//mt.unlock();
}
};
int main()
{
counter c;
vector<thread> threads;
for(int i=0;i<5;++i){
threads.push_back(thread([&]()
{
for(int i=0;i<10000;++i){
c.increase();
}
}));
}
for(auto& t:threads){
t.join();
}
cout << c.value <<endl;
return 0;
}
++ is actually two operations. One is reading the value, the other is incrementing it. Since it isn't an atomic operation, multiple threads operating in the same region of code will get mixed up.
As an example, consider three threads operating in the same region without any locking:
Threads 1 and 2 read value as 999
Thread 1 computes the incremented value as 1000 and updates the variable
Thread 3 reads 1000, increments to 1001 and updates the variable
Thread 2 computes incremented value as 999 + 1 = 1000 and overwrites 3's work with with 1000
Now if you were using something like the "fetch-and-add" instruction, which is atomic, you wouldn't need any locks. See fetch_add

Why is this C++11 program not going to deadlock?

#include <iostream>
#include <mutex>
using namespace std;
int main()
{
mutex m;
m.lock();
cout << "locked once\n";
m.lock();
cout << "locked twice\n";
return 0;
}
Output:
./a.out
locked once
locked twice
Doesn't the program needs to deadlock at the point of second lock i.e. a mutex being locked twice by same thread?
If lock is called by a thread that already owns the mutex, the behavior is undefined: the program may deadlock, or, if the implementation can detect the deadlock, a resource_deadlock_would_occur error condition may be thrown.
http://en.cppreference.com/w/cpp/thread/mutex/lock

Why thread_id creates not in order?

I tried to create 10 threads, and output each tread index. My code is shown as below, I am wondering why they are repeating instead of arranging in order?
#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include "util.h"
#include <errno.h>
#include <unistd.h>
#include <signal.h>
#include <time.h>
pthread_mutex_t request_buf_lock = PTHREAD_MUTEX_INITIALIZER;
void * worker(void *arg)
{
int thread_id = *(int*)arg;
// int requests_handled = 0;
//requests_handled = requests_handled + 1;
printf("%d\n",thread_id);
}
int main(int argc, char** argv)
{
pthread_t dispatchers[100];
pthread_t workers[100];
int i;
int * thread_id = malloc(sizeof(int));
for (i = 0; i < 10; i++) {
*thread_id = i;
pthread_create(&workers[i], NULL, worker, (void*)thread_id);
}
for (i = 0; i < 10; i++) {
pthread_join(workers[i], NULL);
}
return 0;
}
And the output result is:
4
5
5
6
6
6
7
8
9
9
But I expected it as:
0
1
2
3
4
5
6
7
8
9
Anyone has any idea or advice?
All 10 threads execute in parallel, and they all share a single int object, the one created by the call to malloc.
By the time your first thread executes its printf call, the value of *thread_id has been set to 4. Your second and third threads execute their printf calls when *thread_id has been set to 5. And so on.
If you allocate a separate int object for each thread (either by moving the malloc call inside the loop or just by declaring an array of ints), you'll get a unique thread id in each thread. But they're still likely to be printed in arbitrary order, since there's no synchronization among the threads.

Using thrust with openmp: no substantial speed up obtained

I am interested in porting a code I had written using mostly the Thrust GPU library to multicore CPU's. Thankfully, the website says that thrust code can be used with threading environments such as OpenMP / Intel TBB.
I wrote a simple code below for sorting a large array to see the speedup using a machine which can support upto 16 Open MP threads.
The timings obtained on this machine for sorting a random array of size 16 million are
STL : 1.47 s
Thrust (16 threads) : 1.21 s
There seems to be barely any speed-up. I would like to know how to get a substantial speed-up for sorting arrays using OpenMP like I do with GPUs.
The code is below (the file sort.cu). Compilation was performed as follows:
nvcc -O2 -o sort sort.cu -Xcompiler -fopenmp -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_BACKEND_OMP -lgomp
The NVCC version is 5.5
The Thrust library version being used is v1.7.0
#include <iostream>
#include <iomanip>
#include <cmath>
#include <cstdlib>
#include <stdio.h>
#include <algorithm>
#include <ctime>
#include <time.h>
#include "thrust/sort.h"
int main(int argc, char *argv[])
{
int N = 16000000;
double* myarr = new double[N];
for (int i = 0; i < N; ++i)
{
myarr[i] = (1.0*rand())/RAND_MAX;
}
std::cout << "-------------\n";
clock_t start,stop;
start=clock();
std::sort(myarr,myarr+N);
stop=clock();
std::cout << "Time taken for sorting the array with STL is " << (stop-start)/(double)CLOCKS_PER_SEC;
//--------------------------------------------
srand(1);
for (int i = 0; i < N; ++i)
{
myarr[i] = (1.0*rand())/RAND_MAX;
//std::cout << myarr[i] << std::endl;
}
start=clock();
thrust::sort(myarr,myarr+N);
stop=clock();
std::cout << "------------------\n";
std::cout << "Time taken for sorting the array with Thrust is " << (stop-start)/(double)CLOCKS_PER_SEC;
return 0;
}
The device backend refers to the behavior of operations performed on a thrust::device_vector or similar reference. Thrust interprets the array/pointer you are passing it as a host pointer, and performs host-based operations on it, which are not affected by the device backend setting.
There are a variety of ways to fix this issue. If you read the device backend documentation you will find general examples and omp-specific examples. You could even specify a different host backend which should have the desired behavior (OMP usage) with your code, I think.
Once you fix this, you'll get an additional result surprise, perhaps: thrust appears to sort the array quickly, but reports a very long execution time. I believe this is due (on linux, anyway) to the clock() function being affected by the number of OMP threads in use.
The following code/sample run has those issues addressed, and seems to give me a ~3x speedup for 4 threads.
$ cat t592.cu
#include <iostream>
#include <iomanip>
#include <cmath>
#include <cstdlib>
#include <stdio.h>
#include <algorithm>
#include <ctime>
#include <sys/time.h>
#include <time.h>
#include <thrust/device_ptr.h>
#include <thrust/sort.h>
int main(int argc, char *argv[])
{
int N = 16000000;
double* myarr = new double[N];
for (int i = 0; i < N; ++i)
{
myarr[i] = (1.0*rand())/RAND_MAX;
}
std::cout << "-------------\n";
timeval t1, t2;
gettimeofday(&t1, NULL);
std::sort(myarr,myarr+N);
gettimeofday(&t2, NULL);
float et = (((t2.tv_sec*1000000)+t2.tv_usec)-((t1.tv_sec*1000000)+t1.tv_usec))/float(1000000);
std::cout << "Time taken for sorting the array with STL is " << et << std::endl;;
//--------------------------------------------
srand(1);
for (int i = 0; i < N; ++i)
{
myarr[i] = (1.0*rand())/RAND_MAX;
//std::cout << myarr[i] << std::endl;
}
thrust::device_ptr<double> darr = thrust::device_pointer_cast<double>(myarr);
gettimeofday(&t1, NULL);
thrust::sort(darr,darr+N);
gettimeofday(&t2, NULL);
et = (((t2.tv_sec*1000000)+t2.tv_usec)-((t1.tv_sec*1000000)+t1.tv_usec))/float(1000000);
std::cout << "------------------\n";
std::cout << "Time taken for sorting the array with Thrust is " << et << std::endl ;
return 0;
}
$ nvcc -O2 -o t592 t592.cu -Xcompiler -fopenmp -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_BACKEND_OMP -lgomp
$ OMP_NUM_THREADS=4 ./t592
-------------
Time taken for sorting the array with STL is 1.31956
------------------
Time taken for sorting the array with Thrust is 0.468176
$
Your mileage may vary. In particular, you may not see any improvement as you go above 4 threads. There may be a number of factors which prevent an OMP code from scaling beyond a certain number of threads. Sorting generally tends to be a memory-bound algorithm, so you will probably observe an increase until you have saturated the memory subsystem, and then no further increase from additional cores. Depending on your system, it's possible you could be in this situation already, in which case you may not see any improvement from OMP style multithreading.

Does a call to MPI_Barrier affect every thread in an MPI process?

Does a call to MPI_Barrier affect every thread in an MPI process or only the thread
that makes the call?
For your information , my MPI application will run with MPI_THREAD_MULTIPLE.
Thanks.
The way to think of this is that MPI_Barrier (and other collectives) are blocking function calls, which block until all processes in the communicator have completed the function. That, I think, makes it a little easier to figure out what should happen; the function blocks, but other threads continue on their way unimpeded.
So consider the following chunk of code (The shared done flag being flushed to communicate between threads is not how you should be doing thread communication, so please don't use this as a template for anything. Furthermore, using a reference to done will solve this bug/optimization, see the end of comment 2):
#include <mpi.h>
#include <omp.h>
#include <stdio.h>
#include <unistd.h>
int main(int argc, char**argv) {
int ierr, size, rank;
int provided;
volatile int done=0;
MPI_Comm comm;
ierr = MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
if (provided == MPI_THREAD_SINGLE) {
fprintf(stderr,"Could not initialize with thread support\n");
MPI_Abort(MPI_COMM_WORLD,1);
}
comm = MPI_COMM_WORLD;
ierr = MPI_Comm_size(comm, &size);
ierr = MPI_Comm_rank(comm, &rank);
if (rank == 1) sleep(10);
#pragma omp parallel num_threads(2) default(none) shared(rank,comm,done)
{
#pragma omp single
{
/* spawn off one thread to do the barrier,... */
#pragma omp task
{
MPI_Barrier(comm);
printf("%d -- thread done Barrier\n", rank);
done = 1;
#pragma omp flush
}
/* and another to do some printing while we're waiting */
#pragma omp task
{
int *p = &done;
while(!(*p) {
printf("%d -- thread waiting\n", rank);
sleep(1);
}
}
}
}
MPI_Finalize();
return 0;
}
Rank 1 sleeps for 10 seconds, and all the ranks start a barrier in one thread. If you run this with mpirun -np 2, you'd expect the first of rank 0s threads to hit the barrier, and the other to cycle around printing and waiting -- and sure enough, that's what happens:
$ mpirun -np 2 ./threadbarrier
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
1 -- thread waiting
0 -- thread done Barrier
1 -- thread done Barrier

Resources