I have an OpenMP parallelized program that looks like that:
[...]
#pragma omp parallel
{
//initialize threads
#pragma omp for
for(...)
{
//Work is done here
}
}
Now I'm adding MPI support. What I will need is a thread that handles the communication, in my case, calls GatherAll all the time and fills/empties a linked list for receiving/sending data from the other processes. That thread should send/receive until a flag is set. So right now there is no MPI stuff in the example, my question is about the implementation of that routine in OpenMP.
How do I implement such a thread? For example, I tried to introduce a single directive here:
[...]
int kill=0
#pragma omp parallel shared(kill)
{
//initialize threads
#pragma omp single nowait
{
while(!kill)
send_receive();
}
#pragma omp for
for(...)
{
//Work is done here
}
kill=1
}
but in this case the program gets stuck because the implicit barrier after the for-loop waits for the thread in the while-loop above.
Thank you, rugermini.
You could try adding a nowait clause to your single construct:
EDIT: responding to the first comment
If you enable nested parallelism for OpenMP, you might be able to achieve what you want by making two levels of parallelism. In the top level, you have two concurrent parallel sections, one for the MPI communications, the other for local computation. This last section can itself be parallelized, which gives you a second level of parallelisation. Only threads executing this level will be affected by barriers in it.
#include <iostream>
#include <omp.h>
int main()
{
int kill = 0;
#pragma omp parallel sections
{
#pragma omp section
{
while (kill == 0){
/* manage MPI communications */
}
}
#pragma omp section
{
#pragma omp parallel
#pragma omp for
for (int i = 0; i < 10000 ; ++i) {
/* your workload */
}
kill = 1;
}
}
}
However, you must be aware that your code is going to break if you don't have at least two threads, which means you're breaking the assumption that the sequential and parallelized versions of the code should do the same thing.
It would be much cleaner to wrap your OpenMP kernel inside a more global MPI communication scheme (potentially using asynchronous communications to overlap communications with computations).
You have to be careful, because you can't just have your MPI calling thread "skip" the omp for loop; all threads in the thread team have to go through the for loop.
There's a couple ways you could do this: with nested parallism and tasks, you could launch one task to do the message passing and anther to call a work routine which has an omp parallel for in it:
#include <mpi.h>
#include <omp.h>
#include <stdio.h>
void work(int rank) {
const int n=14;
#pragma omp parallel for
for (int i=0; i<n; i++) {
int tid = omp_get_thread_num();
printf("%d:%d working on item %d\n", rank, tid, i);
}
}
void sendrecv(int rank, int sneighbour, int rneighbour, int *data) {
const int tag=1;
MPI_Sendrecv(&rank, 1, MPI_INT, sneighbour, tag,
data, 1, MPI_INT, rneighbour, tag,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
int main(int argc, char **argv) {
int rank, size;
int sneighbour;
int rneighbour;
int data;
int got;
MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &got);
MPI_Comm_size(MPI_COMM_WORLD,&size);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
omp_set_nested(1);
sneighbour = rank + 1;
if (sneighbour >= size) sneighbour = 0;
rneighbour = rank - 1;
if (rneighbour <0 ) rneighbour = size-1;
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task
{
sendrecv(rank, sneighbour, rneighbour, &data);
printf("Got data from %d\n", data);
}
#pragma omp task
work(rank);
}
}
MPI_Finalize();
return 0;
}
Alternately, you could make your omp for loop schedule(dynamic) so that the other threads can pick up some of the slack from while the master thread is sending, and the master thread can pick up some work when it's done:
#include <mpi.h>
#include <omp.h>
#include <stdio.h>
void sendrecv(int rank, int sneighbour, int rneighbour, int *data) {
const int tag=1;
MPI_Sendrecv(&rank, 1, MPI_INT, sneighbour, tag,
data, 1, MPI_INT, rneighbour, tag,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
int main(int argc, char **argv) {
int rank, size;
int sneighbour;
int rneighbour;
int data;
int got;
const int n=14;
MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &got);
MPI_Comm_size(MPI_COMM_WORLD,&size);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
omp_set_nested(1);
sneighbour = rank + 1;
if (sneighbour >= size) sneighbour = 0;
rneighbour = rank - 1;
if (rneighbour <0 ) rneighbour = size-1;
#pragma omp parallel
{
#pragma omp master
{
sendrecv(rank, sneighbour, rneighbour, &data);
printf("Got data from %d\n", data);
}
#pragma omp for schedule(dynamic)
for (int i=0; i<n; i++) {
int tid = omp_get_thread_num();
printf("%d:%d working on item %d\n", rank, tid, i);
}
}
MPI_Finalize();
return 0;
}
Hmmm. If you are indeed adding MPI 'support' to your program, then you ought to be using mpi_allgather as mpi_gatherall does not exist. Note that mpi_allgather is a collective operation, that is all processes in the communicator call it. You can't have a process gathering data while the other processes do whatever it is they do. What you could do is use MPI single-sided communications to implement your idea; this will be a little tricky but no more than that if one process only reads the memory of other processes.
I'm puzzled by your use of the term 'thread' wrt MPI. I fear that you are confusing OpenMP and MPI, one of whose variants is called OpenMPI. Despite this name it is as different from OpenMP as chalk from cheese. MPI programs are written in terms of processes, not threads. The typical OpenMP implementation does indeed use threads, though the details are generally well-hidden from the programmer.
I'm seriously impressed that you are trying, or seem to be trying, to use MPI 'inside' your OpenMP code. This is exactly the opposite of work I do, and see others do on some seriously large computers. The standard mode for such 'hybrid' parallelisation is to write MPI programs which call OpenMP code. Many of today's very large computers comprise collections of what are, in effect, multicore boxes. A typical approach to programming one of these is to have one MPI process running on each box, and for each of those processes to use one OpenMP thread for each core in the box.
Related
In my experience, when I update a varible in 1 task the variable is not updated in other tasks even if the first task that updated the variable is done executing. For example given the code,
int nThreads = atoi(argv[1]);
omp_set_num_threads(nThreads);
int currentInt = 0;
int numEdges = 1000000;
#pragma omp parallel shared(currentInt)
{
#pragma omp single
{
#pragma omp task shared(currentInt)
{
printf("I am doing kruskals: Thread %d\n", omp_get_thread_num());
while(currentInt < numEdges)
{
currentInt++;
}
printf("Kruskals Done! %d\n", currentInt);
#pragma omp shared(currentInt)
{
for(int i = 0; i < 10000000; i++){
}
printf("Helper: Current Int %d Thread %d \n", currentInt, omp_get_thread_num());
}
}
#pragma omp taskwait
}
}
It will always print currentInt 0. Even if the first task finishes before the second. I need this because I am trying to parallize an algorithm where a have a sequential task going through a large array and many parallel tasks excuting simultanously on parts of that array and once the sequential task reaches the portion of the array that a parallel task is working on the parallel task can stop itself because it is no longer needed. The parallel and sequential tasks share no dependancies so that is not a problem.
Any help will be appreciated.
I have a top level controller, which schedules n sub threads,
and waits for all of them to complete before scheduling them all over again. These threads go on forever, so the threads do not need to be joined.
So the pseudo-code is something like this (assuming n=2):
Top:
loop:
1. initiate T1 and T2
2. wait for completion of both T1 and T2
T1: (similarly for T2)
loop:
1. wait for lock-1
2. do something
3. send completion signal
I am thinking of the following code for this, where Top,T1,T2 are
separate threads:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#define NUM_PROCS 2
pthread_mutex_t m_1, m_2; // for scheduling T1,T2
int count;
pthread_mutex_t m_count; // for completion-signal
pthread_cond_t c_count;
pthread_attr_t attr; // for threads
pthread_t thread[NUM_PROCS+1];
void *Top(void *t) {
count=0;
while(1) {
pthread_mutex_unlock(&m_1);
pthread_mutex_unlock(&m_2);
// not sure if this the correct way to wait for T1&T2
pthread_mutex_lock(&m_count);
while(count < 2) {
pthread_cond_wait(&c_count, &m_count);
}
count=0;
pthread_mutex_unlock(&m_count);
}
}
void *T1(void *t) { // similarly for T2
while(1) {
pthread_mutex_lock(&m_1); // use m_2 for T2
sleep(1);
pthread_mutex_lock(&m_count);
count++;
pthread_mutex_unlock(&m_count);
pthread_cond_signal(&c_count);
}
}
void *T2(void *t) {
while(1) {
pthread_mutex_lock(&m_2);
sleep(1);
pthread_mutex_lock(&m_count);
count++;
pthread_mutex_unlock(&m_count);
pthread_cond_signal(&c_count);
}
}
int main() {
int rc;
int t[NUM_PROCS+1] = {0,1,2}; // thread numbers
pthread_mutex_init(&m_1, NULL); // initializations
pthread_mutex_init(&m_2, NULL);
pthread_mutex_init(&m_count, NULL);
pthread_cond_init(&c_count, NULL);
pthread_mutex_lock(&m_1); // to allow Top to start first
pthread_mutex_lock(&m_2);
pthread_attr_init(&attr); // initiate the threads
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
rc = pthread_create(&thread[0], &attr, Top, (void *)&t[0]);
rc = pthread_create(&thread[1], &attr, T1, (void *)&t[1]);
rc = pthread_create(&thread[2], &attr, T2, (void *)&t[2]);
}
My questions on the above code:
Is the above code correct?
Usually, lock and unlock are both done by the same thread.
So my solution, of T1 locking m_1 and Top unlocking it,
seems a bit weird. Is there a better way of doing this?
Is semaphore a more efficient way to do this synchronization?
Will the code change (except main() of course) if I implement
this as separate processes with shared memory, instead of as
threads? And will that be less efficient than the threads version?
A thread that has not locked a pthread mutex may not unlock it. If you need to create a lock that one thread can acquire and another thread can release, you have to do so with your own code. A standard mutex is not such a lock.
I have a code that runs many iterations and only if a condition is met, the result of the iteration is saved. This is naturally expressed as a while loop. I am attempting to make the code run in parallel, since each realisation is independent. So I have this:
while(nit<avit){
#pragma omp parallel shared(nit,avit)
{
//do some stuff
if(condition){
#pragma omp critical
{
nit++;
\\save results
}
}
}//implicit barrier here
}
and this works fine... but there is a barrier after each realization, which means that if the stuff I am doing inside the parallel block takes longer in one iteration than the others, all my threads are waiting for it to finish, instead of continuing with the next iteration.
Is there a way to avoid this barrier so that the threads keep working? I am averaging thousands of iterations, so a few more don't hurt (in case the nit variable has not been incremented in already running threads)...
I have tried to turn this into a parallel for, but the automatic increment in the for loop makes the nit variable go wild. This is my attempt:
#pragma omp parallel shared(nit,avit)
{
#pragma omp for
for(nit=0;nit<avit;nit++){
//do some stuff
if(condition){
\\save results
} else {
#pragma omp critical
{
nit--;
}
}
}
}
and it keeps working and going around the for loop, as expected, but my nit variable takes unpredictable values... as one could expect from the increase and decrease of it by different threads at different times.
I have also tried leaving the increment in the for loop blank, but it doesn't compile, or trying to trick my code to have no increment in the for loop, like
...
incr=0;
for(nit=0;nit<avit;nit+=incr)
...
but then my code crashes...
Any ideas?
Thanks
Edit: Here's a working minimal example of the code on a while loop:
#include <random>
#include <vector>
#include <iostream>
#include <time.h>
#include <omp.h>
#include <stdlib.h>
#include <unistd.h>
using namespace std;
int main(){
int nit,dit,avit=100,t,j,tmax=100,jmax=10;
vector<double> Res(10),avRes(10);
nit=0; dit=0;
while(nit<avit){
#pragma omp parallel shared(tmax,nit,jmax,avRes,avit,dit) private(t,j) firstprivate(Res)
{
srand(int(time(NULL)) ^ omp_get_thread_num());
t=0; j=0;
while(t<tmax&&j<jmax){
Res[j]=rand() % 10;
t+=Res[j];
if(omp_get_thread_num()==5){
usleep(100000);
}
j++;
}
if(t<tmax){
#pragma omp critical
{
nit++;
for(j=0;j<jmax;j++){
avRes[j]+=Res[j];
}
for(j=0;j<jmax;j++){
cout<<avRes[j]/nit<<"\t";
}
cout<<" \t nit="<<nit<<"\t thread: "<<omp_get_thread_num();
cout<<endl;
}
} else{
#pragma omp critical
{
dit++;
cout<<"Discarded: "<<dit<<"\r"<<flush;
}
}
}
}
return 0;
}
I added the usleep part to simulate one thread taking longer than the others. If you run the program, all threads have to wait for thread 5 to finish, and then they start the next run. what I am trying to do is precisely to avoid such wait, i.e. I'd like the other threads to pick the next iteration without waiting for 5 to finish.
You can basically follow the same concept as for this question, with a slight variation to ensure that avRes is not written to in parallel:
int nit = 0;
#pragma omp parallel
while(1) {
int local_nit;
#pragma omp atomic read
local_nit = nit;
if (local_nit >= avit) {
break;
}
[...]
if (...) {
#pragma omp critical
{
#pragma omp atomic capture
local_nit = ++nit;
for(j=0;j<jmax;j++){
avRes[j] += Res[j];
}
for(j=0;j<jmax;j++){
// technically you could also use `nit` directly since
// now `nit` is only modified within this critical section
cout<<avRes[j]/local_nit<<"\t";
}
}
} else {
#pragma omp atomic update
dit++;
}
}
It also works with critical regions, but atomics are more efficient.
There's another thing you need to consider, rand() should not be used in parallel contexts. See this question. For C++, use a private (i.e. defined within the parallel region) random number generator from <random>.
I tried to find a solution in order to keep the number of working threads constant under linux in C using pthreads, but I seem to be unable to fully understand what's wrong with the following code:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#define MAX_JOBS 50
#define MAX_THREADS 5
pthread_mutex_t mutex1 = PTHREAD_MUTEX_INITIALIZER;
int jobs = MAX_JOBS;
int worker = 0;
int counter = 0;
void *functionC() {
pthread_mutex_lock(&mutex1);
worker++;
counter++;
printf("Counter value: %d\n",counter);
pthread_mutex_unlock(&mutex1);
// Do something...
sleep(4);
pthread_mutex_lock(&mutex1);
jobs--;
worker--;
printf(" >>> Job done: %d\n",jobs);
pthread_mutex_unlock(&mutex1);
}
int main(int argc, char *argv[]) {
int i=0, j=0;
pthread_t thread[MAX_JOBS];
// Create threads if the number of working threads doesn't exceed MAX_THREADS
while (1) {
if (worker > MAX_THREADS) {
printf(" +++ In queue: %d\n", worker);
sleep(1);
} else {
//printf(" +++ Creating new thread: %d\n", worker);
pthread_create(&thread[i], NULL, &functionC, NULL);
//printf("%d",worker);
i++;
}
if (i == MAX_JOBS) break;
}
// Wait all threads to finish
for (j=0;j<MAX_JOBS;j++) {
pthread_join(thread[j], NULL);
}
return(0);
}
A while (1) loop keeps creating threads if the number of working threads is under a certain threshold. A mutex is supposed to lock the critical sections every time the global counter of the working threads is incremented (thread creation) and decremented (job is done). I thought it could work fine and for the most part it does, but weird things happen...
For instance, if I comment (as it is in this snippet) the printf //printf(" +++ Creating new thread: %d\n", worker); the while (1) seems to generate a random number (18-25 in my experience) threads (functionC prints out "Counter value: from 1 to 18-25"...) at a time instead of respecting the IF condition inside the loop. If I include the printf the loop seems to behave "almost" in the right way... This seems to hint that there's a missing "mutex" condition that I should add to the loop in main() to effectively lock the thread when MAX_THREADS is reached but after changing a LOT of times this code for the past few days I'm a bit lost, now. What am I missing?
Please, let me know what I should change in order to keep the number of threads constant it doesn't seem that I'm too far from the solution... Hopefully... :-)
Thanks in advance!
Your problem is that worker is not incremented until the new thread actually starts and gets to run - in the meantime, the main thread loops around, checks workers, finds that it hasn't changed, and starts another thread. It can repeat this many times, creating far too many threads.
So, you need to increment worker in the main thread, when you've decided to create a new thread.
You have another problem - you should be using condition variables to let the main thread sleep until it should start another thread, not using a busy-wait loop with a sleep(1); in it. The complete fixed code would look like:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <unistd.h>
#define MAX_JOBS 50
#define MAX_THREADS 5
pthread_mutex_t mutex1 = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t cond1 = PTHREAD_COND_INITIALIZER;
int jobs = MAX_JOBS;
int workers = 0;
int counter = 0;
void *functionC() {
pthread_mutex_lock(&mutex1);
counter++;
printf("Counter value: %d\n",counter);
pthread_mutex_unlock(&mutex1);
// Do something...
sleep(4);
pthread_mutex_lock(&mutex1);
jobs--;
printf(" >>> Job done: %d\n",jobs);
/* Worker is about to exit, so decrement count and wakeup main thread */
workers--;
pthread_cond_signal(&cond1);
pthread_mutex_unlock(&mutex1);
return NULL;
}
int main(int argc, char *argv[]) {
int i=0, j=0;
pthread_t thread[MAX_JOBS];
// Create threads if the number of working threads doesn't exceed MAX_THREADS
while (i < MAX_JOBS) {
/* Block on condition variable until there are insufficient workers running */
pthread_mutex_lock(&mutex1);
while (workers >= MAX_THREADS)
pthread_cond_wait(&cond1, &mutex1);
/* Another worker will be running shortly */
workers++;
pthread_mutex_unlock(&mutex1);
pthread_create(&thread[i], NULL, &functionC, NULL);
i++;
}
// Wait all threads to finish
for (j=0;j<MAX_JOBS;j++) {
pthread_join(thread[j], NULL);
}
return(0);
}
Note that even though this works, it isn't ideal - it's best to create the number of threads you want up-front, and have them loop around, waiting for work. This is because creating and destroying threads has significant overhead, and because it often simplifies resource management. A version of your code rewritten to work like this would look like:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <unistd.h>
#define MAX_JOBS 50
#define MAX_THREADS 5
pthread_mutex_t mutex1 = PTHREAD_MUTEX_INITIALIZER;
int jobs = MAX_JOBS;
int counter = 0;
void *functionC()
{
int running_job;
pthread_mutex_lock(&mutex1);
counter++;
printf("Counter value: %d\n",counter);
while (jobs > 0) {
running_job = jobs--;
pthread_mutex_unlock(&mutex1);
printf(" >>> Job starting: %d\n", running_job);
// Do something...
sleep(4);
printf(" >>> Job done: %d\n", running_job);
pthread_mutex_lock(&mutex1);
}
pthread_mutex_unlock(&mutex1);
return NULL;
}
int main(int argc, char *argv[]) {
int i;
pthread_t thread[MAX_THREADS];
for (i = 0; i < MAX_THREADS; i++)
pthread_create(&thread[i], NULL, &functionC, NULL);
// Wait all threads to finish
for (i = 0; i < MAX_THREADS; i++)
pthread_join(thread[i], NULL);
return 0;
}
Anyone think about it. OpenMP features to adjust cpu muscles to handle dumbbel. In my research for openmp we cannot set thread priority to execute block code with powerfull muscle. Only one way(_beginthreadex or CreateThread function with 5. parameters) to create threads with highest priority.
Here some code for this issue:
This is manual setting.
int numberOfCore = ( execute __cpuid to obtain number of cores on your cpu ).
HANDLES* hThreads = new HANDLES[ numberOfCore ];
hThreads[0] = _beginthreadex( NULL, 0, someThreadFunc, NULL, 0, NULL );
SetThreadPriority( hThreads[0], HIGH_PRIORITY_CLASS );
WaitForMultipleObjects(...);
Here is i want to see this part:
#pragma omp parallel
{
#pragma omp for ( threadpriority:HIGH_PRIORITY_CLASS )
for( ;; ) { ... }
}
Or
#pragma omp parallel
{
// Generally this function greatly appreciativable.
_omp_set_priority( HIGH_PRIORITY_CLASS );
#pragma omp for
for( ;; ) { ... }
}
I dont know if there was a way to setup priority with openmp pls inform us.
You can do SetThreadPriority in the body of the loop without requiring special support from OpenMP:
for (...)
{
DWORD priority=GetThreadPriority(...);
SetThreadPriority(...);
// stuff
SetThreadPriority(priority);
}
Simple test reveals unexpected results:
I have run a simple test in Visual Studio 2010 (Windows 7):
#include <stdio.h>
#include <omp.h>
#include <windows.h>
int main()
{
int tid, nthreads;
SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_ABOVE_NORMAL);
#pragma omp parallel private(tid) num_threads(4)
{
tid = omp_get_thread_num();
printf("Thread %d: Priority = %d\n", tid, GetThreadPriority(GetCurrentThread()));
}
printf("\n");
#pragma omp parallel private(tid) shared(nthreads) num_threads(4)
{
tid = omp_get_thread_num();
#pragma omp master
{
printf("Master Thread %d: Priority = %d\n", tid, GetThreadPriority(GetCurrentThread()));
}
}
#pragma omp parallel num_threads(4)
{
SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_ABOVE_NORMAL);
}
printf("\n");
#pragma omp parallel private(tid) num_threads(4)
{
tid = omp_get_thread_num();
printf("Thread %d: Priority = %d\n", tid, GetThreadPriority(GetCurrentThread()));
}
return 0;
}
The output is:
Thread 1: Priority = 0
Thread 0: Priority = 1
Thread 2: Priority = 0
Thread 3: Priority = 0
Master Thread 0: Priority = 1
Thread 0: Priority = 1
Thread 1: Priority = 1
Thread 3: Priority = 1
Thread 2: Priority = 1
Explanation:
The OpenMP master threads is executed with the thread priority of the main.
The other OpenMP threads are left in Normal priority.
When manually setting the thread priority of OpenMP threads, the threads remains with that priority.