In the consumer branch of the code snippet below, a flush is used to observe any changes to flag that might have happened since the previous read, but the data variable is not flushed prior to invoking f().
Q1: Should a flush be added to the consumer before invoking f()?
Q2: Does the answer change if you assume that data is not in the L1 cache of the consumer thread before invoking f()?
#pragma omp parallel shared(data, flag)
{
if (omp_get_thread_num() == 0) { // Producer.
// Write to data and make visible to other thread.
data = computeData();
#pragma omp flush (data)
// Write to flag and make visible to other thread.
flag = 1;
#pragma omp flush (flag)
}
if (omp_get_thread_num() == 1) { // Consumer.
while (flag == 0) {
#pragma omp flush (flag)
; // No-op, flush reloads.
}
f(data); // Do something with data.
}
}
Seems that the code snippet with the race condition used in my class has been copied from other sources. I am concurrently reading The OpenMP Common Core [1] and found a race-free equivalent using atomic as recommended by #MichaelKlemm. I modified my original snippet based on Figure 11.5 in [1].
#pragma omp parallel shared(data, flag)
{
int temp = 0;
if (omp_get_thread_num() == 0) { // Producer.
// Write to data and make visible to other thread.
data = computeData();
#pragma omp flush
// Write to flag with atomic results in implicit flush.
#pragma omp atomic write
flag = 1;
}
if (omp_get_thread_num() == 1) { // Consumer.
while (!temp) {
#pragma omp atomic read
temp = flag; // Read into temp in case flag changes.
}
#pragma omp flush
f(data); // Do something with data.
}
}
[1] https://mitpress.mit.edu/books/openmp-common-core
I have a code that runs many iterations and only if a condition is met, the result of the iteration is saved. This is naturally expressed as a while loop. I am attempting to make the code run in parallel, since each realisation is independent. So I have this:
while(nit<avit){
#pragma omp parallel shared(nit,avit)
{
//do some stuff
if(condition){
#pragma omp critical
{
nit++;
\\save results
}
}
}//implicit barrier here
}
and this works fine... but there is a barrier after each realization, which means that if the stuff I am doing inside the parallel block takes longer in one iteration than the others, all my threads are waiting for it to finish, instead of continuing with the next iteration.
Is there a way to avoid this barrier so that the threads keep working? I am averaging thousands of iterations, so a few more don't hurt (in case the nit variable has not been incremented in already running threads)...
I have tried to turn this into a parallel for, but the automatic increment in the for loop makes the nit variable go wild. This is my attempt:
#pragma omp parallel shared(nit,avit)
{
#pragma omp for
for(nit=0;nit<avit;nit++){
//do some stuff
if(condition){
\\save results
} else {
#pragma omp critical
{
nit--;
}
}
}
}
and it keeps working and going around the for loop, as expected, but my nit variable takes unpredictable values... as one could expect from the increase and decrease of it by different threads at different times.
I have also tried leaving the increment in the for loop blank, but it doesn't compile, or trying to trick my code to have no increment in the for loop, like
...
incr=0;
for(nit=0;nit<avit;nit+=incr)
...
but then my code crashes...
Any ideas?
Thanks
Edit: Here's a working minimal example of the code on a while loop:
#include <random>
#include <vector>
#include <iostream>
#include <time.h>
#include <omp.h>
#include <stdlib.h>
#include <unistd.h>
using namespace std;
int main(){
int nit,dit,avit=100,t,j,tmax=100,jmax=10;
vector<double> Res(10),avRes(10);
nit=0; dit=0;
while(nit<avit){
#pragma omp parallel shared(tmax,nit,jmax,avRes,avit,dit) private(t,j) firstprivate(Res)
{
srand(int(time(NULL)) ^ omp_get_thread_num());
t=0; j=0;
while(t<tmax&&j<jmax){
Res[j]=rand() % 10;
t+=Res[j];
if(omp_get_thread_num()==5){
usleep(100000);
}
j++;
}
if(t<tmax){
#pragma omp critical
{
nit++;
for(j=0;j<jmax;j++){
avRes[j]+=Res[j];
}
for(j=0;j<jmax;j++){
cout<<avRes[j]/nit<<"\t";
}
cout<<" \t nit="<<nit<<"\t thread: "<<omp_get_thread_num();
cout<<endl;
}
} else{
#pragma omp critical
{
dit++;
cout<<"Discarded: "<<dit<<"\r"<<flush;
}
}
}
}
return 0;
}
I added the usleep part to simulate one thread taking longer than the others. If you run the program, all threads have to wait for thread 5 to finish, and then they start the next run. what I am trying to do is precisely to avoid such wait, i.e. I'd like the other threads to pick the next iteration without waiting for 5 to finish.
You can basically follow the same concept as for this question, with a slight variation to ensure that avRes is not written to in parallel:
int nit = 0;
#pragma omp parallel
while(1) {
int local_nit;
#pragma omp atomic read
local_nit = nit;
if (local_nit >= avit) {
break;
}
[...]
if (...) {
#pragma omp critical
{
#pragma omp atomic capture
local_nit = ++nit;
for(j=0;j<jmax;j++){
avRes[j] += Res[j];
}
for(j=0;j<jmax;j++){
// technically you could also use `nit` directly since
// now `nit` is only modified within this critical section
cout<<avRes[j]/local_nit<<"\t";
}
}
} else {
#pragma omp atomic update
dit++;
}
}
It also works with critical regions, but atomics are more efficient.
There's another thing you need to consider, rand() should not be used in parallel contexts. See this question. For C++, use a private (i.e. defined within the parallel region) random number generator from <random>.
I have a theoretical OpenMP question for you all.
Imagine I do the following:
#pragma omp parallel
{
#pragma omp single
{
while (!empty(linkedList)) {
#pragma omp task
doWork();
}
}
}
What happens if doWork() adds elements back into the list?
My worry is that the single thread that is spinning of the tasks will terminate before the threads doing the tasks can finish. This might mean that any elements that gets added back onto the list by the doWork function are missed. Does anybody know how this works?
Thanks!
Just embed the generator loop into another loop and use taskwait in between to ensure that all tasks have finished executing. You must also ensure proper locking of the linked list in the concurrent parts of the code, e.g. by the use of critical sections (as shown below) or finer-grained locks.
doWork(element e)
{
// ...
#pragma omp critical(listOps)
insertElement(linkedList, newElement);
// ...
}
#pragma omp parallel
{
#pragma omp single
{
do
{
#pragma omp critical(listOps)
while (!empty(linkedList)) {
element e = removeElement(linkedList);
#pragma omp task
doWork(e);
}
#pragma omp taskwait
} while (!empty(linkedList));
}
}
I'm beginning in openMP and I try to use openMP in my code source. I have four functions and I would like to give for each thread one function. Here is my code:
int a,b,c,d;
omp_set_num_threads(4);
#pragma omp parallel
{
a=SetHist1(int (Convert_Mask0(mask)),1);
b=SetHist2(int (Convert_Mask45(mask)),1);
c=SetHist3(int (Convert_Mask90(mask)),1);
d=SetHist4(int (Convert_Mask135(mask)),1);
}
but this does not work for me.
You can use SECTIONS directives to make each SetHistX on different threads. You could also use TASK directives depending of your needs.
Differences of use between sections and tasks are available here.
Using sections directives, your code would look like something like this :
#pragma omp parallel sections
{
#pragma omp section
{
a=SetHist1(int (Convert_Mask0(mask)),1);
}
#pragma omp section
{
b=SetHist2(int (Convert_Mask45(mask)),1);
}
#pragma omp section
{
c=SetHist3(int (Convert_Mask90(mask)),1);
}
#pragma omp section
{
d=SetHist4(int (Convert_Mask135(mask)),1);
}
}
I have an OpenMP parallelized program that looks like that:
[...]
#pragma omp parallel
{
//initialize threads
#pragma omp for
for(...)
{
//Work is done here
}
}
Now I'm adding MPI support. What I will need is a thread that handles the communication, in my case, calls GatherAll all the time and fills/empties a linked list for receiving/sending data from the other processes. That thread should send/receive until a flag is set. So right now there is no MPI stuff in the example, my question is about the implementation of that routine in OpenMP.
How do I implement such a thread? For example, I tried to introduce a single directive here:
[...]
int kill=0
#pragma omp parallel shared(kill)
{
//initialize threads
#pragma omp single nowait
{
while(!kill)
send_receive();
}
#pragma omp for
for(...)
{
//Work is done here
}
kill=1
}
but in this case the program gets stuck because the implicit barrier after the for-loop waits for the thread in the while-loop above.
Thank you, rugermini.
You could try adding a nowait clause to your single construct:
EDIT: responding to the first comment
If you enable nested parallelism for OpenMP, you might be able to achieve what you want by making two levels of parallelism. In the top level, you have two concurrent parallel sections, one for the MPI communications, the other for local computation. This last section can itself be parallelized, which gives you a second level of parallelisation. Only threads executing this level will be affected by barriers in it.
#include <iostream>
#include <omp.h>
int main()
{
int kill = 0;
#pragma omp parallel sections
{
#pragma omp section
{
while (kill == 0){
/* manage MPI communications */
}
}
#pragma omp section
{
#pragma omp parallel
#pragma omp for
for (int i = 0; i < 10000 ; ++i) {
/* your workload */
}
kill = 1;
}
}
}
However, you must be aware that your code is going to break if you don't have at least two threads, which means you're breaking the assumption that the sequential and parallelized versions of the code should do the same thing.
It would be much cleaner to wrap your OpenMP kernel inside a more global MPI communication scheme (potentially using asynchronous communications to overlap communications with computations).
You have to be careful, because you can't just have your MPI calling thread "skip" the omp for loop; all threads in the thread team have to go through the for loop.
There's a couple ways you could do this: with nested parallism and tasks, you could launch one task to do the message passing and anther to call a work routine which has an omp parallel for in it:
#include <mpi.h>
#include <omp.h>
#include <stdio.h>
void work(int rank) {
const int n=14;
#pragma omp parallel for
for (int i=0; i<n; i++) {
int tid = omp_get_thread_num();
printf("%d:%d working on item %d\n", rank, tid, i);
}
}
void sendrecv(int rank, int sneighbour, int rneighbour, int *data) {
const int tag=1;
MPI_Sendrecv(&rank, 1, MPI_INT, sneighbour, tag,
data, 1, MPI_INT, rneighbour, tag,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
int main(int argc, char **argv) {
int rank, size;
int sneighbour;
int rneighbour;
int data;
int got;
MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &got);
MPI_Comm_size(MPI_COMM_WORLD,&size);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
omp_set_nested(1);
sneighbour = rank + 1;
if (sneighbour >= size) sneighbour = 0;
rneighbour = rank - 1;
if (rneighbour <0 ) rneighbour = size-1;
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task
{
sendrecv(rank, sneighbour, rneighbour, &data);
printf("Got data from %d\n", data);
}
#pragma omp task
work(rank);
}
}
MPI_Finalize();
return 0;
}
Alternately, you could make your omp for loop schedule(dynamic) so that the other threads can pick up some of the slack from while the master thread is sending, and the master thread can pick up some work when it's done:
#include <mpi.h>
#include <omp.h>
#include <stdio.h>
void sendrecv(int rank, int sneighbour, int rneighbour, int *data) {
const int tag=1;
MPI_Sendrecv(&rank, 1, MPI_INT, sneighbour, tag,
data, 1, MPI_INT, rneighbour, tag,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
int main(int argc, char **argv) {
int rank, size;
int sneighbour;
int rneighbour;
int data;
int got;
const int n=14;
MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &got);
MPI_Comm_size(MPI_COMM_WORLD,&size);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
omp_set_nested(1);
sneighbour = rank + 1;
if (sneighbour >= size) sneighbour = 0;
rneighbour = rank - 1;
if (rneighbour <0 ) rneighbour = size-1;
#pragma omp parallel
{
#pragma omp master
{
sendrecv(rank, sneighbour, rneighbour, &data);
printf("Got data from %d\n", data);
}
#pragma omp for schedule(dynamic)
for (int i=0; i<n; i++) {
int tid = omp_get_thread_num();
printf("%d:%d working on item %d\n", rank, tid, i);
}
}
MPI_Finalize();
return 0;
}
Hmmm. If you are indeed adding MPI 'support' to your program, then you ought to be using mpi_allgather as mpi_gatherall does not exist. Note that mpi_allgather is a collective operation, that is all processes in the communicator call it. You can't have a process gathering data while the other processes do whatever it is they do. What you could do is use MPI single-sided communications to implement your idea; this will be a little tricky but no more than that if one process only reads the memory of other processes.
I'm puzzled by your use of the term 'thread' wrt MPI. I fear that you are confusing OpenMP and MPI, one of whose variants is called OpenMPI. Despite this name it is as different from OpenMP as chalk from cheese. MPI programs are written in terms of processes, not threads. The typical OpenMP implementation does indeed use threads, though the details are generally well-hidden from the programmer.
I'm seriously impressed that you are trying, or seem to be trying, to use MPI 'inside' your OpenMP code. This is exactly the opposite of work I do, and see others do on some seriously large computers. The standard mode for such 'hybrid' parallelisation is to write MPI programs which call OpenMP code. Many of today's very large computers comprise collections of what are, in effect, multicore boxes. A typical approach to programming one of these is to have one MPI process running on each box, and for each of those processes to use one OpenMP thread for each core in the box.