In the consumer branch of the code snippet below, a flush is used to observe any changes to flag that might have happened since the previous read, but the data variable is not flushed prior to invoking f().
Q1: Should a flush be added to the consumer before invoking f()?
Q2: Does the answer change if you assume that data is not in the L1 cache of the consumer thread before invoking f()?
#pragma omp parallel shared(data, flag)
{
if (omp_get_thread_num() == 0) { // Producer.
// Write to data and make visible to other thread.
data = computeData();
#pragma omp flush (data)
// Write to flag and make visible to other thread.
flag = 1;
#pragma omp flush (flag)
}
if (omp_get_thread_num() == 1) { // Consumer.
while (flag == 0) {
#pragma omp flush (flag)
; // No-op, flush reloads.
}
f(data); // Do something with data.
}
}
Seems that the code snippet with the race condition used in my class has been copied from other sources. I am concurrently reading The OpenMP Common Core [1] and found a race-free equivalent using atomic as recommended by #MichaelKlemm. I modified my original snippet based on Figure 11.5 in [1].
#pragma omp parallel shared(data, flag)
{
int temp = 0;
if (omp_get_thread_num() == 0) { // Producer.
// Write to data and make visible to other thread.
data = computeData();
#pragma omp flush
// Write to flag with atomic results in implicit flush.
#pragma omp atomic write
flag = 1;
}
if (omp_get_thread_num() == 1) { // Consumer.
while (!temp) {
#pragma omp atomic read
temp = flag; // Read into temp in case flag changes.
}
#pragma omp flush
f(data); // Do something with data.
}
}
[1] https://mitpress.mit.edu/books/openmp-common-core
Related
In my experience, when I update a varible in 1 task the variable is not updated in other tasks even if the first task that updated the variable is done executing. For example given the code,
int nThreads = atoi(argv[1]);
omp_set_num_threads(nThreads);
int currentInt = 0;
int numEdges = 1000000;
#pragma omp parallel shared(currentInt)
{
#pragma omp single
{
#pragma omp task shared(currentInt)
{
printf("I am doing kruskals: Thread %d\n", omp_get_thread_num());
while(currentInt < numEdges)
{
currentInt++;
}
printf("Kruskals Done! %d\n", currentInt);
#pragma omp shared(currentInt)
{
for(int i = 0; i < 10000000; i++){
}
printf("Helper: Current Int %d Thread %d \n", currentInt, omp_get_thread_num());
}
}
#pragma omp taskwait
}
}
It will always print currentInt 0. Even if the first task finishes before the second. I need this because I am trying to parallize an algorithm where a have a sequential task going through a large array and many parallel tasks excuting simultanously on parts of that array and once the sequential task reaches the portion of the array that a parallel task is working on the parallel task can stop itself because it is no longer needed. The parallel and sequential tasks share no dependancies so that is not a problem.
Any help will be appreciated.
I have a code that runs many iterations and only if a condition is met, the result of the iteration is saved. This is naturally expressed as a while loop. I am attempting to make the code run in parallel, since each realisation is independent. So I have this:
while(nit<avit){
#pragma omp parallel shared(nit,avit)
{
//do some stuff
if(condition){
#pragma omp critical
{
nit++;
\\save results
}
}
}//implicit barrier here
}
and this works fine... but there is a barrier after each realization, which means that if the stuff I am doing inside the parallel block takes longer in one iteration than the others, all my threads are waiting for it to finish, instead of continuing with the next iteration.
Is there a way to avoid this barrier so that the threads keep working? I am averaging thousands of iterations, so a few more don't hurt (in case the nit variable has not been incremented in already running threads)...
I have tried to turn this into a parallel for, but the automatic increment in the for loop makes the nit variable go wild. This is my attempt:
#pragma omp parallel shared(nit,avit)
{
#pragma omp for
for(nit=0;nit<avit;nit++){
//do some stuff
if(condition){
\\save results
} else {
#pragma omp critical
{
nit--;
}
}
}
}
and it keeps working and going around the for loop, as expected, but my nit variable takes unpredictable values... as one could expect from the increase and decrease of it by different threads at different times.
I have also tried leaving the increment in the for loop blank, but it doesn't compile, or trying to trick my code to have no increment in the for loop, like
...
incr=0;
for(nit=0;nit<avit;nit+=incr)
...
but then my code crashes...
Any ideas?
Thanks
Edit: Here's a working minimal example of the code on a while loop:
#include <random>
#include <vector>
#include <iostream>
#include <time.h>
#include <omp.h>
#include <stdlib.h>
#include <unistd.h>
using namespace std;
int main(){
int nit,dit,avit=100,t,j,tmax=100,jmax=10;
vector<double> Res(10),avRes(10);
nit=0; dit=0;
while(nit<avit){
#pragma omp parallel shared(tmax,nit,jmax,avRes,avit,dit) private(t,j) firstprivate(Res)
{
srand(int(time(NULL)) ^ omp_get_thread_num());
t=0; j=0;
while(t<tmax&&j<jmax){
Res[j]=rand() % 10;
t+=Res[j];
if(omp_get_thread_num()==5){
usleep(100000);
}
j++;
}
if(t<tmax){
#pragma omp critical
{
nit++;
for(j=0;j<jmax;j++){
avRes[j]+=Res[j];
}
for(j=0;j<jmax;j++){
cout<<avRes[j]/nit<<"\t";
}
cout<<" \t nit="<<nit<<"\t thread: "<<omp_get_thread_num();
cout<<endl;
}
} else{
#pragma omp critical
{
dit++;
cout<<"Discarded: "<<dit<<"\r"<<flush;
}
}
}
}
return 0;
}
I added the usleep part to simulate one thread taking longer than the others. If you run the program, all threads have to wait for thread 5 to finish, and then they start the next run. what I am trying to do is precisely to avoid such wait, i.e. I'd like the other threads to pick the next iteration without waiting for 5 to finish.
You can basically follow the same concept as for this question, with a slight variation to ensure that avRes is not written to in parallel:
int nit = 0;
#pragma omp parallel
while(1) {
int local_nit;
#pragma omp atomic read
local_nit = nit;
if (local_nit >= avit) {
break;
}
[...]
if (...) {
#pragma omp critical
{
#pragma omp atomic capture
local_nit = ++nit;
for(j=0;j<jmax;j++){
avRes[j] += Res[j];
}
for(j=0;j<jmax;j++){
// technically you could also use `nit` directly since
// now `nit` is only modified within this critical section
cout<<avRes[j]/local_nit<<"\t";
}
}
} else {
#pragma omp atomic update
dit++;
}
}
It also works with critical regions, but atomics are more efficient.
There's another thing you need to consider, rand() should not be used in parallel contexts. See this question. For C++, use a private (i.e. defined within the parallel region) random number generator from <random>.
I have a theoretical OpenMP question for you all.
Imagine I do the following:
#pragma omp parallel
{
#pragma omp single
{
while (!empty(linkedList)) {
#pragma omp task
doWork();
}
}
}
What happens if doWork() adds elements back into the list?
My worry is that the single thread that is spinning of the tasks will terminate before the threads doing the tasks can finish. This might mean that any elements that gets added back onto the list by the doWork function are missed. Does anybody know how this works?
Thanks!
Just embed the generator loop into another loop and use taskwait in between to ensure that all tasks have finished executing. You must also ensure proper locking of the linked list in the concurrent parts of the code, e.g. by the use of critical sections (as shown below) or finer-grained locks.
doWork(element e)
{
// ...
#pragma omp critical(listOps)
insertElement(linkedList, newElement);
// ...
}
#pragma omp parallel
{
#pragma omp single
{
do
{
#pragma omp critical(listOps)
while (!empty(linkedList)) {
element e = removeElement(linkedList);
#pragma omp task
doWork(e);
}
#pragma omp taskwait
} while (!empty(linkedList));
}
}
I have a multi-threaded process where a file is shared (read and written) by multiple threads. Is there any way a thread can lock one file segment so that other threads cannot access it?
I have tried fcntl(fd, F_SETLKW, &flock), but this lock only works for processes, not threads (a lock is shared between all threads in an process).
Yes - but not with the same mechanism. You'll have to use something like pthread mutexes, and keep track of the bookkeeping yourself.
Possible outline for how to make this work
Wait on and claim a process-level mutex over a bookkeeping structure
make sure no other threads within your process are trying to use that segment
mark yourself as using the file segment
Release the process-level mutex
Grab fnctl lock for process (if necessary)
Do your writing
Release fnctl lock to allow other processes to use the segment (if necessary)
Wait again on process-levelbookkeeping structure mutex (may not be necessary if you can mark it unused atomically)
mark segment as unused within your process.
Release process-level mutex
No. The region-locking feature you're asking about has surprising semantics and it is not being further developed because it is controlled by POSIX. (In fact, it is Kirk McKusick's preferred example of what's wrong with POSIX.) If there is a non-POSIX byte-range lock facility in Linux, I can't find it.
There is discussion of the problems of POSIX byte-range locking in a multithreaded world here: http://www.samba.org/samba/news/articles/low_point/tale_two_stds_os2.html.
However, if you're concerned only with threads within one process, you can build your own region-locking using semaphores. For example:
#include <stdbool.h>
#include <pthread.h>
#include <sys/types.h>
// A record indicating an active lock.
struct threadlock {
int fd; // or -1 for unused entries.
off_t start;
off_t length;
};
// A table of all active locks (and the unused entries).
static struct threadlock all_locks[100];
// Mutex housekeeping.
static pthread_mutex_t mutex;
static pthread_cond_t some_lock_released;
static pthread_once_t once_control = PTHREAD_ONCE_INIT;
static void threadlock_init(void) {
for (int i = 0; i < sizeof(all_locks)/sizeof(all_locks[0]); ++i)
all_locks[i].fd = -1;
pthread_mutex_init(&mutex, (pthread_mutexattr_t *)0);
pthread_cond_init(&some_lock_released, (pthread_condattr_t *)0);
}
// True iff the given region overlaps one that is already locked.
static bool region_overlaps_lock(int fd, off_t start, off_t length) {
for (int i = 0; i < sizeof(all_locks)/sizeof(all_locks[0]); ++i) {
const struct threadlock *t = &all_locks[i];
if (t->fd == fd &&
t->start < start + length &&
start < t->start + t->length)
return true;
}
return false;
}
// Returns a pointer to an unused entry, or NULL if there isn't one.
static struct threadlock *find_unused_entry(void) {
for (int i = 0; i < sizeof(all_locks)/sizeof(all_locks[0]); ++i) {
if (-1 == all_locks[i].fd)
return &all_locks[i];
}
return 0;
}
// True iff the lock table is full.
static inline bool too_many_locks(void) {
return 0 == find_unused_entry();
}
// Wait until no thread has a lock for the given region
// [start, start+end) of the given file descriptor, and then lock
// the region. Keep the return value for threadunlock.
// Warning: if you open two file descriptors on the same file
// (including hard links to the same file), this function will fail
// to notice that they're the same file, and it will happily hand out
// two locks for the same region.
struct threadlock *threadlock(int fd, off_t start, off_t length) {
pthread_once(&once_control, &threadlock_init);
pthread_mutex_lock(&mutex);
while (region_overlaps_lock(fd, start, length) || too_many_locks())
pthread_cond_wait(&some_lock_released, &mutex);
struct threadlock *newlock = find_unused_entry();
newlock->fd = fd;
newlock->start = start;
newlock->length = length;
pthread_mutex_unlock(&mutex);
return newlock;
}
// Unlocks a region locked by threadlock.
void threadunlock(struct threadlock *what_threadlock_returned) {
pthread_mutex_lock(&mutex);
what_threadlock_returned->fd = -1;
pthread_cond_broadcast(&some_lock_released);
pthread_mutex_unlock(&mutex);
}
Caution: the code compiles but I haven't tested it even a little.
If you don't need file locks between different processes, avoid the file locks (which are one of the worst designed parts of the POSIX API) and just use mutexes or other shared memory concurrency primitives.
There are 2 ways you can do it:
Use Mutex to get a record's lock in a thread within the same process. Once the lock is acquired, any other thread in the process, mapping the file that tries to acquire the lock is blocked until the lock is released.(Preferable and only most straightforward solution available in Linux).
Semaphores and mutexes on a shared memory or a memory mapped file.
I have an OpenMP parallelized program that looks like that:
[...]
#pragma omp parallel
{
//initialize threads
#pragma omp for
for(...)
{
//Work is done here
}
}
Now I'm adding MPI support. What I will need is a thread that handles the communication, in my case, calls GatherAll all the time and fills/empties a linked list for receiving/sending data from the other processes. That thread should send/receive until a flag is set. So right now there is no MPI stuff in the example, my question is about the implementation of that routine in OpenMP.
How do I implement such a thread? For example, I tried to introduce a single directive here:
[...]
int kill=0
#pragma omp parallel shared(kill)
{
//initialize threads
#pragma omp single nowait
{
while(!kill)
send_receive();
}
#pragma omp for
for(...)
{
//Work is done here
}
kill=1
}
but in this case the program gets stuck because the implicit barrier after the for-loop waits for the thread in the while-loop above.
Thank you, rugermini.
You could try adding a nowait clause to your single construct:
EDIT: responding to the first comment
If you enable nested parallelism for OpenMP, you might be able to achieve what you want by making two levels of parallelism. In the top level, you have two concurrent parallel sections, one for the MPI communications, the other for local computation. This last section can itself be parallelized, which gives you a second level of parallelisation. Only threads executing this level will be affected by barriers in it.
#include <iostream>
#include <omp.h>
int main()
{
int kill = 0;
#pragma omp parallel sections
{
#pragma omp section
{
while (kill == 0){
/* manage MPI communications */
}
}
#pragma omp section
{
#pragma omp parallel
#pragma omp for
for (int i = 0; i < 10000 ; ++i) {
/* your workload */
}
kill = 1;
}
}
}
However, you must be aware that your code is going to break if you don't have at least two threads, which means you're breaking the assumption that the sequential and parallelized versions of the code should do the same thing.
It would be much cleaner to wrap your OpenMP kernel inside a more global MPI communication scheme (potentially using asynchronous communications to overlap communications with computations).
You have to be careful, because you can't just have your MPI calling thread "skip" the omp for loop; all threads in the thread team have to go through the for loop.
There's a couple ways you could do this: with nested parallism and tasks, you could launch one task to do the message passing and anther to call a work routine which has an omp parallel for in it:
#include <mpi.h>
#include <omp.h>
#include <stdio.h>
void work(int rank) {
const int n=14;
#pragma omp parallel for
for (int i=0; i<n; i++) {
int tid = omp_get_thread_num();
printf("%d:%d working on item %d\n", rank, tid, i);
}
}
void sendrecv(int rank, int sneighbour, int rneighbour, int *data) {
const int tag=1;
MPI_Sendrecv(&rank, 1, MPI_INT, sneighbour, tag,
data, 1, MPI_INT, rneighbour, tag,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
int main(int argc, char **argv) {
int rank, size;
int sneighbour;
int rneighbour;
int data;
int got;
MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &got);
MPI_Comm_size(MPI_COMM_WORLD,&size);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
omp_set_nested(1);
sneighbour = rank + 1;
if (sneighbour >= size) sneighbour = 0;
rneighbour = rank - 1;
if (rneighbour <0 ) rneighbour = size-1;
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task
{
sendrecv(rank, sneighbour, rneighbour, &data);
printf("Got data from %d\n", data);
}
#pragma omp task
work(rank);
}
}
MPI_Finalize();
return 0;
}
Alternately, you could make your omp for loop schedule(dynamic) so that the other threads can pick up some of the slack from while the master thread is sending, and the master thread can pick up some work when it's done:
#include <mpi.h>
#include <omp.h>
#include <stdio.h>
void sendrecv(int rank, int sneighbour, int rneighbour, int *data) {
const int tag=1;
MPI_Sendrecv(&rank, 1, MPI_INT, sneighbour, tag,
data, 1, MPI_INT, rneighbour, tag,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
int main(int argc, char **argv) {
int rank, size;
int sneighbour;
int rneighbour;
int data;
int got;
const int n=14;
MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &got);
MPI_Comm_size(MPI_COMM_WORLD,&size);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
omp_set_nested(1);
sneighbour = rank + 1;
if (sneighbour >= size) sneighbour = 0;
rneighbour = rank - 1;
if (rneighbour <0 ) rneighbour = size-1;
#pragma omp parallel
{
#pragma omp master
{
sendrecv(rank, sneighbour, rneighbour, &data);
printf("Got data from %d\n", data);
}
#pragma omp for schedule(dynamic)
for (int i=0; i<n; i++) {
int tid = omp_get_thread_num();
printf("%d:%d working on item %d\n", rank, tid, i);
}
}
MPI_Finalize();
return 0;
}
Hmmm. If you are indeed adding MPI 'support' to your program, then you ought to be using mpi_allgather as mpi_gatherall does not exist. Note that mpi_allgather is a collective operation, that is all processes in the communicator call it. You can't have a process gathering data while the other processes do whatever it is they do. What you could do is use MPI single-sided communications to implement your idea; this will be a little tricky but no more than that if one process only reads the memory of other processes.
I'm puzzled by your use of the term 'thread' wrt MPI. I fear that you are confusing OpenMP and MPI, one of whose variants is called OpenMPI. Despite this name it is as different from OpenMP as chalk from cheese. MPI programs are written in terms of processes, not threads. The typical OpenMP implementation does indeed use threads, though the details are generally well-hidden from the programmer.
I'm seriously impressed that you are trying, or seem to be trying, to use MPI 'inside' your OpenMP code. This is exactly the opposite of work I do, and see others do on some seriously large computers. The standard mode for such 'hybrid' parallelisation is to write MPI programs which call OpenMP code. Many of today's very large computers comprise collections of what are, in effect, multicore boxes. A typical approach to programming one of these is to have one MPI process running on each box, and for each of those processes to use one OpenMP thread for each core in the box.