Why does openMP hang on barrier within a while cycle? - multithreading

I've got the following code:
int working_threads=1;
#pragma omp parallel
{
int my_num=omp_get_thread_num()+1;
int idle=false;
while(working_threads>0) {
if(my_num==1)
working_threads=0;
#pragma omp barrier
}
}
If I run it, it every now and then hangs on the barrier. The more threads, the more likely this is to happen. I've tried to debug it with printf and it seems that sometimes not all threads are executed and thus the barrier waits for them forever. This happens in the first iteration, the second one is obviously never run.
Is it an invalid piece of code? If so, how can I change it? I need to run a while loop in parallel. It is not known how many loops will be executed before, but it is guaranteed that all threads will have the same number of iterations.

Despite your attempt to synchronize with the barrier, you do have a race condition on working_threads that can easily lead to unequal amount of iterations:
thread 0 | thread 1
... | ...
while (working_threads > 0) [==true] | ...
if (my_num == 1) [==true] | ...
working_threads = 0 | ...
| while (working_threads > 0) [==false]
[hangs waiting for barrier] | [hangs trying to exit from parallel]
To fix your specific code, you would have to also add a barrier between the while-condition-check and working_threads = 0.
#pragma omp parallel
{
int my_num=omp_get_thread_num()+1;
int idle=false;
while(working_threads>0) {
#pragma omp barrier
if(my_num==1)
working_threads=0;
#pragma omp barrier
}
}
Note that the code is not exactly the most idiomatic or elegant solution. Depending on your specific work-sharing problem, there may be a better approach. Also you must ensure that worker_threads is written only by a single thread - or use atomics when writing.

Related

OpenMP task firstprivate

I have a question regarding the OpenMP task pragma, if we suppose the following code:
#pragma omp parallel
{
x = omp_get_thread_num();
#pragma omp task firstprivate(x)
//do something with x
}
as far as I understood tasking, it is not guaranteed, which thread executes the task.
So my question is, is "x" in the task now the id of thread generated the task or the one who executes it?
e.g. if thread 0 comes across the task, and thread 3 executes it: x should be 0 then, right?
So my question is, is "x" in the task now the id of thread generated
the task or the one who executes it?
It depends, if the parallel default data-sharing attribute is shared (which by default typically it is) then:
'x' can be equal to any thread ID ranging from 0 to the total number of threads in the team - 1. This is because there is a race condition during the update of the variable 'x'.
This can be show-cased with the following code:
#include <omp.h>
#include <stdio.h>
int main(){
int x;
#pragma omp parallel
{
x = omp_get_thread_num();
if(omp_get_thread_num() == 1){
sleep(5);
#pragma omp task firstprivate(x)
{
printf("Value of x = %d | ID Thread executing = %d\n", x, omp_get_thread_num());
}
}
}
return 0;
}
So the thread with ID=1 creates the task, however, 'x' can have different values than '1' and also different values than the thread currently executing the task. This is because while the thread with ID=1, is waiting during sleep(5);, the remaining threads in the team can update the value of 'x'.
Typically, the canonical form in such use-cases would be to use a single pragma wrapping around the task creation as follows:
#include <omp.h>
#include <stdio.h>
int main(){
int x;
#pragma omp parallel
{
#pragma omp single
{
printf("I am the task creator '%d'\n", omp_get_thread_num());
x = omp_get_thread_num();
#pragma omp task firstprivate(x)
{
printf("Value of x = %d | ID Thread executing = %d\n", x, omp_get_thread_num());
}
}
}
return 0;
}
And in this case as #Michael Klemm mentioned on the comments:
..., x will contain the ID of the thread that created the task. So, yes, if thread 0 created the task, x will be zero even though thread 3 is picked to execute the task.
This also applies in the cases that variable 'x' is private by the time the statement x = omp_get_thread_num(); happens.
Therefore, if you run the code above you should always get I am the task creator with the same value as Value of x =, but you can get a different value in ID Thread executing. For example:
I am the task creator '4'
Value of x = 4 | ID Thread executing = 7
This is in accordance to the behaviour specified in the OpenMP standard, namely:
The task construct is a task generating construct. When a thread
encounters a task construct, an explicit task is generated from the
code for the associated structured block. The data environment of the task is created according to the data-sharing attribute clauses on the task construct, per-data environment ICVs, and any defaults that
apply.
The encountering thread may immediately execute the task, or defer its execution. In the latter case, any thread in the team may be assigned the task.

Thread scheduling/Data race conditions

In class, we are studying threads and race conditions. By my estimates, it should be possible for the below code to output the value 8 or 9, as it is possible that thread 1 is interrupted by thread 2 before the counter value is updated, but after it has been incremented in the eax register.
int counter = 10;
void *worker(void *arg) {
counter--;
return NULL;
}
int main(int argc, char *argv[]) {
pthread_t p1, p2;
pthread_create(&p1, NULL, worker, NULL);
pthread_create(&p2, NULL, worker, NULL);
pthread_join(p1, NULL);
pthread_join(p2, NULL);
printf("%d\n", counter);
}
However, when I run the code, I always receive the output 8. Is it a mechanism of the compiler that normalizes the output, or is it only possible for the code to output 8 (no race condition is created)?
There's no way for us to tell without knowing lots of complicated details about your platform, compiler, maybe even CPU. The code has a race condition in theory but it may be exceptionally difficult, maybe even impossible, to trigger.
Of course, if you upgrade your compiler or CPU, change compilation options, upgrade your OS, or do any number of other things, it may start behaving differently.
This is one of the reasons race conditions can be so insidious. They can be impossible to trigger under some conditions and then suddenly start happening all the time when some change is made elsewhere.
The code definitely has a race condition.
I don't find it surprising that you're seeing consistent results--starting a thread takes a little while, so there's a good chance that in your case, the first thread finishes before the second gets started.
Nonetheless, the code clearly has undefined behavior, because there's no question it has a race condition.
There definitely is a race condition. The reason you're not seeing it is because the increment happens so fast compared to the time it takes to start a thread that it's likely for the first thread to be done before the second thread even starts. You'll see the race condition if you make the amount of work sufficiently large that the first thread will still be running when the second one starts.
example: modify the worker function to decrement in a loop
int counter = 1000000000;
void* worker(void *arg)
{
for (int i = 0; i < 500000000; ++i)
--counter;
return NULL;
}
Since counter starts at 1 billion, and you're running 2 threads that each decrement counter by 500 million, you would expect counter to be 0 when you are done if race conditions didn't exist.

Proper / Efficient parallelization of a for loop with OpenMP

I have developed a distributed memory MPI application which involves processing of a grid. Now i want to apply shared memory techniques (essentially making it a hybrid - parallel program), with OpenMP, to see if it can become any faster, or more efficient. I'm having a hard time with OpenMP, especially with a nested for loop. My application involves printing the grid to the screen every half a second, but when i parallelize it with OpenMP, execution proceeds 10 times slower, or not at all. The console screen lags and refreshes itself with random / unexpected data. In other words, it is going completely wrong. Take a look at the following function, which does the printing:
void display2dGrid(char** grid, int nrows, int ncolumns, int ngen)
{
//#pragma omp parallel
updateScreen();
int y, x;
//#pragma omp parallel shared(grid) // garbage
//#pragma omp parallel private(y) // garbage output!
//#pragma omp for
for (y = 0; y < nrows; y++) {
//#pragma omp parallel shared(grid) // nothing?
//#pragma omp parallel private(x) // 10 times slower!
for (x = 0; x < ncolumns; x++) {
printf("%c ", grid[y][x]);
}
printf("\n");
}
printf("Gen #%d\n", ngen);
fflush(stdout);
}
(updateScreen() just clears the screen and writes from top left corner again.)
The function is executed by only one process, which makes it a perfect target for thread parallelization. As you can see i have tried many approaches and one is worse than the other. Best case, i get semi proper output every 2 seconds (because it refreshes very slowly). Worst case i get garbage output.
I would appreciate any help. Is there a place where i can find more information to proper parallelize loops with OpenMP? Thanks in advance.
The function is executed by only one process, which makes it a perfect target for thread parallelization.
That is actually not true. The function you are trying to parallelize is a very bad target for parallelization. The calls to printf in your example need to happen in a specific sequential order, or else, you're going to obtain a garbage result as your experienced (since the elements of your grid are going to be printed in an order that means nothing). Actually, your attempts at parallelizing were pretty good, the problem comes from the fact that the function itself is a bad target for parallelization.
Speedup when parallelizing programs comes from the fact that you are distributing workload across multiple cores. In order to be able to do that with maximum efficiency, said workloads need to be independent, or at least share state as little as possible, which is not the case here since the calls to printf need to happen in a specific order.
When you try to parallelize some work that is intrinsically sequential, you lose more time synchronizing your workers (your openmp threads), than you gain by parallizing the work itself (which is why you obtain crap time when your result gets better).
Also, as this answer (https://stackoverflow.com/a/20089967/3909725) suggests, you should not print the content of your grid at each loop (unless you are debugging), but rather perform all of your computations, and then print the content when you have finished doing what your ultimate goal is, since printing is only useful to see the result of the computation, and only slows the process.
An example :
Here is a very basic example of parallizing a program with openmp that achieves speedup. Here a dummy (yet heavy) computation is realized for each value of the i variable. The computations in each loop are completely independent, and the different threads can achieve their computations independently. The calls to printf can be achieved in whatever order since they are just informative.
Original (sequential.c)
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
int main()
{
int i,j;
double x=0;
for(i=0; i < 100; i++)
{
x = 100000 * fabs(cos(i*i));
for(j=0;j<100+i*20000;j++)
x += sqrt(sqrt(543*j)*fabs(sin(j)));
printf("Computed i=%2d [%g]\n",i,x);
}
}
Parallelized version (parallel.c)
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main()
{
int i,j;
double x=0;
#pragma omp parallel for
for(i=0; i < 100; i++)
{
/* Dummy heavy computation */
x = 100000 * fabs(cos(i*i));
#pragma omp parallel for reduction(+: x)
for(j=0;j<100+i*20000;j++)
x += sqrt(sqrt(543*j)*fabs(sin(j)));
printf("Thread %d computed i=%2d [%g]\n",omp_get_thread_num(),i,x);
}
}
A pretty good guide to openmp can be found here : http://bisqwit.iki.fi/story/howto/openmp/

OpenMP: for loop with changing number of iterations

I would like to use OpenMP to make my program run faster. Unfortunately, the opposite is the case. My code looks something like this:
const int max_iterations = 10000;
int num_interation = std::numeric_limits<int>::max();
#pragma omp parallel for
for(int i = 0; i < std::min(num_interation, max_iterations); i++)
{
// do sth.
// update the number of required iterations
// num_interation can only become smaller over time
num_interation = update_iterations(...);
}
For some reason, many more iterations are processed than required. Without OpenMP, it takes 500 iterations on avarage. However, even when setting the numbers of threads to one (set_num_threads(1)), it computes more than one thousand iterations. The same happens if I use mutliple threads, and also when using a writelock when updating num_iterations.
I would assume that it has something todo with memory bandwidth or a race condition. But those problems should not appear in case of set_num_threads(1).
Therefore, I assume that it could have something todo with the scheduling and the chunk size. However, I am really not sure about this.
Can somebody give me a hint?
A quick answer for the behaviour you experience is given by the OpenMP standard page 56:
The iteration count for each associated loop is computed before entry
to the outermost loop. If execution of any associated loop changes any
of the values used to compute any of the iteration counts, then the
behavior is unspecified.
In essence, this means that you cannot modify the boundaries of your loop once you entered it. Although according to the standard the behaviour is "unspecified", in your case, what happen is quite clear since as soon as you switch OpenMP on on your code, you compute the number of iterations you had specified initially.
So you have to take another approach to this problem.
This is a possible solution (amongst many other) which I hope scales OK. It has the drawback of potentially allowing more iterations to happen than the number you intended (up to OMP_NUM_THREADS-1 more iterations than expected, assuming that //do sth. is balanced, and many more if not). Also, it assumes that update_iterations(...) is thread safe and can be called in parallel without unwanted side effects... This is a very strong assumption which you'd better enforce!
num_interation = std::min(num_interation, max_iterations);
#pragma omp parallel
{
int i = omp_get_thread_num();
const int nbth = omp_get_num_threads();
while ( i < num_interation ) {
// do sth.
// update the number of required iterations
// num_interation can only become smaller over time
int new_num_interation = update_iterations(...);
#pragma omp critical
num_interation = std::min(num_interation, new_num_interation);
i += nbth;
}
}
A more synchronised solution, if the //do sth. isn't so balanced and not doing too many extra iterations is important, could be:
num_interation = std::min(num_interation, max_iterations);
int nb_it_done = 0;
#pragma omp parallel
{
int i = omp_get_thread_num();
const int nbth = omp_get_num_threads();
while ( nb_it_done < num_interation ) {
// do sth.
// update the number of required iterations
// num_interation can only become smaller over time
int new_num_interation = update_iterations(i);
#pragma omp critical
num_interation = std::min(num_interation, new_num_interation);
i += nbth;
#pragma omp single
nb_it_done += nbth;
}
}
Another weird thing here is that, since you didn't show what i is used for, it isn't clear if iterating somewhat randomly into the domain is a problem. If it isn't, the first solution should work well, even for unbalanced //do sth.. But if it is a problem, then you'd better stick with the second solution (and even potentially reinforce the synchronism).
But at the end of the day, there is now way (that I can think of and with decent parallelism) to avoid potential extra work to be done, since the number of iterations can change along the way.

New to OpenMP and parallel programming need to partition a loop using scheduling

Ok so here's what the problem says.
Implement a simple loop that calls a function containing a delay. Partition this loop across four threads using static, dynamic and guided scheduling. Measure execution times for each type of scheduling with respect to both the size of the loop and the size of the delay.
this is what I've done so far, I have no idea if I'm on the right track
#include <omp.h>
#include <stdio.h>
int main() {
double start_time, run_time;
omp_set_num_threads(4);
start_time = omp_get_wtime();
#pragma omp parallel
#pragma omp for schedule(static)
for (int n = 0; n < 100; n++){
printf("square of %d=%d\n", n, n*n);
printf("cube of %d=%d\n", n, n*n*n);
int ID = omp_get_thread_num();
printf("Thread(%d) \n", ID);
}
run_time = omp_get_wtime() - start_time;
printf("Time Elapsed (%f)", run_time);
getchar();
}
At first you need a loop, where the distribution makes a difference. The loop has 100 iterations, so the OpenMP schedule will only 100 times decide what is the next iteration for a thread what takes no mensurable time. The output with printf takes very long so in your code it makes no difference which schedule is used. Its better to make a loop without console output and a very high loop count like
#pragma omp parallel
{
#pragma omp for schedule(static) private(value)
for (int i = 0; i < 100000000; i++) {
value = ...
}
}
At last you have to write code in the loop which "result" is used after the loop with a printf for example. If not the body could be deleted by the compiler because of optimize the code (it is not used later so its not needed). You can concentrate the time measurings on the parallel pool without the output of the results.
If your iterations nearly takes the same time, then a static distribution should be faster. If they differ very much the dynamic and guided schedules should dominate your measurings.

Resources