Proper / Efficient parallelization of a for loop with OpenMP - multithreading

I have developed a distributed memory MPI application which involves processing of a grid. Now i want to apply shared memory techniques (essentially making it a hybrid - parallel program), with OpenMP, to see if it can become any faster, or more efficient. I'm having a hard time with OpenMP, especially with a nested for loop. My application involves printing the grid to the screen every half a second, but when i parallelize it with OpenMP, execution proceeds 10 times slower, or not at all. The console screen lags and refreshes itself with random / unexpected data. In other words, it is going completely wrong. Take a look at the following function, which does the printing:
void display2dGrid(char** grid, int nrows, int ncolumns, int ngen)
{
//#pragma omp parallel
updateScreen();
int y, x;
//#pragma omp parallel shared(grid) // garbage
//#pragma omp parallel private(y) // garbage output!
//#pragma omp for
for (y = 0; y < nrows; y++) {
//#pragma omp parallel shared(grid) // nothing?
//#pragma omp parallel private(x) // 10 times slower!
for (x = 0; x < ncolumns; x++) {
printf("%c ", grid[y][x]);
}
printf("\n");
}
printf("Gen #%d\n", ngen);
fflush(stdout);
}
(updateScreen() just clears the screen and writes from top left corner again.)
The function is executed by only one process, which makes it a perfect target for thread parallelization. As you can see i have tried many approaches and one is worse than the other. Best case, i get semi proper output every 2 seconds (because it refreshes very slowly). Worst case i get garbage output.
I would appreciate any help. Is there a place where i can find more information to proper parallelize loops with OpenMP? Thanks in advance.

The function is executed by only one process, which makes it a perfect target for thread parallelization.
That is actually not true. The function you are trying to parallelize is a very bad target for parallelization. The calls to printf in your example need to happen in a specific sequential order, or else, you're going to obtain a garbage result as your experienced (since the elements of your grid are going to be printed in an order that means nothing). Actually, your attempts at parallelizing were pretty good, the problem comes from the fact that the function itself is a bad target for parallelization.
Speedup when parallelizing programs comes from the fact that you are distributing workload across multiple cores. In order to be able to do that with maximum efficiency, said workloads need to be independent, or at least share state as little as possible, which is not the case here since the calls to printf need to happen in a specific order.
When you try to parallelize some work that is intrinsically sequential, you lose more time synchronizing your workers (your openmp threads), than you gain by parallizing the work itself (which is why you obtain crap time when your result gets better).
Also, as this answer (https://stackoverflow.com/a/20089967/3909725) suggests, you should not print the content of your grid at each loop (unless you are debugging), but rather perform all of your computations, and then print the content when you have finished doing what your ultimate goal is, since printing is only useful to see the result of the computation, and only slows the process.
An example :
Here is a very basic example of parallizing a program with openmp that achieves speedup. Here a dummy (yet heavy) computation is realized for each value of the i variable. The computations in each loop are completely independent, and the different threads can achieve their computations independently. The calls to printf can be achieved in whatever order since they are just informative.
Original (sequential.c)
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
int main()
{
int i,j;
double x=0;
for(i=0; i < 100; i++)
{
x = 100000 * fabs(cos(i*i));
for(j=0;j<100+i*20000;j++)
x += sqrt(sqrt(543*j)*fabs(sin(j)));
printf("Computed i=%2d [%g]\n",i,x);
}
}
Parallelized version (parallel.c)
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main()
{
int i,j;
double x=0;
#pragma omp parallel for
for(i=0; i < 100; i++)
{
/* Dummy heavy computation */
x = 100000 * fabs(cos(i*i));
#pragma omp parallel for reduction(+: x)
for(j=0;j<100+i*20000;j++)
x += sqrt(sqrt(543*j)*fabs(sin(j)));
printf("Thread %d computed i=%2d [%g]\n",omp_get_thread_num(),i,x);
}
}
A pretty good guide to openmp can be found here : http://bisqwit.iki.fi/story/howto/openmp/

Related

Why does openMP hang on barrier within a while cycle?

I've got the following code:
int working_threads=1;
#pragma omp parallel
{
int my_num=omp_get_thread_num()+1;
int idle=false;
while(working_threads>0) {
if(my_num==1)
working_threads=0;
#pragma omp barrier
}
}
If I run it, it every now and then hangs on the barrier. The more threads, the more likely this is to happen. I've tried to debug it with printf and it seems that sometimes not all threads are executed and thus the barrier waits for them forever. This happens in the first iteration, the second one is obviously never run.
Is it an invalid piece of code? If so, how can I change it? I need to run a while loop in parallel. It is not known how many loops will be executed before, but it is guaranteed that all threads will have the same number of iterations.
Despite your attempt to synchronize with the barrier, you do have a race condition on working_threads that can easily lead to unequal amount of iterations:
thread 0 | thread 1
... | ...
while (working_threads > 0) [==true] | ...
if (my_num == 1) [==true] | ...
working_threads = 0 | ...
| while (working_threads > 0) [==false]
[hangs waiting for barrier] | [hangs trying to exit from parallel]
To fix your specific code, you would have to also add a barrier between the while-condition-check and working_threads = 0.
#pragma omp parallel
{
int my_num=omp_get_thread_num()+1;
int idle=false;
while(working_threads>0) {
#pragma omp barrier
if(my_num==1)
working_threads=0;
#pragma omp barrier
}
}
Note that the code is not exactly the most idiomatic or elegant solution. Depending on your specific work-sharing problem, there may be a better approach. Also you must ensure that worker_threads is written only by a single thread - or use atomics when writing.

OpenMP: for loop with changing number of iterations

I would like to use OpenMP to make my program run faster. Unfortunately, the opposite is the case. My code looks something like this:
const int max_iterations = 10000;
int num_interation = std::numeric_limits<int>::max();
#pragma omp parallel for
for(int i = 0; i < std::min(num_interation, max_iterations); i++)
{
// do sth.
// update the number of required iterations
// num_interation can only become smaller over time
num_interation = update_iterations(...);
}
For some reason, many more iterations are processed than required. Without OpenMP, it takes 500 iterations on avarage. However, even when setting the numbers of threads to one (set_num_threads(1)), it computes more than one thousand iterations. The same happens if I use mutliple threads, and also when using a writelock when updating num_iterations.
I would assume that it has something todo with memory bandwidth or a race condition. But those problems should not appear in case of set_num_threads(1).
Therefore, I assume that it could have something todo with the scheduling and the chunk size. However, I am really not sure about this.
Can somebody give me a hint?
A quick answer for the behaviour you experience is given by the OpenMP standard page 56:
The iteration count for each associated loop is computed before entry
to the outermost loop. If execution of any associated loop changes any
of the values used to compute any of the iteration counts, then the
behavior is unspecified.
In essence, this means that you cannot modify the boundaries of your loop once you entered it. Although according to the standard the behaviour is "unspecified", in your case, what happen is quite clear since as soon as you switch OpenMP on on your code, you compute the number of iterations you had specified initially.
So you have to take another approach to this problem.
This is a possible solution (amongst many other) which I hope scales OK. It has the drawback of potentially allowing more iterations to happen than the number you intended (up to OMP_NUM_THREADS-1 more iterations than expected, assuming that //do sth. is balanced, and many more if not). Also, it assumes that update_iterations(...) is thread safe and can be called in parallel without unwanted side effects... This is a very strong assumption which you'd better enforce!
num_interation = std::min(num_interation, max_iterations);
#pragma omp parallel
{
int i = omp_get_thread_num();
const int nbth = omp_get_num_threads();
while ( i < num_interation ) {
// do sth.
// update the number of required iterations
// num_interation can only become smaller over time
int new_num_interation = update_iterations(...);
#pragma omp critical
num_interation = std::min(num_interation, new_num_interation);
i += nbth;
}
}
A more synchronised solution, if the //do sth. isn't so balanced and not doing too many extra iterations is important, could be:
num_interation = std::min(num_interation, max_iterations);
int nb_it_done = 0;
#pragma omp parallel
{
int i = omp_get_thread_num();
const int nbth = omp_get_num_threads();
while ( nb_it_done < num_interation ) {
// do sth.
// update the number of required iterations
// num_interation can only become smaller over time
int new_num_interation = update_iterations(i);
#pragma omp critical
num_interation = std::min(num_interation, new_num_interation);
i += nbth;
#pragma omp single
nb_it_done += nbth;
}
}
Another weird thing here is that, since you didn't show what i is used for, it isn't clear if iterating somewhat randomly into the domain is a problem. If it isn't, the first solution should work well, even for unbalanced //do sth.. But if it is a problem, then you'd better stick with the second solution (and even potentially reinforce the synchronism).
But at the end of the day, there is now way (that I can think of and with decent parallelism) to avoid potential extra work to be done, since the number of iterations can change along the way.

New to OpenMP and parallel programming need to partition a loop using scheduling

Ok so here's what the problem says.
Implement a simple loop that calls a function containing a delay. Partition this loop across four threads using static, dynamic and guided scheduling. Measure execution times for each type of scheduling with respect to both the size of the loop and the size of the delay.
this is what I've done so far, I have no idea if I'm on the right track
#include <omp.h>
#include <stdio.h>
int main() {
double start_time, run_time;
omp_set_num_threads(4);
start_time = omp_get_wtime();
#pragma omp parallel
#pragma omp for schedule(static)
for (int n = 0; n < 100; n++){
printf("square of %d=%d\n", n, n*n);
printf("cube of %d=%d\n", n, n*n*n);
int ID = omp_get_thread_num();
printf("Thread(%d) \n", ID);
}
run_time = omp_get_wtime() - start_time;
printf("Time Elapsed (%f)", run_time);
getchar();
}
At first you need a loop, where the distribution makes a difference. The loop has 100 iterations, so the OpenMP schedule will only 100 times decide what is the next iteration for a thread what takes no mensurable time. The output with printf takes very long so in your code it makes no difference which schedule is used. Its better to make a loop without console output and a very high loop count like
#pragma omp parallel
{
#pragma omp for schedule(static) private(value)
for (int i = 0; i < 100000000; i++) {
value = ...
}
}
At last you have to write code in the loop which "result" is used after the loop with a printf for example. If not the body could be deleted by the compiler because of optimize the code (it is not used later so its not needed). You can concentrate the time measurings on the parallel pool without the output of the results.
If your iterations nearly takes the same time, then a static distribution should be faster. If they differ very much the dynamic and guided schedules should dominate your measurings.

Parallel processing a prime finder with openMP

I am trying to construct a prime finder for a bit of C practice. I've got the algorithm down and I've done a bunch of optimisations to make it faster, I then decided to try to parallelize it because, hey why not! Turns out to be harder than I thought. I can either get all threads running the same process (with same args) or a single thread will run if I try and supply different args to each process. I really have no idea what I'm doing here but you can see some experimental values I'm using in this code:
// gcc -std=c99 -o multithread multithread.c -fopenmp -lm
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
int pf(unsigned int start, unsigned int limit, unsigned int q);
int main(int argc, char *argv[])
{
printf("prime finder\n");
int j, slimits[4] = {1,10000000,20000000,30000000}, elimits[4] = {10000000,20000000,30000000,40000000};
double startTime = omp_get_wtime();
#pragma omp parallel shared(slimits, elimits primes)
{
#pragma omp for
for (j = 0; j < 4; j++)
{
primes += pf(slimits[j], elimits[j], atoi(argv[2]));
}
}
printf("%d prime numbers found in %.2f seconds.\n\n", primes, omp_get_wtime() - startTime);
return 0;
}
I havn't included the pf function as it is quite large but it works on its own, it returns the number of primes found. Im sure the issue is here somewhere.
Any help would be greatly appreciated!
You have made at least one obvious (to me) and serious mistake. You've declared primes shared and allowed all the threads in the program to update it. You have, thereby, programmed a data race. Nothing in OpenMP (nor in C if I recall correctly) guarantees that += will be implemented atomically. You haven't actually specified what the problem with your program is, or what the problems are, but this must surely be one of them.
I'll tell you how to fix this later but I think there is a more serious underlying design problem you should address first. You seem to have decided that you would have 4 threads running and that you should divide the range of integers to test for primality into 4 and pass one chunk to each thread. Sure, you can make that work but it's not a smart approach to using OpenMP. Nor is it a smart approach to dividing the work of primality testing.
A smarter approach to OpenMP program design is to start off by making no assumptions about the number of threads that will be available to the executing program. Design for any number of threads, do not design a program whose behaviour depends on the number of threads it gets at run-time. Use OpenMP's facilities, specifically the schedule clause, to distribute the workload at run time.
Turning to primality testing. Draw, or at least think about, a scatter plot of points (i,t(i)), where i is an integer and t(i) is the time it takes to determine whether or not i is prime. The pattern in this plot is about as difficult to discern as the pattern in the plot of the occurrence of primes in the integers. In other words, the time to determine the primality of an integer is very unpredictable. It does tend to rise as the integers increase (well, excluding large even integers which I'm sure your test doesn't consider anyway).
One implication of this unpredictability is that if you divide a range of integers into N sub-ranges and give one sub-range to each of N threads you are not giving the threads the same amount of work to do. Indeed, in the range of integers 1..m (any m) there is one integer which takes much longer to test than any other integer in the range, and this time is the irreducible minimum that your program will take. A naive distribution of the range will produce a seriously unbalanced workload.
Here's what I think you should do to fix your program.
First, write a function which tests the primality of a single integer. This will be the basic task for your computation. Call this is_prime. Next, study the schedule clause for the parallel for construct. OpenMP provides a number of task scheduling options, I won't explain them here, you will find plenty of good documentation online. Finally, study also the reduction clause; this provides the solution to the data race you have programmed.
Applying all this I suggest you change
#pragma omp parallel shared(slimits, elimits primes)
{
#pragma omp for
for (j = 0; j < 4; j++)
{
primes += pf(slimits[j], elimits[j], atoi(argv[2]));
}
}
to
#pragma omp parallel shared(slimits, elimits, max_int_to_test)
{
#pragma omp for reduction(+:primes) schedule (dynamic, 10)
for (j = 3; j < max_int_to_test; j += 2)
{
primes += is_prime(j);
}
}
With any luck my rudimentary C hasn't screwed up the syntax too much.

Thread-safe random number generation for Monte-Carlo integration

Im trying to write something which very quickly calculates random numbers and can be applied on multiple threads. My current code is:
/* Approximating PI using a Monte-Carlo method. */
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <omp.h>
#define N 1000000000 /* As lareg as possible for increased accuracy */
double random_function(void);
int main(void)
{
int i = 0;
double X, Y;
double count_inside_temp = 0.0, count_inside = 0.0;
unsigned int th_id = omp_get_thread_num();
#pragma omp parallel private(i, X, Y) firstprivate(count_inside_temp)
{
srand(th_id);
#pragma omp for schedule(static)
for (i = 0; i <= N; i++) {
X = 2.0 * random_function() - 1.0;
Y = 2.0 * random_function() - 1.0;
if ((X * X) + (Y * Y) < 1.0) {
count_inside_temp += 1.0;
}
}
#pragma omp atomic
count_inside += count_inside_temp;
}
printf("Approximation to PI is = %.10lf\n", (count_inside * 4.0)/ N);
return 0;
}
double random_function(void)
{
return ((double) rand() / (double) RAND_MAX);
}
This works but from observing a resource manager I know its not using all the threads. Does rand() work for multithreaded code? And if not is there a good alternative? Many Thanks. Jack
Is rand() thread safe? Maybe, maybe not:
The rand() function need not be reentrant. A function that is not required to be reentrant is not required to be thread-safe."
One test and good learning exercise would be to replace the call to rand() with, say, a fixed integer and see what happens.
The way I think of pseudo-random number generators is as a black box which take an integer as input and return an integer as output. For any given input the output is always the same, but there is no pattern in the sequence of numbers and the sequence is uniformly distributed over the range of possible outputs. (This model isn't entirely accurate, but it'll do.) The way you use this black box is to choose a staring number (the seed) use the output value in your application and as the input for the next call to the random number generator. There are two common approaches to designing an API:
Two functions, one to set the initial seed (e.g. srand(seed)) and one to retrieve the next value from the sequence (e.g. rand()). The state of the PRNG is stored internally in sort of global variable. Generating a new random number either will not be thread safe (hard to tell, but the output stream won't be reproducible) or will be slow in multithreded code (you end up with some serialization around the state value).
A interface where the PRNG state is exposed to the application programmer. Here you typically have three functions: init_prng(seed), which returns some opaque representation of the PRNG state, get_prng(state), which returns a random number and changes the state variable, and destroy_peng(state), which just cleans up allocated memory and so on. PRNGs with this type of API should all be thread safe and run in parallel with no locking (because you are in charge of managing the (now thread local) state variable.
I generally write in Fortran and use Ladd's implementation of the Mersenne Twister PRNG (that link is worth reading). There are lots of suitable PRNG's in C which expose the state to your control. PRNG looks good and using this (with initialization and destroy calls inside the parallel region and private state variables) should give you a decent speedup.
Finally, it's often the case that PRNGs can be made to perform better if you ask for a whole sequence of random numbers in one go (e.g. the compiler can vectorize the PRNG internals). Because of this libraries often have something like get_prng_array(state) functions which give you back an array full of random numbers as if you put get_prng in a loop filling the array elements - they just do it more quickly. This would be a second optimization (and would need an added for loop inside the parallel for loop. Obviously, you don't want to run out of per-thread stack space doing this!

Resources