OpenMP: for loop with changing number of iterations - multithreading

I would like to use OpenMP to make my program run faster. Unfortunately, the opposite is the case. My code looks something like this:
const int max_iterations = 10000;
int num_interation = std::numeric_limits<int>::max();
#pragma omp parallel for
for(int i = 0; i < std::min(num_interation, max_iterations); i++)
{
// do sth.
// update the number of required iterations
// num_interation can only become smaller over time
num_interation = update_iterations(...);
}
For some reason, many more iterations are processed than required. Without OpenMP, it takes 500 iterations on avarage. However, even when setting the numbers of threads to one (set_num_threads(1)), it computes more than one thousand iterations. The same happens if I use mutliple threads, and also when using a writelock when updating num_iterations.
I would assume that it has something todo with memory bandwidth or a race condition. But those problems should not appear in case of set_num_threads(1).
Therefore, I assume that it could have something todo with the scheduling and the chunk size. However, I am really not sure about this.
Can somebody give me a hint?

A quick answer for the behaviour you experience is given by the OpenMP standard page 56:
The iteration count for each associated loop is computed before entry
to the outermost loop. If execution of any associated loop changes any
of the values used to compute any of the iteration counts, then the
behavior is unspecified.
In essence, this means that you cannot modify the boundaries of your loop once you entered it. Although according to the standard the behaviour is "unspecified", in your case, what happen is quite clear since as soon as you switch OpenMP on on your code, you compute the number of iterations you had specified initially.
So you have to take another approach to this problem.
This is a possible solution (amongst many other) which I hope scales OK. It has the drawback of potentially allowing more iterations to happen than the number you intended (up to OMP_NUM_THREADS-1 more iterations than expected, assuming that //do sth. is balanced, and many more if not). Also, it assumes that update_iterations(...) is thread safe and can be called in parallel without unwanted side effects... This is a very strong assumption which you'd better enforce!
num_interation = std::min(num_interation, max_iterations);
#pragma omp parallel
{
int i = omp_get_thread_num();
const int nbth = omp_get_num_threads();
while ( i < num_interation ) {
// do sth.
// update the number of required iterations
// num_interation can only become smaller over time
int new_num_interation = update_iterations(...);
#pragma omp critical
num_interation = std::min(num_interation, new_num_interation);
i += nbth;
}
}
A more synchronised solution, if the //do sth. isn't so balanced and not doing too many extra iterations is important, could be:
num_interation = std::min(num_interation, max_iterations);
int nb_it_done = 0;
#pragma omp parallel
{
int i = omp_get_thread_num();
const int nbth = omp_get_num_threads();
while ( nb_it_done < num_interation ) {
// do sth.
// update the number of required iterations
// num_interation can only become smaller over time
int new_num_interation = update_iterations(i);
#pragma omp critical
num_interation = std::min(num_interation, new_num_interation);
i += nbth;
#pragma omp single
nb_it_done += nbth;
}
}
Another weird thing here is that, since you didn't show what i is used for, it isn't clear if iterating somewhat randomly into the domain is a problem. If it isn't, the first solution should work well, even for unbalanced //do sth.. But if it is a problem, then you'd better stick with the second solution (and even potentially reinforce the synchronism).
But at the end of the day, there is now way (that I can think of and with decent parallelism) to avoid potential extra work to be done, since the number of iterations can change along the way.

Related

How to prevent switching threads in FOR loop, while using omp directive

Assume such simple FOR loop
#pragma omp parallel for
for (int i = 0; i<10; i++)
{
//do_something_1
//do_something_2
//do_something_3
}
As I understand, loop iterations can ran in any order. This is fine for me.
The problem is, that threads can switch between themselfs in the middle of iteration execution.
Assume there where created 10 threads(as a number of iterations).
Lets say, thread_1 is currently running, and after completing do_something_1 and do_something_2 lines, it switched to thread_4(for example).
Is there a way to force a thread to complete whole iteration without being switched, I mean thread_1 completing lines do_something_1, do_something_2 and do_something_3 without being interrupted.
I understand that this is a part of OS algorithm for multithreading environment, but this hope there is a way to bypass it.
Edit:
Using ORDERED clause in pragma is needed only in 2 cases,
1) You need a result of previous iteration in current iteration. And than it will be a single thread program
2) You need, that your index will be correct in each iteration(thought you still can run all iterations parallel).
Let's see example for my problem:
int new_index = 0;
#pragma omp parallel for
for (int i = 0; i<10; i++)
{
<mutex lock>
new_index++
<mutex unlock>
//do_something_1
//do_something_2
//do_something_3
my_array[new_index] = 5; //correct
my_array[i] = 5; //not correct
}
So, there will be still 10 iterations, but now it should be a correct index each time for my_array.
The problem is : thread_1 increments new_index(new_index = 1), complete do_something_1, and then switched to thread_2.
Thread_2 completes it's loop completely(new_index = 2), but now , when OS switch back to thread_1, there is no correct new_index(new_index = 1) and my_array stays unchanged.
So, I thought if it possible to tall to OS, don't switch threads in a middle of iteration.

How to make a thread wait another to finish using OpenMP threads?

I want to make the following loop that fills the A matrix parallel. For every A[i][j] element that is calculated I want the price in A[i-1][j], A[i-1][j -1] and A[i0][j-1] to have been calculated first. So my thread has to wait for the threads in these positions to have calculated their results. I've tried to achieve this like this:
#pragma omp parallel for num_threads(threadcnt) \
default(none) shared(A, lenx, leny) private(j)
for (i=1; i<lenx; i++)
{
for (j=1; j<leny; j++)
{
do
{
} while (A[i-1][j-1] == -1 || A[i-1][j] == -1 || A[i][j-1] == -1);
A[i][j] = max(A[i-1][j-1]+1,A[i-1][j]-1, A[i][j-1] -1);
}
}
My A matrix is initialized in -1 so if A[][] equals to -1 the operation in this cell is not completed. It takes more time than the serial program though.. Any idea to avoid the while loop?
The waiting loop seems sub-optimal. Apart from burning cores that are spin-waiting, you will also need a plethora of well-placed flush directives to make this code work.
One alternative, especially in the context of a more general parallelization scheme would be to use tasks and task dependences to model the dependences between the different array elements:
#pragma omp parallel
#pragma omp single
for (i=1; i<lenx; i++) {
for (j=1; j<leny; j++) {
#pragma omp task depend(in:A[i-1][j-1],A[i-1][j],A[i][j-1]) depend(out:A[i][j])
A[i][j] = max(A[i-1][j-1]+1,A[i-1][j]-1, A[i][j-1] -1);
}
}
You may want to think about block the matrix updates, so that each task receives a block of the matrix instead of a single element, but the general idea will remain the same.
Another useful OpenMP feature could be the ordered construct and it's ability to adhere to exactly this kind of data dependency:
#pragma omp parallel for
for (int i=1; i<lenx; i++) {
for (int j=1; j<leny; j++) {
#pragma omp ordered depend(source)
#pragma omp ordered depend(sink:i-1,j-1)
A[i][j] = max(A[i-1][j-1]+1,A[i-1][j]-1, A[i][j-1] -1);
}
}
PS: The code above is untested, but it should get the rough idea across.
Your solution cannot work. As A is initialized to -1, and A[0][j] is never modified, if i==1, it will test A[1-1][j] and will always fail. BTW, if A is initiliazed to -1, how cannot you have anything but -1 with the max?
When you have dependencies problem, there are two solutions.
First one is to sequentialize your code. Openmp has the ordered directive to do that, but the drawback is that you loose parallelism (while still paying thread creation cost). Openmp 4.5 has a way to describe dependencies with the depend and sink/source statements, but I do not know how efficient can the compiler be to deal with that. And my compilers (gcc 7.3 or clang 6.0) do not support this feature.
Second solution is to modify your algorithm to suppress dependencies. Now, you are computing the maximum of all values that are at the left or above a given element. Lets turn it to a simpler problem. Compute the maximum of all values at the left of a given element. We can easily parallelize it by computing on the different rows, as there no interrow dependency.
// compute b[i][j]=max_k<j(a[i][k]
#pragma omp parallel for
for(int i=0;i<n;i++){
for(int j=0;j<n;j++){
// max per row
if((j!=0)) b[i][j]=max(a[i][j],b[i][j-1]);
else b[i][j]=a[i][j]; // left column initialised to value of a
}
}
Consider another simple problem, to compute the prefix maximum on the different columns. It is again easy to parallelize, but this time on the inner loop, as there is not inter-column dependency.
// compute c[i][j]=max_i<k(b[k,j])
for(int i=0;i<n;i++){
#pragma omp parallel for
for(int j=0;j<n;j++){
// max per column
if((i!=0)) c[i][j]=max(b[i][j],c[i-1][j]);
else c[i][j]=b[i][j]; // upper row initialised to value of b
}
}
Now you just have to chain these computations to get the expected result. Here is the final code (with a unique array used and some cleanup in the code).
#pragma omp parallel for
for(int i=0;i<n;i++){
for(int j=1;j<n;j++){
// max per row
a[i][j]=max(a[i][j],a[i][j-1]);
}
}
for(int i=1;i<n;i++){
#pragma omp parallel for
for(int j=0;j<n;j++){
// max per column
a[i][j]=max(a[i][j],a[i-1][j]);
}
}

How to join all threads before deleting the ThreadPool

I am using a MultiThreading class which creates the required number of threads in its own threadpool and deletes itself after use.
std::thread *m_pool; //number of threads according to available cores
std::mutex m_locker;
std::condition_variable m_condition;
std::atomic<bool> m_exit;
int m_processors
m_pool = new std::thread[m_processors + 1]
void func()
{
//code
}
for (int i = 0; i < m_processors; i++)
{
m_pool[i] = std::thread(func);
}
void reset(void)
{
{
std::lock_guard<std::mutex> lock(m_locker);
m_exit = true;
}
m_condition.notify_all();
for(int i = 0; i <= m_processors; i++)
m_pool[i].join();
delete[] m_pool;
}
After running through all tasks, the for-loop is supposed to join all running threads before delete[] is being executed.
But there seems to be one last thread still running, while the m_pool does not exist anymore.
This leads to the problem, that I can't close my program anymore.
Is there any way to check if all threads are joined or wait for all threads to be joined before deleting the threadpool?
Simple typo bug I think.
Your loop that has the condition i <= m_processors is a bug and will actually process one extra entry past the end of the array. This is an off-by-one bug. Suppose m_processors is 2. You'll have an array that contains 2 elements with indices [0] and [1]. Yet, you'll be reading past the end of the array, attempting to join with the item at index [2]. m_pool[2] is undefined memory and you're likely going to either crash or block forever there.
You likely intended i < m_processors.
The real source of the problem is addressed by Wick's answer. I will extend it with some tips that also solve your problem while improving other aspects of your code.
If you use C++11 for std::thread, then you shouldn't create your thread handles using operator new[]. There are better ways of doing that with other C++ constructs, which will make everything simpler and exception safe (you don't leak memory if an unexpected exception is thrown).
Store your thread objects in a std::vector. It will manage the memory allocation and deallocation for you (no more new and delete). You can use other more flexible containers such as std::list if you insert/delete threads dynamically.
Fill the vector in place with std::generate or similar
std::vector<std::thread> m_pool;
m_pool.reserve(n_processors);
// Fill the vector
std::generate_n( std::back_inserter(m_pool), m_processors,
[](){ return std::thread(func); } );
Join all the elements using range-for loop and delete handles using container's functions.
for( std::thread& t: m_pool ) {
t.join();
}
m_pool.clear();

Proper / Efficient parallelization of a for loop with OpenMP

I have developed a distributed memory MPI application which involves processing of a grid. Now i want to apply shared memory techniques (essentially making it a hybrid - parallel program), with OpenMP, to see if it can become any faster, or more efficient. I'm having a hard time with OpenMP, especially with a nested for loop. My application involves printing the grid to the screen every half a second, but when i parallelize it with OpenMP, execution proceeds 10 times slower, or not at all. The console screen lags and refreshes itself with random / unexpected data. In other words, it is going completely wrong. Take a look at the following function, which does the printing:
void display2dGrid(char** grid, int nrows, int ncolumns, int ngen)
{
//#pragma omp parallel
updateScreen();
int y, x;
//#pragma omp parallel shared(grid) // garbage
//#pragma omp parallel private(y) // garbage output!
//#pragma omp for
for (y = 0; y < nrows; y++) {
//#pragma omp parallel shared(grid) // nothing?
//#pragma omp parallel private(x) // 10 times slower!
for (x = 0; x < ncolumns; x++) {
printf("%c ", grid[y][x]);
}
printf("\n");
}
printf("Gen #%d\n", ngen);
fflush(stdout);
}
(updateScreen() just clears the screen and writes from top left corner again.)
The function is executed by only one process, which makes it a perfect target for thread parallelization. As you can see i have tried many approaches and one is worse than the other. Best case, i get semi proper output every 2 seconds (because it refreshes very slowly). Worst case i get garbage output.
I would appreciate any help. Is there a place where i can find more information to proper parallelize loops with OpenMP? Thanks in advance.
The function is executed by only one process, which makes it a perfect target for thread parallelization.
That is actually not true. The function you are trying to parallelize is a very bad target for parallelization. The calls to printf in your example need to happen in a specific sequential order, or else, you're going to obtain a garbage result as your experienced (since the elements of your grid are going to be printed in an order that means nothing). Actually, your attempts at parallelizing were pretty good, the problem comes from the fact that the function itself is a bad target for parallelization.
Speedup when parallelizing programs comes from the fact that you are distributing workload across multiple cores. In order to be able to do that with maximum efficiency, said workloads need to be independent, or at least share state as little as possible, which is not the case here since the calls to printf need to happen in a specific order.
When you try to parallelize some work that is intrinsically sequential, you lose more time synchronizing your workers (your openmp threads), than you gain by parallizing the work itself (which is why you obtain crap time when your result gets better).
Also, as this answer (https://stackoverflow.com/a/20089967/3909725) suggests, you should not print the content of your grid at each loop (unless you are debugging), but rather perform all of your computations, and then print the content when you have finished doing what your ultimate goal is, since printing is only useful to see the result of the computation, and only slows the process.
An example :
Here is a very basic example of parallizing a program with openmp that achieves speedup. Here a dummy (yet heavy) computation is realized for each value of the i variable. The computations in each loop are completely independent, and the different threads can achieve their computations independently. The calls to printf can be achieved in whatever order since they are just informative.
Original (sequential.c)
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
int main()
{
int i,j;
double x=0;
for(i=0; i < 100; i++)
{
x = 100000 * fabs(cos(i*i));
for(j=0;j<100+i*20000;j++)
x += sqrt(sqrt(543*j)*fabs(sin(j)));
printf("Computed i=%2d [%g]\n",i,x);
}
}
Parallelized version (parallel.c)
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main()
{
int i,j;
double x=0;
#pragma omp parallel for
for(i=0; i < 100; i++)
{
/* Dummy heavy computation */
x = 100000 * fabs(cos(i*i));
#pragma omp parallel for reduction(+: x)
for(j=0;j<100+i*20000;j++)
x += sqrt(sqrt(543*j)*fabs(sin(j)));
printf("Thread %d computed i=%2d [%g]\n",omp_get_thread_num(),i,x);
}
}
A pretty good guide to openmp can be found here : http://bisqwit.iki.fi/story/howto/openmp/

New to OpenMP and parallel programming need to partition a loop using scheduling

Ok so here's what the problem says.
Implement a simple loop that calls a function containing a delay. Partition this loop across four threads using static, dynamic and guided scheduling. Measure execution times for each type of scheduling with respect to both the size of the loop and the size of the delay.
this is what I've done so far, I have no idea if I'm on the right track
#include <omp.h>
#include <stdio.h>
int main() {
double start_time, run_time;
omp_set_num_threads(4);
start_time = omp_get_wtime();
#pragma omp parallel
#pragma omp for schedule(static)
for (int n = 0; n < 100; n++){
printf("square of %d=%d\n", n, n*n);
printf("cube of %d=%d\n", n, n*n*n);
int ID = omp_get_thread_num();
printf("Thread(%d) \n", ID);
}
run_time = omp_get_wtime() - start_time;
printf("Time Elapsed (%f)", run_time);
getchar();
}
At first you need a loop, where the distribution makes a difference. The loop has 100 iterations, so the OpenMP schedule will only 100 times decide what is the next iteration for a thread what takes no mensurable time. The output with printf takes very long so in your code it makes no difference which schedule is used. Its better to make a loop without console output and a very high loop count like
#pragma omp parallel
{
#pragma omp for schedule(static) private(value)
for (int i = 0; i < 100000000; i++) {
value = ...
}
}
At last you have to write code in the loop which "result" is used after the loop with a printf for example. If not the body could be deleted by the compiler because of optimize the code (it is not used later so its not needed). You can concentrate the time measurings on the parallel pool without the output of the results.
If your iterations nearly takes the same time, then a static distribution should be faster. If they differ very much the dynamic and guided schedules should dominate your measurings.

Resources