OpenMP 4.0 nested task parallelism and GCC 4.9.1 - multithreading

I have a code like this:
#pragma omp parallel
{
#pragma omp single
{
int x;
#pragma omp task depend(inout:x)
{
for (int i = 1; i < 16; i++)
{
#pragma omp task
DoComputationOnPartition(i);
}
#pragma omp taskwait
}
for (int i = 1; i < 16; i++)
{
#pragma omp task depend(in:x)
{
OperateOnPartition(i);
}
}
#pragma omp task depend(inout:x)
{
for (int i =1; i < 16; i++) x++;
}
for (int i = 1; i < 16; i++)
{
#pragma omp task depend(in:x)
{
OperateOnPartition(i);
}
}
#pragma omp taskwait
}
}
And what I find is that the master thread never gets to execute a task of DoComputationOnPartition nested inside the first task. Can someone explain that? It should work, right? The #pragma omp taskwait is an scheduling point so any thread of the team should be able to get a task. The master thread reaches the final taskwait and it should be able to get a nested task. They have long enough duration to allow that.
Thanks.

Taskwait waits just for the immediate children, spawning other tasks than the children in the taskwait construct is possible, but risks (much) increasing the latency of taskwait, if you have important amount of work after the taskwait. If you want to wait for all children + grand children etc., you can use #pragma omp taskgroup construct, or just in these cases leave out the taskwaits and use the (implicit) barrier at the end of single construct.

As per OpenMP 4.0 specification:
Binding A taskgroup region binds to the current task region. The binding thread set of the taskgroup region is the current team.
Description When a thread encounters a taskgroup construct, it starts executing the region. There is an implicit task scheduling
point at the end of the taskgroup region. The current task is
suspended at the task scheduling point until all child tasks that it
generated in the taskgroup region and all of their descendent tasks
complete execution.
So, you mean to put the tasks inside a taskgroup. Well, ok. But it also means a scheduling point, the same as a taskwait.
If libgomp has a problem with that, it is a problem specific of that runtime, not OpenMP as an API. ICC and other runtimes like OmpSs do not have such behaviour problem :-S
Taskgroup only makes sense (as I see it) if you want to wait for all the hierarchy of tasks, but this is not the case.
I think you mean this, right?:
#pragma omp task depend(inout:x)
{
#pragma omp taskgroup
{
for (int i = 1; i < 16; i++)
{
#pragma omp task
DoComputationOnPartition(i);
}
}
}
The original code has the first taskwait nested, inside another context, so there is only a small group of tasks waiting in that taskwait.

Related

How to make a thread wait another to finish using OpenMP threads?

I want to make the following loop that fills the A matrix parallel. For every A[i][j] element that is calculated I want the price in A[i-1][j], A[i-1][j -1] and A[i0][j-1] to have been calculated first. So my thread has to wait for the threads in these positions to have calculated their results. I've tried to achieve this like this:
#pragma omp parallel for num_threads(threadcnt) \
default(none) shared(A, lenx, leny) private(j)
for (i=1; i<lenx; i++)
{
for (j=1; j<leny; j++)
{
do
{
} while (A[i-1][j-1] == -1 || A[i-1][j] == -1 || A[i][j-1] == -1);
A[i][j] = max(A[i-1][j-1]+1,A[i-1][j]-1, A[i][j-1] -1);
}
}
My A matrix is initialized in -1 so if A[][] equals to -1 the operation in this cell is not completed. It takes more time than the serial program though.. Any idea to avoid the while loop?
The waiting loop seems sub-optimal. Apart from burning cores that are spin-waiting, you will also need a plethora of well-placed flush directives to make this code work.
One alternative, especially in the context of a more general parallelization scheme would be to use tasks and task dependences to model the dependences between the different array elements:
#pragma omp parallel
#pragma omp single
for (i=1; i<lenx; i++) {
for (j=1; j<leny; j++) {
#pragma omp task depend(in:A[i-1][j-1],A[i-1][j],A[i][j-1]) depend(out:A[i][j])
A[i][j] = max(A[i-1][j-1]+1,A[i-1][j]-1, A[i][j-1] -1);
}
}
You may want to think about block the matrix updates, so that each task receives a block of the matrix instead of a single element, but the general idea will remain the same.
Another useful OpenMP feature could be the ordered construct and it's ability to adhere to exactly this kind of data dependency:
#pragma omp parallel for
for (int i=1; i<lenx; i++) {
for (int j=1; j<leny; j++) {
#pragma omp ordered depend(source)
#pragma omp ordered depend(sink:i-1,j-1)
A[i][j] = max(A[i-1][j-1]+1,A[i-1][j]-1, A[i][j-1] -1);
}
}
PS: The code above is untested, but it should get the rough idea across.
Your solution cannot work. As A is initialized to -1, and A[0][j] is never modified, if i==1, it will test A[1-1][j] and will always fail. BTW, if A is initiliazed to -1, how cannot you have anything but -1 with the max?
When you have dependencies problem, there are two solutions.
First one is to sequentialize your code. Openmp has the ordered directive to do that, but the drawback is that you loose parallelism (while still paying thread creation cost). Openmp 4.5 has a way to describe dependencies with the depend and sink/source statements, but I do not know how efficient can the compiler be to deal with that. And my compilers (gcc 7.3 or clang 6.0) do not support this feature.
Second solution is to modify your algorithm to suppress dependencies. Now, you are computing the maximum of all values that are at the left or above a given element. Lets turn it to a simpler problem. Compute the maximum of all values at the left of a given element. We can easily parallelize it by computing on the different rows, as there no interrow dependency.
// compute b[i][j]=max_k<j(a[i][k]
#pragma omp parallel for
for(int i=0;i<n;i++){
for(int j=0;j<n;j++){
// max per row
if((j!=0)) b[i][j]=max(a[i][j],b[i][j-1]);
else b[i][j]=a[i][j]; // left column initialised to value of a
}
}
Consider another simple problem, to compute the prefix maximum on the different columns. It is again easy to parallelize, but this time on the inner loop, as there is not inter-column dependency.
// compute c[i][j]=max_i<k(b[k,j])
for(int i=0;i<n;i++){
#pragma omp parallel for
for(int j=0;j<n;j++){
// max per column
if((i!=0)) c[i][j]=max(b[i][j],c[i-1][j]);
else c[i][j]=b[i][j]; // upper row initialised to value of b
}
}
Now you just have to chain these computations to get the expected result. Here is the final code (with a unique array used and some cleanup in the code).
#pragma omp parallel for
for(int i=0;i<n;i++){
for(int j=1;j<n;j++){
// max per row
a[i][j]=max(a[i][j],a[i][j-1]);
}
}
for(int i=1;i<n;i++){
#pragma omp parallel for
for(int j=0;j<n;j++){
// max per column
a[i][j]=max(a[i][j],a[i-1][j]);
}
}

OpenMP Multithreads becoming one thread

I am programming using OpenMP to get to learn about multithreads. Is it possible for any thread, which is any thread of 11 in this case, to reach the return statement at the end while some threads may be still working on something in the for loop? Or do they become one master thread again after line 13?
int np, iam;
#pragma omp parallel private(np, iam) num_threads(11)
{
np = omp_get_num_threads();
iam = omp_get_thread_num();
#pragma omp for
for (int i = 2; i < 100; i++) {
std::cout << i;
doStuff(i);
}
}
} // line 13
// synchronize necessary?
return 0;
There is an implicit barrier ar the end of the parallel construct, so no synchronization is necessary. Any further code is executed only by the master thread (the one that had thread_num == 0 within the parallel region), and only after all threads have reached the end of the parallel region.

OpenMP task parallelism - performance issue

I have a problem with OpenMP tasks. I am trying to create parallel version of "for" loop using omp tasks. However, the time of execution this version is close to 2 times longer than base version, where I use omp for, and I do not know what is the reason of this. Look at codes bellows:
omp for version:
t.start();
#pragma omp parallel num_threads(threadsNumber)
{
for(int ts=0; ts<1000; ++ts)
{
#pragma omp for
for(int i=0; i<size; ++i)
{
array_31[i] = array_11[i] * array_21[i];
}
}
}
t.stop();
cout << "Time of omp for: " << t.time() << endl;
omp task version:
t.start();
#pragma omp parallel num_threads(threadsNumber)
{
#pragma omp master
{
for(int ts=0; ts<1000; ++ts)
{
for(int th=0; th<threadsNumber; ++th)
{
#pragma omp task
{
for(int i=th*blockSize; i<th*blockSize+blockSize; ++i)
{
array_32[i] = array_12[i] * array_22[i];
}
}
}
#pragma omp taskwait
}
}
}
t.stop();
cout << "Time of omp task: " << t.time() << endl;
In the tasks version i divide loop in the same way as in omp for. Each of task has to execute the same amount of iterations. Total amount of tasks is equal to total amount of threads.
Performance results:
Time of omp for: 4.54871
Time of omp task: 8.43251
What can be a problem? Is is possible to achive similar performance for both versions? Attached codes are simple, because i wanted to only illustrate my problem, which i try to resolve. I do not expect that both versions give me the same performance, however i would like to reduce the difference.
Thanks for reply.
Best regards.
I think the issue here is the overhead. When you declare a loop as parallel it assigns all the threads to execute their part of the for loop all at once. When you start a task it must go though the whole process of setting up every time you launch a task. Why not just do the following.
#pragma omp parallel num_threads(threadsNumber)
{
#pragma omp master
{
for(int ts=0; ts<1000; ++ts)
{
#pragma omp for
for(int th=0; th<threadsNumber; ++th)
{
for(int i=th*blockSize; i<th*blockSize+blockSize; ++i)
{
array_32[i] = array_12[i] * array_22[i];
}
}
}
}
}
I'd say that the issue that you're experimenting here is related to the data affinity: when you use the #pragma omp for the distribution of iterations across threads is always the same for all the values of ts, whereas with tasks you don't have a way to specify a binding of tasks to threads.
Once said that, I've executed your program in my machine with three arrays of 1M elements and the results between both versions are closer:
t1_for: 2.041443s
t1_tasking: 2.159012s
(I used GCC 5.3.0 20151204)

Why does this openmp code gives segmentation fault?

I am trying to execute an OpenMP code and have been successful in doing so. However I have a doubt regarding the statement #pragma omp parallel.
Consider these two code snippets:
#pragma omp parallel firstprivate(my_amax)
{
for (i=0; i<MATRIX_DIM; i++) {
#pragma omp for
for (j=0; j<MATRIX_DIM; j++) {
my_amax = abs_max(my_amax, A[i][j], B[i][j]);
#pragma omp critical
{
if(fabs(amax)<fabs(my_amax))
amax=my_amax;
}
}
}
}
And
for (i=0; i<MATRIX_DIM; i++) {
#pragma omp parallel firstprivate(my_amax)
{
#pragma omp for
for (j=0; j<MATRIX_DIM; j++) {
my_amax = abs_max(my_amax, A[i][j], B[i][j]);
#pragma omp critical
{
if (fabs(amax)<fabs(my_amax))
amax=my_amax;
}
}
}
}
The only difference in the code is the position of the parallel part. The first code always gives me segmentation error, while the second code executes perfectly. Why is that so?
I know that #pragam omp parallel spawns the required threads, but since the next i for loop is not declared as parallel, it should not be a problem i.e the i part should get executed sequentially while the j iterations which are actually parallelized will execute in parallel. What does exactly happens in the first case with the i iterations?
For what I can see, you forgot to declare i private in the first case. Therefore, i is updated quite randomly by the various threads executing the corresponding loop, leading to out of bound access to arrays A and B.
Just try to add private(i) to your #pragma omp parallel firstprivate(my_amax) directive and see what happens.

Avoid thread creation overhead in open MP

I am using open MP to parallelize a part of code in HEVC. The basic structure of the code is as given below
Void funct()
{
for(...)
{
#pragma OMP parallel for private(....)
for (...)
{
/// do some parallel work
}
//end of inner for loop
//other tasks
}
/// end of outer for loop
}
//end of function
Now i have modified the inner for loop so that the code is parallelized and every thread perform task independently. I am not getting any errors but the overall processing time is increased with multiple threads than what it would have taken with single thread. I guess the main reason is that for every iteration of outer loop there is thread creation overhead for innner loop. Is there any way to avoid this issue or any way by which we can create thread only once. I cannot parallelize the outer for loop since i have made modifications in inner for loop to enable each thread to work independently. Please suggest any possible solutions.
You can use separate directives #pragma omp parallel and #pragma omp for.
#pragma omp parallel creates parallel threads, whereas #pragma omp for distributes the work between the threads. For sequential part of the outer loop you can use #pragma omp single.
Here is an example:
int n = 3, m = 10;
#pragma omp parallel
{
for (int i = 0; i < n; i++){
#pragma omp single
{
printf("Outer loop part 1, thread num = %d\n",
omp_get_thread_num());
}
#pragma omp for
for(int j = 0; j < m; j++) {
int thread_num = omp_get_thread_num();
printf("j = %d, Thread num = %d\n", j, thread_num);
}
#pragma omp single
{
printf("Outer loop part 2, thread num = %d\n",
omp_get_thread_num());
}
}
}
But I am not sure will it help you or not. To diagnose OpenMP performance issues, it would be better to use some profiler, such as Scalasca or VTune.

Resources