OpenMP task parallelism - performance issue - multithreading

I have a problem with OpenMP tasks. I am trying to create parallel version of "for" loop using omp tasks. However, the time of execution this version is close to 2 times longer than base version, where I use omp for, and I do not know what is the reason of this. Look at codes bellows:
omp for version:
t.start();
#pragma omp parallel num_threads(threadsNumber)
{
for(int ts=0; ts<1000; ++ts)
{
#pragma omp for
for(int i=0; i<size; ++i)
{
array_31[i] = array_11[i] * array_21[i];
}
}
}
t.stop();
cout << "Time of omp for: " << t.time() << endl;
omp task version:
t.start();
#pragma omp parallel num_threads(threadsNumber)
{
#pragma omp master
{
for(int ts=0; ts<1000; ++ts)
{
for(int th=0; th<threadsNumber; ++th)
{
#pragma omp task
{
for(int i=th*blockSize; i<th*blockSize+blockSize; ++i)
{
array_32[i] = array_12[i] * array_22[i];
}
}
}
#pragma omp taskwait
}
}
}
t.stop();
cout << "Time of omp task: " << t.time() << endl;
In the tasks version i divide loop in the same way as in omp for. Each of task has to execute the same amount of iterations. Total amount of tasks is equal to total amount of threads.
Performance results:
Time of omp for: 4.54871
Time of omp task: 8.43251
What can be a problem? Is is possible to achive similar performance for both versions? Attached codes are simple, because i wanted to only illustrate my problem, which i try to resolve. I do not expect that both versions give me the same performance, however i would like to reduce the difference.
Thanks for reply.
Best regards.

I think the issue here is the overhead. When you declare a loop as parallel it assigns all the threads to execute their part of the for loop all at once. When you start a task it must go though the whole process of setting up every time you launch a task. Why not just do the following.
#pragma omp parallel num_threads(threadsNumber)
{
#pragma omp master
{
for(int ts=0; ts<1000; ++ts)
{
#pragma omp for
for(int th=0; th<threadsNumber; ++th)
{
for(int i=th*blockSize; i<th*blockSize+blockSize; ++i)
{
array_32[i] = array_12[i] * array_22[i];
}
}
}
}
}

I'd say that the issue that you're experimenting here is related to the data affinity: when you use the #pragma omp for the distribution of iterations across threads is always the same for all the values of ts, whereas with tasks you don't have a way to specify a binding of tasks to threads.
Once said that, I've executed your program in my machine with three arrays of 1M elements and the results between both versions are closer:
t1_for: 2.041443s
t1_tasking: 2.159012s
(I used GCC 5.3.0 20151204)

Related

How to use multiple threads for multiple file reading/writing operations in C?

I'm trying to analyse the optimum number of threads required to perform a large file reading/writing operation on multiple files.
So how do I proceed with creating multiple threads and assigning each thread with some number of files to speed up the execution time?
Language used- C
Look into https://en.wikipedia.org/wiki/OpenMP for creating threads and managing them efficiently for your analysis.
In you case what you would be doing is using multiple single threads and assigning the workload to them or a team of threads which will recieve equal share of work. Examples below:
int main(int argc, char **argv)
{
// split work in sections
#pragma omp parallel sections
{
#pragma omp section
{
function_1(); // where you perform R/W operations
}
#pragma omp section
{
function_2(); // where you perform R/W operations or other
}
}
// split work on N threads equally
#pragma omp parallel
{
// where you perform R/W operations
}
int threadNum = omp_get_thread_num();
printf("Number of threads used: %d\n",threadNum);
/*
example for loop: if you have 2 threads this work would be split with
i that runs from 0 to 49999 while the second gets a version running from 50000 to 99999.
*/
int arr[100000];
#pragma omp parallel for
{
for (int i = 0; i < 100000; i++) {
arr[i] = 2 * i;
}
}
int threadNum = omp_get_thread_num();
printf("Number of threads used: %d\n",threadNum);
return 0;
}

Dividing a loop task among threads using OpenMP

Suppose I want to run the following loop between threats (say 4 threads) such that each thread is in charge of calculating (N/4) where N is the number of rows of the matrix.
#pragma omp parallel num_threads(4) private(i,j,M) shared(Matrix)
{
#pragma omp for schedule(static)
for(i=0; i<N; i++)
{
for(j=0; j<N; j++)
{
M[i][j]= Matrix[i][j] + (Matrix[i][j] * Matrix[j][i]);
}
}
}
My question is: Should I explicitly specify the beginning and the end of each chunk of the matrix that each thread will calculate or OpenMP will automatically distribute the job between the threads? The reason behind this question is that I've read somewhere that OpenMP will automatically distribute the job between the threads but when I implemented it, it gave me fault segmentation error.
Thank you.

Why does this openmp code gives segmentation fault?

I am trying to execute an OpenMP code and have been successful in doing so. However I have a doubt regarding the statement #pragma omp parallel.
Consider these two code snippets:
#pragma omp parallel firstprivate(my_amax)
{
for (i=0; i<MATRIX_DIM; i++) {
#pragma omp for
for (j=0; j<MATRIX_DIM; j++) {
my_amax = abs_max(my_amax, A[i][j], B[i][j]);
#pragma omp critical
{
if(fabs(amax)<fabs(my_amax))
amax=my_amax;
}
}
}
}
And
for (i=0; i<MATRIX_DIM; i++) {
#pragma omp parallel firstprivate(my_amax)
{
#pragma omp for
for (j=0; j<MATRIX_DIM; j++) {
my_amax = abs_max(my_amax, A[i][j], B[i][j]);
#pragma omp critical
{
if (fabs(amax)<fabs(my_amax))
amax=my_amax;
}
}
}
}
The only difference in the code is the position of the parallel part. The first code always gives me segmentation error, while the second code executes perfectly. Why is that so?
I know that #pragam omp parallel spawns the required threads, but since the next i for loop is not declared as parallel, it should not be a problem i.e the i part should get executed sequentially while the j iterations which are actually parallelized will execute in parallel. What does exactly happens in the first case with the i iterations?
For what I can see, you forgot to declare i private in the first case. Therefore, i is updated quite randomly by the various threads executing the corresponding loop, leading to out of bound access to arrays A and B.
Just try to add private(i) to your #pragma omp parallel firstprivate(my_amax) directive and see what happens.

OpenMP 4.0 nested task parallelism and GCC 4.9.1

I have a code like this:
#pragma omp parallel
{
#pragma omp single
{
int x;
#pragma omp task depend(inout:x)
{
for (int i = 1; i < 16; i++)
{
#pragma omp task
DoComputationOnPartition(i);
}
#pragma omp taskwait
}
for (int i = 1; i < 16; i++)
{
#pragma omp task depend(in:x)
{
OperateOnPartition(i);
}
}
#pragma omp task depend(inout:x)
{
for (int i =1; i < 16; i++) x++;
}
for (int i = 1; i < 16; i++)
{
#pragma omp task depend(in:x)
{
OperateOnPartition(i);
}
}
#pragma omp taskwait
}
}
And what I find is that the master thread never gets to execute a task of DoComputationOnPartition nested inside the first task. Can someone explain that? It should work, right? The #pragma omp taskwait is an scheduling point so any thread of the team should be able to get a task. The master thread reaches the final taskwait and it should be able to get a nested task. They have long enough duration to allow that.
Thanks.
Taskwait waits just for the immediate children, spawning other tasks than the children in the taskwait construct is possible, but risks (much) increasing the latency of taskwait, if you have important amount of work after the taskwait. If you want to wait for all children + grand children etc., you can use #pragma omp taskgroup construct, or just in these cases leave out the taskwaits and use the (implicit) barrier at the end of single construct.
As per OpenMP 4.0 specification:
Binding A taskgroup region binds to the current task region. The binding thread set of the taskgroup region is the current team.
Description When a thread encounters a taskgroup construct, it starts executing the region. There is an implicit task scheduling
point at the end of the taskgroup region. The current task is
suspended at the task scheduling point until all child tasks that it
generated in the taskgroup region and all of their descendent tasks
complete execution.
So, you mean to put the tasks inside a taskgroup. Well, ok. But it also means a scheduling point, the same as a taskwait.
If libgomp has a problem with that, it is a problem specific of that runtime, not OpenMP as an API. ICC and other runtimes like OmpSs do not have such behaviour problem :-S
Taskgroup only makes sense (as I see it) if you want to wait for all the hierarchy of tasks, but this is not the case.
I think you mean this, right?:
#pragma omp task depend(inout:x)
{
#pragma omp taskgroup
{
for (int i = 1; i < 16; i++)
{
#pragma omp task
DoComputationOnPartition(i);
}
}
}
The original code has the first taskwait nested, inside another context, so there is only a small group of tasks waiting in that taskwait.

Avoid thread creation overhead in open MP

I am using open MP to parallelize a part of code in HEVC. The basic structure of the code is as given below
Void funct()
{
for(...)
{
#pragma OMP parallel for private(....)
for (...)
{
/// do some parallel work
}
//end of inner for loop
//other tasks
}
/// end of outer for loop
}
//end of function
Now i have modified the inner for loop so that the code is parallelized and every thread perform task independently. I am not getting any errors but the overall processing time is increased with multiple threads than what it would have taken with single thread. I guess the main reason is that for every iteration of outer loop there is thread creation overhead for innner loop. Is there any way to avoid this issue or any way by which we can create thread only once. I cannot parallelize the outer for loop since i have made modifications in inner for loop to enable each thread to work independently. Please suggest any possible solutions.
You can use separate directives #pragma omp parallel and #pragma omp for.
#pragma omp parallel creates parallel threads, whereas #pragma omp for distributes the work between the threads. For sequential part of the outer loop you can use #pragma omp single.
Here is an example:
int n = 3, m = 10;
#pragma omp parallel
{
for (int i = 0; i < n; i++){
#pragma omp single
{
printf("Outer loop part 1, thread num = %d\n",
omp_get_thread_num());
}
#pragma omp for
for(int j = 0; j < m; j++) {
int thread_num = omp_get_thread_num();
printf("j = %d, Thread num = %d\n", j, thread_num);
}
#pragma omp single
{
printf("Outer loop part 2, thread num = %d\n",
omp_get_thread_num());
}
}
}
But I am not sure will it help you or not. To diagnose OpenMP performance issues, it would be better to use some profiler, such as Scalasca or VTune.

Resources