Avoid thread creation overhead in open MP - multithreading

I am using open MP to parallelize a part of code in HEVC. The basic structure of the code is as given below
Void funct()
{
for(...)
{
#pragma OMP parallel for private(....)
for (...)
{
/// do some parallel work
}
//end of inner for loop
//other tasks
}
/// end of outer for loop
}
//end of function
Now i have modified the inner for loop so that the code is parallelized and every thread perform task independently. I am not getting any errors but the overall processing time is increased with multiple threads than what it would have taken with single thread. I guess the main reason is that for every iteration of outer loop there is thread creation overhead for innner loop. Is there any way to avoid this issue or any way by which we can create thread only once. I cannot parallelize the outer for loop since i have made modifications in inner for loop to enable each thread to work independently. Please suggest any possible solutions.

You can use separate directives #pragma omp parallel and #pragma omp for.
#pragma omp parallel creates parallel threads, whereas #pragma omp for distributes the work between the threads. For sequential part of the outer loop you can use #pragma omp single.
Here is an example:
int n = 3, m = 10;
#pragma omp parallel
{
for (int i = 0; i < n; i++){
#pragma omp single
{
printf("Outer loop part 1, thread num = %d\n",
omp_get_thread_num());
}
#pragma omp for
for(int j = 0; j < m; j++) {
int thread_num = omp_get_thread_num();
printf("j = %d, Thread num = %d\n", j, thread_num);
}
#pragma omp single
{
printf("Outer loop part 2, thread num = %d\n",
omp_get_thread_num());
}
}
}
But I am not sure will it help you or not. To diagnose OpenMP performance issues, it would be better to use some profiler, such as Scalasca or VTune.

Related

How to use multiple threads for multiple file reading/writing operations in C?

I'm trying to analyse the optimum number of threads required to perform a large file reading/writing operation on multiple files.
So how do I proceed with creating multiple threads and assigning each thread with some number of files to speed up the execution time?
Language used- C
Look into https://en.wikipedia.org/wiki/OpenMP for creating threads and managing them efficiently for your analysis.
In you case what you would be doing is using multiple single threads and assigning the workload to them or a team of threads which will recieve equal share of work. Examples below:
int main(int argc, char **argv)
{
// split work in sections
#pragma omp parallel sections
{
#pragma omp section
{
function_1(); // where you perform R/W operations
}
#pragma omp section
{
function_2(); // where you perform R/W operations or other
}
}
// split work on N threads equally
#pragma omp parallel
{
// where you perform R/W operations
}
int threadNum = omp_get_thread_num();
printf("Number of threads used: %d\n",threadNum);
/*
example for loop: if you have 2 threads this work would be split with
i that runs from 0 to 49999 while the second gets a version running from 50000 to 99999.
*/
int arr[100000];
#pragma omp parallel for
{
for (int i = 0; i < 100000; i++) {
arr[i] = 2 * i;
}
}
int threadNum = omp_get_thread_num();
printf("Number of threads used: %d\n",threadNum);
return 0;
}

OpenMP C++: Overlapping computations with communications

I need to read a file in chunks. Since I am reading it in chunks, I thought I could parallelize the communications with computations.
I need three active threads, such that while T1 is reading data, T2 could be processing, and T3 could be writing at the same time.
So far, I have managed to create three local buffers where I read the data from a file and store it in these local buffers. I take N elements from the file. Every time I reach N elements, I change thread_id, which means that next thread should store the next elements to its corresponding local buffer.
The #pragma omp critical make sure that while one thread is processing or writing data, no other thread is doing the same. However, if T1 is processing its chunk of data, T2 can write its output. This makes sure that the input is read and written correctly.
What I have so far is this:
int i = 0;
int thread_id = 0;
while (input_file >> stream_element) {
if(thread_id == 0) {
input_buffer_1[i] = stream_element;
}
else if (thread_id == 1) {
input_buffer_2[i] = stream_element;
}
else
input_buffer_3[i] = stream_element;
i++;
if(i == N) {
i = 0;
if(thread_id == 0) {
#pragma omp critical (processing_data)
process(input_buffer_1, output_buffer_1, N);
#pragma omp critical (writing_data)
write(output_file, output_buffer_1, N);
}
else if(thread_id == 1) {
#pragma omp critical (processing_data)
process(input_buffer_2, output_buffer_2, N);
#pragma omp critical (writing_data)
write(output_file, output_buffer_2, N);
}
else {
#pragma omp critical (processing_data)
process(input_buffer_3, output_buffer_3, N);
#pragma omp critical (writing_data)
write(output_file, output_buffer_3, N);
}
thread_id++;
if(thread_id == STREAM_THREADS)
thread_id = 0;
}
What would be the best way to initiate three threads and do these tasks in parallel and synchronized manner?
A simple visualization of the desired result is shown below.
Best regards,
Suejb

OpenMP Multithreads becoming one thread

I am programming using OpenMP to get to learn about multithreads. Is it possible for any thread, which is any thread of 11 in this case, to reach the return statement at the end while some threads may be still working on something in the for loop? Or do they become one master thread again after line 13?
int np, iam;
#pragma omp parallel private(np, iam) num_threads(11)
{
np = omp_get_num_threads();
iam = omp_get_thread_num();
#pragma omp for
for (int i = 2; i < 100; i++) {
std::cout << i;
doStuff(i);
}
}
} // line 13
// synchronize necessary?
return 0;
There is an implicit barrier ar the end of the parallel construct, so no synchronization is necessary. Any further code is executed only by the master thread (the one that had thread_num == 0 within the parallel region), and only after all threads have reached the end of the parallel region.

Why does this openmp code gives segmentation fault?

I am trying to execute an OpenMP code and have been successful in doing so. However I have a doubt regarding the statement #pragma omp parallel.
Consider these two code snippets:
#pragma omp parallel firstprivate(my_amax)
{
for (i=0; i<MATRIX_DIM; i++) {
#pragma omp for
for (j=0; j<MATRIX_DIM; j++) {
my_amax = abs_max(my_amax, A[i][j], B[i][j]);
#pragma omp critical
{
if(fabs(amax)<fabs(my_amax))
amax=my_amax;
}
}
}
}
And
for (i=0; i<MATRIX_DIM; i++) {
#pragma omp parallel firstprivate(my_amax)
{
#pragma omp for
for (j=0; j<MATRIX_DIM; j++) {
my_amax = abs_max(my_amax, A[i][j], B[i][j]);
#pragma omp critical
{
if (fabs(amax)<fabs(my_amax))
amax=my_amax;
}
}
}
}
The only difference in the code is the position of the parallel part. The first code always gives me segmentation error, while the second code executes perfectly. Why is that so?
I know that #pragam omp parallel spawns the required threads, but since the next i for loop is not declared as parallel, it should not be a problem i.e the i part should get executed sequentially while the j iterations which are actually parallelized will execute in parallel. What does exactly happens in the first case with the i iterations?
For what I can see, you forgot to declare i private in the first case. Therefore, i is updated quite randomly by the various threads executing the corresponding loop, leading to out of bound access to arrays A and B.
Just try to add private(i) to your #pragma omp parallel firstprivate(my_amax) directive and see what happens.

OpenMP 4.0 nested task parallelism and GCC 4.9.1

I have a code like this:
#pragma omp parallel
{
#pragma omp single
{
int x;
#pragma omp task depend(inout:x)
{
for (int i = 1; i < 16; i++)
{
#pragma omp task
DoComputationOnPartition(i);
}
#pragma omp taskwait
}
for (int i = 1; i < 16; i++)
{
#pragma omp task depend(in:x)
{
OperateOnPartition(i);
}
}
#pragma omp task depend(inout:x)
{
for (int i =1; i < 16; i++) x++;
}
for (int i = 1; i < 16; i++)
{
#pragma omp task depend(in:x)
{
OperateOnPartition(i);
}
}
#pragma omp taskwait
}
}
And what I find is that the master thread never gets to execute a task of DoComputationOnPartition nested inside the first task. Can someone explain that? It should work, right? The #pragma omp taskwait is an scheduling point so any thread of the team should be able to get a task. The master thread reaches the final taskwait and it should be able to get a nested task. They have long enough duration to allow that.
Thanks.
Taskwait waits just for the immediate children, spawning other tasks than the children in the taskwait construct is possible, but risks (much) increasing the latency of taskwait, if you have important amount of work after the taskwait. If you want to wait for all children + grand children etc., you can use #pragma omp taskgroup construct, or just in these cases leave out the taskwaits and use the (implicit) barrier at the end of single construct.
As per OpenMP 4.0 specification:
Binding A taskgroup region binds to the current task region. The binding thread set of the taskgroup region is the current team.
Description When a thread encounters a taskgroup construct, it starts executing the region. There is an implicit task scheduling
point at the end of the taskgroup region. The current task is
suspended at the task scheduling point until all child tasks that it
generated in the taskgroup region and all of their descendent tasks
complete execution.
So, you mean to put the tasks inside a taskgroup. Well, ok. But it also means a scheduling point, the same as a taskwait.
If libgomp has a problem with that, it is a problem specific of that runtime, not OpenMP as an API. ICC and other runtimes like OmpSs do not have such behaviour problem :-S
Taskgroup only makes sense (as I see it) if you want to wait for all the hierarchy of tasks, but this is not the case.
I think you mean this, right?:
#pragma omp task depend(inout:x)
{
#pragma omp taskgroup
{
for (int i = 1; i < 16; i++)
{
#pragma omp task
DoComputationOnPartition(i);
}
}
}
The original code has the first taskwait nested, inside another context, so there is only a small group of tasks waiting in that taskwait.

Resources