Fortran MPI and OpenMP additional threads - multithreading

I have a code that is already paralleled by MPI. Every MPI process have to do a lot of calculations on the grid. The code looks something like this:
do i=starti,stopi
call function1(input1(i),output1)
call function1(input2(i+1),output2)
call function1(input3(i+2,output3)
call solve(output1,output2,output3,OUTPUT(i))
end do
Where MPI have different ranges of starti and stopi. For example for 2 processes I have:
process 1 starti=1 , stopi=100
process 2 starti=101 , stopi=200
This is working very nice but I want to use OpenMP to speed thing up a little bit. Modified code looks like this:
do i=starti,stopi
!$OMP PARALLEL SECTIONS
!$OMP SECTION
call function1(input1(i),output1)
!$OMP SECTION
call function1(input2(i+1),output2)
!$OMP SECTION
call function1(input3(i+2),output3)
!$OMP END PARALLEL SECTIONS
call solve(output1,output2,output3,OUTPUT(i))
end do
But when I run this code with 2 MPI Processes and 3 OpenMP threads it is 2x slower than pure MPI. The CPU is not the case as I have 4 cores/ 8 threads CPU.
Why is that?

Related

about tasks in OpenMP

in my current code I am using two OpenMP sections, with two threads only
as follows:
!$omp parallel NUM_THREADS(2)
!$omp sections
!$omp section
do work1
!$omp section
do work2
!$omp end sections
!$omp end parallel
work1 is more time consuming than work2, and because sections construct requires both threads to finish their tasks to synchronize, work1 becomes the limiting step, and one thread is wasted most of the time.
I want to use a more flexible construct such as task, but I don't know if it is
possible to do the following and if so, how to do it. First, start with two threads (1 task per thread as in sections construct), one solving work1 and the other work2. Second, as soon as the easiest task work2 is finished, the free thread could be used to speedup work1.
Thanks.

Idle threads while new threads can be assigned to a nested loop

I have two nested loops:
!$omp parallel
!$omp do
do i=1,4
...
!$omp parallel
!$omp do
do j=1,4
call job(i,j)
My computer can run four threads in parallel. For the outer loop such four threads are created. The first three finish quickly since for i=4, the job is four times more expensive.
Now I expect that in the inner parallel region, new threads share the work. But this doesn't happen: The CPU load stays at 1/4, just as if the 4th thread works serially on the inner loop.
How can I allocate parallel CPU time to the inner parallel loop?
Did you try the following approach?
!$omp parallel do collapse(2)
do i = 1,4
do j = 1,4
call job(i,j)
end do
end do
It should behave better.

Declaring a different nested number of threads for two separate tasks (OpenMP)

I am writing a parallel code that is exploiting some parallelism at an outer level. Essentially there are two separate subroutines (very expensive) that may be executed concurrently. This is a large code, and as such, within each subroutine there are other calls as well as many omp parallel/do regions. So to execute my two subroutines I want to make use of nested parallelism, so that they can both be called in the outer region as such:
!$omp parallel
!$omp single
! Do the first expensive task (contains more omp parallel regions)
!$omp end single nowait
!$omp single
! Do the second expensive task (contains more omp parallel regions)
!$omp end single nowait
!$omp end parallel
If both of these expensive tasks took an equal amount of time I would not have a problem. But during the simulation, at each time step, the amount that each has to do changes. So doing an environment variable for setting the nested number of threads like export OMP_NUM_THREADS=16,8 where I have 16 in the first level of parallelism and 8 in the nested regions (inside the two expensive subroutines) does not work well. I have a scheme already to distribute the correct number of threads to their respective task, I just don't know how to set different numbers of threads for the nested level in the respective subroutines. Of course I could go into each expensive subroutine and all subroutines within those and actually hardcode the number of threads that I would like, but like I mentioned this is a very large code and that is the ugly solution. I would much rather do this in an environment variable type of way. There is no information on this subject online. Does anyone out there have a clue how one could do this?
Thanks in advance.
I'm not sure whether I understand correctly what you are trying to achieve, but you can set the default team size for nested parallel regions by simply calling omp_set_num_threads(). If you call it from the serial part of the application, it will set the default team size for top-level parallel regions. If you call it from within a parallel region, it will affect nested parallel regions spawned by the calling thread. And different threads can set different team sizes for their nested regions. So, in a nutshell, you can do something like:
!$omp parallel
!$omp single
call omp_set_num_threads(nn)
! Do the first expensive task (contains more omp parallel regions)
!$omp end single nowait
!$omp single
call omp_set_num_threads(mm)
! Do the second expensive task (contains more omp parallel regions)
!$omp end single nowait
!$omp end parallel
Parallel regions spawned from the thread executing the first single construct will execute with nn threads. Parallel regions spawned from the thread executing the second single construct will execute with mm threads.
Also, have you considered using explicit OpenMP tasks (!$omp task) instead of single + nowait?

Use OpenMP section parallel in a non-parallel time-dependent do loop

I have a quick question regarding the OpenMP. Usually one can do a section parallel like this (written in fortran, and has two sections):
!$OMP PARALLEL SECTIONS
!$OMP SECTION
< Fortran code block A>
!$OMP SECTION
< Fortran code block B>
!$OMP END SECTIONS
Now what I really want to run fortran code block A and B within a do loop, which itself should not be parallelized, because this do-loop is a time-dependent loop that every new step depend on previous step’s results. And before the parallel section, I need to run a serial code (let's call it block C). Now both block A, B, C are function of do loop variable t. Then naively one might propose such code by simply embedded this parallel within a do loop:
do t=1:tmax
< Fortran serial code block C>
!$OMP PARALLEL SECTIONS
!$OMP SECTION
< Fortran code block A>
!$OMP SECTION
< Fortran code block B>
!$OMP END SECTIONS
end do
However, it is obvious that the creation of the thread overheads will largely decelerate this speed, which even possibly make it slower than a standard serial code. Therefore, one might come up with smarter idea to solve this.
I was wondering whether you can help me on giving some hints on how to do this. What's the best approach (fastest computation) on this?
I concur with both comments that it is not at all obvious how much the OpenMP overhead would be compared to the computation. If you find it (after performing the corresponding measurements) to be really high, then the typical way to handle this case is to put the loop inside a parallel region:
!$OMP PARALLEL PRIVATE(t)
do t=1,tmax
!$OMP SINGLE
< Fortran code block C >
!$OMP END SINGLE
!$OMP SECTIONS
!$OMP SECTION
< Fortran code block A >
!$OMP SECTION
< Fortran code block B >
!$OMP END SECTIONS
end do
!$OMP END PARALLEL
Each thread will loop independently. The SECTIONS construct has an implicit barrier at its end so the threads are synchronised before the next loop iteration. If there is some additional code before the end of the parallel region that does not synchronise, an explicit barrier has to be inserted just before end do.
The SINGLE construct is used to isolate block C such that it gets executed by one thread only.

Multi-threaded linear system solution in OpenBLAS

I have a code using Fortran 95 and the gfortran compiler. I am also using OpenMP and I have to handle very big arrays. In my code I also have to solve a system of linear equations using the solver DGTSV from OpenBLAS. I want to parallelize this solver as well using openblas which should be capable of that. But I have trouble with the syntax. Using the attached pseudo code all 4 CPUs are used to almost 100% but I am not sure if each kernel solves the linear equations separately or if they split it into parts and calculating it parallel.
The whole stuff is compiled using gfortran -fopenmp -lblas a.f95 -o a.out
So my pseudo code looks like
program a
implicit none
integer, parameter :: N = 200
real*8, dimension(numx) :: D = 0.0
real*8, dimension(numx-1):: DL = 0.0
real*8, dimension(numx-1):: DU = 0.0
real*8, dimension(numx) :: b = 0.0
integer :: info = 0
integer :: numthread=4
...
!$OMP PARALLEL NUM_THREADS(numthread)
...
!$OMP DO
...
!$OMP END DO
CALL DGTSV(N,1,DL,D,DU,b,N,info)
!$OMP DO
...
!$OMP END DO
...
!$OMP END PARALLEL
end program a
What does I have to do to make the solver parallelized, so each kernel calculates parts of the solver?
Inside an OpenMP parallel region, all the threads execute the same code (as in MPI), and the work is only split when the threads reach a loop/section/task.
In your example, the work inside the loops (OMP DO) is distributed among the available threads. After the loop is done, an implicit barrier synchronizes all the threads and then they execute in parallel the function DGTSV. After the subroutine has returned, the loop is split again.
#HristoIliev proposed using a OMP SINGLE clause. This restricts the piece of code inside to be executed by only one thread and forces all the other threads to wait for it (unless you specify nowait).
On the other hand, nested parallelism is called to the case where you declare a parallel region inside another parallel region. This also applies when you perform calls to a OpenMP parallelized library inside a parallel region.
By default, OpenMP does not increase parallelism nested parallel regions, instead, only the thread that enter the parallel region is able to execute it. This behavior can be changed using the environment variable OMP_NESTED to true.
The OMP SINGLE solution is far better than splitting the parallel region in two, as the resources are reused for the next loop:
$!OMP PARALLEL
$!OMP DO
DO ...
END DO
$!OMP SINGLE
CALL DGTSV(...)
$!OMP DO
DO ...
END DO
$!OMP END PARALLEL
To illustrate the usage of OMP_NESTED I'll show you some results I had from an application which used FFTW (a Fast Fourier Transform implementation) configured to use OpenMP. The execution was performed in a 16 core two-socket Intel Xeon E5 #2.46GHz node.
The following graphs show the amount of time spent in the whole application, where parallel regions appear when CPUs > 1, serialized regions when CPUs = 1 and synchronization regions when CPUs = 0.
The application is embarrassingly parallel, so in this particular case using nesting is not worthwhile (FFTW does not scale that good).
This is the OMP_NESTED=false execution. Observe how the amount of parallelism is limited by the amount of threads spent in the external parallel region (ftdock).
This is the OMP_NESTED=true execution. In this case, it is possible to increase parallelism further than the amount of threads spent on the external parallel region. The maximum parallelism possible in this case is 16, when either the 8 external threads create a single peer to execute the internal parallel region or they are 4 creating 3 additional threads each (8x2 = 4x4 = 16).

Resources