Use OpenMP section parallel in a non-parallel time-dependent do loop - multithreading

I have a quick question regarding the OpenMP. Usually one can do a section parallel like this (written in fortran, and has two sections):
!$OMP PARALLEL SECTIONS
!$OMP SECTION
< Fortran code block A>
!$OMP SECTION
< Fortran code block B>
!$OMP END SECTIONS
Now what I really want to run fortran code block A and B within a do loop, which itself should not be parallelized, because this do-loop is a time-dependent loop that every new step depend on previous step’s results. And before the parallel section, I need to run a serial code (let's call it block C). Now both block A, B, C are function of do loop variable t. Then naively one might propose such code by simply embedded this parallel within a do loop:
do t=1:tmax
< Fortran serial code block C>
!$OMP PARALLEL SECTIONS
!$OMP SECTION
< Fortran code block A>
!$OMP SECTION
< Fortran code block B>
!$OMP END SECTIONS
end do
However, it is obvious that the creation of the thread overheads will largely decelerate this speed, which even possibly make it slower than a standard serial code. Therefore, one might come up with smarter idea to solve this.
I was wondering whether you can help me on giving some hints on how to do this. What's the best approach (fastest computation) on this?

I concur with both comments that it is not at all obvious how much the OpenMP overhead would be compared to the computation. If you find it (after performing the corresponding measurements) to be really high, then the typical way to handle this case is to put the loop inside a parallel region:
!$OMP PARALLEL PRIVATE(t)
do t=1,tmax
!$OMP SINGLE
< Fortran code block C >
!$OMP END SINGLE
!$OMP SECTIONS
!$OMP SECTION
< Fortran code block A >
!$OMP SECTION
< Fortran code block B >
!$OMP END SECTIONS
end do
!$OMP END PARALLEL
Each thread will loop independently. The SECTIONS construct has an implicit barrier at its end so the threads are synchronised before the next loop iteration. If there is some additional code before the end of the parallel region that does not synchronise, an explicit barrier has to be inserted just before end do.
The SINGLE construct is used to isolate block C such that it gets executed by one thread only.

Related

about tasks in OpenMP

in my current code I am using two OpenMP sections, with two threads only
as follows:
!$omp parallel NUM_THREADS(2)
!$omp sections
!$omp section
do work1
!$omp section
do work2
!$omp end sections
!$omp end parallel
work1 is more time consuming than work2, and because sections construct requires both threads to finish their tasks to synchronize, work1 becomes the limiting step, and one thread is wasted most of the time.
I want to use a more flexible construct such as task, but I don't know if it is
possible to do the following and if so, how to do it. First, start with two threads (1 task per thread as in sections construct), one solving work1 and the other work2. Second, as soon as the easiest task work2 is finished, the free thread could be used to speedup work1.
Thanks.

Fortran MPI and OpenMP additional threads

I have a code that is already paralleled by MPI. Every MPI process have to do a lot of calculations on the grid. The code looks something like this:
do i=starti,stopi
call function1(input1(i),output1)
call function1(input2(i+1),output2)
call function1(input3(i+2,output3)
call solve(output1,output2,output3,OUTPUT(i))
end do
Where MPI have different ranges of starti and stopi. For example for 2 processes I have:
process 1 starti=1 , stopi=100
process 2 starti=101 , stopi=200
This is working very nice but I want to use OpenMP to speed thing up a little bit. Modified code looks like this:
do i=starti,stopi
!$OMP PARALLEL SECTIONS
!$OMP SECTION
call function1(input1(i),output1)
!$OMP SECTION
call function1(input2(i+1),output2)
!$OMP SECTION
call function1(input3(i+2),output3)
!$OMP END PARALLEL SECTIONS
call solve(output1,output2,output3,OUTPUT(i))
end do
But when I run this code with 2 MPI Processes and 3 OpenMP threads it is 2x slower than pure MPI. The CPU is not the case as I have 4 cores/ 8 threads CPU.
Why is that?

Declaring a different nested number of threads for two separate tasks (OpenMP)

I am writing a parallel code that is exploiting some parallelism at an outer level. Essentially there are two separate subroutines (very expensive) that may be executed concurrently. This is a large code, and as such, within each subroutine there are other calls as well as many omp parallel/do regions. So to execute my two subroutines I want to make use of nested parallelism, so that they can both be called in the outer region as such:
!$omp parallel
!$omp single
! Do the first expensive task (contains more omp parallel regions)
!$omp end single nowait
!$omp single
! Do the second expensive task (contains more omp parallel regions)
!$omp end single nowait
!$omp end parallel
If both of these expensive tasks took an equal amount of time I would not have a problem. But during the simulation, at each time step, the amount that each has to do changes. So doing an environment variable for setting the nested number of threads like export OMP_NUM_THREADS=16,8 where I have 16 in the first level of parallelism and 8 in the nested regions (inside the two expensive subroutines) does not work well. I have a scheme already to distribute the correct number of threads to their respective task, I just don't know how to set different numbers of threads for the nested level in the respective subroutines. Of course I could go into each expensive subroutine and all subroutines within those and actually hardcode the number of threads that I would like, but like I mentioned this is a very large code and that is the ugly solution. I would much rather do this in an environment variable type of way. There is no information on this subject online. Does anyone out there have a clue how one could do this?
Thanks in advance.
I'm not sure whether I understand correctly what you are trying to achieve, but you can set the default team size for nested parallel regions by simply calling omp_set_num_threads(). If you call it from the serial part of the application, it will set the default team size for top-level parallel regions. If you call it from within a parallel region, it will affect nested parallel regions spawned by the calling thread. And different threads can set different team sizes for their nested regions. So, in a nutshell, you can do something like:
!$omp parallel
!$omp single
call omp_set_num_threads(nn)
! Do the first expensive task (contains more omp parallel regions)
!$omp end single nowait
!$omp single
call omp_set_num_threads(mm)
! Do the second expensive task (contains more omp parallel regions)
!$omp end single nowait
!$omp end parallel
Parallel regions spawned from the thread executing the first single construct will execute with nn threads. Parallel regions spawned from the thread executing the second single construct will execute with mm threads.
Also, have you considered using explicit OpenMP tasks (!$omp task) instead of single + nowait?

Parallel write to / read from file

From briefly looking online, I couldn't find a method for reading from or writing to a file in Fortran 90 using OpenMP in parallel. I want to know if it is possible to do something like
!$OMP PARALLEL
do i=1,Nx
do j=1,Ny
do k=1,Nz
read(1,*) B(i,j,k)
enddo
enddo
enddo
!$OMP END PARALLEL
!$OMP PARALLEL
do i=1,Nx
do j=1,Ny
do k=1,Nz
write(1,*) B(i,j,k)
enddo
enddo
enddo
!$OMP END PARALLEL
Where there are additional clauses to ensure that this is done correctly. I imagine there may be an easy way to do this for the read, but the write seems less obvious. Thanks for any and all help/suggestions!
Since you're doing formatted io, think about how thread N figures out the correct offset in the file to read/write.
Additionally, since the question is tagged gfortran, in gfortran there is a per-unit mutex which is held for the duration of a io statement.

OpenMP uderstanding deadlock in critical construct

I am trying to understand exactly why a deadlock occurs when in a parallel region a critical construct is nested in a critical construct.
I have consulted the following resources: this source the author writes:
In OpenMP this can happen if inside a critical region a function is called which
contains another critical region. In this case the critical region of the called
function will wait for the first critical region to terminate - which will never
happen.
Alright, but why not? Furthermore from: Hager, Georg, and Gerhard Wellein. Introduction to high performance computing for scientists and engineers. CRC Press, 2010, p. 149:
When a thread encounters a CRITICAL directive inside a critical region, it will block forever.
Same question, why?
Finally, Chapman, Barbara, Gabriele Jost, and Ruud Van Der Pas. Using OpenMP: portable shared memory parallel programming. Vol. 10. MIT press, 2008 also provide an example using locks, however not with the critical construct.
From my current understanding there are two different possible ways why a deadlock occurs in a nested critical region:
Begin first take:
If two threads arrive at a nested critical construct (one critical region inside another), thread one enters the "outer" critical region and thread two waits. Quoting Chapman et al.
When a thread encounters a critical construct, it waits until no other thread is
executing a critical region with the same name.
Alright, so far so good. Now, thread one DOES NOT enter the nested critical region, because it is a synchronization point where threads wait for all other threads to arrive before proceeding. And since the second thread is waiting for the first thread the exit the "outer" critical region they are in a deadlock.
End first take.
Begin second take:
Both threads arrive at the "outer" critical construct. Thread one enters the "outer" critical construct, thread two waits. Now, thread one ENTERS the "inner" critical construct and stops at it's implied barrier, because it waits for thread two. Thread two on the other hand waits for thread one to exit to "outer" thread and so both are waiting forever.
End second take.
Here is a small Fortran code that produces the deadlock:
1 subroutine foo
2
3 !$OMP PARALLEL
4 !$OMP CRITICAL
5 print*, 'Hallo i am just a single thread and I like it that way'
6 !$OMP END CRITICAL
7 !$OMP END PARALLEL
8
9 end subroutine foo
10
11 program deadlock
12 implicit none
13 integer :: i,sum = 0
14
15 !$OMP PARALLEL
16 !$OMP DO
17 do i = 1, 100
18 !$OMP CRITICAL
19 sum = sum + i
20 call foo()
21 !$OMP END CRITICAL
22 enddo
23 !$OMP END DO
24 !$OMP END PARALLEL
25
26 print*, sum
27 end program deadlock
So my question is, is one of the two suggestions right, or is there another possibility
why a deadlock occurs in this situation.
There is no implied barrier, i.e. no "synchronization point where threads wait for other threads to arrive" associated with CRITICAL constructs. Instead, at the start of a critical construct, threads wait for any thread already inside a critical construct of the same name to leave the construct.
Critical constructs with the same name cannot be nested, because the current OpemMP rules say they can not (see the restrictions on nesting in OpemMP 4.0 section 2.16). That's really the answer to your question and the end of the discussion - if you break that prohibition then anything can happen.
Practically, this prohibition allows implementations to assume that critical constructs with the same name will not be nested. One common implementation choice is then that a thread encountering a critical construct will wait for all threads including itself to leave the construct. If it is waiting a thread cannot be leaving. That results in a deadlock.
Critical constructs with different names can be nested. Deadlock is possible in that case if the nesting is not consistent. Consider:
!$OMP PARALLEL
!$OMP CRITICAL (A)
!$OMP CRITICAL (B) ! Thread one waiting here.
!...
!$OMP OMP CRITICAL (B)
!$OMP END CRITICAL (A)
!$OMP CRITICAL (B)
!$OMP CRITICAL (A) ! Thread two waiting here.
!...
!$OMP OMP CRITICAL (A)
!$OMP END CRITICAL (B)
!$END PARALLEL
If this situation occurs the threads will be waiting quite a while.

Resources