OpenMP uderstanding deadlock in critical construct - multithreading

I am trying to understand exactly why a deadlock occurs when in a parallel region a critical construct is nested in a critical construct.
I have consulted the following resources: this source the author writes:
In OpenMP this can happen if inside a critical region a function is called which
contains another critical region. In this case the critical region of the called
function will wait for the first critical region to terminate - which will never
happen.
Alright, but why not? Furthermore from: Hager, Georg, and Gerhard Wellein. Introduction to high performance computing for scientists and engineers. CRC Press, 2010, p. 149:
When a thread encounters a CRITICAL directive inside a critical region, it will block forever.
Same question, why?
Finally, Chapman, Barbara, Gabriele Jost, and Ruud Van Der Pas. Using OpenMP: portable shared memory parallel programming. Vol. 10. MIT press, 2008 also provide an example using locks, however not with the critical construct.
From my current understanding there are two different possible ways why a deadlock occurs in a nested critical region:
Begin first take:
If two threads arrive at a nested critical construct (one critical region inside another), thread one enters the "outer" critical region and thread two waits. Quoting Chapman et al.
When a thread encounters a critical construct, it waits until no other thread is
executing a critical region with the same name.
Alright, so far so good. Now, thread one DOES NOT enter the nested critical region, because it is a synchronization point where threads wait for all other threads to arrive before proceeding. And since the second thread is waiting for the first thread the exit the "outer" critical region they are in a deadlock.
End first take.
Begin second take:
Both threads arrive at the "outer" critical construct. Thread one enters the "outer" critical construct, thread two waits. Now, thread one ENTERS the "inner" critical construct and stops at it's implied barrier, because it waits for thread two. Thread two on the other hand waits for thread one to exit to "outer" thread and so both are waiting forever.
End second take.
Here is a small Fortran code that produces the deadlock:
1 subroutine foo
2
3 !$OMP PARALLEL
4 !$OMP CRITICAL
5 print*, 'Hallo i am just a single thread and I like it that way'
6 !$OMP END CRITICAL
7 !$OMP END PARALLEL
8
9 end subroutine foo
10
11 program deadlock
12 implicit none
13 integer :: i,sum = 0
14
15 !$OMP PARALLEL
16 !$OMP DO
17 do i = 1, 100
18 !$OMP CRITICAL
19 sum = sum + i
20 call foo()
21 !$OMP END CRITICAL
22 enddo
23 !$OMP END DO
24 !$OMP END PARALLEL
25
26 print*, sum
27 end program deadlock
So my question is, is one of the two suggestions right, or is there another possibility
why a deadlock occurs in this situation.

There is no implied barrier, i.e. no "synchronization point where threads wait for other threads to arrive" associated with CRITICAL constructs. Instead, at the start of a critical construct, threads wait for any thread already inside a critical construct of the same name to leave the construct.
Critical constructs with the same name cannot be nested, because the current OpemMP rules say they can not (see the restrictions on nesting in OpemMP 4.0 section 2.16). That's really the answer to your question and the end of the discussion - if you break that prohibition then anything can happen.
Practically, this prohibition allows implementations to assume that critical constructs with the same name will not be nested. One common implementation choice is then that a thread encountering a critical construct will wait for all threads including itself to leave the construct. If it is waiting a thread cannot be leaving. That results in a deadlock.
Critical constructs with different names can be nested. Deadlock is possible in that case if the nesting is not consistent. Consider:
!$OMP PARALLEL
!$OMP CRITICAL (A)
!$OMP CRITICAL (B) ! Thread one waiting here.
!...
!$OMP OMP CRITICAL (B)
!$OMP END CRITICAL (A)
!$OMP CRITICAL (B)
!$OMP CRITICAL (A) ! Thread two waiting here.
!...
!$OMP OMP CRITICAL (A)
!$OMP END CRITICAL (B)
!$END PARALLEL
If this situation occurs the threads will be waiting quite a while.

Related

about tasks in OpenMP

in my current code I am using two OpenMP sections, with two threads only
as follows:
!$omp parallel NUM_THREADS(2)
!$omp sections
!$omp section
do work1
!$omp section
do work2
!$omp end sections
!$omp end parallel
work1 is more time consuming than work2, and because sections construct requires both threads to finish their tasks to synchronize, work1 becomes the limiting step, and one thread is wasted most of the time.
I want to use a more flexible construct such as task, but I don't know if it is
possible to do the following and if so, how to do it. First, start with two threads (1 task per thread as in sections construct), one solving work1 and the other work2. Second, as soon as the easiest task work2 is finished, the free thread could be used to speedup work1.
Thanks.

Idle threads while new threads can be assigned to a nested loop

I have two nested loops:
!$omp parallel
!$omp do
do i=1,4
...
!$omp parallel
!$omp do
do j=1,4
call job(i,j)
My computer can run four threads in parallel. For the outer loop such four threads are created. The first three finish quickly since for i=4, the job is four times more expensive.
Now I expect that in the inner parallel region, new threads share the work. But this doesn't happen: The CPU load stays at 1/4, just as if the 4th thread works serially on the inner loop.
How can I allocate parallel CPU time to the inner parallel loop?
Did you try the following approach?
!$omp parallel do collapse(2)
do i = 1,4
do j = 1,4
call job(i,j)
end do
end do
It should behave better.

Declaring a different nested number of threads for two separate tasks (OpenMP)

I am writing a parallel code that is exploiting some parallelism at an outer level. Essentially there are two separate subroutines (very expensive) that may be executed concurrently. This is a large code, and as such, within each subroutine there are other calls as well as many omp parallel/do regions. So to execute my two subroutines I want to make use of nested parallelism, so that they can both be called in the outer region as such:
!$omp parallel
!$omp single
! Do the first expensive task (contains more omp parallel regions)
!$omp end single nowait
!$omp single
! Do the second expensive task (contains more omp parallel regions)
!$omp end single nowait
!$omp end parallel
If both of these expensive tasks took an equal amount of time I would not have a problem. But during the simulation, at each time step, the amount that each has to do changes. So doing an environment variable for setting the nested number of threads like export OMP_NUM_THREADS=16,8 where I have 16 in the first level of parallelism and 8 in the nested regions (inside the two expensive subroutines) does not work well. I have a scheme already to distribute the correct number of threads to their respective task, I just don't know how to set different numbers of threads for the nested level in the respective subroutines. Of course I could go into each expensive subroutine and all subroutines within those and actually hardcode the number of threads that I would like, but like I mentioned this is a very large code and that is the ugly solution. I would much rather do this in an environment variable type of way. There is no information on this subject online. Does anyone out there have a clue how one could do this?
Thanks in advance.
I'm not sure whether I understand correctly what you are trying to achieve, but you can set the default team size for nested parallel regions by simply calling omp_set_num_threads(). If you call it from the serial part of the application, it will set the default team size for top-level parallel regions. If you call it from within a parallel region, it will affect nested parallel regions spawned by the calling thread. And different threads can set different team sizes for their nested regions. So, in a nutshell, you can do something like:
!$omp parallel
!$omp single
call omp_set_num_threads(nn)
! Do the first expensive task (contains more omp parallel regions)
!$omp end single nowait
!$omp single
call omp_set_num_threads(mm)
! Do the second expensive task (contains more omp parallel regions)
!$omp end single nowait
!$omp end parallel
Parallel regions spawned from the thread executing the first single construct will execute with nn threads. Parallel regions spawned from the thread executing the second single construct will execute with mm threads.
Also, have you considered using explicit OpenMP tasks (!$omp task) instead of single + nowait?

Use OpenMP section parallel in a non-parallel time-dependent do loop

I have a quick question regarding the OpenMP. Usually one can do a section parallel like this (written in fortran, and has two sections):
!$OMP PARALLEL SECTIONS
!$OMP SECTION
< Fortran code block A>
!$OMP SECTION
< Fortran code block B>
!$OMP END SECTIONS
Now what I really want to run fortran code block A and B within a do loop, which itself should not be parallelized, because this do-loop is a time-dependent loop that every new step depend on previous step’s results. And before the parallel section, I need to run a serial code (let's call it block C). Now both block A, B, C are function of do loop variable t. Then naively one might propose such code by simply embedded this parallel within a do loop:
do t=1:tmax
< Fortran serial code block C>
!$OMP PARALLEL SECTIONS
!$OMP SECTION
< Fortran code block A>
!$OMP SECTION
< Fortran code block B>
!$OMP END SECTIONS
end do
However, it is obvious that the creation of the thread overheads will largely decelerate this speed, which even possibly make it slower than a standard serial code. Therefore, one might come up with smarter idea to solve this.
I was wondering whether you can help me on giving some hints on how to do this. What's the best approach (fastest computation) on this?
I concur with both comments that it is not at all obvious how much the OpenMP overhead would be compared to the computation. If you find it (after performing the corresponding measurements) to be really high, then the typical way to handle this case is to put the loop inside a parallel region:
!$OMP PARALLEL PRIVATE(t)
do t=1,tmax
!$OMP SINGLE
< Fortran code block C >
!$OMP END SINGLE
!$OMP SECTIONS
!$OMP SECTION
< Fortran code block A >
!$OMP SECTION
< Fortran code block B >
!$OMP END SECTIONS
end do
!$OMP END PARALLEL
Each thread will loop independently. The SECTIONS construct has an implicit barrier at its end so the threads are synchronised before the next loop iteration. If there is some additional code before the end of the parallel region that does not synchronise, an explicit barrier has to be inserted just before end do.
The SINGLE construct is used to isolate block C such that it gets executed by one thread only.

What happens to a thread when an up is done on its mutex?

Mutexes are used to protect critical sections. Let's say a down has been already done on a mutex, and while the thread that did that is in the CS, 10 other threads are right behind it and do a down on the mutex, putting themselves to sleep. When the first thread exits the critical section and does an up on the mutex, do all 10 threads wake up and just resume what they were about to do, namely, entering the critical section? Wouldn't that mean then that all 10 might end up in the critical section at the same time?
No, only one thread will wake up and take ownership of the mutex. The rest of them will remain asleep. Which thread is the one that wakes up is usually nondeterministic.
The above is a generalisation and the details of implementation will be different in each system. For example, in Java compare Object#notify() and Object#notifyAll().

Resources