How can I do this without using OMP TASK? - multithreading

I have an array of 4 tasks called array. It is declared as following:
type(tcb),dimension(4)::arrray 
with:
type:: tcb !> new type task control block
procedure(my_interface),NOPASS,pointer:: f_ptr => null() !< the function pointer
type(variables)::variables !< the variables
integer :: state !< the task state
end type tcb
I have only 2 threads to execute these 4 tasks. I want to avoid using !$OMP TASK.
When I used the tasking construct, I got something like this:
type(tcb),dimension(4),intent(inout)::array !< the array of tasks
integer,intent(in)::ff !< the counter
type(tcb)::self !< self
type(variables),intent(inout)::var !< the variables
!OpenMP variables
integer::num_thread !< the rank of the thread
integer::nthreads !< the number of threads
integer:: OMP_GET_THREAD_NUM !< function to get the rank of the thread
integer::OMP_GET_NUM_THREADS !< function to get the number of threads
!=======================================================================================================================================================
!$OMP PARALLEL PRIVATE(num_thread,nthreads,ff) &
!$OMP SHARED(array)
num_thread=OMP_GET_THREAD_NUM() !< le rang du thread
nthreads=OMP_GET_NUM_THREADS() !< le nombre de threads
!$OMP MASTER
do ff=1,3
!$OMP TASK SHARED(array) IF ((num_thread .ne. 0) .and. (num_thread .ne. 1))
call array(ff)%f_ptr(self,var)
!$OMP END TASK
end if
end do
!$OMP TASKWAIT
!$OMP END MASTER
Do you have any idea ? I want that when a thread finishes running a task, it moves directly to the next available task. It shouldn't wait for the other thread to finish.
How can I do it without using the OMP Tasking ?
I want to schedule the tasks alone not with the help of OpenMP. Is it possible ?

Using OpenMP directives you have created a serial program. Here, I try to explain why. The task creation in your code is inside the MASTER region, in which the num_thread is always zero. From specification:
Thread number: A number that the OpenMP implementation assigns to an
OpenMP thread. For threads within the same team, zero identifies
the master thread and consecutive numbers identify the other threads
of this team.
Therefore the ((num_thread .ne. 0) .and. (num_thread .ne. 1)) expression is always false. Again from specification:
When an if clause is present on a task construct, and the if
clause expression evaluates to false, an undeferred task is generated,
and the encountering thread must suspend the current task region, for
which execution cannot be resumed until the generated task is
completed.
So, it means that you have a serial program. The master threads execution is suspend until a task is finished. Although it is not required (or specified) by the standard, in practice it means that your program will run on the master thread only, the other threads are just waiting. So, you have to remove the if clause and your program will be concurrent.
If you wish to run those tasks on 2 threads only, you have to use the num_threads(2) clause on parallel region to explicitly specify it.
EDIT: To answer your question, using tasks is a good choice, but if the number of tasks are known at compile time you can use sections as well
!$omp sections
!$omp section
! first job is here
!$omp section
! second job is here
...
!$omp end sections
ps: you mentioned 4 tasks, but your code only generates 3 of them..
ps2: end if is not needed.

Related

about tasks in OpenMP

in my current code I am using two OpenMP sections, with two threads only
as follows:
!$omp parallel NUM_THREADS(2)
!$omp sections
!$omp section
do work1
!$omp section
do work2
!$omp end sections
!$omp end parallel
work1 is more time consuming than work2, and because sections construct requires both threads to finish their tasks to synchronize, work1 becomes the limiting step, and one thread is wasted most of the time.
I want to use a more flexible construct such as task, but I don't know if it is
possible to do the following and if so, how to do it. First, start with two threads (1 task per thread as in sections construct), one solving work1 and the other work2. Second, as soon as the easiest task work2 is finished, the free thread could be used to speedup work1.
Thanks.

Idle threads while new threads can be assigned to a nested loop

I have two nested loops:
!$omp parallel
!$omp do
do i=1,4
...
!$omp parallel
!$omp do
do j=1,4
call job(i,j)
My computer can run four threads in parallel. For the outer loop such four threads are created. The first three finish quickly since for i=4, the job is four times more expensive.
Now I expect that in the inner parallel region, new threads share the work. But this doesn't happen: The CPU load stays at 1/4, just as if the 4th thread works serially on the inner loop.
How can I allocate parallel CPU time to the inner parallel loop?
Did you try the following approach?
!$omp parallel do collapse(2)
do i = 1,4
do j = 1,4
call job(i,j)
end do
end do
It should behave better.

Declaring a different nested number of threads for two separate tasks (OpenMP)

I am writing a parallel code that is exploiting some parallelism at an outer level. Essentially there are two separate subroutines (very expensive) that may be executed concurrently. This is a large code, and as such, within each subroutine there are other calls as well as many omp parallel/do regions. So to execute my two subroutines I want to make use of nested parallelism, so that they can both be called in the outer region as such:
!$omp parallel
!$omp single
! Do the first expensive task (contains more omp parallel regions)
!$omp end single nowait
!$omp single
! Do the second expensive task (contains more omp parallel regions)
!$omp end single nowait
!$omp end parallel
If both of these expensive tasks took an equal amount of time I would not have a problem. But during the simulation, at each time step, the amount that each has to do changes. So doing an environment variable for setting the nested number of threads like export OMP_NUM_THREADS=16,8 where I have 16 in the first level of parallelism and 8 in the nested regions (inside the two expensive subroutines) does not work well. I have a scheme already to distribute the correct number of threads to their respective task, I just don't know how to set different numbers of threads for the nested level in the respective subroutines. Of course I could go into each expensive subroutine and all subroutines within those and actually hardcode the number of threads that I would like, but like I mentioned this is a very large code and that is the ugly solution. I would much rather do this in an environment variable type of way. There is no information on this subject online. Does anyone out there have a clue how one could do this?
Thanks in advance.
I'm not sure whether I understand correctly what you are trying to achieve, but you can set the default team size for nested parallel regions by simply calling omp_set_num_threads(). If you call it from the serial part of the application, it will set the default team size for top-level parallel regions. If you call it from within a parallel region, it will affect nested parallel regions spawned by the calling thread. And different threads can set different team sizes for their nested regions. So, in a nutshell, you can do something like:
!$omp parallel
!$omp single
call omp_set_num_threads(nn)
! Do the first expensive task (contains more omp parallel regions)
!$omp end single nowait
!$omp single
call omp_set_num_threads(mm)
! Do the second expensive task (contains more omp parallel regions)
!$omp end single nowait
!$omp end parallel
Parallel regions spawned from the thread executing the first single construct will execute with nn threads. Parallel regions spawned from the thread executing the second single construct will execute with mm threads.
Also, have you considered using explicit OpenMP tasks (!$omp task) instead of single + nowait?

OpenMP uderstanding deadlock in critical construct

I am trying to understand exactly why a deadlock occurs when in a parallel region a critical construct is nested in a critical construct.
I have consulted the following resources: this source the author writes:
In OpenMP this can happen if inside a critical region a function is called which
contains another critical region. In this case the critical region of the called
function will wait for the first critical region to terminate - which will never
happen.
Alright, but why not? Furthermore from: Hager, Georg, and Gerhard Wellein. Introduction to high performance computing for scientists and engineers. CRC Press, 2010, p. 149:
When a thread encounters a CRITICAL directive inside a critical region, it will block forever.
Same question, why?
Finally, Chapman, Barbara, Gabriele Jost, and Ruud Van Der Pas. Using OpenMP: portable shared memory parallel programming. Vol. 10. MIT press, 2008 also provide an example using locks, however not with the critical construct.
From my current understanding there are two different possible ways why a deadlock occurs in a nested critical region:
Begin first take:
If two threads arrive at a nested critical construct (one critical region inside another), thread one enters the "outer" critical region and thread two waits. Quoting Chapman et al.
When a thread encounters a critical construct, it waits until no other thread is
executing a critical region with the same name.
Alright, so far so good. Now, thread one DOES NOT enter the nested critical region, because it is a synchronization point where threads wait for all other threads to arrive before proceeding. And since the second thread is waiting for the first thread the exit the "outer" critical region they are in a deadlock.
End first take.
Begin second take:
Both threads arrive at the "outer" critical construct. Thread one enters the "outer" critical construct, thread two waits. Now, thread one ENTERS the "inner" critical construct and stops at it's implied barrier, because it waits for thread two. Thread two on the other hand waits for thread one to exit to "outer" thread and so both are waiting forever.
End second take.
Here is a small Fortran code that produces the deadlock:
1 subroutine foo
2
3 !$OMP PARALLEL
4 !$OMP CRITICAL
5 print*, 'Hallo i am just a single thread and I like it that way'
6 !$OMP END CRITICAL
7 !$OMP END PARALLEL
8
9 end subroutine foo
10
11 program deadlock
12 implicit none
13 integer :: i,sum = 0
14
15 !$OMP PARALLEL
16 !$OMP DO
17 do i = 1, 100
18 !$OMP CRITICAL
19 sum = sum + i
20 call foo()
21 !$OMP END CRITICAL
22 enddo
23 !$OMP END DO
24 !$OMP END PARALLEL
25
26 print*, sum
27 end program deadlock
So my question is, is one of the two suggestions right, or is there another possibility
why a deadlock occurs in this situation.
There is no implied barrier, i.e. no "synchronization point where threads wait for other threads to arrive" associated with CRITICAL constructs. Instead, at the start of a critical construct, threads wait for any thread already inside a critical construct of the same name to leave the construct.
Critical constructs with the same name cannot be nested, because the current OpemMP rules say they can not (see the restrictions on nesting in OpemMP 4.0 section 2.16). That's really the answer to your question and the end of the discussion - if you break that prohibition then anything can happen.
Practically, this prohibition allows implementations to assume that critical constructs with the same name will not be nested. One common implementation choice is then that a thread encountering a critical construct will wait for all threads including itself to leave the construct. If it is waiting a thread cannot be leaving. That results in a deadlock.
Critical constructs with different names can be nested. Deadlock is possible in that case if the nesting is not consistent. Consider:
!$OMP PARALLEL
!$OMP CRITICAL (A)
!$OMP CRITICAL (B) ! Thread one waiting here.
!...
!$OMP OMP CRITICAL (B)
!$OMP END CRITICAL (A)
!$OMP CRITICAL (B)
!$OMP CRITICAL (A) ! Thread two waiting here.
!...
!$OMP OMP CRITICAL (A)
!$OMP END CRITICAL (B)
!$END PARALLEL
If this situation occurs the threads will be waiting quite a while.

Use OpenMP section parallel in a non-parallel time-dependent do loop

I have a quick question regarding the OpenMP. Usually one can do a section parallel like this (written in fortran, and has two sections):
!$OMP PARALLEL SECTIONS
!$OMP SECTION
< Fortran code block A>
!$OMP SECTION
< Fortran code block B>
!$OMP END SECTIONS
Now what I really want to run fortran code block A and B within a do loop, which itself should not be parallelized, because this do-loop is a time-dependent loop that every new step depend on previous step’s results. And before the parallel section, I need to run a serial code (let's call it block C). Now both block A, B, C are function of do loop variable t. Then naively one might propose such code by simply embedded this parallel within a do loop:
do t=1:tmax
< Fortran serial code block C>
!$OMP PARALLEL SECTIONS
!$OMP SECTION
< Fortran code block A>
!$OMP SECTION
< Fortran code block B>
!$OMP END SECTIONS
end do
However, it is obvious that the creation of the thread overheads will largely decelerate this speed, which even possibly make it slower than a standard serial code. Therefore, one might come up with smarter idea to solve this.
I was wondering whether you can help me on giving some hints on how to do this. What's the best approach (fastest computation) on this?
I concur with both comments that it is not at all obvious how much the OpenMP overhead would be compared to the computation. If you find it (after performing the corresponding measurements) to be really high, then the typical way to handle this case is to put the loop inside a parallel region:
!$OMP PARALLEL PRIVATE(t)
do t=1,tmax
!$OMP SINGLE
< Fortran code block C >
!$OMP END SINGLE
!$OMP SECTIONS
!$OMP SECTION
< Fortran code block A >
!$OMP SECTION
< Fortran code block B >
!$OMP END SECTIONS
end do
!$OMP END PARALLEL
Each thread will loop independently. The SECTIONS construct has an implicit barrier at its end so the threads are synchronised before the next loop iteration. If there is some additional code before the end of the parallel region that does not synchronise, an explicit barrier has to be inserted just before end do.
The SINGLE construct is used to isolate block C such that it gets executed by one thread only.

Resources