in my current code I am using two OpenMP sections, with two threads only
as follows:
!$omp parallel NUM_THREADS(2)
!$omp sections
!$omp section
do work1
!$omp section
do work2
!$omp end sections
!$omp end parallel
work1 is more time consuming than work2, and because sections construct requires both threads to finish their tasks to synchronize, work1 becomes the limiting step, and one thread is wasted most of the time.
I want to use a more flexible construct such as task, but I don't know if it is
possible to do the following and if so, how to do it. First, start with two threads (1 task per thread as in sections construct), one solving work1 and the other work2. Second, as soon as the easiest task work2 is finished, the free thread could be used to speedup work1.
Thanks.
Related
I have an array of 4 tasks called array. It is declared as following:
type(tcb),dimension(4)::arrray
with:
type:: tcb !> new type task control block
procedure(my_interface),NOPASS,pointer:: f_ptr => null() !< the function pointer
type(variables)::variables !< the variables
integer :: state !< the task state
end type tcb
I have only 2 threads to execute these 4 tasks. I want to avoid using !$OMP TASK.
When I used the tasking construct, I got something like this:
type(tcb),dimension(4),intent(inout)::array !< the array of tasks
integer,intent(in)::ff !< the counter
type(tcb)::self !< self
type(variables),intent(inout)::var !< the variables
!OpenMP variables
integer::num_thread !< the rank of the thread
integer::nthreads !< the number of threads
integer:: OMP_GET_THREAD_NUM !< function to get the rank of the thread
integer::OMP_GET_NUM_THREADS !< function to get the number of threads
!=======================================================================================================================================================
!$OMP PARALLEL PRIVATE(num_thread,nthreads,ff) &
!$OMP SHARED(array)
num_thread=OMP_GET_THREAD_NUM() !< le rang du thread
nthreads=OMP_GET_NUM_THREADS() !< le nombre de threads
!$OMP MASTER
do ff=1,3
!$OMP TASK SHARED(array) IF ((num_thread .ne. 0) .and. (num_thread .ne. 1))
call array(ff)%f_ptr(self,var)
!$OMP END TASK
end if
end do
!$OMP TASKWAIT
!$OMP END MASTER
Do you have any idea ? I want that when a thread finishes running a task, it moves directly to the next available task. It shouldn't wait for the other thread to finish.
How can I do it without using the OMP Tasking ?
I want to schedule the tasks alone not with the help of OpenMP. Is it possible ?
Using OpenMP directives you have created a serial program. Here, I try to explain why. The task creation in your code is inside the MASTER region, in which the num_thread is always zero. From specification:
Thread number: A number that the OpenMP implementation assigns to an
OpenMP thread. For threads within the same team, zero identifies
the master thread and consecutive numbers identify the other threads
of this team.
Therefore the ((num_thread .ne. 0) .and. (num_thread .ne. 1)) expression is always false. Again from specification:
When an if clause is present on a task construct, and the if
clause expression evaluates to false, an undeferred task is generated,
and the encountering thread must suspend the current task region, for
which execution cannot be resumed until the generated task is
completed.
So, it means that you have a serial program. The master threads execution is suspend until a task is finished. Although it is not required (or specified) by the standard, in practice it means that your program will run on the master thread only, the other threads are just waiting. So, you have to remove the if clause and your program will be concurrent.
If you wish to run those tasks on 2 threads only, you have to use the num_threads(2) clause on parallel region to explicitly specify it.
EDIT: To answer your question, using tasks is a good choice, but if the number of tasks are known at compile time you can use sections as well
!$omp sections
!$omp section
! first job is here
!$omp section
! second job is here
...
!$omp end sections
ps: you mentioned 4 tasks, but your code only generates 3 of them..
ps2: end if is not needed.
I have two nested loops:
!$omp parallel
!$omp do
do i=1,4
...
!$omp parallel
!$omp do
do j=1,4
call job(i,j)
My computer can run four threads in parallel. For the outer loop such four threads are created. The first three finish quickly since for i=4, the job is four times more expensive.
Now I expect that in the inner parallel region, new threads share the work. But this doesn't happen: The CPU load stays at 1/4, just as if the 4th thread works serially on the inner loop.
How can I allocate parallel CPU time to the inner parallel loop?
Did you try the following approach?
!$omp parallel do collapse(2)
do i = 1,4
do j = 1,4
call job(i,j)
end do
end do
It should behave better.
I have a code that is already paralleled by MPI. Every MPI process have to do a lot of calculations on the grid. The code looks something like this:
do i=starti,stopi
call function1(input1(i),output1)
call function1(input2(i+1),output2)
call function1(input3(i+2,output3)
call solve(output1,output2,output3,OUTPUT(i))
end do
Where MPI have different ranges of starti and stopi. For example for 2 processes I have:
process 1 starti=1 , stopi=100
process 2 starti=101 , stopi=200
This is working very nice but I want to use OpenMP to speed thing up a little bit. Modified code looks like this:
do i=starti,stopi
!$OMP PARALLEL SECTIONS
!$OMP SECTION
call function1(input1(i),output1)
!$OMP SECTION
call function1(input2(i+1),output2)
!$OMP SECTION
call function1(input3(i+2),output3)
!$OMP END PARALLEL SECTIONS
call solve(output1,output2,output3,OUTPUT(i))
end do
But when I run this code with 2 MPI Processes and 3 OpenMP threads it is 2x slower than pure MPI. The CPU is not the case as I have 4 cores/ 8 threads CPU.
Why is that?
I am writing a parallel code that is exploiting some parallelism at an outer level. Essentially there are two separate subroutines (very expensive) that may be executed concurrently. This is a large code, and as such, within each subroutine there are other calls as well as many omp parallel/do regions. So to execute my two subroutines I want to make use of nested parallelism, so that they can both be called in the outer region as such:
!$omp parallel
!$omp single
! Do the first expensive task (contains more omp parallel regions)
!$omp end single nowait
!$omp single
! Do the second expensive task (contains more omp parallel regions)
!$omp end single nowait
!$omp end parallel
If both of these expensive tasks took an equal amount of time I would not have a problem. But during the simulation, at each time step, the amount that each has to do changes. So doing an environment variable for setting the nested number of threads like export OMP_NUM_THREADS=16,8 where I have 16 in the first level of parallelism and 8 in the nested regions (inside the two expensive subroutines) does not work well. I have a scheme already to distribute the correct number of threads to their respective task, I just don't know how to set different numbers of threads for the nested level in the respective subroutines. Of course I could go into each expensive subroutine and all subroutines within those and actually hardcode the number of threads that I would like, but like I mentioned this is a very large code and that is the ugly solution. I would much rather do this in an environment variable type of way. There is no information on this subject online. Does anyone out there have a clue how one could do this?
Thanks in advance.
I'm not sure whether I understand correctly what you are trying to achieve, but you can set the default team size for nested parallel regions by simply calling omp_set_num_threads(). If you call it from the serial part of the application, it will set the default team size for top-level parallel regions. If you call it from within a parallel region, it will affect nested parallel regions spawned by the calling thread. And different threads can set different team sizes for their nested regions. So, in a nutshell, you can do something like:
!$omp parallel
!$omp single
call omp_set_num_threads(nn)
! Do the first expensive task (contains more omp parallel regions)
!$omp end single nowait
!$omp single
call omp_set_num_threads(mm)
! Do the second expensive task (contains more omp parallel regions)
!$omp end single nowait
!$omp end parallel
Parallel regions spawned from the thread executing the first single construct will execute with nn threads. Parallel regions spawned from the thread executing the second single construct will execute with mm threads.
Also, have you considered using explicit OpenMP tasks (!$omp task) instead of single + nowait?
I am trying to understand exactly why a deadlock occurs when in a parallel region a critical construct is nested in a critical construct.
I have consulted the following resources: this source the author writes:
In OpenMP this can happen if inside a critical region a function is called which
contains another critical region. In this case the critical region of the called
function will wait for the first critical region to terminate - which will never
happen.
Alright, but why not? Furthermore from: Hager, Georg, and Gerhard Wellein. Introduction to high performance computing for scientists and engineers. CRC Press, 2010, p. 149:
When a thread encounters a CRITICAL directive inside a critical region, it will block forever.
Same question, why?
Finally, Chapman, Barbara, Gabriele Jost, and Ruud Van Der Pas. Using OpenMP: portable shared memory parallel programming. Vol. 10. MIT press, 2008 also provide an example using locks, however not with the critical construct.
From my current understanding there are two different possible ways why a deadlock occurs in a nested critical region:
Begin first take:
If two threads arrive at a nested critical construct (one critical region inside another), thread one enters the "outer" critical region and thread two waits. Quoting Chapman et al.
When a thread encounters a critical construct, it waits until no other thread is
executing a critical region with the same name.
Alright, so far so good. Now, thread one DOES NOT enter the nested critical region, because it is a synchronization point where threads wait for all other threads to arrive before proceeding. And since the second thread is waiting for the first thread the exit the "outer" critical region they are in a deadlock.
End first take.
Begin second take:
Both threads arrive at the "outer" critical construct. Thread one enters the "outer" critical construct, thread two waits. Now, thread one ENTERS the "inner" critical construct and stops at it's implied barrier, because it waits for thread two. Thread two on the other hand waits for thread one to exit to "outer" thread and so both are waiting forever.
End second take.
Here is a small Fortran code that produces the deadlock:
1 subroutine foo
2
3 !$OMP PARALLEL
4 !$OMP CRITICAL
5 print*, 'Hallo i am just a single thread and I like it that way'
6 !$OMP END CRITICAL
7 !$OMP END PARALLEL
8
9 end subroutine foo
10
11 program deadlock
12 implicit none
13 integer :: i,sum = 0
14
15 !$OMP PARALLEL
16 !$OMP DO
17 do i = 1, 100
18 !$OMP CRITICAL
19 sum = sum + i
20 call foo()
21 !$OMP END CRITICAL
22 enddo
23 !$OMP END DO
24 !$OMP END PARALLEL
25
26 print*, sum
27 end program deadlock
So my question is, is one of the two suggestions right, or is there another possibility
why a deadlock occurs in this situation.
There is no implied barrier, i.e. no "synchronization point where threads wait for other threads to arrive" associated with CRITICAL constructs. Instead, at the start of a critical construct, threads wait for any thread already inside a critical construct of the same name to leave the construct.
Critical constructs with the same name cannot be nested, because the current OpemMP rules say they can not (see the restrictions on nesting in OpemMP 4.0 section 2.16). That's really the answer to your question and the end of the discussion - if you break that prohibition then anything can happen.
Practically, this prohibition allows implementations to assume that critical constructs with the same name will not be nested. One common implementation choice is then that a thread encountering a critical construct will wait for all threads including itself to leave the construct. If it is waiting a thread cannot be leaving. That results in a deadlock.
Critical constructs with different names can be nested. Deadlock is possible in that case if the nesting is not consistent. Consider:
!$OMP PARALLEL
!$OMP CRITICAL (A)
!$OMP CRITICAL (B) ! Thread one waiting here.
!...
!$OMP OMP CRITICAL (B)
!$OMP END CRITICAL (A)
!$OMP CRITICAL (B)
!$OMP CRITICAL (A) ! Thread two waiting here.
!...
!$OMP OMP CRITICAL (A)
!$OMP END CRITICAL (B)
!$END PARALLEL
If this situation occurs the threads will be waiting quite a while.