I have found an issue in my hybrid MPI/OpenMP code that is reproduced
in the simplest form in the code cited below. I am using 2 threads
per MPI rank. These two threads are then used in a OpenMP "Section"
to do several computations, one of these consists on making a "mpi_allreduce" call on two different vectors A and B whose results
are stored in W and WW. The problem is that every time I run the program
I end up with a different output. My mind is that the MPI calls are
overlapping and the reduced arrays W and WW are combined even when they
have different names but I am not sure. Any comment on how to overcome
this issue is welcome.
Details:
The MPI thread level is initialized to MPI_THREAD_MULTIPLE in the code
but I have tried also serial and funneled (with same issue).
I compile the code mpiifort -openmp allreduce_omp_mpi.f90 and for
running I use:
export OMP_NUM_THREADS=2
mpirun -np 3 ./a.out
PROGRAM HELLO
use mpi
use omp_lib
IMPLICIT NONE
INTEGER nthreads, tid
Integer Provided,mpi_err,myid,nproc
CHARACTER(MPI_MAX_PROCESSOR_NAME):: hostname
INTEGER :: nhostchars
integer :: i
real*8 :: A(1000), B(1000), W(1000),WW(1000)
provided=0
!Initialize MPI context
call mpi_init_thread(MPI_THREAD_MULTIPLE,provided,mpi_err)
CALL mpi_comm_rank(mpi_comm_world,myid,mpi_err)
CALL mpi_comm_size(mpi_comm_world,nproc,mpi_err)
CALL mpi_get_processor_name(hostname,nhostchars,mpi_err)
!Initialize arrays
A=1.0
B=2.0
!Check if MPI_THREAD_MULTIPLE is available
if (provided >= MPI_THREAD_MULTIPLE) then
write(6,*) ' mpi_thread_multiple provided',myid
else
write(6,*) ' not mpi_thread_multiple provided',myid
endif
!$OMP PARALLEL PRIVATE(nthreads, tid) NUM_THREADS(2)
!$omp sections
!$omp section
call mpi_allreduce(A,W,1000,mpi_double_precision,mpi_sum,mpi_comm_world,mpi_err)
!$omp section
call mpi_allreduce(B,WW,1000,mpi_double_precision,mpi_sum,mpi_comm_world,mpi_err)
!$omp end sections
!$OMP END PARALLEL
write(6,*) 'W',(w(i),i=1,10)
write(6,*) 'WW',(ww(i),i=1,10)
CALL mpi_finalize(mpi_err)
END
The MPI standard forbids concurrent execution of (blocking) collective operations over the same communicator (Section 5.13 "Correctness [of collective communication]"):
...
Finally, in multithreaded implementations, one can have more than one, concurrently executing, collective communication call at a process. In these situations, it is the user's responsibility to ensure that the same communicator is not used concurrently by two different collective communication calls at the same process.
The key point here is: same communicator. Nothing prevents you from starting concurrent collective communications over different communicators:
integer, dimension(2) :: comms
call MPI_COMM_DUP(MPI_COMM_WORLD, comms(1), ierr)
call MPI_COMM_DUP(MPI_COMM_WORLD, comms(2), ierr)
!$omp parallel sections num_threads(2)
!$omp section
call MPI_ALLREDUCE(A, W, 1000, MPI_REAL8, MPI_SUM, comms(1), ierr)
!$omp section
call MPI_ALLREDUCE(B, WW, 1000, MPI_REAL8, MPI_SUM, comms(2), ierr)
!$omp end parallel sections
call MPI_COMM_FREE(comms(1), ierr)
call MPI_COMM_FREE(comms(2), ierr)
This program simply duplicates MPI_COMM_WORLD twice. The first copy is used in the first parallel section, the second copy is used in the second one. Although the two new communicators are copies of MPI_COMM_WORLD, they are separate contexts and thus concurrent operations over them are possible.
MPI_COMM_DUP is an expensive operation, therefore the newly created communicators should be used for as long as possible before being freed.
Related
I'm trying to add in multi-threading using OpenMP for a work project. I'm using Fortran77 in Visual Studio 2017, and while trying to debug, have found a behavior that I don't totally understand.
I'm new to OpenMP, so this may be an misunderstanding about the mechanics of multi-threading, but my understanding is that if I have a write statement in a do loop, every thread should print out that write statement. I've recreated my issue in a small project for simplicity.
The subroutine that has this issue is supposed to print which thread is operating and then the value of a variable (which is 1) before calling another simple subroutine:
SUBROUTINE TAKEVAR(VAR)
USE OMP_LIB
INTEGER:: VAR
VAR = VAR + 1
!$OMP PARALLEL NUM_THREADS(4)
!$OMP DO
DO i = 1, 5
WRITE(*, *) 'Hello from thread ', OMP_GET_THREAD_NUM()
WRITE(*, *) VAR
CALL ROUTINE2(VAR)
ENDDO
!$OMP END DO
!$OMP END PARALLEL
END SUBROUTINE
However, the output of this subroutine is as follows:
Output (can't embed images yet)
I see that the loop is executed 5 times using 4 threads, which is correct, but each statement showing the thread number is not followed by the value of the variable. Is there a way to fix this?
Since each thread is executed at the same time than the others, nothing can garantee that no other WRITE from another thread will occur between the first and second WRITE of a given thread.
In a comment #Laci gave a possible solution:
!$OMP CRITICAL
WRITE(*, *) 'Hello from thread ', OMP_GET_THREAD_NUM()
WRITE(*, *) VAR
!$OMP END CRITICAL
The code inside a critical section can be executed by only one thread at a time, so nothing can be executed between the 2 WRITE. However critical sections can hurt performances...
Another solution can be to use a single WRITE instruction:
WRITE(*, *) 'Hello from thread ', OMP_GET_THREAD_NUM(), VAR
That said I have had bad experiences (crashes on execution) with WRITEs in parallel regions, so I tend to always put them in critical sections.
I'm very new to Hybrid parallel coding, so I'm wondering this kind of concept is possible, and whether it will cause bad efficiency of parallelization.
Let's say that I need A routine and B routine. A is quite difficult to parallelize with MPI, while B is relatively straightforward to MPI. Since I want this code to be scalable to some extent, I'm going to exploit as much MPI parallelization as I can.
I'm getting the concepts of thread and process very roughly, I suppose the numbers of total threads to be n_threads x n_process.
program Hybrid
use MPI
use OMP_LIB
call MPI_INIT ( ierr )
* call MPI_COMM_SIZE ( MPI_COMM_WORLD, n_process, ierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, my_id, ierr )
...
* call omp_set_num_threads ( n_threads )
...
call MPI_FINALIZE ( ierr )
end program
So in above example, total threads becomes n_threads x n_process in my understanding (I'm not sure whether I'm using the word total threads properly, though). The asterisks(*) are just for making it easy to find n_threads and n_process.
My serial version of code looks like,
program Serial
do i = 1, time_steps
call A
call B
enddo
end program
Both A and B needs a global view of array.
MPI parallelized B routine, B_MPI, starts with some MPI_ScatterV to distribute its global information into sub processes, and it ends with MPI_GatherV to recover all global view, and only the 'my_id == 0' process holds this global view.
While I want A to be parallelized with OpenMP, I don't want to activate too many threads, so want only 'my_id == 0' process calls OpenMP calls, makes folks of OpenMP threads like below.
program
use MPI
use OMP_LIB
call MPI_INIT ( ierr )
* call MPI_COMM_SIZE ( MPI_COMM_WORLD, n_process, ierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, my_id, ierr )
...
do i = 1, time_steps
if (my_id == 0) then
* call A_OMP ! 'call omp_set_num_threads ( n_threads ) ' inside the 'A_OMP'
endif
call B_MPI ! Starts with 'MPI_ScatterV', ends with 'MPI_GatherV'
enddo
end program
So in this way, only invoking OpenMP calls inside 'my_id == 0', I want to make the total threads to be like n_process + n_threads, rather than n_process x n_threads.
To be honest, I'm not very sure how threads are folked if they're mixed with MPI. I want to make it sure whether above kind of picture is possible, and will be efficient.
Thank you for reading this question.
There shouldn't be any issue with this in principle, although note that using your terminology total_threads = n_threads + n_process - 1. If you had 10 physical CPU-cores and wanted to use 4 OpenMP threads, then you would have to launch 7 MPI processes. MPI rank 0 would only create 3 additional threads in the parallel region - the other thread, OpenMP thread 0, is just the original MPI process.
The issue in practice is that to get this to work efficiently you'll need to make sure that the 7 MPI processes and 3 additional OpenMP threads are scheduled appropriately at runtime, i.e. bound to the appropriate CPU cores. This might require a combination of both MPI and OpenMP affinity settings that will need to work together appropriately. Of course, you can never efficiently use more OpenMP threads than you have cores in a node as all the threads must run on the same physical computer as MPI rank 0.
So i have some code with a double checked locking solution for reading data files in multi threaded (with openmp) application, which looks something like:
logical, dimension(10,10) :: is_data_loaded
is_data_loaded=.false.
! Other code
subroutine load(i,j)
integer,intent(in) :: i,j ! Indexes into array is_data_loaded
if(is_data_loaded(i,j)) return
!$OMP CRITICAL(load data)
if(.not.is_data_loaded(i,j)) then
call load_single_file(i,j)
is_data_loaded(i,j) = .true.
endif
!$OMP END CRITICAL(load_data)
end subroutine
Where I'm worried that if two threads get to the critical region at the same time (with the same i,j index) the second gets blocked by the first one entering the region but once the first finishes the second thread may start executing the critical block before seeing the updated is_data_loaded flag and thus we get into a problem with two threads updating the same data.
So firstly is this an issue with opemp critical blocks? I'm unsure of the semantics and whether the standard says something like "everything must be consistent across threads before the next thread runs in a critical block" or not. And if it is a problem, would just wrapping the read/writes to is_data_loaded in an omp atomic statement be sufficient?
I think the code is wrong, as the threads might indeed not see the updates of is_data_loaded after another thread has set it from the critical region. While the critical region will ensure that the corresponding memory flushes occur, the thread executing the if(is_data_loaded(i,j)) return might not see the update, as this statement might still see outdated data.
I think adding !$omp flush before the if(is_data_loaded(i,j)) return is needed to ensure that all data has been flushed and is_data_loaded(i,j) is loaded with the most recent data.
I wonder if it is possible to set number of threads for each section in an openmp parallel section i.e.:
real*8 :: x
real*4 :: y
integer*8 :: ii
integer*4 ** jj
x = 0.0d0
y = 0.0
!$OMP PARALLEL
!$OMP SECTIONS
!$OMP SECTION NUM_THREADS(3)
do ii=1,100000000000
x=x+(cos(sin(tan(ii*1.0d0)))**(x/ii)
end do
!$OMP SECTION NUM_THREADS(1)
do jj=1,10000
x=x+exp(jj*0.001)
end do
!$OMP END SECTIONS
!$OMP END PARALLEL
This code does not work with ifort 16.0 but I just wonder if there is something else one could do..?
EDIT: I get an error (during compilation) when I try to set number of threads per section... I would like to specify different number of threads per section.
EDIT 2:Error message = 2 x
error #5082: Syntax error, found 'NUM_THREADS' when expecting one of "<"END-OF-STATEMENT">" ;
at the two !OMP SECTION NUM_THREADS(i)statements
Pardon the hasty writ..
Now that you've told us what we needed to know it's blindingly obvious what the problem is ...
... the num_threads clause is applicable only to the parallel directive.
It is not possible to, in a straightforward fashion, allocate m out of n threads to one section and the remaining n-m threads to another. You can probably hack something together to achieve that effect but it would be going against the grain of OpenMP programming.
What you are trying to do is against the philosophy of OpenMP, where you are not supposed to have full control on threads. You can, however, use a hack: a combination of OpenMP and pthread. That is, OMP PARALLEL blocks which contain pthread statements (pthread will give you full control on which threads will be used in the OMP block). In the past, I was experimenting with that, and although I didn't try exactly what you want to do, I managed to get some interesting results verifying that a OpenMP+pthread combination is possible. Besides, some compilers (like gfortran) implement OpenMP via pthreads behind the scenes.
Of course you need to write Fortran bindings for the pthread statements you will use, but that's not much of a problem. The real problem is that such an approach is problematic by definition. It mixes two radically different parallelization models and it is a hack, so you are on your own. I wouldn't go that way in a serious application but, with enough trial-and-error, it is a way to do what you are trying to do.
I have a big code in Fortran, it has a commercial license so that I cannot
post the code. It contains several modules each of them with several subroutines. I compiled that code with the -fopenmp flag (I used the flag
for all files in the program).
In one subroutine I placed the code
!$OMP PARALLEL
nthreads=OMP_GET_NUM_THREADS()
write(6,*) 'threads', nthreads
!$OMP END PARALLEL
Initially, the program complains about the unrecognized OMP_GET_NUM_THREADS data type. I saw some post on this forum and there
it was suggested to use use omp_lib to load run-time libraries . After
adding this line to the subroutine, the program ran but it printed
threads 1
as if only one thread were used even when I set
export OMP_NUM_THREADS=10
My questions are, should I use use omp_lib in each subroutine? or maybe
only on the "main" program?
As I said before, this subroutine (where I wrote the omp directives) is
inside a module.
You need to have OMP_Get_num_threads() declared in the scope of the subroutine where it is used. This might happen with a use statement within that subroutine, or, since you are using modules, in the header of the module. In the snippet pasted below, I did both.
Note, that the function call to OMP_Get_num_threads() needs to be embedded in a parallel section (as you did in the code snippet), otherwise the result is always 1.
Since the modules have no access to the scope of the main program, useing the module there is not sufficient (but the compiler will tell you that when you are linking the binary).
module testMod
use omp_lib, only: OMP_Get_num_threads
implicit none
contains
subroutine printThreads()
use omp_lib, only: OMP_Get_num_threads
implicit none
! This will always print "1"
print *,OMP_Get_num_threads()
! This will print the actual number of threads
!$omp parallel
print *,OMP_Get_num_threads()
!$omp end parallel
end subroutine
end module
program test
use testMod, only: printThreads
implicit none
call printThreads()
end program
with the following results:
OMP_NUM_THREADS=2 ./a.out
1
2
2
OMP_NUM_THREADS=4 ./a.out
1
4
4
4
4