Fortran program compiled with fopenmp shows only one thread - multithreading

I have a big code in Fortran, it has a commercial license so that I cannot
post the code. It contains several modules each of them with several subroutines. I compiled that code with the -fopenmp flag (I used the flag
for all files in the program).
In one subroutine I placed the code
!$OMP PARALLEL
nthreads=OMP_GET_NUM_THREADS()
write(6,*) 'threads', nthreads
!$OMP END PARALLEL
Initially, the program complains about the unrecognized OMP_GET_NUM_THREADS data type. I saw some post on this forum and there
it was suggested to use use omp_lib to load run-time libraries . After
adding this line to the subroutine, the program ran but it printed
threads 1
as if only one thread were used even when I set
export OMP_NUM_THREADS=10
My questions are, should I use use omp_lib in each subroutine? or maybe
only on the "main" program?
As I said before, this subroutine (where I wrote the omp directives) is
inside a module.

You need to have OMP_Get_num_threads() declared in the scope of the subroutine where it is used. This might happen with a use statement within that subroutine, or, since you are using modules, in the header of the module. In the snippet pasted below, I did both.
Note, that the function call to OMP_Get_num_threads() needs to be embedded in a parallel section (as you did in the code snippet), otherwise the result is always 1.
Since the modules have no access to the scope of the main program, useing the module there is not sufficient (but the compiler will tell you that when you are linking the binary).
module testMod
use omp_lib, only: OMP_Get_num_threads
implicit none
contains
subroutine printThreads()
use omp_lib, only: OMP_Get_num_threads
implicit none
! This will always print "1"
print *,OMP_Get_num_threads()
! This will print the actual number of threads
!$omp parallel
print *,OMP_Get_num_threads()
!$omp end parallel
end subroutine
end module
program test
use testMod, only: printThreads
implicit none
call printThreads()
end program
with the following results:
OMP_NUM_THREADS=2 ./a.out
1
2
2
OMP_NUM_THREADS=4 ./a.out
1
4
4
4
4

Related

Not all threads are printing in OpenMP

I'm trying to add in multi-threading using OpenMP for a work project. I'm using Fortran77 in Visual Studio 2017, and while trying to debug, have found a behavior that I don't totally understand.
I'm new to OpenMP, so this may be an misunderstanding about the mechanics of multi-threading, but my understanding is that if I have a write statement in a do loop, every thread should print out that write statement. I've recreated my issue in a small project for simplicity.
The subroutine that has this issue is supposed to print which thread is operating and then the value of a variable (which is 1) before calling another simple subroutine:
SUBROUTINE TAKEVAR(VAR)
USE OMP_LIB
INTEGER:: VAR
VAR = VAR + 1
!$OMP PARALLEL NUM_THREADS(4)
!$OMP DO
DO i = 1, 5
WRITE(*, *) 'Hello from thread ', OMP_GET_THREAD_NUM()
WRITE(*, *) VAR
CALL ROUTINE2(VAR)
ENDDO
!$OMP END DO
!$OMP END PARALLEL
END SUBROUTINE
However, the output of this subroutine is as follows:
Output (can't embed images yet)
I see that the loop is executed 5 times using 4 threads, which is correct, but each statement showing the thread number is not followed by the value of the variable. Is there a way to fix this?
Since each thread is executed at the same time than the others, nothing can garantee that no other WRITE from another thread will occur between the first and second WRITE of a given thread.
In a comment #Laci gave a possible solution:
!$OMP CRITICAL
WRITE(*, *) 'Hello from thread ', OMP_GET_THREAD_NUM()
WRITE(*, *) VAR
!$OMP END CRITICAL
The code inside a critical section can be executed by only one thread at a time, so nothing can be executed between the 2 WRITE. However critical sections can hurt performances...
Another solution can be to use a single WRITE instruction:
WRITE(*, *) 'Hello from thread ', OMP_GET_THREAD_NUM(), VAR
That said I have had bad experiences (crashes on execution) with WRITEs in parallel regions, so I tend to always put them in critical sections.

Is reading/writing to different elements of a module array thread-safe?

As long as a program does not allow simultaneous writes to the same elements of a shared data structure that is stored in a module, is it thread-safe? I know this is a noob question, but couldn't find it explicitly addressed anywhere. Here's the situation:
At the beginning of a program, data is initialized and stored in a module-level allocatable array (FIELDVARS) which then becomes accessible to any subroutine where the module is referenced by a USE statement.
Suppose now that the program enters a multi-threaded and/or multi-core computational phase, and FIELDVARS is accessed for "read/write" operations during repeated multiple simultaneous calls to subroutine (COMPUTE).
Once the computational phase is complete, the program returns to a single-threaded phase and FIELDVARS must be used in a subsequent subroutine (POST). However, FIELDVARS cannot be added to the input args of COMPUTE or POST because these are called from a closed-source main program. Therefore the module-level array is used to pass the addt'l data between subroutines.
Assume that FIELDVARS and COMPUTE have been designed so that each call to COMPUTE will always give access to a set of unique elements of FIELDVARS, which are guaranteed to be different than for any other call, so that simultaneous "write" operations on the same elements will never occur. For example:
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ... ] <-- FIELDVARS
^---call 1---^ ^---call 2---^ ... <-- Each call to COMPUTE is guaranteed to access a specific set of elements of FIELDVARS.
Question: Is this scenario considered "thread-safe", "conditionally safe", or "not thread-safe"? If it is not safe, what is the specific danger and what would you suggest to handle it?
Other relevant details:
The main program controls threading.
The source code of the main program is not available and cannot be changed.
The main program controls how/when COMPUTE, POST, and other subroutines are called, as well as what args can be passed in. This is why the module-level array is used to pass data between different subroutines rather than as an arg.
! DEMO MODULE W/ ALLOCATABLE INTEGER ARRAY
module DATA_MODULE
integer, dimension(:), allocatable :: FIELDVARS !<-- allocated/populated elsewhere, prior to calling COMPUTE
end module DATA_MODULE
! DEMO COMPUTE SUBROUTINE (THREADED PHASE W/ MULTIPLE SIMULTANEOUS CALLS)
subroutine COMPUTE(x, y, x_idx, y_idx, flag)
use DATA_MODULE
logical :: flag
integer :: x,y,x_idx,y_idx !<-- different for every call to COMPUTE
if (flag == .false.) then !<-- read data only
...
x = FIELDVARS(x_idx)
y = FIELDVARS(y_idx)
...
else if (flag == .true.) then !<-- write data
...
FIELDVARS(x_idx) = 0
FIELDVARS(y_idx) = 0
...
endif
end subroutine COMPUTE
It is fine and many programs depend on that fact. In OpenMP you often loop over arrays and different threads may easily work with elements which are close to each other in memory, especially on the boundaries of the blocks assigned to each thread.
At modern CPUs this is a non-issue. See also https://en.wikipedia.org/wiki/Cache_coherence
What is a real problem is False sharing. Two or more threads working with elements of memory belonging to the same chache line will be compiting for a shared resource and it may be very slow.

Set number of threads for each section in openmp?

I wonder if it is possible to set number of threads for each section in an openmp parallel section i.e.:
real*8 :: x
real*4 :: y
integer*8 :: ii
integer*4 ** jj
x = 0.0d0
y = 0.0
!$OMP PARALLEL
!$OMP SECTIONS
!$OMP SECTION NUM_THREADS(3)
do ii=1,100000000000
x=x+(cos(sin(tan(ii*1.0d0)))**(x/ii)
end do
!$OMP SECTION NUM_THREADS(1)
do jj=1,10000
x=x+exp(jj*0.001)
end do
!$OMP END SECTIONS
!$OMP END PARALLEL
This code does not work with ifort 16.0 but I just wonder if there is something else one could do..?
EDIT: I get an error (during compilation) when I try to set number of threads per section... I would like to specify different number of threads per section.
EDIT 2:Error message = 2 x
error #5082: Syntax error, found 'NUM_THREADS' when expecting one of "<"END-OF-STATEMENT">" ;
at the two !OMP SECTION NUM_THREADS(i)statements
Pardon the hasty writ..
Now that you've told us what we needed to know it's blindingly obvious what the problem is ...
... the num_threads clause is applicable only to the parallel directive.
It is not possible to, in a straightforward fashion, allocate m out of n threads to one section and the remaining n-m threads to another. You can probably hack something together to achieve that effect but it would be going against the grain of OpenMP programming.
What you are trying to do is against the philosophy of OpenMP, where you are not supposed to have full control on threads. You can, however, use a hack: a combination of OpenMP and pthread. That is, OMP PARALLEL blocks which contain pthread statements (pthread will give you full control on which threads will be used in the OMP block). In the past, I was experimenting with that, and although I didn't try exactly what you want to do, I managed to get some interesting results verifying that a OpenMP+pthread combination is possible. Besides, some compilers (like gfortran) implement OpenMP via pthreads behind the scenes.
Of course you need to write Fortran bindings for the pthread statements you will use, but that's not much of a problem. The real problem is that such an approach is problematic by definition. It mixes two radically different parallelization models and it is a hack, so you are on your own. I wouldn't go that way in a serious application but, with enough trial-and-error, it is a way to do what you are trying to do.

Threads making MPI calls in a Hybrid MPI/OpenMP

I have found an issue in my hybrid MPI/OpenMP code that is reproduced
in the simplest form in the code cited below. I am using 2 threads
per MPI rank. These two threads are then used in a OpenMP "Section"
to do several computations, one of these consists on making a "mpi_allreduce" call on two different vectors A and B whose results
are stored in W and WW. The problem is that every time I run the program
I end up with a different output. My mind is that the MPI calls are
overlapping and the reduced arrays W and WW are combined even when they
have different names but I am not sure. Any comment on how to overcome
this issue is welcome.
Details:
The MPI thread level is initialized to MPI_THREAD_MULTIPLE in the code
but I have tried also serial and funneled (with same issue).
I compile the code mpiifort -openmp allreduce_omp_mpi.f90 and for
running I use:
export OMP_NUM_THREADS=2
mpirun -np 3 ./a.out
PROGRAM HELLO
use mpi
use omp_lib
IMPLICIT NONE
INTEGER nthreads, tid
Integer Provided,mpi_err,myid,nproc
CHARACTER(MPI_MAX_PROCESSOR_NAME):: hostname
INTEGER :: nhostchars
integer :: i
real*8 :: A(1000), B(1000), W(1000),WW(1000)
provided=0
!Initialize MPI context
call mpi_init_thread(MPI_THREAD_MULTIPLE,provided,mpi_err)
CALL mpi_comm_rank(mpi_comm_world,myid,mpi_err)
CALL mpi_comm_size(mpi_comm_world,nproc,mpi_err)
CALL mpi_get_processor_name(hostname,nhostchars,mpi_err)
!Initialize arrays
A=1.0
B=2.0
!Check if MPI_THREAD_MULTIPLE is available
if (provided >= MPI_THREAD_MULTIPLE) then
write(6,*) ' mpi_thread_multiple provided',myid
else
write(6,*) ' not mpi_thread_multiple provided',myid
endif
!$OMP PARALLEL PRIVATE(nthreads, tid) NUM_THREADS(2)
!$omp sections
!$omp section
call mpi_allreduce(A,W,1000,mpi_double_precision,mpi_sum,mpi_comm_world,mpi_err)
!$omp section
call mpi_allreduce(B,WW,1000,mpi_double_precision,mpi_sum,mpi_comm_world,mpi_err)
!$omp end sections
!$OMP END PARALLEL
write(6,*) 'W',(w(i),i=1,10)
write(6,*) 'WW',(ww(i),i=1,10)
CALL mpi_finalize(mpi_err)
END
The MPI standard forbids concurrent execution of (blocking) collective operations over the same communicator (Section 5.13 "Correctness [of collective communication]"):
...
Finally, in multithreaded implementations, one can have more than one, concurrently executing, collective communication call at a process. In these situations, it is the user's responsibility to ensure that the same communicator is not used concurrently by two different collective communication calls at the same process.
The key point here is: same communicator. Nothing prevents you from starting concurrent collective communications over different communicators:
integer, dimension(2) :: comms
call MPI_COMM_DUP(MPI_COMM_WORLD, comms(1), ierr)
call MPI_COMM_DUP(MPI_COMM_WORLD, comms(2), ierr)
!$omp parallel sections num_threads(2)
!$omp section
call MPI_ALLREDUCE(A, W, 1000, MPI_REAL8, MPI_SUM, comms(1), ierr)
!$omp section
call MPI_ALLREDUCE(B, WW, 1000, MPI_REAL8, MPI_SUM, comms(2), ierr)
!$omp end parallel sections
call MPI_COMM_FREE(comms(1), ierr)
call MPI_COMM_FREE(comms(2), ierr)
This program simply duplicates MPI_COMM_WORLD twice. The first copy is used in the first parallel section, the second copy is used in the second one. Although the two new communicators are copies of MPI_COMM_WORLD, they are separate contexts and thus concurrent operations over them are possible.
MPI_COMM_DUP is an expensive operation, therefore the newly created communicators should be used for as long as possible before being freed.

Scope of fortran modules

I already asked this question however this im going to try to be clearer this time.
Im really new to fortran so forgive any sytax error this is more psuedo code.
module variables
implicit none
SAVE
integer x
integer y
end module
subroutine init()
use variables
x = x + 2
y = y + 1
endsubroutine
then my main program would be
program main
use variables
implicit none
call init()
call some_other_function()
endprogram
If i included my modules will they retain their values in some_other_function()
assume that some_other_function() is an abstraction of a huge simulation program.
Can i rely on my initialized variables staying keeping their values.
is that was the SAVE statement in the module does?
Background info: I have program1 that is being called by program2
for a a significant amount of time. Program1 has a huge intilization phase that only needs to happen once. If I ran that initial phase before program2 calls program1, could i rely on all the module declared variables being saved
With a SAVE statement in the module, the values of the module variables are retained for the duration of the run of the program. If you initialize them in one procedure, the main program and other procedures will see those value. Module variables are preserved as long as they are in scope, so since you use your example module from the main program, their values would be retained for the duration of the run even without the SAVE statement. In principle, if a module was only used in some procedures and a SAVE statement were not used, the compiler could "forget" the values when none of the procedures were in the call chain. Probably many or perhaps all compilers don't actually reset the values ... it would be extra work to figure out whether a module had gone out of scope.
P.S. Your example has an error since you never initialize x and y. You only change their values.

Resources