Not all threads are printing in OpenMP - multithreading

I'm trying to add in multi-threading using OpenMP for a work project. I'm using Fortran77 in Visual Studio 2017, and while trying to debug, have found a behavior that I don't totally understand.
I'm new to OpenMP, so this may be an misunderstanding about the mechanics of multi-threading, but my understanding is that if I have a write statement in a do loop, every thread should print out that write statement. I've recreated my issue in a small project for simplicity.
The subroutine that has this issue is supposed to print which thread is operating and then the value of a variable (which is 1) before calling another simple subroutine:
SUBROUTINE TAKEVAR(VAR)
USE OMP_LIB
INTEGER:: VAR
VAR = VAR + 1
!$OMP PARALLEL NUM_THREADS(4)
!$OMP DO
DO i = 1, 5
WRITE(*, *) 'Hello from thread ', OMP_GET_THREAD_NUM()
WRITE(*, *) VAR
CALL ROUTINE2(VAR)
ENDDO
!$OMP END DO
!$OMP END PARALLEL
END SUBROUTINE
However, the output of this subroutine is as follows:
Output (can't embed images yet)
I see that the loop is executed 5 times using 4 threads, which is correct, but each statement showing the thread number is not followed by the value of the variable. Is there a way to fix this?

Since each thread is executed at the same time than the others, nothing can garantee that no other WRITE from another thread will occur between the first and second WRITE of a given thread.
In a comment #Laci gave a possible solution:
!$OMP CRITICAL
WRITE(*, *) 'Hello from thread ', OMP_GET_THREAD_NUM()
WRITE(*, *) VAR
!$OMP END CRITICAL
The code inside a critical section can be executed by only one thread at a time, so nothing can be executed between the 2 WRITE. However critical sections can hurt performances...
Another solution can be to use a single WRITE instruction:
WRITE(*, *) 'Hello from thread ', OMP_GET_THREAD_NUM(), VAR
That said I have had bad experiences (crashes on execution) with WRITEs in parallel regions, so I tend to always put them in critical sections.

Related

Julia #threads single

Is there something in Julia Threads similar to a single command in OpenMP that will ensure all threads wait before a particular block of code and then execute that block in only one thread? I have a loop that distributes calculations of forces between threads before performing an update to all locations at once, and I cannot find any feature to achieve this without terminating the #threads loop.
You can use locks:
function f()
l = Threads.SpinLock()
x = 0
Threads.#threads for i in 1:10^7
Threads.lock(l)
x += 1 # this block is executed only in one thread
Threads.unlock(l)
end
return x
end
Note that the SpinLock mechanism is dedicated to non-blocking codes (that is computations only, no I/O in the loop). If there is I/O involved ReentrantLock should be used instead.

Double check locking in fortran with openmp

So i have some code with a double checked locking solution for reading data files in multi threaded (with openmp) application, which looks something like:
logical, dimension(10,10) :: is_data_loaded
is_data_loaded=.false.
! Other code
subroutine load(i,j)
integer,intent(in) :: i,j ! Indexes into array is_data_loaded
if(is_data_loaded(i,j)) return
!$OMP CRITICAL(load data)
if(.not.is_data_loaded(i,j)) then
call load_single_file(i,j)
is_data_loaded(i,j) = .true.
endif
!$OMP END CRITICAL(load_data)
end subroutine
Where I'm worried that if two threads get to the critical region at the same time (with the same i,j index) the second gets blocked by the first one entering the region but once the first finishes the second thread may start executing the critical block before seeing the updated is_data_loaded flag and thus we get into a problem with two threads updating the same data.
So firstly is this an issue with opemp critical blocks? I'm unsure of the semantics and whether the standard says something like "everything must be consistent across threads before the next thread runs in a critical block" or not. And if it is a problem, would just wrapping the read/writes to is_data_loaded in an omp atomic statement be sufficient?
I think the code is wrong, as the threads might indeed not see the updates of is_data_loaded after another thread has set it from the critical region. While the critical region will ensure that the corresponding memory flushes occur, the thread executing the if(is_data_loaded(i,j)) return might not see the update, as this statement might still see outdated data.
I think adding !$omp flush before the if(is_data_loaded(i,j)) return is needed to ensure that all data has been flushed and is_data_loaded(i,j) is loaded with the most recent data.

Threads making MPI calls in a Hybrid MPI/OpenMP

I have found an issue in my hybrid MPI/OpenMP code that is reproduced
in the simplest form in the code cited below. I am using 2 threads
per MPI rank. These two threads are then used in a OpenMP "Section"
to do several computations, one of these consists on making a "mpi_allreduce" call on two different vectors A and B whose results
are stored in W and WW. The problem is that every time I run the program
I end up with a different output. My mind is that the MPI calls are
overlapping and the reduced arrays W and WW are combined even when they
have different names but I am not sure. Any comment on how to overcome
this issue is welcome.
Details:
The MPI thread level is initialized to MPI_THREAD_MULTIPLE in the code
but I have tried also serial and funneled (with same issue).
I compile the code mpiifort -openmp allreduce_omp_mpi.f90 and for
running I use:
export OMP_NUM_THREADS=2
mpirun -np 3 ./a.out
PROGRAM HELLO
use mpi
use omp_lib
IMPLICIT NONE
INTEGER nthreads, tid
Integer Provided,mpi_err,myid,nproc
CHARACTER(MPI_MAX_PROCESSOR_NAME):: hostname
INTEGER :: nhostchars
integer :: i
real*8 :: A(1000), B(1000), W(1000),WW(1000)
provided=0
!Initialize MPI context
call mpi_init_thread(MPI_THREAD_MULTIPLE,provided,mpi_err)
CALL mpi_comm_rank(mpi_comm_world,myid,mpi_err)
CALL mpi_comm_size(mpi_comm_world,nproc,mpi_err)
CALL mpi_get_processor_name(hostname,nhostchars,mpi_err)
!Initialize arrays
A=1.0
B=2.0
!Check if MPI_THREAD_MULTIPLE is available
if (provided >= MPI_THREAD_MULTIPLE) then
write(6,*) ' mpi_thread_multiple provided',myid
else
write(6,*) ' not mpi_thread_multiple provided',myid
endif
!$OMP PARALLEL PRIVATE(nthreads, tid) NUM_THREADS(2)
!$omp sections
!$omp section
call mpi_allreduce(A,W,1000,mpi_double_precision,mpi_sum,mpi_comm_world,mpi_err)
!$omp section
call mpi_allreduce(B,WW,1000,mpi_double_precision,mpi_sum,mpi_comm_world,mpi_err)
!$omp end sections
!$OMP END PARALLEL
write(6,*) 'W',(w(i),i=1,10)
write(6,*) 'WW',(ww(i),i=1,10)
CALL mpi_finalize(mpi_err)
END
The MPI standard forbids concurrent execution of (blocking) collective operations over the same communicator (Section 5.13 "Correctness [of collective communication]"):
...
Finally, in multithreaded implementations, one can have more than one, concurrently executing, collective communication call at a process. In these situations, it is the user's responsibility to ensure that the same communicator is not used concurrently by two different collective communication calls at the same process.
The key point here is: same communicator. Nothing prevents you from starting concurrent collective communications over different communicators:
integer, dimension(2) :: comms
call MPI_COMM_DUP(MPI_COMM_WORLD, comms(1), ierr)
call MPI_COMM_DUP(MPI_COMM_WORLD, comms(2), ierr)
!$omp parallel sections num_threads(2)
!$omp section
call MPI_ALLREDUCE(A, W, 1000, MPI_REAL8, MPI_SUM, comms(1), ierr)
!$omp section
call MPI_ALLREDUCE(B, WW, 1000, MPI_REAL8, MPI_SUM, comms(2), ierr)
!$omp end parallel sections
call MPI_COMM_FREE(comms(1), ierr)
call MPI_COMM_FREE(comms(2), ierr)
This program simply duplicates MPI_COMM_WORLD twice. The first copy is used in the first parallel section, the second copy is used in the second one. Although the two new communicators are copies of MPI_COMM_WORLD, they are separate contexts and thus concurrent operations over them are possible.
MPI_COMM_DUP is an expensive operation, therefore the newly created communicators should be used for as long as possible before being freed.

Fortran program compiled with fopenmp shows only one thread

I have a big code in Fortran, it has a commercial license so that I cannot
post the code. It contains several modules each of them with several subroutines. I compiled that code with the -fopenmp flag (I used the flag
for all files in the program).
In one subroutine I placed the code
!$OMP PARALLEL
nthreads=OMP_GET_NUM_THREADS()
write(6,*) 'threads', nthreads
!$OMP END PARALLEL
Initially, the program complains about the unrecognized OMP_GET_NUM_THREADS data type. I saw some post on this forum and there
it was suggested to use use omp_lib to load run-time libraries . After
adding this line to the subroutine, the program ran but it printed
threads 1
as if only one thread were used even when I set
export OMP_NUM_THREADS=10
My questions are, should I use use omp_lib in each subroutine? or maybe
only on the "main" program?
As I said before, this subroutine (where I wrote the omp directives) is
inside a module.
You need to have OMP_Get_num_threads() declared in the scope of the subroutine where it is used. This might happen with a use statement within that subroutine, or, since you are using modules, in the header of the module. In the snippet pasted below, I did both.
Note, that the function call to OMP_Get_num_threads() needs to be embedded in a parallel section (as you did in the code snippet), otherwise the result is always 1.
Since the modules have no access to the scope of the main program, useing the module there is not sufficient (but the compiler will tell you that when you are linking the binary).
module testMod
use omp_lib, only: OMP_Get_num_threads
implicit none
contains
subroutine printThreads()
use omp_lib, only: OMP_Get_num_threads
implicit none
! This will always print "1"
print *,OMP_Get_num_threads()
! This will print the actual number of threads
!$omp parallel
print *,OMP_Get_num_threads()
!$omp end parallel
end subroutine
end module
program test
use testMod, only: printThreads
implicit none
call printThreads()
end program
with the following results:
OMP_NUM_THREADS=2 ./a.out
1
2
2
OMP_NUM_THREADS=4 ./a.out
1
4
4
4
4

Reusable Barrier Algorithm

I'm looking into the Reusable Barrier algorithm from the book "The Little Book Of Semaphores" (archived here).
The puzzle is on page 31 (Basic Synchronization Patterns/Reusable Barrier), and I have come up with a 'solution' (or not) which differs from the solution from the book (a two-phase barrier).
This is my 'code' for each thread:
# n = 4; threads running
# semaphore = n max., initialized to 0
# mutex, unowned.
start:
mutex.wait()
counter = counter + 1
if counter = n:
semaphore.signal(4) # add 4 at once
counter = 0
mutex.release()
semaphore.wait()
# critical section
semaphore.release()
goto start
This does seem to work, I've even inserted different sleep timers into different sections of the threads, and they still wait for all the threads to come before continuing each and every loop. Am I missing something? Is there a condition that this will fail?
I've implemented this using the Windows library Semaphore and Mutex functions.
Update:
Thank you to starblue for the answer. Turns out that if for whatever reason a thread is slow between mutex.release() and semaphore.wait() any of the threads that arrive to semaphore.wait() after a full loop will be able to go through again, since there will be one of the N unused signals left.
And having put a Sleep command for thread number 3, I got this result where one can see that thread 3 missed a turn the first time, with thread 1 having done 2 turns, and then catching up on the second turn (which was in fact its 1st turn).
Thanks again to everyone for the input.
One thread could run several times through the barrier while some other thread doesn't run at all.

Resources