OPENMP running the same job on threads - multithreading

In my OPENMP code, I want all threads do the same job and at the end take the average ( basically calculate error). ( How I calculate error? Each thread generates different random numbers, so the result from each threads is different.)
Here is simple code
program ...
..
!$OMP PARALLEL
do i=1,Nstep
!.... some code goes here
result=...
end do
!$END PARALLEL
sum = result(from thread 0)+result(from thread 1)+...
sum = sum/(number of threads)
Simply I have to send do loop inside OPENMP to all threads, not blocking this loop.
I can do what I want using MPI and MPI_reduce, but I want to write a hybrid code OPENMP + MPI. I haven't figured out the OPENMP part, so suggestions please?

It is as simple as applying sum reduction over result:
USE omp_lib ! for omp_get_num_threads()
INTEGER :: num_threads
result = 0.0
num_threads = 1
!$OMP PARALLEL REDUCTION(+:result)
!$OMP SINGLE
num_threads = omp_get_num_threads()
!$OMP END SINGLE
do i = 1, Nstep
...
result = ...
...
end do
!$END PARALLEL
result = result / num_threads
Here num_threads is a shared INTEGER variable that is assigned the actual number of threads used to execute the parallel region. The assignment is put in a SINGLE construct since it suffices one thread - and no matter which one - to execute the assignment.

Related

GFortran unformatted I/O throughput on NVMe SSDs

Please help me understand how I can improve sequential, unformatted I/O throughput with (G)Fortran, especially when working on NVMe SSDs.
I wrote a little test program, see bottom of this post. What this does is open one or more files in parallel (OpenMP) and write an array of random numbers into it. Then it flushes system caches (root required, otherwise the read test will most likely read from memory) opens the files, and reads from them. Time is measured in wall time (trying to include only I/O-related times), and performance numbers are given in MiB/s. The program loops until aborted.
The hardware I am using for testing is a Samsung 970 Evo Plus 1TB SSD, connected via 2 PCIe 3.0 lanes. So in theory, it should be capable of ~1500MiB/s sequential reads and writes.
Testing beforehand with "dd if=/dev/zero of=./testfile bs=1G count=1 oflag=direct" results in ~750MB/s. Not too great, but still better than what I get with Gfortran. And depending on who you ask, dd should not be used for benchmarking anyway. This is just to make sure that the hardware is in theory capable of more.
Results with my code tend to get better with larger file size, but even with 1GiB it caps out at around 200MiB/s write, 420MiB/s read. Using more threads (e.g. 4) increases write speeds a bit, but only to around 270MiB/s.
I made sure to keep the benchmark runs short, and give the SSD time to relax between tests.
I was under the impression that it should be possible to saturate 2 PCIe 3.0 lanes worth of bandwidth, even with only a single thread. At least when using unformatted I/O.
The code does not seem to be CPU limited, top shows less than 50% usage on a single core if I move the allocation and initialization of the "values" field out of the loop. Which still does not bode well for overall performance, considering that I would like to see numbers that are at least 5 times higher.
I also tried to use access=stream for the open statements, but to no avail.
So what seems to be the problem?
Is my code wrong/unoptimized? Are my expectations too high?
Platform used:
Opensuse Leap 15.1, Kernel 4.12.14-lp151.28.36-default
2x AMD Epyc 7551, Supermicro H11DSI, Samsung 970 Evo Plus 1TB (2xPCIe 3.0)
gcc version 8.2.1, compiler options: -ffree-line-length-none -O3 -ffast-math -funroll-loops -flto
MODULE types
implicit none
save
INTEGER, PARAMETER :: I8B = SELECTED_INT_KIND(18)
INTEGER, PARAMETER :: I4B = SELECTED_INT_KIND(9)
INTEGER, PARAMETER :: SP = KIND(1.0)
INTEGER, PARAMETER :: DP = KIND(1.0d0)
END MODULE types
MODULE parameters
use types
implicit none
save
INTEGER(I4B) :: filesize ! file size in MiB
INTEGER(I4B) :: nthreads ! number of threads for parallel ececution
INTEGER(I4B) :: alloc_size ! size of the allocated data field
END MODULE parameters
PROGRAM iometer
use types
use parameters
use omp_lib
implicit none
CHARACTER(LEN=100) :: directory_char, filesize_char, nthreads_char
CHARACTER(LEN=40) :: dummy_char1
CHARACTER(LEN=110) :: filename
CHARACTER(LEN=10) :: filenumber
INTEGER(I4B) :: thread, tunit, n
INTEGER(I8B) :: counti, countf, count_rate
REAL(DP) :: telapsed_read, telapsed_write, mib_written, write_speed, mib_read, read_speed
REAL(SP), DIMENSION(:), ALLOCATABLE :: values
call system_clock(counti,count_rate)
call getarg(1,directory_char)
dummy_char1 = ' directory to test:'
write(*,'(A40,A)') dummy_char1, trim(adjustl(directory_char))
call getarg(2,filesize_char)
dummy_char1 = ' file size (MiB):'
read(filesize_char,*) filesize
write(*,'(A40,I12)') dummy_char1, filesize
call getarg(3,nthreads_char)
dummy_char1 = ' number of parallel threads:'
read(nthreads_char,*) nthreads
write(*,'(A40,I12)') dummy_char1, nthreads
alloc_size = filesize * 262144
dummy_char1 = ' allocation size:'
write(*,'(A40,I12)') dummy_char1, alloc_size
mib_written = real(alloc_size,kind=dp) * real(nthreads,kind=dp) / 1048576.0_dp
mib_read = mib_written
CALL OMP_SET_NUM_THREADS(nthreads)
do while(.true.)
!$OMP PARALLEL default(shared) private(thread, filename, filenumber, values, tunit)
thread = omp_get_thread_num()
write(filenumber,'(I0.10)') thread
filename = trim(adjustl(directory_char)) // '/' // trim(adjustl(filenumber)) // '.temp'
allocate(values(alloc_size))
call random_seed()
call RANDOM_NUMBER(values)
tunit = thread + 100
!$OMP BARRIER
!$OMP MASTER
call system_clock(counti)
!$OMP END MASTER
!$OMP BARRIER
open(unit=tunit, file=trim(adjustl(filename)), status='replace', action='write', form='unformatted')
write(tunit) values
close(unit=tunit)
!$OMP BARRIER
!$OMP MASTER
call system_clock(countf)
telapsed_write = real(countf-counti,kind=dp)/real(count_rate,kind=dp)
write_speed = mib_written/telapsed_write
!write(*,*) 'write speed (MiB/s): ', write_speed
call execute_command_line ('echo 3 > /proc/sys/vm/drop_caches', wait=.true.)
call system_clock(counti)
!$OMP END MASTER
!$OMP BARRIER
open(unit=tunit, file=trim(adjustl(filename)), status='old', action='read', form='unformatted')
read(tunit) values
close(unit=tunit)
!$OMP BARRIER
!$OMP MASTER
call system_clock(countf)
telapsed_read = real(countf-counti,kind=dp)/real(count_rate,kind=dp)
read_speed = mib_read/telapsed_read
write(*,'(A29,2F10.3)') ' write / read speed (MiB/s): ', write_speed, read_speed
!$OMP END MASTER
!$OMP BARRIER
deallocate(values)
!$OMP END PARALLEL
call sleep(1)
end do
END PROGRAM iometer
The mistake in your code is that in your calculation of mib_written you have forgotten to take into account the size of a real(sp) variable (4 bytes). Thus your results are a factor of 4 too low. E.g. calculate it as
mib_written = filesize * nthreads
Some minor nits, some specific to GFortran:
Don't repeatedly call random_seed, particularly not from each thread. If you want to call it, call it once in the beginning of the program.
You can use open(newunit=tunit, ...) to let the compiler runtime allocate a unique unit number for each file.
If you want the 'standard' 64-bit integer/floating point kinds, you can use the variables int64 and real64 from the iso_fortran_env intrinsic module.
For testing with larger files, you need to make alloc_size of kind int64.
Use the standard get_command_argument intrinsic instead of the nonstandard getarg.
access='stream' is slightly faster than the default (sequential) as there's no need to handle the record length markers.
Your test program with these fixes (and the parameters module folded into the main program) below:
PROGRAM iometer
use iso_fortran_env
use omp_lib
implicit none
CHARACTER(LEN=100) :: directory_char, filesize_char, nthreads_char
CHARACTER(LEN=40) :: dummy_char1
CHARACTER(LEN=110) :: filename
CHARACTER(LEN=10) :: filenumber
INTEGER :: thread, tunit
INTEGER(int64) :: counti, countf, count_rate
REAL(real64) :: telapsed_read, telapsed_write, mib_written, write_speed, mib_read, read_speed
REAL, DIMENSION(:), ALLOCATABLE :: values
INTEGER :: filesize ! file size in MiB
INTEGER :: nthreads ! number of threads for parallel ececution
INTEGER(int64) :: alloc_size ! size of the allocated data field
call system_clock(counti,count_rate)
call get_command_argument(1, directory_char)
dummy_char1 = ' directory to test:'
write(*,'(A40,A)') dummy_char1, trim(adjustl(directory_char))
call get_command_argument(2, filesize_char)
dummy_char1 = ' file size (MiB):'
read(filesize_char,*) filesize
write(*,'(A40,I12)') dummy_char1, filesize
call get_command_argument(3, nthreads_char)
dummy_char1 = ' number of parallel threads:'
read(nthreads_char,*) nthreads
write(*,'(A40,I12)') dummy_char1, nthreads
alloc_size = filesize * 262144_int64
dummy_char1 = ' allocation size:'
write(*,'(A40,I12)') dummy_char1, alloc_size
mib_written = filesize * nthreads
dummy_char1 = ' MiB written:'
write(*, '(A40,g0)') dummy_char1, mib_written
mib_read = mib_written
CALL OMP_SET_NUM_THREADS(nthreads)
!$OMP PARALLEL default(shared) private(thread, filename, filenumber, values, tunit)
do while (.true.)
thread = omp_get_thread_num()
write(filenumber,'(I0.10)') thread
filename = trim(adjustl(directory_char)) // '/' // trim(adjustl(filenumber)) // '.temp'
if (.not. allocated(values)) then
allocate(values(alloc_size))
call RANDOM_NUMBER(values)
end if
open(newunit=tunit, file=filename, status='replace', action='write', form='unformatted', access='stream')
!$omp barrier
!$omp master
call system_clock(counti)
!$omp end master
!$omp barrier
write(tunit) values
close(unit=tunit)
!$omp barrier
!$omp master
call system_clock(countf)
telapsed_write = real(countf - counti, kind=real64)/real(count_rate, kind=real64)
write_speed = mib_written/telapsed_write
call execute_command_line ('echo 3 > /proc/sys/vm/drop_caches', wait=.true.)
!$OMP END MASTER
open(newunit=tunit, file=trim(adjustl(filename)), status='old', action='read', form='unformatted', access='stream')
!$omp barrier
!$omp master
call system_clock(counti)
!$omp end master
!$omp barrier
read(tunit) values
close(unit=tunit)
!$omp barrier
!$omp master
call system_clock(countf)
telapsed_read = real(countf - counti, kind=real64)/real(count_rate, kind=real64)
read_speed = mib_read/telapsed_read
write(*,'(A29,2F10.3)') ' write / read speed (MiB/s): ', write_speed, read_speed
!$OMP END MASTER
call sleep(1)
end do
!$OMP END PARALLEL
END PROGRAM iometer

OMP variable reduction inside subroutine called in the OMP block

I have a variable pointed to by this pointer in one of my routines:
complex(dp), pointer :: Pkc(:,:)=>NULL()
I also allocate it in the same routine:
call MIO_Allocate(Pkc,[1,1,1],[ptsTot,nAt,2],'Pkc','diag')
Inside this routine, I use it inside an OMP region:
!$OMP PARALLEL DO PRIVATE(iee, ie, unfoldedK), REDUCTION(+:Ake1, Ake2, Ake), &
!$OMP& SHARED(Pkc, KptsG, E, Kpts, AkeGaussian1, AkeGaussian2, AkeGaussian, nAt, nspin, is, ucell, gcell, H0, maxNeigh, hopp, NList, Nneigh, neighCell, gaussian, Epts)
do ik=1,ptsTot ! K loop
...
call DiagSpectralWeightWeiKuInequivalentInequivalent(nAt,nspin,is,Pkc(ik,:,:),E(:,is),Kpts(:,ik),unfoldedK(:),ucell,gcell,H0,maxNeigh,hopp,NList,Nneigh,neighCell)
...
end do
!$OMP END PARALLEL DO
where the new routine that is called is given by
subroutineDiagSpectralWeightWeiKu(N,ns,is,PkcLoc,E,K,KG,cell,H0,maxN,hopp,N List,Nneigh,neighCell)
...
complex(dp), intent(out) :: PkcLoc(N,2)
...
do j=1,N ! These are the eigenvectors with band index J
do in=1,N
PkcLoc(j,1) = PkcLoc(j,1) + exp(-cmplx_i*dot_product(KG, RtsVec(in,:))) * Hts(in,j)
end do
end do
...
end subroutine DiagSpectralWeightWeiKuInequivalent
How can I make sure that PkcLoc gets the proper behavior when doing OMP? I am getting segmentation faults which I assume are related to a missing REDUCTION on PkcLoc.
Any advice on how to solve this?
I found this thread, but it's different in the sense that in my case, the do loop is outside of the subroutine that is called.

Computing c𝑖 = √(a𝑖 × b𝑖) in parallel using nested parallelism

Let's say we have two vectors A=(ai) and B=(bi), each of size n and we have to compute a new vector C=(ci) as 𝑐𝑖 = √(𝑎𝑖 × 𝑏𝑖) for(i=1,...,n)
Main question: What would be the best way to compute the ci in parallel (using nested parallelism, i.e. using sync and spawn).
I think the below understanding is correct about the computation
for (i = 1 to n) {
C[i] = Math.sqrt(A[i] * B[i]);
}
And is there any way to use parallel for loops to compute C in parallel ?
If so, I think the approach will be the following:
parallel for (i = 1 to n) {
C[i] = Math.sqrt(A[i] * B[i]);
}
Is it correct ?
Assuming that by best you mean fastest, the usual approach would be to divide A and B into chunks, spawn a separate thread for handling each of these chunks in parallel, and wait for all the threads to finish their tasks.
The optimal number of chunks for such computation, most likely, will be the number of CPU cores you have on your computer. So, the pseudocode would look like:
chunkSize = ceiling(n / numberOfCPUs)
for (t = 1 to numberOfCPUs) {
startIndex = (t - 1) * chunkSize + 1
size = min(chunkSize, C.size - startIndex + 1)
threads.add(Thread.spawn(startIndex, size))
}
threads.join()
Where each thread, provided with the startIndex and size, computes:
for (i = startIndex to startIndex + size) {
C[i] = Math.sqrt(A[i] * B[i])
}
Another approach would be to have a pool of threads and give those threads a single shared queue of indices 1, 2, ... n. Each thread on each iteration polls the top index (let it be i) and calculates C[i]. As soon as the queue is empty, the work is done. The problem here is that you need additional synchronization mechanism that would guarantee that every index is processed by exactly one thread. For some simple tasks (like yours) such mechanism might consume more resources than actual calculation, but for relatively long-running tasks it works pretty well.
There's a mutual approach when you break the initial set of tasks into chunks, provide each thread in the pool with its own chunk, but when a thread is done with its chunk, it starts 'stealing' tasks from other threads in order not to sit idle. On many real tasks it gives better results than either of previous approaches.

Code takes much more time to finish with more than 1 thread

I want to benchmark some Fortran code with OpenMP-threads with a critical-section. To simulate a realistic environment I tried to generate some load before this critical-section.
!Kompileraufruf: gfortran -fopenmp -o minExample.x minExample.f90
PROGRAM minExample
USE omp_lib
IMPLICIT NONE
INTEGER :: n_chars, real_alloced
INTEGER :: nx,ny,nz,ix,iy,iz, idx
INTEGER :: nthreads, lasteinstellung,i
INTEGER, PARAMETER :: dp = kind(1.0d0)
REAL (KIND = dp) :: j
CHARACTER(LEN=32) :: arg
nx = 2
ny = 2
nz = 2
lasteinstellung= 10000
CALL getarg(1, arg)
READ(arg,*) nthreads
CALL OMP_SET_NUM_THREADS(nthreads)
!$omp parallel
!$omp master
nthreads=omp_get_num_threads()
!$omp end master
!$omp end parallel
WRITE(*,*) "Running OpenMP benchmark on ",nthreads," thread(s)"
n_chars = 0
idx = 0
!$omp parallel do default(none) collapse(3) &
!$omp shared(nx,ny,nz,n_chars) &
!$omp private(ix,iy,iz, idx) &
!$omp private(lasteinstellung,j) !&
DO iz=-nz,nz
DO iy=-ny,ny
DO ix=-nx,nx
! WRITE(*,*) ix,iy,iz
j = 0.0d0
DO i=1,lasteinstellung
j = j + real(i)
END DO
!$omp critical
n_chars = n_chars + 1
idx = n_chars
!$omp end critical
END DO
END DO
END DO
END PROGRAM
I compiled this code with gfortran -fopenmp -o test.x test.f90 and executed it with time ./test.x THREAD
Executing this code gives some strange behaviour depending on the thread-count (set with OMP_SET_NUM_THREADS): compared with one thread (6ms) the execution with more threads costs a lot more time (2 threads: 16000ms, 4 threads: 9000ms) on my multicore machine.
What could cause this behaviour? Is there a better (but still easy) way to generate load without running in some cache-effects or related things?
edit: strange behaviour: if I have the write in the nested loops, the execution speeds dramatically up with 2 threads. If its commented out, the execution with 2 or 3 threads takes forever (write shows very slow incrementation of loop variables)...but not with 1 or 4 threads. I tried this code also on another multicore machine. There it takes for 1 and 3 threads forever but not for 2 or 4 threads.
If the code you are showing is really complete you are missing definition of loadSet in the parallel section in which it is private. It is undefined and loop
DO i=1,loadSet
j = j + real(i)
END DO
can take a completely arbitrary number of iterations.
If the value is defined somewhere before in the code you do not show you probably want firstprivate instead of private.

Seeking help with a MT design pattern

I have a queue of 1000 work items and a n-proc machine (assume n =
4).The main thread spawns n (=4) worker threads at a time ( 25 outer
iterations) and waits for all threads to complete before processing
the next n (=4) items until the entire queue is processed
for(i= 0 to queue.Length / numprocs)
for(j= 0 to numprocs)
{
CreateThread(WorkerThread,WorkItem)
}
WaitForMultipleObjects(threadHandle[])
The work done by each (worker) thread is not homogeneous.Therefore in
1 batch (of n) if thread 1 spends 1000 s doing work and rest of the 3
threads only 1 s , above design is inefficient,becaue after 1 sec
other 3 processors are idling. Besides there is no pooling - 1000
distinct threads are being created
How do I use the NT thread pool (I am not familiar enough- hence the
long winded question) and QueueUserWorkitem to achieve the above. The
following constraints should hold
The main thread requires that all worker items are processed before
it can proceed.So I would think that a waitall like construct above
is required
I want to create as many threads as processors (ie not 1000 threads
at a time)
Also I dont want to create 1000 distinct events, pass to the worker
thread, and wait on all events using the QueueUserWorkitem API or
otherwise
Exisitng code is in C++.Prefer C++ because I dont know c#
I suspect that the above is a very common pattern and was looking for
input from you folks.
I'm not a C++ programmer, so I'll give you some half-way pseudo code for it
tcount = 0
maxproc = 4
while queue_item = queue.get_next() # depends on implementation of queue
# may well be:
# for i=0; i<queue.length; i++
while tcount == maxproc
wait 0.1 seconds # or some other interval that isn't as cpu intensive
# as continously running the loop
tcount += 1 # must be atomic (reading the value and writing the new
# one must happen consecutively without interruption from
# other threads). I think ++tcount would handle that in cpp.
new thread(worker, queue_item)
function worker(item)
# ...do stuff with item here...
tcount -= 1 # must be atomic

Resources