I implemented a code that takes an array of function pointers called tasks_ready_master. It contains 20 elements. Each element points to a subroutine (task) and to an integer representing the state of the task.
In order to execute the tasks, I only need to:
call tasks_ready_master(index_element)%f_ptr(self,var)
var represents all the variables needed in the computation.
Since I have 20 tasks, I want to use OpenMP to parallelize this code.
For example:
If we have 4 threads then we will consider 1 master, 1 application master(controls the states of the tasks) and 2 workers executing the tasks. The master should assign the tasks to a free worker.
So I don't want to have the executions in this order (it can be but I shouldn't impose it) : worker 1 (first task) / worker 2 (second task) / worker 1 (third task) / worker 2 (fourth task) / .. /
I want to use the lock routines in order to let the first worker not wait for the second one if he finishes his execution.
We can have the executions in this order (for exemple): worker 1 (first task) / worker 2 (second task) / worker 2 (third task) / worker 2 (fourth task) / worker 1 (fifth task) / .. /
The idea is not to wait for the other worker to finish his execution.
To do that, at each execution, the worker needs to erase (remove) the element(task) from the array. I should imperatively use the lock routines and avoid the OMP_BARRIER.
Here is my implementation:
!Thread Application Master (numero 1)
if (num_thread==1) then
do ff=5,20 ! 16 tasks
if (associated(tasklist_GRAD(ff)%f_ptr) .eqv. .true.) then
tasks_ready_master(ff) = tasklist_GRAD(ff)
tasks_ready_master(ff)%state=STATE_READY
end if
end do
end if
!$OMP BARRIER !!I used this OMP_BARRIER because I had a segmentation fault
if ((num_thread .ne. 0) .and. (num_thread .ne. 1)) then
do ff=1,nthreads-2
if (num_thread==ff+1) then
if (tasks_ready_master(ff+4)%state==STATE_READY) then
tasks_ready_master(ff+4)%state=STATE_RUNNING
free_workers_list(ff)=1
call tasks_ready_master(ff+4)%f_ptr(self,var)
tasks_ready_master(ff+4)%state=STATE_INACTIVE
free_workers_list(ff)=0
call erase(5,s,tasks_ready_master)
end if
end if
end do
do l=nthreads+3,20
call omp_set_lock(lock)
do ff=1,nthreads-2
if (num_thread==ff+1) then
if (free_workers_list(ff)==1) then
call omp_unset_lock(lock)
else
s=size(tasks_ready_master)
if (tasks_ready_master(ff+4)%state==STATE_READY) then
free_workers_list(ff)=1
call tasks_ready_master(5)%f_ptr(self,var)
tasks_ready_master(5)%state=STATE_INACTIVE
free_workers_list(ff)=0
call erase(5,s,tasks_ready_master)
end if
call omp_unset_lock(lock)
end if
end if
end do
end do
end if
For the declaration and initialization part, I did:
INTEGER ::ff,l
type(tcb)::self
type(variables)::var
integer::s
!OpenMP variables
integer::num_thread,nthreads
integer:: OMP_GET_THREAD_NUM, OMP_GET_NUM_THREADS
type(tcb),dimension(20)::tasklist_GRAD,tasks_ready_master
integer::STATUS=0
integer,allocatable,dimension(:)::free_workers_list !liste contenant les nums des threads workers
!Variables pour gestion des threads
INTEGER(KIND=omp_lock_kind), SAVE :: lock
CALL omp_init_lock (lock)
!=======================================================================================================================================================
!$OMP PARALLEL PRIVATE(num_thread,nthreads,ff) &
!$OMP SHARED(tasklist_GRAD,tasks_ready_master,free_workers_list,s)
num_thread=OMP_GET_THREAD_NUM() ! le rang du thread
nthreads=OMP_GET_NUM_THREADS() ! le nombre de threads
For the assignment of the tasks to the array:
!attribution des taches aux listes (20 tasks)
tasklist_GRAD(1)%f_ptr => u_prime_1 !1
tasklist_GRAD(2)%f_ptr => u_prime_droite_1 !2
tasklist_GRAD(3)%f_ptr => u_prime_gauche_1 !3
tasklist_GRAD(4)%f_ptr => u_grad_x_1 !4
tasklist_GRAD(5)%f_ptr => u_prime_2 !5
tasklist_GRAD(6)%f_ptr => taux_ !6
tasklist_GRAD(7)%f_ptr => u_prime_gauche_2 !7
tasklist_GRAD(8)%f_ptr => u_prime_droite_2 !8
tasklist_GRAD(9)%f_ptr => ax_droite_1 !9
tasklist_GRAD(10)%f_ptr => ax_gauche_1 !10
tasklist_GRAD(11)%f_ptr => ux_droite_1 !11
tasklist_GRAD(12)%f_ptr => ux_gauche_1 !12
tasklist_GRAD(13)%f_ptr => u_grad_x_2 !13
tasklist_GRAD(14)%f_ptr => u_prime_gauche_3 !14
tasklist_GRAD(15)%f_ptr => u_prime_droite_3 !15
tasklist_GRAD(16)%f_ptr => ax_droite_2 !16
tasklist_GRAD(17)%f_ptr => ax_gauche_2 !17
tasklist_GRAD(18)%f_ptr => ux_droite_2 !1
tasklist_GRAD(19)%f_ptr => ux_gauche_2 !19
tasklist_GRAD(20)%f_ptr => u_grad_x_3 !20
do ff=1,20
tasklist_GRAD(1)%state=STATE_WAITING
end do
tasklist_GRAD(1)%variables%ww_t => ww
tasklist_GRAD(1)%variables%pas_t => pas
tasklist_GRAD(1)%variables%cpt_t => cpt
tasklist_GRAD(1)%variables%nb_element_t => nb_element
tasklist_GRAD(1)%variables%cpt1_t => cpt1
tasklist_GRAD(1)%variables%dt_t => dt
tasklist_GRAD(1)%variables%dx_t => dx
tasklist_GRAD(1)%variables%p_element_t => p_element
tasklist_GRAD(1)%variables%u_prime_t => u_prime
tasklist_GRAD(1)%variables%u_prime_moins_t => u_prime_moins
tasklist_GRAD(1)%variables%u_prime_plus_t => u_prime_plus
tasklist_GRAD(1)%variables%taux_t => taux
tasklist_GRAD(1)%variables%grad_x_u_t => grad_x_u
tasklist_GRAD(1)%variables%grad_t_u_t => grad_t_u
tasklist_GRAD(1)%variables%grad_x_f_t => grad_x_f
tasklist_GRAD(1)%variables%grad_t_f_t => grad_t_f
tasklist_GRAD(1)%variables%ax_plus_t => ax_plus
tasklist_GRAD(1)%variables%ax_moins_t => ax_moins
tasklist_GRAD(1)%variables%ux_plus_t => ux_plus
tasklist_GRAD(1)%variables%ux_moins_t => ux_moins
tasklist_GRAD(1)%variables%sm_t => sm
tasklist_GRAD(1)%variables%flux_t => flux
tasklist_GRAD(1)%variables%tab0_t => tab0
tasklist_GRAD(1)%variables%tab_t => tab
You can find the whole code on: https://gitlab-sds.insa-cvl.fr/houeslat/stage_hecese/-/blob/master/OpenMP/hecese_OMP/app_management_test.f90.
I am facing a problem since the execution time increased and I feel that I'm executing the tasks not in parallel.
Related
I have an array of 20 tasks that I want to execute in a parallel way.
Instead of using OMP Tasking at the Master thread level, I decided to create 4 child processes using c_fork and to use !$OMP TASK at the child process level so that each CP uses a number of threads in order to execute a block of 4 tasks. (4 child processes*4 tasks =16 tasks).
Here is the declaration of the variables:
INTEGER ::ff !< the counter
type(tcb)::self !< self
type(variables),intent(inout)::var !< the variables
integer::STATUS !< STATUS
integer::any_child !< child process
integer::rc !< return code
!OpenMP variables
integer::num_thread !< the rank of the thread
integer::num_thread_cp !< the rank of a thread in the child process
integer::nthreads !< the number of threads
integer:: OMP_GET_THREAD_NUM !< function to get the rank of the thread
integer::OMP_GET_NUM_THREADS !< function to get the number of threads
!!$ type(tcb),dimension(20)::tasklist_GRAD !< the array of tasks
type(tcb),dimension(20),intent(inout)::tasks_ready_master !< the master array of tasks
!Variables pour gestion des threads
!!$ INTEGER(KIND=omp_lock_kind), SAVE :: lock
integer::number_children !< the number of child processes
integer::pid !< the pid
Here is the code I implemented:
!=======================================================================================================================================================
!$OMP PARALLEL PRIVATE(num_thread,nthreads,ff) &
!$OMP SHARED(tasks_ready_master)
num_thread=OMP_GET_THREAD_NUM() ! le rang du thread
nthreads=OMP_GET_NUM_THREADS() ! le nombre de threads
if (num_thread==0) then !< Thread Master (number 0)
number_children=4
do ff =5,20,number_children
pid=c_fork() !< spawn a child process
if (pid < 0) then
call perror('fork()' // c_null_char) !< Error
else if (pid == 0) then !< Child process
!$OMP TASK SHARED(tasks_ready_master) IF ((num_thread .ne. 0) .and. (num_thread .ne. 1))
call tasks_ready_master(ff)%f_ptr(self,var) !< execute the task
!$OMP END TASK
call exit(STATUS)
else
any_child=c_wait(STATUS) !< wait for the child process
if (any_child<0) then
stop 'error' !< Error
end if
end if
end do
end if
My problem is at this line : call tasks_ready_master(ff)%f_ptr(self,var) !< execute the task.
Suppose that we have 4 threads. Only 2 workers are available.
How can I be sure that each worker will execute a task in a block of 4 tasks then go to the next task (if not executed yet) ?
I hope that you understood my question. All I want is to parallelize the fork/wait method.
PS: Don't worry about the tcb and variables types.
I'm trying to improve the performances of my OpenMP code written in Fortran. In order to do that, I changed the variable threads_list from private to shared as you can see in the code below. My idea was to compute the elements of the thread_list array not only by the thread master (number 0) but also by multiple threads (the exact number is the nthreads-2). I was waiting for a smaller time of execution but no, the time of execution is getting 10 times bigger). I know that shared variables decrease the performances but I know also that increasing the number of threads computing a certain task increase the performances.
Here is the code of the subroutine where I use OpenMP:
subroutine management_16_tasks(tasklist_GRAD,ww,pas,cpt ,nb_element,cpt1,dt,dx,p_element,u_prime,u_prime_moins,u_prime_plus,&
&taux,grad_x_u,grad_t_u,grad_x_f,grad_t_f,ax_plus,ax_moins,ux_plus,ux_moins,sm,flux,tab0,tab)
INTEGER ::ff,pas
INTEGER,intent(inout)::cpt,cpt1,nb_element,ww
real(kind=REAL64) :: dt,dx
integer ,allocatable, dimension(:),intent(inout) ::p_element
REAL(KIND=REAL64) ,allocatable, dimension(:),intent(inout) :: u_prime,u_prime_moins, u_prime_plus,taux,grad_x_u,&
&grad_t_u,grad_t_f,grad_x_f,flux,sm
real(kind=REAL64),allocatable,dimension(:),intent(inout) :: ax_plus,ax_moins,ux_moins,ux_plus
REAL(KIND=REAL64) ,allocatable, dimension(:,:),intent(inout) ::tab0,tab
type(tcb)::self
!OpenMP variables
integer::num_thread,nthreads
integer, external :: OMP_GET_THREAD_NUM, OMP_GET_NUM_THREADS
type(tcb),dimension(20)::tasklist_GRAD,tasks_ready_master
!Variables pour gestion des threads
integer,allocatable,dimension(:)::threads_list !liste contenant les nums des threads workers
integer,dimension(100)::threads_list_all !liste contenant les nums des threads workers dans l'ordre selon les tâches
integer,dimension(16)::threads_list_part3 ! le reste des tâches
!=======================================================================================================================================================
!$OMP PARALLEL PRIVATE(num_thread,threads_list_all,nthreads,ff) &
!$OMP SHARED(tasklist_GRAD,tasks_ready_master) &
!$OMP SHARED(threads_list_part3,threads_list)
num_thread=OMP_GET_THREAD_NUM() ! le rang du thread
nthreads=OMP_GET_NUM_THREADS() ! le nombre de threads
!Thread Application Master (numero 1)
if (num_thread==1) then
do ff=5,20 ! 16 tâches
if (associated(tasklist_GRAD(ff)%f_ptr) .eqv. .true.) then
tasks_ready_master(ff) = tasklist_GRAD(ff)
tasks_ready_master(ff)%state=STATE_READY
end if
end do
end if
!!$ !$OMP BARRIER
if (num_thread==0) then
allocate(threads_list(nthreads-2)) ! liste des threads workers
end if
if ((num_thread .ne. 0) .and. (num_thread .ne. 1)) then
!$OMP DO SCHEDULE(STATIC,1)
do ff=1,nthreads-2
threads_list(ff)=ff+1 ! 2,3,..,nombre de threads-2
end do
!$OMP END DO
end if
!Thread Master (numero 0)
if (num_thread==0) then
do ff=5,20,nthreads-2
if (tasks_ready_master(ff)%state==STATE_READY) then
threads_list_all(ff-4:ff+nthreads-7)=threads_list(:)
end if
end do
threads_list_part3=threads_list_all(1:16) ! 16 tâches
deallocate(threads_list)
end if
!$OMP BARRIER
!Threads workers
do ff=5,20
if (num_thread==threads_list_part3(ff-4)) then
call tasks_ready_master(ff)%f_ptr(self,ww,pas,cpt ,nb_element,cpt1,dt,dx,p_element,u_prime,u_prime_moins,u_prime_plus,&
&taux,grad_x_u,grad_t_u,grad_x_f,grad_t_f,ax_plus,ax_moins,ux_plus,ux_moins,sm,flux,tab0,tab)
tasks_ready_master(ff)%state=STATE_RUNNING
end if
!$OMP BARRIER
end do
!Thread Master (numero 0)
if (num_thread==0) then
do ff=5,20
if (tasks_ready_master(ff)%state==STATE_RUNNING) then
tasklist_GRAD(ff)%state=STATE_RUNNING
end if
end do
end if
!$OMP END PARALLEL
end subroutine management_16_tasks
I moved the part of the allocation of threads_list from inside the if (num_threads==0) then to outside. I changed threads_list from private to shared. The allocation is still done by only one thread. I added the filter if ((num_thread .ne. 0) .and. (num_thread .ne. 1)) then and added also !$OMP DO SCHEDULE (STATIC,1) so that each thread computes an element of the array.
Please help me understand how I can improve sequential, unformatted I/O throughput with (G)Fortran, especially when working on NVMe SSDs.
I wrote a little test program, see bottom of this post. What this does is open one or more files in parallel (OpenMP) and write an array of random numbers into it. Then it flushes system caches (root required, otherwise the read test will most likely read from memory) opens the files, and reads from them. Time is measured in wall time (trying to include only I/O-related times), and performance numbers are given in MiB/s. The program loops until aborted.
The hardware I am using for testing is a Samsung 970 Evo Plus 1TB SSD, connected via 2 PCIe 3.0 lanes. So in theory, it should be capable of ~1500MiB/s sequential reads and writes.
Testing beforehand with "dd if=/dev/zero of=./testfile bs=1G count=1 oflag=direct" results in ~750MB/s. Not too great, but still better than what I get with Gfortran. And depending on who you ask, dd should not be used for benchmarking anyway. This is just to make sure that the hardware is in theory capable of more.
Results with my code tend to get better with larger file size, but even with 1GiB it caps out at around 200MiB/s write, 420MiB/s read. Using more threads (e.g. 4) increases write speeds a bit, but only to around 270MiB/s.
I made sure to keep the benchmark runs short, and give the SSD time to relax between tests.
I was under the impression that it should be possible to saturate 2 PCIe 3.0 lanes worth of bandwidth, even with only a single thread. At least when using unformatted I/O.
The code does not seem to be CPU limited, top shows less than 50% usage on a single core if I move the allocation and initialization of the "values" field out of the loop. Which still does not bode well for overall performance, considering that I would like to see numbers that are at least 5 times higher.
I also tried to use access=stream for the open statements, but to no avail.
So what seems to be the problem?
Is my code wrong/unoptimized? Are my expectations too high?
Platform used:
Opensuse Leap 15.1, Kernel 4.12.14-lp151.28.36-default
2x AMD Epyc 7551, Supermicro H11DSI, Samsung 970 Evo Plus 1TB (2xPCIe 3.0)
gcc version 8.2.1, compiler options: -ffree-line-length-none -O3 -ffast-math -funroll-loops -flto
MODULE types
implicit none
save
INTEGER, PARAMETER :: I8B = SELECTED_INT_KIND(18)
INTEGER, PARAMETER :: I4B = SELECTED_INT_KIND(9)
INTEGER, PARAMETER :: SP = KIND(1.0)
INTEGER, PARAMETER :: DP = KIND(1.0d0)
END MODULE types
MODULE parameters
use types
implicit none
save
INTEGER(I4B) :: filesize ! file size in MiB
INTEGER(I4B) :: nthreads ! number of threads for parallel ececution
INTEGER(I4B) :: alloc_size ! size of the allocated data field
END MODULE parameters
PROGRAM iometer
use types
use parameters
use omp_lib
implicit none
CHARACTER(LEN=100) :: directory_char, filesize_char, nthreads_char
CHARACTER(LEN=40) :: dummy_char1
CHARACTER(LEN=110) :: filename
CHARACTER(LEN=10) :: filenumber
INTEGER(I4B) :: thread, tunit, n
INTEGER(I8B) :: counti, countf, count_rate
REAL(DP) :: telapsed_read, telapsed_write, mib_written, write_speed, mib_read, read_speed
REAL(SP), DIMENSION(:), ALLOCATABLE :: values
call system_clock(counti,count_rate)
call getarg(1,directory_char)
dummy_char1 = ' directory to test:'
write(*,'(A40,A)') dummy_char1, trim(adjustl(directory_char))
call getarg(2,filesize_char)
dummy_char1 = ' file size (MiB):'
read(filesize_char,*) filesize
write(*,'(A40,I12)') dummy_char1, filesize
call getarg(3,nthreads_char)
dummy_char1 = ' number of parallel threads:'
read(nthreads_char,*) nthreads
write(*,'(A40,I12)') dummy_char1, nthreads
alloc_size = filesize * 262144
dummy_char1 = ' allocation size:'
write(*,'(A40,I12)') dummy_char1, alloc_size
mib_written = real(alloc_size,kind=dp) * real(nthreads,kind=dp) / 1048576.0_dp
mib_read = mib_written
CALL OMP_SET_NUM_THREADS(nthreads)
do while(.true.)
!$OMP PARALLEL default(shared) private(thread, filename, filenumber, values, tunit)
thread = omp_get_thread_num()
write(filenumber,'(I0.10)') thread
filename = trim(adjustl(directory_char)) // '/' // trim(adjustl(filenumber)) // '.temp'
allocate(values(alloc_size))
call random_seed()
call RANDOM_NUMBER(values)
tunit = thread + 100
!$OMP BARRIER
!$OMP MASTER
call system_clock(counti)
!$OMP END MASTER
!$OMP BARRIER
open(unit=tunit, file=trim(adjustl(filename)), status='replace', action='write', form='unformatted')
write(tunit) values
close(unit=tunit)
!$OMP BARRIER
!$OMP MASTER
call system_clock(countf)
telapsed_write = real(countf-counti,kind=dp)/real(count_rate,kind=dp)
write_speed = mib_written/telapsed_write
!write(*,*) 'write speed (MiB/s): ', write_speed
call execute_command_line ('echo 3 > /proc/sys/vm/drop_caches', wait=.true.)
call system_clock(counti)
!$OMP END MASTER
!$OMP BARRIER
open(unit=tunit, file=trim(adjustl(filename)), status='old', action='read', form='unformatted')
read(tunit) values
close(unit=tunit)
!$OMP BARRIER
!$OMP MASTER
call system_clock(countf)
telapsed_read = real(countf-counti,kind=dp)/real(count_rate,kind=dp)
read_speed = mib_read/telapsed_read
write(*,'(A29,2F10.3)') ' write / read speed (MiB/s): ', write_speed, read_speed
!$OMP END MASTER
!$OMP BARRIER
deallocate(values)
!$OMP END PARALLEL
call sleep(1)
end do
END PROGRAM iometer
The mistake in your code is that in your calculation of mib_written you have forgotten to take into account the size of a real(sp) variable (4 bytes). Thus your results are a factor of 4 too low. E.g. calculate it as
mib_written = filesize * nthreads
Some minor nits, some specific to GFortran:
Don't repeatedly call random_seed, particularly not from each thread. If you want to call it, call it once in the beginning of the program.
You can use open(newunit=tunit, ...) to let the compiler runtime allocate a unique unit number for each file.
If you want the 'standard' 64-bit integer/floating point kinds, you can use the variables int64 and real64 from the iso_fortran_env intrinsic module.
For testing with larger files, you need to make alloc_size of kind int64.
Use the standard get_command_argument intrinsic instead of the nonstandard getarg.
access='stream' is slightly faster than the default (sequential) as there's no need to handle the record length markers.
Your test program with these fixes (and the parameters module folded into the main program) below:
PROGRAM iometer
use iso_fortran_env
use omp_lib
implicit none
CHARACTER(LEN=100) :: directory_char, filesize_char, nthreads_char
CHARACTER(LEN=40) :: dummy_char1
CHARACTER(LEN=110) :: filename
CHARACTER(LEN=10) :: filenumber
INTEGER :: thread, tunit
INTEGER(int64) :: counti, countf, count_rate
REAL(real64) :: telapsed_read, telapsed_write, mib_written, write_speed, mib_read, read_speed
REAL, DIMENSION(:), ALLOCATABLE :: values
INTEGER :: filesize ! file size in MiB
INTEGER :: nthreads ! number of threads for parallel ececution
INTEGER(int64) :: alloc_size ! size of the allocated data field
call system_clock(counti,count_rate)
call get_command_argument(1, directory_char)
dummy_char1 = ' directory to test:'
write(*,'(A40,A)') dummy_char1, trim(adjustl(directory_char))
call get_command_argument(2, filesize_char)
dummy_char1 = ' file size (MiB):'
read(filesize_char,*) filesize
write(*,'(A40,I12)') dummy_char1, filesize
call get_command_argument(3, nthreads_char)
dummy_char1 = ' number of parallel threads:'
read(nthreads_char,*) nthreads
write(*,'(A40,I12)') dummy_char1, nthreads
alloc_size = filesize * 262144_int64
dummy_char1 = ' allocation size:'
write(*,'(A40,I12)') dummy_char1, alloc_size
mib_written = filesize * nthreads
dummy_char1 = ' MiB written:'
write(*, '(A40,g0)') dummy_char1, mib_written
mib_read = mib_written
CALL OMP_SET_NUM_THREADS(nthreads)
!$OMP PARALLEL default(shared) private(thread, filename, filenumber, values, tunit)
do while (.true.)
thread = omp_get_thread_num()
write(filenumber,'(I0.10)') thread
filename = trim(adjustl(directory_char)) // '/' // trim(adjustl(filenumber)) // '.temp'
if (.not. allocated(values)) then
allocate(values(alloc_size))
call RANDOM_NUMBER(values)
end if
open(newunit=tunit, file=filename, status='replace', action='write', form='unformatted', access='stream')
!$omp barrier
!$omp master
call system_clock(counti)
!$omp end master
!$omp barrier
write(tunit) values
close(unit=tunit)
!$omp barrier
!$omp master
call system_clock(countf)
telapsed_write = real(countf - counti, kind=real64)/real(count_rate, kind=real64)
write_speed = mib_written/telapsed_write
call execute_command_line ('echo 3 > /proc/sys/vm/drop_caches', wait=.true.)
!$OMP END MASTER
open(newunit=tunit, file=trim(adjustl(filename)), status='old', action='read', form='unformatted', access='stream')
!$omp barrier
!$omp master
call system_clock(counti)
!$omp end master
!$omp barrier
read(tunit) values
close(unit=tunit)
!$omp barrier
!$omp master
call system_clock(countf)
telapsed_read = real(countf - counti, kind=real64)/real(count_rate, kind=real64)
read_speed = mib_read/telapsed_read
write(*,'(A29,2F10.3)') ' write / read speed (MiB/s): ', write_speed, read_speed
!$OMP END MASTER
call sleep(1)
end do
!$OMP END PARALLEL
END PROGRAM iometer
I want to benchmark some Fortran code with OpenMP-threads with a critical-section. To simulate a realistic environment I tried to generate some load before this critical-section.
!Kompileraufruf: gfortran -fopenmp -o minExample.x minExample.f90
PROGRAM minExample
USE omp_lib
IMPLICIT NONE
INTEGER :: n_chars, real_alloced
INTEGER :: nx,ny,nz,ix,iy,iz, idx
INTEGER :: nthreads, lasteinstellung,i
INTEGER, PARAMETER :: dp = kind(1.0d0)
REAL (KIND = dp) :: j
CHARACTER(LEN=32) :: arg
nx = 2
ny = 2
nz = 2
lasteinstellung= 10000
CALL getarg(1, arg)
READ(arg,*) nthreads
CALL OMP_SET_NUM_THREADS(nthreads)
!$omp parallel
!$omp master
nthreads=omp_get_num_threads()
!$omp end master
!$omp end parallel
WRITE(*,*) "Running OpenMP benchmark on ",nthreads," thread(s)"
n_chars = 0
idx = 0
!$omp parallel do default(none) collapse(3) &
!$omp shared(nx,ny,nz,n_chars) &
!$omp private(ix,iy,iz, idx) &
!$omp private(lasteinstellung,j) !&
DO iz=-nz,nz
DO iy=-ny,ny
DO ix=-nx,nx
! WRITE(*,*) ix,iy,iz
j = 0.0d0
DO i=1,lasteinstellung
j = j + real(i)
END DO
!$omp critical
n_chars = n_chars + 1
idx = n_chars
!$omp end critical
END DO
END DO
END DO
END PROGRAM
I compiled this code with gfortran -fopenmp -o test.x test.f90 and executed it with time ./test.x THREAD
Executing this code gives some strange behaviour depending on the thread-count (set with OMP_SET_NUM_THREADS): compared with one thread (6ms) the execution with more threads costs a lot more time (2 threads: 16000ms, 4 threads: 9000ms) on my multicore machine.
What could cause this behaviour? Is there a better (but still easy) way to generate load without running in some cache-effects or related things?
edit: strange behaviour: if I have the write in the nested loops, the execution speeds dramatically up with 2 threads. If its commented out, the execution with 2 or 3 threads takes forever (write shows very slow incrementation of loop variables)...but not with 1 or 4 threads. I tried this code also on another multicore machine. There it takes for 1 and 3 threads forever but not for 2 or 4 threads.
If the code you are showing is really complete you are missing definition of loadSet in the parallel section in which it is private. It is undefined and loop
DO i=1,loadSet
j = j + real(i)
END DO
can take a completely arbitrary number of iterations.
If the value is defined somewhere before in the code you do not show you probably want firstprivate instead of private.
I am quite new to Perl, especially Perl Threads.
I want to accomplish:
Have 5 threads that will en-queue data(Random numbers) into a
Thread::queue
Have 3 threads that will de-queue data from the
Thread::queue.
The complete code that I wrote in order to achieve above mission:
#!/usr/bin/perl -w
use strict;
use threads;
use Thread::Queue;
my $queue = new Thread::Queue();
our #Enquing_threads;
our #Dequeuing_threads;
sub buildQueue
{
my $TotalEntry=1000;
while($TotalEntry-- >0)
{
my $query = rand(10000);
$queue->enqueue($query);
print "Enque thread with TID " .threads->tid . " got $query,";
print "Queue Size: " . $queue->pending . "\n";
}
}
sub process_Queue
{
my $query;
while ($query = $queue->dequeue)
{
print "Dequeu thread with TID " .threads->tid . " got $query\n";
}
}
push #Enquing_threads,threads->create(\&buildQueue) for 1..5;
push #Dequeuing_threads,threads->create(\&process_Queue) for 1..3;
Issues that I am Facing:
The threads are not running as concurrently as expected.
The entire program abnormally exit with following console output:
Perl exited with active threads:
8 running and unjoined
0 finished and unjoined
0 running and detached
Enque thread with TID 5 got 6646.13585023883,Queue Size: 595
Enque thread with TID 1 got 3573.84104215917,Queue Size: 595
Any help on code-optimization is appreciated.
This behaviour is to be expected: When the main thread exits, all other threads exit as well. If you don't care, you can $thread->detach them. Otherwise, you have to manually $thread->join them, which we'll do.
The $thread->join waits for the thread to complete, and fetches the return value (threads can return values just like subroutines, although the context (list/void/scalar) has to be fixed at spawn time).
We will detach the threads that enqueue data:
threads->create(\&buildQueue)->detach for 1..5;
Now for the dequeueing threads, we put them into a lexical variable (why are you using globals?), so that we can dequeue them later:
my #dequeue_threads = map threads->create(\&process_queue), 1 .. 3;
Then wait for them to complete:
$_->join for #dequeue_threads;
We know that the detached threads will finish execution before the programm exits, because the only way for the dequeueing threads to exit is to exhaust the queue.
Except for one and a half bugs. You see, there is a difference between an empty queue and a finished queue. If the queue is just empty, the dequeueing threads will block on $queue->dequeue until they get some input. The traditional solution is to dequeue while the value they get is defined. We can break the loop by supplying as many undef values in the queue as there are threads reading from the queue. More modern version of Thread::Queue have an end method, that makes dequeue return undef for all subsequent calls.
The problem is when to end the queue. We should to this after all enqueueing threads have exited. Which means, we should wait for them manually. Sigh.
my #enqueueing = map threads->create(\&enqueue), 1..5;
my #dequeueing = map threads->create(\&dequeue), 1..3;
$_->join for #enqueueing;
$queue->enqueue(undef) for 1..3;
$_->join for #dequeueing;
And in sub dequeuing: while(defined( my $item = $queue->dequeue )) { ... }.
Using the defined test fixes another bug: rand can return zero, although this is quite unlikely and will slip through most tests. The contract of rand is that it returns a pseudo-random floating point number between including zero and excluding some upper bound: A number from the interval [0, x). The bound defaults to 1.
If you don't want to join the enqueueing threads manually, you could use a semaphore to signal completition. A semaphore is a multithreading primitive that can be incremented and decremented, but not below zero. If a decrement operation would let the drop count below zero, the call blocks until another thread raises the count. If the start count is 1, this can be used as a flag to block resources.
We can also start with a negative value 1 - $NUM_THREADS, and have each thread increment the value, so that only when all threads have exited, it can be decremented again.
use threads; # make a habit of importing `threads` as the first thing
use strict; use warnings;
use feature 'say';
use Thread::Queue;
use Thread::Semaphore;
use constant {
NUM_ENQUEUE_THREADS => 5, # it's good to fix the thread counts early
NUM_DEQUEUE_THREADS => 3,
};
sub enqueue {
my ($out_queue, $finished_semaphore) = #_;
my $tid = threads->tid;
# iterate over ranges instead of using the while($maxval --> 0) idiom
for (1 .. 1000) {
$out_queue->enqueue(my $val = rand 10_000);
say "Thread $tid enqueued $val";
}
$finished_semaphore->up;
# try a non-blocking decrement. Returns true only for the last thread exiting.
if ($finished_semaphore->down_nb) {
$out_queue->end; # for sufficiently modern versions of Thread::Queue
# $out_queue->enqueue(undef) for 1 .. NUM_DEQUEUE_THREADS;
}
}
sub dequeue {
my ($in_queue) = #_;
my $tid = threads->tid;
while(defined( my $item = $in_queue->dequeue )) {
say "thread $tid dequeued $item";
}
}
# create the queue and the semaphore
my $queue = Thread::Queue->new;
my $enqueuers_ended_semaphore = Thread::Semaphore->new(1 - NUM_ENQUEUE_THREADS);
# kick off the enqueueing threads -- they handle themself
threads->create(\&enqueue, $queue, $enqueuers_ended_semaphore)->detach for 1..NUM_ENQUEUE_THREADS;
# start and join the dequeuing threads
my #dequeuers = map threads->create(\&dequeue, $queue), 1 .. NUM_DEQUEUE_THREADS;
$_->join for #dequeuers;
Don't be suprised if the threads do not seem to run in parallel, but sequentially: This task (enqueuing a random number) is very fast, and is not well suited for multithreading (enqueueing is more expensive than creating a random number).
Here is a sample run where each enqueuer only creates two values:
Thread 1 enqueued 6.39390993005694
Thread 1 enqueued 0.337993319585337
Thread 2 enqueued 4.34504733960242
Thread 2 enqueued 2.89158054485114
Thread 3 enqueued 9.4947585773571
Thread 3 enqueued 3.17079715055542
Thread 4 enqueued 8.86408863197179
Thread 5 enqueued 5.13654995317669
Thread 5 enqueued 4.2210886147538
Thread 4 enqueued 6.94064174636395
thread 6 dequeued 6.39390993005694
thread 6 dequeued 0.337993319585337
thread 6 dequeued 4.34504733960242
thread 6 dequeued 2.89158054485114
thread 6 dequeued 9.4947585773571
thread 6 dequeued 3.17079715055542
thread 6 dequeued 8.86408863197179
thread 6 dequeued 5.13654995317669
thread 6 dequeued 4.2210886147538
thread 6 dequeued 6.94064174636395
You can see that 5 managed to enqueue a few things before 4. The threads 7 and 8 don't get to dequeue anything, 6 is too fast. Also, all enqueuers are finished before the dequeuers are spawned (for such a small number of inputs).