Fortran and OpenMP thread groups for independent tasks - multithreading

I need to run two independent tasks using OpenMP. One of them is way more involved than the other, so it would be ideal to split the available threads such that the more complicated task uses more of them. After these two tasks are finished, I need to use both of their outputs. I am not entirely sure if this can be done with OpenMP, so any suggestion would be very useful.
This is an attempt to illustrate what I need. There are two independent suboutines with separate inputs and outputs. Subroutine mysub2 is more complex than mysub1. It has multiple nested loops, so it would benefit more from having more threads running it. Out of 6 threads, I would like to assign 2 of them to execute mysub1, and 4 of them to mysub2, simultaneously. After getting each subroutine outputs, z1 and z2, both of them are used to compute z3.
In this attempt I was trying to assign threads 0 and 1 to task 1, and the other 4 to task 2. Obviously, this doesn't work as intended because it runs mysub1 twice and mysub2 four times, but I have no idea how to achieve what I need.
module mymod
implicit none
contains
subroutine mysub1(x1,y1,z1)
! Element-wise product of vectors
real,intent(in) :: x1(:),y1(:)
real,intent(out) :: z1(size(x1))
integer :: i
!$omp parallel do private(i)
do i = 1,size(x1)
z1(i) = x1(i) * y1(i)
end do
!$omp end parallel do
print *, 'Done with mysub1'
end subroutine mysub1
subroutine mysub2(x2,y2,z2)
! Matrix multiplication
real,intent(in) :: x2(:,:),y2(:,:)
real,intent(out) :: z2(size(x2,1),size(y2,2))
integer :: i,j
!$omp parallel do private(i,j)
do i = 1,size(x2,1)
do j = 1,size(y2,2)
z2(i,j) = dot_product(x2(i,:), y2(:,j))
end do
end do
!$omp end parallel do
print *, 'Done with mysub2'
end subroutine mysub2
end module mymod
program main
use omp_lib
use mymod
implicit none
integer :: tid
integer,parameter :: m = 2
integer,parameter :: n = 3
integer,parameter :: p = 4
real :: x1(m),y1(m),z1(m)
real :: x2(m,n),y2(n,p),z2(m,p),z3
! Setting total number of threads to 6
call omp_set_num_threads(6)
! Assigning arbitrary values for illustration purposes
x1 = 1.0
y1 = 2.0
x2 = 3.0
y2 = 4.0
!$omp parallel private(tid)
! Getting thread number
tid = omp_get_thread_num()
if ((tid == 0) .or. (tid == 1)) then
! Task 1 to be executed in two threads, tid = 0,1
call mysub1(x1,y1,z1)
else
! Task 2 to be executed in four threads, tid = 2,3,4,5
call mysub2(x2,y2,z2)
end if
!$omp end parallel
! Using z1 and z2 (serially, no need to parallelize)
z3 = sum(z1) + sum(z2)
print *, 'Final output', z3
end program main
Of course, this is just an example. I know I don't need to use mysub2 to do matrix multiplication. I'm just trying to illustrate that mysub2 is more complex and hence, it would be ideal to use more threads for it, without having to paste several hundred lines of the actual code I have.

Related

Manually assigning iteration number to OpenMP schedule static

I am trying to parallelize the following snippet of code using OpenMP. The portion shown is basically a version of the Thomas matrix algorithm (matrix inversion) for implicit solver in fluid flow/turbulence simulations. I have stripped it down from its exact form to make the question easily understandable.
REAL(KIND=DP), INTENT(INOUT), DIMENSION(1:10) :: A, C_old, C_new, D
INTEGER :: K
!$OMP PARALLEL DO SCHEDULE(STATIC) NUM_THREADS(3)&
!$OMP SHARED(A, C_old, C_new, D)
DO K = 2,9
C_new(K) = C_old(K)/(A(K)*C_new(K-1))
END DO
!$OMP END PARALLEL
C(1) = C(10) = 0 because they are at the fixed boundaries (top and the bottom wall of a channel). Hence there is no need to update them.
As it is evident, the calculation of C(2) needs the information of C(1) and the calculation of C(3) needs the information of the MODIFIED C(2) and so on. Or in other words, it is a forward sweep (the algorithm is here https://en.wikipedia.org/wiki/Tridiagonal_matrix_algorithm). Hence, for three (03) number of threads, I want the distribution of iterations across the threads to be [1,2,3,4], [4,5,6,7], [7,8,9,10]. But if I just do a SCHEDULE STATIC, I would not be able to control this and the distribution might be [1,2,3], [4,5,6], [7,8,9,10] across 3 threads. This would give erroneous results as the information of the grid points 4 and 7 are not passed on to the other threads.
Please let me know if there is a way where I can force the indices to be solved by a certain thread without any race? Say,
thread #1 solve indices [1,2,3,4]
thread #2 solve indices [4,5,6,7] etc.

Odd behavior of matrix multiplication with threading in Julia

I am trying to do some linear algebra in Threads.#threads for loop, and it returns some weird results. It seems that matrices are not multiplied properly in the loop. Is it safe to do in a threaded for loop?
Below is a minimal working example to generate a TxR table of NxN matrices. For each of R iterations the (t+1)-th matrix is the product the (t)-th with another random matrix. The multiplications are performed by different threads, and then checked for correctness by one thread. The function should return a matrix of zeros. It does so for N<=3, however, there are a few ones in the result for N>=4.
function testMM(T, R, N)
m1 = zeros(Int64, (T,R,N,N))
m2 = rand(0:2, (T-1,R,N,N))
m1[1,:,:,:] = rand(0:1,(R,N,N))
Threads.#threads for i=1:R
for t=2:T
m1[t,i,:,:] = m2[t-1,i,:,:] * m1[t-1,i,:,:]
end
end
odds = zeros(Int64,(T-1,R))
for i=1:R
for t=2:T
if m1[t,i,:,:] != m2[t-1,i,:,:] * m1[t-1,i,:,:]
odds[t-1,i] = 1
end
end
end
return odds
end
Threads.nthreads() is 4 for me. Tested on stable 64bit Julia 0.5.2, 0.5.3, 0.6.0 on Windows.
Edit: This example is even simpler: several copies of a matrix are squared several times independently. The result should be the same for all copies, but the function usually returns false for N>=4. Looks as if the data of different threads is mixed somewhere inside BLAS.
function testMM2(T, R, N)
m0 = rand(0:2, (N,N))
m = [deepcopy(m0) for i=1:R]
Threads.#threads for i=1:R
for t=1:T
m[i] = m[i]^2
end
end
return all(x->(x==m[1]),m)
end

Parallel processing in Dot Product

I am having a heck of a time trying to figure out how to get a simple Dot Product calculation to parallel process on a Fortran code compiled by the Intel ifort compiler v 16. I have the section of code below, it is part of a program used for a more complex process, but this is where most of the time is spent by the program:
double precision function ddot(n,dx,incx,dy,incy)
c
c forms the dot product of two vectors.
c uses unrolled loops for increments equal to one.
c jack dongarra, linpack, 3/11/78.
c modified 12/3/93, array(1) declarations changed to array(*)
c
double precision dx(*),dy(*),dtemp
integer i,incx,incy,ix,iy,m,mp1,n
c
CALL OMP_SET_NUM_THREADS(12)
ddot = 0.0d0
dtemp = 0.0d0
if(n.le.0)return
if(incx.eq.1.and.incy.eq.1)go to 20
c
c code for unequal increments or equal increments
c not equal to 1
c
ix = 1
iy = 1
if(incx.lt.0)ix = (-n+1)*incx + 1
if(incy.lt.0)iy = (-n+1)*incy + 1
do 10 i = 1,n
dtemp = dtemp + dx(ix)*dy(iy)
ix = ix + incx
iy = iy + incy
10 continue
ddot = dtemp
return
c
c code for both increments equal to 1
c
c
c clean-up loop
c
20 m = mod(n,5)
if( m .eq. 0 ) go to 40
!$OMP PARALLEL DO
!$OMP& DEFAULT(NONE) SHARED(dx,dy,m) PRIVATE(i)
!$OMP& SCHEDULE(STATIC)
!$OMP& REDUCTION( + : dtemp )
do 30 i = 1,m
dtemp = dtemp + dx(i)*dy(i)
30 continue
!$OMP END PARALLEL DO
if( n .lt. 5 ) go to 60
40 mp1 = m + 1
!$OMP PARALLEL DO
!$OMP& DEFAULT(NONE) SHARED(dx,dy,n,mp1) PRIVATE(i)
!$OMP& SCHEDULE(STATIC)
!$OMP& REDUCTION( + : dtemp )
do 50 i = mp1,n,5
dtemp = dtemp + dx(i)*dy(i) + dx(i + 1)*dy(i + 1) +
* dx(i + 2)*dy(i + 2) + dx(i + 3)*dy(i + 3) + dx(i + 4)*dy(i + 4)
50 continue
!$OMP END PARALLEL DO
60 ddot = dtemp
return
end
I am new to the OpenMP commands and am pretty sure I have something funny in there that slows the whole thing down more than on a single core. Currently I have tried to run it on 4 threads on a slower 4(4) core machine where it actually went a bit faster than the large 20(40) core machine where we designated 12 threads for the processing. At this point I'm thinking the code is funny and doing something I don't want.
The Do loop higher up could be parallelized too, but I didn't know how to define the ix and iy and so just left it alone since it doesn't spend much time there.
Precision is very important, so the compiler is set to fp-mode precise. I don't know if that matters at all, but when the code does manage to generate answers they do appear correct. Basically, I'm just trying to figure out how to speed up this code, but instead parallel processing seems to slow down the process instead.
There are a bunch of Intel Webinars you can look up help you.
I have aperture optimum histogram code that does OpenMP SIMD REDUCTION. So I decided to bring in the vector into a (big,nthreads) array and do each thread in a parallel region. Generally it runs slower than just using a single core. (I have not tried vtune on that yet)
Other similar array approaches with FFTs run faster, and the cores are all at 100% with good scaling.
Basically one either needs to work out the issues or test which works better. Any OpenMP parallel takes a long time to start, so you want it way outside, and not down at the tightest level.
Generally you can be better off with PURE functions or subroutines, and using
!DEC$ ATTRIBUTES VECTOR ...
In ifort16 there is also VECTOR(REF(variable)) and the reference is new.
Once all that is singing, then the parallel can be attempted.
Your DO 50 would need some big numbers for 'i' in order to make parallelised code go faster, or the OpenMP parallel 'startup' will gobble up too much time.
There is not a lot other than vtune to aid finding cashe misses (etc) to give you insight into getting to having faster code (which is really code without slowdowns). After all that, then it may be worthwhile to also compile using gfortran. I generally find that running through two compilers gives more insight into making better overall code. But if you get gains from !DEC$ Extensions, then gfortran may not help. CONTIGUOUS can also be worth trying in functions.

executing a command on a variable used by multiple threads

Given the threading scenario where two threads execute the computation:
x = x + 1 where x is a shared variable
What are the possible results and describe why your answer could happen.
This is a textbook problem from my OS book and I was curious if I needed more information to answer this such as what x is initialized too and how often the threads execute this command or just once. My answer originally was that it could be two possible answers depending on the order that the threads execute them by the OS.
This is a rather simple task, so there isnt probably too much that could go wrong.
The only issue i can immediately think of is if one thread uses on old value of x in its calculation.
eg:
start with x = 2
1) thread A reads x = 2
2) thread B reads x = 2
3) thread A writes x = 2 + 1
x = 3
4) thread B writes x = 2(old value of x) + 1
x = 3 when it should be 4
this would be even more apparent if more than 1 thread reads the value before the first thread writes.

OpenMP parallel numerical integration (summation) performance

I recently started studying parallel coding, I'm still at the beginning so I wanted to try some very simple coding. Since it is in my interest to perform parallel numerical integration I started with a simple summation Fortran code:
program par_hello_world
use omp_lib
implicit none
integer, parameter:: bign = 1000000000
integer:: i
double precision:: start, finish, start1, finish1, a
a = 0
call cpu_time(start)
!$OMP PARALLEL num_threads(8)
!$OMP DO REDUCTION(+:a)
do i = 1,bign
a = a + sqrt(1.0**5)
end do
!$OMP END DO
!$OMP END PARALLEL
call cpu_time(finish)
print*, 'parallel result:'
print*, a
print*, (finish-start)
a=0
call cpu_time(start1)
do i = 1,bign
a = a + sqrt(1.0**5)
end do
call cpu_time(finish1)
print*, 'sequential result:'
print*, a
print*, (finish1-start1)
end program
The code basically simulates a summation, I used the weird expression sqrt(1.0**5) to have a measurable computational time, if I used just 1 the computational time was so small that i could not compare the sequential code with the parallel.
I tried to avoid the race condition by using the REDUCTION clause.
However I'm getting very strange time results:
If I raise the number of threads from 2 to 16 I don't get a reduction of computational time but somehow I even get an increase.
Incredibly it seems that also the sequential code is influenced by the choice of the threads number (I really don't understand why!) in particular it is raised if I raise the number of threads.
I get the correct result for the variable a
I think I'm doing something very wrong somewhere, but I'm clueless about it...

Resources