I'm changing a code for serial execution adjusting it to parallel execution (openmp), but I get a bad aproximation of the desired result (pi value). I show both codes below.
Is there something wrong?
program trap
use omp_lib
implicit none
double precision::suma=0.d0 ! sum is a scalar
double precision:: h,x,lima,limb
integer::n,i, istart, iend, thread_num, total_threads=4, ppt
integer(kind=8):: tic, toc, rate
double precision:: time
double precision, dimension(4):: pi= 0.d0
call system_clock(count_rate = rate)
call system_clock(tic)
lima=0.0d0; limb=1.0d0; suma=0.0d0; n=10000000
h=(limb-lima)/n
suma=h*(f(lima)+f(limb))*0.5d0 !first and last points
ppt= n/total_threads
!$ call omp_set_num_threads(total_threads)
!$omp parallel private (istart, iend, thread_num, i)
thread_num = omp_get_thread_num()
!$ istart = thread_num*ppt +1
!$ iend = min(thread_num*ppt + ppt, n-1)
do i=istart,iend ! this will control the loop in different images
x=lima+i*h
suma=suma+f(x)
pi(thread_num+1)=suma
enddo
!$omp end parallel
suma=sum(pi)
suma=suma*h
print *,"The value of pi is= ",suma ! print once from the first image
!print*, 'pi=' , pi
call system_clock(toc)
time = real(toc-tic)/real(rate)
print*, 'Time ', time, 's'
contains
double precision function f(y)
double precision:: y
f=4.0d0/(1.0d0+y*y)
end function f
end program trap
!----------------------------------------------------------------------------------
program trap
implicit none
double precision::sum ! sum is a scalar
double precision:: h,x,lima,limb
integer::n,i
integer(kind=8):: tic, toc, rate
double precision:: time
call system_clock(count_rate = rate)
call system_clock(tic)
lima=0.0d0; limb=1.0d0; sum=0.0d0; n=10000000
h=(limb-lima)/n
sum=h*(f(lima)+f(limb))*0.5d0 !first and last points
do i=1,n-1 ! this will control the loop in different images
x=lima+i*h
sum=sum+f(x)
enddo
sum=sum*h
print *,"The value of pi is (serial exe)= ",sum ! print once from the first image
call system_clock(toc)
time = real(toc-tic)/real(rate)
print*, 'Time serial execution', time, 's'
contains
double precision function f(y)
double precision:: y
f=4.0d0/(1.0d0+y*y)
end function f
end program trap
Compiled using:
$ gfortran -fopenmp -Wall -Wextra -O2 -Wall -o prog.exe test.f90
$ ./prog.exe
and
$ gfortran -Wall -Wextra -O2 -Wall -o prog.exe testserial.f90
$ ./prog.exe
In serial execution I get good aproximations of pi (3.1415) but using parallel I get (I show several parallel executions):
The value of pi is= 3.6731101425922810
Time 3.3386986702680588E-002 s
-------------------------------------------------------
The value of pi is= 3.1556004791445953
Time 8.3681479096412659E-002 s
------------------------------------------------------
The value of pi is= 3.2505952856717966
Time 5.1473543047904968E-002 s
There is a problem in your openmp parallel statement.
You keep on adding up onto the variable suma.
Therefore, you need to specify a reduction statement.
Also, you did not specify the variable x to be private.
I also changed some more parts of your code
You explicitly told each thread which index range it should use. Most often the compiler can figure that out more efficiently by itself. I changed parallel to parallel do for that.
It is good practice to set variable attributes in the openmp parallel region to be default(none). You will need to set each variables attribute explicitly.
program trap
use omp_lib
implicit none
double precision :: suma,h,x,lima,limb, time
integer :: n, i
integer, parameter :: total_threads=5
integer(kind=8) :: tic, toc, rate
call system_clock(count_rate = rate)
call system_clock(tic)
lima=0.0d0; limb=1.0d0; suma=0.0d0; n=10000000
h=(limb-lima)/n
suma=h*(f(lima)+f(limb))*0.5d0 !first and last points
call omp_set_num_threads(total_threads)
!$omp parallel do default(none) private(i, x) shared(lima, h, n) reduction(+: suma)
do i = 1, n
x=lima+i*h
suma=suma+f(x)
end do
!$omp end parallel do
suma=suma*h
print *,"The value of pi is= ", suma ! print once from the first image
call system_clock(toc)
time = real(toc-tic)/real(rate)
print*, 'Time ', time, 's'
contains
double precision function f(y)
double precision:: y
f=4.0d0/(1.0d0+y*y)
end function
end program
Related
I need to run two independent tasks using OpenMP. One of them is way more involved than the other, so it would be ideal to split the available threads such that the more complicated task uses more of them. After these two tasks are finished, I need to use both of their outputs. I am not entirely sure if this can be done with OpenMP, so any suggestion would be very useful.
This is an attempt to illustrate what I need. There are two independent suboutines with separate inputs and outputs. Subroutine mysub2 is more complex than mysub1. It has multiple nested loops, so it would benefit more from having more threads running it. Out of 6 threads, I would like to assign 2 of them to execute mysub1, and 4 of them to mysub2, simultaneously. After getting each subroutine outputs, z1 and z2, both of them are used to compute z3.
In this attempt I was trying to assign threads 0 and 1 to task 1, and the other 4 to task 2. Obviously, this doesn't work as intended because it runs mysub1 twice and mysub2 four times, but I have no idea how to achieve what I need.
module mymod
implicit none
contains
subroutine mysub1(x1,y1,z1)
! Element-wise product of vectors
real,intent(in) :: x1(:),y1(:)
real,intent(out) :: z1(size(x1))
integer :: i
!$omp parallel do private(i)
do i = 1,size(x1)
z1(i) = x1(i) * y1(i)
end do
!$omp end parallel do
print *, 'Done with mysub1'
end subroutine mysub1
subroutine mysub2(x2,y2,z2)
! Matrix multiplication
real,intent(in) :: x2(:,:),y2(:,:)
real,intent(out) :: z2(size(x2,1),size(y2,2))
integer :: i,j
!$omp parallel do private(i,j)
do i = 1,size(x2,1)
do j = 1,size(y2,2)
z2(i,j) = dot_product(x2(i,:), y2(:,j))
end do
end do
!$omp end parallel do
print *, 'Done with mysub2'
end subroutine mysub2
end module mymod
program main
use omp_lib
use mymod
implicit none
integer :: tid
integer,parameter :: m = 2
integer,parameter :: n = 3
integer,parameter :: p = 4
real :: x1(m),y1(m),z1(m)
real :: x2(m,n),y2(n,p),z2(m,p),z3
! Setting total number of threads to 6
call omp_set_num_threads(6)
! Assigning arbitrary values for illustration purposes
x1 = 1.0
y1 = 2.0
x2 = 3.0
y2 = 4.0
!$omp parallel private(tid)
! Getting thread number
tid = omp_get_thread_num()
if ((tid == 0) .or. (tid == 1)) then
! Task 1 to be executed in two threads, tid = 0,1
call mysub1(x1,y1,z1)
else
! Task 2 to be executed in four threads, tid = 2,3,4,5
call mysub2(x2,y2,z2)
end if
!$omp end parallel
! Using z1 and z2 (serially, no need to parallelize)
z3 = sum(z1) + sum(z2)
print *, 'Final output', z3
end program main
Of course, this is just an example. I know I don't need to use mysub2 to do matrix multiplication. I'm just trying to illustrate that mysub2 is more complex and hence, it would be ideal to use more threads for it, without having to paste several hundred lines of the actual code I have.
I am trying to do some linear algebra in Threads.#threads for loop, and it returns some weird results. It seems that matrices are not multiplied properly in the loop. Is it safe to do in a threaded for loop?
Below is a minimal working example to generate a TxR table of NxN matrices. For each of R iterations the (t+1)-th matrix is the product the (t)-th with another random matrix. The multiplications are performed by different threads, and then checked for correctness by one thread. The function should return a matrix of zeros. It does so for N<=3, however, there are a few ones in the result for N>=4.
function testMM(T, R, N)
m1 = zeros(Int64, (T,R,N,N))
m2 = rand(0:2, (T-1,R,N,N))
m1[1,:,:,:] = rand(0:1,(R,N,N))
Threads.#threads for i=1:R
for t=2:T
m1[t,i,:,:] = m2[t-1,i,:,:] * m1[t-1,i,:,:]
end
end
odds = zeros(Int64,(T-1,R))
for i=1:R
for t=2:T
if m1[t,i,:,:] != m2[t-1,i,:,:] * m1[t-1,i,:,:]
odds[t-1,i] = 1
end
end
end
return odds
end
Threads.nthreads() is 4 for me. Tested on stable 64bit Julia 0.5.2, 0.5.3, 0.6.0 on Windows.
Edit: This example is even simpler: several copies of a matrix are squared several times independently. The result should be the same for all copies, but the function usually returns false for N>=4. Looks as if the data of different threads is mixed somewhere inside BLAS.
function testMM2(T, R, N)
m0 = rand(0:2, (N,N))
m = [deepcopy(m0) for i=1:R]
Threads.#threads for i=1:R
for t=1:T
m[i] = m[i]^2
end
end
return all(x->(x==m[1]),m)
end
I am having a heck of a time trying to figure out how to get a simple Dot Product calculation to parallel process on a Fortran code compiled by the Intel ifort compiler v 16. I have the section of code below, it is part of a program used for a more complex process, but this is where most of the time is spent by the program:
double precision function ddot(n,dx,incx,dy,incy)
c
c forms the dot product of two vectors.
c uses unrolled loops for increments equal to one.
c jack dongarra, linpack, 3/11/78.
c modified 12/3/93, array(1) declarations changed to array(*)
c
double precision dx(*),dy(*),dtemp
integer i,incx,incy,ix,iy,m,mp1,n
c
CALL OMP_SET_NUM_THREADS(12)
ddot = 0.0d0
dtemp = 0.0d0
if(n.le.0)return
if(incx.eq.1.and.incy.eq.1)go to 20
c
c code for unequal increments or equal increments
c not equal to 1
c
ix = 1
iy = 1
if(incx.lt.0)ix = (-n+1)*incx + 1
if(incy.lt.0)iy = (-n+1)*incy + 1
do 10 i = 1,n
dtemp = dtemp + dx(ix)*dy(iy)
ix = ix + incx
iy = iy + incy
10 continue
ddot = dtemp
return
c
c code for both increments equal to 1
c
c
c clean-up loop
c
20 m = mod(n,5)
if( m .eq. 0 ) go to 40
!$OMP PARALLEL DO
!$OMP& DEFAULT(NONE) SHARED(dx,dy,m) PRIVATE(i)
!$OMP& SCHEDULE(STATIC)
!$OMP& REDUCTION( + : dtemp )
do 30 i = 1,m
dtemp = dtemp + dx(i)*dy(i)
30 continue
!$OMP END PARALLEL DO
if( n .lt. 5 ) go to 60
40 mp1 = m + 1
!$OMP PARALLEL DO
!$OMP& DEFAULT(NONE) SHARED(dx,dy,n,mp1) PRIVATE(i)
!$OMP& SCHEDULE(STATIC)
!$OMP& REDUCTION( + : dtemp )
do 50 i = mp1,n,5
dtemp = dtemp + dx(i)*dy(i) + dx(i + 1)*dy(i + 1) +
* dx(i + 2)*dy(i + 2) + dx(i + 3)*dy(i + 3) + dx(i + 4)*dy(i + 4)
50 continue
!$OMP END PARALLEL DO
60 ddot = dtemp
return
end
I am new to the OpenMP commands and am pretty sure I have something funny in there that slows the whole thing down more than on a single core. Currently I have tried to run it on 4 threads on a slower 4(4) core machine where it actually went a bit faster than the large 20(40) core machine where we designated 12 threads for the processing. At this point I'm thinking the code is funny and doing something I don't want.
The Do loop higher up could be parallelized too, but I didn't know how to define the ix and iy and so just left it alone since it doesn't spend much time there.
Precision is very important, so the compiler is set to fp-mode precise. I don't know if that matters at all, but when the code does manage to generate answers they do appear correct. Basically, I'm just trying to figure out how to speed up this code, but instead parallel processing seems to slow down the process instead.
There are a bunch of Intel Webinars you can look up help you.
I have aperture optimum histogram code that does OpenMP SIMD REDUCTION. So I decided to bring in the vector into a (big,nthreads) array and do each thread in a parallel region. Generally it runs slower than just using a single core. (I have not tried vtune on that yet)
Other similar array approaches with FFTs run faster, and the cores are all at 100% with good scaling.
Basically one either needs to work out the issues or test which works better. Any OpenMP parallel takes a long time to start, so you want it way outside, and not down at the tightest level.
Generally you can be better off with PURE functions or subroutines, and using
!DEC$ ATTRIBUTES VECTOR ...
In ifort16 there is also VECTOR(REF(variable)) and the reference is new.
Once all that is singing, then the parallel can be attempted.
Your DO 50 would need some big numbers for 'i' in order to make parallelised code go faster, or the OpenMP parallel 'startup' will gobble up too much time.
There is not a lot other than vtune to aid finding cashe misses (etc) to give you insight into getting to having faster code (which is really code without slowdowns). After all that, then it may be worthwhile to also compile using gfortran. I generally find that running through two compilers gives more insight into making better overall code. But if you get gains from !DEC$ Extensions, then gfortran may not help. CONTIGUOUS can also be worth trying in functions.
I met the 'killed' problem when I run my f90 function for calculating autocorrelation function, in which I need calculate average over a number (np) of time series when the parameternp is very large.
When np is small,my code can work very well. But when np is quite large (10^6), The function is killed after some time of running. Could someone give any suggestion on this problem? The code is as follows.
program acf
implicit none
!========================
!parameters and variables
!========================
character(LEN=20) :: filename
integer,parameter :: rk=4
real,parameter :: rooc=29.16, rohc=12.25, cosphic=0.866
real(kind=rk),parameter :: delta_t=0.005
real :: ave_qj
integer :: i,j,k,nmo,nat,iat,&
imo,np,m1,m2,m3,mt
real(kind=rk),allocatable,dimension (:,:) :: r12, r13, r23,h
real(kind=rk),allocatable,dimension (:,:) :: cosphi, pm
real,allocatable,dimension (:,:) :: x,y,z
character(LEN=3),allocatable,dimension (:) :: atom_type
integer,allocatable,dimension (:) :: ndx_1, ndx_2, ndx_3
real,allocatable,dimension (:) :: qj
real(kind=rk),allocatable,dimension (:) :: corr_h
real(kind=rk),allocatable,dimension (:) :: scalar_h
!==================
!read data in input
!==================
open(10,file='acqjond_k_pair_input')
read(10,*)filename
read(10,*)nmo !number of movie steps
read(10,*)nat !number of atoms per mole.
read(10,*)np !number of pairs
allocate(ndx_1(np))
allocate(ndx_2(np))
allocate(ndx_3(np))
do k=1,np
read(10,*)ndx_1(k),ndx_2(k),ndx_3(k)
enddo
close(10)
!
allocate(atom_type(nat))
allocate(x(nat,nmo))
allocate(y(nat,nmo))
allocate(z(nat,nmo))
allocate(cosphi(nmo,np))
allocate(pm(nmo,np))
allocate(r12(nmo,np))
allocate(r13(nmo,np))
allocate(r23(nmo,np))
allocate(h(nmo,np))
allocate(qj(np))
!=======================
!read in trajectory file
!=======================
open(10,file='traj_pos.xyz')
do imo=1,nmo
read(10,*) !Neglect data of this line
read(10,*) !Neglect data of this line
do iat= 1,nat
read (10,*)atom_type(iat),x(iat,imo),&
y(iat,imo),z(iat,imo)
enddo
enddo
close(10)
!
do k=1,np
qj(k)=0
m1=ndx_1(k)
m2=ndx_2(k)
m3=ndx_3(k)
do j =1, nmo
h(j,k)=0
r13(j,k)= (x(m1,j)-x(m3,j))**2+ &
(y(m1,j)-y(m3,j))**2+ &
(z(m1,j)-z(m3,j))**2 !r:squra of distances
r12(j,k)= (x(m1,j)-x(m2,j))**2+ &
(y(m1,j)-y(m2,j))**2+ &
(z(m1,j)-z(m2,j))**2
r23(j,k)= (x(m2,j)-x(m3,j))**2+ &
(y(m2,j)-y(m3,j))**2+ &
(z(m2,j)-z(m3,j))**2
pm(j,k)= (x(m3,j)-x(m2,j))* &
(x(m1,j)-x(m2,j))+ &
(y(m3,j)-y(m2,j))* &
(y(m1,j)-y(m2,j))+ &
(z(m3,j)-z(m2,j))* &
(z(m1,j)-z(m2,j))
cosphi(j,k)= pm(j,k)/(sqrt(r23(j,k)*r12(j,k)))
if (r13(j,k) .lt. rohc .and. r12(j,k).lt.rooc &
.and. cosphi(j,k).gt. cosphic) then
h(j,k)=1.0
qj(k)=qj(k)+h(j,k)
endif
enddo
qj(k)=qj(k)/nmo
enddo
deallocate (x,y,z,atom_type)
!================
!Write the result
!================
open(10,file=trim(filename)//'_nqj.dat')
do k=1,np
write(10,*) k, qj(k)
do j =1,nmo
write(10,*)j,j*delta_t,h(j,k)
enddo
enddo
close(10)
deallocate (cosphi,pm, &
r12,r13,r23,ndx_1,ndx_2,ndx_3)
!==================================
!Calculate autocorrelation function
!==================================
allocate(corr_h(nmo))
allocate(scalar_h(np))
do i=1, nmo
corr_h(i)=0
enddo
ave_qj=0.d0
! calculate ave_qj
do k=1, np
ave_qj=ave_qj+qj(k)
enddo
ave_qj=ave_qj/np
write(6,*) ave_qj
! calculate <f(0)f(t)>/<f>
do mt=0,nmo-1 ! time interval
do k=1, np
scalar_h(k)=0.d0
do j=1, nmo-mt-1
scalar_h(k)=scalar_h(k)+h(j,k)*h(j+mt,k)
enddo
scalar_h(k)=scalar_h(k)/(nmo-mt)
corr_h(mt+1)=corr_h(mt+1)+scalar_h(k)
enddo
corr_h(mt+1)=corr_h(mt+1)/(np*ave_qj)
enddo
!=====================
!Write the correlation
!C(t)
!=====================
open(10,file=trim(filename)//'_acf_h.dat')
do i=1,int(nmo*rate)
write(10,*)i-1,corr_h(i)
enddo
write(6,*)'written in '//trim(filename)//'_acf_h.dat'
close(10)
!==============================================================
end
You are running out of memory with the code as written.If np=10^6, nmo=10000 and nat=700 the nmo by np arrays take 4*10000*10^6=4*10^11 bytes=400 Giga Bytes.
As noted in the comments you don't need most of the very large nmo by np arrays, and r12, r23, r13, cosphi and pm can be got rid of trivially. However as currently structured with the two loops you need the h array. I strongly suggest you restructure the code to get rid of this as well as otherwise you will run into the same problem. It is fairly easy to calculate h for just one value of the outer k loop, and then use that to calculate the contribution to the autocorrelation function, and then move onto the next k. That way you only need h to be of size nmo.
That still leaves a number of nat by nmo arrays. These require 4*700*10000=28Mbytes, which is much more manageable.
Oh, and I would swap the way round that you store x, y and z. This will get you much better data locality in the first main loops, as then the inner loop will go down the first index of these arrays, which is much more efficient than you have currently.
I recently started studying parallel coding, I'm still at the beginning so I wanted to try some very simple coding. Since it is in my interest to perform parallel numerical integration I started with a simple summation Fortran code:
program par_hello_world
use omp_lib
implicit none
integer, parameter:: bign = 1000000000
integer:: i
double precision:: start, finish, start1, finish1, a
a = 0
call cpu_time(start)
!$OMP PARALLEL num_threads(8)
!$OMP DO REDUCTION(+:a)
do i = 1,bign
a = a + sqrt(1.0**5)
end do
!$OMP END DO
!$OMP END PARALLEL
call cpu_time(finish)
print*, 'parallel result:'
print*, a
print*, (finish-start)
a=0
call cpu_time(start1)
do i = 1,bign
a = a + sqrt(1.0**5)
end do
call cpu_time(finish1)
print*, 'sequential result:'
print*, a
print*, (finish1-start1)
end program
The code basically simulates a summation, I used the weird expression sqrt(1.0**5) to have a measurable computational time, if I used just 1 the computational time was so small that i could not compare the sequential code with the parallel.
I tried to avoid the race condition by using the REDUCTION clause.
However I'm getting very strange time results:
If I raise the number of threads from 2 to 16 I don't get a reduction of computational time but somehow I even get an increase.
Incredibly it seems that also the sequential code is influenced by the choice of the threads number (I really don't understand why!) in particular it is raised if I raise the number of threads.
I get the correct result for the variable a
I think I'm doing something very wrong somewhere, but I'm clueless about it...