OpenMP parallel numerical integration (summation) performance - multithreading

I recently started studying parallel coding, I'm still at the beginning so I wanted to try some very simple coding. Since it is in my interest to perform parallel numerical integration I started with a simple summation Fortran code:
program par_hello_world
use omp_lib
implicit none
integer, parameter:: bign = 1000000000
integer:: i
double precision:: start, finish, start1, finish1, a
a = 0
call cpu_time(start)
!$OMP PARALLEL num_threads(8)
!$OMP DO REDUCTION(+:a)
do i = 1,bign
a = a + sqrt(1.0**5)
end do
!$OMP END DO
!$OMP END PARALLEL
call cpu_time(finish)
print*, 'parallel result:'
print*, a
print*, (finish-start)
a=0
call cpu_time(start1)
do i = 1,bign
a = a + sqrt(1.0**5)
end do
call cpu_time(finish1)
print*, 'sequential result:'
print*, a
print*, (finish1-start1)
end program
The code basically simulates a summation, I used the weird expression sqrt(1.0**5) to have a measurable computational time, if I used just 1 the computational time was so small that i could not compare the sequential code with the parallel.
I tried to avoid the race condition by using the REDUCTION clause.
However I'm getting very strange time results:
If I raise the number of threads from 2 to 16 I don't get a reduction of computational time but somehow I even get an increase.
Incredibly it seems that also the sequential code is influenced by the choice of the threads number (I really don't understand why!) in particular it is raised if I raise the number of threads.
I get the correct result for the variable a
I think I'm doing something very wrong somewhere, but I'm clueless about it...

Related

Fortran and OpenMP thread groups for independent tasks

I need to run two independent tasks using OpenMP. One of them is way more involved than the other, so it would be ideal to split the available threads such that the more complicated task uses more of them. After these two tasks are finished, I need to use both of their outputs. I am not entirely sure if this can be done with OpenMP, so any suggestion would be very useful.
This is an attempt to illustrate what I need. There are two independent suboutines with separate inputs and outputs. Subroutine mysub2 is more complex than mysub1. It has multiple nested loops, so it would benefit more from having more threads running it. Out of 6 threads, I would like to assign 2 of them to execute mysub1, and 4 of them to mysub2, simultaneously. After getting each subroutine outputs, z1 and z2, both of them are used to compute z3.
In this attempt I was trying to assign threads 0 and 1 to task 1, and the other 4 to task 2. Obviously, this doesn't work as intended because it runs mysub1 twice and mysub2 four times, but I have no idea how to achieve what I need.
module mymod
implicit none
contains
subroutine mysub1(x1,y1,z1)
! Element-wise product of vectors
real,intent(in) :: x1(:),y1(:)
real,intent(out) :: z1(size(x1))
integer :: i
!$omp parallel do private(i)
do i = 1,size(x1)
z1(i) = x1(i) * y1(i)
end do
!$omp end parallel do
print *, 'Done with mysub1'
end subroutine mysub1
subroutine mysub2(x2,y2,z2)
! Matrix multiplication
real,intent(in) :: x2(:,:),y2(:,:)
real,intent(out) :: z2(size(x2,1),size(y2,2))
integer :: i,j
!$omp parallel do private(i,j)
do i = 1,size(x2,1)
do j = 1,size(y2,2)
z2(i,j) = dot_product(x2(i,:), y2(:,j))
end do
end do
!$omp end parallel do
print *, 'Done with mysub2'
end subroutine mysub2
end module mymod
program main
use omp_lib
use mymod
implicit none
integer :: tid
integer,parameter :: m = 2
integer,parameter :: n = 3
integer,parameter :: p = 4
real :: x1(m),y1(m),z1(m)
real :: x2(m,n),y2(n,p),z2(m,p),z3
! Setting total number of threads to 6
call omp_set_num_threads(6)
! Assigning arbitrary values for illustration purposes
x1 = 1.0
y1 = 2.0
x2 = 3.0
y2 = 4.0
!$omp parallel private(tid)
! Getting thread number
tid = omp_get_thread_num()
if ((tid == 0) .or. (tid == 1)) then
! Task 1 to be executed in two threads, tid = 0,1
call mysub1(x1,y1,z1)
else
! Task 2 to be executed in four threads, tid = 2,3,4,5
call mysub2(x2,y2,z2)
end if
!$omp end parallel
! Using z1 and z2 (serially, no need to parallelize)
z3 = sum(z1) + sum(z2)
print *, 'Final output', z3
end program main
Of course, this is just an example. I know I don't need to use mysub2 to do matrix multiplication. I'm just trying to illustrate that mysub2 is more complex and hence, it would be ideal to use more threads for it, without having to paste several hundred lines of the actual code I have.

Manually assigning iteration number to OpenMP schedule static

I am trying to parallelize the following snippet of code using OpenMP. The portion shown is basically a version of the Thomas matrix algorithm (matrix inversion) for implicit solver in fluid flow/turbulence simulations. I have stripped it down from its exact form to make the question easily understandable.
REAL(KIND=DP), INTENT(INOUT), DIMENSION(1:10) :: A, C_old, C_new, D
INTEGER :: K
!$OMP PARALLEL DO SCHEDULE(STATIC) NUM_THREADS(3)&
!$OMP SHARED(A, C_old, C_new, D)
DO K = 2,9
C_new(K) = C_old(K)/(A(K)*C_new(K-1))
END DO
!$OMP END PARALLEL
C(1) = C(10) = 0 because they are at the fixed boundaries (top and the bottom wall of a channel). Hence there is no need to update them.
As it is evident, the calculation of C(2) needs the information of C(1) and the calculation of C(3) needs the information of the MODIFIED C(2) and so on. Or in other words, it is a forward sweep (the algorithm is here https://en.wikipedia.org/wiki/Tridiagonal_matrix_algorithm). Hence, for three (03) number of threads, I want the distribution of iterations across the threads to be [1,2,3,4], [4,5,6,7], [7,8,9,10]. But if I just do a SCHEDULE STATIC, I would not be able to control this and the distribution might be [1,2,3], [4,5,6], [7,8,9,10] across 3 threads. This would give erroneous results as the information of the grid points 4 and 7 are not passed on to the other threads.
Please let me know if there is a way where I can force the indices to be solved by a certain thread without any race? Say,
thread #1 solve indices [1,2,3,4]
thread #2 solve indices [4,5,6,7] etc.

Parallelization of Piecewise Polynomial Evaluation

I am trying to evaluate points in a large piecewise polynomial, which is obtained from a cubic-spline. This takes a long time to do and I would like to speed it up.
As such, I would like to evaluate a points on a piecewise polynomial with parallel processes, rather than sequentially.
Code:
z = zeros(1e6, 1) ; % preallocate some memory for speed
Y = rand(11220,161) ; %some data, rand for generating a working example
X = 0 : 0.0125 : 2 ; % vector of data sites
pp = spline(X, Y) ; % get the piecewise polynomial form of the cubic spline.
The resulting structure is large.
for t = 1 : 1e6 % big number
hcurrent = ppval(pp,t); %evaluate the piecewise polynomial at t
z(t) = sum(x(t:t+M-1).*hcurrent,1) ; % do some operation of the interpolated value. Most likely not relevant to this question.
end
Unfortunately, with matrix form and using:
hcurrent = flipud(ppval(pp, 1: 1e6 ))
requires too much memory to process, so cannot be done. Is there a way that I can batch process this code to speed it up?
For scalar second arguments, as in your example, you're dealing with two issues. First, there's a good amount of function call overhead and redundant computation (e.g., unmkpp(pp) is called every loop iteration). Second, ppval is written to be general so it's not fully vectorized and does a lot of things that aren't necessary in your case.
Below is vectorized code code that take advantage of some of the structure of your problem (e.g., t is an integer greater than 0), avoids function call overhead, move some calculations outside of your main for loop (at the cost of a bit of extra memory), and gets rid of a for loop inside of ppval:
n = 1e6;
z = zeros(n,1);
X = 0:0.0125:2;
Y = rand(11220,numel(X));
pp = spline(X,Y);
[b,c,l,k,dd] = unmkpp(pp);
T = 1:n;
idx = discretize(T,[-Inf b(2:l) Inf]); % Or: [~,idx] = histc(T,[-Inf b(2:l) Inf]);
x = bsxfun(#power,T-b(idx),(k-1:-1:0).').';
idx = dd*idx;
d = 1-dd:0;
for t = T
hcurrent = sum(bsxfun(#times,c(idx(t)+d,:),x(t,:)),2);
z(t) = ...;
end
The resultant code takes ~34% of the time of your example for n=1e6. Note that because of the vectorization, calculations are performed in a different order. This will result in slight differences between outputs from ppval and my optimized version due to the nature of floating point math. Any differences should be on the order of a few times eps(hcurrent). You can still try using parfor to further speed up the calculation (with four already running workers, my system took just 12% of your code's original time).
I consider the above a proof of concept. I may have over-optmized the code above if your example doesn't correspond well to your actual code and data. In that case, I suggest creating your own optimized version. You can start by looking at the code for ppval by typing edit ppval in your Command Window. You may be able to implement further optimizations by looking at the structure of your problem and what you ultimately want in your z vector.
Internally, ppval still uses histc, which has been deprecated. My code above uses discretize to perform the same task, as suggested by the documentation.
Use parfor command for parallel loops. see here, also precompute z vector as z(j) = x(j:j+M-1) and hcurrent in parfor for speed up.
The Spline Parameters estimation can be written in Matrix form.
Once you write it in Matrix form and solve it you can use the Model Matrix to evaluate the Spline on all data point using Matrix Multiplication which is probably the most tuned operation in MATLAB.

Parallel processing in Dot Product

I am having a heck of a time trying to figure out how to get a simple Dot Product calculation to parallel process on a Fortran code compiled by the Intel ifort compiler v 16. I have the section of code below, it is part of a program used for a more complex process, but this is where most of the time is spent by the program:
double precision function ddot(n,dx,incx,dy,incy)
c
c forms the dot product of two vectors.
c uses unrolled loops for increments equal to one.
c jack dongarra, linpack, 3/11/78.
c modified 12/3/93, array(1) declarations changed to array(*)
c
double precision dx(*),dy(*),dtemp
integer i,incx,incy,ix,iy,m,mp1,n
c
CALL OMP_SET_NUM_THREADS(12)
ddot = 0.0d0
dtemp = 0.0d0
if(n.le.0)return
if(incx.eq.1.and.incy.eq.1)go to 20
c
c code for unequal increments or equal increments
c not equal to 1
c
ix = 1
iy = 1
if(incx.lt.0)ix = (-n+1)*incx + 1
if(incy.lt.0)iy = (-n+1)*incy + 1
do 10 i = 1,n
dtemp = dtemp + dx(ix)*dy(iy)
ix = ix + incx
iy = iy + incy
10 continue
ddot = dtemp
return
c
c code for both increments equal to 1
c
c
c clean-up loop
c
20 m = mod(n,5)
if( m .eq. 0 ) go to 40
!$OMP PARALLEL DO
!$OMP& DEFAULT(NONE) SHARED(dx,dy,m) PRIVATE(i)
!$OMP& SCHEDULE(STATIC)
!$OMP& REDUCTION( + : dtemp )
do 30 i = 1,m
dtemp = dtemp + dx(i)*dy(i)
30 continue
!$OMP END PARALLEL DO
if( n .lt. 5 ) go to 60
40 mp1 = m + 1
!$OMP PARALLEL DO
!$OMP& DEFAULT(NONE) SHARED(dx,dy,n,mp1) PRIVATE(i)
!$OMP& SCHEDULE(STATIC)
!$OMP& REDUCTION( + : dtemp )
do 50 i = mp1,n,5
dtemp = dtemp + dx(i)*dy(i) + dx(i + 1)*dy(i + 1) +
* dx(i + 2)*dy(i + 2) + dx(i + 3)*dy(i + 3) + dx(i + 4)*dy(i + 4)
50 continue
!$OMP END PARALLEL DO
60 ddot = dtemp
return
end
I am new to the OpenMP commands and am pretty sure I have something funny in there that slows the whole thing down more than on a single core. Currently I have tried to run it on 4 threads on a slower 4(4) core machine where it actually went a bit faster than the large 20(40) core machine where we designated 12 threads for the processing. At this point I'm thinking the code is funny and doing something I don't want.
The Do loop higher up could be parallelized too, but I didn't know how to define the ix and iy and so just left it alone since it doesn't spend much time there.
Precision is very important, so the compiler is set to fp-mode precise. I don't know if that matters at all, but when the code does manage to generate answers they do appear correct. Basically, I'm just trying to figure out how to speed up this code, but instead parallel processing seems to slow down the process instead.
There are a bunch of Intel Webinars you can look up help you.
I have aperture optimum histogram code that does OpenMP SIMD REDUCTION. So I decided to bring in the vector into a (big,nthreads) array and do each thread in a parallel region. Generally it runs slower than just using a single core. (I have not tried vtune on that yet)
Other similar array approaches with FFTs run faster, and the cores are all at 100% with good scaling.
Basically one either needs to work out the issues or test which works better. Any OpenMP parallel takes a long time to start, so you want it way outside, and not down at the tightest level.
Generally you can be better off with PURE functions or subroutines, and using
!DEC$ ATTRIBUTES VECTOR ...
In ifort16 there is also VECTOR(REF(variable)) and the reference is new.
Once all that is singing, then the parallel can be attempted.
Your DO 50 would need some big numbers for 'i' in order to make parallelised code go faster, or the OpenMP parallel 'startup' will gobble up too much time.
There is not a lot other than vtune to aid finding cashe misses (etc) to give you insight into getting to having faster code (which is really code without slowdowns). After all that, then it may be worthwhile to also compile using gfortran. I generally find that running through two compilers gives more insight into making better overall code. But if you get gains from !DEC$ Extensions, then gfortran may not help. CONTIGUOUS can also be worth trying in functions.

Matlab - for loop: create two threads and join them each n iterations

Each n steps of a for-loop I need to perform a time consuming operation that I will only need n iterations later (for the next time I call this time consuming operation) - still I need results of iteration i-1 to start the computation.
I believe I could benefit from multithreading - with only 2 threads. At i:
(1st thread): keep running the main loop until it reaches i + n and wait for the 2nd thread to be finished.
(2nd) do the time consuming operation.
Anyway to implement that in Matlab??
for i=1:1:N
y(i) = g(y(i-1), x(i-1));
if(mod(i, n) == 1)
x(i) = f(x(i-n), y(i-1)); %Time consuming
else
x(i) = x(i-1);
end
end
Thanks!
You can separate the script into two parts:
1. First loop to compute y array
2. Second loop to compute the x array
You can use the parallel computing toolbox to speed up the second loop. e.g. parfor

Resources