Parallel processing in Dot Product - multithreading

I am having a heck of a time trying to figure out how to get a simple Dot Product calculation to parallel process on a Fortran code compiled by the Intel ifort compiler v 16. I have the section of code below, it is part of a program used for a more complex process, but this is where most of the time is spent by the program:
double precision function ddot(n,dx,incx,dy,incy)
c
c forms the dot product of two vectors.
c uses unrolled loops for increments equal to one.
c jack dongarra, linpack, 3/11/78.
c modified 12/3/93, array(1) declarations changed to array(*)
c
double precision dx(*),dy(*),dtemp
integer i,incx,incy,ix,iy,m,mp1,n
c
CALL OMP_SET_NUM_THREADS(12)
ddot = 0.0d0
dtemp = 0.0d0
if(n.le.0)return
if(incx.eq.1.and.incy.eq.1)go to 20
c
c code for unequal increments or equal increments
c not equal to 1
c
ix = 1
iy = 1
if(incx.lt.0)ix = (-n+1)*incx + 1
if(incy.lt.0)iy = (-n+1)*incy + 1
do 10 i = 1,n
dtemp = dtemp + dx(ix)*dy(iy)
ix = ix + incx
iy = iy + incy
10 continue
ddot = dtemp
return
c
c code for both increments equal to 1
c
c
c clean-up loop
c
20 m = mod(n,5)
if( m .eq. 0 ) go to 40
!$OMP PARALLEL DO
!$OMP& DEFAULT(NONE) SHARED(dx,dy,m) PRIVATE(i)
!$OMP& SCHEDULE(STATIC)
!$OMP& REDUCTION( + : dtemp )
do 30 i = 1,m
dtemp = dtemp + dx(i)*dy(i)
30 continue
!$OMP END PARALLEL DO
if( n .lt. 5 ) go to 60
40 mp1 = m + 1
!$OMP PARALLEL DO
!$OMP& DEFAULT(NONE) SHARED(dx,dy,n,mp1) PRIVATE(i)
!$OMP& SCHEDULE(STATIC)
!$OMP& REDUCTION( + : dtemp )
do 50 i = mp1,n,5
dtemp = dtemp + dx(i)*dy(i) + dx(i + 1)*dy(i + 1) +
* dx(i + 2)*dy(i + 2) + dx(i + 3)*dy(i + 3) + dx(i + 4)*dy(i + 4)
50 continue
!$OMP END PARALLEL DO
60 ddot = dtemp
return
end
I am new to the OpenMP commands and am pretty sure I have something funny in there that slows the whole thing down more than on a single core. Currently I have tried to run it on 4 threads on a slower 4(4) core machine where it actually went a bit faster than the large 20(40) core machine where we designated 12 threads for the processing. At this point I'm thinking the code is funny and doing something I don't want.
The Do loop higher up could be parallelized too, but I didn't know how to define the ix and iy and so just left it alone since it doesn't spend much time there.
Precision is very important, so the compiler is set to fp-mode precise. I don't know if that matters at all, but when the code does manage to generate answers they do appear correct. Basically, I'm just trying to figure out how to speed up this code, but instead parallel processing seems to slow down the process instead.

There are a bunch of Intel Webinars you can look up help you.
I have aperture optimum histogram code that does OpenMP SIMD REDUCTION. So I decided to bring in the vector into a (big,nthreads) array and do each thread in a parallel region. Generally it runs slower than just using a single core. (I have not tried vtune on that yet)
Other similar array approaches with FFTs run faster, and the cores are all at 100% with good scaling.
Basically one either needs to work out the issues or test which works better. Any OpenMP parallel takes a long time to start, so you want it way outside, and not down at the tightest level.
Generally you can be better off with PURE functions or subroutines, and using
!DEC$ ATTRIBUTES VECTOR ...
In ifort16 there is also VECTOR(REF(variable)) and the reference is new.
Once all that is singing, then the parallel can be attempted.
Your DO 50 would need some big numbers for 'i' in order to make parallelised code go faster, or the OpenMP parallel 'startup' will gobble up too much time.
There is not a lot other than vtune to aid finding cashe misses (etc) to give you insight into getting to having faster code (which is really code without slowdowns). After all that, then it may be worthwhile to also compile using gfortran. I generally find that running through two compilers gives more insight into making better overall code. But if you get gains from !DEC$ Extensions, then gfortran may not help. CONTIGUOUS can also be worth trying in functions.

Related

Julia Multithreading a nested for loop for combinations of i & j

I have a question regarding multithreading in Julia and how to parallelize a for loop effectively.
Suppose you have a nested for loop and a computer with 4 cores. A straightforward way is to add Threads.#threads in front of the for loop. Assuming that the cores can run what they need to do without interference.
As I have understood this, only the outermost part of the nested for loop is parallelized. Assuming that N = 15 and M = 14 then a computer with 4 cores would be a bottleneck.
However, if you have a PC with 32 cores, then 32-15= 17 cores would be doing nothing. However, there would be 210 combinations in total to compute.
Is this correct? Is this how Threads.#threads work? Is there a way to parallelize the combination of both i and j. Perhaps using FLoops? I have tried to read the documentation, however, I need to know if I am going in a completely wrong direction.
Threads.#threads for i in 1:N
for j in 1:M
# Do stuff
end
end
vs.
using FLoops
#floops for i in 1:N
for j in 1:M
# Do stuff
end
end
Thanks in advance
you could probably have a third variable that you can divide into the two variables.
Threads.#threads for k in 1:(N*M)
j = k % M
i = k รท M
alternatively using itertools.product will assign both i and j without the two extra lines.
#floop for (i,j) in product(1:N,1:M)

Fortran and OpenMP thread groups for independent tasks

I need to run two independent tasks using OpenMP. One of them is way more involved than the other, so it would be ideal to split the available threads such that the more complicated task uses more of them. After these two tasks are finished, I need to use both of their outputs. I am not entirely sure if this can be done with OpenMP, so any suggestion would be very useful.
This is an attempt to illustrate what I need. There are two independent suboutines with separate inputs and outputs. Subroutine mysub2 is more complex than mysub1. It has multiple nested loops, so it would benefit more from having more threads running it. Out of 6 threads, I would like to assign 2 of them to execute mysub1, and 4 of them to mysub2, simultaneously. After getting each subroutine outputs, z1 and z2, both of them are used to compute z3.
In this attempt I was trying to assign threads 0 and 1 to task 1, and the other 4 to task 2. Obviously, this doesn't work as intended because it runs mysub1 twice and mysub2 four times, but I have no idea how to achieve what I need.
module mymod
implicit none
contains
subroutine mysub1(x1,y1,z1)
! Element-wise product of vectors
real,intent(in) :: x1(:),y1(:)
real,intent(out) :: z1(size(x1))
integer :: i
!$omp parallel do private(i)
do i = 1,size(x1)
z1(i) = x1(i) * y1(i)
end do
!$omp end parallel do
print *, 'Done with mysub1'
end subroutine mysub1
subroutine mysub2(x2,y2,z2)
! Matrix multiplication
real,intent(in) :: x2(:,:),y2(:,:)
real,intent(out) :: z2(size(x2,1),size(y2,2))
integer :: i,j
!$omp parallel do private(i,j)
do i = 1,size(x2,1)
do j = 1,size(y2,2)
z2(i,j) = dot_product(x2(i,:), y2(:,j))
end do
end do
!$omp end parallel do
print *, 'Done with mysub2'
end subroutine mysub2
end module mymod
program main
use omp_lib
use mymod
implicit none
integer :: tid
integer,parameter :: m = 2
integer,parameter :: n = 3
integer,parameter :: p = 4
real :: x1(m),y1(m),z1(m)
real :: x2(m,n),y2(n,p),z2(m,p),z3
! Setting total number of threads to 6
call omp_set_num_threads(6)
! Assigning arbitrary values for illustration purposes
x1 = 1.0
y1 = 2.0
x2 = 3.0
y2 = 4.0
!$omp parallel private(tid)
! Getting thread number
tid = omp_get_thread_num()
if ((tid == 0) .or. (tid == 1)) then
! Task 1 to be executed in two threads, tid = 0,1
call mysub1(x1,y1,z1)
else
! Task 2 to be executed in four threads, tid = 2,3,4,5
call mysub2(x2,y2,z2)
end if
!$omp end parallel
! Using z1 and z2 (serially, no need to parallelize)
z3 = sum(z1) + sum(z2)
print *, 'Final output', z3
end program main
Of course, this is just an example. I know I don't need to use mysub2 to do matrix multiplication. I'm just trying to illustrate that mysub2 is more complex and hence, it would be ideal to use more threads for it, without having to paste several hundred lines of the actual code I have.

How can we calculate the time complexity for the following piece of code in BIG-OH Notation?

def exercise2(N):
count=0
i = N
while ( i > 0 ):
for j in range(0,i):
count = count + 1
i = i//2
How do we know the time complexities for both the while and for loop?
Edit:Many of the users are sending me link to understand the time complexity using Big-OH analysis. I appreciate it but the only languages in CS i understand is python and all those explanations are using java and C++ which makes it hard for me to understand. If anyone could explain time complexity using python it would be great!
The inner loop (for) is running in i, and i is going down from N to 1 in log(N) steps at most. Hence, the time complexity is N + N/2 + N/4 + ... + 1 = N(1 + 1/2 + 1/4 + ... + 1/2^k) = \Theta(N). For the latter equation, you can suppose N = 2^k as we are computing the asymptotic time complexity.

Manually assigning iteration number to OpenMP schedule static

I am trying to parallelize the following snippet of code using OpenMP. The portion shown is basically a version of the Thomas matrix algorithm (matrix inversion) for implicit solver in fluid flow/turbulence simulations. I have stripped it down from its exact form to make the question easily understandable.
REAL(KIND=DP), INTENT(INOUT), DIMENSION(1:10) :: A, C_old, C_new, D
INTEGER :: K
!$OMP PARALLEL DO SCHEDULE(STATIC) NUM_THREADS(3)&
!$OMP SHARED(A, C_old, C_new, D)
DO K = 2,9
C_new(K) = C_old(K)/(A(K)*C_new(K-1))
END DO
!$OMP END PARALLEL
C(1) = C(10) = 0 because they are at the fixed boundaries (top and the bottom wall of a channel). Hence there is no need to update them.
As it is evident, the calculation of C(2) needs the information of C(1) and the calculation of C(3) needs the information of the MODIFIED C(2) and so on. Or in other words, it is a forward sweep (the algorithm is here https://en.wikipedia.org/wiki/Tridiagonal_matrix_algorithm). Hence, for three (03) number of threads, I want the distribution of iterations across the threads to be [1,2,3,4], [4,5,6,7], [7,8,9,10]. But if I just do a SCHEDULE STATIC, I would not be able to control this and the distribution might be [1,2,3], [4,5,6], [7,8,9,10] across 3 threads. This would give erroneous results as the information of the grid points 4 and 7 are not passed on to the other threads.
Please let me know if there is a way where I can force the indices to be solved by a certain thread without any race? Say,
thread #1 solve indices [1,2,3,4]
thread #2 solve indices [4,5,6,7] etc.

Find euclidean distance between rows of two huge CSR matrices

I have two sparse martrices, A and B. A is 120000*5000 and B is 30000*5000. I need to find the euclidean distances between each row in B with all rows of A and then find the 5 rows in A with the lowest distance to the selected row in B. As it is a very big data I am using CSR otherwise I get memory error. It is clear that for each row in A it calculates (x_b - x_a)^2 5000 times and sums them and then get a sqrt. This process is taking a very very long time, like 11 days! Is there any way I can do this more efficiently? I just need the 5 rows with the lowest distance to each row in B.
I am implementing K-Nearest Neighbours and A is my training set and B is my test set.
Well - I don't know if you could 'vectorize' that code, so that it would run in native code instead of Python. The trick to speed-up numpy and scipy is always getting that.
If you can run that code in native code in a 1GHz CPU, with 1 FP instruction for clock cicle, you'd get it done in a little under 10 hours.
(5000 * 2 * 30000 * 120000) / 1024 ** 3
Raise that to 1.5Ghz x 2 CPU physical cores x 4 way SIMD instructions with multiply + acummulate (Intel AVX extensions, available in most CPUs) and you could get that number crunching down to one hour, at 2 x 100% on a modest core i5 machinne. But that would require full SIMD optimization in native code - far from a trivial task (although, if you decide to go this path, further questions on S.O. could get help from people either to wet their hands in SIMD coding :-) ) - interfacing this code in C with Scipy is not hard using cython, for example (you only need that part to get it to the above 10 hour figure)
Now... as for algorithm optimization, and keeping things Python :-)
Fact is, you don't need to fully calculate all distances from rows in A - you just need to keep a sorted list of the 5 lower rows - and any time the cumulation of a sum of squares get larger than the 5th nearest row (so far), you just abort the calculation for that row.
You could use Python' heapq operations for that:
import heapq
import math
def get_closer_rows(b_row, a):
result = [(float("+inf"), None) * 5]
for i, a_row in enumerate(a):
distance_sq = 0
count = 0
for element_a, element_b in zip(a_row, b_row):
distance_sq += element_a * element_b
if not count % 64 and distance_sq > result[4][0]:
break
count += 1
else:
heapq.heappush(result, (distance, i))
result[:] = result[:5]
return [math.sqrt(r) for r in result]
closer_rows_to_b = []
for row in b:
closer_rows_to_b.append(get_closer_rows(row, a))
Note the auxiliar "count" to avoid the expensive retrieving and comparison of values for all multiplications.
Now, if you can run this code using pypy instead of regular Python, I believe it could get full benefit of JITting, and you could get a noticeable improvement over your times if you are running the code in pure Python (i.e.: non numpy/scipy vectorized code).

Resources