Multi-threaded linear system solution in OpenBLAS

Multi-threaded linear system solution in OpenBLAS - multithreading

I have a code using Fortran 95 and the gfortran compiler. I am also using OpenMP and I have to handle very big arrays. In my code I also have to solve a system of linear equations using the solver DGTSV from OpenBLAS. I want to parallelize this solver as well using openblas which should be capable of that. But I have trouble with the syntax. Using the attached pseudo code all 4 CPUs are used to almost 100% but I am not sure if each kernel solves the linear equations separately or if they split it into parts and calculating it parallel.
The whole stuff is compiled using gfortran -fopenmp -lblas a.f95 -o a.out
So my pseudo code looks like
program a
implicit none
integer, parameter :: N = 200
real*8, dimension(numx) :: D = 0.0
real*8, dimension(numx-1):: DL = 0.0
real*8, dimension(numx-1):: DU = 0.0
real*8, dimension(numx) :: b = 0.0
integer :: info = 0
integer :: numthread=4
...
!$OMP PARALLEL NUM_THREADS(numthread)
...
!$OMP DO
...
!$OMP END DO
CALL DGTSV(N,1,DL,D,DU,b,N,info)
!$OMP DO
...
!$OMP END DO
...
!$OMP END PARALLEL
end program a
What does I have to do to make the solver parallelized, so each kernel calculates parts of the solver?

Inside an OpenMP parallel region, all the threads execute the same code (as in MPI), and the work is only split when the threads reach a loop/section/task.
In your example, the work inside the loops (OMP DO) is distributed among the available threads. After the loop is done, an implicit barrier synchronizes all the threads and then they execute in parallel the function DGTSV. After the subroutine has returned, the loop is split again.
#HristoIliev proposed using a OMP SINGLE clause. This restricts the piece of code inside to be executed by only one thread and forces all the other threads to wait for it (unless you specify nowait).
On the other hand, nested parallelism is called to the case where you declare a parallel region inside another parallel region. This also applies when you perform calls to a OpenMP parallelized library inside a parallel region.
By default, OpenMP does not increase parallelism nested parallel regions, instead, only the thread that enter the parallel region is able to execute it. This behavior can be changed using the environment variable OMP_NESTED to true.
The OMP SINGLE solution is far better than splitting the parallel region in two, as the resources are reused for the next loop:
$!OMP PARALLEL
$!OMP DO
DO ...
END DO
$!OMP SINGLE
CALL DGTSV(...)
$!OMP DO
DO ...
END DO
$!OMP END PARALLEL
To illustrate the usage of OMP_NESTED I'll show you some results I had from an application which used FFTW (a Fast Fourier Transform implementation) configured to use OpenMP. The execution was performed in a 16 core two-socket Intel Xeon E5 #2.46GHz node.
The following graphs show the amount of time spent in the whole application, where parallel regions appear when CPUs > 1, serialized regions when CPUs = 1 and synchronization regions when CPUs = 0.
The application is embarrassingly parallel, so in this particular case using nesting is not worthwhile (FFTW does not scale that good).
This is the OMP_NESTED=false execution. Observe how the amount of parallelism is limited by the amount of threads spent in the external parallel region (ftdock).
This is the OMP_NESTED=true execution. In this case, it is possible to increase parallelism further than the amount of threads spent on the external parallel region. The maximum parallelism possible in this case is 16, when either the 8 external threads create a single peer to execute the internal parallel region or they are 4 creating 3 additional threads each (8x2 = 4x4 = 16).

Related

How to create OpenMP threads only once (and do not kill them between different parallel regions)

I have an array A_p which is defined as threadprivate for each thread. The code is complicated and does some calculations. Finally, I want to reduce all the arrays to one shared array A.
DO J=1,Y
DO I=1,X
a=0
!$omp parallel reduction(+:a)
a = A_p(I,J)
!$omp end parallel
A(I,J) = A(I,J)+ a
END DO
END DO
This solution works, but the problem is that the threads are probably created every iteration, which incurs a huge overhead. I would like to find a way to save the threads alive between the iterations, so they could be just created once.
I have also tried the following solution:
!$omp parallel reduction(+:A)
A = A_p
!$omp end parallel
but it seems to create a certain overhead for initializing a private variable A for each thread (which by the way is redundant, because there are already threadprivate variables and we do not really need more private arrays). Of course the overhead here is smaller than the overhead observed in the previous solution, but still not good enough for me.
Also, I would like to ask about the way OpenMP implement the reduction. For example, in the first solution I presented, does the reduction of variable a is serial, or it is implemented in a tree combining fashion (achieving a logarithmic running time for the reduction phase)?

When can't computation speed be improved by parallelization?

Similar questions have been asked before but I couldn't find an answer that was more about the low level mechanics of threads themselves.
Problem
I have a physical modeling project in which I need to apply a function to 160 billion data points.
for(int i=0; i < N(160,000,000,000); i++){
physicalModal(input[i]); //Linear function, just additions and subtractions
}
function physicalModal(x){
A*x +B*x +C*x + D*x......... //An over simplification but you get the idea. A linear function
}
Given the nature of this problem am I correct in thinking a single thread on a single core, or 1 thread per core, would be the fastest way to solve this? That using extra threads beyond the number of cores would not help me here?
My Logic (Please correct where my assumptions are wrong)
Threads on a single core don't really work independently, they just share processor time which can be beneficial when one thread is waiting on perhaps a socket response and other threads are processing requests. In the example I posted above I figure the CPU could go to 100% on one thread so using multiple threads would just disturb the computation. Is this correct?
What then determines when threading is useful?
If my above assumption is correct, whats the key factor in determining when other threads would be useful? My guess would be simultaneous operations that have varying completion times, waiting, etc...But thats based on my initial premise which may be incorrect.

I need to apply a function to 160 billion data points.
I assume that your function has no side effects (no writes to global/static variables; no disk/network access; no service to many remote users) and just do some arithmetics on its input (on single point of input or several nearby points as for stencil (it is stencil kernel):
for(int i=0; i < 160_000_000_000; i++){
//Linear function, just additions and subtractions
output[i] = physicalModel(input[i] /* possibly also input[i-1], input[i+1] .. */);
}
Then you have to check how efficient this function works on single CPU. Can you (or your compiler) unroll your loop and convert it to SIMD parallelism?
for(int i=0+1; i < 160_000_000_000-1; i++){
output[i] = A*input[i-1]+ B*input[i] + C*input[i+1];
}
// unrolled 4 times; if input is float, compiler may load 4 floats
// into single SSE2 reg and do 4 operations from one asm command
for(int i=0+4; i < 160_000_000_000-4; i+=4){
output[i+0] = A*input[i-1]+ B*input[i+0] + C*input[i+1];
output[i+1] = A*input[i+0]+ B*input[i+1] + C*input[i+2];
output[i+2] = A*input[i+1]+ B*input[i+2] + C*input[i+3];
output[i+3] = A*input[i+2]+ B*input[i+3] + C*input[i+4];
}
When your function has good single-threaded performance, you can add thread or process parallelism (using OpenMP/MPI or other method). With our assumptions, there are no threads blocking on some external resource like reading from HDD or from network, so every thread you started can run at any time. Then we should start no more than 1 thread per CPU core. If we started several threads, each will run for some amount of time and displaced by other, having less performance than in case of 1 thread per cpu core.
In C/C++ adding of OpenMP thread level parallelism (https://en.wikipedia.org/wiki/OpenMP, http://www.openmp.org/) can be as easy as adding one line just before your for loop (and adding -fopenmp/-openmp option to your compilation); compiler and library will split your for loop into parts and distribute them between threads ([0..N/4], [N/4..N/2], [N/2..N*3/4], [N*3/4..N] for 4 threads or other split scheme; you can give hints with schedule option)
#pragma omp parallel for
for(int i=0+1; i < 160_000_000_000-1; i++){
output[i] = physicalModel(input[i]);;
}
Thread count will be determined in runtime by OpenMP library (gomp in gcc - https://gcc.gnu.org/onlinedocs/libgomp/index.html). By default it is "one thread per CPU is used" (per logical cpu core). You can change number of threads with OMP_NUM_THREADS environment variable (export OMP_NUM_THREADS=5; ./program).
On CPU with hardware multithreading on single cpu cores (Intel HT, other variants of SMT: you have 4 physical cores and 8 "logical") in some cases you should use 1 thread per logical core, and in other cases 1 thread per physical core (with correct thread binding), as some resources (FPU units) are shared between logical cores. Do some experiments if your code will be used several (many) times.
If your threads (model) are limited by speed of memory (Memory Bound; they loads many data from memory and does very simple operation on every float), you may want to run less threads than cpu core count, as additional threads will not get addition memory bandwidth.
If your threads do lot of computations for every element loaded from memory, use better SIMD and more threads (compute bound). When you have very good and wide SIMD (full-width AVX), you will have no speedup from using HT, as full-width AVX unit is shared between logical cores (but every physical core has one, so use it); in this case you will also have lower cpu frequency, as full-width AVX unit is very hot under full load.
Illustration of memory and compute limited applications: https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/
https://crd.lbl.gov/assets/Uploads/FTG/Projects/Roofline/_resampled/ResizedImage600300-rooflineai.png

Declaring a different nested number of threads for two separate tasks (OpenMP)

I am writing a parallel code that is exploiting some parallelism at an outer level. Essentially there are two separate subroutines (very expensive) that may be executed concurrently. This is a large code, and as such, within each subroutine there are other calls as well as many omp parallel/do regions. So to execute my two subroutines I want to make use of nested parallelism, so that they can both be called in the outer region as such:
!$omp parallel
!$omp single
! Do the first expensive task (contains more omp parallel regions)
!$omp end single nowait
!$omp single
! Do the second expensive task (contains more omp parallel regions)
!$omp end single nowait
!$omp end parallel
If both of these expensive tasks took an equal amount of time I would not have a problem. But during the simulation, at each time step, the amount that each has to do changes. So doing an environment variable for setting the nested number of threads like export OMP_NUM_THREADS=16,8 where I have 16 in the first level of parallelism and 8 in the nested regions (inside the two expensive subroutines) does not work well. I have a scheme already to distribute the correct number of threads to their respective task, I just don't know how to set different numbers of threads for the nested level in the respective subroutines. Of course I could go into each expensive subroutine and all subroutines within those and actually hardcode the number of threads that I would like, but like I mentioned this is a very large code and that is the ugly solution. I would much rather do this in an environment variable type of way. There is no information on this subject online. Does anyone out there have a clue how one could do this?
Thanks in advance.

I'm not sure whether I understand correctly what you are trying to achieve, but you can set the default team size for nested parallel regions by simply calling omp_set_num_threads(). If you call it from the serial part of the application, it will set the default team size for top-level parallel regions. If you call it from within a parallel region, it will affect nested parallel regions spawned by the calling thread. And different threads can set different team sizes for their nested regions. So, in a nutshell, you can do something like:
!$omp parallel
!$omp single
call omp_set_num_threads(nn)
! Do the first expensive task (contains more omp parallel regions)
!$omp end single nowait
!$omp single
call omp_set_num_threads(mm)
! Do the second expensive task (contains more omp parallel regions)
!$omp end single nowait
!$omp end parallel
Parallel regions spawned from the thread executing the first single construct will execute with nn threads. Parallel regions spawned from the thread executing the second single construct will execute with mm threads.
Also, have you considered using explicit OpenMP tasks (!$omp task) instead of single + nowait?

All my codes are running much slower when I use open mp

I already saw several posts on this site which talk about this issue. However, I think my serious codes where overhead due to creation of threads and all should not be a big issue, have become much slower with open mp now! I am using a quad core machine with gfortran 4.6.3 as my compiler. Below is an example of a test code.
Program test
use omp_lib
integer*8 i,j,k,l
!$omp parallel
!$omp do
do i = 1,20000
do j = 1, 1000
do k = 1, 1000
l = i
enddo
enddo
enddo
!$omp end do nowait
!$omp end parallel
End program test
This code takes around 80 seconds if I run it without open mp, however, with open mp, it takes around 150 seconds. I have seen the same issue with my other serious codes whose runtime is around 5 minutes or so in serial mode. In those codes I am taking care that there are no dependencies from thread to thread. Then why should these codes become slower instead of faster?
Thanks in advance.

You have a race condition, more threads are writing in the same shared l. Thus the program is invalid, l should be private. It also leads to a slowdown because the threads invalidate the cache content the other cores have and the threads have to reload the memory content all the time. Similar thing happens when more threads use the same cache line and it is known as false sharing.
You also probably don't use any compiler optimizations. Enable them by -O2 -O3, -O5 or -Ofast. You will see that the program takes 0 seconds because the compiler optimizes everything out.

Using OpenMP (libgomp) in an already multithreaded application

We are using OpenMP (libgomp) in order to speed up some calculations in a multithreaded Qt application. The parallel OpenMP sections are located within two different threads, though in fact they never execute in parallel. What we observe in this case is that 2N (where N = OMP_THREAD_LIMIT) omp threads are launched, apparently interfering each with the other. The calculation time is very high, while the processor load is low. Setting OMP_WAIT_POLICY hardly has any effect.
We also tried moving all the omp sections to a single thread (this is not a good solution for us, though, from an architectural point of view). In this case, the overall calculation time does drop and the processor is fully loaded, but only if OMP_WAIT_POLICY is set to ACTIVE. When OMP_WAIT_POLICY == PASSIVE, the calculation time remains low and the processor is idle 50% of time.
Odd enough, when we use omp within a single thread, the first loop parallelized using omp (in a series of omp calulations) executes 10 times slower compared to the multithread case.
Upd: Our questions are:
a) is there any way to reuse the openmp threads when using omp in the context of different threads.
b) Why executing with OMP_WAIT_POLICY == PASSIVE slows everything. Does it take so long to wake the threads?
c) Is there any logical explanation for the phenomenon of the first parallel block being so slow (even when waiting in active mode)
Upd2: Please mind that the issue is probably related to GNU OMP implementation. icc doesn't have it.

Try to start/stop openmp threads in runtime using omp_set_num_threads(1) and omp_set_num_threads(cpucount)
This call with (1) should stop all openmp worker threads, and call with (cpu_num) will restart them again.
So, at start of programm, run omp_set_num_threads(1).
Before omp-parallelized region, you can start omp threads even with WAIT_POLICY=active, and they will not consume cpu before this point.
After omp parallel region you can stop threads again.
The omp_set_num_threads(cpucount) call is very slow, slower than waking threads with wait_policy=passive. This can be the reason for (c) - if your libgomp starts threads only at first parallel region.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string