I'm currently facing a performance issue when calling intel mkl inside an openmp loop. Let me explain my problem in more detail after posting a simplified code.
program Test
use omp_lib
implicit none
double complex, allocatable :: RhoM(:,:), Rho1M(:,:)
integer :: ik, il, ij, N, M, Y
M = 20
Y = 2000000
N = 500
allocate(RhoM(M,N),Rho1M(M,N))
RhoM = (1.0d0,0.0d0)
Rho1M = (0.0d0,1.0d0)
call omp_set_num_threads(4)
do il=1,Y
Rho1M = (0.0d0,1.0d0)
!$omp parallel do private(ik)
do ik=1,N
call zaxpy(M, (1.0d0,0.0d0), RhoM(:,ik:ik), 1, Rho1M(:,ik:ik), 1)
end do
!$omp end parallel do
end do
end program Test
Basically, this program does an in-place matrix summation. However, it does not make any sense, it is just a simplified code. I'm running Windows 10 Pro and using the intel fortran compiler(Version 19.1.0.166). I compile with: ifort -o Test.exe Test.f90 /fast /O3 /Qmkl:sequential /debug:all libiomp5md.lib /Qopenmp. Since the "vectors" used by zaxpy aren't that large, I tried to use openmp in order to speed up the program. I checked the running time with the vtune tool from intel (thats the reason for the debug all flag). I have a i5 4430 meaning 4 threads and 4 physical cores.
Time with openmp: 107s;
Time without openmp: 44s
The funny thing is that with increasing amount of threads, the program is slower. Vtune tells me that more threads are used, however, the computational time increases. This seems to be very counter intuitive.
Of course, I am not the first one facing problems like this. I will attach some links and discuss why it did not work for me.
Intel provides information about how to choose parameters (https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications). However, I'm linking with the sequential intel mkl. If I try the suggested parameters with parallel intel mkl, I'm still slow.
It seems to be important to switch on omp_set_nested(1) (Number of threads of Intel MKL functions inside OMP parallel regions). Firstly, this parameter is deprecated. When I use omp_set_max_active_levels() I cannot see any difference.
This is probably the most suitable question (Calling multithreaded MKL in from openmp parallel region). However, I use sequential intel mkl and have not to care about the mkl threads.
This one here (OpenMP parallelize multiple sequential loops) says I should try using schedule. I tried dynamic and static with different values of chunk size, however, it did not help at all, since the amount of work that has to be done per thread it exactly the same.
It would be very nice, if you have an idea why the program slows down by increasing the thread size.
If you need any further information, please tell me.
Seems to be the case that openmp destroys and creates the splitting into the threads 2000000 times. That causes the additional computational time. See the post from Andrew (https://software.intel.com/en-us/forums/intel-fortran-compiler/topic/733673) and the post from Jim Dempsey.
Related
I am new to Intel OneAPI, but I installed the OneAPI package and when I run
mpirun -n ...
I receive an output like the following if I set N = 3 (for example):
Iteration #1...
Iteration #1...
Iteration #1...
Iteration #2...
Iteration #2...
Iteration #2...
Rather than dividing the cores I specify to the program, it rather runs the program N times with 1 core divided to each process. I was wondering how to set this up so that N cores are divided to 1 process.
Other useful information is that I am running a program called Quantum Espresso and I am running this program with a NUMA 2x18 core dual processor with 2 threads for each core. I initially installed Intel OneAPI because I noticed that if I specify 72 cores with mpirun, the computational demand increases 50-60 fold as opposed to running with 1 core and was hoping OneAPI may be able to resolve this.
So mpirun with -np will say how many instances of a given process to run as you saw.
Have you read this part of their documentation?
https://www.quantum-espresso.org/Doc/user_guide/
I’m not sure how you’ve built it or which functions you are using, but if you have built against their multithreaded libraries with OpenMP then you should get N threads in a process for those library calls.
Otherwise you will be limited to the MPI parallelism in their MPI parallel code.
I’m not sure what you expect when you said you use all 72 code and the computational demand increases? Isn’t that what you want, with the goal that the final result is completed sooner? In either the OpenMP or MPI cause you should see computational resource usage go up.
Good luck!
How can I monitor the amount of SIMD (SSE, AVX, AVX2, AVX-512) instruction usage of a process? For example, htop can be used to monitor general CPU usage, but not specifically SIMD instruction usage.
I think the only reliable way to count all SIMD instructions (not just FP math) is dynamic instrumentation (e.g. via something like Intel PIN / SDE).
See How to characterize a workload by obtaining the instruction type breakdown? and How do I determine the number of x86 machine instructions executed in a C program? specifically sde64 -mix -- ./my_program to print the instruction mix for your program for that run, example output in libsvm compiled with AVX vs no AVX
I don't think there's a good way to make this like top / htop, if it's even possible to safely attach to already-running processes, especially multi-threaded once.
It might also be possible to get dynamic instruction counts using last-branch-record stuff to record / reconstruct the path of execution and count everything, but I don't know of tools for that. In theory that could attach to already-running programs without much danger, but it would take a lot of computation (disassembling and counting instructions) to do it on the fly for all running processes. Not like just asking the kernel for CPU usage stats that it tracks anyway on context switches.
You'd need hardware instruction-counting support for this to be really efficient the way top is.
For SIMD floating point math specifically (not FP shuffles, just real FP math like vaddps), there are perf counter events.
e.g. from perf list output:
fp_arith_inst_retired.128b_packed_single
[Number of SSE/AVX computational 128-bit packed single precision
floating-point instructions retired. Each count represents 4
computations. Applies to SSE* and AVX* packed single precision
floating-point instructions: ADD SUB MUL DIV MIN MAX RCP RSQRT SQRT
DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as
they perform multiple calculations per element]
So it's not even counting uops, it's counting FLOPS. There are other events for ...pd packed double, and 256-bit versions of each. (I assume on CPUs with AVX512, there are also 512-bit vector versions of these events.)
You can use perf to count their execution globally across processes and on all cores. Or for a single process
## count math instructions only, not SIMD integer, load/store, or anything else
perf stat -e cycles:u,instructions:u,fp_arith_inst_retired.{128,256}b_packed_{double,single}:u ./my_program
# fixme: that brace-expansion doesn't expand properly; it separates with spaces not commas.
(Intentionally omitting fp_arith_inst_retired.scalar_{double,single} because you only asked about SIMD and scalar instructions on XMM registers don't count, IMO.)
(You can attach perf to a running process by using -p PID instead of a command. Or use perf top as suggested in
See Ubuntu - how to tell if AVX or SSE, is current being used by CPU app?
You can run perf stat -a to monitor globally across all cores, regardless of what process is executing. But again, this only counts FP math, not SIMD in general.
Still, it is hardware-supported and thus could be cheap enough for something like htop to use without wasting a lot of CPU time if you leave it running long-term.
I have a code that carries heavy computations: executing pred(In, Out) let's say 64K times where each execution takes 1-10 seconds.
I want to use multi-threaded (64) machine for speeding up the process.
I use concurrent_maplist for this:
concurrent_maplist(pred, List_of_64K_In, List_of_64K_Out).
I get approx 8 times speed up but not more than that.
I thought the reason was the following notice on concurrent_maplist:
Note that the the overhead of this predicate is considerable and
therefore Goal must be fairly expensive before one reaches a speedup.
To make a goal fairly expensive, I modified the code as:
% putting 1K pred/2 in heavy_pred/2
concurrent_maplist(heavy_pred, List_of_64_List_of_1k_In, List_of_64_List_of_1k_Out).
heavy_pred(List_of_In, List_of_Out) :-
maplist(pred, List_of_In, List_of_Out).
Surprisingly (for me), I do not get further speed up with this change.
I wonder how to get further speed up with multi-threading?
Some additional details:
Architecture: x86_64, AMD, 14.04.1-Ubuntu.
swipl -v: SWI-Prolog version 6.6.4 for amd64.
pred/2 is a theorem prover which takes formulas and tries to prove them.
It uses standard predicates with few non-standard ones: cyclic_term/1, write/1, copy_term/2, etc.
To make your cores to work, you can use threads that can be fired up by your application. Using the very latest swipl version may help you get the latest improvements it offers.
You can also fire up several applications, each running e.g. 8 threads. Result: more cores would get work, albeit the total will take up more memory than running a single application with more threads. Now, you will need to manage the different application instances so that the overall job progresses (and don't repeat work done on other apps). I imagine that this is needed even in the case of a single application with more threads. My answer is not very technically steeped, but would get you to use more cores for a cpu intensive job.
Multi-core CPUs don't scale linearly when the results are merged. I checked whether there is a difference between concurrent_maplist/2 and the new concurrent_and/2. I am using an example that has fairly large work items from here.
I added testing for concurrent_maplist/2:
/* parallel, concurrent_maplist */
count3(N) :-
findall(Y, between(1, 1000, Y), L),
concurrent_maplist(slice, L, R),
aggregate_all(sum(M), member(M, R), N).
Interestingly there is not much difference between concurrent_and/2 and concurrent_maplist/2. Possibly because the number of work items is only 1000, due to the fairly larger work item itself. Nevertheless 1'000'000 numbers were check for primality:
?- current_prolog_flag(cpu_count, X).
X = 8.
/* sequential */
?- time(count(N)).
% 138,647,812 inferences, 9.781 CPU in 9.792 seconds (100% CPU, 14174856 Lips)
N = 78499.
/* parallel, concurrent_and */
?- time(count2(N)).
% 4,450 inferences, 0.000 CPU in 2.458 seconds (0% CPU, Infinite Lips)
N = 78499.
/* parallel, concurrent_maplist */
?- time(count3(N)).
% 23,183 inferences, 0.000 CPU in 2.423 seconds (0% CPU, Infinite Lips)
N = 78499.
Although the machine had 8 logical cores, thanks to hyper threading. The speed-up is close to the number of physical cores, which is only 4. So in case of your machine with 64 logical cores, that is what cpu_count Prolog flag returns, you should check how many physical cores there are.
I am using Ubuntu 14.04 x64 on a machine with Intel Xeon CPU. and I am experiencing an strange behaviour. I have a Fortran code and a lengthy part of the calculation is parallel with OpenMP. With a smaller data set (say less than 4000) everything works fine. However, when I test a data set with 90K elements, in the middle the calculation the number of threads used suddenly drops to 1 which obviously slows down the computation.
I did these checks already:
Using OMP_GET_NUM_THREADS() I monitor the number of threads during the process and it ramains the same even after system uses 1 thread.
I use a LAPACK routine for eigen value calculation inside the loop. I compiled the Lapack again on my system to make sure the libraries on my system do not do anything.
Can it be that the System changes the number of used threads from outside? If so why?
Thank you.
It looks like a load balancing problem. Try with a dynamic scheduling :
!$OMP PARALLEL SCHEDULE(DYNAMIC)
I use FFTW 3.1.2 with Fortran to perform real to complex and complex to real FFTs. It works perfectly on one thread.
Unfortunately I have some problems when I use the multi-threaded FFTW
on a 32 CPU shared memory computer. I have two plans,
one for 9 real to complex FFT and one for 9 complex to real FFT (size
of each real field: 512*512). I use Fortran and I compile (using ifort) my
code linking to the following libraries:
-lfftw3f_threads -lfftw3f -lm -lguide -lpthread -mp
The program seems to compile correctly and the function sfftw_init_threads returns a non-zero integer value, usually 65527.
However, even though the program runs perfectly, it is slower with 2
or more threads than with one. A top command shows weird CPU load
larger than 100% (and much more larger than n_threads*100). An htop
command shows that one processor (let's say number 1) is working at a
100% load on the program, while ALL the other processors, including
number 1, are working on this very same program, at a 0% load, 0% memory and 0 TIME.
If anybody has any idea of what's going on here... thanks a lot!
This looks like it could be a synchronisation problem. You can get this type of behaviour if all threads except one are locked out e.g. by a semaphore to a library call.
How are you calling the planner? Are all your function calls correctly synchronised? Are you creating the plans in a single thread or on all threads? I assume you've read the notes on thread safety in the FFTW docs... ;)
Unless your FFTs are pretty large, the automatic multithreading in FFTW is unlikely to be a win speed wise. The synchronization overhead inside the library can dominate the computation being done. You should profile different sizes and see where the break even point is.