I am using Ubuntu 14.04 x64 on a machine with Intel Xeon CPU. and I am experiencing an strange behaviour. I have a Fortran code and a lengthy part of the calculation is parallel with OpenMP. With a smaller data set (say less than 4000) everything works fine. However, when I test a data set with 90K elements, in the middle the calculation the number of threads used suddenly drops to 1 which obviously slows down the computation.
I did these checks already:
Using OMP_GET_NUM_THREADS() I monitor the number of threads during the process and it ramains the same even after system uses 1 thread.
I use a LAPACK routine for eigen value calculation inside the loop. I compiled the Lapack again on my system to make sure the libraries on my system do not do anything.
Can it be that the System changes the number of used threads from outside? If so why?
Thank you.
It looks like a load balancing problem. Try with a dynamic scheduling :
!$OMP PARALLEL SCHEDULE(DYNAMIC)
Related
I am new to Intel OneAPI, but I installed the OneAPI package and when I run
mpirun -n ...
I receive an output like the following if I set N = 3 (for example):
Iteration #1...
Iteration #1...
Iteration #1...
Iteration #2...
Iteration #2...
Iteration #2...
Rather than dividing the cores I specify to the program, it rather runs the program N times with 1 core divided to each process. I was wondering how to set this up so that N cores are divided to 1 process.
Other useful information is that I am running a program called Quantum Espresso and I am running this program with a NUMA 2x18 core dual processor with 2 threads for each core. I initially installed Intel OneAPI because I noticed that if I specify 72 cores with mpirun, the computational demand increases 50-60 fold as opposed to running with 1 core and was hoping OneAPI may be able to resolve this.
So mpirun with -np will say how many instances of a given process to run as you saw.
Have you read this part of their documentation?
https://www.quantum-espresso.org/Doc/user_guide/
I’m not sure how you’ve built it or which functions you are using, but if you have built against their multithreaded libraries with OpenMP then you should get N threads in a process for those library calls.
Otherwise you will be limited to the MPI parallelism in their MPI parallel code.
I’m not sure what you expect when you said you use all 72 code and the computational demand increases? Isn’t that what you want, with the goal that the final result is completed sooner? In either the OpenMP or MPI cause you should see computational resource usage go up.
Good luck!
New member here but long time Perl programmer.
I have a process that I run on a Windows machine that iterates through combinations of records from arrays/lists to identify a maximum combination, following a set of criteria.
On an old Intel i3 machine, an example would take about 45 mins to run. I purchased a new AMD Ryzen 7 machine that on benchmarks is about 7 or 8 times faster than the old machine. But the execution time was only reduced from 45 to 22 minutes.
This new machine has crazy processor capabilities, but it does not appear that Perl takes advantage of these.
Are there Perl settings or ways of coding to take advantage of all of the processor speed that I have on my new machine? Threads, etc?
thanks
Perl by default will only use a single thread and thus only a single CPU core. This means it will only use a small part of what current multi-core systems offer. It has the ability to make use of multiple threads though and thus multiple CPU core. But this needs to be explicitly done, i.e. the implementation needs to be adapted to make use of parallel execution. This can involve major changes to the algorithm used to solve your problem. And not all problems can be easily parallelized.
Apart from the Perl is not the preferred languages if performance is the goal. There is lots of overhead due to being a dynamically typed language and no explicitly control over memory allocation. Languages like C, C++ or Rust which are closer to the hardware start with significantly less overhead and then allow even more low-level control to further reuse overhead. But they don't magically parallelize either.
I'm currently facing a performance issue when calling intel mkl inside an openmp loop. Let me explain my problem in more detail after posting a simplified code.
program Test
use omp_lib
implicit none
double complex, allocatable :: RhoM(:,:), Rho1M(:,:)
integer :: ik, il, ij, N, M, Y
M = 20
Y = 2000000
N = 500
allocate(RhoM(M,N),Rho1M(M,N))
RhoM = (1.0d0,0.0d0)
Rho1M = (0.0d0,1.0d0)
call omp_set_num_threads(4)
do il=1,Y
Rho1M = (0.0d0,1.0d0)
!$omp parallel do private(ik)
do ik=1,N
call zaxpy(M, (1.0d0,0.0d0), RhoM(:,ik:ik), 1, Rho1M(:,ik:ik), 1)
end do
!$omp end parallel do
end do
end program Test
Basically, this program does an in-place matrix summation. However, it does not make any sense, it is just a simplified code. I'm running Windows 10 Pro and using the intel fortran compiler(Version 19.1.0.166). I compile with: ifort -o Test.exe Test.f90 /fast /O3 /Qmkl:sequential /debug:all libiomp5md.lib /Qopenmp. Since the "vectors" used by zaxpy aren't that large, I tried to use openmp in order to speed up the program. I checked the running time with the vtune tool from intel (thats the reason for the debug all flag). I have a i5 4430 meaning 4 threads and 4 physical cores.
Time with openmp: 107s;
Time without openmp: 44s
The funny thing is that with increasing amount of threads, the program is slower. Vtune tells me that more threads are used, however, the computational time increases. This seems to be very counter intuitive.
Of course, I am not the first one facing problems like this. I will attach some links and discuss why it did not work for me.
Intel provides information about how to choose parameters (https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications). However, I'm linking with the sequential intel mkl. If I try the suggested parameters with parallel intel mkl, I'm still slow.
It seems to be important to switch on omp_set_nested(1) (Number of threads of Intel MKL functions inside OMP parallel regions). Firstly, this parameter is deprecated. When I use omp_set_max_active_levels() I cannot see any difference.
This is probably the most suitable question (Calling multithreaded MKL in from openmp parallel region). However, I use sequential intel mkl and have not to care about the mkl threads.
This one here (OpenMP parallelize multiple sequential loops) says I should try using schedule. I tried dynamic and static with different values of chunk size, however, it did not help at all, since the amount of work that has to be done per thread it exactly the same.
It would be very nice, if you have an idea why the program slows down by increasing the thread size.
If you need any further information, please tell me.
Seems to be the case that openmp destroys and creates the splitting into the threads 2000000 times. That causes the additional computational time. See the post from Andrew (https://software.intel.com/en-us/forums/intel-fortran-compiler/topic/733673) and the post from Jim Dempsey.
I have two physical, "identical" Linux RedHat servers. I ran a small program on both of them. My problem: the CPU usage of my program varies between both servers. I am not a Linux expert. I am wondering what could lead to that performance difference?
I wrote the program in C++ and in java to see if the inconsistency comes from the programming language chosen. The program itself does a little bit of integer calculation over time to consume a constant amount of CPU time. Both program versions have the same percentual CPU usage difference.
The environmental variables I have already thought of and could be excluded:
identical server type
identical processor (both have two sockets, single core)
both Intel Hyper-Threading-Technology enabled
identical clock speed
identical OS version (Red Hat Enterprise Linux Server release 5.9)
identical Java version, Java RE, JVM
Intel Demand based Switching can be ignored since the measurement tool uses the default value of clock speed for CPU capacity
processor affinity can be excluded as well I think. I ran multiple measurement series and I always retrieve exactly the same CPU usage values.
Is there maybe a C library or something like that, that has an impact on the CPU usage of C++ and Java programs which needs to be updated separately from the actual OS version? Or could there be a different thread scheduler?
There are a variety of things that can differ even for "identical" systems. Different compilers being used to build various libraries, as well as different versions of compilers. For example, there are continuous improvements from generation to generation of the ability of Intel compilers to optimize. Other differences can occur due to airflow differences causing one machine to run hotter than the other resulting in a drop in frequency occasionally. There are a whole host of other issues that can cause identical systems to run differently.
Here's my recommendation: Create an OS image and use that same image for both systems. Disconnect both from any network. Run compute bound (which you are). Bind your app to a certain core. Verify the exit air temperatures are well within specification. Disable any turbo capability. If there are still differences, do a memory speed check.
Also, use a more sophisticated profiling and analysis tool such as Intel Vtune. You can dig into actual cycles, measure cache misses, branch mispredicts, etc. They should also be identical. If they aren't, the analysis should give you an idea of where the problem lies.
I use FFTW 3.1.2 with Fortran to perform real to complex and complex to real FFTs. It works perfectly on one thread.
Unfortunately I have some problems when I use the multi-threaded FFTW
on a 32 CPU shared memory computer. I have two plans,
one for 9 real to complex FFT and one for 9 complex to real FFT (size
of each real field: 512*512). I use Fortran and I compile (using ifort) my
code linking to the following libraries:
-lfftw3f_threads -lfftw3f -lm -lguide -lpthread -mp
The program seems to compile correctly and the function sfftw_init_threads returns a non-zero integer value, usually 65527.
However, even though the program runs perfectly, it is slower with 2
or more threads than with one. A top command shows weird CPU load
larger than 100% (and much more larger than n_threads*100). An htop
command shows that one processor (let's say number 1) is working at a
100% load on the program, while ALL the other processors, including
number 1, are working on this very same program, at a 0% load, 0% memory and 0 TIME.
If anybody has any idea of what's going on here... thanks a lot!
This looks like it could be a synchronisation problem. You can get this type of behaviour if all threads except one are locked out e.g. by a semaphore to a library call.
How are you calling the planner? Are all your function calls correctly synchronised? Are you creating the plans in a single thread or on all threads? I assume you've read the notes on thread safety in the FFTW docs... ;)
Unless your FFTs are pretty large, the automatic multithreading in FFTW is unlikely to be a win speed wise. The synchronization overhead inside the library can dominate the computation being done. You should profile different sizes and see where the break even point is.