I am running the code with MPI and each processor executing multithreading using OpenMp and other one is OpenMP parallel code, the performance worsens with array matrix size being 500 for MPI but I am able to find the results for matrix size being very large like 1000 which is not possible with just OpenMP, I am wondering whether it should always be the case that the performance of OpenMP with MPI would be better? or it can be the way it is in my use case.
Related
The theory of multirhead programing explained is based on the number of cores, but nowdays processors have more logical cores than physical ones. The question is, if a well-implemented parallel algorithm is run on a processor with 4 physical and 8 logical cores, the speedup will be 4 or 8 times (the best case without couting the cost of parallelism and additional staff).
For example below you can see the results of image filtering, having 4 cores and 8 thread CPU. It looks like upper bound is 4 time speed up, but in case of using 8 threads it seems to be the best speed up among the rest
Logical cores are only useful if your code is latency bound. This is the case when there are stalls (eg. cache miss) or when instructions are sequentiallized a lot (eg. a loop of dependent divisions). In such a case, the processor could truly execute 2 threads in parallel on the same physical core (using two logical cores). A relatively good example is a naive matrix transposition. Logical cores do not help a lot on many optimized codes because optimized codes do not stall often and generally expose a lot of instruction level parallelism (eg. due to loop unrolling).
When you are measuring a speed up, it is generally not relevant to use logical cores unless you know that the workload can benefit from them (the ones inherently latency bound) or when the processor is designed so they should be use (on Xeon Phi or POWER processors for example). I expect logical cores not to be useful on an optimized image filtering workload.
Note that logical cores tends to make benchmark results harder to understand.
I'm currently facing a performance issue when calling intel mkl inside an openmp loop. Let me explain my problem in more detail after posting a simplified code.
program Test
use omp_lib
implicit none
double complex, allocatable :: RhoM(:,:), Rho1M(:,:)
integer :: ik, il, ij, N, M, Y
M = 20
Y = 2000000
N = 500
allocate(RhoM(M,N),Rho1M(M,N))
RhoM = (1.0d0,0.0d0)
Rho1M = (0.0d0,1.0d0)
call omp_set_num_threads(4)
do il=1,Y
Rho1M = (0.0d0,1.0d0)
!$omp parallel do private(ik)
do ik=1,N
call zaxpy(M, (1.0d0,0.0d0), RhoM(:,ik:ik), 1, Rho1M(:,ik:ik), 1)
end do
!$omp end parallel do
end do
end program Test
Basically, this program does an in-place matrix summation. However, it does not make any sense, it is just a simplified code. I'm running Windows 10 Pro and using the intel fortran compiler(Version 19.1.0.166). I compile with: ifort -o Test.exe Test.f90 /fast /O3 /Qmkl:sequential /debug:all libiomp5md.lib /Qopenmp. Since the "vectors" used by zaxpy aren't that large, I tried to use openmp in order to speed up the program. I checked the running time with the vtune tool from intel (thats the reason for the debug all flag). I have a i5 4430 meaning 4 threads and 4 physical cores.
Time with openmp: 107s;
Time without openmp: 44s
The funny thing is that with increasing amount of threads, the program is slower. Vtune tells me that more threads are used, however, the computational time increases. This seems to be very counter intuitive.
Of course, I am not the first one facing problems like this. I will attach some links and discuss why it did not work for me.
Intel provides information about how to choose parameters (https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications). However, I'm linking with the sequential intel mkl. If I try the suggested parameters with parallel intel mkl, I'm still slow.
It seems to be important to switch on omp_set_nested(1) (Number of threads of Intel MKL functions inside OMP parallel regions). Firstly, this parameter is deprecated. When I use omp_set_max_active_levels() I cannot see any difference.
This is probably the most suitable question (Calling multithreaded MKL in from openmp parallel region). However, I use sequential intel mkl and have not to care about the mkl threads.
This one here (OpenMP parallelize multiple sequential loops) says I should try using schedule. I tried dynamic and static with different values of chunk size, however, it did not help at all, since the amount of work that has to be done per thread it exactly the same.
It would be very nice, if you have an idea why the program slows down by increasing the thread size.
If you need any further information, please tell me.
Seems to be the case that openmp destroys and creates the splitting into the threads 2000000 times. That causes the additional computational time. See the post from Andrew (https://software.intel.com/en-us/forums/intel-fortran-compiler/topic/733673) and the post from Jim Dempsey.
I am using Ubuntu 14.04 x64 on a machine with Intel Xeon CPU. and I am experiencing an strange behaviour. I have a Fortran code and a lengthy part of the calculation is parallel with OpenMP. With a smaller data set (say less than 4000) everything works fine. However, when I test a data set with 90K elements, in the middle the calculation the number of threads used suddenly drops to 1 which obviously slows down the computation.
I did these checks already:
Using OMP_GET_NUM_THREADS() I monitor the number of threads during the process and it ramains the same even after system uses 1 thread.
I use a LAPACK routine for eigen value calculation inside the loop. I compiled the Lapack again on my system to make sure the libraries on my system do not do anything.
Can it be that the System changes the number of used threads from outside? If so why?
Thank you.
It looks like a load balancing problem. Try with a dynamic scheduling :
!$OMP PARALLEL SCHEDULE(DYNAMIC)
As far as I know, in a multiprocessor environment any thread/process can be allocated to any core/processor so, what is meant by following line:
the number of MPI ranks used on an Intel Xeon Phi coprocessor should be substantially fewer than the number of cores in no small part because of limited memory on the coprocessor.
I mean, what are the issues if #cores <= #MPI Ranks ?
That quote is correct only when it is applied to a memory size constrained problem; in general it would be an incorrect statement. In general you should use more tasks than you have physical cores on the Xeon Phi in order to hide memory latency1.
To answer your question "What are the issues if the number of cores is fewer than the number of MPI ranks?": you run the risk of having too much context switching. On many problems it is advantageous to use more tasks than you have cores to hide memory latency2.
1. I don't even feel like I need to cite a reference for this because how loudly it is advertised; however, they do mention it in an article on the OpenCL design document: http://software.intel.com/en-us/articles/opencl-design-and-programming-guide-for-the-intel-xeon-phi-coprocessor
2. This advice applies to the Xeon Phi specifically, not necessarily other pieces of hardware.
Well if you make number of MPI tasks higher than number of cores it makes no sense, because you start to enforce 2 tasks on one processing unit, and therefore exhaustion of computing resources.
When it comes to preferred substantially lower number of tasks over cores on Xeon Phi. Maybe they prefer threads over processes. The architecture of Xeon Phi is quite peculiar and overhead introduced by maintaining an MPI task can seriously cripple computing performance. I will not hide that I do not know technical reason behind it. But maybe someone will fill it in.
If I recall correctly communication bus in there is a ring (or two rings), so maybe all to all communication and barriers are polluting bus and turns out to be ineffective.
Using threads or the native execution mode they provide has less overhead.
Also I think you should look at it more like a multicore CPU, not a multi-CPU machine. For greater performance you don't want to run 4 MPI tasks on a 4-core CPU either, you want to run one 4-threaded MPI task.
I have implemented few normal looping applications in OpenMP, TBB and OpenCL. In all these applications, OpeCL gives far better performance than others too when I am only running it on CPU with no specific optimizations done in kernels. OpenMP and TBB gives good performance too but far less than OpenCL, what could be reason for it because these both are CPU specialized frameworks and should gives at least a performance equal to OpenMP/TBB.
My second concern is that when it comes to OpenMP and TBB, OpenMP is always better in performance than TBB in my implementations in which I havent tuned it for a very good optimizations as I am not so expert. Is there a reason that OpenMP is normally better in performance than TBB? Because I think they both or even OpenCL too uses same kind of thread pooling at low level.... Any expert opinions? Thanks
One advantage that OpenCL has over TBB and OpenMP is that it can take better advantage of SIMD parallelism in your hardware. Some OpenCL implementations will run your code such that each work item runs in a SIMD vector lane of the machine, as well as running on separate cores. Depending on the algorithm, this could provide lots of performance benefits.
C compilers can also take some advantage of SIMD parallelism as well, using auto-vectorization, but the memory aliasing rules in C make it hard for this to work in some cases. Since OpenCL requires programmers to call out the work items and fence memory accesses explicitly, an OpenCL compiler can be more aggressive.
In the end, it depends on your code. One could find an algorithm for which any of OpenCL, OpenMP, or TBB are best.
OpenCL runtime for CPU and MIC provided by Intel uses TBB under the hood. It's far from just 'thread pooling at low level' since it takes advantage of sophisticated scheduling and partitioning algorithms provided by TBB for better load balance and so better utilization of CPUs.
As for TBB vs. OpenMP. Usually, it comes down to incorrect measurements. For example, TBB has no implicit barrier like in OpenMP, so a warm-up loop is not enough. You have to make sure all the threads are created and this overhead is not included into your measurements. Another example: sometimes, compilers are not able to vectorize the same code with TBB which is vectorized with OpenMP.
OpenCL kernels are compiled for the given hardware. The potential for vendor/hardware specific optimisations is huge.