All my codes are running much slower when I use open mp - multithreading

I already saw several posts on this site which talk about this issue. However, I think my serious codes where overhead due to creation of threads and all should not be a big issue, have become much slower with open mp now! I am using a quad core machine with gfortran 4.6.3 as my compiler. Below is an example of a test code.
Program test
use omp_lib
integer*8 i,j,k,l
!$omp parallel
!$omp do
do i = 1,20000
do j = 1, 1000
do k = 1, 1000
l = i
enddo
enddo
enddo
!$omp end do nowait
!$omp end parallel
End program test
This code takes around 80 seconds if I run it without open mp, however, with open mp, it takes around 150 seconds. I have seen the same issue with my other serious codes whose runtime is around 5 minutes or so in serial mode. In those codes I am taking care that there are no dependencies from thread to thread. Then why should these codes become slower instead of faster?
Thanks in advance.

You have a race condition, more threads are writing in the same shared l. Thus the program is invalid, l should be private. It also leads to a slowdown because the threads invalidate the cache content the other cores have and the threads have to reload the memory content all the time. Similar thing happens when more threads use the same cache line and it is known as false sharing.
You also probably don't use any compiler optimizations. Enable them by -O2 -O3, -O5 or -Ofast. You will see that the program takes 0 seconds because the compiler optimizes everything out.

Related

How to create OpenMP threads only once (and do not kill them between different parallel regions)

I have an array A_p which is defined as threadprivate for each thread. The code is complicated and does some calculations. Finally, I want to reduce all the arrays to one shared array A.
DO J=1,Y
DO I=1,X
a=0
!$omp parallel reduction(+:a)
a = A_p(I,J)
!$omp end parallel
A(I,J) = A(I,J)+ a
END DO
END DO
This solution works, but the problem is that the threads are probably created every iteration, which incurs a huge overhead. I would like to find a way to save the threads alive between the iterations, so they could be just created once.
I have also tried the following solution:
!$omp parallel reduction(+:A)
A = A_p
!$omp end parallel
but it seems to create a certain overhead for initializing a private variable A for each thread (which by the way is redundant, because there are already threadprivate variables and we do not really need more private arrays). Of course the overhead here is smaller than the overhead observed in the previous solution, but still not good enough for me.
Also, I would like to ask about the way OpenMP implement the reduction. For example, in the first solution I presented, does the reduction of variable a is serial, or it is implemented in a tree combining fashion (achieving a logarithmic running time for the reduction phase)?

Matlab: -maxNumCompThreads, hyperthreading, and parpool

I'm running Matlab R2014a on a node in a Linux cluster that has 20 cores and hyperthreading enabled. I know this has been discussed before, but I'm looking for some clarification. Here's what my understanding is of the threads vs. cores issue in Matlab:
Matlab has inherent multithreading capabilities, and will utilize extra cores on a multicore machine.
Matlab runs its threads in such a way that putting multiple Matlab threads on the same core (i.e. hyperthreading) isn't useful. So by default, the maximum number of threads that Matlab will create is the number of cores on your system.
When using parpool(), regardless of the number of workers you create, each worker will use only one physical core, as mentioned in this thread.
However, I've also read that using the (deprecated) function maxNumCompThreads(), you can either decrease or increase the number of threads that Matlab or one of the workers will generate. This can be useful in several scenarios:
You want to utilize Matlab's implicit multithreading capabilities to run some code on a cluster node without allocating the entire node. It would be nice if there was some other way to do this if maxNumCompThreads ever gets removed.
You want to do a parameter sweep but have less parameters than the number of cores on your machine. In this case you might want to increase the number of threads per worker so that all of your cores are utilized. This was suggested recently in this thread. However, in my experience, while the individual workers seem quite happy to use maxNumCompThreads() to increase their thread count, inspecting the actual CPU usage using the "top" command suggests that it doesn't have any effect, i.e. each worker still only gets to use one core. It's possible that what is happening is that the individual Matlab processes spawned by the parpool are run with the argument -singleCompThread. I've confirmed that if the parent Matlab process is run with -singleCompThread, the command maxNumCompThreads(n), where n > 1 throws an error due to the fact that Matlab is running in single threaded mode. So the result seems to be that (at least in 2014a), you can't increase the number of computational threads on the parallel pool workers. Related to this is that I can't seem to get the Parent matlab process to to start more threads than there are cores, even though the computer itself has hyperthreading enabled. Again, it will happily run maxNumCompThreads(n), where n > # physical cores, but the fact that top is showing CPU utilization to be 50% suggests otherwise. So what is happening, or what am I misunderstanding?
Edit: to lay out my questions more explicitly:
Within a parfor loop, why doesn't setting maxNumCompThreads(n), when n > 1 seem to work? If it's because the worker process is started with -singleCompThread, why doesn't maxNumCompThreads() return an error like it does in the parent process started with -singleCompThread?
In the parent process, why doesn't using maxNumCompThreads(n), where n > # physical cores, do anything?
Note: I posted this previously on Matlab answers, but haven't received any feedback.
Edit2: It looks like the problem in (1) was an issue with the test code I was using.
That's quite a long question, but I think the straightforward answer is that yes, as I understand it, MATLAB workers are started with -singleCompThread.
First, a few quick tests to confirm our understanding:
> matlab.exe -singleCompThread
>> warning('off', 'MATLAB:maxNumCompThreads:Deprecated')
>> maxNumCompThreads
ans =
1
>> maxNumCompThreads(2)
Error using feature
MATLAB has computational multithreading disabled.
To enable multithreading please restart MATLAB without singleCompThread option.
Error in maxNumCompThreadsHelper (line 37)
Error in maxNumCompThreads (line 27)
lastn = maxNumCompThreadsHelper(varargin{:});
As indicated, when MATLAB is started with the -singleCompThread option, we cannot override it using maxNumCompThreads.
> matlab.exe
>> parpool(2); % local pool
>> spmd, n = maxNumCompThreads, end
Lab 1:
n =
1
Lab 2:
n =
1
We can see that each worker is by default limited to a single computation thread. This is a good thing because we want to avoid over-subscription and unnecessary context switches, which occurs when the number of threads trying to run exceeds the number of available physical/logical cores. So in theory, the best way to maximize CPU utilization is to start as many single-threaded workers as we have cores.
No by looking at the local worker processes running in background, we see that each is launched as:
matlab.exe -dmlworker -noFigureWindows [...]
I believe the undocumented -dmlworker option does something similar to -singleCompThread, but probably a bit different. For one, I was able to override it using maxNumCompThreads(2) without it throwing an error like before..
Remember that even if a MATLAB session is running in single-threaded computation mode, it doesn't mean the computational thread is exclusively restricted to one CPU core only (the thread could jump around between cores assigned by the OS scheduler). You'll have to set the affinity of the worker processes if you want to control that..
So I did some profiling using Intel VTune Amplifier. Basically I ran some linear algebra code, and performed hotspots analysis by attaching to the MATLAB process and filtering on the mkl.dll module (this is the Intel MKL library that MATLAB uses as an optimized BLAS/LAPACK implementation). Here are my results:
- Serial mode
I used the following code: eig(rand(500));
Starting MATLAB normally, computation spawns 4 threads (that's the default automatic value chosen seeing that I have a quad-core i7 Intel CPU).
starting MATLAB normally, but calling maxNumCompThreads(1) before the computation. As expected, only 1 thread is used by the computation.
starting MATLAB with -singleCompThread option, again only 1 thread is used.
- Parallel mode (parpool)
I used the following code: parpool(2); spmd, eig(rand(500)); end. In both cases below, MATLAB is started normally
when running code on the workers with the defaults settings, each worker is limited to one computation thread
when I override the settings on the workers using maxNumCompThreads(2), then each worker will use 2 threads
Here is a screenshot of what VTune reports:
Hope that answers your questions :)
I was wrong about maxNumCompThreads not working on parpool workers. It looks like the problem was that the code I was using:
parfor j = 1:2
tic
maxNumCompThreads(2);
workersCompThreads(j) = maxNumCompThreads;
i = 1;
while toc < 200
a = randn(10^i)*randn(10^i);
i = i + 1;
end
end
used so much memory by the time I checked CPU utilization that the bottleneck was I/O and the extra threads were already shut down. When I did the following:
parfor j = 1:2
tic
maxNumCompThreads(2);
workersCompThreads(j) = maxNumCompThreads;
i = 4;
while toc < 200
a = randn(10^i)*randn(10^i);
end
end
The extra threads started and stayed running.
As for the second issue, I got a confirmation from the Mathworks that the parent Matlab process won't start more threads than the number of physical cores, even if you explicitly raise the limit beyond that. So in the documentation, the sentence:
"Currently, the maximum number of computational threads is equal to the number of computational cores on your machine."
should say:
"Currently, the maximum number of computational threads is equal to the number of physical cores on your machine."

Multi-threaded linear system solution in OpenBLAS

I have a code using Fortran 95 and the gfortran compiler. I am also using OpenMP and I have to handle very big arrays. In my code I also have to solve a system of linear equations using the solver DGTSV from OpenBLAS. I want to parallelize this solver as well using openblas which should be capable of that. But I have trouble with the syntax. Using the attached pseudo code all 4 CPUs are used to almost 100% but I am not sure if each kernel solves the linear equations separately or if they split it into parts and calculating it parallel.
The whole stuff is compiled using gfortran -fopenmp -lblas a.f95 -o a.out
So my pseudo code looks like
program a
implicit none
integer, parameter :: N = 200
real*8, dimension(numx) :: D = 0.0
real*8, dimension(numx-1):: DL = 0.0
real*8, dimension(numx-1):: DU = 0.0
real*8, dimension(numx) :: b = 0.0
integer :: info = 0
integer :: numthread=4
...
!$OMP PARALLEL NUM_THREADS(numthread)
...
!$OMP DO
...
!$OMP END DO
CALL DGTSV(N,1,DL,D,DU,b,N,info)
!$OMP DO
...
!$OMP END DO
...
!$OMP END PARALLEL
end program a
What does I have to do to make the solver parallelized, so each kernel calculates parts of the solver?
Inside an OpenMP parallel region, all the threads execute the same code (as in MPI), and the work is only split when the threads reach a loop/section/task.
In your example, the work inside the loops (OMP DO) is distributed among the available threads. After the loop is done, an implicit barrier synchronizes all the threads and then they execute in parallel the function DGTSV. After the subroutine has returned, the loop is split again.
#HristoIliev proposed using a OMP SINGLE clause. This restricts the piece of code inside to be executed by only one thread and forces all the other threads to wait for it (unless you specify nowait).
On the other hand, nested parallelism is called to the case where you declare a parallel region inside another parallel region. This also applies when you perform calls to a OpenMP parallelized library inside a parallel region.
By default, OpenMP does not increase parallelism nested parallel regions, instead, only the thread that enter the parallel region is able to execute it. This behavior can be changed using the environment variable OMP_NESTED to true.
The OMP SINGLE solution is far better than splitting the parallel region in two, as the resources are reused for the next loop:
$!OMP PARALLEL
$!OMP DO
DO ...
END DO
$!OMP SINGLE
CALL DGTSV(...)
$!OMP DO
DO ...
END DO
$!OMP END PARALLEL
To illustrate the usage of OMP_NESTED I'll show you some results I had from an application which used FFTW (a Fast Fourier Transform implementation) configured to use OpenMP. The execution was performed in a 16 core two-socket Intel Xeon E5 #2.46GHz node.
The following graphs show the amount of time spent in the whole application, where parallel regions appear when CPUs > 1, serialized regions when CPUs = 1 and synchronization regions when CPUs = 0.
The application is embarrassingly parallel, so in this particular case using nesting is not worthwhile (FFTW does not scale that good).
This is the OMP_NESTED=false execution. Observe how the amount of parallelism is limited by the amount of threads spent in the external parallel region (ftdock).
This is the OMP_NESTED=true execution. In this case, it is possible to increase parallelism further than the amount of threads spent on the external parallel region. The maximum parallelism possible in this case is 16, when either the 8 external threads create a single peer to execute the internal parallel region or they are 4 creating 3 additional threads each (8x2 = 4x4 = 16).

Using OpenMP (libgomp) in an already multithreaded application

We are using OpenMP (libgomp) in order to speed up some calculations in a multithreaded Qt application. The parallel OpenMP sections are located within two different threads, though in fact they never execute in parallel. What we observe in this case is that 2N (where N = OMP_THREAD_LIMIT) omp threads are launched, apparently interfering each with the other. The calculation time is very high, while the processor load is low. Setting OMP_WAIT_POLICY hardly has any effect.
We also tried moving all the omp sections to a single thread (this is not a good solution for us, though, from an architectural point of view). In this case, the overall calculation time does drop and the processor is fully loaded, but only if OMP_WAIT_POLICY is set to ACTIVE. When OMP_WAIT_POLICY == PASSIVE, the calculation time remains low and the processor is idle 50% of time.
Odd enough, when we use omp within a single thread, the first loop parallelized using omp (in a series of omp calulations) executes 10 times slower compared to the multithread case.
Upd: Our questions are:
a) is there any way to reuse the openmp threads when using omp in the context of different threads.
b) Why executing with OMP_WAIT_POLICY == PASSIVE slows everything. Does it take so long to wake the threads?
c) Is there any logical explanation for the phenomenon of the first parallel block being so slow (even when waiting in active mode)
Upd2: Please mind that the issue is probably related to GNU OMP implementation. icc doesn't have it.
Try to start/stop openmp threads in runtime using omp_set_num_threads(1) and omp_set_num_threads(cpucount)
This call with (1) should stop all openmp worker threads, and call with (cpu_num) will restart them again.
So, at start of programm, run omp_set_num_threads(1).
Before omp-parallelized region, you can start omp threads even with WAIT_POLICY=active, and they will not consume cpu before this point.
After omp parallel region you can stop threads again.
The omp_set_num_threads(cpucount) call is very slow, slower than waking threads with wait_policy=passive. This can be the reason for (c) - if your libgomp starts threads only at first parallel region.

Linux 2.6.31 Scheduler and Multithreaded Jobs

I run massively parallel scientific computing jobs on a shared Linux computer with 24 cores. Most of the time my jobs are capable of scaling to 24 cores when nothing else is running on this computer. However, it seems like when even one single-threaded job that isn't mine is running, my 24-thread jobs (which I set for high nice values) only manage to get ~1800% CPU (using Linux notation). Meanwhile, about 500% of the CPU cycles (again, using Linux notation) are idle. Can anyone explain this behavior and what I can do about it to get all of the 23 cores that aren't being used by someone else?
Notes:
In case it's relevant, I have observed this on slightly different kernel versions, though I can't remember which off the top of my head.
The CPU architecture is x64. Is it at all possible that the fact that my 24-core jobs are 32-bit and the other jobs I'm competing w/ are 64-bit is relevant?
Edit: One thing I just noticed is that going up to 30 threads seems to alleviate the problem to some degree. It gets me up to ~2100% CPU.
It is possible that this is caused by the scheduler trying to keep each of your tasks running on the same CPU that it was previously running on (it does this because the task has likely brought its working set into that CPU's cache - it's "cache hot").
Here's a few ideas you can try:
Run twice as many threads as you have cores;
Run one or two less threads than you have cores;
Reduce the value of /proc/sys/kernel/sched_migration_cost (perhaps down to zero);
Reduce the value of /proc/sys/kernel/sched_domain/.../imbalance_pct down closer to 100.
Do your threads have to synchronize? If so, you might have the following problem:
Assume you have a 4-cpu system, and a 4-thread job. When run alone, threads fan out to use all 4 cores and total usage is near perfect (We'll call this 400%).
If you add one single-threaded interfering job, the scheduler might place 2 of your threads on the same cpu. This means that 2 of your threads are now running at effectively half their normal pace (dramatic simplification), and if your threads need to synchronize periodically, the progress of your job can be limited by the slowest thread, which in this case is running at half normal speed. You would see utilization of only 200% (from your job running 4x 50%) plus 100% (the interfering job) = 300%.
Similarly, if you assume that the interfering job only uses 25% of one processor's time, you might see one of your threads and the interferer on the same CPU. In that case the slowest thread is running at 3/4 normal speed, causing the total utilization to be 300% (4x 75%) + 25% = 325%. Play with these numbers and it's not hard to come up with something similar to what you're seeing.
If that's the problem, you can certainly play with priorities to give unwelcome tasks only tiny fractions of available CPU (I'm assuming I/O delays aren't a factor). Or, as you've found, try to increase threads so that each CPU has, say, 2 threads, minus a few to allow for system tasks. In this way a 24 core system might run best with, say, 46 threads (which always leaves half of 2 cores' time available for system tasks).
Do your threads communicates with each other?
Try to manually bind every thread to cpu, with sched_setaffinity or pthread_setaffinity_np. Scheduler can be rather dumb when working with a lot of relating threads.
It might be worthwhile to use mpstat (part of the sysstat package) to figure out if you have entire CPUs sitting idle while others are fully utilized. It should give you a more detailed view of the utilization than top or vmstat: run mpstat -P ALL to see 1 line per CPU.
As an experiment, you might try setting the CPU affinity on each thread such that each is bound to an individual CPU; this would let you see what performance is like if you don't let the kernel scheduler decide which CPU a task is scheduled on. It's not a good permanent solution, but if it helps a lot it gives you an idea of where the scheduler is falling short.
Do you think the bottleneck is in your application or the kernel's scheduling algorithm? Before you start tweaking scheduling parameters, I suggest you try running a simple multi-threaded application to see if it exhibits the same behavior as your application.
// COMPILE WITH: gcc threads.c -lpthread -o thread
#include <pthread.h>
#define NUM_CORES 24
void* loop_forever(void* argument) {
int a;
while(1) a++;
}
void main() {
int i;
pthread_t threads[NUM_CORES];
for (i = 0; i < NUM_CORES; i++)
pthread_create(&threads[i], 0, loop_forever, 0);
for (i = 0; i < NUM_CORES; i++)
pthread_join(threads[i], 0);
}

Resources