Using OpenMP (libgomp) in an already multithreaded application

Using OpenMP (libgomp) in an already multithreaded application - multithreading

We are using OpenMP (libgomp) in order to speed up some calculations in a multithreaded Qt application. The parallel OpenMP sections are located within two different threads, though in fact they never execute in parallel. What we observe in this case is that 2N (where N = OMP_THREAD_LIMIT) omp threads are launched, apparently interfering each with the other. The calculation time is very high, while the processor load is low. Setting OMP_WAIT_POLICY hardly has any effect.
We also tried moving all the omp sections to a single thread (this is not a good solution for us, though, from an architectural point of view). In this case, the overall calculation time does drop and the processor is fully loaded, but only if OMP_WAIT_POLICY is set to ACTIVE. When OMP_WAIT_POLICY == PASSIVE, the calculation time remains low and the processor is idle 50% of time.
Odd enough, when we use omp within a single thread, the first loop parallelized using omp (in a series of omp calulations) executes 10 times slower compared to the multithread case.
Upd: Our questions are:
a) is there any way to reuse the openmp threads when using omp in the context of different threads.
b) Why executing with OMP_WAIT_POLICY == PASSIVE slows everything. Does it take so long to wake the threads?
c) Is there any logical explanation for the phenomenon of the first parallel block being so slow (even when waiting in active mode)
Upd2: Please mind that the issue is probably related to GNU OMP implementation. icc doesn't have it.

Try to start/stop openmp threads in runtime using omp_set_num_threads(1) and omp_set_num_threads(cpucount)
This call with (1) should stop all openmp worker threads, and call with (cpu_num) will restart them again.
So, at start of programm, run omp_set_num_threads(1).
Before omp-parallelized region, you can start omp threads even with WAIT_POLICY=active, and they will not consume cpu before this point.
After omp parallel region you can stop threads again.
The omp_set_num_threads(cpucount) call is very slow, slower than waking threads with wait_policy=passive. This can be the reason for (c) - if your libgomp starts threads only at first parallel region.

Related

Performance of multi-threading exceeding cores

If I have a process that starts X amount of threads, will there ever be a performance gain having X higher than the number of CPU cores (assuming all the threads are working synchronously without async calls to storage/network)?
E.G. If I have a two cores CPU, will I just slow down the application starting 3+ constantly working threads?

It really depends on what your code does. it is too broad.
Having more threads than cores might speed up the program for example if some of the threads sleep or try to block on a lock. in this case, the OS scheduler can wake different thread and that thread will work while the other thread is sleeping.
Having more threads than the number of cores may also decrease the program execution time because the OS scheduler has to do more work to switch between the threads execution and that scheduling might be a heavy operation.
As always, benchmarking your application with different amount of threads is the best way to achieve maximum performance. there are also algorithms (like Hill-Climbing) which may help the application fine tune the best number of threads on runtime.

It is possible that such a thing happens.
Both Intel and AMD currently implement forms of SMT in their CPUs. This means that, in general, one single thread of execution may not be able to exploit 100% of the computing resources.
This happens because modern CPUs execute instructions in multiple pipelined steps, so that the clock frequency can be increased (less stuff gets done in every cycle, so you can do more cycles). The downside of this approach is that, if you have two consecutive instructions A and B, with the latter depending on the result of the former, you may have to wait some clock cycles without doing anything, just waiting for instruction A to complete. So, they came up with SMT, which allows the CPU to interleave instructions from two different threads/processes on the same pipeline, in order to fill such gaps.
Note: it is not exactly like this, CPUs don't just wait. They try to guess the result of the first operation and execute the second assuming that result. If their guess is wrong, they cancel the pending instructions and start over. Also, they have some feedback circuits that allow tighter execution of interdependent instructions. And nowadays branch predictors are surprisingly good. Things get better for the pipeline if you can just fill gaps with instructions from some other process, rather than going with a guess, but this potentially halves the amount of cache each executing thread can use.

It makes sense to run more threads if your threads make read/write/send/recv syscalls or similar, or sleep on locks, etc.
If your threads are pure computation threads, adding more of them will slow down system because of context switches.
If you still need more threads by design, you might want to look into the cooperative multitasking. Both Windows and Linux have API for that and that will work faster than the context switches. In Windows it called fibers:
https://msdn.microsoft.com/en-us/library/windows/desktop/ms682661(v=vs.85).aspx
In Linux it is a set of functions make/get/swapcontext():
http://man7.org/linux/man-pages/man3/makecontext.3.html

This question: Optimal number of threads per core might help you.
In the thread I wrote an answer describing a scenario when having higher number of threads than the available number of cores boosts performance.

Who schedules threads?

I have a question about scheduling threads. on the one hand, I learned that threads are scheduled and treated as processes in Linux, meaning they get scheduled like any other process using the conventional methods. (for example, the Completely Fair Scheduler in linux)
On the other hand, I also know that the CPU might also switch between threads using methods like Switch on Event or Fine-grain. For example, on cache miss event the CPU switches a thread. but what if the scheduler doesn't want to switch the thread? how do they agree on one action?
I'm really confused between the two: who schedules a thread? the OS or the CPU?
thanks alot :)

The answer is both.
What happens is really fairly simple: on a CPU that supports multiple threads per core (e.g., an Intel with Hyperthreading) the CPU appears to the OS as having some number of virtual cores. For example, an Intel i7 has 4 actual cores, but looks to the OS like 8 cores.
The OS schedules 8 threads onto those 8 (virtual) cores. When it's time to do a task switch, the OS's scheduler looks through the threads and finds the 8 that are...the most eligible to run (taking into account things like thread priority, time since they last ran, etc.)
The CPU has only 4 real cores, but those cores support executing multiple instructions simultaneously (and out of order, in the absence of dependencies). Incoming instructions get decoded and thrown into a "pool". Each clock cycle, the CPU tries to find some instructions in that pool that don't depend on the results of some previous instruction.
With multiple threads per core, what happens is basically that each actual core has two input streams to put into the "pool" of instructions it might be able to execute in a given clock cycle. Each clock cycle it still looks for instructions from that pool that don't depend on the results of previous instructions that haven't finished executing yet. If it finds some, it puts them into the execution units and executes them. The only major change is that each instruction now needs some sort of tag attached to designate which "virtual core" will be used to store results into--that is, each of the two threads has (for example) its own set of registers, and instructions from each thread have to write to the registers for that virtual core.
It is possible, however, for a CPU to support some degree of thread priority so that (for example) if the pool of available instructions includes some instructions from both input threads (or all N input threads, if there are more than two) it will prefer to choose instructions from one thread over instructions from another thread in any given clock cycle. This can be absolute, so it runs thread A as fast as possible, and thread B only with cycles A can't use, or it can be a "milder" preference, such as attempting to maintain a 2:1 ratio of instructions executed (or, of course, essentially any other ratio preferred).
Of course, there are other ways of setting up priorities as well (such as partitioning execution resources), but the general idea remains the same.
An OS that's aware of shared cores like this can also modify its scheduling to suit, such as scheduling only one thread on a pair of cores if that thread has higher priority.

The OS handles scheduling and dispatching of ready threads, (those that require CPU), onto cores, managing CPU execution in a similar fashion as it manages other resources. A cache-miss is no reason to swap out a thread. A page-fault, where the desired page is not loaded into RAM at all, may cause a thread to be blocked until the page gets loaded from disk. The memory-management hardware does that by generating a hardware interrupt to an OS driver that handles the page-fault.

Thread pools with OpenMP: overhead and changing the number of threads

I recently discovered the concept of thread pools. As far as I understand GCC, ICC, and MSVC all use thread pools with OpenMP. I'm curious to know what happens when I change the number of threads? For example let's assume the default number of threads is eight. I create a team of eight threads and then in a later section I do four threads and then I go back to eight.
#pragma omp parallel for
for(int i=0; i<n; i++)
#pragma omp parallel for num_threads(4)
for(int i=0; i<n; i++)
#pragma omp parallel for
for(int i=0; i<n; i++)
This is something I actually do now because part of my code gets worse results with hyper-threading so I lower the number of thread to the number of physical cores (for that part of the code only). What if I did the opposite (4 thread, then eight, then 4)?
Does the thread pool have to be recreated each time I change the number of threads? If not, does adding or removing threads cause any significant overhead?
What's the overhead for the thread pool, i.e. what fraction of the work per thread goes to the pool?

It might be late by now to answer this question. However, I am going to do so.
When you start with 8 threads from the beginning, a total of 7 threads will be created, then, including your main process, you have a team of 8. So, first loop in your sample code would be executed using this team. Therefore, the thread pool has 8 threads. After they are done with this region, they go into sleep until woken up.
Now, when you reach second parallel region with 4 threads, only 3 threads from your thread pool is woken up (3 threads + your current main thread) and the rest of the threads are still in sleep mode. So, four of the threads are sleeping.
And then, similar to first parallel region, all threads will incorporate with each other to do the third parallel region.
On the other hand, if you start with 4 threads and the second parallel region asks for 8 threads, then the OpenMP library will react to this change and create 4 extra threads to meet what you asked for (8 threads). Usually created threads are not thrown out of the pool until the end of program life. Hopefully, you might need it in future. It is a general approach that most OpenMP libraries follow. The reason behind this idea is the fact that creating new threads is an expensive job and that's why they try to avoid it and postpone it as much as they can.
Hope this helps you and future commuters in here.

Openmp thread divergence?

The term thread divergence is used in CUDA; from my understanding it's a situation where different threads are assigned to do different tasks and this results in a big performance hit.
I was wondering, is there a similar penalty for doing this in openmp? For example, say I have a 6 core processor and a program with 6 threads. If I have a conditional that makes 3 threads perform a certain task, and then have the other three threads perform a completely different task, will there be a big performance hit? I guess in essence it's sort of using openmp to do MIMD.
Basically, I'm writing a program with openmp and CUDA. I want two threads to run a CUDA kernel while the other left over threads run C code. Thanks.

No, there is no performance hit for diverging threads using OpenMP. It is a problem in CUDA because of the way instructions are broadcast simultaneously to a set of cores. When an OpenMP thread targets a CPU core, each CPU core has its own independent set of instructions to follow, and it runs just like any other single-threaded program would.
You may see some of your cores being underutilized if you have synchronization barriers following thread divergence, because that would force faster threads to wait for the slower threads to catch up.

When speaking about CPU parallelism, there's no intrinsic performance hit from using a certain threading design pattern. Not at the theoretical level at least.
The only problem I see is that since the threads are doing different things which may have varying completion times, some of the threads may sit idle after finishing their work, waiting for the others to finish a longer task.

The term thread divergence in CUDA refers to the situation when not all threads of a bock evaluate a conditional with the same outcome. Such threads are said to diverge. If diverging threads are in the same warp then such threads may perform work serially which leads to performance loss.
I am not sure that OpenMP has the same issue, though. When different threads perform different work then load balancing may be used by the runtime perhaps, but it doesn't lead to work serialization necessarily.

there is no this kind of problem in openmp because every openmp thread has its own PC.

Linux 2.6.31 Scheduler and Multithreaded Jobs

I run massively parallel scientific computing jobs on a shared Linux computer with 24 cores. Most of the time my jobs are capable of scaling to 24 cores when nothing else is running on this computer. However, it seems like when even one single-threaded job that isn't mine is running, my 24-thread jobs (which I set for high nice values) only manage to get ~1800% CPU (using Linux notation). Meanwhile, about 500% of the CPU cycles (again, using Linux notation) are idle. Can anyone explain this behavior and what I can do about it to get all of the 23 cores that aren't being used by someone else?
Notes:
In case it's relevant, I have observed this on slightly different kernel versions, though I can't remember which off the top of my head.
The CPU architecture is x64. Is it at all possible that the fact that my 24-core jobs are 32-bit and the other jobs I'm competing w/ are 64-bit is relevant?
Edit: One thing I just noticed is that going up to 30 threads seems to alleviate the problem to some degree. It gets me up to ~2100% CPU.

It is possible that this is caused by the scheduler trying to keep each of your tasks running on the same CPU that it was previously running on (it does this because the task has likely brought its working set into that CPU's cache - it's "cache hot").
Here's a few ideas you can try:
Run twice as many threads as you have cores;
Run one or two less threads than you have cores;
Reduce the value of /proc/sys/kernel/sched_migration_cost (perhaps down to zero);
Reduce the value of /proc/sys/kernel/sched_domain/.../imbalance_pct down closer to 100.

Do your threads have to synchronize? If so, you might have the following problem:
Assume you have a 4-cpu system, and a 4-thread job. When run alone, threads fan out to use all 4 cores and total usage is near perfect (We'll call this 400%).
If you add one single-threaded interfering job, the scheduler might place 2 of your threads on the same cpu. This means that 2 of your threads are now running at effectively half their normal pace (dramatic simplification), and if your threads need to synchronize periodically, the progress of your job can be limited by the slowest thread, which in this case is running at half normal speed. You would see utilization of only 200% (from your job running 4x 50%) plus 100% (the interfering job) = 300%.
Similarly, if you assume that the interfering job only uses 25% of one processor's time, you might see one of your threads and the interferer on the same CPU. In that case the slowest thread is running at 3/4 normal speed, causing the total utilization to be 300% (4x 75%) + 25% = 325%. Play with these numbers and it's not hard to come up with something similar to what you're seeing.
If that's the problem, you can certainly play with priorities to give unwelcome tasks only tiny fractions of available CPU (I'm assuming I/O delays aren't a factor). Or, as you've found, try to increase threads so that each CPU has, say, 2 threads, minus a few to allow for system tasks. In this way a 24 core system might run best with, say, 46 threads (which always leaves half of 2 cores' time available for system tasks).

Do your threads communicates with each other?
Try to manually bind every thread to cpu, with sched_setaffinity or pthread_setaffinity_np. Scheduler can be rather dumb when working with a lot of relating threads.

It might be worthwhile to use mpstat (part of the sysstat package) to figure out if you have entire CPUs sitting idle while others are fully utilized. It should give you a more detailed view of the utilization than top or vmstat: run mpstat -P ALL to see 1 line per CPU.
As an experiment, you might try setting the CPU affinity on each thread such that each is bound to an individual CPU; this would let you see what performance is like if you don't let the kernel scheduler decide which CPU a task is scheduled on. It's not a good permanent solution, but if it helps a lot it gives you an idea of where the scheduler is falling short.

Do you think the bottleneck is in your application or the kernel's scheduling algorithm? Before you start tweaking scheduling parameters, I suggest you try running a simple multi-threaded application to see if it exhibits the same behavior as your application.
// COMPILE WITH: gcc threads.c -lpthread -o thread
#include <pthread.h>
#define NUM_CORES 24
void* loop_forever(void* argument) {
int a;
while(1) a++;
}
void main() {
int i;
pthread_t threads[NUM_CORES];
for (i = 0; i < NUM_CORES; i++)
pthread_create(&threads[i], 0, loop_forever, 0);
for (i = 0; i < NUM_CORES; i++)
pthread_join(threads[i], 0);
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string