Thread pools with OpenMP: overhead and changing the number of threads

Thread pools with OpenMP: overhead and changing the number of threads - multithreading

I recently discovered the concept of thread pools. As far as I understand GCC, ICC, and MSVC all use thread pools with OpenMP. I'm curious to know what happens when I change the number of threads? For example let's assume the default number of threads is eight. I create a team of eight threads and then in a later section I do four threads and then I go back to eight.
#pragma omp parallel for
for(int i=0; i<n; i++)
#pragma omp parallel for num_threads(4)
for(int i=0; i<n; i++)
#pragma omp parallel for
for(int i=0; i<n; i++)
This is something I actually do now because part of my code gets worse results with hyper-threading so I lower the number of thread to the number of physical cores (for that part of the code only). What if I did the opposite (4 thread, then eight, then 4)?
Does the thread pool have to be recreated each time I change the number of threads? If not, does adding or removing threads cause any significant overhead?
What's the overhead for the thread pool, i.e. what fraction of the work per thread goes to the pool?

It might be late by now to answer this question. However, I am going to do so.
When you start with 8 threads from the beginning, a total of 7 threads will be created, then, including your main process, you have a team of 8. So, first loop in your sample code would be executed using this team. Therefore, the thread pool has 8 threads. After they are done with this region, they go into sleep until woken up.
Now, when you reach second parallel region with 4 threads, only 3 threads from your thread pool is woken up (3 threads + your current main thread) and the rest of the threads are still in sleep mode. So, four of the threads are sleeping.
And then, similar to first parallel region, all threads will incorporate with each other to do the third parallel region.
On the other hand, if you start with 4 threads and the second parallel region asks for 8 threads, then the OpenMP library will react to this change and create 4 extra threads to meet what you asked for (8 threads). Usually created threads are not thrown out of the pool until the end of program life. Hopefully, you might need it in future. It is a general approach that most OpenMP libraries follow. The reason behind this idea is the fact that creating new threads is an expensive job and that's why they try to avoid it and postpone it as much as they can.
Hope this helps you and future commuters in here.

Related

what happens when omp num_threads (more than 1) and parallel for with only 1 loop is present

l_thread = 4;
max = 1; //4
#pragma omp parallel for num_threads(l_thread) private(i)
for(i=0;i<max;i++)
;//some operation
in this case, 4 threads will be created by omp. I want to know if the for loop which loops only 1 time(in the case), will be taken only by one of 4 threads right? and other threads will be idle state ? and in this case i am seeing cpu usage of 4 threads are nearly same. What might be the reason? only one thread should be high and others must be low right?

Your take on this is correct. If max=1 and you have more than one thread, thread 0 will execute the single loop iteration and the other threads will wait at the end of the parallel region for thread 0 to catch up. The reason you're seeing the n-1 other threads causing load on the system is because they spin-wait at the end of regions, because that is much faster when the threads have to wake up and notice that the parallel work (or in your case: not so parallel work :-)) is completed.
You can change this behavior via the OMP_WAIT_POLICY environment variable. See the OpenMP specification for a full description.

Number of thread in one program

I am new to multithreading ,I read the code like below:
void hello_world(string s)
{
cout<< s << endl;
}
int main()
{
const int n = 1000;
vector<thread> threads;
for(int i = 0; i < n; i++){
threads.push_back(thread(hello_world,"test"));
}
for(int i = 0; i < threads.size(); i++){
threads[i].join();
}
return 0;
}
I believe the program above use 1000 threads to speed up the program,However,this confuses me cause when type the commend lscpu returns:
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
where I believe the number of threads is 12,Based on description above, I have two questions:
(1) How many threads can I call in one program?
(2) Follow question 1 ,I believe the number of threads we can call is limited,what's the information that I can base on to decide how many thread can I call to make sure other programs run well?

How many threads can I call in one program?
You didn't specify any programming language or operating system but in most cases, there's no hard limit. The most important limit is how much memory is available for the thread stacks. It's harder to quantify how many threads it takes before they all spend more time competing with each other for resources than they spend doing any actual work.
the command lscpu returns: Thread(s) per core: 2, Core(s) per socket: 6, Socket(s): 1
That information tells how many threads the system can simultaneously execute. If your program has more than twelve threads that are ready to run then, at most twelve of them will actually be running at any given point in time, while the rest of them await their turns.
But note: In order for twelve threads to be "ready to run," they have to all be doing tasks that do not interfere with each other. That mostly means, doing computations on values that are already in memory. In your example program, all of the threads want to write to the same output stream. Assuming that the output stream is "thread safe," then that will be something that only one thread can do at a time. It doesn't matter how many cores your computer has.
how many thread can I call to make sure other programs run well?
That's hard to answer unless you know what all of the other programs need to do. And what does "run well" mean, anyway? If you want to be able to use office tools or programming tools to get work done while a large, compute-intensive program runs "in the background," then you'll pretty much have to figure out for yourself just how much work the background program can do while still allowing the "foreground" tools to respond to your typing.

Is it possible to fix the number of threads and dispatch the task when there is idle one?

I want to use OpenMP to attain this effect: fix the number of threads, if there is an idle thread, dispatch the task into it, else wait for an idle one. The following is my test code:
#include <omp.h>
#include <stdio.h>
#include <unistd.h>
void func(void) {
#pragma omp parallel for
for (int i = 0; i < 3; i++)
{
sleep(30);
printf("%d\n", omp_get_thread_num());
}
}
int main(void) {
omp_set_nested(1);
omp_set_num_threads(omp_get_num_procs());
#pragma omp parallel for
for (int i = 0; i < 8; i++)
{
printf("%d\n", omp_get_thread_num());
func();
}
return 0;
}
Actually, my machine contains 24 cores, so
omp_set_num_threads(omp_get_num_procs())
will launch 24 threads at the beginning. Then main's for-loop will occupy 8 threads, during every thread, a func will be called, therefore additonal 2 threads ought to be used. From my calculation, I think 24 threads will be enough. But in actual running, there are totally 208 threads generated.
So my questions are as follows:
(1) Why so many threads are created though 24 seems enough?
(2) Is it possible to fix the number of threads (E.g., same as the number of cores) and dispatch the task when there is idle one?

1) That's just the way parallel for is defined as a parallel directive immediately followed by a loop directive. So there is no limitation of thread creation based on worksharing granularity.
Edit: To clarify OpenMP will:
Create an implementation-defined amount of threads - unless you specify otherwise
Schedule the share of loop iterations among this team of threads. You now end up with threads in the team that have no work.
If you have nested parallelism, this will repeat: A single thread encounters the new nested parallel construct and will create a whole new team.
So in your case 8 threads encounter the inner parallel construct spawning 24 new threads each, and 16 threads of the outer loop don't. So you have 8 * 24 + 16 = 208 threads total.
2) Yes, incidentally, this concept is called task in OpenMP. Here is a good introduction.

In OpenMP once you asked for particular number of threads the runtime system will give them to your parallel region if it is able to do so, and those threads cannot be used for other work while the parallel region is active. The runtime system cannot guess that you are not going to use threads you have requested.
So what you can do is to either ask for lesser number of threads if you need lesser threads, or use some other parallelization technique that can dynamically manage number of active threads. For example, using OpenMP if you ask for 8 threads for outer parallel and 3 threads for inner regions, you may and up with 24 threads (or lesser, if threads may be re-used, e.g. when parallel regions are not running simultaneously).
-- Andrey

you should try
#pragma omp task
besides, in my opinion, avoid using nested omp threads.

When can't computation speed be improved by parallelization?

Similar questions have been asked before but I couldn't find an answer that was more about the low level mechanics of threads themselves.
Problem
I have a physical modeling project in which I need to apply a function to 160 billion data points.
for(int i=0; i < N(160,000,000,000); i++){
physicalModal(input[i]); //Linear function, just additions and subtractions
}
function physicalModal(x){
A*x +B*x +C*x + D*x......... //An over simplification but you get the idea. A linear function
}
Given the nature of this problem am I correct in thinking a single thread on a single core, or 1 thread per core, would be the fastest way to solve this? That using extra threads beyond the number of cores would not help me here?
My Logic (Please correct where my assumptions are wrong)
Threads on a single core don't really work independently, they just share processor time which can be beneficial when one thread is waiting on perhaps a socket response and other threads are processing requests. In the example I posted above I figure the CPU could go to 100% on one thread so using multiple threads would just disturb the computation. Is this correct?
What then determines when threading is useful?
If my above assumption is correct, whats the key factor in determining when other threads would be useful? My guess would be simultaneous operations that have varying completion times, waiting, etc...But thats based on my initial premise which may be incorrect.

I need to apply a function to 160 billion data points.
I assume that your function has no side effects (no writes to global/static variables; no disk/network access; no service to many remote users) and just do some arithmetics on its input (on single point of input or several nearby points as for stencil (it is stencil kernel):
for(int i=0; i < 160_000_000_000; i++){
//Linear function, just additions and subtractions
output[i] = physicalModel(input[i] /* possibly also input[i-1], input[i+1] .. */);
}
Then you have to check how efficient this function works on single CPU. Can you (or your compiler) unroll your loop and convert it to SIMD parallelism?
for(int i=0+1; i < 160_000_000_000-1; i++){
output[i] = A*input[i-1]+ B*input[i] + C*input[i+1];
}
// unrolled 4 times; if input is float, compiler may load 4 floats
// into single SSE2 reg and do 4 operations from one asm command
for(int i=0+4; i < 160_000_000_000-4; i+=4){
output[i+0] = A*input[i-1]+ B*input[i+0] + C*input[i+1];
output[i+1] = A*input[i+0]+ B*input[i+1] + C*input[i+2];
output[i+2] = A*input[i+1]+ B*input[i+2] + C*input[i+3];
output[i+3] = A*input[i+2]+ B*input[i+3] + C*input[i+4];
}
When your function has good single-threaded performance, you can add thread or process parallelism (using OpenMP/MPI or other method). With our assumptions, there are no threads blocking on some external resource like reading from HDD or from network, so every thread you started can run at any time. Then we should start no more than 1 thread per CPU core. If we started several threads, each will run for some amount of time and displaced by other, having less performance than in case of 1 thread per cpu core.
In C/C++ adding of OpenMP thread level parallelism (https://en.wikipedia.org/wiki/OpenMP, http://www.openmp.org/) can be as easy as adding one line just before your for loop (and adding -fopenmp/-openmp option to your compilation); compiler and library will split your for loop into parts and distribute them between threads ([0..N/4], [N/4..N/2], [N/2..N*3/4], [N*3/4..N] for 4 threads or other split scheme; you can give hints with schedule option)
#pragma omp parallel for
for(int i=0+1; i < 160_000_000_000-1; i++){
output[i] = physicalModel(input[i]);;
}
Thread count will be determined in runtime by OpenMP library (gomp in gcc - https://gcc.gnu.org/onlinedocs/libgomp/index.html). By default it is "one thread per CPU is used" (per logical cpu core). You can change number of threads with OMP_NUM_THREADS environment variable (export OMP_NUM_THREADS=5; ./program).
On CPU with hardware multithreading on single cpu cores (Intel HT, other variants of SMT: you have 4 physical cores and 8 "logical") in some cases you should use 1 thread per logical core, and in other cases 1 thread per physical core (with correct thread binding), as some resources (FPU units) are shared between logical cores. Do some experiments if your code will be used several (many) times.
If your threads (model) are limited by speed of memory (Memory Bound; they loads many data from memory and does very simple operation on every float), you may want to run less threads than cpu core count, as additional threads will not get addition memory bandwidth.
If your threads do lot of computations for every element loaded from memory, use better SIMD and more threads (compute bound). When you have very good and wide SIMD (full-width AVX), you will have no speedup from using HT, as full-width AVX unit is shared between logical cores (but every physical core has one, so use it); in this case you will also have lower cpu frequency, as full-width AVX unit is very hot under full load.
Illustration of memory and compute limited applications: https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/
https://crd.lbl.gov/assets/Uploads/FTG/Projects/Roofline/_resampled/ResizedImage600300-rooflineai.png

Using OpenMP (libgomp) in an already multithreaded application

We are using OpenMP (libgomp) in order to speed up some calculations in a multithreaded Qt application. The parallel OpenMP sections are located within two different threads, though in fact they never execute in parallel. What we observe in this case is that 2N (where N = OMP_THREAD_LIMIT) omp threads are launched, apparently interfering each with the other. The calculation time is very high, while the processor load is low. Setting OMP_WAIT_POLICY hardly has any effect.
We also tried moving all the omp sections to a single thread (this is not a good solution for us, though, from an architectural point of view). In this case, the overall calculation time does drop and the processor is fully loaded, but only if OMP_WAIT_POLICY is set to ACTIVE. When OMP_WAIT_POLICY == PASSIVE, the calculation time remains low and the processor is idle 50% of time.
Odd enough, when we use omp within a single thread, the first loop parallelized using omp (in a series of omp calulations) executes 10 times slower compared to the multithread case.
Upd: Our questions are:
a) is there any way to reuse the openmp threads when using omp in the context of different threads.
b) Why executing with OMP_WAIT_POLICY == PASSIVE slows everything. Does it take so long to wake the threads?
c) Is there any logical explanation for the phenomenon of the first parallel block being so slow (even when waiting in active mode)
Upd2: Please mind that the issue is probably related to GNU OMP implementation. icc doesn't have it.

Try to start/stop openmp threads in runtime using omp_set_num_threads(1) and omp_set_num_threads(cpucount)
This call with (1) should stop all openmp worker threads, and call with (cpu_num) will restart them again.
So, at start of programm, run omp_set_num_threads(1).
Before omp-parallelized region, you can start omp threads even with WAIT_POLICY=active, and they will not consume cpu before this point.
After omp parallel region you can stop threads again.
The omp_set_num_threads(cpucount) call is very slow, slower than waking threads with wait_policy=passive. This can be the reason for (c) - if your libgomp starts threads only at first parallel region.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string