How to deal with OpenMP thread pool contention

How to deal with OpenMP thread pool contention - multithreading

I'm working on an application that uses both coarse and fine grained multi-threading. That is, we manage scheduling of large work units on a pool of threads manually, and then within those work units certain functions utilize OpenMP for finer grain multithreading.
We have realized gains by selectively using OpenMP in our costliest loops, but are concerned about creating contention for the OpenMP worker pool as we add OpenMP blocks to cheaper loops. Is there a way to signal to OpenMP that a block of code should use the pool if it is available, and if not it should process the loop serially?

you can use omp_set_num_threads(int) to set no. of threads inside a pool. Then the compiler will try to create a pool of threads if possible and schedule them. if it is not possible to create pool, then it will create as many threads as possible and run others in serial fashion.
for more info try this link

You may be able to do what you want by clever use of omp_get_num_threads, omp_set_num_threads, and if and num_threads clauses on parallel directives. OpenMP 3.0 also provides tasks which might be useful.

Related

Nested threads in C++ 11 multithread programming

At first, I create four threads and each of them will call a GPU function. However, within each of the four, I also want to create two threads. One is to read data from the disk and the other is to do the computation. I am not sure if I can create a nested thread in C++. I think this is not a neat code. Can I have another way to solve the problem?

In general it should be no problem to create a new thread from a running thread.
Like you assume it's not the best solution, because creating/destroying threads often isn't cheap and the more threads you have the more context switches you have which is or can be also a performance penalty.
So you could create a thread pool which has given number of threads and let the thread pool threads work on reading data from disk and do computations. You would avoid massive creation and destroying of threads.
If you also create/destroy often the threads who are calling GPU functions you could create two threadpools one for the threads calling the GPU functions and one threadpool for reading from disk and computations.

You could use std::async and do away with thread management entirely. Or use a hybrid approach where you have the 4 core threads which I assume will never die and then in those functions where you wish to perform more asynchronous work you can use std::async.
https://solarianprogrammer.com/2012/10/17/cpp-11-async-tutorial/
It's not clear whether async tasks are using thread pools. If you want to ensure high performance, which you probably care about since you're using GPUs, you should use a thread pool.
http://roar11.com/2016/01/a-platform-independent-thread-pool-using-c14/

Run threads in each core in Delphi

I'm working with a Delphi application and I have created two threads to sync with different databases, one to read and other to write. I would like to know if Delphi is actually using all potential of each core (running on an i5 with 4 cores for example) or if I need to write a specific code to distribute the threads to each core.
I have no idea how to find this.

There's nothing you need to do. The operating system schedules ready-to-run threads on available cores.

There is nothing to do. The OS will choose the best place to run each of your threads taking into account a large number of factors completely beyond your control. The OS manages your threads in conjunction with all other threads in all other processes on the system.
Don't forget that if your threads aren't particularly busy, there will be absolutely no need to run them on different cores.
Sometimes moving code to a separate core can introduce unexpected inefficiencies. Remember CPU's have high speed memory caches; and if certain data is not available in the cache of one core, moving to it could incur relatively slower RAM access.
The point I'm trying to make here, is that you trying to second-guess all these scenarios and permutations is premature optimisation. Rather let the OS do the work for you. You have other things you should rather focus on as indicated below.
However, that said any interaction between your threads can significantly affect the OS's ability to run them on separate cores. E.g.
At one extreme: if each of your threads do a lot of work through a shared lock (perhaps the reader thread places data in a shared location that the writer consumes, so a lock is used to avoid race conditions), then it's likely that both threads will run on the same core.
The best case scenario would be when there is zero interaction between the threads. In this case the OS can easily run the threads on separate cores.
One thing to be aware of is that the threads can interact even if you didn't explicitly code anything to do so. The default memory manger is shared between all threads. So if you do a lot of dynamic memory allocation in each thread, you can experience contention limiting scalability across large numbers of cores.
So the important thing for you to focus on is getting your design "correct":
Ensure a "clean" separation of concerns.
Eliminate unnecessary interaction between threads.
Ensure whatever interaction is needed uses the most appropriate technique for your requirements.
Get the above right, and the OS will schedule your threads as efficiently as it can.

Openmp thread divergence?

The term thread divergence is used in CUDA; from my understanding it's a situation where different threads are assigned to do different tasks and this results in a big performance hit.
I was wondering, is there a similar penalty for doing this in openmp? For example, say I have a 6 core processor and a program with 6 threads. If I have a conditional that makes 3 threads perform a certain task, and then have the other three threads perform a completely different task, will there be a big performance hit? I guess in essence it's sort of using openmp to do MIMD.
Basically, I'm writing a program with openmp and CUDA. I want two threads to run a CUDA kernel while the other left over threads run C code. Thanks.

No, there is no performance hit for diverging threads using OpenMP. It is a problem in CUDA because of the way instructions are broadcast simultaneously to a set of cores. When an OpenMP thread targets a CPU core, each CPU core has its own independent set of instructions to follow, and it runs just like any other single-threaded program would.
You may see some of your cores being underutilized if you have synchronization barriers following thread divergence, because that would force faster threads to wait for the slower threads to catch up.

When speaking about CPU parallelism, there's no intrinsic performance hit from using a certain threading design pattern. Not at the theoretical level at least.
The only problem I see is that since the threads are doing different things which may have varying completion times, some of the threads may sit idle after finishing their work, waiting for the others to finish a longer task.

The term thread divergence in CUDA refers to the situation when not all threads of a bock evaluate a conditional with the same outcome. Such threads are said to diverge. If diverging threads are in the same warp then such threads may perform work serially which leads to performance loss.
I am not sure that OpenMP has the same issue, though. When different threads perform different work then load balancing may be used by the runtime perhaps, but it doesn't lead to work serialization necessarily.

there is no this kind of problem in openmp because every openmp thread has its own PC.

Which boost multithreading design pattern should I use?

I'm getting started with boost for multi-threading to port my program to window ( from pthread of linux ) . Is there anyone familiar with it ? Any suggestion on which pattern should I use ?
Here are the requirements:
I have many threads most of the time running the same thing with different parameters,
All threads shared a memory location called "critical memory" (an array)
Synchronization has to be done with a "barrier" at certain iteration
requires highest parallelization if possible i.e good scheduling with same priority for all threads ( currently I let CPU does the job, but I find out that boost has threadpool with thread.schedule() not sure if i should use )
For pthread, every thing is function, so I'm not sure if I should convert it to object, what's the advantage then ?. A little bit of confusion after reading this tutorial http://antonym.org/2009/05/threading-with-boost---part-i-creating-threads.html so many options to use...
Thanks in advance

porting should be quite straightforward:
I have many threads most of the time
running the same thing with different
parameters,
create required number of threads with functor that binds your different parameters, like:
boost::thread thr1(boost::bind(your_thread_func, arg1, arg2));
All threads shared a memory location
called "critical memory" (an array)
nothing special here, just use boost::mutex to synchronize access (or another mutex type if you have special requirements)
Synchronization has to be done with a
"barrier" at certain iteration
use boost::barrier: http://www.boost.org/doc/libs/1_45_0/doc/html/thread/synchronization.html#thread.synchronization.barriers
requires highest parallelization if possible i.e good scheduling with same priority for all threads ( currently I let CPU does the job, but I find out that boost has threadpool with thread.schedule() not sure if i should use )
CPU? probably you meant OS scheduler. it's the simplest possible solution and in most cases satisfactory one. threadpool is not a part of boost and tbh I'm not familiar with it. boost thread doesn't have scheduler.
I don't know anything about your task and its parallelization potential, so I'll suppose it can be parallelized to more threads than you have cores. theoretically, to receive highest performance you need to smoothly distribute your work among thread number = number of your cores (including virtual ones). it's not the easiest task and you can use ready-for-use solutions. e.g. Intel Threading Building Blocks (GPL license) or even Boost Asio. Despite its main purpose is network communication, Asio has its dispatcher and you can use it as a thread pool. Just create optimal number of threads (number of cores?), boost::asio::io_service object and run it from all threads. post work to thread pool by io_service::post()

In my opinion, to use win32 port of pthread is more straightforward way to accomplish such tasks.
EDITED:
Last month I converted 1 year old project source code to pthread-w32 from Boost.Thread. Boost.Thread provides lots of good thing but if you are working on the old fashion thread routines Boost.Thread could be too mush hassle.

Thread Pool vs Thread Spawning

Can someone list some comparison points between Thread Spawning vs Thread Pooling, which one is better? Please consider the .NET framework as a reference implementation that supports both.

Thread pool threads are much cheaper than a regular Thread, they pool the system resources required for threads. But they have a number of limitations that may make them unfit:
You cannot abort a threadpool thread
There is no easy way to detect that a threadpool completed, no Thread.Join()
There is no easy way to marshal exceptions from a threadpool thread
You cannot display any kind of UI on a threadpool thread beyond a message box
A threadpool thread should not run longer than a few seconds
A threadpool thread should not block for a long time
The latter two constraints are a side-effect of the threadpool scheduler, it tries to limit the number of active threads to the number of cores your CPU has available. This can cause long delays if you schedule many long running threads that block often.
Many other threadpool implementations have similar constraints, give or take.

A "pool" contains a list of available "threads" ready to be used whereas "spawning" refers to actually creating a new thread.
The usefulness of "Thread Pooling" lies in "lower time-to-use": creation time overhead is avoided.
In terms of "which one is better": it depends. If the creation-time overhead is a problem use Thread-pooling. This is a common problem in environments where lots of "short-lived tasks" need to be performed.
As pointed out by other folks, there is a "management overhead" for Thread-Pooling: this is minimal if properly implemented. E.g. limiting the number of threads in the pool is trivial.

For some definition of "better", you generally want to go with a thread pool. Without knowing what your use case is, consider that with a thread pool, you have a fixed number of threads which can all be created at startup or can be created on demand (but the number of threads cannot exceed the size of the pool). If a task is submitted and no thread is available, it is put into a queue until there is a thread free to handle it.
If you are spawning threads in response to requests or some other kind of trigger, you run the risk of depleting all your resources as there is nothing to cap the amount of threads created.
Another benefit to thread pooling is reuse - the same threads are used over and over to handle different tasks, rather than having to create a new thread each time.
As pointed out by others, if you have a small number of tasks that will run for a long time, this would negate the benefits gained by avoiding frequent thread creation (since you would not need to create a ton of threads anyway).

My feeling is that you should start just by creating a thread as needed... If the performance of this is OK, then you're done. If at some point, you detect that you need lower latency around thread creation you can generally drop in a thread pool without breaking anything...

All depends on your scenario. Creating new threads is resource intensive and an expensive operation. Most very short asynchronous operations (less than a few seconds max) could make use of the thread pool.
For longer running operations that you want to run in the background, you'd typically create (spawn) your own thread. (Ab)using a platform/runtime built-in threadpool for long running operations could lead to nasty forms of deadlocks etc.

Thread pooling is usually considered better, because the threads are created up front, and used as required. Therefore, if you are using a lot of threads for relatively short tasks, it can be a lot faster. This is because they are saved for future use and are not destroyed and later re-created.
In contrast, if you only need 2-3 threads and they will only be created once, then this will be better. This is because you do not gain from caching existing threads for future use, and you are not creating extra threads which might not be used.

It depends on what you want to execute on the other thread.
For short task it is better to use a thread pool, for long task it may be better to spawn a new thread as it could starve the thread pool for other tasks.

The main difference is that a ThreadPool maintains a set of threads that are already spun-up and available for use, because starting a new thread can be expensive processor-wise.
Note however that even a ThreadPool needs to "spawn" threads... it usually depends on workload - if there is a lot of work to be done, a good threadpool will spin up new threads to handle the load based on configuration and system resources.

There is little extra time required for creating/spawning thread, where as thread poll already contains created threads which are ready to be used.

This answer is a good summary but just in case, here is the link to Wikipedia:
http://en.wikipedia.org/wiki/Thread_pool_pattern

For Multi threaded execution combined with getting return values from the execution, or an easy way to detect that a threadpool has completed, java Callables could be used.
See https://blogs.oracle.com/CoreJavaTechTips/entry/get_netbeans_6 for more info.

Assuming C# and Windows 7 and up...
When you create a thread using new Thread(), you create a managed thread that becomes backed by a native OS thread when you call Start – a one to one relationship. It is important to know only one thread runs on a CPU core at any given time.
An easier way is to call ThreadPool.QueueUserWorkItem (i.e. background thread), which in essence does the same thing, except those background threads aren’t forever tied to a single native thread. The .NET scheduler will simulate multitasking between managed threads on a single native thread. With say 4 cores, you’ll have 4 native threads each running multiple managed threads, determined by .NET. This offers lighter-weight multitasking since switching between managed threads happens within the .NET VM not in the kernel. There is some overhead associated with crossing from user mode to kernel mode, and the .NET scheduler minimizes such crossing.
It may be important to note that heavy multitasking might benefit from pure native OS threads in a well-designed multithreading framework. However, the performance benefits aren’t that much.
With using the ThreadPool, just make sure the minimum worker thread count is high enough or ThreadPool.QueueUserWorkItem will be slower than new Thread(). In a benchmark test looping 512 times calling new Thread() left ThreadPool.QueueUserWorkItem in the dust with default minimums. However, first setting the minimum worker thread count to 512, in this test, made new Thread() and ThreadPool.QueueUserWorkItem perform similarly.
A side effective of setting a high worker thread count is that new Task() (or Task.Factory.StartNew) also performed similarly as new Thread() and ThreadPool.QueueUserWorkItem.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string