OpenMP: why #pragma openmp parallel should be useful without for? - multithreading

I'm an OpenMP beginner and from what I've read #pragma omp parallel:
It creates a team of N threads ..., all of which execute the next
statement ... After the statement, the threads join back into one.
I cannot imagine an example where this could be useful without the for keyword after the directive written above. What I mean is that the for keyword split the iterations between the threads of the team, while with the directive above the following block/statement will be executed by all the threads and there is no performance improvement. Can you help me please to clarify it?

You can provide your own mechanism that splits the job into parallel pieces, but relies on OpenMP for parallelism.
Here’s a hypothetic example that uses OpenMP to dequeue some operations and run then in parallel:
#pragma omp parallel
{
operation op;
while( queue.tryDequeue( &op ) )
op.run();
}
The implementation of queue.tryDequeue must be thread-safe, i.e. guarded by critical section/mutex, or lock-free implementation.
To be efficient, the implementation of op.run() must be CPU-heavy, taking much longer than queue.tryDequeue() Otherwise, you’ll spend most of the time blocking that queue, and not doing the parallelizable work.

The for keyword does not divide the work !!!
You have to remember that divide the work means each thread executes a section of your loop. If you insist on using #pragma omp parallel then its like this
#pragma omp parallel
{
#pragma omp for
for(int i= 1...100)
{
}
}
what the above code does is divides the for loop among n threads and for each for n threads anything declared inside the #pragma omp for is a private variable to that thread. This ensure thread safety and also means you are responsible for gathering data, eg using reduction operations

Related

OpenMP strange hang at barrier

I have the following code fragment that runs on multicore using OpenMP. I found that (especially at higher core counts), the application hangs in a strange way.
......
while (true) {
compute1(...);
#pragma omp barrier
if (_terminated) {
...... // <- at least one thread reaches here (L1)
break;
}
......
compute2(...);
......
if (_terminated) {
cout << thread_id << endl; // <- when hangs, it always prints the last thread
}
#pragma omp barrier // <- one thread is stuck here (L2)
......
}
......
I observe that at least one thread is able to reach L1 (assume this is the case and assume the application quits successfully if all threads can reach L1). But sometimes not all threads can reach L1. In fact, by stopping the debugger when the application hangs, it indicates that at least one thread is stuck at the barrier at L2. I put a printing statement right above L2 and it always yields the last thread number (15 when using 16 threads, 7 when using 8 threads, etc.).
This is very strange because the fact that at least one thread can reach L1 indicates that it has moved past the first barrier above L1, which implies that all threads should have reached the same barrier. Therefore, all threads should reach L1 (_terminated is a global shared variable), but in reality, this is not the case. I encounter this issue frequently at higher core counts. It almost never happens when the number of cores is lower than the inherent parallelism in compute1 and compute2.
I am very confused by this issue that I am quite certain it is either that 1) I fundamentally misunderstood some aspects of OpenMP semantics. 2) This is a bug in OpenMP. Some suggestions are much appreciated!
You have race condition(s) in your code. When you write a shared variable (e.g. _terminated) in a thread and read it in a different one a data race may occur. Note that a data race is undefined behaviour in C/C++. To avoid data race one possible (and efficient) solution is using atomic operations:
To write it use
#pragma omp atomic write seq_cst
_terminated=...
To read it use
bool local_terminated;
#pragma omp atomic read seq_cst
local_terminated=_terminated;
if (local_terminated){
...
If this does not solve your problem please provide a minimal reproducible example.

Is it possible to fix the number of threads and dispatch the task when there is idle one?

I want to use OpenMP to attain this effect: fix the number of threads, if there is an idle thread, dispatch the task into it, else wait for an idle one. The following is my test code:
#include <omp.h>
#include <stdio.h>
#include <unistd.h>
void func(void) {
#pragma omp parallel for
for (int i = 0; i < 3; i++)
{
sleep(30);
printf("%d\n", omp_get_thread_num());
}
}
int main(void) {
omp_set_nested(1);
omp_set_num_threads(omp_get_num_procs());
#pragma omp parallel for
for (int i = 0; i < 8; i++)
{
printf("%d\n", omp_get_thread_num());
func();
}
return 0;
}
Actually, my machine contains 24 cores, so
omp_set_num_threads(omp_get_num_procs())
will launch 24 threads at the beginning. Then main's for-loop will occupy 8 threads, during every thread, a func will be called, therefore additonal 2 threads ought to be used. From my calculation, I think 24 threads will be enough. But in actual running, there are totally 208 threads generated.
So my questions are as follows:
(1) Why so many threads are created though 24 seems enough?
(2) Is it possible to fix the number of threads (E.g., same as the number of cores) and dispatch the task when there is idle one?
1) That's just the way parallel for is defined as a parallel directive immediately followed by a loop directive. So there is no limitation of thread creation based on worksharing granularity.
Edit: To clarify OpenMP will:
Create an implementation-defined amount of threads - unless you specify otherwise
Schedule the share of loop iterations among this team of threads. You now end up with threads in the team that have no work.
If you have nested parallelism, this will repeat: A single thread encounters the new nested parallel construct and will create a whole new team.
So in your case 8 threads encounter the inner parallel construct spawning 24 new threads each, and 16 threads of the outer loop don't. So you have 8 * 24 + 16 = 208 threads total.
2) Yes, incidentally, this concept is called task in OpenMP. Here is a good introduction.
In OpenMP once you asked for particular number of threads the runtime system will give them to your parallel region if it is able to do so, and those threads cannot be used for other work while the parallel region is active. The runtime system cannot guess that you are not going to use threads you have requested.
So what you can do is to either ask for lesser number of threads if you need lesser threads, or use some other parallelization technique that can dynamically manage number of active threads. For example, using OpenMP if you ask for 8 threads for outer parallel and 3 threads for inner regions, you may and up with 24 threads (or lesser, if threads may be re-used, e.g. when parallel regions are not running simultaneously).
-- Andrey
you should try
#pragma omp task
besides, in my opinion, avoid using nested omp threads.

Reduce OpenMP fork/join overhead by separating #omp parallel and #omp for

I'm reading the book An introduction to parallel programming by Peter S. Pacheco. In Section 5.6.2, it gave an interesting discussion about reducing the fork/join overhead.
Consider the odd-even transposition sort algorithm:
for(phase=0; phase < n; phase++){
if(phase is even){
# pragma omp parallel for default(none) shared(n) private(i)
for(i=1; i<n; i+=2){//meat}
}
else{
# pragma omp parallel for default(none) shared(n) private(i)
for(i=1; i<n-1; i+=2){//meat}
}
}
The author argues that the above code has somewhat high fork/join overhead. Because the threads are forked and joined in each iteration of the outer loop. Hence, he proposes the following version:
# pragma omp parallel default(none) shared(n) private(i, phase)
for(phase=0; phase < n; phase++){
if(phase is even){
# pragma omp for
for(i=1; i<n; i+=2){//meat}
}
else{
# pragma omp for
for(i=1; i<n-1; i+=2){//meat}
}
}
According to the authors, the second version forks the threads before the outer loop starts and reuse the threads for each iterations, yielding better performance.
However, I'm suspicious of the correctness of the second version. In my understanding, an #pragma omp parallel directive initiates a group of threads and let the threads execute the following structured block in parallel. In this case, the structured block should be the whole outer for-loop for(phase=0 ...). Then, shouldn't it be the case where the whole outer loop is executed four time given 4 threads are used? That is, if n=10, then 40 iterations would be executed on 4 threads. What is wrong with my understanding? And how does the omp parallel (without for) play with a following for-loop like above?
The second version is correct.
Per the OpenMP specification, a #pragma omp parallel for directive is just a shortcut for #pragma omp parallel immediately followed by #pragma omp for, as in
#pragma omp parallel
{
#pragma omp for
for(int i=0; i<N; ++i) { /*loop body*/ }
}
If there is some code in the parallel region before or after the loop construct, it will be executed independently by each thread in the region (unless limited by other OpenMP directives). But, #pragma omp for is a work sharing construct; the loop following that directive is shared by all threads in the region. I.e. it's executed as a single loop which iterations are somehow split across the threads. Thus, if the parallel region above is executed by 4 threads, still the loop will be executed just once, and not 4 times.
Back to the example in your question: the phase loop is executed by each thread individually, but #pragma omp for at each phase iteration indicates start of a shared loop. For n=10, each thread will enter a shared loop 10 times, and execute a part of it; so there won't be 40 executions of the inner loops, but just 10.
Note there is an implicit barrier at the end of #pragma omp for; it means a thread that completed its portion of a shared loop will not proceed until all other threads complete their portions too. So, the execution is synchronized across the threads. This is necessary to ensure correctness in most cases; e.g. in your example this guarantees that threads always work on the same phase. However if consequent shared loops within a region are safe to execute simultaneously, a nowait clause can be used to eliminate the implicit barrier and allow threads immediately proceed to the rest of the parallel region.
Note also that such processing of work sharing directives is quite specific to OpenMP. With other parallel programming frameworks, the logic you used in the question might be correct.
And last, smart OpenMP implementations do not join threads after a parallel region is completed; instead, threads might busy-wait for some time, and then sleep until another parallel region is started. This is done exactly to prevent high overheads at the start and the end of parallel regions. So, while the optimization suggested in the book still removes some overhead (perhaps), for some algorithms its impact on execution time might be negligible. The algorithm in the question is quite likely one of these; in the first implementation, parallel regions quickly follow one after another within the serial loop, so OpenMP worker threads most likely will be active at the beginning of a region and will start it quickly, avoiding the fork/join overhead. So don't be surprised if in practice you see no performance difference from the described optimization.

Calling multithreaded MKL in from openmp parallel region

I have a code with following structure
#pragma omp parallel
{
#omp for nowait
{
// first for loop
}
#omp for nowait
{
// first for loop
}
#pragma barrier
<-- #pragma omp single/critical/atomic --> not sure
dgemm_(....)
#pragma omp for
{
// yet another for loop
}
}
For dgemm_, I link with multithreaded mkl. I want mkl to use all available 8 threads. What is the best way to do so?
This is a case of nested parallelism. It is supported by MKL, but it only works if your executable is built using the Intel C/C++ compiler. The reason for that restriction is that MKL uses Intel's OpenMP runtime and that different OMP runtimes do not play well with each other.
Once that is sorted out, you should enable nested parallelism by setting OMP_NESTED to TRUE and disable MKL's detection of nested parallelism by setting MKL_DYNAMIC to FALSE. If the data to be processes with dgemm_ is shared, then you have to invoke the latter from within a single construct. If each thread processes its own private data, then you don't need any synchronisation constructs, but using multithreaded MKL won't give you any benefit too. Therefore I would assume that your case is the former.
To summarise:
#pragma omp single
dgemm_(...);
and run with:
$ MKL_DYNAMIC=FALSE MKL_NUM_THREADS=8 OMP_NUM_THREADS=8 OMP_NESTED=TRUE ./exe
You could also set the parameters with the appropriate calls:
mkl_set_dynamic(0);
mkl_set_num_threads(8);
omp_set_nested(1);
#pragma omp parallel num_threads(8) ...
{
...
}
though I would prefer to use environment variables instead.
While this post is a bit dated, I would still like to give some useful insights for it.
The above answer is correct from a function perspective, but will not give best results from a performance perspective. The reason is that most OpenMP implementations do not shutdown the threads when they reach a barrier or don't have work to do. Instead, the threads will enter a spin-wait loop and continue to consume processor cycles while they are waiting.
In the example:
#pragma omp parallel
{
#omp for nowait
for(...) {} // first loop
#omp for
for(...) {} // second loop
#pragma omp single
dgemm_(....)
#pragma omp for
for(...) {} // third loop
}
What will happen is that even if the dgemm call creates additional threads inside MKL, the outer-level threads will still be actively waiting for the end of the single construct and thus dgemm will run with reduced performance.
There are essentially two solutions to this problem:
1) List item Use the code as above and in addition to the suggested environment variables also disable active waiting:
$ MKL_DYNAMIC=FALSE MKL_NUM_THREADS=8 OMP_NUM_THREADS=8 OMP_NESTED=TRUE OMP_WAIT_MODE=passive ./exe
2) Modify the code to split the parallel regions:
#pragma omp parallel
{
#omp for nowait
for(...) {} // first loop
#omp for nowait
for(...) {} // second loop
}
dgemm_(...);
#pragma omp parallel
#pragma omp for nowait
for(...) {} // third loop
}
For solution 1, the threads go to the sleep mode immediately and do not consume cycles. The downside is that the thread has to wake up from this deeper sleep state, which will increase the latency compared to the spin-wait.
For solution 2, the threads are kept in their spin-wait loop and are very likely actively waiting when the dgemm call enters its parallel region. The additional joins and forks will also introduce some overhead, but it may be better than the over-subscription of the initial solution with the single construct or solution 1.
What is the best solution will clear depend on the amount of work being done in the dgemm operation compared to the synchronization overhead for fork/join, which in mostly dominated by the thread count and the internal implementation.

Stl container vector push_back with OpenMP multithreading

I want to push_back an object into a vector from different threads. The no. of threads depends on the machine.
#pragma omp parallel shared(Spaces, LookUpTable) private(LutDistribution, tid)
{
tid = omp_get_thread_num();
BestCoreSpaces.push_back( computeBestCoreSpace(tid, &Spaces, &LookUpTable, LutDistribution));
}
The problem is, that I'm not sure if it's working. I don't get crashes. I'm using openMP. Is openMP queuing something?
Maybe its enough to reserve memory for the container with BestCoreSpaces.reserve(tid) or to assign the amount of elements with BestCoreSpaces.assign(tid, Space). Can somebody help me?
You just get away with it - you have a race condition that might or might not manifest itself depending on the optimization level at compile time, the thread execution and/or the alignment of the stars.
You have to make the push_back() a critical section (i.e use a mutex). For example:
#pragma omp parallel shared(Spaces, LookUpTable, BestCoreSpaces) private(LutDistribution, tid)
{
tid = omp_get_thread_num();
#pragma omp critical
BestCoreSpaces.push_back(
computeBestCoreSpace(tid, &Spaces, &LookUpTable, LutDistribution)
);
}

Resources