OpenMP OpenBLAS nested parallelism - multithreading

we are trying to run two instances of cblas_dgemm in parallel. If the total number of threads is 16, we would like each instance to run using 8 threads. Currently, we are using a structure like this:
#pragma omp parallel num_threads(2)
{
if (omp_get_thread_num() == 0){
cblas_dgemm(...);
}else {
cblas_dgemm(...);
}
}
Here is the issue:
At the top level, there are two OpenMP threads each of which is active inside one of the if/else blocks. Now, we expect those threads to call the cblas_dgemm functions is parallel, and inside those cblas_dgemm functions, we expect new threads to be spawned.
To set the number of threads internal to each cblas_dgemm, we set the corresponding environment variable: setenv OPENBLAS_NUM_THREADS 8
However, it doesn't seem to be working. If we measure the runtime for each of the parallel calls, the runtime values are equal, but they are equal to the runtime of a single cblas_dgemm call when nested parallelism is not used and the environment variable OPENBLAS_NUM_THREADS is set to 1.
What is going wrong? and how can we have the desired behavior?
Is there any way we could know the number of threads inside the cblas_dgemm function?
Thank you very much for your time and help

The mechanism you are trying to use is called "nesting", that is, creating a new parallel region within an outer, existing parallel region is already active. While most implementations support nesting, it is disabled by default. Try setting OMP_NESTED=true on the command line or call omp_set_nested(true) before the first OpenMP directive in your code.
I would also change the above code to read like this:
#pragma omp parallel num_threads(2)
{
#pragma omp sections
#pragma omp section
{
cblas_dgemm(...);
}
#pragma omp section
{
cblas_dgemm(...);
}
}
That way, the code will also compute the correct thing with only one thread, serializing the two calls to dgemm. In your example with only one thread, the code would run but miss the second dgemm call.

Related

OpenMP parallel for -- Multiple parallel for's Vs. one parallel that includes within it multiple for's

I am going through Using OpenMP. The authors compare and contrast the following two constructs:
//Construct 1
#pragma omp parallel for
for( ... )
{
/* Work sharing loop 1 */
}
...
#pragma omp parallel for
for( ... )
{
/* Work sharing loop N */
}
as against
//Construct 2
#pragma omp parallel
{
#pragma omp for
for( ... )
{
/* Work sharing loop 1 */
}
...
#pragma omp for
for( ... )
{
/* Work sharing loop N */
}
}
They state that Construct 2
has fewer implied barriers, and there might be potential for cache
data reuse between loops. The downside of this approach is that one
can no longer adjust the number of threads on a per loop basis, but
that is often not a real limitation.
I am having a difficult time understanding how Construct 2 has fewer implied barriers. Is there not an implied barrier in Construct 2 after each for loop due to #pragma omp for? So, in each case, isn't the number of implied barriers the same, N? That is, is it not the case in Construct 2 that the first loop occurs first, and so on, and then the Nth for loop is executed last?
Also, how is Construct 2 more favorable for cache reuse between loops?
I am having a difficult time understanding how Construct 2 has fewer
implied barriers. Is there not an implied barrier in Construct 2 after
each for loop due to #pragma omp for? So, in each case, isn't the
number of implied barriers the same, N? That is, is it not the case in
Construct 2 that the first loop occurs first, and so on, and then the
Nth for loop is executed last?
I did not read the book but based on what you have shown it is actually the other way around, namely:
//Construct 1
#pragma omp parallel for
for( ... )
{
/* Work sharing loop 1 */
} // <-- implicit barrier
...
#pragma omp parallel for
for( ... )
{
/* Work sharing loop N */
} // <-- implicit barrier.
has N implicit barriers (at the end of each parallel region), whereas the second code:
//Construct 2
#pragma omp parallel
{
#pragma omp for
for( ... )
{
/* Work sharing loop 1 */
} <-- implicit barrier
...
#pragma omp for
for( ... )
{
/* Work sharing loop N */
} <-- implicit barrier
} <-- implicit barrier
has N+1 barriers (at the end of each for + the parallel region).
Actually, in this case, since there is no computation between the last two implicit barriers, one can add the nowait to the last #pragma omp for to eliminate one of the redundant barriers.
One way for the second code to have fewer implicit barriers than the second would be if you would add a nowait clause to the #pragma omp for clauses.
From the link about the book that you have shown:
Finally, Using OpenMP considers trends likely to influence OpenMP
development, offering a glimpse of the possibilities of a future
OpenMP 3.0 from the vantage point of the current OpenMP 2.5. With
multicore computer use increasing, the need for a comprehensive
introduction and overview of the standard interface is clear.
So the book is using the old OpenMP 2.5 standard, and from that standard about the loop constructor one can read:
There is an implicit barrier at the end of a loop constructor
unless a nowait clause is specified.
A nowait cannot be added to the parallel constructor but it can be added to the for constructor. Therefore, the second code has the potential to have fewer implicit barriers if one can add the nowait clause to the #pragma omp for clauses. However, as it is, the second code has actually more implicit barriers than the first code.
Also, how is Construct 2 more favorable for cache reuse between loops?
If you are using a static distribution of the loop iterations among threads (e.g., #pragma omp for scheduler(static, ...) in the second code, the same threads will be working with the same loop iterations. For instance, with two threads let us call them Thread A and Thread B. If we assume a static distribution with chunk=1, Thread A and B will work with the odd and even iterations of each loop, respectively. Consequently, depending on the actual application code, this might mean that those threads will work with the same memory positions of a given data structure (e.g., the same array positions).
In the first code, in theory (however this will depend on the specific OpenMP implementation), since there are two different parallel regions, different threads can pick up the same loop iterations across the two loops. In other words, in our example with the two threads, there are no guarantees that the same thread that computed the even (or the odd) numbers in one loop would compute those same numbers in the other loops.

What is the point of running same code under different threads - openMP?

From: https://bisqwit.iki.fi/story/howto/openmp/
The parallel construct
The parallel construct starts a parallel block. It creates a team
of N threads (where N is determined at runtime, usually from the
number of CPU cores, but may be affected by a few things), all of
which execute the next statement (or the next block, if the statement
is a {…} -enclosure). After the statement, the threads join back into
one.
#pragma omp parallel
{
// Code inside this region runs in parallel.
printf("Hello!\n");
}
I want to understand what is the point of running same code under different threads. In what kind of cases it can be helpful?
By using omp_get_thread_num() you can retrieve the thread ID which enables you to parametrize the so called "same code" with respect to that thread ID.
Take this example:
A is a 1000-dimensional integer array and you need to sum its values using 2 OpenMP threads.
You would design you code something like this:
int A_dim = 1000
long sum[2] = {0,0}
#pragma omp parallel
{
int threadID = omp_get_thread_num();
int start = threadID * (A_dim / 2)
int end = (threadID + 1) * (A_dim / 2)
for(int i = start; i < end; i++)
sum[threadID] += A[i]
}
start is the lower bound which your thread will start summing from (example: thread #0 will start summing from 0, while thread #1 will start summing from 500).
end is pretty much the same of start, but it's the upper bound of which array index the thread will sum up to (example: thread #0 will sum until 500, summing values from A[0] to A[499], while thread #1 will sum until 1000 is reached, values from A[500] to A[999])
I want to understand what is the point of running same code under different threads. In what kind of cases it can be helpful?
When you are running the same code on different data.
For example, if I want to invert 10 matrices, I might run the matrix inversion code on 10 threads ... to get (ideally) a 10-fold speedup compared to 1 thread and a for loop.
The basic idea of OpenMP is to distribute work. For this you need to create some threads.
The parallel construct creates this number of threads. Afterwards you can distibute/share work with other constructs like omp for or omp task.
A possible benefit of this distinction is e.g. when you have to allocate memory for each thread (i.e. thread-local data).
I want to understand what is the point of running same code under different threads. In what kind of cases it can be helpful?
One example: in physics you got a random process(collision, initial maxwellian etc) in your code and you need to run the code many times to get the average results, in this case you need to run the same code several times.

Opencl: global thread synchronization between two loops

I have an opencl kernel that computes two global buffers in two loops.
The first loop does some computations with a global thread and writes the result to the output buffer "OutBuff". Then the second loop updates the values of the global buffer "UpdateBuff" according to the results computed in "OutBuff" in the first loop(on the previous level). The prolem is that the global thread between the two loops changed since the threads are executed in parallel. But in my case, I need to keep the order of thread execution between these two loops. I need to compute the two loops with the same global id.
for example
__kernel void globalSynch(__global double4* input,__global uint *points,__global double4* OutBuff,__global double4* UpdateBuff)
{
int gid = get_global_id(0);
uint pt;
for(int level=0;level<N;level++)
{
for(int i=0;i<blocksize;i++)
{
pt== points[gid*i*level];
OutBuff[pt]= do_some_computations(UpdateBuff,....);
}
barrier( CLK_GLOBAL_MEM_FENCE);
for(int j=0;j<blocksize1;j++)
{
pt=points[gid*j*(level+1)];
UpdateBuff[pt]= do_some_computations(OutBuff,...);
}
barrier( CLK_GLOBAL_MEM_FENCE);
}
}
Is this related to use Semaphores?
This is a common OpenCL misunderstanding. The barrier statement is only within a work group, not the global work size. There is no statement for global synchronization (because of how work groups are executed; some run to completion before others even start). The solution for global synchronization is to use separate kernels. The first will run to completion, and then the second one will.

OpenMP: recursively subdividing work and threads

I wonder how below logic can be written using OpenMP:
do_something(Job job, Threads x ... x+b){
if(b<1) // if only one thread is assigned
do_real_work(job); return;
// otherwise, divide the work in two subtasks
// Only one thread executes this:
Divide job into job1 and job2
// Divide the assigned threads into two groups
// and assign the subtasks to them.
// The below two lines should be executed in parallel.
do_something(job1, x ... x+b/2)
do_something(job2, x+b/2 ... x+b)
}
The above workflow by itself is simply divide-and-conquer. I want to divide the work among n threads in a "binary-tree" style.
Particularly, I want the program to be able to obtain the # of threads from, say, evn var, and take care of the division recursively.
If 4 threads are used, then two levels are executed;
If 8 threads are used, then three levels are executed, etc.
I have no idea how one can designate a subset of threads to execute a parallel task in OpenMP.
And is it even possible to specify the thread IDs to carry out the task?
Although it is possible to obtain the thread ID using omp_get_thread_num() and branch according to this ID, it seems that your problem is more appropriate for being solved using explicit tasks. The maximum number of OpenMP threads as set in the environment variable OMP_NUM_THREADS could be obtained by calling omp_get_max_threads() (even outside a parallel region) and the actual number could be obtained by calling omp_get_num_threads() inside an active region. The parallel code should look something like:
do_something(Job job, Threads x ... x+b){
if(b<1) // if only one thread is assigned
do_real_work(job); return;
// otherwise, divide the work in two subtasks
// Only one thread executes this:
Divide job into job1 and job2
// Divide the assigned threads into two groups
// and assign the subtasks to them.
// The below two lines should be executed in parallel.
#pragma omp task
do_something(job1, x ... x+b/2)
#pragma omp task
do_something(job2, x+b/2 ... x+b)
}
// Call like follows
#pragma omp parallel
{
#pragma omp single
{
b = omp_get_num_threads();
do_something(job, x, b);
}
}

Correct usage of openMP target construct

I'm trying to figure out if I am using the Openmp 4 construct correctly.
So it would be nice if someone could give me some tips..
class XY {
#pragma omp declare target
static void function_XY(){
#pragma omp for
loop{}
#pragma omp end declare target
main() {
var declaration
some sequential stuff
#pragma omp target map(some variables) {
#pragma omp parallel {
#pragma omp for
loop1{}
function_XY();
#pragma omp for
loop2{}
}
}
some more sequential stuff
}
My overall code is working, and getting faster with more threads, but I'm wondering if the code is correctly executed on the target device(xeon phi).
Also if i remove all omp stuff and execute my program sequentially it runs faster than execution with multiple threads(any number). Maybe due to initialisation of omp?
What I want is the parallel execution of: loop1, function_XY, loop2 on the targetdevice.
" I'm wondering if the code is correctly executed on the target device(xeon phi)"
Well, if you are correctly compiling the code with -mmic flag, then it will generate a binary that only runs on the mic.
To run the code (in native mode) on the mic, copy the executable to the mic (via scp), copy the needed libraries, SSH to the mic, and execute it.
Don't forget to export LD_LIBRARY_PATH to indicate the path of the libraries on the mic.
Now, assuming that you do run the code on the co-processor, increased performance when disabling threading, indicates that there is a bottleneck somewhere in the code. But this needs more info to analyze.

Resources