I'm trying to figure out if I am using the Openmp 4 construct correctly.
So it would be nice if someone could give me some tips..
class XY {
#pragma omp declare target
static void function_XY(){
#pragma omp for
loop{}
#pragma omp end declare target
main() {
var declaration
some sequential stuff
#pragma omp target map(some variables) {
#pragma omp parallel {
#pragma omp for
loop1{}
function_XY();
#pragma omp for
loop2{}
}
}
some more sequential stuff
}
My overall code is working, and getting faster with more threads, but I'm wondering if the code is correctly executed on the target device(xeon phi).
Also if i remove all omp stuff and execute my program sequentially it runs faster than execution with multiple threads(any number). Maybe due to initialisation of omp?
What I want is the parallel execution of: loop1, function_XY, loop2 on the targetdevice.
" I'm wondering if the code is correctly executed on the target device(xeon phi)"
Well, if you are correctly compiling the code with -mmic flag, then it will generate a binary that only runs on the mic.
To run the code (in native mode) on the mic, copy the executable to the mic (via scp), copy the needed libraries, SSH to the mic, and execute it.
Don't forget to export LD_LIBRARY_PATH to indicate the path of the libraries on the mic.
Now, assuming that you do run the code on the co-processor, increased performance when disabling threading, indicates that there is a bottleneck somewhere in the code. But this needs more info to analyze.
Related
I am going through Using OpenMP. The authors compare and contrast the following two constructs:
//Construct 1
#pragma omp parallel for
for( ... )
{
/* Work sharing loop 1 */
}
...
#pragma omp parallel for
for( ... )
{
/* Work sharing loop N */
}
as against
//Construct 2
#pragma omp parallel
{
#pragma omp for
for( ... )
{
/* Work sharing loop 1 */
}
...
#pragma omp for
for( ... )
{
/* Work sharing loop N */
}
}
They state that Construct 2
has fewer implied barriers, and there might be potential for cache
data reuse between loops. The downside of this approach is that one
can no longer adjust the number of threads on a per loop basis, but
that is often not a real limitation.
I am having a difficult time understanding how Construct 2 has fewer implied barriers. Is there not an implied barrier in Construct 2 after each for loop due to #pragma omp for? So, in each case, isn't the number of implied barriers the same, N? That is, is it not the case in Construct 2 that the first loop occurs first, and so on, and then the Nth for loop is executed last?
Also, how is Construct 2 more favorable for cache reuse between loops?
I am having a difficult time understanding how Construct 2 has fewer
implied barriers. Is there not an implied barrier in Construct 2 after
each for loop due to #pragma omp for? So, in each case, isn't the
number of implied barriers the same, N? That is, is it not the case in
Construct 2 that the first loop occurs first, and so on, and then the
Nth for loop is executed last?
I did not read the book but based on what you have shown it is actually the other way around, namely:
//Construct 1
#pragma omp parallel for
for( ... )
{
/* Work sharing loop 1 */
} // <-- implicit barrier
...
#pragma omp parallel for
for( ... )
{
/* Work sharing loop N */
} // <-- implicit barrier.
has N implicit barriers (at the end of each parallel region), whereas the second code:
//Construct 2
#pragma omp parallel
{
#pragma omp for
for( ... )
{
/* Work sharing loop 1 */
} <-- implicit barrier
...
#pragma omp for
for( ... )
{
/* Work sharing loop N */
} <-- implicit barrier
} <-- implicit barrier
has N+1 barriers (at the end of each for + the parallel region).
Actually, in this case, since there is no computation between the last two implicit barriers, one can add the nowait to the last #pragma omp for to eliminate one of the redundant barriers.
One way for the second code to have fewer implicit barriers than the second would be if you would add a nowait clause to the #pragma omp for clauses.
From the link about the book that you have shown:
Finally, Using OpenMP considers trends likely to influence OpenMP
development, offering a glimpse of the possibilities of a future
OpenMP 3.0 from the vantage point of the current OpenMP 2.5. With
multicore computer use increasing, the need for a comprehensive
introduction and overview of the standard interface is clear.
So the book is using the old OpenMP 2.5 standard, and from that standard about the loop constructor one can read:
There is an implicit barrier at the end of a loop constructor
unless a nowait clause is specified.
A nowait cannot be added to the parallel constructor but it can be added to the for constructor. Therefore, the second code has the potential to have fewer implicit barriers if one can add the nowait clause to the #pragma omp for clauses. However, as it is, the second code has actually more implicit barriers than the first code.
Also, how is Construct 2 more favorable for cache reuse between loops?
If you are using a static distribution of the loop iterations among threads (e.g., #pragma omp for scheduler(static, ...) in the second code, the same threads will be working with the same loop iterations. For instance, with two threads let us call them Thread A and Thread B. If we assume a static distribution with chunk=1, Thread A and B will work with the odd and even iterations of each loop, respectively. Consequently, depending on the actual application code, this might mean that those threads will work with the same memory positions of a given data structure (e.g., the same array positions).
In the first code, in theory (however this will depend on the specific OpenMP implementation), since there are two different parallel regions, different threads can pick up the same loop iterations across the two loops. In other words, in our example with the two threads, there are no guarantees that the same thread that computed the even (or the odd) numbers in one loop would compute those same numbers in the other loops.
we are trying to run two instances of cblas_dgemm in parallel. If the total number of threads is 16, we would like each instance to run using 8 threads. Currently, we are using a structure like this:
#pragma omp parallel num_threads(2)
{
if (omp_get_thread_num() == 0){
cblas_dgemm(...);
}else {
cblas_dgemm(...);
}
}
Here is the issue:
At the top level, there are two OpenMP threads each of which is active inside one of the if/else blocks. Now, we expect those threads to call the cblas_dgemm functions is parallel, and inside those cblas_dgemm functions, we expect new threads to be spawned.
To set the number of threads internal to each cblas_dgemm, we set the corresponding environment variable: setenv OPENBLAS_NUM_THREADS 8
However, it doesn't seem to be working. If we measure the runtime for each of the parallel calls, the runtime values are equal, but they are equal to the runtime of a single cblas_dgemm call when nested parallelism is not used and the environment variable OPENBLAS_NUM_THREADS is set to 1.
What is going wrong? and how can we have the desired behavior?
Is there any way we could know the number of threads inside the cblas_dgemm function?
Thank you very much for your time and help
The mechanism you are trying to use is called "nesting", that is, creating a new parallel region within an outer, existing parallel region is already active. While most implementations support nesting, it is disabled by default. Try setting OMP_NESTED=true on the command line or call omp_set_nested(true) before the first OpenMP directive in your code.
I would also change the above code to read like this:
#pragma omp parallel num_threads(2)
{
#pragma omp sections
#pragma omp section
{
cblas_dgemm(...);
}
#pragma omp section
{
cblas_dgemm(...);
}
}
That way, the code will also compute the correct thing with only one thread, serializing the two calls to dgemm. In your example with only one thread, the code would run but miss the second dgemm call.
I have applied loop enrolling as mentioned in this post
Code:
for(i = 0; i< ROUND_DOWN(contours.size(),3);i+=3)
{
cv::convexHull(contours[i], convexHulls[i]);
cv::convexHull(contours[i+1], convexHulls[i+1]);
cv::convexHull(contours[i+2], convexHulls[i+2]);
}
Now I want to use multiple threads (3) in the for loop so each thread only execute one statement in the loop some what like section using openmp.
How to do so?
I tried this:
for(i = 0; i< ROUND_DOWN(contours.size(),3);i+=3)
{
#pragma omp parallel sections
{
#pragma omp section
cv::convexHull(contours[i], convexHulls[i]);
#pragma omp section
cv::convexHull(contours[i+1], convexHulls[i+1]);
#pragma omp section
cv::convexHull(contours[i+2], convexHulls[i+2]);
}
}
But it didn't work and I got an error can someone tell me how to do this right?
I did get another post. In this SSE instructions are used but I am unable to make sense of it.
Simply use a parallel for:
#pragma omp parallel for
for(i = 0; i < contours.size(); i++)
{
cv::convexHull(contours[i], convexHulls[i]);
}
This expresses what you want to do and allows the compiler and runtime to run the loop in parallel. For instance this will work with any thread configuration or size, while your suggestion will only work properly for three threads.
Don't help the compiler unless you have evidence or strong knowledge that it is beneficial. If you ever do, verify that it is actually beneficial. If the simple version does not perform good in your case, you should first give the compiler hints (e.g. scheduling strategies) rather than implementing your own manually.
Note that this will only work properly, if there are certain data dependencies between loop iterations (same with your section code). Your code looks like this is not the case, but a certain evaluation would require a proper complete code example.
Not sure why you want to parallelize using sections. It is not clear what kind of dependencies you have in the cv::convexHull function but if there are no side-effects (as I think) you should be able to parallelize simply using work-sharing:
#pragma omp parallel for private(e)
for(i = 0; i< ROUND_DOWN(contours.size(),3); i+=3)
{
cv::convexHull(contours[i], convexHulls[i]);
cv::convexHull(contours[i+1], convexHulls[i+1]);
cv::convexHull(contours[i+2], convexHulls[i+2]);
}
I want to replace:
omp_set_lock(&bestTimeSeenSoFar_lock);
temp_bestTimeSeenSoFar = bestTimeSeenSoFar; // this is a read
omp_unset_lock(&bestTimeSeenSoFar_lock);
...
omp_set_lock(&bestTimeSeenSoFar_lock);
// update/write bestTimeSeenSoFar
omp_unset_lock(&bestTimeSeenSoFar_lock);
with code that will allow multiple threads to be reading the variable at once UNLESS a thread is trying to write, in which case they wait until the write is done. Help?
What about using something like this?
#pragma flush( bestTimeSeenSoFar )
#pragma omp atomic read
temp_bestTimeSeenSoFar = bestTimeSeenSoFar;
...
#pragma omp atomic write
bestTimeSeenSoFar = whatever;
#pragma flush( bestTimeSeenSoFar )
My reading to the OpenMP standard chapter 2.12.6 dealing with atomic doesn't permit me to decide whether this will perform exactly what you want, but this is the best / closest I can come up with. Moreover, even if this might work in theory, it will be highly dependant on the quality of the implementation of this feature within your compiler. So it not working for you won't necessarily imply that the idea is wrong.
Anyway, I would encourage you to give it a try and, please please, to report if it works for you.
I have a simple code prepared for testing. This is the most important piece of the code:
#pragma omp parallel sections
{
#pragma omp section
{
for (int j=0;j<100000;j++)
for (int i=0;i<1000;i++) a1[i]=1;
}
#pragma omp section
{
for (int j=0;j<100000;j++)
for (int i=0;i<1000;i++) a2[i]=1;
}
}
I compiled the program with MinGW compiler and results are as I expected. As I am going to use a computer with Linux only, I compiled the code on Linux (using the same machine). I used gcc 4.7.2 and intel 12.1.0 compilers. The efficiency of the program significantly decreased. It is slower than sequential program (omp_set_num_threads(1))
I have also tried with private arrays in threads, but the effect is similar.
Can someone suggest any explanation?
I don't exactly understand what you mean to achieve with your code but the difference in efficiency could be due to the compiler you are making use of not knowing how to handle code which has sections-within-sections.
First off, try a different compiler. From my experience gcc-4.8.0 works better with OpenMP so maybe you could try that to start off.
Secondly, use optimisation flags! If you are measuring performance than it would only be fair to use either -O1 -O2 or -O3. The latter will give you the best performance but takes some short-cuts with mathematical functions which make floating point operations slightly less accurate.
g++ -fopenmp name.cpp -O3
You can read up more on compiler flags on this page if it interests you.
As an end note, don't know how experienced you are with OpenMP, but when dealing with loops in OpenMP you would usually use the following:
#pragma omp parallel for
for(int i=0; i<N; ++i)
doSomething();
Additionally, if you are using nested loops, then you can use the collapse directive to inform your compiler to turn your nested loops into a single one (which can lead to better performance)
#pragma omp parallel for collapse(2) private(i, j)
for(int i=0; i<N; ++i)
for(int j=0; j<N; ++j)
doSomething();
There are some things you should be aware of when using collapse which you can read about here. I personally prefer manually converting them into single loop as from my experience this proves even more efficient.