OpenMP - sections directive; Linux slower than Windows - linux

I have a simple code prepared for testing. This is the most important piece of the code:
#pragma omp parallel sections
{
#pragma omp section
{
for (int j=0;j<100000;j++)
for (int i=0;i<1000;i++) a1[i]=1;
}
#pragma omp section
{
for (int j=0;j<100000;j++)
for (int i=0;i<1000;i++) a2[i]=1;
}
}
I compiled the program with MinGW compiler and results are as I expected. As I am going to use a computer with Linux only, I compiled the code on Linux (using the same machine). I used gcc 4.7.2 and intel 12.1.0 compilers. The efficiency of the program significantly decreased. It is slower than sequential program (omp_set_num_threads(1))
I have also tried with private arrays in threads, but the effect is similar.
Can someone suggest any explanation?

I don't exactly understand what you mean to achieve with your code but the difference in efficiency could be due to the compiler you are making use of not knowing how to handle code which has sections-within-sections.
First off, try a different compiler. From my experience gcc-4.8.0 works better with OpenMP so maybe you could try that to start off.
Secondly, use optimisation flags! If you are measuring performance than it would only be fair to use either -O1 -O2 or -O3. The latter will give you the best performance but takes some short-cuts with mathematical functions which make floating point operations slightly less accurate.
g++ -fopenmp name.cpp -O3
You can read up more on compiler flags on this page if it interests you.
As an end note, don't know how experienced you are with OpenMP, but when dealing with loops in OpenMP you would usually use the following:
#pragma omp parallel for
for(int i=0; i<N; ++i)
doSomething();
Additionally, if you are using nested loops, then you can use the collapse directive to inform your compiler to turn your nested loops into a single one (which can lead to better performance)
#pragma omp parallel for collapse(2) private(i, j)
for(int i=0; i<N; ++i)
for(int j=0; j<N; ++j)
doSomething();
There are some things you should be aware of when using collapse which you can read about here. I personally prefer manually converting them into single loop as from my experience this proves even more efficient.

Related

Issue in parallelising inner loop of a nested for in OpenMP

I need to parallelize the inner of a nested loop with OpenMP. They way I did it is not working fine. Each thread should iterate on each of the M points, but only iterate(in the second loop) on its own chunk of coordinates. So I want the first loop to go from 0 to M , the second one frommy_first_coord to my_last_coord. In the code I posted, the program is faster when launched with 4 threads than when with 8, so there's some issue. I know one way to do this is by "manually" dividing the coordinates, meaning that each thread gets its own num_of_coords / thread_count(and considering the remainder), I did that with Pthread. I would prefer to make use of pragmas in OpenMP. I'm sure I'm missing something. Let me show you the code
#pragma omp parallel
...
for (int i = 0; i < M; i++) { //All iterate from i to M
# pragma omp for nowait
for (int coord = 0; coord < N; coord++) { //each works on its portion of coords
centroids[points[i].cluster].accumulator.coordinates[coord] += points[i].coordinates[coord];
}
}
I put the Pthread version too, so that you don't misunderstand what I want to achieve, but with the use of pragmas
/*M is global,
first_nn and last_nn are local*/
for (long i = 0; i < M; i++)
for(long coord = first_nn; coord <= last_nn; coord++)
centroids[points[i].cluster].accumulator.coordinates[coord] += points[i].coordinates[coord];
I hope that it is clear enough. Thank you
Edit:
I'm using gcc 12.2.0. By adding the -O3 flag times have improved.
With larger inputs the difference is speedup between 4 and 8 threads is more significant.
Your comment indicates that you are worried about speedup.
How many physical cores does your processor have? Try every thread count from 1 to that number.
Do not use hyperthreads
You may find a good speedup for low thread counts, but a leveling off effect: that is because you have a "streaming" operation, which is limited by bandwidth. Unless you have a very expensive processor, there is not enough bandwidth to keep all cores running fast.
You could try setting OMP_PROC_BIND=true which prevents the OS from migrating your threads. That can improve cache usage.
You have some sort of indirect addressing going on with the i variable so further memory effects related to the TLB may make your parallel code not scale optimally.
But start with point 3 and report.

OpenMP parallel for -- Multiple parallel for's Vs. one parallel that includes within it multiple for's

I am going through Using OpenMP. The authors compare and contrast the following two constructs:
//Construct 1
#pragma omp parallel for
for( ... )
{
/* Work sharing loop 1 */
}
...
#pragma omp parallel for
for( ... )
{
/* Work sharing loop N */
}
as against
//Construct 2
#pragma omp parallel
{
#pragma omp for
for( ... )
{
/* Work sharing loop 1 */
}
...
#pragma omp for
for( ... )
{
/* Work sharing loop N */
}
}
They state that Construct 2
has fewer implied barriers, and there might be potential for cache
data reuse between loops. The downside of this approach is that one
can no longer adjust the number of threads on a per loop basis, but
that is often not a real limitation.
I am having a difficult time understanding how Construct 2 has fewer implied barriers. Is there not an implied barrier in Construct 2 after each for loop due to #pragma omp for? So, in each case, isn't the number of implied barriers the same, N? That is, is it not the case in Construct 2 that the first loop occurs first, and so on, and then the Nth for loop is executed last?
Also, how is Construct 2 more favorable for cache reuse between loops?
I am having a difficult time understanding how Construct 2 has fewer
implied barriers. Is there not an implied barrier in Construct 2 after
each for loop due to #pragma omp for? So, in each case, isn't the
number of implied barriers the same, N? That is, is it not the case in
Construct 2 that the first loop occurs first, and so on, and then the
Nth for loop is executed last?
I did not read the book but based on what you have shown it is actually the other way around, namely:
//Construct 1
#pragma omp parallel for
for( ... )
{
/* Work sharing loop 1 */
} // <-- implicit barrier
...
#pragma omp parallel for
for( ... )
{
/* Work sharing loop N */
} // <-- implicit barrier.
has N implicit barriers (at the end of each parallel region), whereas the second code:
//Construct 2
#pragma omp parallel
{
#pragma omp for
for( ... )
{
/* Work sharing loop 1 */
} <-- implicit barrier
...
#pragma omp for
for( ... )
{
/* Work sharing loop N */
} <-- implicit barrier
} <-- implicit barrier
has N+1 barriers (at the end of each for + the parallel region).
Actually, in this case, since there is no computation between the last two implicit barriers, one can add the nowait to the last #pragma omp for to eliminate one of the redundant barriers.
One way for the second code to have fewer implicit barriers than the second would be if you would add a nowait clause to the #pragma omp for clauses.
From the link about the book that you have shown:
Finally, Using OpenMP considers trends likely to influence OpenMP
development, offering a glimpse of the possibilities of a future
OpenMP 3.0 from the vantage point of the current OpenMP 2.5. With
multicore computer use increasing, the need for a comprehensive
introduction and overview of the standard interface is clear.
So the book is using the old OpenMP 2.5 standard, and from that standard about the loop constructor one can read:
There is an implicit barrier at the end of a loop constructor
unless a nowait clause is specified.
A nowait cannot be added to the parallel constructor but it can be added to the for constructor. Therefore, the second code has the potential to have fewer implicit barriers if one can add the nowait clause to the #pragma omp for clauses. However, as it is, the second code has actually more implicit barriers than the first code.
Also, how is Construct 2 more favorable for cache reuse between loops?
If you are using a static distribution of the loop iterations among threads (e.g., #pragma omp for scheduler(static, ...) in the second code, the same threads will be working with the same loop iterations. For instance, with two threads let us call them Thread A and Thread B. If we assume a static distribution with chunk=1, Thread A and B will work with the odd and even iterations of each loop, respectively. Consequently, depending on the actual application code, this might mean that those threads will work with the same memory positions of a given data structure (e.g., the same array positions).
In the first code, in theory (however this will depend on the specific OpenMP implementation), since there are two different parallel regions, different threads can pick up the same loop iterations across the two loops. In other words, in our example with the two threads, there are no guarantees that the same thread that computed the even (or the odd) numbers in one loop would compute those same numbers in the other loops.

Loop unrolling with OMP

I have applied loop enrolling as mentioned in this post
Code:
for(i = 0; i< ROUND_DOWN(contours.size(),3);i+=3)
{
cv::convexHull(contours[i], convexHulls[i]);
cv::convexHull(contours[i+1], convexHulls[i+1]);
cv::convexHull(contours[i+2], convexHulls[i+2]);
}
Now I want to use multiple threads (3) in the for loop so each thread only execute one statement in the loop some what like section using openmp.
How to do so?
I tried this:
for(i = 0; i< ROUND_DOWN(contours.size(),3);i+=3)
{
#pragma omp parallel sections
{
#pragma omp section
cv::convexHull(contours[i], convexHulls[i]);
#pragma omp section
cv::convexHull(contours[i+1], convexHulls[i+1]);
#pragma omp section
cv::convexHull(contours[i+2], convexHulls[i+2]);
}
}
But it didn't work and I got an error can someone tell me how to do this right?
I did get another post. In this SSE instructions are used but I am unable to make sense of it.
Simply use a parallel for:
#pragma omp parallel for
for(i = 0; i < contours.size(); i++)
{
cv::convexHull(contours[i], convexHulls[i]);
}
This expresses what you want to do and allows the compiler and runtime to run the loop in parallel. For instance this will work with any thread configuration or size, while your suggestion will only work properly for three threads.
Don't help the compiler unless you have evidence or strong knowledge that it is beneficial. If you ever do, verify that it is actually beneficial. If the simple version does not perform good in your case, you should first give the compiler hints (e.g. scheduling strategies) rather than implementing your own manually.
Note that this will only work properly, if there are certain data dependencies between loop iterations (same with your section code). Your code looks like this is not the case, but a certain evaluation would require a proper complete code example.
Not sure why you want to parallelize using sections. It is not clear what kind of dependencies you have in the cv::convexHull function but if there are no side-effects (as I think) you should be able to parallelize simply using work-sharing:
#pragma omp parallel for private(e)
for(i = 0; i< ROUND_DOWN(contours.size(),3); i+=3)
{
cv::convexHull(contours[i], convexHulls[i]);
cv::convexHull(contours[i+1], convexHulls[i+1]);
cv::convexHull(contours[i+2], convexHulls[i+2]);
}

Correct usage of openMP target construct

I'm trying to figure out if I am using the Openmp 4 construct correctly.
So it would be nice if someone could give me some tips..
class XY {
#pragma omp declare target
static void function_XY(){
#pragma omp for
loop{}
#pragma omp end declare target
main() {
var declaration
some sequential stuff
#pragma omp target map(some variables) {
#pragma omp parallel {
#pragma omp for
loop1{}
function_XY();
#pragma omp for
loop2{}
}
}
some more sequential stuff
}
My overall code is working, and getting faster with more threads, but I'm wondering if the code is correctly executed on the target device(xeon phi).
Also if i remove all omp stuff and execute my program sequentially it runs faster than execution with multiple threads(any number). Maybe due to initialisation of omp?
What I want is the parallel execution of: loop1, function_XY, loop2 on the targetdevice.
" I'm wondering if the code is correctly executed on the target device(xeon phi)"
Well, if you are correctly compiling the code with -mmic flag, then it will generate a binary that only runs on the mic.
To run the code (in native mode) on the mic, copy the executable to the mic (via scp), copy the needed libraries, SSH to the mic, and execute it.
Don't forget to export LD_LIBRARY_PATH to indicate the path of the libraries on the mic.
Now, assuming that you do run the code on the co-processor, increased performance when disabling threading, indicates that there is a bottleneck somewhere in the code. But this needs more info to analyze.

No speed-up with useless printf's using OpenMP

I just wrote my first OpenMP program that parallelizes a simple for loop. I ran the code on my dual core machine and saw some speed up when going from 1 thread to 2 threads. However, I ran the same code on a school linux server and saw no speed-up. After trying different things, I finally realized that removing some useless printf statements caused the code to have significant speed-up. Below is the main part of the code that I parallelized:
#pragma omp parallel for private(i)
for(i = 2; i <= n; i++)
{
printf("useless statement");
prime[i-2] = is_prime(i);
}
I guess that the implementation of printf has significant overhead that OpenMP must be duplicating with each thread. What causes this overhead and why can OpenMP not overcome it?
Speculating, but maybe the stdout is guarded by a lock?
In general, printf is an expensive operation because it interacts with other resources (such as files, the console and such).
My empirical experience is that printf is very slow on a Windows console, comparably much faster on Linux console but fastest still if redirected to a file or /dev/null.
I've found that printf-debugging can seriously impact the performance of my apps, and I use it sparingly.
Try running your application redirected to a file or to /dev/null to see if this has any appreciable impact; this will help narrow down where the problem lays.
Of course, if the printfs are useless, why are they in the loop at all?
To expand a bit on #Will's answer ...
I don't know whether stdout is guarded by a lock, but I'm pretty sure that writing to it is serialised at some point in the software stack. With the printf statements included OP is probably timing the execution of a lot of serial writes to stdout, not the parallelised execution of the loop.
I suggest OP modifies the printf statement to include i, see what happens.
As for the apparent speed-up on the dual-core machine -- was it statistically significant ?
You have here a parallel for loop, but the scheduling is unspecified.
#pragma omp parallel for private(i)
for(i = 2; i <= n; i++)
There are some scheduling types defined in OpenMP 3.0 standard. They can be changed by setting OMP_SCHEDULE environment variable to type[,chunk] where
type is one of static, dynamic, guided, or auto
chunk is an optional positive integer that specifies the chunk size
Another way of changing schedule kind is calling openmp function omp_set_schedule
The is_prime function can be rather fast. /I suggest/
prime[i-2] = is_prime(i);
So, the problem can came from wrong scheduling mode, when a little number is executed before barrier from scheduling.
And the printf have 2 parts inside it /I consider glibc as popular Linux libc implementation/
Parse the format string and put all parameters into buffer
Write buffer to file descriptor (to FILE buffer, as stdout is buffered by glibc by default)
The first part of printf can be done in parallel, but second part is a critical section and it is locked with _IO_flockfile.
What were your timings - was it much slower with the printf's? In some tight loops the printf's might take a large fraction of the total computing time; for example if is_prime() is pretty fast, and therefore the performance is determined more by the number of calls to printf than the number of (parallelized) calls to is_prime().

Resources