openmp parallel for working mechanism

openmp parallel for working mechanism - multithreading

I am new to openmp and write some code to find out how this parallel for works
#pragma omp parallel for
for (int i=1;i<=x;++i){
for (int j=1;j<=y;++j){
if (diameter < array[i][j]){
diameter = array[i][j];
}
}
}
here are my quesiotns
1: in this struct there are total xy iterations, how many threads will the omp create to do the work? Is it xy threads OR the thread number is limited by machine resource?
2: if question 1 answer is limited by machine resource, say I can run this program on a 32-core machine, does it mean the maximum thread that can run in parallel is 32?
3: for the private and shared variable
-- 3.1 the x and y are only for read, it is necessary to set them as private?
--3.2 the diameter is concurrently read and written. if it is shared I guess it may cause some delay, but if it is private how can I get a single truth of the value

Answers :
There are total x iterations not x*y iterations, you should add collapse(2) to make it x*y iterations : #pragma omp parallel for collapse(2)
the omp will use $OMP_NUM_THREADS which is generally equal to the number of machine threads , you can change it by : export OMP_NUM_THREADS= n_threads in Linux or set OMP_NUM_THREADS=n_threads in Windows. but it is not recommended to set it > number of machine threads.
see answer 1.
you have not to set x and y as private if they are declared outside the loops.
you can not set diameter as private nor as shared. I think you should change your code from :
if(diameter<array[i][j]) {diameter=array[i][j]} to diameter = max(diameter,array[i][j]) and then use reduction reduction(max:diameter) to calculate diameter
the code must look like this:
#pragma omp parallel for collapse(2) reduction(max:diameter)
for (int i=1;i<=x;++i){
for (int j=1;j<=y;++j){
diameter = max(diameter,array[i][j]);
}
}

Related

How to make a thread wait another to finish using OpenMP threads?

I want to make the following loop that fills the A matrix parallel. For every A[i][j] element that is calculated I want the price in A[i-1][j], A[i-1][j -1] and A[i0][j-1] to have been calculated first. So my thread has to wait for the threads in these positions to have calculated their results. I've tried to achieve this like this:
#pragma omp parallel for num_threads(threadcnt) \
default(none) shared(A, lenx, leny) private(j)
for (i=1; i<lenx; i++)
{
for (j=1; j<leny; j++)
{
do
{
} while (A[i-1][j-1] == -1 || A[i-1][j] == -1 || A[i][j-1] == -1);
A[i][j] = max(A[i-1][j-1]+1,A[i-1][j]-1, A[i][j-1] -1);
}
}
My A matrix is initialized in -1 so if A[][] equals to -1 the operation in this cell is not completed. It takes more time than the serial program though.. Any idea to avoid the while loop?

The waiting loop seems sub-optimal. Apart from burning cores that are spin-waiting, you will also need a plethora of well-placed flush directives to make this code work.
One alternative, especially in the context of a more general parallelization scheme would be to use tasks and task dependences to model the dependences between the different array elements:
#pragma omp parallel
#pragma omp single
for (i=1; i<lenx; i++) {
for (j=1; j<leny; j++) {
#pragma omp task depend(in:A[i-1][j-1],A[i-1][j],A[i][j-1]) depend(out:A[i][j])
A[i][j] = max(A[i-1][j-1]+1,A[i-1][j]-1, A[i][j-1] -1);
}
}
You may want to think about block the matrix updates, so that each task receives a block of the matrix instead of a single element, but the general idea will remain the same.
Another useful OpenMP feature could be the ordered construct and it's ability to adhere to exactly this kind of data dependency:
#pragma omp parallel for
for (int i=1; i<lenx; i++) {
for (int j=1; j<leny; j++) {
#pragma omp ordered depend(source)
#pragma omp ordered depend(sink:i-1,j-1)
A[i][j] = max(A[i-1][j-1]+1,A[i-1][j]-1, A[i][j-1] -1);
}
}
PS: The code above is untested, but it should get the rough idea across.

Your solution cannot work. As A is initialized to -1, and A[0][j] is never modified, if i==1, it will test A[1-1][j] and will always fail. BTW, if A is initiliazed to -1, how cannot you have anything but -1 with the max?
When you have dependencies problem, there are two solutions.
First one is to sequentialize your code. Openmp has the ordered directive to do that, but the drawback is that you loose parallelism (while still paying thread creation cost). Openmp 4.5 has a way to describe dependencies with the depend and sink/source statements, but I do not know how efficient can the compiler be to deal with that. And my compilers (gcc 7.3 or clang 6.0) do not support this feature.
Second solution is to modify your algorithm to suppress dependencies. Now, you are computing the maximum of all values that are at the left or above a given element. Lets turn it to a simpler problem. Compute the maximum of all values at the left of a given element. We can easily parallelize it by computing on the different rows, as there no interrow dependency.
// compute b[i][j]=max_k<j(a[i][k]
#pragma omp parallel for
for(int i=0;i<n;i++){
for(int j=0;j<n;j++){
// max per row
if((j!=0)) b[i][j]=max(a[i][j],b[i][j-1]);
else b[i][j]=a[i][j]; // left column initialised to value of a
}
}
Consider another simple problem, to compute the prefix maximum on the different columns. It is again easy to parallelize, but this time on the inner loop, as there is not inter-column dependency.
// compute c[i][j]=max_i<k(b[k,j])
for(int i=0;i<n;i++){
#pragma omp parallel for
for(int j=0;j<n;j++){
// max per column
if((i!=0)) c[i][j]=max(b[i][j],c[i-1][j]);
else c[i][j]=b[i][j]; // upper row initialised to value of b
}
}
Now you just have to chain these computations to get the expected result. Here is the final code (with a unique array used and some cleanup in the code).
#pragma omp parallel for
for(int i=0;i<n;i++){
for(int j=1;j<n;j++){
// max per row
a[i][j]=max(a[i][j],a[i][j-1]);
}
}
for(int i=1;i<n;i++){
#pragma omp parallel for
for(int j=0;j<n;j++){
// max per column
a[i][j]=max(a[i][j],a[i-1][j]);
}
}

Proper / Efficient parallelization of a for loop with OpenMP

I have developed a distributed memory MPI application which involves processing of a grid. Now i want to apply shared memory techniques (essentially making it a hybrid - parallel program), with OpenMP, to see if it can become any faster, or more efficient. I'm having a hard time with OpenMP, especially with a nested for loop. My application involves printing the grid to the screen every half a second, but when i parallelize it with OpenMP, execution proceeds 10 times slower, or not at all. The console screen lags and refreshes itself with random / unexpected data. In other words, it is going completely wrong. Take a look at the following function, which does the printing:
void display2dGrid(char** grid, int nrows, int ncolumns, int ngen)
{
//#pragma omp parallel
updateScreen();
int y, x;
//#pragma omp parallel shared(grid) // garbage
//#pragma omp parallel private(y) // garbage output!
//#pragma omp for
for (y = 0; y < nrows; y++) {
//#pragma omp parallel shared(grid) // nothing?
//#pragma omp parallel private(x) // 10 times slower!
for (x = 0; x < ncolumns; x++) {
printf("%c ", grid[y][x]);
}
printf("\n");
}
printf("Gen #%d\n", ngen);
fflush(stdout);
}
(updateScreen() just clears the screen and writes from top left corner again.)
The function is executed by only one process, which makes it a perfect target for thread parallelization. As you can see i have tried many approaches and one is worse than the other. Best case, i get semi proper output every 2 seconds (because it refreshes very slowly). Worst case i get garbage output.
I would appreciate any help. Is there a place where i can find more information to proper parallelize loops with OpenMP? Thanks in advance.

The function is executed by only one process, which makes it a perfect target for thread parallelization.
That is actually not true. The function you are trying to parallelize is a very bad target for parallelization. The calls to printf in your example need to happen in a specific sequential order, or else, you're going to obtain a garbage result as your experienced (since the elements of your grid are going to be printed in an order that means nothing). Actually, your attempts at parallelizing were pretty good, the problem comes from the fact that the function itself is a bad target for parallelization.
Speedup when parallelizing programs comes from the fact that you are distributing workload across multiple cores. In order to be able to do that with maximum efficiency, said workloads need to be independent, or at least share state as little as possible, which is not the case here since the calls to printf need to happen in a specific order.
When you try to parallelize some work that is intrinsically sequential, you lose more time synchronizing your workers (your openmp threads), than you gain by parallizing the work itself (which is why you obtain crap time when your result gets better).
Also, as this answer (https://stackoverflow.com/a/20089967/3909725) suggests, you should not print the content of your grid at each loop (unless you are debugging), but rather perform all of your computations, and then print the content when you have finished doing what your ultimate goal is, since printing is only useful to see the result of the computation, and only slows the process.
An example :
Here is a very basic example of parallizing a program with openmp that achieves speedup. Here a dummy (yet heavy) computation is realized for each value of the i variable. The computations in each loop are completely independent, and the different threads can achieve their computations independently. The calls to printf can be achieved in whatever order since they are just informative.
Original (sequential.c)
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
int main()
{
int i,j;
double x=0;
for(i=0; i < 100; i++)
{
x = 100000 * fabs(cos(i*i));
for(j=0;j<100+i*20000;j++)
x += sqrt(sqrt(543*j)*fabs(sin(j)));
printf("Computed i=%2d [%g]\n",i,x);
}
}
Parallelized version (parallel.c)
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main()
{
int i,j;
double x=0;
#pragma omp parallel for
for(i=0; i < 100; i++)
{
/* Dummy heavy computation */
x = 100000 * fabs(cos(i*i));
#pragma omp parallel for reduction(+: x)
for(j=0;j<100+i*20000;j++)
x += sqrt(sqrt(543*j)*fabs(sin(j)));
printf("Thread %d computed i=%2d [%g]\n",omp_get_thread_num(),i,x);
}
}
A pretty good guide to openmp can be found here : http://bisqwit.iki.fi/story/howto/openmp/

Dividing a loop task among threads using OpenMP

Suppose I want to run the following loop between threats (say 4 threads) such that each thread is in charge of calculating (N/4) where N is the number of rows of the matrix.
#pragma omp parallel num_threads(4) private(i,j,M) shared(Matrix)
{
#pragma omp for schedule(static)
for(i=0; i<N; i++)
{
for(j=0; j<N; j++)
{
M[i][j]= Matrix[i][j] + (Matrix[i][j] * Matrix[j][i]);
}
}
}
My question is: Should I explicitly specify the beginning and the end of each chunk of the matrix that each thread will calculate or OpenMP will automatically distribute the job between the threads? The reason behind this question is that I've read somewhere that OpenMP will automatically distribute the job between the threads but when I implemented it, it gave me fault segmentation error.
Thank you.

The differences in the accuracy of the calculations in single / multi-threaded (OpenMP) modes

Can anybody explain/understand the different of the calculation result in single / multi-threaded mode?
Here is an example of approx. calculation of pi:
#include <iomanip>
#include <cmath>
#include <ppl.h>
const int itera(1000000000);
int main()
{
printf("PI calculation \nconst int itera = 1000000000\n\n");
clock_t start, stop;
//Single thread
start = clock();
double summ_single(0);
for (int n = 1; n < itera; n++)
{
summ_single += 6.0 / (static_cast<double>(n)* static_cast<double>(n));
};
stop = clock();
printf("Time single thread %f\n", (double)(stop - start) / 1000.0);
//Multithread with OMP
//Activate OMP in Project settings, C++, Language
start = clock();
double summ_omp(0);
#pragma omp parallel for reduction(+:summ_omp)
for (int n = 1; n < itera; n++)
{
summ_omp += 6.0 / (static_cast<double>(n)* static_cast<double>(n));
};
stop = clock();
printf("Time OMP parallel %f\n", (double)(stop - start) / 1000.0);
//Multithread with Concurrency::parallel_for
start = clock();
Concurrency::combinable<double> piParts;
Concurrency::parallel_for(1, itera, [&piParts](int n)
{
piParts.local() += 6.0 / (static_cast<double>(n)* static_cast<double>(n));
});
double summ_Conparall(0);
piParts.combine_each([&summ_Conparall](double locali)
{
summ_Conparall += locali;
});
stop = clock();
printf("Time Concurrency::parallel_for %f\n", (double)(stop - start) / 1000.0);
printf("\n");
printf("pi single = %15.12f\n", std::sqrt(summ_single));
printf("pi omp = %15.12f\n", std::sqrt(summ_omp));
printf("pi comb = %15.12f\n", std::sqrt(summ_Conparall));
printf("\n");
system("PAUSE");
}
And the results:
PI calculation VS2010 Win32
Time single thread 5.330000
Time OMP parallel 1.029000
Time Concurrency:arallel_for 11.103000
pi single = 3.141592643651
pi omp = 3.141592648425
pi comb = 3.141592651497
PI calculation VS2013 Win32
Time single thread 5.200000
Time OMP parallel 1.291000
Time Concurrency:arallel_for 7.413000
pi single = 3.141592643651
pi omp = 3.141592648425
pi comb = 3.141592647841
PI calculation VS2010 x64
Time single thread 5.190000
Time OMP parallel 1.036000
Time Concurrency::parallel_for 7.120000
pi single = 3.141592643651
pi omp = 3.141592648425
pi comb = 3.141592649319
PI calculation VS2013 x64
Time single thread 5.230000
Time OMP parallel 1.029000
Time Concurrency::parallel_for 5.326000
pi single = 3.141592643651
pi omp = 3.141592648425
pi comb = 3.141592648489
The tests were made on AMD and Intel CPUs, Win 7 x64.
What is the reason for difference between PI calculation in single and multicore?
Why the result of calculation with Concurrency::parallel_for is not constant on different builds (compiler, 32/64 bit platform)?
P.S.
Visual studio express doesn’t support OpenMP.

Floating-point addition is a non-associative operation due to round-off errors, therefore the order of operations matters. Having your parallel program give different result(s) than the serial version is something normal. Understanding and dealing with it is part of the art of writing (portable) parallel codes. This is exacerbated in the 32- against 64-bit builds since in 32-bit mode the VS compiler uses x87 instructions and the x87 FPU does all operations with an internal precision of 80 bits. In 64-bit mode SSE math is used.
In the serial case, one thread computes s1+s2+...+sN, where N is the number of terms in the expansion.
In the OpenMP case there are n partial sums, where n is the number of OpenMP threads. Which terms get into each partial sum depends on the way iterations are distributed among the threads. The default for many OpenMP implementations is static scheduling, which means that thread 0 (the main thread) computes ps0 = s1 + s2 + ... + sN/n; thread 1 computes ps1 = sN/n+1 + sN/n+2 + ... + s2N/n; and so on. In the end the reduction combines somehow those partial sums.
The parallel_for case is very similar to the OpenMP one. The difference is that by default the iterations are distributed in a dynamic fashion - see the documentation for auto_partitioner, therefore each partial sum contains a more or less random selection of terms. This not only gives a slightly different result, but it also gives a slightly different result with each execution, i.e. the result from two consecutive parallel_for's with the same number of threads might differ a bit. If you replace the partitioner with an instance of simple_partitioner and set the chunk size equal to itera / number-of-threads, you should get the same result as in the OpenMP case if the reduction is performed the same way.
You could use Kahan summation and implement your own reduction also using Kahan summation. Then the parallel codes should produce the same (over much more similar) result as the serial one.

I would guess that the parallel reduction that openmp does is in general more accurate as the
floating point addition round-off error gets more distributed. In general floating point
reductions are problematic because of rounding errors etc. http://floating-point-gui.de/
performing those operations in parallel is a way to improve on the accuracy by distributing the rounding error. Imagine you are doing a big reduction, at some point the accumulator is going to grow in size compared to the other values and this will increase the rounding error for each addition as the accumulators range is much larger and it may not be possible to represent the value of the smaller value in that range accurately, however if there are multiple accumulators for the same reduction operating in parallel their magnitudes would remain smaller and this kind of error would be smaller.

So...
In win32 mode the FPU with 80bit registers will be used.
In x64 mode the SSE2 with double precision float (64 bit) will be used. The use of sse2 seems like be by default in x64 mode.
Theoretically... is it possible that the calculation in win32 mode will be more precise? :)
http://en.wikipedia.org/wiki/SSE2
So the better way is to buy new CPUs with AVX or compile to 32bit code?...

New to OpenMP and parallel programming need to partition a loop using scheduling

Ok so here's what the problem says.
Implement a simple loop that calls a function containing a delay. Partition this loop across four threads using static, dynamic and guided scheduling. Measure execution times for each type of scheduling with respect to both the size of the loop and the size of the delay.
this is what I've done so far, I have no idea if I'm on the right track
#include <omp.h>
#include <stdio.h>
int main() {
double start_time, run_time;
omp_set_num_threads(4);
start_time = omp_get_wtime();
#pragma omp parallel
#pragma omp for schedule(static)
for (int n = 0; n < 100; n++){
printf("square of %d=%d\n", n, n*n);
printf("cube of %d=%d\n", n, n*n*n);
int ID = omp_get_thread_num();
printf("Thread(%d) \n", ID);
}
run_time = omp_get_wtime() - start_time;
printf("Time Elapsed (%f)", run_time);
getchar();
}

At first you need a loop, where the distribution makes a difference. The loop has 100 iterations, so the OpenMP schedule will only 100 times decide what is the next iteration for a thread what takes no mensurable time. The output with printf takes very long so in your code it makes no difference which schedule is used. Its better to make a loop without console output and a very high loop count like
#pragma omp parallel
{
#pragma omp for schedule(static) private(value)
for (int i = 0; i < 100000000; i++) {
value = ...
}
}
At last you have to write code in the loop which "result" is used after the loop with a printf for example. If not the body could be deleted by the compiler because of optimize the code (it is not used later so its not needed). You can concentrate the time measurings on the parallel pool without the output of the results.
If your iterations nearly takes the same time, then a static distribution should be faster. If they differ very much the dynamic and guided schedules should dominate your measurings.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string