OpenMP and MPI hybrid dynamic scheduling

OpenMP and MPI hybrid dynamic scheduling - multithreading

As the number of threads increase, the count which is "temp" decreases..
When I sent the number of threads as "1" it gives an correct answer but as the number of threads increases, running time shorter but gives wrong answer
#include <stdio.h>
#include <mpi.h>
#include <complex.h>
#include <time.h>
#include <omp.h>
#define MAXITERS 1000
// globals
int count = 0;
int nptsside;
float side2;
float side4;
int temp = 0;
int inset(double complex c) {
int iters;
float rl,im;
double complex z = c;
for (iters = 0; iters < MAXITERS; iters++) {
z = z*z + c;
rl = creal(z);
im = cimag(z);
if (rl*rl + im*im > 4) return 0;
}
return 1;
}
int main(int argc, char **argv)
{
nptsside = atoi(argv[1]);
side2 = nptsside / 2.0;
side4 = nptsside / 4.0;
//struct timespec bgn,nd;
//clock_gettime(CLOCK_REALTIME, &bgn);
int x,y; float xv,yv;
double complex z;
int i;
int mystart, myend;
int nrows;
int nprocs, mype;
int data;
MPI_Status status;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &mype);
nrows = nptsside/nprocs;
printf("%d\n", nprocs);
mystart = mype*nrows;
myend = mystart + nrows - 1;
#pragma omp parallel shared(mystart, myend, temp)
{
int nth = omp_get_num_threads();
printf("%d\n", nth);
#ifdef STATIC
#pragma omp for reduction(+:temp) schedule(static)
#elif defined DYNAMIC
#pragma omp for reduction(+:temp) schedule(dynamic)
#elif defined GUIDED
#pragma omp for reduction(+:temp) schedule(guided)
#endif
for (x=mystart; x<=myend; x++) {
for ( y=0; y<nptsside; y++) {
xv = (x - side2) / side4;
yv = (y - side2) / side4;
z = xv + yv*I;
if (inset(z)) {
temp++;
}
}
}
}
if(mype==0) {
count += temp;
printf("%d\n", temp);
for (i = 1; i < nprocs; i++) {
MPI_Recv(&temp, 1, MPI_INT, i, 0, MPI_COMM_WORLD, &status);
count += temp;
printf("%d\n", temp);
}
}
else{
MPI_Send(&temp, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);
}
MPI_Finalize();
if(mype==0) {
printf("%d\n", count);
}
//clock_gettime(CLOCK_REALTIME, &nd);
//printf("%f\n",timediff(bgn,nd));
}

You are not defining any private variables for when you enter the OpenMP loop.
First off, you must always declare your loop counter for your OpenMP loop (as well as any loop counters for nested loops inside your OpenMP loop) private.
Secondly, you have three variables (xv, yv, and z) that each depend on your iterations in these loops. Thus, each thread needs to have its own private copy of these variables as well. Changing your parallel statement to
#pragma omp parallel shared(mystart, myend, temp) private(x, y, xv, yv, z)
should fix your OpenMP problems.
Seeing as you say that setting your number of threads to 1 yields the correct answer, I have not looked at your MPI code.
EDIT: OK I lied, I briefly looked into your MPI code now. Instead of all of your sends and receives, you should be writing a single reduce. This collective will be much faster than the blocking communication you set up currently.
MPI_Reduce(&temp, &count, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);

Related

MPI4PY: ring communication with neighbor_alltoallw

Please Help!
I am using MPI (= Message Passing Interface) in python for a ring communication, which means that every rank are sending and receiving from each other. I know one way to realize this is by using for instance MPI.COMM_WORLD.issend()and MPI.COMM_WORLD.recv(), this is working and done.
Now I want to realize the same Output on a different way by using MPI.Topocomm.Neighbor_alltoallw but this is not working. I wrote a C Code and is working there, so the same output can be reached with this function, but when I implement this in python it is not working. Please find below the C Code and the Python Code
The definition of the Function says (mpi4py Package for Python):
Neighbor_alltoallw(...)
Topocomm.Neighbor_alltoallw(self, sendbuf, recvbuf)
Neighbor All-to-All Generalized
I do not understand following things:
why is recbuf not a return value? it seems to be an argument here
how can this be implmented for a ring communication in Python?
Thank you for your time and support!
my working C Code:
#include <stdio.h>
#include <mpi.h>
#define to_right 201
#define max_dims 1
int main (int argc, char *argv[])
{
int my_rank, size;
int snd_buf, rcv_buf;
int right, left;
int sum, i;
MPI_Comm new_comm;
int dims[max_dims],
periods[max_dims],
reorder;
MPI_Aint snd_displs[2], rcv_displs[2];
int snd_counts[2], rcv_counts[2];
MPI_Datatype snd_types[2], rcv_types[2];
MPI_Status status;
MPI_Request request;
MPI_Init(&argc, &argv);
/* Get process info. */
MPI_Comm_size(MPI_COMM_WORLD, &size);
/* Set cartesian topology. */
dims[0] = size;
periods[0] = 1;
reorder = 1;
MPI_Cart_create(MPI_COMM_WORLD, max_dims, dims, periods,
reorder,&new_comm);
/* Get coords */
MPI_Comm_rank(new_comm, &my_rank);
/* MPI_Cart_coords(new_comm, my_rank, max_dims, my_coords); */
/* Get nearest neighbour rank. */
MPI_Cart_shift(new_comm, 0, 1, &left, &right);
/* Compute global sum. */
sum = 0;
snd_buf = my_rank;
rcv_buf = -1000; /* unused value, should be overwritten by first MPI_Recv; only for test purpose */
rcv_counts[0] = 1; MPI_Get_address(&rcv_buf, &rcv_displs[0]); snd_types[0] = MPI_INT;
rcv_counts[1] = 0; rcv_displs[1] = 0 /*unused*/; snd_types[1] = MPI_INT;
snd_counts[0] = 0; snd_displs[0] = 0 /*unused*/; rcv_types[0] = MPI_INT;
snd_counts[1] = 1; MPI_Get_address(&snd_buf, &snd_displs[1]); rcv_types[1] = MPI_INT;
for( i = 0; i < size; i++)
{
/* Substituted by MPI_Neighbor_alltoallw() :
MPI_Issend(&snd_buf, 1, MPI_INT, right, to_right,
new_comm, &request);
MPI_Recv(&rcv_buf, 1, MPI_INT, left, to_right,
new_comm, &status);
MPI_Wait(&request, &status);
*/
MPI_Neighbor_alltoallw(MPI_BOTTOM, snd_counts, snd_displs, snd_types,
MPI_BOTTOM, rcv_counts, rcv_displs, rcv_types, new_comm);
snd_buf = rcv_buf;
sum += rcv_buf;
}
printf ("PE%i:\tSum = %i\n", my_rank, sum);
MPI_Finalize();
}
My not working Python Code:
from mpi4py import MPI
size = MPI.COMM_WORLD.Get_size()
my_rank = MPI.COMM_WORLD.Get_rank()
to_right =201
max_dims=1
dims = [max_dims]
periods=[max_dims]
dims[0]=size
periods[0]=1
reorder = True
new_comm=MPI.Intracomm.Create_cart(MPI.COMM_WORLD,dims,periods,True)
my_rank= new_comm.Get_rank()
left_right= MPI.Cartcomm.Shift(new_comm,0,1)
left=left_right[0]
right=left_right[1]
sum=0
snd_buf=my_rank
rcv_buf=-1000 #unused value, should be overwritten, only for test purpose
for counter in range(0,size):
MPI.Topocomm.Neighbor_alltoallw(new_comm,snd_buf,rcv_buf)
snd_buf=rcv_buf
sum=sum+rcv_buf
print('PE ', my_rank,'sum=',sum)

OpenACC How can I keep a data between differetn calls of a function?

I'm trying to optimize an application with OpenACC. In the main, I have an iteration loop of this type:
while(t<tstop){
add(&data, nx);
}
Where data is a variable of type Data, defined by this Structure
typedef struct Data_{
double *x;
}Data;
The function I'm calling in the while loop is parallelizable, but what I don't manage to do is to maintain the array x[] in the device memory between the different calls of the function.
void add(Data *data, int n){
#pragma acc data pcopy(data[0:1])
#pragma acc data pcopy(data->x[0:n])
#pragma acc parallel loop
for(int i=0; i < n ; i++){
data->x[i] += 1.;
}
#pragma acc exit data copyout(data->x[0:n])
#pragma acc exit data copyout(data[0:1])
}
I know the program seems to be no sense but I just wrote something to reproduce the problem I have in the real code.
I tryied to use unstructured data region:
#pragma acc enter data copyin(data[0:1])
#pragma acc enter data copyin(data->x[0:n])
#pragma acc data present(data[:1], data->x[:n])
#pragma acc parallel loop
for(int i=0; i < n ; i++){
data->x[i] += 1.;
}
#pragma acc exit data copyout(data->x[0:n])
#pragma acc exit data copyout(data[0:1])
but for some reason I get an error of this type:
FATAL ERROR: variable in data clause is partially present on the device: name=data

I'm not able to reproduce the partially present error from the code snip-it provided so it's unclear why this error is occurring. In general, the error occurs when the size of the variable in the present table differs from the size being used in the data clause. If you can provide a reproducing example, I can take a look and determine why it's happening here.
To answer the topic question, device variables can be accessed anywhere within the scope of the data region they are in, even across subroutines. For unstructured data regions (i.e. enter data/exit data), the scope is defined at runtime between the enter and exit calls. For structured data regions, the scope is defined by the structured block.
Here's an example using the structure you define above (though I've included the size of x as part of the struct).
% cat test.c
#include <stdio.h>
#include <stdlib.h>
typedef struct Data_{
double *x;
int n;
}Data;
void add(Data *data){
#pragma acc parallel loop present(data)
for(int i=0; i < data->n ; i++){
data->x[i] += 1.;
}
}
int main () {
Data *data;
data = (Data*) malloc(sizeof(Data));
data->n = 64;
data->x = (double *) malloc(sizeof(double)*data->n);
for(int i=0; i < data->n ; i++){
data->x[i] = (double) i;
}
#pragma acc enter data copyin(data[0:1])
#pragma acc enter data copyin(data->x[0:data->n])
add(data);
#pragma acc exit data copyout(data->x[0:data->n])
#pragma acc exit data delete(data)
for(int i=0; i < data->n ; i++){
printf("%d:%f\n",i,data->x[i]);
}
free(data->x);
free(data);
}
% pgcc test.c -ta=tesla -Minfo=accel; a.out
add:
12, Generating present(data[:])
Generating Tesla code
13, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
main:
28, Generating enter data copyin(data[:1])
29, Generating enter data copyin(data->x[:data->n])
31, Generating exit data copyout(data->x[:data->n])
32, Generating exit data delete(data[:1])
0:1.000000
1:2.000000
2:3.000000
3:4.000000
4:5.000000
5:6.000000
6:7.000000
7:8.000000
8:9.000000
9:10.000000
10:11.000000
11:12.000000
12:13.000000
13:14.000000
14:15.000000
15:16.000000
16:17.000000
17:18.000000
18:19.000000
19:20.000000
20:21.000000
21:22.000000
22:23.000000
23:24.000000
24:25.000000
25:26.000000
26:27.000000
27:28.000000
28:29.000000
29:30.000000
30:31.000000
31:32.000000
32:33.000000
33:34.000000
34:35.000000
35:36.000000
36:37.000000
37:38.000000
38:39.000000
39:40.000000
40:41.000000
41:42.000000
42:43.000000
43:44.000000
44:45.000000
45:46.000000
46:47.000000
47:48.000000
48:49.000000
49:50.000000
50:51.000000
51:52.000000
52:53.000000
53:54.000000
54:55.000000
55:56.000000
56:57.000000
57:58.000000
58:59.000000
59:60.000000
60:61.000000
61:62.000000
62:63.000000
63:64.000000
Also, here's a second example, but now with "data" being an array where the size of each "x" can be different.
% cat test2.c
#include <stdio.h>
#include <stdlib.h>
#define M 16
typedef struct Data_{
double *x;
int n;
}Data;
void add(Data *data){
#pragma acc parallel loop present(data)
for(int i=0; i < data->n ; i++){
data->x[i] += 1.;
}
}
int main () {
Data *data;
data = (Data*) malloc(sizeof(Data)*M);
#pragma acc enter data create(data[0:M])
for (int i =0; i < M; ++i) {
data[i].n = i+1;
data[i].x = (double *) malloc(sizeof(double)*data[i].n);
for(int j=0; j < data[i].n ; j++){
data[i].x[j] = (double)((i*data[i].n) + j);
}
#pragma acc update device(data[i].n)
#pragma acc enter data copyin(data[i].x[0:data[i].n])
}
for (int i =0; i < M; ++i) {
add(&data[i]);
}
for (int i =0; i < M; ++i) {
#pragma acc update self(data[i].x[:data[i].n])
for(int j=0; j < data[i].n ; j++){
printf("%d:%d:%f\n",i,j,data[i].x[j]);
}}
for (int i =0; i < M; ++i) {
#pragma acc exit data delete(data[i].x)
free(data[i].x);
}
#pragma acc exit data delete(data)
free(data);
}
% pgcc test2.c -ta=tesla -Minfo=accel; a.out
add:
11, Generating present(data[:1])
Generating Tesla code
14, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
main:
22, Generating enter data create(data[:16])
32, Generating update device(data->n)
Generating enter data copyin(data->x[:data->n])
38, Generating update self(data->x[:data->n])
46, Generating exit data delete(data->x[:1])
49, Generating exit data delete(data[:1])
0:0:1.000000
1:0:3.000000
1:1:4.000000
2:0:7.000000
2:1:8.000000
2:2:9.000000
3:0:13.000000
3:1:14.000000
3:2:15.000000
3:3:16.000000
4:0:21.000000
4:1:22.000000
4:2:23.000000
4:3:24.000000
4:4:25.000000
5:0:31.000000
5:1:32.000000
5:2:33.000000
5:3:34.000000
5:4:35.000000
5:5:36.000000
6:0:43.000000
6:1:44.000000
6:2:45.000000
6:3:46.000000
6:4:47.000000
6:5:48.000000
6:6:49.000000
7:0:57.000000
7:1:58.000000
7:2:59.000000
7:3:60.000000
7:4:61.000000
7:5:62.000000
7:6:63.000000
7:7:64.000000
8:0:73.000000
8:1:74.000000
8:2:75.000000
8:3:76.000000
8:4:77.000000
8:5:78.000000
8:6:79.000000
8:7:80.000000
8:8:81.000000
9:0:91.000000
9:1:92.000000
9:2:93.000000
9:3:94.000000
9:4:95.000000
9:5:96.000000
9:6:97.000000
9:7:98.000000
9:8:99.000000
9:9:100.000000
10:0:111.000000
10:1:112.000000
10:2:113.000000
10:3:114.000000
10:4:115.000000
10:5:116.000000
10:6:117.000000
10:7:118.000000
10:8:119.000000
10:9:120.000000
10:10:121.000000
11:0:133.000000
11:1:134.000000
11:2:135.000000
11:3:136.000000
11:4:137.000000
11:5:138.000000
11:6:139.000000
11:7:140.000000
11:8:141.000000
11:9:142.000000
11:10:143.000000
11:11:144.000000
12:0:157.000000
12:1:158.000000
12:2:159.000000
12:3:160.000000
12:4:161.000000
12:5:162.000000
12:6:163.000000
12:7:164.000000
12:8:165.000000
12:9:166.000000
12:10:167.000000
12:11:168.000000
12:12:169.000000
13:0:183.000000
13:1:184.000000
13:2:185.000000
13:3:186.000000
13:4:187.000000
13:5:188.000000
13:6:189.000000
13:7:190.000000
13:8:191.000000
13:9:192.000000
13:10:193.000000
13:11:194.000000
13:12:195.000000
13:13:196.000000
14:0:211.000000
14:1:212.000000
14:2:213.000000
14:3:214.000000
14:4:215.000000
14:5:216.000000
14:6:217.000000
14:7:218.000000
14:8:219.000000
14:9:220.000000
14:10:221.000000
14:11:222.000000
14:12:223.000000
14:13:224.000000
14:14:225.000000
15:0:241.000000
15:1:242.000000
15:2:243.000000
15:3:244.000000
15:4:245.000000
15:5:246.000000
15:6:247.000000
15:7:248.000000
15:8:249.000000
15:9:250.000000
15:10:251.000000
15:11:252.000000
15:12:253.000000
15:13:254.000000
15:14:255.000000
15:15:256.000000
Note, be careful about copying structs with dynamic data members. Copying the struct itself, i.e. like you have above "#pragma acc exit data copyout(data[0:1])", will overwrite the host address of "x" with the device address. Instead, copy only "data->x" and delete "data".

Rstudio crashes with Rcpp and OpenMP function

This is a follow up question to dqrng with Rcpp for drawing from a normal and a binomial distribution. I tried to implement the answer but instead of drawing from a single distribution I'm drawing from 3. This is the code that I wrote:
// [[Rcpp::depends(dqrng, BH, RcppArmadillo)]]
#include <RcppArmadillo.h>
#include <boost/random/binomial_distribution.hpp>
#include <xoshiro.h>
#include <dqrng_distribution.h>
// [[Rcpp::plugins(openmp)]]
#include <omp.h>
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
arma::mat parallel_random_matrix(int n, int m, int ncores, double p=0.5) {
dqrng::xoshiro256plus rng(42);
arma::mat out(n*m,3);
// ok to use rng here
#pragma omp parallel num_threads(ncores)
{
dqrng::xoshiro256plus lrng(rng); // make thread local copy of rng
lrng.jump(omp_get_thread_num() + 1); // advance rng by 1 ... ncores jumps
int iter = 0;
#pragma omp for
for (int i = 0; i < m; ++i) {
for (int j = 0; j < n; ++j) {
iter = i * n + j;
// p can be a function of i and j
boost::random::binomial_distribution<int> dist_binomial(1,p);
auto gen_bernoulli = std::bind(dist_binomial, std::ref(lrng));
boost::random::normal_distribution<int> dist_normal1(2.0,1.0);
auto gen_normal1 = std::bind(dist_normal1, std::ref(lrng));
boost::random::normal_distribution<int> dist_normal2(4.0,3.0);
auto gen_normal2 = std::bind(dist_normal2, std::ref(lrng));
out(iter,0) = gen_bernoulli();
out(iter,1) = gen_normal1();
out(iter,2) = gen_normal2();
}
}
}
// ok to use rng here
return out;
}
/*** R
parallel_random_matrix(5, 5, 4, 0.75)
*/
When I try to run it Rstudio crashes. However, when I change the code like follows it does work:
// [[Rcpp::depends(dqrng, BH, RcppArmadillo)]]
#include <RcppArmadillo.h>
#include <boost/random/binomial_distribution.hpp>
#include <xoshiro.h>
#include <dqrng_distribution.h>
// [[Rcpp::plugins(openmp)]]
#include <omp.h>
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
arma::mat parallel_random_matrix(int n, int m, int ncores, double p=0.5) {
dqrng::xoshiro256plus rng(42);
arma::mat out(n*m,3);
// ok to use rng here
#pragma omp parallel num_threads(ncores)
{
dqrng::xoshiro256plus lrng(rng); // make thread local copy of rng
lrng.jump(omp_get_thread_num() + 1); // advance rng by 1 ... ncores jumps
int iter = 0;
#pragma omp for
for (int i = 0; i < m; ++i) {
for (int j = 0; j < n; ++j) {
iter = i * n + j;
// p can be a function of i and j
boost::random::binomial_distribution<int> dist_binomial(1,p);
auto gen_bernoulli = std::bind(dist_binomial, std::ref(lrng));
boost::random::normal_distribution<int> dist_normal1(2.0,1.0);
auto gen_normal1 = std::bind(dist_normal1, std::ref(lrng));
boost::random::normal_distribution<int> dist_normal2(4.0,3.0);
auto gen_normal2 = std::bind(dist_normal2, std::ref(lrng));
out(iter,0) = gen_bernoulli();
out(iter,1) = 2.0;//gen_normal1();
out(iter,2) = 3.0;//gen_normal2();
}
}
}
// ok to use rng here
return out;
}
/*** R
parallel_random_matrix(5, 5, 4, 0.75)
*/
What am I doing wrong?

Here lies the problem:
boost::random::normal_distribution<int> dist_normal1(2.0,1.0);
^^^
This distribution is meant for real types, not integral types, c.f. https://www.boost.org/doc/libs/1_69_0/doc/html/boost/random/normal_distribution.html. Correct would be
boost::random::normal_distribution<double> dist_normal1(2.0,1.0);

Hybrid MPI+OpenMP Vs MPI Performance

I am converting a 3-D Jacobi solver from pure MPI to Hybrid MPI+OpenMP. I have a 192x192x192 array which is divided among 24 processes in Pure MPI in 1-D decomposition i.e. each process has 192/24 x 192 x 192 = 8 x 192 x 192 slab of data. Now I do :
for(i=0 ; i <= 7; i++)
for(j=0; j<= 191; j++)
for(k=0; k<= 191; k++)
{
unew[i][j][k] = 1/6.0 * (u[i+1][j][k]+u[i-1][j][k]+
u[i][j+1][k]+u[i][j-1][k]+
u[i][j][k+1]+u[i][j][k-1]);
}
This update takes around 60 seconds for each process.
Now with Hybrid MPI, I run two processes (1 process per socket --bind-to socket --map-by socket and OMP_PROC_PLACES=coreswith OMP_PROC_BIND=close). I create 12 threads per MPI Process (i.e. 12 threads per socket or processor). Now each MPI process has an array of size : 192/2 x 192 x 192 = 96x192x192 elements. Each thread works on 96/12 x 192 x 192 = 8 x 192 x 192 portion of the array owned by each process. I do the same triple loop update using threads but the time is approximately 76 seconds for each thread. The load balance is perfect in both the problems. What could be the possible causes of performance degradation ? Is is False Sharing because threads could be invalidating the cache lines close to each other's chunk of data ? If yes, then how do I reduce this performance degradation ? (I have purposefully not mentioned ghost data but initially I am NOT overlapping communication with computation.)
In response to the comments below, am posting the code. Apologies for the long MWE but you can very safely ignore (1) Header files declaration (2) Variable Declaration (3) Memory allocation routine (4) Formation of Cartesian Topology (5) Setting boundary conditions in parallel using OpenMP parallel region (6) Declaration of MPI_Type_subarray datatype (7) MPI_Isend() and MPI_Irecv() calls and just concentrate on (a) INDEPENDENT UPDATE OpenMP parallel region (b) independent_update(...) routine being called from here.
/* IGNORE THIS PORTION */
#include<mpi.h>
#include<omp.h>
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#define MIN(a,b) (a < b ? a : b)
#define Tol 0.00001
/* IGNORE THIS ROUTINE */
void input(int *X, int *Y, int *Z)
{
int a=193, b=193, c=193;
*X = a;
*Y = b;
*Z = c;
}
/* IGNORE THIS ROUTINE */
float*** allocate_mem(int X, int Y, int Z)
{
int i,j;
float ***matrix;
float *arr;
arr = (float*)calloc(X*Y*Z, sizeof(float));
matrix = (float***)calloc(X, sizeof(float**));
for(i = 0 ; i<= X-1; i++)
matrix[i] = (float**)calloc(Y, sizeof(float*));
for(i = 0 ; i <= X-1; i++)
for(j=0; j<= Y-1; j++)
matrix[i][j] = &(arr[i*Y*Z + j*Z]);
return matrix ;
}
/* THIS ROUTINE IS IMPORTANT */
float independent_update(float ***old, float ***new, int NX, int NY, int NZ, int tID, int chunk)
{
int i,j,k, start, end;
float error = 0.0;
float diff;
start = tID * chunk + 1;
end = MIN( (tID+1)*chunk, NX-2 );
for(i = start; i <= end ; i++)
{
for(j = 1; j<= NY-2; j++)
{
#pragma omp simd
for(k = 1; k<= NZ-2; k++)
{
new[i][j][k] = (1/6.0) *(old[i-1][j][k] + old[i+1][j][k] + old[i][j-1][k] + old[i][j+1][k] + old[i][j][k-1] + old[i][j][k+1] );
diff = 1.0 - new[i][j][k];
diff = (diff > 0 ? diff : -1.0 * diff );
if(diff > error)
error = diff;
}
}
}
return error;
}
int main(int argc, char *argv[])
{
/* IGNORE VARIABLE DECLARATION */
int size, rank; //Size of old_comm and rank of process
int i, j, k,l; //General loop variables
MPI_Comm old_comm, new_comm; //MPI_COMM_WORLD handle and for MPI_Cart_create()
int N[3]; //For taking input of size of matrix from user
int P; //Represent number of processes i.e. same as size
int dims[3]; //For dimensions of Cartesian topology
int PX, PY, PZ; //X dim, Y dim, Z dim of each process
float ***old, ***new, ***temp; //Matrices for results dimensions is (Px+2)*(PY+2)*(PZ+2)
int period[3]; //Periodicity for each dimension
int reorder; //Whether processes should be reordered in new cartesian topology
int ndims; //Number of dimensions (which is 3)
int Z_TOWARDS_U, Z_AWAY_U; //Z neighbour towards you and away from you (Z const)
int X_DOWN, X_UP; //Below plane and above plane (X const)
int Y_LEFT, Y_RIGHT; //Left plane and right plane (Y const)
int coords[3]; //Finding coordinates of processes
int dimension; //Used in MPI_Cart_shift() , values = 0, 1,2
int displacement; //Used in MPI_Cart_shift(), values will be +1 to find immediate neighbours
float l_max_err; //Local maximum error on process
float l_max_err_new; //For dependent faces.
float G_max_err = 1.0; //Maximum error for stopping criterion
int iterations = 0 ; //Counting number of iterations
MPI_Request send[6], recv[6]; //For MPI_Isend and MPI_Irecv
int start[3]; //Start will be defined in MPI_Isend() and MPI_Irecv()
int gsize[3]; //Defining global size of subarray
MPI_Datatype x_subarray; //For sending X_UP and X_DOWN
int local_x[3]; //Defining local plane size for X_UP/X_DOWN
MPI_Datatype y_subarray; //For sending Y_LEFT and Y_RIGHT
int local_y[3]; //Defining local plane for Y_LEFT/Y_RIGHT
MPI_Datatype z_subarray; //For sending Z_TOWARDS_U and Z_AWAY_U
int local_z[3]; //Defining local plan size for XY plane i.e. where Z=0
double strt, end; //For measuring time
double strt1, end1, delta1; //For measuring trivial time 1
double strt2, end2, delta2; //For measuring trivial time 2
double t_i_strt, t_i_end, t_i_sum=0; //Time for independent computational kernel
double t_up_strt, t_up_end, t_up_sum=0; //Time for X_UP
double t_down_strt, t_down_end, t_down_sum=0; //Time for X_DOWN
double t_left_strt, t_left_end, t_left_sum=0; //Time for Y_LEFT
double t_right_strt, t_right_end, t_right_sum=0; //Time for Y_RIGHT
double t_towards_strt, t_towards_end, t_towards_sum=0; //For Z_TOWARDS_U
double t_away_strt, t_away_end, t_away_sum=0; //For Z_AWAY_U
double t_comm_strt, t_comm_end, t_comm_sum=0; //Time comm + independent update (need to subtract to get comm time)
double t_setup_strt,t_setup_end; //Set-up start and end time
double t_allred_strt,t_allred_end,t_allred_total=0.0; //Measuring Allreduce time separately.
int threadID; //ID of a thread
int nthreads; //Total threads in OpenMP region
int chunk; //chunk - used to calculate iterations of a thread
/* IGNORE MPI STARTUP ETC */
MPI_Init(&argc, &argv);
t_setup_strt = MPI_Wtime();
old_comm = MPI_COMM_WORLD;
MPI_Comm_size(old_comm, &size);
MPI_Comm_rank(old_comm, &rank);
P = size;
if(rank == 0)
{
input(&N[0], &N[1], &N[2]);
}
MPI_Bcast(N, 3, MPI_INT, 0, old_comm);
dims[0] = 0;
dims[1] = 0;
dims[2] = 0;
period[0] = period[1] = period[2] = 0; //All dimensions aperiodic
reorder = 0 ; //No reordering of ranks in new_comm
ndims = 3;
MPI_Dims_create(P,ndims,dims);
MPI_Cart_create(old_comm, ndims, dims, period, reorder, &new_comm);
if( (N[0]-1) % dims[0] == 0 && (N[1]-1) % dims[1] == 0 && (N[2]-1) % dims[2] == 0 )
{
PX = (N[0]-1)/dims[0]; //Rows of unknowns each process gets
PY = (N[1]-1)/dims[1]; //Columns of unknowns each process gets
PZ = (N[2]-1)/dims[2]; //Depth of unknowns each process gets
}
old = allocate_mem(PX+2, PY+2, PZ+2); //3D arrays with ghost points
new = allocate_mem(PX+2, PY+2, PZ+2); //3D arrays with ghost points
dimension = 0;
displacement = 1;
MPI_Cart_shift(new_comm, dimension, displacement, &X_UP, &X_DOWN); //Find UP and DOWN neighbours
dimension = 1;
MPI_Cart_shift(new_comm, dimension, displacement, &Y_LEFT, &Y_RIGHT); //Find UP and DOWN neighbours
dimension = 2;
MPI_Cart_shift(new_comm, dimension, displacement, &Z_TOWARDS_U, &Z_AWAY_U); //Find UP and DOWN neighbours
/* IGNORE BOUNDARY SETUPS FOR PDE */
#pragma omp parallel for default(none) shared(old,new,PX,PY,PZ) private(i,j,k) schedule(static)
for(i = 0; i <= PX+1; i++)
{
for(j = 0; j <= PY+1; j++)
{
for(k = 0; k <= PZ+1; k++)
{
old[i][j][k] = 0.0;
new[i][j][k] = 0.0;
}
}
}
#pragma omp parallel default(none) shared(X_DOWN,X_UP,Y_LEFT,Y_RIGHT,Z_TOWARDS_U,Z_AWAY_U,old,new,PX,PY,PZ) private(i,j,k,threadID,nthreads)
{
threadID = omp_get_thread_num();
nthreads = omp_get_num_threads();
if(threadID == 0)
{
if(X_DOWN == MPI_PROC_NULL) //X is constant here, this is YZ upper plane
{
for(j = 1 ; j<= PY ; j++)
for(k = 1 ; k<= PZ ; k++)
{
old[0][j][k] = 1;
new[0][j][k] = 1; //Set boundaries in new also
}
}
}
if(threadID == (nthreads-1))
{
if(X_UP == MPI_PROC_NULL) //YZ lower plane
{
for(j = 1 ; j<= PY ; j++)
for(k = 1; k<= PZ ; k++)
{
old[PX+1][j][k] = 1;
new[PX+1][j][k] = 1;
}
}
}
if(Y_LEFT == MPI_PROC_NULL) //Y is constant, this is left XZ plane, possibly can use collapse(2)
{
#pragma omp for schedule(static)
for(i = 1 ; i<= PX ; i++)
for(k = 1; k<= PZ; k++)
{
old[i][0][k] = 1;
new[i][0][k] = 1;
}
}
if(Y_RIGHT == MPI_PROC_NULL) //XZ right plane, again collapse(2) potential
{
#pragma omp for schedule(static)
for(i = 1 ; i<= PX; i++)
for(k = 1; k<= PZ ; k++)
{
old[i][PY+1][k] = 1;
new[i][PY+1][k] = 1;
}
}
if(Z_TOWARDS_U == MPI_PROC_NULL) //Z is constant here, towards you XY plane, collapse(2)
{
#pragma omp for schedule(static)
for(i = 1 ; i<= PX ; i++)
for(j = 1; j<= PY ; j++)
{
old[i][j][0] = 1;
new[i][j][0] = 1;
}
}
if(Z_AWAY_U == MPI_PROC_NULL) //Away from you XY plane, collapse(2)
{
#pragma omp for schedule(static)
for(i = 1 ; i<= PX; i++)
for(j = 1; j<= PY ; j++)
{
old[i][j][PZ+1] = 1;
new[i][j][PZ+1] = 1;
}
}
}
/* IGNORE SUBARRAY DECLARATION */
gsize[0] = PX+2; //Global sizes of 3-D cubes for each process
gsize[1] = PY+2;
gsize[2] = PZ+2;
start[0] = 0; //Will specify starting location while sending/receiving
start[1] = 0;
start[2] = 0;
local_x[0] = 1;
local_x[1] = PY;
local_x[2] = PZ;
MPI_Type_create_subarray(ndims, gsize, local_x, start, MPI_ORDER_C, MPI_FLOAT, &x_subarray);
MPI_Type_commit(&x_subarray);
local_y[0] = PX;
local_y[1] = 1;
local_y[2] = PZ;
MPI_Type_create_subarray(ndims, gsize, local_y, start, MPI_ORDER_C, MPI_FLOAT, &y_subarray);
MPI_Type_commit(&y_subarray);
local_z[0] = PX;
local_z[1] = PY;
local_z[2] = 1;
MPI_Type_create_subarray(ndims, gsize, local_z, start, MPI_ORDER_C, MPI_FLOAT, &z_subarray);
MPI_Type_commit(&z_subarray);
t_setup_end = MPI_Wtime();
strt = MPI_Wtime();
while(G_max_err > Tol) //iterations < ITERATIONS)
{
iterations++ ;
t_comm_strt = MPI_Wtime();
/* IGNORE MPI COMMUNICATION */
MPI_Irecv(&old[0][1][1], 1, x_subarray, X_DOWN, 10, new_comm, &recv[0]);
MPI_Irecv(&old[PX+1][1][1], 1, x_subarray, X_UP, 20, new_comm, &recv[1]);
MPI_Irecv(&old[1][PY+1][1], 1, y_subarray, Y_RIGHT, 30, new_comm, &recv[2]);
MPI_Irecv(&old[1][0][1], 1, y_subarray, Y_LEFT, 40, new_comm, &recv[3]);
MPI_Irecv(&old[1][1][PZ+1], 1, z_subarray, Z_AWAY_U, 50, new_comm, &recv[4]);
MPI_Irecv(&old[1][1][0], 1, z_subarray, Z_TOWARDS_U, 60, new_comm, &recv[5]);
MPI_Isend(&old[PX][1][1], 1, x_subarray, X_UP, 10, new_comm, &send[0]);
MPI_Isend(&old[1][1][1], 1, x_subarray, X_DOWN, 20, new_comm, &send[1]);
MPI_Isend(&old[1][1][1], 1, y_subarray, Y_LEFT, 30, new_comm, &send[2]);
MPI_Isend(&old[1][PY][1], 1, y_subarray, Y_RIGHT, 40, new_comm, &send[3]);
MPI_Isend(&old[1][1][1], 1, z_subarray, Z_TOWARDS_U, 50, new_comm, &send[4]);
MPI_Isend(&old[1][1][PZ], 1, z_subarray, Z_AWAY_U, 60, new_comm, &send[5]);
MPI_Waitall(6, send, MPI_STATUSES_IGNORE);
MPI_Waitall(6, recv, MPI_STATUSES_IGNORE);
t_comm_end = MPI_Wtime();
t_comm_sum = t_comm_sum + (t_comm_end - t_comm_strt);
/* Use threads in Independent update */
t_i_strt = MPI_Wtime();
l_max_err = 0.0; //Very important, Reduction result is combined with this !
/* THIS IS THE IMPORTANT REGION */
#pragma omp parallel default(none) shared(old,new,PX,PY,PZ,chunk) private(threadID,nthreads) reduction(max:l_max_err)
{
nthreads = omp_get_num_threads();
threadID = omp_get_thread_num();
chunk = (PX-1+1) / nthreads ;
l_max_err = independent_update(old, new, PX+2, PY+2, PZ+2, threadID, chunk);
}
t_i_end = MPI_Wtime();
t_i_sum = t_i_sum + (t_i_end - t_i_strt) ;
/* IGNORE THE REMAINING CODE */
t_allred_strt = MPI_Wtime();
MPI_Allreduce(&l_max_err, &G_max_err, 1, MPI_FLOAT, MPI_MAX, new_comm);
t_allred_end = MPI_Wtime();
t_allred_total = t_allred_total + (t_allred_end - t_allred_strt);
temp = new ;
new = old;
old = temp;
}
MPI_Barrier(new_comm);
end = MPI_Wtime();
if( rank == 0)
{
printf("\nIterations = %d, G_max_err = %f", iterations, G_max_err);
printf("\nThe total SET-UP time for MPI and boundary conditions is %lf", (t_setup_end-t_setup_strt));
printf("\nThe total time for SOLVING is %lf", (end-strt));
printf("\nThe total time for INDEPENDENT COMPUTE %lf", t_i_sum);
printf("\nThe total time for COMMUNICATION OVERHEAD is %lf", t_comm_sum);
printf("\nThe total time for MPI_ALLREDUCE() is %lf", t_allred_total);
}
MPI_Type_free(&x_subarray);
MPI_Type_free(&y_subarray);
MPI_Type_free(&z_subarray);
free(&old[0][0][0]);
free(&new[0][0][0]);
MPI_Finalize();
return 0;
}
P.S. : I am almost sure that the cost of spawning/waking the threads is not the reason for such a huge difference in the timing.
Please find attached Scalasca snapshot for INDEPENDENT COMPUTE of the Hybrid Program.
Using loop simd construct
#pragma omp parallel default(none) shared(old,new,PX,PY,PZ,l_max_err) private(i,j,k,diff)
{
#pragma omp for simd schedule(static) reduction(max:l_max_err)
for(i = 1; i <= PX ; i++)
{
for(j = 1; j<= PY; j++)
{
for(k = 1; k<= PZ; k++)
{
new[i][j][k] = (1/6.0) *(old[i-1][j][k] + old[i+1][j][k] + old[i][j-1][k] + old[i][j+1][k] + old[i][j][k-1] + old[i][j][k+1] );
diff = 1.0 - new[i][j][k];
diff = (diff > 0 ? diff : -1.0 * diff );
if(diff > l_max_err)
l_max_err = diff;
}
}
}
}

You frequently get memory access and cache issues when you just do one MPI process per socket on a CPU with multiple memory controllers. It can be on either the read or the write side, so you can't really say which. This is especially an issue when doing thread-parallel execution with lightweight compute tasks (e.g. math on arrays). One MPI process per socket in this case tends to fare significantly worse than pure MPI.
In your BIOS, set up whatever the maximal NUMA per socket option is
Use one MPI process per NUMA node.
Try some different parameter values in schedule(static). I've rarely found the default to be best.
Essentially what this will do is ensure each bundle of threads only works on a single pool of memory.

C++ Threaded Template Vector Quicksort

Threaded quick sort method:
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include "MD5.h"
#include <thread>
using namespace std;
template<typename T>
void quickSort(vector<T> &arr, int left, int right) {
int i = left, j = right; //Make local copys to modify
T tmp; //Termorary variable to use for swaping.
T pivot = arr[(left + right) / 2]; //Find the centerpoint. if 0.5 truncate.
while (i <= j) {
while (arr[i] < pivot) //is i < pivot?
i++;
while (arr[j] > pivot) //Is j > pivot?
j--;
if (i <= j) { //Swap
tmp = arr[i];
arr[i] = arr[j];
arr[j] = tmp;
i++;
j--;
}
};
thread left_t; //Left thread
thread right_t; //Right thread
if (left < j)
left_t = thread(quickSort<T>, ref(arr), left, j);
if (i < right)
right_t = thread(quickSort<T>, ref(arr), i, right);
if (left < j)
left_t.join();
if (left < j)
right_t.join();
}
int main()
{
vector<int> table;
for (int i = 0; i < 100; i++)
{
table.push_back(rand() % 100);
}
cout << "Before" << endl;
for each(int val in table)
{
cout << val << endl;
}
quickSort(table, 0, 99);
cout << "After" << endl;
for each(int val in table)
{
cout << val << endl;
}
char temp = cin.get();
return 0;
}
Above program lags like mad hell and Spams "abort()" has been called.
Im thinking it has something to do with vectors and it Having threading issues
Iv seen the Question asked by Daniel Makardich, His Utilizes a Vector int While mine uses Vector T

You don't have any problem with quick sort, but with passing a templated function to a thread. There is no function quickSort. You need to explicitly give type, to instantiate the function template:
#include <thread>
#include <iostream>
template<typename T>
void f(T a) { std::cout << a << '\n'; }
int main () {
std::thread t;
int a;
std::string b("b");
t = std::thread(f, a); // Won't work
t = std::thread(f<int>, a);
t.join();
t = std::thread(f<decltype(b)>, b); // a bit fancier, more dynamic way
t.join();
return 0;
}
I suspect in your case this should do:
left_t = thread(quickSort<T>, ref(arr), left, j);
And similar for right_t. Also, you have mistake there trying to use operator()() instead of constructing an object. That is why the error is different.
Can't verify though, cause there's no minimal verifiable example =/
I don't know if it's possible to make compiler to use automatic type deduction for f passed as a param, if anyone knows that would probably make it a better answer.

Problem was with thread joins and what #luk32 said
Needed to convert the threads to pointers to threads.
thread* left_t = nullptr; //Left thread
thread* right_t = nullptr; //Right thread
if (left < j)
left_t = new thread(quickSort<T>, ref(arr), left, j);
if (i < right)
right_t = new thread(quickSort<T>, ref(arr), i, right);
if (left_t)
{
left_t->join();
delete left_t;
}
if (right_t)
{
right_t->join();
delete right_t;
}
Seems like if you create a default constructed thread object. But don't use it, it still wants to be joined. and if you do join it, it will complain.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

OpenMP and MPI hybrid dynamic scheduling - multithreading

Related

MPI4PY: ring communication with neighbor_alltoallw

OpenACC How can I keep a data between differetn calls of a function?

Rstudio crashes with Rcpp and OpenMP function

Hybrid MPI+OpenMP Vs MPI Performance

C++ Threaded Template Vector Quicksort

Categories

Resources