OpenMP in ray-tracing program with two nested for loops crashing - multithreading

I am using OpenMP in VisualStudio 2015.
In my Ray-Tracing algorithm I have the following code
int x, y;
#pragma omp parallel for private(y)
for (x = 0; x < sWidth; x++) {
for (y = 0; y < sHeight; y++) {
thisone = y*sWidth + x;
for (int aax = 0; aax < aadepth; aax++) {
for (int aay = 0; aay < aadepth; aay++) {
aa_index = aay*aadepth + aax;
if (sWidth > sHeight) {
//the image is wider than it is tall
xamnt = ((x + (double)aax / (aadepth - 1)) / sWidth)*aspectratio - (((sWidth - sHeight) / sHeight) / 2);
yamnt = ((sHeight - y) + (double)aax / (aadepth - 1)) / sHeight;
else if (sHeight > sWidth) {
//the image is taller than it is wide
xamnt = (x + (double)aax / (aadepth - 1)) / sWidth;
yamnt = (((sHeight - y) + (double)aax / (aadepth - 1)) / sHeight) / aspectratio - (((sHeight - sWidth) / sWidth) / 2);
else {
//the image is square
xamnt = (x + (double)aax / (aadepth - 1)) / sWidth;
yamnt = ((sHeight - y) + (double)aax / (aadepth - 1)) / sHeight;
//camera ray
Vect cam_ray_origin = scene_cam.getCameraPosition();
Vect cam_ray_direction = camdir.vectAdd(camright.vectMult(xamnt - 0.5).vectAdd(camdown.vectMult(yamnt - 0.5))).normalize();
Ray cam_ray(cam_ray_origin, cam_ray_direction);
vector<double> intersections;
for (int index = 0; index < scene_objects.size(); index++) {
int index_of_winning_object = winningObjectIndex(intersections);
if (index_of_winning_object == -1) {
//set the background black
tempRed[aa_index] = 0;
//cout << tempRed[aa_index];
tempGreen[aa_index] = 0;
tempBlue[aa_index] = 0;
//return color
else {
//index corresponds to an object in our scene
if ( > accuracy) {
//determine the position and direction vectors at the point of intersection
Vect intersection_position = cam_ray_origin.vectAdd(cam_ray_direction.vectMult(;
Vect intersecting_ray_direction = cam_ray_direction;
//color in the points
Color intersection_color = getColorAt(intersection_position,
scene_objects, index_of_winning_object,
light_sources, accuracy, ambientlight);
tempRed[aa_index] = intersection_color.getColorRed();
tempGreen[aa_index] = intersection_color.getColorGreen();
tempBlue[aa_index] = intersection_color.getColorBlue();
//adding the colors to each pixel
//average the pixel color
double totalRed = 0;
double totalGreen = 0;
double totalBlue = 0;
for (int iRed = 0; iRed < aadepth*aadepth; iRed++) {
totalRed = totalRed + tempRed[iRed];
for (int iGreen = 0; iGreen < aadepth*aadepth; iGreen++) {
totalGreen = totalGreen + tempGreen[iGreen];
for (int iBlue = 0; iBlue < aadepth*aadepth; iBlue++) {
totalBlue = totalBlue + tempBlue[iBlue];
double avgRed = totalRed / (aadepth*aadepth);
double avgGreen = totalGreen / (aadepth*aadepth);
double avgBlue = totalBlue / (aadepth*aadepth);
pixels[thisone].r = avgRed;
pixels[thisone].g = avgGreen;
pixels[thisone].b = avgBlue;
saveimg("scene_anti-aliased_thread_test1.bmp", pixels);
I used several different functions from OpenMP. I tried #pragma omp parallel and #pragma omp parallel for with #pragma omp parallel for reduction() and #pragma omp parallel for private().
But nothing seems to work.
When I use the #pragma omp parallel just before the xand y for loops I get a "Microsoft C/C++ optimizing compilator has stopped working".Severity Code Description Project File Line Suppression State
Error C3017 termination test in OpenMP 'for' statement has improper form Ray-Tracing e:\utility programs\vs - c++\ray-tracing\ray-tracing\main.cpp 258
Error C1903 unable to recover from previous error(s); stopping compilation Ray-Tracing e:\utility programs\vs - c++\ray-tracing\ray-tracing\main.cpp 258
Why am I getting these errors?
When I use the same #pragma omp parallel before the aax and aay for loops it renders the image even slower.
Now I guess that, when I use the #pragma omp parallel for in the second two nested for loops it just computes the code several times and that's it. It doesn't do anything.
But I honestly have no idea, I just started playing around with openMP.
Please help with - which OpenMP functions should I use and where should I put them?
P.S. - the code seen is mainly from the online tutorial for making a ray-tracer from scratch. A basic way to make a ray-tracer by Caleb Piercy.


Nan building and looping over an array

I'm able to execute a hello world example, but beyond that I'm new to nan and node add-ons.
I'm concerned about memory leaks so if I'm causing any please let me
And how do I push an array onto that out array similar to
[].push([0, 1]). I'm not sure how to do it in the cleanest way possible without creating a new variable to store it - if possible.
Also if there's anything else I'm doing that's not best practice please let me know! I've been researching this for a while now.
Here's the code I have so far
#include <nan.h>
void Method(const Nan::FunctionCallbackInfo <v8::Value> &info) {
v8::Local <v8::Context> context = info.GetIsolate()->GetCurrentContext();
v8::Local <v8::Array> coordinate = v8::Local<v8::Array>::Cast(info[0]);
unsigned int radius = info[2]->Uint32Value(context).FromJust();
// Also if creating the array is wasteful this way by giving it the max possible size
v8::Local <v8::Array> out = Nan::New<v8::Array>(x * y);
for (int x = -radius; x <= radius; ++x) {
for (int y = -radius; y <= radius; ++y) {
if (x * x + y * y <= radius * radius) {
// I need to push something like [x + coordinate->Get(context, 0), y + coordinate->Get(context, 0)];
I was later able to write this.. If anyone can point out if I approached it correctly and/or if there are any memory issues I need to watch out for.
#include <nan.h>
void Method(const Nan::FunctionCallbackInfo <v8::Value> &info) {
v8::Local <v8::Context> context = info.GetIsolate()->GetCurrentContext();
v8::Local <v8::Array> coordinates v8::Local<v8::Array>::Cast(info[0]);
int radius = info[1]->Int32Value(context).FromJust();
v8::Local <v8::Array> out = Nan::New<v8::Array>();
int index = 0;
for (unsigned int i = 0; i < coordinates->Length(); i++) {
v8::Local <v8::Array> coordinate = v8::Local<v8::Array>::Cast(coordinates->Get(context, i).ToLocalChecked());
int xArg = coordinate->Get(context, 0).ToLocalChecked()->Int32Value(context).FromJust();
int yArg = coordinate->Get(context, 1).ToLocalChecked()->Int32Value(context).FromJust();
for (int xPos = -radius; xPos <= radius; ++xPos) {
for (int yPos = -radius; yPos <= radius; ++yPos) {
if (xPos * xPos + yPos * yPos <= radius * radius) {
v8::Local <v8::Array> xy = Nan::New<v8::Array>();
(void) xy->Set(context, 0, Nan::New(xPos + xArg));
(void) xy->Set(context, 1, Nan::New(yPos + yArg));
(void) out->Set(context, index++, xy);
I don't think you have any leaks - in fact there is no implicit memory allocation at all in your code - but in case you need it, I suggest you check the gyp files of my addons for more information on how to build them with asan with g++ or clang. As far as I am concerned, it is a mandatory step when creating Node addons.
The option is called --enable_asan

Multiply matrices using dgemv and multithreads in c

I have a problem in my code. I want to multiply 2 matrices using dgemv from cblas, but I want to share the operations to the threads I have. I have also used dgemv to multiply the matrices in a previous exercise where there was no parallelism needed. Is there any idea of what I should do?
The code:
for (it = 0; it < itime; it++) {
cblas_dgemv(CblasColMajor,CblasNoTrans,n,n, 1 , sigma, n, u , 1, 0.0 , d, 1);
#pragma omp parallel for private(i,j,sum) schedule(static)
for (i = 0; i < n; i++) {
sum = 0.0;
uplus[i] = u[i] + dtmu - dt * u[i];
#pragma omp simd reduction(+:sum)
for (j = 0; j < n; j++) {
sum += sigma[i*n+j]*u[j];
sum = sum - u[i]*m[i];
uplus[i] += dtdiv * sum;
if (uplus[i] > uth) {
uplus[i] = 0.0;
if (it >= ttransient) {
omega1[i] += 1.0;
t = u;
u = uplus;
uplus = t;
I want to get the dgemv function into the parallel region and share somehow the multiplications to the threads I have.

Is Implement of operators in one instruction really faster than implement operators in separated instructions?

I tried to measure the speed of a simple code as below. The purpose is to find out which method will be faster (1 instruction x[j] = j * 2.0 + 1.0; OR 2 separated instructions x[j] = j * 2.0; x[j] += 1.0; ?)
int main(int argc, char* argv[])
int x[100000];
std::clock_t start;
start = std::clock();
for (int i = 0; i < 10000; i++) {
for (int j = 0; j < 100000; j++) {
x[j] = j * 2.0 + 1.0;
//x[j] = j * 2.0;
//x[j] += 1.0;
std::cout << (std::clock() - start) / (double)(CLOCKS_PER_SEC / 1000) << std::endl;
return 0;
The results showed that with only 1 instruction (x[j] = j * 2.0 + 1.0;), it took me around 3.5(s). However, with 2 separated instructions (x[j] = j * 2.0; x[j] += 1.0;), it took me 8(s).
Could anyone explain why the difference of time is so big like that? Thanks in advance all.

Hybrid MPI+OpenMP Vs MPI Performance

I am converting a 3-D Jacobi solver from pure MPI to Hybrid MPI+OpenMP. I have a 192x192x192 array which is divided among 24 processes in Pure MPI in 1-D decomposition i.e. each process has 192/24 x 192 x 192 = 8 x 192 x 192 slab of data. Now I do :
for(i=0 ; i <= 7; i++)
for(j=0; j<= 191; j++)
for(k=0; k<= 191; k++)
unew[i][j][k] = 1/6.0 * (u[i+1][j][k]+u[i-1][j][k]+
This update takes around 60 seconds for each process.
Now with Hybrid MPI, I run two processes (1 process per socket --bind-to socket --map-by socket and OMP_PROC_PLACES=coreswith OMP_PROC_BIND=close). I create 12 threads per MPI Process (i.e. 12 threads per socket or processor). Now each MPI process has an array of size : 192/2 x 192 x 192 = 96x192x192 elements. Each thread works on 96/12 x 192 x 192 = 8 x 192 x 192 portion of the array owned by each process. I do the same triple loop update using threads but the time is approximately 76 seconds for each thread. The load balance is perfect in both the problems. What could be the possible causes of performance degradation ? Is is False Sharing because threads could be invalidating the cache lines close to each other's chunk of data ? If yes, then how do I reduce this performance degradation ? (I have purposefully not mentioned ghost data but initially I am NOT overlapping communication with computation.)
In response to the comments below, am posting the code. Apologies for the long MWE but you can very safely ignore (1) Header files declaration (2) Variable Declaration (3) Memory allocation routine (4) Formation of Cartesian Topology (5) Setting boundary conditions in parallel using OpenMP parallel region (6) Declaration of MPI_Type_subarray datatype (7) MPI_Isend() and MPI_Irecv() calls and just concentrate on (a) INDEPENDENT UPDATE OpenMP parallel region (b) independent_update(...) routine being called from here.
#define MIN(a,b) (a < b ? a : b)
#define Tol 0.00001
void input(int *X, int *Y, int *Z)
int a=193, b=193, c=193;
*X = a;
*Y = b;
*Z = c;
float*** allocate_mem(int X, int Y, int Z)
int i,j;
float ***matrix;
float *arr;
arr = (float*)calloc(X*Y*Z, sizeof(float));
matrix = (float***)calloc(X, sizeof(float**));
for(i = 0 ; i<= X-1; i++)
matrix[i] = (float**)calloc(Y, sizeof(float*));
for(i = 0 ; i <= X-1; i++)
for(j=0; j<= Y-1; j++)
matrix[i][j] = &(arr[i*Y*Z + j*Z]);
return matrix ;
float independent_update(float ***old, float ***new, int NX, int NY, int NZ, int tID, int chunk)
int i,j,k, start, end;
float error = 0.0;
float diff;
start = tID * chunk + 1;
end = MIN( (tID+1)*chunk, NX-2 );
for(i = start; i <= end ; i++)
for(j = 1; j<= NY-2; j++)
#pragma omp simd
for(k = 1; k<= NZ-2; k++)
new[i][j][k] = (1/6.0) *(old[i-1][j][k] + old[i+1][j][k] + old[i][j-1][k] + old[i][j+1][k] + old[i][j][k-1] + old[i][j][k+1] );
diff = 1.0 - new[i][j][k];
diff = (diff > 0 ? diff : -1.0 * diff );
if(diff > error)
error = diff;
return error;
int main(int argc, char *argv[])
int size, rank; //Size of old_comm and rank of process
int i, j, k,l; //General loop variables
MPI_Comm old_comm, new_comm; //MPI_COMM_WORLD handle and for MPI_Cart_create()
int N[3]; //For taking input of size of matrix from user
int P; //Represent number of processes i.e. same as size
int dims[3]; //For dimensions of Cartesian topology
int PX, PY, PZ; //X dim, Y dim, Z dim of each process
float ***old, ***new, ***temp; //Matrices for results dimensions is (Px+2)*(PY+2)*(PZ+2)
int period[3]; //Periodicity for each dimension
int reorder; //Whether processes should be reordered in new cartesian topology
int ndims; //Number of dimensions (which is 3)
int Z_TOWARDS_U, Z_AWAY_U; //Z neighbour towards you and away from you (Z const)
int X_DOWN, X_UP; //Below plane and above plane (X const)
int Y_LEFT, Y_RIGHT; //Left plane and right plane (Y const)
int coords[3]; //Finding coordinates of processes
int dimension; //Used in MPI_Cart_shift() , values = 0, 1,2
int displacement; //Used in MPI_Cart_shift(), values will be +1 to find immediate neighbours
float l_max_err; //Local maximum error on process
float l_max_err_new; //For dependent faces.
float G_max_err = 1.0; //Maximum error for stopping criterion
int iterations = 0 ; //Counting number of iterations
MPI_Request send[6], recv[6]; //For MPI_Isend and MPI_Irecv
int start[3]; //Start will be defined in MPI_Isend() and MPI_Irecv()
int gsize[3]; //Defining global size of subarray
MPI_Datatype x_subarray; //For sending X_UP and X_DOWN
int local_x[3]; //Defining local plane size for X_UP/X_DOWN
MPI_Datatype y_subarray; //For sending Y_LEFT and Y_RIGHT
int local_y[3]; //Defining local plane for Y_LEFT/Y_RIGHT
MPI_Datatype z_subarray; //For sending Z_TOWARDS_U and Z_AWAY_U
int local_z[3]; //Defining local plan size for XY plane i.e. where Z=0
double strt, end; //For measuring time
double strt1, end1, delta1; //For measuring trivial time 1
double strt2, end2, delta2; //For measuring trivial time 2
double t_i_strt, t_i_end, t_i_sum=0; //Time for independent computational kernel
double t_up_strt, t_up_end, t_up_sum=0; //Time for X_UP
double t_down_strt, t_down_end, t_down_sum=0; //Time for X_DOWN
double t_left_strt, t_left_end, t_left_sum=0; //Time for Y_LEFT
double t_right_strt, t_right_end, t_right_sum=0; //Time for Y_RIGHT
double t_towards_strt, t_towards_end, t_towards_sum=0; //For Z_TOWARDS_U
double t_away_strt, t_away_end, t_away_sum=0; //For Z_AWAY_U
double t_comm_strt, t_comm_end, t_comm_sum=0; //Time comm + independent update (need to subtract to get comm time)
double t_setup_strt,t_setup_end; //Set-up start and end time
double t_allred_strt,t_allred_end,t_allred_total=0.0; //Measuring Allreduce time separately.
int threadID; //ID of a thread
int nthreads; //Total threads in OpenMP region
int chunk; //chunk - used to calculate iterations of a thread
MPI_Init(&argc, &argv);
t_setup_strt = MPI_Wtime();
old_comm = MPI_COMM_WORLD;
MPI_Comm_size(old_comm, &size);
MPI_Comm_rank(old_comm, &rank);
P = size;
if(rank == 0)
input(&N[0], &N[1], &N[2]);
MPI_Bcast(N, 3, MPI_INT, 0, old_comm);
dims[0] = 0;
dims[1] = 0;
dims[2] = 0;
period[0] = period[1] = period[2] = 0; //All dimensions aperiodic
reorder = 0 ; //No reordering of ranks in new_comm
ndims = 3;
MPI_Cart_create(old_comm, ndims, dims, period, reorder, &new_comm);
if( (N[0]-1) % dims[0] == 0 && (N[1]-1) % dims[1] == 0 && (N[2]-1) % dims[2] == 0 )
PX = (N[0]-1)/dims[0]; //Rows of unknowns each process gets
PY = (N[1]-1)/dims[1]; //Columns of unknowns each process gets
PZ = (N[2]-1)/dims[2]; //Depth of unknowns each process gets
old = allocate_mem(PX+2, PY+2, PZ+2); //3D arrays with ghost points
new = allocate_mem(PX+2, PY+2, PZ+2); //3D arrays with ghost points
dimension = 0;
displacement = 1;
MPI_Cart_shift(new_comm, dimension, displacement, &X_UP, &X_DOWN); //Find UP and DOWN neighbours
dimension = 1;
MPI_Cart_shift(new_comm, dimension, displacement, &Y_LEFT, &Y_RIGHT); //Find UP and DOWN neighbours
dimension = 2;
MPI_Cart_shift(new_comm, dimension, displacement, &Z_TOWARDS_U, &Z_AWAY_U); //Find UP and DOWN neighbours
#pragma omp parallel for default(none) shared(old,new,PX,PY,PZ) private(i,j,k) schedule(static)
for(i = 0; i <= PX+1; i++)
for(j = 0; j <= PY+1; j++)
for(k = 0; k <= PZ+1; k++)
old[i][j][k] = 0.0;
new[i][j][k] = 0.0;
#pragma omp parallel default(none) shared(X_DOWN,X_UP,Y_LEFT,Y_RIGHT,Z_TOWARDS_U,Z_AWAY_U,old,new,PX,PY,PZ) private(i,j,k,threadID,nthreads)
threadID = omp_get_thread_num();
nthreads = omp_get_num_threads();
if(threadID == 0)
if(X_DOWN == MPI_PROC_NULL) //X is constant here, this is YZ upper plane
for(j = 1 ; j<= PY ; j++)
for(k = 1 ; k<= PZ ; k++)
old[0][j][k] = 1;
new[0][j][k] = 1; //Set boundaries in new also
if(threadID == (nthreads-1))
if(X_UP == MPI_PROC_NULL) //YZ lower plane
for(j = 1 ; j<= PY ; j++)
for(k = 1; k<= PZ ; k++)
old[PX+1][j][k] = 1;
new[PX+1][j][k] = 1;
if(Y_LEFT == MPI_PROC_NULL) //Y is constant, this is left XZ plane, possibly can use collapse(2)
#pragma omp for schedule(static)
for(i = 1 ; i<= PX ; i++)
for(k = 1; k<= PZ; k++)
old[i][0][k] = 1;
new[i][0][k] = 1;
if(Y_RIGHT == MPI_PROC_NULL) //XZ right plane, again collapse(2) potential
#pragma omp for schedule(static)
for(i = 1 ; i<= PX; i++)
for(k = 1; k<= PZ ; k++)
old[i][PY+1][k] = 1;
new[i][PY+1][k] = 1;
if(Z_TOWARDS_U == MPI_PROC_NULL) //Z is constant here, towards you XY plane, collapse(2)
#pragma omp for schedule(static)
for(i = 1 ; i<= PX ; i++)
for(j = 1; j<= PY ; j++)
old[i][j][0] = 1;
new[i][j][0] = 1;
if(Z_AWAY_U == MPI_PROC_NULL) //Away from you XY plane, collapse(2)
#pragma omp for schedule(static)
for(i = 1 ; i<= PX; i++)
for(j = 1; j<= PY ; j++)
old[i][j][PZ+1] = 1;
new[i][j][PZ+1] = 1;
gsize[0] = PX+2; //Global sizes of 3-D cubes for each process
gsize[1] = PY+2;
gsize[2] = PZ+2;
start[0] = 0; //Will specify starting location while sending/receiving
start[1] = 0;
start[2] = 0;
local_x[0] = 1;
local_x[1] = PY;
local_x[2] = PZ;
MPI_Type_create_subarray(ndims, gsize, local_x, start, MPI_ORDER_C, MPI_FLOAT, &x_subarray);
local_y[0] = PX;
local_y[1] = 1;
local_y[2] = PZ;
MPI_Type_create_subarray(ndims, gsize, local_y, start, MPI_ORDER_C, MPI_FLOAT, &y_subarray);
local_z[0] = PX;
local_z[1] = PY;
local_z[2] = 1;
MPI_Type_create_subarray(ndims, gsize, local_z, start, MPI_ORDER_C, MPI_FLOAT, &z_subarray);
t_setup_end = MPI_Wtime();
strt = MPI_Wtime();
while(G_max_err > Tol) //iterations < ITERATIONS)
iterations++ ;
t_comm_strt = MPI_Wtime();
MPI_Irecv(&old[0][1][1], 1, x_subarray, X_DOWN, 10, new_comm, &recv[0]);
MPI_Irecv(&old[PX+1][1][1], 1, x_subarray, X_UP, 20, new_comm, &recv[1]);
MPI_Irecv(&old[1][PY+1][1], 1, y_subarray, Y_RIGHT, 30, new_comm, &recv[2]);
MPI_Irecv(&old[1][0][1], 1, y_subarray, Y_LEFT, 40, new_comm, &recv[3]);
MPI_Irecv(&old[1][1][PZ+1], 1, z_subarray, Z_AWAY_U, 50, new_comm, &recv[4]);
MPI_Irecv(&old[1][1][0], 1, z_subarray, Z_TOWARDS_U, 60, new_comm, &recv[5]);
MPI_Isend(&old[PX][1][1], 1, x_subarray, X_UP, 10, new_comm, &send[0]);
MPI_Isend(&old[1][1][1], 1, x_subarray, X_DOWN, 20, new_comm, &send[1]);
MPI_Isend(&old[1][1][1], 1, y_subarray, Y_LEFT, 30, new_comm, &send[2]);
MPI_Isend(&old[1][PY][1], 1, y_subarray, Y_RIGHT, 40, new_comm, &send[3]);
MPI_Isend(&old[1][1][1], 1, z_subarray, Z_TOWARDS_U, 50, new_comm, &send[4]);
MPI_Isend(&old[1][1][PZ], 1, z_subarray, Z_AWAY_U, 60, new_comm, &send[5]);
MPI_Waitall(6, send, MPI_STATUSES_IGNORE);
MPI_Waitall(6, recv, MPI_STATUSES_IGNORE);
t_comm_end = MPI_Wtime();
t_comm_sum = t_comm_sum + (t_comm_end - t_comm_strt);
/* Use threads in Independent update */
t_i_strt = MPI_Wtime();
l_max_err = 0.0; //Very important, Reduction result is combined with this !
#pragma omp parallel default(none) shared(old,new,PX,PY,PZ,chunk) private(threadID,nthreads) reduction(max:l_max_err)
nthreads = omp_get_num_threads();
threadID = omp_get_thread_num();
chunk = (PX-1+1) / nthreads ;
l_max_err = independent_update(old, new, PX+2, PY+2, PZ+2, threadID, chunk);
t_i_end = MPI_Wtime();
t_i_sum = t_i_sum + (t_i_end - t_i_strt) ;
t_allred_strt = MPI_Wtime();
MPI_Allreduce(&l_max_err, &G_max_err, 1, MPI_FLOAT, MPI_MAX, new_comm);
t_allred_end = MPI_Wtime();
t_allred_total = t_allred_total + (t_allred_end - t_allred_strt);
temp = new ;
new = old;
old = temp;
end = MPI_Wtime();
if( rank == 0)
printf("\nIterations = %d, G_max_err = %f", iterations, G_max_err);
printf("\nThe total SET-UP time for MPI and boundary conditions is %lf", (t_setup_end-t_setup_strt));
printf("\nThe total time for SOLVING is %lf", (end-strt));
printf("\nThe total time for INDEPENDENT COMPUTE %lf", t_i_sum);
printf("\nThe total time for COMMUNICATION OVERHEAD is %lf", t_comm_sum);
printf("\nThe total time for MPI_ALLREDUCE() is %lf", t_allred_total);
return 0;
P.S. : I am almost sure that the cost of spawning/waking the threads is not the reason for such a huge difference in the timing.
Please find attached Scalasca snapshot for INDEPENDENT COMPUTE of the Hybrid Program.
Using loop simd construct
#pragma omp parallel default(none) shared(old,new,PX,PY,PZ,l_max_err) private(i,j,k,diff)
#pragma omp for simd schedule(static) reduction(max:l_max_err)
for(i = 1; i <= PX ; i++)
for(j = 1; j<= PY; j++)
for(k = 1; k<= PZ; k++)
new[i][j][k] = (1/6.0) *(old[i-1][j][k] + old[i+1][j][k] + old[i][j-1][k] + old[i][j+1][k] + old[i][j][k-1] + old[i][j][k+1] );
diff = 1.0 - new[i][j][k];
diff = (diff > 0 ? diff : -1.0 * diff );
if(diff > l_max_err)
l_max_err = diff;
You frequently get memory access and cache issues when you just do one MPI process per socket on a CPU with multiple memory controllers. It can be on either the read or the write side, so you can't really say which. This is especially an issue when doing thread-parallel execution with lightweight compute tasks (e.g. math on arrays). One MPI process per socket in this case tends to fare significantly worse than pure MPI.
In your BIOS, set up whatever the maximal NUMA per socket option is
Use one MPI process per NUMA node.
Try some different parameter values in schedule(static). I've rarely found the default to be best.
Essentially what this will do is ensure each bundle of threads only works on a single pool of memory.

Errors with repeated FFTW calls

I'm having a strange issue that I can't resolve. I made this as a simple example that demonstrates the problem. I have a sine wave defined between [0, 2*pi]. I take the Fourier transform using FFTW. Then I have a for loop where I repeatedly take the inverse Fourier transform. In each iteration, I take the average of my solution and print the results. I expect that the average stays the same with each iteration because there is no change to solution, y. However, when I pick N = 256 and other even values of N, I note that the average grows as if there are numerical errors. However, if I choose, say, N = 255 or N = 257, this is not the case and I get what is expect (avg = 0.0 for each iteration).
#include <stdio.h>
#include <stdlib.h>
#include <fftw3.h>
#include <math.h>
int main(void)
int N = 256;
double dx = 2.0 * M_PI / (double)N, dt = 1.0e-3;
double *x, *y;
x = (double *) malloc (sizeof (double) * N);
y = (double *) malloc (sizeof (double) * N);
// initial conditions
for (int i = 0; i < N; i++) {
x[i] = (double)i * dx;
y[i] = sin(x[i]);
fftw_complex yhat[N/2 + 1];
fftw_plan fftwplan, fftwplan2;
// forward plan
fftwplan = fftw_plan_dft_r2c_1d(N, y, yhat, FFTW_ESTIMATE);
// set N/2th mode to zero if N is even
if (N % 2 < 1.0e-13) {
yhat[N/2][0] = 0.0;
yhat[N/2][1] = 0.0;
// backward plan
fftwplan2 = fftw_plan_dft_c2r_1d(N, yhat, y, FFTW_ESTIMATE);
for (int i = 0; i < 50; i++) {
// yhat to y
// rescale
for (int j = 0; j < N; j++) {
y[j] = y[j] / (double)N;
double avg = 0.0;
for (int j = 0; j < N; j++) {
avg += y[j];
printf("%.15f\n", avg/N);
void fftw_cleanup(void);
return 0;
Output for N = 256:
Any ideas?
libfftw has the odious habit of modifying its inputs. Back up yhat if you want to do repeated inverse transforms.
OTOH, it's perverse, but why are you repeating the same operation if you don't expect it give different results? (Despite this being the case)
As indicated in comments: "if you want to keep the input data unchanged, use the FFTW_PRESERVE_INPUT flag. Per"
For example:
// backward plan
fftwplan2 = fftw_plan_dft_c2r_1d(N, yhat, y, FFTW_ESTIMATE | FFTW_PRESERVE_INPUT);
