Errors with repeated FFTW calls

Errors with repeated FFTW calls - transform

I'm having a strange issue that I can't resolve. I made this as a simple example that demonstrates the problem. I have a sine wave defined between [0, 2*pi]. I take the Fourier transform using FFTW. Then I have a for loop where I repeatedly take the inverse Fourier transform. In each iteration, I take the average of my solution and print the results. I expect that the average stays the same with each iteration because there is no change to solution, y. However, when I pick N = 256 and other even values of N, I note that the average grows as if there are numerical errors. However, if I choose, say, N = 255 or N = 257, this is not the case and I get what is expect (avg = 0.0 for each iteration).
Code:
#include <stdio.h>
#include <stdlib.h>
#include <fftw3.h>
#include <math.h>
int main(void)
{
int N = 256;
double dx = 2.0 * M_PI / (double)N, dt = 1.0e-3;
double *x, *y;
x = (double *) malloc (sizeof (double) * N);
y = (double *) malloc (sizeof (double) * N);
// initial conditions
for (int i = 0; i < N; i++) {
x[i] = (double)i * dx;
y[i] = sin(x[i]);
}
fftw_complex yhat[N/2 + 1];
fftw_plan fftwplan, fftwplan2;
// forward plan
fftwplan = fftw_plan_dft_r2c_1d(N, y, yhat, FFTW_ESTIMATE);
fftw_execute(fftwplan);
// set N/2th mode to zero if N is even
if (N % 2 < 1.0e-13) {
yhat[N/2][0] = 0.0;
yhat[N/2][1] = 0.0;
}
// backward plan
fftwplan2 = fftw_plan_dft_c2r_1d(N, yhat, y, FFTW_ESTIMATE);
for (int i = 0; i < 50; i++) {
// yhat to y
fftw_execute(fftwplan2);
// rescale
for (int j = 0; j < N; j++) {
y[j] = y[j] / (double)N;
}
double avg = 0.0;
for (int j = 0; j < N; j++) {
avg += y[j];
}
printf("%.15f\n", avg/N);
}
fftw_destroy_plan(fftwplan);
fftw_destroy_plan(fftwplan2);
void fftw_cleanup(void);
free(x);
free(y);
return 0;
}
Output for N = 256:
0.000000000000000
0.000000000000000
0.000000000000000
-0.000000000000000
0.000000000000000
0.000000000000022
-0.000000000000007
-0.000000000000039
0.000000000000161
-0.000000000000314
0.000000000000369
0.000000000004775
-0.000000000007390
-0.000000000079126
-0.000000000009457
-0.000000000462023
0.000000000900855
-0.000000000196451
0.000000000931323
-0.000000009895302
0.000000039348379
0.000000133179128
0.000000260770321
-0.000003233551979
0.000008285045624
-0.000016331672668
0.000067450106144
-0.000166893005371
0.001059055328369
-0.002521514892578
0.005493164062500
-0.029907226562500
0.093383789062500
-0.339111328125000
1.208251953125000
-3.937500000000000
13.654296875000000
-43.812500000000000
161.109375000000000
-479.250000000000000
1785.500000000000000
-5369.000000000000000
19376.000000000000000
-66372.000000000000000
221104.000000000000000
-753792.000000000000000
2387712.000000000000000
-8603776.000000000000000
29706240.000000000000000
-96833536.000000000000000
Any ideas?

libfftw has the odious habit of modifying its inputs. Back up yhat if you want to do repeated inverse transforms.
OTOH, it's perverse, but why are you repeating the same operation if you don't expect it give different results? (Despite this being the case)

As indicated in comments: "if you want to keep the input data unchanged, use the FFTW_PRESERVE_INPUT flag. Per http://www.fftw.org/doc/Planner-Flags.html"
For example:
// backward plan
fftwplan2 = fftw_plan_dft_c2r_1d(N, yhat, y, FFTW_ESTIMATE | FFTW_PRESERVE_INPUT);

Related

Nan building and looping over an array

I'm able to execute a hello world example, but beyond that I'm new to nan and node add-ons.
I'm concerned about memory leaks so if I'm causing any please let me
know.
And how do I push an array onto that out array similar to
[].push([0, 1]). I'm not sure how to do it in the cleanest way possible without creating a new variable to store it - if possible.
Also if there's anything else I'm doing that's not best practice please let me know! I've been researching this for a while now.
Here's the code I have so far
#include <nan.h>
void Method(const Nan::FunctionCallbackInfo <v8::Value> &info) {
v8::Local <v8::Context> context = info.GetIsolate()->GetCurrentContext();
v8::Local <v8::Array> coordinate = v8::Local<v8::Array>::Cast(info[0]);
unsigned int radius = info[2]->Uint32Value(context).FromJust();
// Also if creating the array is wasteful this way by giving it the max possible size
v8::Local <v8::Array> out = Nan::New<v8::Array>(x * y);
for (int x = -radius; x <= radius; ++x) {
for (int y = -radius; y <= radius; ++y) {
if (x * x + y * y <= radius * radius) {
// I need to push something like [x + coordinate->Get(context, 0), y + coordinate->Get(context, 0)];
out->push_back();
}
}
}
}
I was later able to write this.. If anyone can point out if I approached it correctly and/or if there are any memory issues I need to watch out for.
#include <nan.h>
void Method(const Nan::FunctionCallbackInfo <v8::Value> &info) {
v8::Local <v8::Context> context = info.GetIsolate()->GetCurrentContext();
v8::Local <v8::Array> coordinates v8::Local<v8::Array>::Cast(info[0]);
int radius = info[1]->Int32Value(context).FromJust();
v8::Local <v8::Array> out = Nan::New<v8::Array>();
int index = 0;
for (unsigned int i = 0; i < coordinates->Length(); i++) {
v8::Local <v8::Array> coordinate = v8::Local<v8::Array>::Cast(coordinates->Get(context, i).ToLocalChecked());
int xArg = coordinate->Get(context, 0).ToLocalChecked()->Int32Value(context).FromJust();
int yArg = coordinate->Get(context, 1).ToLocalChecked()->Int32Value(context).FromJust();
for (int xPos = -radius; xPos <= radius; ++xPos) {
for (int yPos = -radius; yPos <= radius; ++yPos) {
if (xPos * xPos + yPos * yPos <= radius * radius) {
v8::Local <v8::Array> xy = Nan::New<v8::Array>();
(void) xy->Set(context, 0, Nan::New(xPos + xArg));
(void) xy->Set(context, 1, Nan::New(yPos + yArg));
(void) out->Set(context, index++, xy);
}
}
}
}
info.GetReturnValue().Set(out);
}

I don't think you have any leaks - in fact there is no implicit memory allocation at all in your code - but in case you need it, I suggest you check the gyp files of my addons for more information on how to build them with asan with g++ or clang. As far as I am concerned, it is a mandatory step when creating Node addons.
https://github.com/mmomtchev/node-gdal-async/blob/master/binding.gyp
https://github.com/mmomtchev/exprtk.js/blob/main/binding.gyp
The option is called --enable_asan

Multiply matrices using dgemv and multithreads in c

I have a problem in my code. I want to multiply 2 matrices using dgemv from cblas, but I want to share the operations to the threads I have. I have also used dgemv to multiply the matrices in a previous exercise where there was no parallelism needed. Is there any idea of what I should do?
The code:
for (it = 0; it < itime; it++) {
cblas_dgemv(CblasColMajor,CblasNoTrans,n,n, 1 , sigma, n, u , 1, 0.0 , d, 1);
#pragma omp parallel for private(i,j,sum) schedule(static)
for (i = 0; i < n; i++) {
sum = 0.0;
uplus[i] = u[i] + dtmu - dt * u[i];
#pragma omp simd reduction(+:sum)
for (j = 0; j < n; j++) {
sum += sigma[i*n+j]*u[j];
}
sum = sum - u[i]*m[i];
uplus[i] += dtdiv * sum;
if (uplus[i] > uth) {
uplus[i] = 0.0;
if (it >= ttransient) {
omega1[i] += 1.0;
}
}
}
t = u;
u = uplus;
uplus = t;
}
I want to get the dgemv function into the parallel region and share somehow the multiplications to the threads I have.

RcppAramadillo Cube::operator() : index out of bounds

I have been fiddling with the following C++ code for integration with R code that I have written (too much to include here), but keep getting an error that the Cube::operator() index is out of bounds and I am unsure as to why this is occurring. My suspicion is that the 3D array is not being filled correctly as described in
making 3d array with arma::cube in Rcpp shows cube error
but I am uncertain how to properly solve the issue.
Below is my full C++ code:
// [[Rcpp::depends(RcppArmadillo)]]
#define ARMA_DONT_PRINT_OPENMP_WARNING
#include <RcppArmadillo.h>
#include <RcppArmadilloExtensions/sample.h>
#include <set>
using namespace Rcpp;
int sample_one(int n) {
return n * unif_rand();
}
int sample_n_distinct(const IntegerVector& x,
int k,
const int * pop_ptr) {
IntegerVector ind_index = RcppArmadillo::sample(x, k, false);
std::set<int> distinct_container;
for (int i = 0; i < k; i++) {
distinct_container.insert(pop_ptr[ind_index[i]]);
}
return distinct_container.size();
}
// [[Rcpp::export]]
arma::Cube<int> fillCube(const arma::Cube<int>& pop,
const IntegerVector& specs,
int perms,
int K) {
int num_specs = specs.size();
arma::Cube<int> res(perms, num_specs, K);
IntegerVector specs_C = specs - 1;
const int * pop_ptr;
int i, j, k;
for (i = 0; i < K; i++) {
for (k = 0; k < num_specs; k++) {
for (j = 0; j < perms; j++) {
pop_ptr = &(pop(0, sample_one(perms), sample_one(K)));
res(j, k, i) = sample_n_distinct(specs_C, k + 1, pop_ptr);
}
}
}
return res;
}
Does someone have an idea as to what may be producing the said error?
Below is the R code with a call to the C++ function (including a commented-out triply-nested 'for' loop that the C++ code reproduces).
## Set up container(s) to hold the identity of each individual from each permutation ##
num.specs <- ceiling(N / K)
## Create an ID for each haplotype ##
haps <- 1:Hstar
## Assign individuals (N) to each subpopulation (K) ##
specs <- 1:num.specs
## Generate permutations, assume each permutation has N individuals, and sample those individuals' haplotypes from the probabilities ##
gen.perms <- function() {
sample(haps, size = num.specs, replace = TRUE, prob = probs)
}
pop <- array(dim = c(perms, num.specs, K))
for (i in 1:K) {
pop[,, i] <- replicate(perms, gen.perms())
}
## Make a matrix to hold individuals from each permutation ##
# HAC.mat <- array(dim = c(perms, num.specs, K))
## Perform haplotype accumulation ##
# for (k in specs) {
# for (j in 1:perms) {
# for (i in 1:K) {
# select.perm <- sample(1:nrow(pop), size = 1, replace = TRUE) # randomly sample a permutation
# ind.index <- sample(specs, size = k, replace = FALSE) # randomly sample individuals
# select.subpop <- sample(i, size = 1, replace = TRUE) # randomly sample a subpopulation
# hap.plot <- pop[select.perm, ind.index, select.subpop] # extract data
# HAC.mat[j, k, i] <- length(unique(hap.plot)) # how many haplotypes are recovered
# }
# }
# }
HAC.mat <- fillCube(pop, specs, perms, K)

This is an out-of-bounds error. The gist of problem is the call
pop_ptr = &(pop(0, sample_one(perms), sample_one(K)));
since
sample_one(perms)
is being placed as an access index where the max length is num_specs. This is seen by how res is defined:
arma::Cube<int> res(perms, num_specs, K);
Thus, moving out perms out of num_specs place should resolve the issue.
// [[Rcpp::export]]
arma::Cube<int> fillCube(const arma::Cube<int>& pop,
const IntegerVector& specs,
int perms,
int K) {
int num_specs = specs.size();
arma::Cube<int> res(perms, num_specs, K);
IntegerVector specs_C = specs - 1;
const int * pop_ptr;
int i, j, k;
for (i = 0; i < K; i++) {
for (k = 0; k < num_specs; k++) {
for (j = 0; j < perms; j++) {
// swapped location
pop_ptr = &(pop(sample_one(perms), 0, sample_one(K)));
// should the middle index be 0?
res(j, k, i) = sample_n_distinct(specs_C, k + 1, pop_ptr);
}
}
}
return res;
}

Hybrid MPI+OpenMP Vs MPI Performance

I am converting a 3-D Jacobi solver from pure MPI to Hybrid MPI+OpenMP. I have a 192x192x192 array which is divided among 24 processes in Pure MPI in 1-D decomposition i.e. each process has 192/24 x 192 x 192 = 8 x 192 x 192 slab of data. Now I do :
for(i=0 ; i <= 7; i++)
for(j=0; j<= 191; j++)
for(k=0; k<= 191; k++)
{
unew[i][j][k] = 1/6.0 * (u[i+1][j][k]+u[i-1][j][k]+
u[i][j+1][k]+u[i][j-1][k]+
u[i][j][k+1]+u[i][j][k-1]);
}
This update takes around 60 seconds for each process.
Now with Hybrid MPI, I run two processes (1 process per socket --bind-to socket --map-by socket and OMP_PROC_PLACES=coreswith OMP_PROC_BIND=close). I create 12 threads per MPI Process (i.e. 12 threads per socket or processor). Now each MPI process has an array of size : 192/2 x 192 x 192 = 96x192x192 elements. Each thread works on 96/12 x 192 x 192 = 8 x 192 x 192 portion of the array owned by each process. I do the same triple loop update using threads but the time is approximately 76 seconds for each thread. The load balance is perfect in both the problems. What could be the possible causes of performance degradation ? Is is False Sharing because threads could be invalidating the cache lines close to each other's chunk of data ? If yes, then how do I reduce this performance degradation ? (I have purposefully not mentioned ghost data but initially I am NOT overlapping communication with computation.)
In response to the comments below, am posting the code. Apologies for the long MWE but you can very safely ignore (1) Header files declaration (2) Variable Declaration (3) Memory allocation routine (4) Formation of Cartesian Topology (5) Setting boundary conditions in parallel using OpenMP parallel region (6) Declaration of MPI_Type_subarray datatype (7) MPI_Isend() and MPI_Irecv() calls and just concentrate on (a) INDEPENDENT UPDATE OpenMP parallel region (b) independent_update(...) routine being called from here.
/* IGNORE THIS PORTION */
#include<mpi.h>
#include<omp.h>
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#define MIN(a,b) (a < b ? a : b)
#define Tol 0.00001
/* IGNORE THIS ROUTINE */
void input(int *X, int *Y, int *Z)
{
int a=193, b=193, c=193;
*X = a;
*Y = b;
*Z = c;
}
/* IGNORE THIS ROUTINE */
float*** allocate_mem(int X, int Y, int Z)
{
int i,j;
float ***matrix;
float *arr;
arr = (float*)calloc(X*Y*Z, sizeof(float));
matrix = (float***)calloc(X, sizeof(float**));
for(i = 0 ; i<= X-1; i++)
matrix[i] = (float**)calloc(Y, sizeof(float*));
for(i = 0 ; i <= X-1; i++)
for(j=0; j<= Y-1; j++)
matrix[i][j] = &(arr[i*Y*Z + j*Z]);
return matrix ;
}
/* THIS ROUTINE IS IMPORTANT */
float independent_update(float ***old, float ***new, int NX, int NY, int NZ, int tID, int chunk)
{
int i,j,k, start, end;
float error = 0.0;
float diff;
start = tID * chunk + 1;
end = MIN( (tID+1)*chunk, NX-2 );
for(i = start; i <= end ; i++)
{
for(j = 1; j<= NY-2; j++)
{
#pragma omp simd
for(k = 1; k<= NZ-2; k++)
{
new[i][j][k] = (1/6.0) *(old[i-1][j][k] + old[i+1][j][k] + old[i][j-1][k] + old[i][j+1][k] + old[i][j][k-1] + old[i][j][k+1] );
diff = 1.0 - new[i][j][k];
diff = (diff > 0 ? diff : -1.0 * diff );
if(diff > error)
error = diff;
}
}
}
return error;
}
int main(int argc, char *argv[])
{
/* IGNORE VARIABLE DECLARATION */
int size, rank; //Size of old_comm and rank of process
int i, j, k,l; //General loop variables
MPI_Comm old_comm, new_comm; //MPI_COMM_WORLD handle and for MPI_Cart_create()
int N[3]; //For taking input of size of matrix from user
int P; //Represent number of processes i.e. same as size
int dims[3]; //For dimensions of Cartesian topology
int PX, PY, PZ; //X dim, Y dim, Z dim of each process
float ***old, ***new, ***temp; //Matrices for results dimensions is (Px+2)*(PY+2)*(PZ+2)
int period[3]; //Periodicity for each dimension
int reorder; //Whether processes should be reordered in new cartesian topology
int ndims; //Number of dimensions (which is 3)
int Z_TOWARDS_U, Z_AWAY_U; //Z neighbour towards you and away from you (Z const)
int X_DOWN, X_UP; //Below plane and above plane (X const)
int Y_LEFT, Y_RIGHT; //Left plane and right plane (Y const)
int coords[3]; //Finding coordinates of processes
int dimension; //Used in MPI_Cart_shift() , values = 0, 1,2
int displacement; //Used in MPI_Cart_shift(), values will be +1 to find immediate neighbours
float l_max_err; //Local maximum error on process
float l_max_err_new; //For dependent faces.
float G_max_err = 1.0; //Maximum error for stopping criterion
int iterations = 0 ; //Counting number of iterations
MPI_Request send[6], recv[6]; //For MPI_Isend and MPI_Irecv
int start[3]; //Start will be defined in MPI_Isend() and MPI_Irecv()
int gsize[3]; //Defining global size of subarray
MPI_Datatype x_subarray; //For sending X_UP and X_DOWN
int local_x[3]; //Defining local plane size for X_UP/X_DOWN
MPI_Datatype y_subarray; //For sending Y_LEFT and Y_RIGHT
int local_y[3]; //Defining local plane for Y_LEFT/Y_RIGHT
MPI_Datatype z_subarray; //For sending Z_TOWARDS_U and Z_AWAY_U
int local_z[3]; //Defining local plan size for XY plane i.e. where Z=0
double strt, end; //For measuring time
double strt1, end1, delta1; //For measuring trivial time 1
double strt2, end2, delta2; //For measuring trivial time 2
double t_i_strt, t_i_end, t_i_sum=0; //Time for independent computational kernel
double t_up_strt, t_up_end, t_up_sum=0; //Time for X_UP
double t_down_strt, t_down_end, t_down_sum=0; //Time for X_DOWN
double t_left_strt, t_left_end, t_left_sum=0; //Time for Y_LEFT
double t_right_strt, t_right_end, t_right_sum=0; //Time for Y_RIGHT
double t_towards_strt, t_towards_end, t_towards_sum=0; //For Z_TOWARDS_U
double t_away_strt, t_away_end, t_away_sum=0; //For Z_AWAY_U
double t_comm_strt, t_comm_end, t_comm_sum=0; //Time comm + independent update (need to subtract to get comm time)
double t_setup_strt,t_setup_end; //Set-up start and end time
double t_allred_strt,t_allred_end,t_allred_total=0.0; //Measuring Allreduce time separately.
int threadID; //ID of a thread
int nthreads; //Total threads in OpenMP region
int chunk; //chunk - used to calculate iterations of a thread
/* IGNORE MPI STARTUP ETC */
MPI_Init(&argc, &argv);
t_setup_strt = MPI_Wtime();
old_comm = MPI_COMM_WORLD;
MPI_Comm_size(old_comm, &size);
MPI_Comm_rank(old_comm, &rank);
P = size;
if(rank == 0)
{
input(&N[0], &N[1], &N[2]);
}
MPI_Bcast(N, 3, MPI_INT, 0, old_comm);
dims[0] = 0;
dims[1] = 0;
dims[2] = 0;
period[0] = period[1] = period[2] = 0; //All dimensions aperiodic
reorder = 0 ; //No reordering of ranks in new_comm
ndims = 3;
MPI_Dims_create(P,ndims,dims);
MPI_Cart_create(old_comm, ndims, dims, period, reorder, &new_comm);
if( (N[0]-1) % dims[0] == 0 && (N[1]-1) % dims[1] == 0 && (N[2]-1) % dims[2] == 0 )
{
PX = (N[0]-1)/dims[0]; //Rows of unknowns each process gets
PY = (N[1]-1)/dims[1]; //Columns of unknowns each process gets
PZ = (N[2]-1)/dims[2]; //Depth of unknowns each process gets
}
old = allocate_mem(PX+2, PY+2, PZ+2); //3D arrays with ghost points
new = allocate_mem(PX+2, PY+2, PZ+2); //3D arrays with ghost points
dimension = 0;
displacement = 1;
MPI_Cart_shift(new_comm, dimension, displacement, &X_UP, &X_DOWN); //Find UP and DOWN neighbours
dimension = 1;
MPI_Cart_shift(new_comm, dimension, displacement, &Y_LEFT, &Y_RIGHT); //Find UP and DOWN neighbours
dimension = 2;
MPI_Cart_shift(new_comm, dimension, displacement, &Z_TOWARDS_U, &Z_AWAY_U); //Find UP and DOWN neighbours
/* IGNORE BOUNDARY SETUPS FOR PDE */
#pragma omp parallel for default(none) shared(old,new,PX,PY,PZ) private(i,j,k) schedule(static)
for(i = 0; i <= PX+1; i++)
{
for(j = 0; j <= PY+1; j++)
{
for(k = 0; k <= PZ+1; k++)
{
old[i][j][k] = 0.0;
new[i][j][k] = 0.0;
}
}
}
#pragma omp parallel default(none) shared(X_DOWN,X_UP,Y_LEFT,Y_RIGHT,Z_TOWARDS_U,Z_AWAY_U,old,new,PX,PY,PZ) private(i,j,k,threadID,nthreads)
{
threadID = omp_get_thread_num();
nthreads = omp_get_num_threads();
if(threadID == 0)
{
if(X_DOWN == MPI_PROC_NULL) //X is constant here, this is YZ upper plane
{
for(j = 1 ; j<= PY ; j++)
for(k = 1 ; k<= PZ ; k++)
{
old[0][j][k] = 1;
new[0][j][k] = 1; //Set boundaries in new also
}
}
}
if(threadID == (nthreads-1))
{
if(X_UP == MPI_PROC_NULL) //YZ lower plane
{
for(j = 1 ; j<= PY ; j++)
for(k = 1; k<= PZ ; k++)
{
old[PX+1][j][k] = 1;
new[PX+1][j][k] = 1;
}
}
}
if(Y_LEFT == MPI_PROC_NULL) //Y is constant, this is left XZ plane, possibly can use collapse(2)
{
#pragma omp for schedule(static)
for(i = 1 ; i<= PX ; i++)
for(k = 1; k<= PZ; k++)
{
old[i][0][k] = 1;
new[i][0][k] = 1;
}
}
if(Y_RIGHT == MPI_PROC_NULL) //XZ right plane, again collapse(2) potential
{
#pragma omp for schedule(static)
for(i = 1 ; i<= PX; i++)
for(k = 1; k<= PZ ; k++)
{
old[i][PY+1][k] = 1;
new[i][PY+1][k] = 1;
}
}
if(Z_TOWARDS_U == MPI_PROC_NULL) //Z is constant here, towards you XY plane, collapse(2)
{
#pragma omp for schedule(static)
for(i = 1 ; i<= PX ; i++)
for(j = 1; j<= PY ; j++)
{
old[i][j][0] = 1;
new[i][j][0] = 1;
}
}
if(Z_AWAY_U == MPI_PROC_NULL) //Away from you XY plane, collapse(2)
{
#pragma omp for schedule(static)
for(i = 1 ; i<= PX; i++)
for(j = 1; j<= PY ; j++)
{
old[i][j][PZ+1] = 1;
new[i][j][PZ+1] = 1;
}
}
}
/* IGNORE SUBARRAY DECLARATION */
gsize[0] = PX+2; //Global sizes of 3-D cubes for each process
gsize[1] = PY+2;
gsize[2] = PZ+2;
start[0] = 0; //Will specify starting location while sending/receiving
start[1] = 0;
start[2] = 0;
local_x[0] = 1;
local_x[1] = PY;
local_x[2] = PZ;
MPI_Type_create_subarray(ndims, gsize, local_x, start, MPI_ORDER_C, MPI_FLOAT, &x_subarray);
MPI_Type_commit(&x_subarray);
local_y[0] = PX;
local_y[1] = 1;
local_y[2] = PZ;
MPI_Type_create_subarray(ndims, gsize, local_y, start, MPI_ORDER_C, MPI_FLOAT, &y_subarray);
MPI_Type_commit(&y_subarray);
local_z[0] = PX;
local_z[1] = PY;
local_z[2] = 1;
MPI_Type_create_subarray(ndims, gsize, local_z, start, MPI_ORDER_C, MPI_FLOAT, &z_subarray);
MPI_Type_commit(&z_subarray);
t_setup_end = MPI_Wtime();
strt = MPI_Wtime();
while(G_max_err > Tol) //iterations < ITERATIONS)
{
iterations++ ;
t_comm_strt = MPI_Wtime();
/* IGNORE MPI COMMUNICATION */
MPI_Irecv(&old[0][1][1], 1, x_subarray, X_DOWN, 10, new_comm, &recv[0]);
MPI_Irecv(&old[PX+1][1][1], 1, x_subarray, X_UP, 20, new_comm, &recv[1]);
MPI_Irecv(&old[1][PY+1][1], 1, y_subarray, Y_RIGHT, 30, new_comm, &recv[2]);
MPI_Irecv(&old[1][0][1], 1, y_subarray, Y_LEFT, 40, new_comm, &recv[3]);
MPI_Irecv(&old[1][1][PZ+1], 1, z_subarray, Z_AWAY_U, 50, new_comm, &recv[4]);
MPI_Irecv(&old[1][1][0], 1, z_subarray, Z_TOWARDS_U, 60, new_comm, &recv[5]);
MPI_Isend(&old[PX][1][1], 1, x_subarray, X_UP, 10, new_comm, &send[0]);
MPI_Isend(&old[1][1][1], 1, x_subarray, X_DOWN, 20, new_comm, &send[1]);
MPI_Isend(&old[1][1][1], 1, y_subarray, Y_LEFT, 30, new_comm, &send[2]);
MPI_Isend(&old[1][PY][1], 1, y_subarray, Y_RIGHT, 40, new_comm, &send[3]);
MPI_Isend(&old[1][1][1], 1, z_subarray, Z_TOWARDS_U, 50, new_comm, &send[4]);
MPI_Isend(&old[1][1][PZ], 1, z_subarray, Z_AWAY_U, 60, new_comm, &send[5]);
MPI_Waitall(6, send, MPI_STATUSES_IGNORE);
MPI_Waitall(6, recv, MPI_STATUSES_IGNORE);
t_comm_end = MPI_Wtime();
t_comm_sum = t_comm_sum + (t_comm_end - t_comm_strt);
/* Use threads in Independent update */
t_i_strt = MPI_Wtime();
l_max_err = 0.0; //Very important, Reduction result is combined with this !
/* THIS IS THE IMPORTANT REGION */
#pragma omp parallel default(none) shared(old,new,PX,PY,PZ,chunk) private(threadID,nthreads) reduction(max:l_max_err)
{
nthreads = omp_get_num_threads();
threadID = omp_get_thread_num();
chunk = (PX-1+1) / nthreads ;
l_max_err = independent_update(old, new, PX+2, PY+2, PZ+2, threadID, chunk);
}
t_i_end = MPI_Wtime();
t_i_sum = t_i_sum + (t_i_end - t_i_strt) ;
/* IGNORE THE REMAINING CODE */
t_allred_strt = MPI_Wtime();
MPI_Allreduce(&l_max_err, &G_max_err, 1, MPI_FLOAT, MPI_MAX, new_comm);
t_allred_end = MPI_Wtime();
t_allred_total = t_allred_total + (t_allred_end - t_allred_strt);
temp = new ;
new = old;
old = temp;
}
MPI_Barrier(new_comm);
end = MPI_Wtime();
if( rank == 0)
{
printf("\nIterations = %d, G_max_err = %f", iterations, G_max_err);
printf("\nThe total SET-UP time for MPI and boundary conditions is %lf", (t_setup_end-t_setup_strt));
printf("\nThe total time for SOLVING is %lf", (end-strt));
printf("\nThe total time for INDEPENDENT COMPUTE %lf", t_i_sum);
printf("\nThe total time for COMMUNICATION OVERHEAD is %lf", t_comm_sum);
printf("\nThe total time for MPI_ALLREDUCE() is %lf", t_allred_total);
}
MPI_Type_free(&x_subarray);
MPI_Type_free(&y_subarray);
MPI_Type_free(&z_subarray);
free(&old[0][0][0]);
free(&new[0][0][0]);
MPI_Finalize();
return 0;
}
P.S. : I am almost sure that the cost of spawning/waking the threads is not the reason for such a huge difference in the timing.
Please find attached Scalasca snapshot for INDEPENDENT COMPUTE of the Hybrid Program.
Using loop simd construct
#pragma omp parallel default(none) shared(old,new,PX,PY,PZ,l_max_err) private(i,j,k,diff)
{
#pragma omp for simd schedule(static) reduction(max:l_max_err)
for(i = 1; i <= PX ; i++)
{
for(j = 1; j<= PY; j++)
{
for(k = 1; k<= PZ; k++)
{
new[i][j][k] = (1/6.0) *(old[i-1][j][k] + old[i+1][j][k] + old[i][j-1][k] + old[i][j+1][k] + old[i][j][k-1] + old[i][j][k+1] );
diff = 1.0 - new[i][j][k];
diff = (diff > 0 ? diff : -1.0 * diff );
if(diff > l_max_err)
l_max_err = diff;
}
}
}
}

You frequently get memory access and cache issues when you just do one MPI process per socket on a CPU with multiple memory controllers. It can be on either the read or the write side, so you can't really say which. This is especially an issue when doing thread-parallel execution with lightweight compute tasks (e.g. math on arrays). One MPI process per socket in this case tends to fare significantly worse than pure MPI.
In your BIOS, set up whatever the maximal NUMA per socket option is
Use one MPI process per NUMA node.
Try some different parameter values in schedule(static). I've rarely found the default to be best.
Essentially what this will do is ensure each bundle of threads only works on a single pool of memory.

CodeJam 2014: How to solve task "New Lottery Game"?

I want to know efficient approach for the New Lottery Game problem.
The Lottery is changing! The Lottery used to have a machine to generate a random winning number. But due to cheating problems, the Lottery has decided to add another machine. The new winning number will be the result of the bitwise-AND operation between the two random numbers generated by the two machines.
To find the bitwise-AND of X and Y, write them both in binary; then a bit in the result in binary has a 1 if the corresponding bits of X and Y were both 1, and a 0 otherwise. In most programming languages, the bitwise-AND of X and Y is written X&Y.
For example:
The old machine generates the number 7 = 0111.
The new machine generates the number 11 = 1011.
The winning number will be (7 AND 11) = (0111 AND 1011) = 0011 = 3.
With this measure, the Lottery expects to reduce the cases of fraudulent claims, but unfortunately an employee from the Lottery company has leaked the following information: the old machine will always generate a non-negative integer less than A and the new one will always generate a non-negative integer less than B.
Catalina wants to win this lottery and to give it a try she decided to buy all non-negative integers less than K.
Given A, B and K, Catalina would like to know in how many different ways the machines can generate a pair of numbers that will make her a winner.
For small input we can check all possible pairs but how to do it with large inputs. I guess we represent the binary number into string first and then check permutations which would give answer less than K. But I can't seem to figure out how to calculate possible permutations of 2 binary strings.

I used a general DP technique that I described in a lot of detail in another answer.
We want to count the pairs (a, b) such that a < A, b < B and a & b < K.
The first step is to convert the numbers to binary and to pad them to the same size by adding leading zeroes. I just padded them to a fixed size of 40. The idea is to build up the valid a and b bit by bit.
Let f(i, loA, loB, loK) be the number of valid suffix pairs of a and b of size 40 - i. If loA is true, it means that the prefix up to i is already strictly smaller than the corresponding prefix of A. In that case there is no restriction on the next possible bit for a. If loA ist false, A[i] is an upper bound on the next bit we can place at the end of the current prefix. loB and loK have an analogous meaning.
Now we have the following transition:
long long f(int i, bool loA, bool loB, bool loK) {
// TODO add memoization
if (i == 40)
return loA && loB && loK;
int hiA = loA ? 1: A[i]-'0'; // upper bound on the next bit in a
int hiB = loB ? 1: B[i]-'0'; // upper bound on the next bit in b
int hiK = loK ? 1: K[i]-'0'; // upper bound on the next bit in a & b
long long res = 0;
for (int a = 0; a <= hiA; ++a)
for (int b = 0; b <= hiB; ++b) {
int k = a & b;
if (k > hiK) continue;
res += f(i+1, loA || a < A[i]-'0',
loB || b < B[i]-'0',
loK || k < K[i]-'0');
}
return res;
}
The result is f(0, false, false, false).
The runtime is O(max(log A, log B)) if memoization is added to ensure that every subproblem is only solved once.

What I did was just to identify when the answer is A * B.
Otherwise, just brute force the rest, this code passed the large input.
// for each test cases
long count = 0;
if ((K > A) || (K > B)) {
count = A * B;
continue; // print count and go to the next test case
}
count = A * B - (A-K) * (B-K);
for (int i = K; i < A; i++) {
for (int j = K; j < B; j++) {
if ((i&j) < K) count++;
}
}
I hope this helps!

just as Niklas B. said.
the whole answer is.
#include <algorithm>
#include <cstring>
#include <iomanip>
#include <iostream>
#include <iterator>
#include <map>
#include <sstream>
#include <string>
#include <vector>
using namespace std;
#define MAX_SIZE 32
int A, B, K;
int arr_a[MAX_SIZE];
int arr_b[MAX_SIZE];
int arr_k[MAX_SIZE];
bool flag [MAX_SIZE][2][2][2];
long long matrix[MAX_SIZE][2][2][2];
long long
get_result();
int main(int argc, char *argv[])
{
int case_amount = 0;
cin >> case_amount;
for (int i = 0; i < case_amount; ++i)
{
const long long result = get_result();
cout << "Case #" << 1 + i << ": " << result << endl;
}
return 0;
}
long long
dp(const int h,
const bool can_A_choose_1,
const bool can_B_choose_1,
const bool can_K_choose_1)
{
if (MAX_SIZE == h)
return can_A_choose_1 && can_B_choose_1 && can_K_choose_1;
if (flag[h][can_A_choose_1][can_B_choose_1][can_K_choose_1])
return matrix[h][can_A_choose_1][can_B_choose_1][can_K_choose_1];
int cnt_A_max = arr_a[h];
int cnt_B_max = arr_b[h];
int cnt_K_max = arr_k[h];
if (can_A_choose_1)
cnt_A_max = 1;
if (can_B_choose_1)
cnt_B_max = 1;
if (can_K_choose_1)
cnt_K_max = 1;
long long res = 0;
for (int i = 0; i <= cnt_A_max; ++i)
{
for (int j = 0; j <= cnt_B_max; ++j)
{
int k = i & j;
if (k > cnt_K_max)
continue;
res += dp(h + 1,
can_A_choose_1 || (i < cnt_A_max),
can_B_choose_1 || (j < cnt_B_max),
can_K_choose_1 || (k < cnt_K_max));
}
}
flag[h][can_A_choose_1][can_B_choose_1][can_K_choose_1] = true;
matrix[h][can_A_choose_1][can_B_choose_1][can_K_choose_1] = res;
return res;
}
long long
get_result()
{
cin >> A >> B >> K;
memset(arr_a, 0, sizeof(arr_a));
memset(arr_b, 0, sizeof(arr_b));
memset(arr_k, 0, sizeof(arr_k));
memset(flag, 0, sizeof(flag));
memset(matrix, 0, sizeof(matrix));
int i = 31;
while (i >= 1)
{
arr_a[i] = A % 2;
A /= 2;
arr_b[i] = B % 2;
B /= 2;
arr_k[i] = K % 2;
K /= 2;
i--;
}
return dp(1, 0, 0, 0);
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string