Not able to set processor affinity - multithreading

I'm trying to implement this code on a 8 core cluster. It has 2 sockets each with 4 cores. I am trying to create 8 threads and set affinity using pthread_attr_setaffinity_np function. But when I look at my performance in VTunes , it shows me that 3969 odd threads are being created. I don't understand why and how! Above all, my performance is exactly the same as it was when no affinity was set (OS thread scheduling). Can someone please help me debug this problem? My code is running perfectly fine but I have no control over the threads! Thanks in advance.
--------------------------------------CODE-------------------------------------------
const int num_thrd=8;
bool RCTAlgorithmBackprojection(RabbitCtGlobalData* r)
{
float O_L = r->O_L;
float R_L = r->R_L;
double* A_n = r->A_n;
float* I_n = r->I_n;
float* f_L = r->f_L;*/
cpu_set_t cpu[num_thrd];
pthread_t thread[num_thrd];
pthread_attr_t attr[num_thrd];
for(int i =0; i< num_thrd; i++)
{
threadCopy[i].L = r->L;
threadCopy[i].O_L = r->O_L;
threadCopy[i].R_L = r->R_L;
threadCopy[i].A_n = r->A_n;
threadCopy[i].I_n = r->I_n;
threadCopy[i].f_L = r->f_L;
threadCopy[i].slice= i;
threadCopy[i].S_x = r->S_x;
threadCopy[i].S_y = r->S_y;
pthread_attr_init(&attr[i]);
CPU_ZERO(&cpu[i]);
CPU_SET(i, &cpu[i]);
pthread_attr_setaffinity_np(&attr[i], CPU_SETSIZE, &cpu[i]);
int rc=pthread_create(&thread[i], &attr[i], backProject, (void*)&threadCopy[i]);
if (rc!=0)
{
cout<<"Can't create thread\n"<<endl;
return -1;
}
// sleep(1);
}
for (int i = 0; i < num_thrd; i++) {
pthread_join(thread[i], NULL);
}
//s_rcgd = r;
return true;
}
void* backProject (void* parm)
{
copyStruct* s = (copyStruct*)parm; // retrive the slice info
unsigned int L = s->L;
float O_L = s->O_L;
float R_L = s->R_L;
double* A_n = s->A_n;
float* I_n = s->I_n;
float* f_L = s->f_L;
int slice1 = s->slice;
//cout<<"The size of volume is L= "<<L<<endl;
int from = (slice1 * L) / num_thrd; // note that this 'slicing' works fine
int to = ((slice1+1) * L) / num_thrd; // even if SIZE is not divisible by num_thrd
//cout<<"computing slice " << slice1<< " from row " << from<< " to " << to-1<<endl;
for (unsigned int k=from; k<to; k++)
{
double z = O_L + (double)k * R_L;
for (unsigned int j=0; j<L; j++)
{
double y = O_L + (double)j * R_L;
for (unsigned int i=0; i<L; i++)
{
double x = O_L + (double)i * R_L;
double w_n = A_n[2] * x + A_n[5] * y + A_n[8] * z + A_n[11];
double u_n = (A_n[0] * x + A_n[3] * y + A_n[6] * z + A_n[9] ) / w_n;
double v_n = (A_n[1] * x + A_n[4] * y + A_n[7] * z + A_n[10]) / w_n;
f_L[k * L * L + j * L + i] += (float)(1.0 / (w_n * w_n) * p_hat_n(u_n, v_n));
}
}
}
//cout<<" finished slice "<<slice1<<endl;
return NULL;
}

Alright, so I found out the reason was because of CPU_SETSIZE that I was using as an argument in pthread_attr_setaffinity_np. I replaced it with num_thrd . Apparently CPU_SETSIZE which will be declared inside #define __USE_GNU was not included in my file.!! Sorry if I bothered any of y'all who were trying to debug this thanks again!

Related

Is Implement of operators in one instruction really faster than implement operators in separated instructions?

I tried to measure the speed of a simple code as below. The purpose is to find out which method will be faster (1 instruction x[j] = j * 2.0 + 1.0; OR 2 separated instructions x[j] = j * 2.0; x[j] += 1.0; ?)
int main(int argc, char* argv[])
{
int x[100000];
std::clock_t start;
start = std::clock();
for (int i = 0; i < 10000; i++) {
for (int j = 0; j < 100000; j++) {
x[j] = j * 2.0 + 1.0;
//x[j] = j * 2.0;
//x[j] += 1.0;
}
}
std::cout << (std::clock() - start) / (double)(CLOCKS_PER_SEC / 1000) << std::endl;
getchar();
return 0;
}
The results showed that with only 1 instruction (x[j] = j * 2.0 + 1.0;), it took me around 3.5(s). However, with 2 separated instructions (x[j] = j * 2.0; x[j] += 1.0;), it took me 8(s).
Could anyone explain why the difference of time is so big like that? Thanks in advance all.

Hybrid MPI+OpenMP Vs MPI Performance

I am converting a 3-D Jacobi solver from pure MPI to Hybrid MPI+OpenMP. I have a 192x192x192 array which is divided among 24 processes in Pure MPI in 1-D decomposition i.e. each process has 192/24 x 192 x 192 = 8 x 192 x 192 slab of data. Now I do :
for(i=0 ; i <= 7; i++)
for(j=0; j<= 191; j++)
for(k=0; k<= 191; k++)
{
unew[i][j][k] = 1/6.0 * (u[i+1][j][k]+u[i-1][j][k]+
u[i][j+1][k]+u[i][j-1][k]+
u[i][j][k+1]+u[i][j][k-1]);
}
This update takes around 60 seconds for each process.
Now with Hybrid MPI, I run two processes (1 process per socket --bind-to socket --map-by socket and OMP_PROC_PLACES=coreswith OMP_PROC_BIND=close). I create 12 threads per MPI Process (i.e. 12 threads per socket or processor). Now each MPI process has an array of size : 192/2 x 192 x 192 = 96x192x192 elements. Each thread works on 96/12 x 192 x 192 = 8 x 192 x 192 portion of the array owned by each process. I do the same triple loop update using threads but the time is approximately 76 seconds for each thread. The load balance is perfect in both the problems. What could be the possible causes of performance degradation ? Is is False Sharing because threads could be invalidating the cache lines close to each other's chunk of data ? If yes, then how do I reduce this performance degradation ? (I have purposefully not mentioned ghost data but initially I am NOT overlapping communication with computation.)
In response to the comments below, am posting the code. Apologies for the long MWE but you can very safely ignore (1) Header files declaration (2) Variable Declaration (3) Memory allocation routine (4) Formation of Cartesian Topology (5) Setting boundary conditions in parallel using OpenMP parallel region (6) Declaration of MPI_Type_subarray datatype (7) MPI_Isend() and MPI_Irecv() calls and just concentrate on (a) INDEPENDENT UPDATE OpenMP parallel region (b) independent_update(...) routine being called from here.
/* IGNORE THIS PORTION */
#include<mpi.h>
#include<omp.h>
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#define MIN(a,b) (a < b ? a : b)
#define Tol 0.00001
/* IGNORE THIS ROUTINE */
void input(int *X, int *Y, int *Z)
{
int a=193, b=193, c=193;
*X = a;
*Y = b;
*Z = c;
}
/* IGNORE THIS ROUTINE */
float*** allocate_mem(int X, int Y, int Z)
{
int i,j;
float ***matrix;
float *arr;
arr = (float*)calloc(X*Y*Z, sizeof(float));
matrix = (float***)calloc(X, sizeof(float**));
for(i = 0 ; i<= X-1; i++)
matrix[i] = (float**)calloc(Y, sizeof(float*));
for(i = 0 ; i <= X-1; i++)
for(j=0; j<= Y-1; j++)
matrix[i][j] = &(arr[i*Y*Z + j*Z]);
return matrix ;
}
/* THIS ROUTINE IS IMPORTANT */
float independent_update(float ***old, float ***new, int NX, int NY, int NZ, int tID, int chunk)
{
int i,j,k, start, end;
float error = 0.0;
float diff;
start = tID * chunk + 1;
end = MIN( (tID+1)*chunk, NX-2 );
for(i = start; i <= end ; i++)
{
for(j = 1; j<= NY-2; j++)
{
#pragma omp simd
for(k = 1; k<= NZ-2; k++)
{
new[i][j][k] = (1/6.0) *(old[i-1][j][k] + old[i+1][j][k] + old[i][j-1][k] + old[i][j+1][k] + old[i][j][k-1] + old[i][j][k+1] );
diff = 1.0 - new[i][j][k];
diff = (diff > 0 ? diff : -1.0 * diff );
if(diff > error)
error = diff;
}
}
}
return error;
}
int main(int argc, char *argv[])
{
/* IGNORE VARIABLE DECLARATION */
int size, rank; //Size of old_comm and rank of process
int i, j, k,l; //General loop variables
MPI_Comm old_comm, new_comm; //MPI_COMM_WORLD handle and for MPI_Cart_create()
int N[3]; //For taking input of size of matrix from user
int P; //Represent number of processes i.e. same as size
int dims[3]; //For dimensions of Cartesian topology
int PX, PY, PZ; //X dim, Y dim, Z dim of each process
float ***old, ***new, ***temp; //Matrices for results dimensions is (Px+2)*(PY+2)*(PZ+2)
int period[3]; //Periodicity for each dimension
int reorder; //Whether processes should be reordered in new cartesian topology
int ndims; //Number of dimensions (which is 3)
int Z_TOWARDS_U, Z_AWAY_U; //Z neighbour towards you and away from you (Z const)
int X_DOWN, X_UP; //Below plane and above plane (X const)
int Y_LEFT, Y_RIGHT; //Left plane and right plane (Y const)
int coords[3]; //Finding coordinates of processes
int dimension; //Used in MPI_Cart_shift() , values = 0, 1,2
int displacement; //Used in MPI_Cart_shift(), values will be +1 to find immediate neighbours
float l_max_err; //Local maximum error on process
float l_max_err_new; //For dependent faces.
float G_max_err = 1.0; //Maximum error for stopping criterion
int iterations = 0 ; //Counting number of iterations
MPI_Request send[6], recv[6]; //For MPI_Isend and MPI_Irecv
int start[3]; //Start will be defined in MPI_Isend() and MPI_Irecv()
int gsize[3]; //Defining global size of subarray
MPI_Datatype x_subarray; //For sending X_UP and X_DOWN
int local_x[3]; //Defining local plane size for X_UP/X_DOWN
MPI_Datatype y_subarray; //For sending Y_LEFT and Y_RIGHT
int local_y[3]; //Defining local plane for Y_LEFT/Y_RIGHT
MPI_Datatype z_subarray; //For sending Z_TOWARDS_U and Z_AWAY_U
int local_z[3]; //Defining local plan size for XY plane i.e. where Z=0
double strt, end; //For measuring time
double strt1, end1, delta1; //For measuring trivial time 1
double strt2, end2, delta2; //For measuring trivial time 2
double t_i_strt, t_i_end, t_i_sum=0; //Time for independent computational kernel
double t_up_strt, t_up_end, t_up_sum=0; //Time for X_UP
double t_down_strt, t_down_end, t_down_sum=0; //Time for X_DOWN
double t_left_strt, t_left_end, t_left_sum=0; //Time for Y_LEFT
double t_right_strt, t_right_end, t_right_sum=0; //Time for Y_RIGHT
double t_towards_strt, t_towards_end, t_towards_sum=0; //For Z_TOWARDS_U
double t_away_strt, t_away_end, t_away_sum=0; //For Z_AWAY_U
double t_comm_strt, t_comm_end, t_comm_sum=0; //Time comm + independent update (need to subtract to get comm time)
double t_setup_strt,t_setup_end; //Set-up start and end time
double t_allred_strt,t_allred_end,t_allred_total=0.0; //Measuring Allreduce time separately.
int threadID; //ID of a thread
int nthreads; //Total threads in OpenMP region
int chunk; //chunk - used to calculate iterations of a thread
/* IGNORE MPI STARTUP ETC */
MPI_Init(&argc, &argv);
t_setup_strt = MPI_Wtime();
old_comm = MPI_COMM_WORLD;
MPI_Comm_size(old_comm, &size);
MPI_Comm_rank(old_comm, &rank);
P = size;
if(rank == 0)
{
input(&N[0], &N[1], &N[2]);
}
MPI_Bcast(N, 3, MPI_INT, 0, old_comm);
dims[0] = 0;
dims[1] = 0;
dims[2] = 0;
period[0] = period[1] = period[2] = 0; //All dimensions aperiodic
reorder = 0 ; //No reordering of ranks in new_comm
ndims = 3;
MPI_Dims_create(P,ndims,dims);
MPI_Cart_create(old_comm, ndims, dims, period, reorder, &new_comm);
if( (N[0]-1) % dims[0] == 0 && (N[1]-1) % dims[1] == 0 && (N[2]-1) % dims[2] == 0 )
{
PX = (N[0]-1)/dims[0]; //Rows of unknowns each process gets
PY = (N[1]-1)/dims[1]; //Columns of unknowns each process gets
PZ = (N[2]-1)/dims[2]; //Depth of unknowns each process gets
}
old = allocate_mem(PX+2, PY+2, PZ+2); //3D arrays with ghost points
new = allocate_mem(PX+2, PY+2, PZ+2); //3D arrays with ghost points
dimension = 0;
displacement = 1;
MPI_Cart_shift(new_comm, dimension, displacement, &X_UP, &X_DOWN); //Find UP and DOWN neighbours
dimension = 1;
MPI_Cart_shift(new_comm, dimension, displacement, &Y_LEFT, &Y_RIGHT); //Find UP and DOWN neighbours
dimension = 2;
MPI_Cart_shift(new_comm, dimension, displacement, &Z_TOWARDS_U, &Z_AWAY_U); //Find UP and DOWN neighbours
/* IGNORE BOUNDARY SETUPS FOR PDE */
#pragma omp parallel for default(none) shared(old,new,PX,PY,PZ) private(i,j,k) schedule(static)
for(i = 0; i <= PX+1; i++)
{
for(j = 0; j <= PY+1; j++)
{
for(k = 0; k <= PZ+1; k++)
{
old[i][j][k] = 0.0;
new[i][j][k] = 0.0;
}
}
}
#pragma omp parallel default(none) shared(X_DOWN,X_UP,Y_LEFT,Y_RIGHT,Z_TOWARDS_U,Z_AWAY_U,old,new,PX,PY,PZ) private(i,j,k,threadID,nthreads)
{
threadID = omp_get_thread_num();
nthreads = omp_get_num_threads();
if(threadID == 0)
{
if(X_DOWN == MPI_PROC_NULL) //X is constant here, this is YZ upper plane
{
for(j = 1 ; j<= PY ; j++)
for(k = 1 ; k<= PZ ; k++)
{
old[0][j][k] = 1;
new[0][j][k] = 1; //Set boundaries in new also
}
}
}
if(threadID == (nthreads-1))
{
if(X_UP == MPI_PROC_NULL) //YZ lower plane
{
for(j = 1 ; j<= PY ; j++)
for(k = 1; k<= PZ ; k++)
{
old[PX+1][j][k] = 1;
new[PX+1][j][k] = 1;
}
}
}
if(Y_LEFT == MPI_PROC_NULL) //Y is constant, this is left XZ plane, possibly can use collapse(2)
{
#pragma omp for schedule(static)
for(i = 1 ; i<= PX ; i++)
for(k = 1; k<= PZ; k++)
{
old[i][0][k] = 1;
new[i][0][k] = 1;
}
}
if(Y_RIGHT == MPI_PROC_NULL) //XZ right plane, again collapse(2) potential
{
#pragma omp for schedule(static)
for(i = 1 ; i<= PX; i++)
for(k = 1; k<= PZ ; k++)
{
old[i][PY+1][k] = 1;
new[i][PY+1][k] = 1;
}
}
if(Z_TOWARDS_U == MPI_PROC_NULL) //Z is constant here, towards you XY plane, collapse(2)
{
#pragma omp for schedule(static)
for(i = 1 ; i<= PX ; i++)
for(j = 1; j<= PY ; j++)
{
old[i][j][0] = 1;
new[i][j][0] = 1;
}
}
if(Z_AWAY_U == MPI_PROC_NULL) //Away from you XY plane, collapse(2)
{
#pragma omp for schedule(static)
for(i = 1 ; i<= PX; i++)
for(j = 1; j<= PY ; j++)
{
old[i][j][PZ+1] = 1;
new[i][j][PZ+1] = 1;
}
}
}
/* IGNORE SUBARRAY DECLARATION */
gsize[0] = PX+2; //Global sizes of 3-D cubes for each process
gsize[1] = PY+2;
gsize[2] = PZ+2;
start[0] = 0; //Will specify starting location while sending/receiving
start[1] = 0;
start[2] = 0;
local_x[0] = 1;
local_x[1] = PY;
local_x[2] = PZ;
MPI_Type_create_subarray(ndims, gsize, local_x, start, MPI_ORDER_C, MPI_FLOAT, &x_subarray);
MPI_Type_commit(&x_subarray);
local_y[0] = PX;
local_y[1] = 1;
local_y[2] = PZ;
MPI_Type_create_subarray(ndims, gsize, local_y, start, MPI_ORDER_C, MPI_FLOAT, &y_subarray);
MPI_Type_commit(&y_subarray);
local_z[0] = PX;
local_z[1] = PY;
local_z[2] = 1;
MPI_Type_create_subarray(ndims, gsize, local_z, start, MPI_ORDER_C, MPI_FLOAT, &z_subarray);
MPI_Type_commit(&z_subarray);
t_setup_end = MPI_Wtime();
strt = MPI_Wtime();
while(G_max_err > Tol) //iterations < ITERATIONS)
{
iterations++ ;
t_comm_strt = MPI_Wtime();
/* IGNORE MPI COMMUNICATION */
MPI_Irecv(&old[0][1][1], 1, x_subarray, X_DOWN, 10, new_comm, &recv[0]);
MPI_Irecv(&old[PX+1][1][1], 1, x_subarray, X_UP, 20, new_comm, &recv[1]);
MPI_Irecv(&old[1][PY+1][1], 1, y_subarray, Y_RIGHT, 30, new_comm, &recv[2]);
MPI_Irecv(&old[1][0][1], 1, y_subarray, Y_LEFT, 40, new_comm, &recv[3]);
MPI_Irecv(&old[1][1][PZ+1], 1, z_subarray, Z_AWAY_U, 50, new_comm, &recv[4]);
MPI_Irecv(&old[1][1][0], 1, z_subarray, Z_TOWARDS_U, 60, new_comm, &recv[5]);
MPI_Isend(&old[PX][1][1], 1, x_subarray, X_UP, 10, new_comm, &send[0]);
MPI_Isend(&old[1][1][1], 1, x_subarray, X_DOWN, 20, new_comm, &send[1]);
MPI_Isend(&old[1][1][1], 1, y_subarray, Y_LEFT, 30, new_comm, &send[2]);
MPI_Isend(&old[1][PY][1], 1, y_subarray, Y_RIGHT, 40, new_comm, &send[3]);
MPI_Isend(&old[1][1][1], 1, z_subarray, Z_TOWARDS_U, 50, new_comm, &send[4]);
MPI_Isend(&old[1][1][PZ], 1, z_subarray, Z_AWAY_U, 60, new_comm, &send[5]);
MPI_Waitall(6, send, MPI_STATUSES_IGNORE);
MPI_Waitall(6, recv, MPI_STATUSES_IGNORE);
t_comm_end = MPI_Wtime();
t_comm_sum = t_comm_sum + (t_comm_end - t_comm_strt);
/* Use threads in Independent update */
t_i_strt = MPI_Wtime();
l_max_err = 0.0; //Very important, Reduction result is combined with this !
/* THIS IS THE IMPORTANT REGION */
#pragma omp parallel default(none) shared(old,new,PX,PY,PZ,chunk) private(threadID,nthreads) reduction(max:l_max_err)
{
nthreads = omp_get_num_threads();
threadID = omp_get_thread_num();
chunk = (PX-1+1) / nthreads ;
l_max_err = independent_update(old, new, PX+2, PY+2, PZ+2, threadID, chunk);
}
t_i_end = MPI_Wtime();
t_i_sum = t_i_sum + (t_i_end - t_i_strt) ;
/* IGNORE THE REMAINING CODE */
t_allred_strt = MPI_Wtime();
MPI_Allreduce(&l_max_err, &G_max_err, 1, MPI_FLOAT, MPI_MAX, new_comm);
t_allred_end = MPI_Wtime();
t_allred_total = t_allred_total + (t_allred_end - t_allred_strt);
temp = new ;
new = old;
old = temp;
}
MPI_Barrier(new_comm);
end = MPI_Wtime();
if( rank == 0)
{
printf("\nIterations = %d, G_max_err = %f", iterations, G_max_err);
printf("\nThe total SET-UP time for MPI and boundary conditions is %lf", (t_setup_end-t_setup_strt));
printf("\nThe total time for SOLVING is %lf", (end-strt));
printf("\nThe total time for INDEPENDENT COMPUTE %lf", t_i_sum);
printf("\nThe total time for COMMUNICATION OVERHEAD is %lf", t_comm_sum);
printf("\nThe total time for MPI_ALLREDUCE() is %lf", t_allred_total);
}
MPI_Type_free(&x_subarray);
MPI_Type_free(&y_subarray);
MPI_Type_free(&z_subarray);
free(&old[0][0][0]);
free(&new[0][0][0]);
MPI_Finalize();
return 0;
}
P.S. : I am almost sure that the cost of spawning/waking the threads is not the reason for such a huge difference in the timing.
Please find attached Scalasca snapshot for INDEPENDENT COMPUTE of the Hybrid Program.
Using loop simd construct
#pragma omp parallel default(none) shared(old,new,PX,PY,PZ,l_max_err) private(i,j,k,diff)
{
#pragma omp for simd schedule(static) reduction(max:l_max_err)
for(i = 1; i <= PX ; i++)
{
for(j = 1; j<= PY; j++)
{
for(k = 1; k<= PZ; k++)
{
new[i][j][k] = (1/6.0) *(old[i-1][j][k] + old[i+1][j][k] + old[i][j-1][k] + old[i][j+1][k] + old[i][j][k-1] + old[i][j][k+1] );
diff = 1.0 - new[i][j][k];
diff = (diff > 0 ? diff : -1.0 * diff );
if(diff > l_max_err)
l_max_err = diff;
}
}
}
}
You frequently get memory access and cache issues when you just do one MPI process per socket on a CPU with multiple memory controllers. It can be on either the read or the write side, so you can't really say which. This is especially an issue when doing thread-parallel execution with lightweight compute tasks (e.g. math on arrays). One MPI process per socket in this case tends to fare significantly worse than pure MPI.
In your BIOS, set up whatever the maximal NUMA per socket option is
Use one MPI process per NUMA node.
Try some different parameter values in schedule(static). I've rarely found the default to be best.
Essentially what this will do is ensure each bundle of threads only works on a single pool of memory.

Reaction-Diffusion algorithm on Processing + Multithreading

I have made an implementation of the Reaction-Diffusion algorithm on Processing 3.1.1, following a video tutorial. I have made some adaptations on my code, like implementing it on a torus space, instead of a bounded box, like the video.
However, I ran into this annoying issue, that the code runs really slow, proportional to the canvas size (larger, slower). With that, I tried optmizing the code, according to my (limited) knowledge. The main thing I did was to reduce the number of loops running.
Even then, my code still ran quite slow.
Since I have noticed that with a canvas of 50 x 50 in size, the algorithm ran at a good speed, I tried making it multithreaded, in such a way that the canvas would be divided between the threads, and each thread would run the algorithm for a small region of the canvas.
All threads read from the current state of the canvas, and all write to the future state of the canvas. The canvas is then updated using Processing's pixel array.
However, even with multithreading, I didn't see any performance improvement. By the contrary, I saw it getting worse. Now sometimes the canvas flicker between a rendered state and completely white, and in some cases, it doesn't even render.
I'm quite sure that I'm doing something wrong, or I may be taking the wrong approach to optimizing this algorithm. And now, I'm asking for help to understand what I'm doing wrong, and how I could fix or improve my code.
Edit: Implementing ahead of time calculation and rendering using a buffer of PImage objects has removed flickering, but the calculation step on the background doesn't run fast enough to fill the buffer.
My Processing Sketch is below, and thanks in advance.
ArrayList<PImage> buffer = new ArrayList<PImage>();
Thread t;
Buffer b;
PImage currentImage;
Point[][] grid; //current state
Point[][] next; //future state
//Reaction-Diffusion algorithm parameters
final float dA = 1.0;
final float dB = 0.5;
//default: f = 0.055; k = 0.062
//mitosis: f = 0.0367; k = 0.0649
float feed = 0.055;
float kill = 0.062;
float dt = 1.0;
//multi-threading parameters to divide canvas
int threadSizeX = 50;
int threadSizeY = 50;
//red shading colors
color red = color(255, 0, 0);
color white = color(255, 255, 255);
color black = color(0, 0, 0);
//if redShader is false, rendering will use a simple grayscale mode
boolean redShader = true;
//simple class to hold chemicals A and B amounts
class Point
{
float a;
float b;
Point(float a, float b)
{
this.a = a;
this.b = b;
}
}
void setup()
{
size(300, 300);
//initialize matrices with A = 1 and B = 0
grid = new Point[width][];
next = new Point[width][];
for (int x = 0; x < width; x++)
{
grid[x] = new Point[height];
next[x] = new Point[height];
for (int y = 0; y < height; y++)
{
grid[x][y] = new Point(1.0, 0.0);
next[x][y] = new Point(1.0, 0.0);
}
}
int a = (int) random(1, 20); //seed some areas with B = 1.0
for (int amount = 0; amount < a; amount++)
{
int siz = 2;
int x = (int)random(width);
int y = (int)random(height);
for (int i = x - siz/2; i < x + siz/2; i++)
{
for (int j = y - siz/2; j < y + siz/2; j++)
{
int i2 = i;
int j2 = j;
if (i < 0)
{
i2 = width + i;
} else if (i >= width)
{
i2 = i - width;
}
if (j < 0)
{
j2 = height + j;
} else if (j >= height)
{
j2 = j - height;
}
grid[i2][j2].b = 1.0;
}
}
}
initializeThreads();
}
/**
* Divide canvas between threads
*/
void initializeThreads()
{
ArrayList<Reaction> reactions = new ArrayList<Reaction>();
for (int x1 = 0; x1 < width; x1 += threadSizeX)
{
for (int y1 = 0; y1 < height; y1 += threadSizeY)
{
int x2 = x1 + threadSizeX;
int y2 = y1 + threadSizeY;
if (x2 > width - 1)
{
x2 = width - 1;
}
if (y2 > height - 1)
{
y2 = height - 1;
}
Reaction r = new Reaction(x1, y1, x2, y2);
reactions.add(r);
}
}
b = new Buffer(reactions);
t = new Thread(b);
t.start();
}
void draw()
{
if (buffer.size() == 0)
{
return;
}
PImage i = buffer.get(0);
image(i, 0, 0);
buffer.remove(i);
//println(frameRate);
println(buffer.size());
//saveFrame("output/######.png");
}
/**
* Faster than calling built in pow() function
*/
float pow5(float x)
{
return x * x * x * x * x;
}
class Buffer implements Runnable
{
ArrayList<Reaction> reactions;
boolean calculating = false;
public Buffer(ArrayList<Reaction> reactions)
{
this.reactions = reactions;
}
public void run()
{
while (true)
{
if (buffer.size() < 1000)
{
calculate();
if (isDone())
{
buffer.add(currentImage);
Point[][] temp;
temp = grid;
grid = next;
next = temp;
calculating = false;
}
}
}
}
boolean isDone()
{
for (Reaction r : reactions)
{
if (!r.isDone())
{
return false;
}
}
return true;
}
void calculate()
{
if (calculating)
{
return;
}
currentImage = new PImage(width, height);
for (Reaction r : reactions)
{
r.calculate();
}
calculating = true;
}
}
class Reaction
{
int x1;
int x2;
int y1;
int y2;
Thread t;
public Reaction(int x1, int y1, int x2, int y2)
{
this.x1 = x1;
this.x2 = x2;
this.y1 = y1;
this.y2 = y2;
}
public void calculate()
{
Calculator c = new Calculator(x1, y1, x2, y2);
t = new Thread(c);
t.start();
}
public boolean isDone()
{
if (t.getState() == Thread.State.TERMINATED)
{
return true;
} else
{
return false;
}
}
}
class Calculator implements Runnable
{
int x1;
int x2;
int y1;
int y2;
//weights for calculating the Laplacian for A and B
final float[][] laplacianWeights = {{0.05, 0.2, 0.05},
{0.2, -1, 0.2},
{0.05, 0.2, 0.05}};
/**
* x1, x2, y1, y2 delimit a rectangle. The object will only work within it
*/
public Calculator(int x1, int y1, int x2, int y2)
{
this.x1 = x1;
this.x2 = x2;
this.y1 = y1;
this.y2 = y2;
//println("x1: " + x1 + ", y1: " + y1 + ", x2: " + x2 + ", y2: " + y2);
}
#Override
public void run()
{
reaction();
show();
}
public void reaction()
{
for (int x = x1; x <= x2; x++)
{
for (int y = y1; y <= y2; y++)
{
float a = grid[x][y].a;
float b = grid[x][y].b;
float[] l = laplaceAB(x, y);
float a2 = reactionDiffusionA(a, b, l[0]);
float b2 = reactionDiffusionB(a, b, l[1]);
next[x][y].a = a2;
next[x][y].b = b2;
}
}
}
float reactionDiffusionA(float a, float b, float lA)
{
return a + ((dA * lA) - (a * b * b) + (feed * (1 - a))) * dt;
}
float reactionDiffusionB(float a, float b, float lB)
{
return b + ((dB * lB) + (a * b * b) - ((kill + feed) * b)) * dt;
}
/**
* Calculates Laplacian for both A and B at same time, to reduce amount of loops executed
*/
float[] laplaceAB(int x, int y)
{
float[] l = {0.0, 0.0};
for (int i = x - 1; i < x + 2; i++)
{
for (int j = y - 1; j < y + 2; j++)
{
int i2 = i;
int j2 = j;
if (i < 0)
{
i2 = width + i;
} else if (i >= width)
{
i2 = i - width;
}
if (j < 0)
{
j2 = height + j;
} else if (j >= height)
{
j2 = j - height;
}
int weightX = (i - x) + 1;
int weightY = (j - y) + 1;
l[0] += laplacianWeights[weightX][weightY] * grid[i2][j2].a;
l[1] += laplacianWeights[weightX][weightY] * grid[i2][j2].b;
}
}
return l;
}
public void show()
{
currentImage.loadPixels();
//renders the canvas using the pixel array
for (int x = 0; x < width; x++)
{
for (int y = 0; y < height; y++)
{
float a = next[x][y].a;
float b = next[x][y].b;
int pix = x + y * width;
float diff = (a - b);
color c;
if (redShader) //aply red shading
{
float thresh = 0.5;
if (diff < thresh)
{
float diff2 = map(pow5(diff), 0, pow5(thresh), 0, 1);
c = lerpColor(black, red, diff2);
} else
{
float diff2 = map(1 - pow5(-diff + 1), 1 - pow5(-thresh + 1), 1, 0, 1);
c = lerpColor(red, white, diff2);
}
} else //apply gray scale shading
{
c = color(diff * 255, diff * 255, diff * 255);
}
currentImage.pixels[pix] = c;
}
}
currentImage.updatePixels();
}
}
A programmer had a problem. He thought “I know, I’ll solve it with threads!”. has Now problems. two he
Processing uses a single rendering thread.
It does this for good reason, and most other renderers do the same thing. In fact, I don't know of any multi-threaded renderers.
You should only change what's on the screen from Processing's main rendering thread. In other words, you should only change stuff from Processing's functions, not your own thread. This is what's causing the flickering you're seeing. You're changing stuff as it's being drawn to the screen, which is a horrible idea. (And it's why Processing uses a single rendering thread in the first place.)
You could try to use your multiple threads to do the processing, not the rendering. But I highly doubt that's going to be worth it, and like you saw, it might even make things worse.
If you want to speed up your sketch, you might also consider doing the processing ahead of time instead of in real time. Do all your calculations at the beginning of the sketch, and then just reference the results of the calculations when it's time to draw the frame. Or you could draw to a PImage ahead of time, and then just draw those.

Finding the ranking of a word (permutations) with duplicate letters

I'm posting this although much has already been posted about this question. I didn't want to post as an answer since it's not working. The answer to this post (Finding the rank of the Given string in list of all possible permutations with Duplicates) did not work for me.
So I tried this (which is a compilation of code I've plagiarized and my attempt to deal with repetitions). The non-repeating cases work fine. BOOKKEEPER generates 83863, not the desired 10743.
(The factorial function and letter counter array 'repeats' are working correctly. I didn't post to save space.)
while (pointer != length)
{
if (sortedWordChars[pointer] != wordArray[pointer])
{
// Swap the current character with the one after that
char temp = sortedWordChars[pointer];
sortedWordChars[pointer] = sortedWordChars[next];
sortedWordChars[next] = temp;
next++;
//For each position check how many characters left have duplicates,
//and use the logic that if you need to permute n things and if 'a' things
//are similar the number of permutations is n!/a!
int ct = repeats[(sortedWordChars[pointer]-64)];
// Increment the rank
if (ct>1) { //repeats?
System.out.println("repeating " + (sortedWordChars[pointer]-64));
//In case of repetition of any character use: (n-1)!/(times)!
//e.g. if there is 1 character which is repeating twice,
//x* (n-1)!/2!
int dividend = getFactorialIter(length - pointer - 1);
int divisor = getFactorialIter(ct);
int quo = dividend/divisor;
rank += quo;
} else {
rank += getFactorialIter(length - pointer - 1);
}
} else
{
pointer++;
next = pointer + 1;
}
}
Note: this answer is for 1-based rankings, as specified implicitly by example. Here's some Python that works at least for the two examples provided. The key fact is that suffixperms * ctr[y] // ctr[x] is the number of permutations whose first letter is y of the length-(i + 1) suffix of perm.
from collections import Counter
def rankperm(perm):
rank = 1
suffixperms = 1
ctr = Counter()
for i in range(len(perm)):
x = perm[((len(perm) - 1) - i)]
ctr[x] += 1
for y in ctr:
if (y < x):
rank += ((suffixperms * ctr[y]) // ctr[x])
suffixperms = ((suffixperms * (i + 1)) // ctr[x])
return rank
print(rankperm('QUESTION'))
print(rankperm('BOOKKEEPER'))
Java version:
public static long rankPerm(String perm) {
long rank = 1;
long suffixPermCount = 1;
java.util.Map<Character, Integer> charCounts =
new java.util.HashMap<Character, Integer>();
for (int i = perm.length() - 1; i > -1; i--) {
char x = perm.charAt(i);
int xCount = charCounts.containsKey(x) ? charCounts.get(x) + 1 : 1;
charCounts.put(x, xCount);
for (java.util.Map.Entry<Character, Integer> e : charCounts.entrySet()) {
if (e.getKey() < x) {
rank += suffixPermCount * e.getValue() / xCount;
}
}
suffixPermCount *= perm.length() - i;
suffixPermCount /= xCount;
}
return rank;
}
Unranking permutations:
from collections import Counter
def unrankperm(letters, rank):
ctr = Counter()
permcount = 1
for i in range(len(letters)):
x = letters[i]
ctr[x] += 1
permcount = (permcount * (i + 1)) // ctr[x]
# ctr is the histogram of letters
# permcount is the number of distinct perms of letters
perm = []
for i in range(len(letters)):
for x in sorted(ctr.keys()):
# suffixcount is the number of distinct perms that begin with x
suffixcount = permcount * ctr[x] // (len(letters) - i)
if rank <= suffixcount:
perm.append(x)
permcount = suffixcount
ctr[x] -= 1
if ctr[x] == 0:
del ctr[x]
break
rank -= suffixcount
return ''.join(perm)
If we use mathematics, the complexity will come down and will be able to find rank quicker. This will be particularly helpful for large strings.
(more details can be found here)
Suggest to programmatically define the approach shown here (screenshot attached below) given below)
I would say David post (the accepted answer) is super cool. However, I would like to improve it further for speed. The inner loop is trying to find inverse order pairs, and for each such inverse order, it tries to contribute to the increment of rank. If we use an ordered map structure (binary search tree or BST) in that place, we can simply do an inorder traversal from the first node (left-bottom) until it reaches the current character in the BST, rather than traversal for the whole map(BST). In C++, std::map is a perfect one for BST implementation. The following code reduces the necessary iterations in loop and removes the if check.
long long rankofword(string s)
{
long long rank = 1;
long long suffixPermCount = 1;
map<char, int> m;
int size = s.size();
for (int i = size - 1; i > -1; i--)
{
char x = s[i];
m[x]++;
for (auto it = m.begin(); it != m.find(x); it++)
rank += suffixPermCount * it->second / m[x];
suffixPermCount *= (size - i);
suffixPermCount /= m[x];
}
return rank;
}
#Dvaid Einstat, this was really helpful. It took me a WHILE to figure out what you were doing as I am still learning my first language(C#). I translated it into C# and figured that I'd give that solution as well since this listing helped me so much!
Thanks!
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Text.RegularExpressions;
namespace CsharpVersion
{
class Program
{
//Takes in the word and checks to make sure that the word
//is between 1 and 25 charaters inclusive and only
//letters are used
static string readWord(string prompt, int high)
{
Regex rgx = new Regex("^[a-zA-Z]+$");
string word;
string result;
do
{
Console.WriteLine(prompt);
word = Console.ReadLine();
} while (word == "" | word.Length > high | rgx.IsMatch(word) == false);
result = word.ToUpper();
return result;
}
//Creates a sorted dictionary containing distinct letters
//initialized with 0 frequency
static SortedDictionary<char,int> Counter(string word)
{
char[] wordArray = word.ToCharArray();
int len = word.Length;
SortedDictionary<char,int> count = new SortedDictionary<char,int>();
foreach(char c in word)
{
if(count.ContainsKey(c))
{
}
else
{
count.Add(c, 0);
}
}
return count;
}
//Creates a factorial function
static int Factorial(int n)
{
if (n <= 1)
{
return 1;
}
else
{
return n * Factorial(n - 1);
}
}
//Ranks the word input if there are no repeated charaters
//in the word
static Int64 rankWord(char[] wordArray)
{
int n = wordArray.Length;
Int64 rank = 1;
//loops through the array of letters
for (int i = 0; i < n-1; i++)
{
int x=0;
//loops all letters after i and compares them for factorial calculation
for (int j = i+1; j<n ; j++)
{
if (wordArray[i] > wordArray[j])
{
x++;
}
}
rank = rank + x * (Factorial(n - i - 1));
}
return rank;
}
//Ranks the word input if there are repeated charaters
//in the word
static Int64 rankPerm(String word)
{
Int64 rank = 1;
Int64 suffixPermCount = 1;
SortedDictionary<char, int> counter = Counter(word);
for (int i = word.Length - 1; i > -1; i--)
{
char x = Convert.ToChar(word.Substring(i,1));
int xCount;
if(counter[x] != 0)
{
xCount = counter[x] + 1;
}
else
{
xCount = 1;
}
counter[x] = xCount;
foreach (KeyValuePair<char,int> e in counter)
{
if (e.Key < x)
{
rank += suffixPermCount * e.Value / xCount;
}
}
suffixPermCount *= word.Length - i;
suffixPermCount /= xCount;
}
return rank;
}
static void Main(string[] args)
{
Console.WriteLine("Type Exit to end the program.");
string prompt = "Please enter a word using only letters:";
const int MAX_VALUE = 25;
Int64 rank = new Int64();
string theWord;
do
{
theWord = readWord(prompt, MAX_VALUE);
char[] wordLetters = theWord.ToCharArray();
Array.Sort(wordLetters);
bool duplicate = false;
for(int i = 0; i< theWord.Length - 1; i++)
{
if(wordLetters[i] < wordLetters[i+1])
{
duplicate = true;
}
}
if(duplicate)
{
SortedDictionary<char, int> counter = Counter(theWord);
rank = rankPerm(theWord);
Console.WriteLine("\n" + theWord + " = " + rank);
}
else
{
char[] letters = theWord.ToCharArray();
rank = rankWord(letters);
Console.WriteLine("\n" + theWord + " = " + rank);
}
} while (theWord != "EXIT");
Console.WriteLine("\nPress enter to escape..");
Console.Read();
}
}
}
If there are k distinct characters, the i^th character repeated n_i times, then the total number of permutations is given by
(n_1 + n_2 + ..+ n_k)!
------------------------------------------------
n_1! n_2! ... n_k!
which is the multinomial coefficient.
Now we can use this to compute the rank of a given permutation as follows:
Consider the first character(leftmost). say it was the r^th one in the sorted order of characters.
Now if you replace the first character by any of the 1,2,3,..,(r-1)^th character and consider all possible permutations, each of these permutations will precede the given permutation. The total number can be computed using the above formula.
Once you compute the number for the first character, fix the first character, and repeat the same with the second character and so on.
Here's the C++ implementation to your question
#include<iostream>
using namespace std;
int fact(int f) {
if (f == 0) return 1;
if (f <= 2) return f;
return (f * fact(f - 1));
}
int solve(string s,int n) {
int ans = 1;
int arr[26] = {0};
int len = n - 1;
for (int i = 0; i < n; i++) {
s[i] = toupper(s[i]);
arr[s[i] - 'A']++;
}
for(int i = 0; i < n; i++) {
int temp = 0;
int x = 1;
char c = s[i];
for(int j = 0; j < c - 'A'; j++) temp += arr[j];
for (int j = 0; j < 26; j++) x = (x * fact(arr[j]));
arr[c - 'A']--;
ans = ans + (temp * ((fact(len)) / x));
len--;
}
return ans;
}
int main() {
int i,n;
string s;
cin>>s;
n=s.size();
cout << solve(s,n);
return 0;
}
Java version of unrank for a String:
public static String unrankperm(String letters, int rank) {
Map<Character, Integer> charCounts = new java.util.HashMap<>();
int permcount = 1;
for(int i = 0; i < letters.length(); i++) {
char x = letters.charAt(i);
int xCount = charCounts.containsKey(x) ? charCounts.get(x) + 1 : 1;
charCounts.put(x, xCount);
permcount = (permcount * (i + 1)) / xCount;
}
// charCounts is the histogram of letters
// permcount is the number of distinct perms of letters
StringBuilder perm = new StringBuilder();
for(int i = 0; i < letters.length(); i++) {
List<Character> sorted = new ArrayList<>(charCounts.keySet());
Collections.sort(sorted);
for(Character x : sorted) {
// suffixcount is the number of distinct perms that begin with x
Integer frequency = charCounts.get(x);
int suffixcount = permcount * frequency / (letters.length() - i);
if (rank <= suffixcount) {
perm.append(x);
permcount = suffixcount;
if(frequency == 1) {
charCounts.remove(x);
} else {
charCounts.put(x, frequency - 1);
}
break;
}
rank -= suffixcount;
}
}
return perm.toString();
}
See also n-th-permutation-algorithm-for-use-in-brute-force-bin-packaging-parallelization.

Lanczos Resampling error

I have written an image resizer using Lanczos re-sampling. I've taken the implementation straight from the directions on wikipedia. The results look good visually, but for some reason it does not match the result from Matlab's resize with Lanczos very well (in pixel error).
Does anybody see any errors? This is not my area of expertise at all...
Here is my filter (I'm using Lanczos3 by default):
double lanczos_size_ = 3.0;
inline double sinc(double x) {
double pi = 3.1415926;
x = (x * pi);
if (x < 0.01 && x > -0.01)
return 1.0 + x*x*(-1.0/6.0 + x*x*1.0/120.0);
return sin(x)/x;
}
inline double LanczosFilter(double x) {
if (std::abs(x) < lanczos_size_) {
double pi = 3.1415926;
return sinc(x)*sinc(x/lanczos_size_);
} else {
return 0.0;
}
}
And my code to resize the image:
Image Resize(Image& image, int new_rows, int new_cols) {
int old_cols = image.size().cols;
int old_rows = image.size().rows;
double col_ratio =
static_cast<double>(old_cols)/static_cast<double>(new_cols);
double row_ratio =
static_cast<double>(old_rows)/static_cast<double>(new_rows);
// Apply filter first in width, then in height.
Image horiz_image(new_cols, old_rows);
for (int r = 0; r < old_rows; r++) {
for (int c = 0; c < new_cols; c++) {
// x is the new col in terms of the old col coordinates.
double x = static_cast<double>(c)*col_ratio;
// The old col corresponding to the closest new col.
int floor_x = static_cast<int>(x);
horiz_image[r][c] = 0.0;
double weight = 0.0;
// Add up terms across the filter.
for (int i = floor_x - lanczos_size_ + 1; i < floor_x + lanczos_size_; i++) {
if (i >= 0 && i < old_cols) {
double lanc_term = LanczosFilter(x - i);
horiz_image[r][c] += image[r][i]*lanc_term;
weight += lanc_term;
}
}
// Normalize the filter.
horiz_image[r][c] /= weight;
// Strap the pixel values to valid values.
horiz_image[r][c] = (horiz_image[r][c] > 1.0) ? 1.0 : horiz_image[r][c];
horiz_image[r][c] = (horiz_image[r][c] < 0.0) ? 0.0 : horiz_image[r][c];
}
}
// Now apply a vertical filter to the horiz image.
Image new_image(new_cols, new_rows);
for (int r = 0; r < new_rows; r++) {
double x = static_cast<double>(r)*row_ratio;
int floor_x = static_cast<int>(x);
for (int c = 0; c < new_cols; c++) {
new_image[r][c] = 0.0;
double weight = 0.0;
for (int i = floor_x - lanczos_size_ + 1; i < floor_x + lanczos_size_; i++) {
if (i >= 0 && i < old_rows) {
double lanc_term = LanczosFilter(x - i);
new_image[r][c] += horiz_image[i][c]*lanc_term;
weight += lanc_term;
}
}
new_image[r][c] /= weight;
new_image[r][c] = (new_image[r][c] > 1.0) ? 1.0 : new_image[r][c];
new_image[r][c] = (new_image[r][c] < 0.0) ? 0.0 : new_image[r][c];
}
}
return new_image;
}
Here is Lanczosh in one single loop. no errors.
Uses mentioned at top procedures.
void ResizeDD(
double* const pixelsSrc,
const int old_cols,
const int old_rows,
double* const pixelsTarget,
int const new_rows, int const new_cols)
{
double col_ratio =
static_cast<double>(old_cols) / static_cast<double>(new_cols);
double row_ratio =
static_cast<double>(old_rows) / static_cast<double>(new_rows);
// Now apply a filter to the image.
for (int r = 0; r < new_rows; ++r)
{
const double row_within = static_cast<double>(r)* row_ratio;
int floor_row = static_cast<int>(row_within);
for (int c = 0; c < new_cols; ++c)
{
// x is the new col in terms of the old col coordinates.
double col_within = static_cast<double>(c)* col_ratio;
// The old col corresponding to the closest new col.
int floor_col = static_cast<int>(col_within);
double& v_toSet = pixelsTarget[r * new_cols + c];
v_toSet = 0.0;
double weight = 0.0;
for (int i = floor_row - lanczos_size_ + 1; i <= floor_row + lanczos_size_; ++i)
{
for (int j = floor_col - lanczos_size_ + 1; j <= floor_col + lanczos_size_; ++j)
{
if (i >= 0 && i < old_rows && j >= 0 && j < old_cols)
{
const double lanc_term = LanczosFilter(row_within - i + col_within - j);
v_toSet += pixelsSrc[i * old_rows + j] * lanc_term;
weight += lanc_term;
}
}
}
v_toSet /= weight;
v_toSet = (v_toSet > 1.0) ? 1.0 : v_toSet;
v_toSet = (v_toSet < 0.0) ? 0.0 : v_toSet;
}
}
}
The line
for (int i = floor_x - lanczos_size_ + 1; i < floor_x + lanczos_size_; i++)
should be
for (int i = floor_x - lanczos_size_ + 1; i <= floor_x + lanczos_size_; i++)
Do not know but perhaps other mistakes linger too.
I think there is a mistake in your sinc function. Below the fraction bar you have to square pi and x. Additional you have to multiply the function with lanczos size
L(x) = **a***sin(pi*x)*sin(pi*x/a) * (pi**²**x**²**)^-1
Edit: My mistake, there is all right.

Resources