OpenMP in Biham-Middleton-Levine BML model

OpenMP in Biham-Middleton-Levine BML model - multithreading

I've got a serial version of BML and I'm trying to write a parallel one with OpenMP. Basically my code works with a main witin a loop calling two functions for horizontal and vertical moves. Like that:
for (s = 0; s < nmovss; s++) {
horizontal_movs(grid, N);
copy_sides(grid, N);
cur = 1-cur;
vertical_movs(grid, N);
copy_sides(grid, N);
cur = 1-cur;
}
Where cur is the current grid. Then horizontal and vertical functions are similar and have a nested loop:
for(i = 1; i <= n; i++) {
for(j = 1; j <= n+1; j++) {
if(grid[cur][i][j-1] == LR && grid[cur][i][j] == EMPTY) {
grid[1-cur][i][j-1] = EMPTY;
grid[1-cur][i][j] = LR;
}
else {
grid[1-cur][i][j] = grid[cur][i][j];
}
}
}
The code produces a ppm image at every step, and whit a certain input the serial version produce an output that we can suppose good. But using #pragma omp parallel for inside the two functions H and V, the ppm file results splitted in such zones as the number of threads(i.e. 4):
I suppose the problem is that every thread should be doing both functions in sequence before termitate because movememnts are strictcly connected. I don't know how to do that. If I set pragma at a highter level like before main loop, there is no speed-up. Obviously the ppm file has to be not sliced like the image.

Goin'on I tried this solution that gives me an identical result as the serial code, but I don't excatly understand why
# pragma omp parallel num_threads(thread_count) default(none) \
shared(grid, n, cur) private(i, j)
for(i = 1; i <= n+1; i++) {
# pragma omp for
for(j = 1; j <= n; j++) {
if(grid[cur][i-1][j] == TB && grid[cur][i][j] == EMPTY) {
grid[1-cur][i-1][j] = EMPTY;
grid[1-cur][i][j] = TB;
}
else {
grid[1-cur][i][j] = grid[cur][i][j];
}
}
}
}
Therefore, if i use just one thread more than available cores(4), the execution time "explodes" instead of remain barely the same.

Related

Is it possible to parallelize or unroll this loop?

I am trying to see if I can improve the performance of the following loop in C++, which uses two dimensional vectors (_external and _Table) and has a carried loop dependency on the previous iteration. Additionally, it has a calculated index accessor in the innermost loop that will make the access of _Table non sequential on the right hand side.
int N = 8000;
int M = 400
int P = 100;
for(int i = 1; i <= N; i++){
for(int j = 0; j < M; j++){
for(int k =0; k < P; k++){
int index = _external.at(j).at(k);
_Table.at(j).at(i) += _Table.at(index).at(i-1);
}
}
}
What can I do to improve the performance of a loop like this?

Well it looks to me like the order in which these statements:
int index = _external.at(j).at(k);
_Table.at(j).at(i) += _Table.at(index).at(i-1);
are executed is critical to correctness. (That is, if the iteration order for i, j, k changes, then the results will be different ... and incorrect.)
So I think you are only left with micro-optimizations, like hoisting the expressions _Table.at(j).at(i) and _external.at(j) out of the innermost loop.
Consider this:
for(int k =0; k < P; k++){
int index = _external.at(j).at(k);
_Table.at(j).at(i) += _Table.at(index).at(i-1);
}
This loop is repeatedly adding numbers to _Table.at(j).at(i). Since (by inspection) _Table.at(index).at(i-1) must be reading from a different cell of the table (because of i-1 versus i), you could do this:
int temp = 0;
for(int k =0; k < P; k++){
int index = _external.at(j).at(k);
temp += _Table.at(index).at(i-1);
}
_Table.at(j).at(i) += temp;
This will reduce the number of calls to at, and may also improve cache performance a bit.

Multiply matrices using dgemv and multithreads in c

I have a problem in my code. I want to multiply 2 matrices using dgemv from cblas, but I want to share the operations to the threads I have. I have also used dgemv to multiply the matrices in a previous exercise where there was no parallelism needed. Is there any idea of what I should do?
The code:
for (it = 0; it < itime; it++) {
cblas_dgemv(CblasColMajor,CblasNoTrans,n,n, 1 , sigma, n, u , 1, 0.0 , d, 1);
#pragma omp parallel for private(i,j,sum) schedule(static)
for (i = 0; i < n; i++) {
sum = 0.0;
uplus[i] = u[i] + dtmu - dt * u[i];
#pragma omp simd reduction(+:sum)
for (j = 0; j < n; j++) {
sum += sigma[i*n+j]*u[j];
}
sum = sum - u[i]*m[i];
uplus[i] += dtdiv * sum;
if (uplus[i] > uth) {
uplus[i] = 0.0;
if (it >= ttransient) {
omega1[i] += 1.0;
}
}
}
t = u;
u = uplus;
uplus = t;
}
I want to get the dgemv function into the parallel region and share somehow the multiplications to the threads I have.

Hybrid MPI+OpenMP Vs MPI Performance

I am converting a 3-D Jacobi solver from pure MPI to Hybrid MPI+OpenMP. I have a 192x192x192 array which is divided among 24 processes in Pure MPI in 1-D decomposition i.e. each process has 192/24 x 192 x 192 = 8 x 192 x 192 slab of data. Now I do :
for(i=0 ; i <= 7; i++)
for(j=0; j<= 191; j++)
for(k=0; k<= 191; k++)
{
unew[i][j][k] = 1/6.0 * (u[i+1][j][k]+u[i-1][j][k]+
u[i][j+1][k]+u[i][j-1][k]+
u[i][j][k+1]+u[i][j][k-1]);
}
This update takes around 60 seconds for each process.
Now with Hybrid MPI, I run two processes (1 process per socket --bind-to socket --map-by socket and OMP_PROC_PLACES=coreswith OMP_PROC_BIND=close). I create 12 threads per MPI Process (i.e. 12 threads per socket or processor). Now each MPI process has an array of size : 192/2 x 192 x 192 = 96x192x192 elements. Each thread works on 96/12 x 192 x 192 = 8 x 192 x 192 portion of the array owned by each process. I do the same triple loop update using threads but the time is approximately 76 seconds for each thread. The load balance is perfect in both the problems. What could be the possible causes of performance degradation ? Is is False Sharing because threads could be invalidating the cache lines close to each other's chunk of data ? If yes, then how do I reduce this performance degradation ? (I have purposefully not mentioned ghost data but initially I am NOT overlapping communication with computation.)
In response to the comments below, am posting the code. Apologies for the long MWE but you can very safely ignore (1) Header files declaration (2) Variable Declaration (3) Memory allocation routine (4) Formation of Cartesian Topology (5) Setting boundary conditions in parallel using OpenMP parallel region (6) Declaration of MPI_Type_subarray datatype (7) MPI_Isend() and MPI_Irecv() calls and just concentrate on (a) INDEPENDENT UPDATE OpenMP parallel region (b) independent_update(...) routine being called from here.
/* IGNORE THIS PORTION */
#include<mpi.h>
#include<omp.h>
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#define MIN(a,b) (a < b ? a : b)
#define Tol 0.00001
/* IGNORE THIS ROUTINE */
void input(int *X, int *Y, int *Z)
{
int a=193, b=193, c=193;
*X = a;
*Y = b;
*Z = c;
}
/* IGNORE THIS ROUTINE */
float*** allocate_mem(int X, int Y, int Z)
{
int i,j;
float ***matrix;
float *arr;
arr = (float*)calloc(X*Y*Z, sizeof(float));
matrix = (float***)calloc(X, sizeof(float**));
for(i = 0 ; i<= X-1; i++)
matrix[i] = (float**)calloc(Y, sizeof(float*));
for(i = 0 ; i <= X-1; i++)
for(j=0; j<= Y-1; j++)
matrix[i][j] = &(arr[i*Y*Z + j*Z]);
return matrix ;
}
/* THIS ROUTINE IS IMPORTANT */
float independent_update(float ***old, float ***new, int NX, int NY, int NZ, int tID, int chunk)
{
int i,j,k, start, end;
float error = 0.0;
float diff;
start = tID * chunk + 1;
end = MIN( (tID+1)*chunk, NX-2 );
for(i = start; i <= end ; i++)
{
for(j = 1; j<= NY-2; j++)
{
#pragma omp simd
for(k = 1; k<= NZ-2; k++)
{
new[i][j][k] = (1/6.0) *(old[i-1][j][k] + old[i+1][j][k] + old[i][j-1][k] + old[i][j+1][k] + old[i][j][k-1] + old[i][j][k+1] );
diff = 1.0 - new[i][j][k];
diff = (diff > 0 ? diff : -1.0 * diff );
if(diff > error)
error = diff;
}
}
}
return error;
}
int main(int argc, char *argv[])
{
/* IGNORE VARIABLE DECLARATION */
int size, rank; //Size of old_comm and rank of process
int i, j, k,l; //General loop variables
MPI_Comm old_comm, new_comm; //MPI_COMM_WORLD handle and for MPI_Cart_create()
int N[3]; //For taking input of size of matrix from user
int P; //Represent number of processes i.e. same as size
int dims[3]; //For dimensions of Cartesian topology
int PX, PY, PZ; //X dim, Y dim, Z dim of each process
float ***old, ***new, ***temp; //Matrices for results dimensions is (Px+2)*(PY+2)*(PZ+2)
int period[3]; //Periodicity for each dimension
int reorder; //Whether processes should be reordered in new cartesian topology
int ndims; //Number of dimensions (which is 3)
int Z_TOWARDS_U, Z_AWAY_U; //Z neighbour towards you and away from you (Z const)
int X_DOWN, X_UP; //Below plane and above plane (X const)
int Y_LEFT, Y_RIGHT; //Left plane and right plane (Y const)
int coords[3]; //Finding coordinates of processes
int dimension; //Used in MPI_Cart_shift() , values = 0, 1,2
int displacement; //Used in MPI_Cart_shift(), values will be +1 to find immediate neighbours
float l_max_err; //Local maximum error on process
float l_max_err_new; //For dependent faces.
float G_max_err = 1.0; //Maximum error for stopping criterion
int iterations = 0 ; //Counting number of iterations
MPI_Request send[6], recv[6]; //For MPI_Isend and MPI_Irecv
int start[3]; //Start will be defined in MPI_Isend() and MPI_Irecv()
int gsize[3]; //Defining global size of subarray
MPI_Datatype x_subarray; //For sending X_UP and X_DOWN
int local_x[3]; //Defining local plane size for X_UP/X_DOWN
MPI_Datatype y_subarray; //For sending Y_LEFT and Y_RIGHT
int local_y[3]; //Defining local plane for Y_LEFT/Y_RIGHT
MPI_Datatype z_subarray; //For sending Z_TOWARDS_U and Z_AWAY_U
int local_z[3]; //Defining local plan size for XY plane i.e. where Z=0
double strt, end; //For measuring time
double strt1, end1, delta1; //For measuring trivial time 1
double strt2, end2, delta2; //For measuring trivial time 2
double t_i_strt, t_i_end, t_i_sum=0; //Time for independent computational kernel
double t_up_strt, t_up_end, t_up_sum=0; //Time for X_UP
double t_down_strt, t_down_end, t_down_sum=0; //Time for X_DOWN
double t_left_strt, t_left_end, t_left_sum=0; //Time for Y_LEFT
double t_right_strt, t_right_end, t_right_sum=0; //Time for Y_RIGHT
double t_towards_strt, t_towards_end, t_towards_sum=0; //For Z_TOWARDS_U
double t_away_strt, t_away_end, t_away_sum=0; //For Z_AWAY_U
double t_comm_strt, t_comm_end, t_comm_sum=0; //Time comm + independent update (need to subtract to get comm time)
double t_setup_strt,t_setup_end; //Set-up start and end time
double t_allred_strt,t_allred_end,t_allred_total=0.0; //Measuring Allreduce time separately.
int threadID; //ID of a thread
int nthreads; //Total threads in OpenMP region
int chunk; //chunk - used to calculate iterations of a thread
/* IGNORE MPI STARTUP ETC */
MPI_Init(&argc, &argv);
t_setup_strt = MPI_Wtime();
old_comm = MPI_COMM_WORLD;
MPI_Comm_size(old_comm, &size);
MPI_Comm_rank(old_comm, &rank);
P = size;
if(rank == 0)
{
input(&N[0], &N[1], &N[2]);
}
MPI_Bcast(N, 3, MPI_INT, 0, old_comm);
dims[0] = 0;
dims[1] = 0;
dims[2] = 0;
period[0] = period[1] = period[2] = 0; //All dimensions aperiodic
reorder = 0 ; //No reordering of ranks in new_comm
ndims = 3;
MPI_Dims_create(P,ndims,dims);
MPI_Cart_create(old_comm, ndims, dims, period, reorder, &new_comm);
if( (N[0]-1) % dims[0] == 0 && (N[1]-1) % dims[1] == 0 && (N[2]-1) % dims[2] == 0 )
{
PX = (N[0]-1)/dims[0]; //Rows of unknowns each process gets
PY = (N[1]-1)/dims[1]; //Columns of unknowns each process gets
PZ = (N[2]-1)/dims[2]; //Depth of unknowns each process gets
}
old = allocate_mem(PX+2, PY+2, PZ+2); //3D arrays with ghost points
new = allocate_mem(PX+2, PY+2, PZ+2); //3D arrays with ghost points
dimension = 0;
displacement = 1;
MPI_Cart_shift(new_comm, dimension, displacement, &X_UP, &X_DOWN); //Find UP and DOWN neighbours
dimension = 1;
MPI_Cart_shift(new_comm, dimension, displacement, &Y_LEFT, &Y_RIGHT); //Find UP and DOWN neighbours
dimension = 2;
MPI_Cart_shift(new_comm, dimension, displacement, &Z_TOWARDS_U, &Z_AWAY_U); //Find UP and DOWN neighbours
/* IGNORE BOUNDARY SETUPS FOR PDE */
#pragma omp parallel for default(none) shared(old,new,PX,PY,PZ) private(i,j,k) schedule(static)
for(i = 0; i <= PX+1; i++)
{
for(j = 0; j <= PY+1; j++)
{
for(k = 0; k <= PZ+1; k++)
{
old[i][j][k] = 0.0;
new[i][j][k] = 0.0;
}
}
}
#pragma omp parallel default(none) shared(X_DOWN,X_UP,Y_LEFT,Y_RIGHT,Z_TOWARDS_U,Z_AWAY_U,old,new,PX,PY,PZ) private(i,j,k,threadID,nthreads)
{
threadID = omp_get_thread_num();
nthreads = omp_get_num_threads();
if(threadID == 0)
{
if(X_DOWN == MPI_PROC_NULL) //X is constant here, this is YZ upper plane
{
for(j = 1 ; j<= PY ; j++)
for(k = 1 ; k<= PZ ; k++)
{
old[0][j][k] = 1;
new[0][j][k] = 1; //Set boundaries in new also
}
}
}
if(threadID == (nthreads-1))
{
if(X_UP == MPI_PROC_NULL) //YZ lower plane
{
for(j = 1 ; j<= PY ; j++)
for(k = 1; k<= PZ ; k++)
{
old[PX+1][j][k] = 1;
new[PX+1][j][k] = 1;
}
}
}
if(Y_LEFT == MPI_PROC_NULL) //Y is constant, this is left XZ plane, possibly can use collapse(2)
{
#pragma omp for schedule(static)
for(i = 1 ; i<= PX ; i++)
for(k = 1; k<= PZ; k++)
{
old[i][0][k] = 1;
new[i][0][k] = 1;
}
}
if(Y_RIGHT == MPI_PROC_NULL) //XZ right plane, again collapse(2) potential
{
#pragma omp for schedule(static)
for(i = 1 ; i<= PX; i++)
for(k = 1; k<= PZ ; k++)
{
old[i][PY+1][k] = 1;
new[i][PY+1][k] = 1;
}
}
if(Z_TOWARDS_U == MPI_PROC_NULL) //Z is constant here, towards you XY plane, collapse(2)
{
#pragma omp for schedule(static)
for(i = 1 ; i<= PX ; i++)
for(j = 1; j<= PY ; j++)
{
old[i][j][0] = 1;
new[i][j][0] = 1;
}
}
if(Z_AWAY_U == MPI_PROC_NULL) //Away from you XY plane, collapse(2)
{
#pragma omp for schedule(static)
for(i = 1 ; i<= PX; i++)
for(j = 1; j<= PY ; j++)
{
old[i][j][PZ+1] = 1;
new[i][j][PZ+1] = 1;
}
}
}
/* IGNORE SUBARRAY DECLARATION */
gsize[0] = PX+2; //Global sizes of 3-D cubes for each process
gsize[1] = PY+2;
gsize[2] = PZ+2;
start[0] = 0; //Will specify starting location while sending/receiving
start[1] = 0;
start[2] = 0;
local_x[0] = 1;
local_x[1] = PY;
local_x[2] = PZ;
MPI_Type_create_subarray(ndims, gsize, local_x, start, MPI_ORDER_C, MPI_FLOAT, &x_subarray);
MPI_Type_commit(&x_subarray);
local_y[0] = PX;
local_y[1] = 1;
local_y[2] = PZ;
MPI_Type_create_subarray(ndims, gsize, local_y, start, MPI_ORDER_C, MPI_FLOAT, &y_subarray);
MPI_Type_commit(&y_subarray);
local_z[0] = PX;
local_z[1] = PY;
local_z[2] = 1;
MPI_Type_create_subarray(ndims, gsize, local_z, start, MPI_ORDER_C, MPI_FLOAT, &z_subarray);
MPI_Type_commit(&z_subarray);
t_setup_end = MPI_Wtime();
strt = MPI_Wtime();
while(G_max_err > Tol) //iterations < ITERATIONS)
{
iterations++ ;
t_comm_strt = MPI_Wtime();
/* IGNORE MPI COMMUNICATION */
MPI_Irecv(&old[0][1][1], 1, x_subarray, X_DOWN, 10, new_comm, &recv[0]);
MPI_Irecv(&old[PX+1][1][1], 1, x_subarray, X_UP, 20, new_comm, &recv[1]);
MPI_Irecv(&old[1][PY+1][1], 1, y_subarray, Y_RIGHT, 30, new_comm, &recv[2]);
MPI_Irecv(&old[1][0][1], 1, y_subarray, Y_LEFT, 40, new_comm, &recv[3]);
MPI_Irecv(&old[1][1][PZ+1], 1, z_subarray, Z_AWAY_U, 50, new_comm, &recv[4]);
MPI_Irecv(&old[1][1][0], 1, z_subarray, Z_TOWARDS_U, 60, new_comm, &recv[5]);
MPI_Isend(&old[PX][1][1], 1, x_subarray, X_UP, 10, new_comm, &send[0]);
MPI_Isend(&old[1][1][1], 1, x_subarray, X_DOWN, 20, new_comm, &send[1]);
MPI_Isend(&old[1][1][1], 1, y_subarray, Y_LEFT, 30, new_comm, &send[2]);
MPI_Isend(&old[1][PY][1], 1, y_subarray, Y_RIGHT, 40, new_comm, &send[3]);
MPI_Isend(&old[1][1][1], 1, z_subarray, Z_TOWARDS_U, 50, new_comm, &send[4]);
MPI_Isend(&old[1][1][PZ], 1, z_subarray, Z_AWAY_U, 60, new_comm, &send[5]);
MPI_Waitall(6, send, MPI_STATUSES_IGNORE);
MPI_Waitall(6, recv, MPI_STATUSES_IGNORE);
t_comm_end = MPI_Wtime();
t_comm_sum = t_comm_sum + (t_comm_end - t_comm_strt);
/* Use threads in Independent update */
t_i_strt = MPI_Wtime();
l_max_err = 0.0; //Very important, Reduction result is combined with this !
/* THIS IS THE IMPORTANT REGION */
#pragma omp parallel default(none) shared(old,new,PX,PY,PZ,chunk) private(threadID,nthreads) reduction(max:l_max_err)
{
nthreads = omp_get_num_threads();
threadID = omp_get_thread_num();
chunk = (PX-1+1) / nthreads ;
l_max_err = independent_update(old, new, PX+2, PY+2, PZ+2, threadID, chunk);
}
t_i_end = MPI_Wtime();
t_i_sum = t_i_sum + (t_i_end - t_i_strt) ;
/* IGNORE THE REMAINING CODE */
t_allred_strt = MPI_Wtime();
MPI_Allreduce(&l_max_err, &G_max_err, 1, MPI_FLOAT, MPI_MAX, new_comm);
t_allred_end = MPI_Wtime();
t_allred_total = t_allred_total + (t_allred_end - t_allred_strt);
temp = new ;
new = old;
old = temp;
}
MPI_Barrier(new_comm);
end = MPI_Wtime();
if( rank == 0)
{
printf("\nIterations = %d, G_max_err = %f", iterations, G_max_err);
printf("\nThe total SET-UP time for MPI and boundary conditions is %lf", (t_setup_end-t_setup_strt));
printf("\nThe total time for SOLVING is %lf", (end-strt));
printf("\nThe total time for INDEPENDENT COMPUTE %lf", t_i_sum);
printf("\nThe total time for COMMUNICATION OVERHEAD is %lf", t_comm_sum);
printf("\nThe total time for MPI_ALLREDUCE() is %lf", t_allred_total);
}
MPI_Type_free(&x_subarray);
MPI_Type_free(&y_subarray);
MPI_Type_free(&z_subarray);
free(&old[0][0][0]);
free(&new[0][0][0]);
MPI_Finalize();
return 0;
}
P.S. : I am almost sure that the cost of spawning/waking the threads is not the reason for such a huge difference in the timing.
Please find attached Scalasca snapshot for INDEPENDENT COMPUTE of the Hybrid Program.
Using loop simd construct
#pragma omp parallel default(none) shared(old,new,PX,PY,PZ,l_max_err) private(i,j,k,diff)
{
#pragma omp for simd schedule(static) reduction(max:l_max_err)
for(i = 1; i <= PX ; i++)
{
for(j = 1; j<= PY; j++)
{
for(k = 1; k<= PZ; k++)
{
new[i][j][k] = (1/6.0) *(old[i-1][j][k] + old[i+1][j][k] + old[i][j-1][k] + old[i][j+1][k] + old[i][j][k-1] + old[i][j][k+1] );
diff = 1.0 - new[i][j][k];
diff = (diff > 0 ? diff : -1.0 * diff );
if(diff > l_max_err)
l_max_err = diff;
}
}
}
}

You frequently get memory access and cache issues when you just do one MPI process per socket on a CPU with multiple memory controllers. It can be on either the read or the write side, so you can't really say which. This is especially an issue when doing thread-parallel execution with lightweight compute tasks (e.g. math on arrays). One MPI process per socket in this case tends to fare significantly worse than pure MPI.
In your BIOS, set up whatever the maximal NUMA per socket option is
Use one MPI process per NUMA node.
Try some different parameter values in schedule(static). I've rarely found the default to be best.
Essentially what this will do is ensure each bundle of threads only works on a single pool of memory.

calculate determinant of matrix with thread

ı want to calculate determinant of matrix with thread but i have a error "term does not eveluate to a function taking 0 arguments" ı want to solve big matrix with thread and parsing matrix,what can ı do
int determinant(int f[1000][1000], int x)
{
int pr, c[1000], d = 0, b[1000][1000], j, p, q, t;
if (x == 2)
{
d = 0;
d = (f[1][1] * f[2][2]) - (f[1][2] * f[2][1]);
return(d);
}
else
{
for (j = 1; j <= x; j++)
{
int r = 1, s = 1;
for (p = 1; p <= x; p++)
{
for (q = 1; q <= x; q++)
{
if (p != 1 && q != j)
{
b[r][s] = f[p][q];
s++;
if (s > x - 1)
{
r++;
s = 1;
}
}
}
}
for (t = 1, pr = 1; t <= (1 + j); t++)
pr = (-1)*pr;
c[j] = pr*determinant(b, x - 1);
}
for (j = 1, d = 0; j <= x; j++)
{
d = d + (f[1][j] * c[j]);
}
return(d);
}
}
int main()
{
srand(time_t(NULL));
int i, j;
printf("\n\nEnter order of matrix : ");
scanf_s("%d", &m);
printf("\nEnter the elements of matrix\n");
for (i = 1; i <= m; i++)
{
for (j = 1; j <= m; j++)
{
a[i][j] = rand() % 10;
}
}
thread t(determinant(a, m));
t.join();
printf("\n Determinant of Matrix A is %d .", determinant(a, m));
}

The immediate problem is that here: thread t(determinant(a, m)); you pass the result of calling determinant(a, m) as the function to execute, and zero arguments to call that function with - but an int is not a function or other callable object, which is what the error you got complains about.
std::thread's constructor takes the function to run and the arguments to supply separately, so you would need to call std::thread(determinant, a, m).
Now we have another problem, std::thread doesn't provide a way to retrieve the return value, and so you calculate it again here: printf("\n Determinant of Matrix A is %d .", determinant(a, m));.
To fix this, we can use std::async from the <future> header, which will manage the thread handling for us, and lets us retrieve the result later:
auto result = std::async(std::launch::async, determinant, a, m);
int det = result.get()
This will run determinant(a,m) on a new thread, and return a std::future<int> into which the return value may eventually be placed.
We can then try to retrieve that value with std::future::get(), which will block until the value can be retrieved (or until an exception occurs in the thread).
In this example, we still execute determinant in a pretty serial fashion, since we delegate the work to a thread, then wait for that thread to finish its work before continuing.
However we are now free to store the future, and defer calling std::future::get() until we actually need the value, potentially much later in your program.
There are a few other problems in the rest of your code:
all your array indexing is off by one (array indices run from 0 to N-1 in C and C++)
a few of the variables you're using don't exist (like a and m)
C-arrays are passed by pointer, so if you ever change the code not to block on the thread right there, the array will go out of scope and your thread may read garbage from the dangling pointer. If you use a proper container like std::array or std::vector, you can pass it by value so your thread will own the data to operate on for its entire lifetime.

Conways's Game of life array problems

I'm writing a Conway's life game for school. In the program I am having trouble with the arrays taking the values I am assigning them. At one point in the program they print out the value assigned to them (1) yet at the end of the program when I need to print the array to show the iterations of the game it shows an incredibly low number. The other trouble was I was encountering difficulties when putting in a loop that would ask if it wants you to run another iteration. So I removed it until the previous errors were fixed.
Im writing this with C++
#include <stdio.h>
int main (void)
{
int currentarray [12][12];
int futurearray [12][12];
char c;
char check = 'y';
int neighbors = 0;
int x = 0; // row
int y = 0; //column
printf("Birth an organism will be born in each empty location that has exactly three neighbors.\n");
printf("Death an organism with four or more organisms as neighbors will die from overcrowding.\n");
printf("An organism with fewer than two neighbors will die from loneliness.\n");
printf("Survival an organism with two or three neighbors will survive to the next generation.\n");
printf( "To create life input x, y coordinates.\n");
while ( check == 'y' )
{
printf("Enter x coordinate.\n");
scanf("%d", &x ); while((c = getchar()) != '\n' && c != EOF);
printf("Enter y coordinate.\n");
scanf("%d", &y ); while((c = getchar()) != '\n' && c != EOF);
currentarray [x][y] = 1;
printf ("%d\n", currentarray[x][y]);
printf( "Do you wish to enter more input? y/n.\n");
scanf("%c", &check); while((c = getchar()) != '\n' && c != EOF);
}
// Note - Need to add a printf statement showing the array before changes are made after input added.
// check for neighbors
while(check == 'y')
{
for(y = 0; y <= 12; y++)
{
for(x = 0; x <= 12; x++)
{
//Begin counting number of neighbors:
if(currentarray[x-1][y-1] == 1) neighbors += 1;
if(currentarray[x-1][y] == 1) neighbors += 1;
if(currentarray[x-1][y+1] == 1) neighbors += 1;
if(currentarray[x][y-1] == 1) neighbors += 1;
if(currentarray[x][y+1] == 1) neighbors += 1;
if(currentarray[x+1][y-1] == 1) neighbors += 1;
if(currentarray[x+1][y] == 1) neighbors += 1;
if(currentarray[x+1][y+1] == 1) neighbors += 1;
//Apply rules to the cell:
if(currentarray[x][y] == 1 && neighbors < 2)
futurearray[x][y] = 0;
else if(currentarray[x][y] == 1 && neighbors > 3)
futurearray[x][y] = 0;
else if(currentarray[x][y] == 1 && (neighbors == 2 || neighbors == 3))
futurearray[x][y] = 1;
else if(currentarray[x][y] == 0 && neighbors == 3)
futurearray[x][y] = 1;
}
}
}
// Set the current array to the future and change the future to 0
{
for(y = 0; y < 12; y++)
{
for(x = 0; x < 12; x++)
{
//Begin the process
currentarray [x][y] = futurearray [x][y];
futurearray [x][y] = 0;
}
}
}
{
for(y = 0; y < 12; y++)
{
for(x = 0; x < 12; x++)
{
//print the current life board
printf("%d ", currentarray[x][y]);
}
}
}
// Have gone through one iteration of Life
//Ask to do another iteration
printf("Do you wish to continue y/n?\n");
scanf("%c", &check); while((c = getchar()) != '\n' && c != EOF);
return 0;
}

You are defining your arrays as [12][12].
In your generation loop you walk from i = 0 to i <= 12, which is 13 steps instead of the 12 of the array. Additionally you are trying to access x-1 and y-1, which can be as low as -1. Again not inside your array.
Sometimes you get semi-useful values from within your array, but on some borders you are just accessing random data.
Try to correct your border.

You forgot to set neighbors to 0 before counting them.
Since this is C++ (not C), you might as well declare neighbors inside the loop body. Makes these kinds of issues easier to spot, too.
Also, is it me, or is that while loop never going to finish? Your braces are a mess, in general, as is your indentation. You could do yourself and us a favour by cleaning those up.

Obviously agree with all the above suggestions. One nice trick you might want to implement with Life is to create an extra border around your area. So if the user wants a 12x12 grid (and you should allow width/height to be specified and allocate memory dynamically) internally you hold a 14x14 grid corresponding to a border around the actual grid. Before running the calculation copy the top row to the bottom border, bottom row to the top border etc. Now you can run the main algorithm on the inner 12x12 grid without worrying about edge cases. This will enable your patterns to re-appear on the other side if they fall off the edge.

You're also forgetting to set the values of both arrays to zero. This will take care of the ridiculous number issue you're having. you can do that by copying this for loop:
for(y = 0; y < 12; y++)
{
for(x = 0; x < 12; x++)
{
//Begin the process
currentarray [x][y] = futurearray [x][y];
futurearray [x][y] = 0;
}
}
and pasting it before the while loop but instead of setting currentarray[x][y] = futurearray[x][y], set it to 0. Also, if the coordinates are viewable locations instead of array co-ordinates, you'll want to change this:
printf ("%d\n", currentarray[x][y]);
to this:
printf ("%d\n", currentarray[x-1][y-1]);
I would also recommend putting a printf with a newline (\n) after each row has been printed and a tab (\t) after each item so that the formatting looks cleaner.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

OpenMP in Biham-Middleton-Levine BML model - multithreading

Related

Is it possible to parallelize or unroll this loop?

Multiply matrices using dgemv and multithreads in c

Hybrid MPI+OpenMP Vs MPI Performance

calculate determinant of matrix with thread

Conways's Game of life array problems

Categories

Resources