How to make VC compiler optimize my code with SIMD better? - visual-c++

I'm now learning SIMD and thinking about how to let compiler optimize my code better. Now I'm playing with Visual C++ 2013 x86.
I have an array, I have another array, and I want to compute like this:
void computeSum(float* __restrict arr, float* __restrict inp1, float* __restrict inp2, int count)
{
__declspec(align(16)) float* p1 = inp1;
__declspec(align(16)) float* p2 = inp2;
__declspec(align(16)) float* ret = arr;
while (count > 0)
{
ret[0] = p1[0] + p2[0];
ret[1] = p1[1] + p2[1];
ret[2] = p1[2] + p2[2];
ret[3] = p1[3] + p2[3];
p1 += 4;
p2 += 4;
ret += 4;
count -= 4;
}
}
I want to tell the compiler that the arrays are aligned to 16-byte boundary and anyone is not overlay on another, and one loop will compute 4 continuous float number's summation.
But in generated code, VC prefer MOVSS/ADDSS and not use ADDPS which I hope it to.
If I configure the project to use LLVM-vs2013 tool chain, it use ADDPS to compute the summation.
I know how to use compiler intrinsics to write SIMD code, but that's not what I want.
Are there any more hints that VC needs to use ADDPS instruction?
This is the full piece of code.
#include <stdio.h>
#include <stdlib.h>
void computeSum(float* __restrict arr, float* __restrict inp1, float* __restrict inp2, int count)
{
__declspec(align(16)) float* p1 = inp1;
__declspec(align(16)) float* p2 = inp2;
__declspec(align(16)) float* ret = arr;
while (count > 0)
{
ret[0] = p1[0] + p2[0];
ret[1] = p1[1] + p2[1];
ret[2] = p1[2] + p2[2];
ret[3] = p1[3] + p2[3];
p1 += 4;
p2 += 4;
ret += 4;
count -= 4;
}
}
int main()
{
float* inp1 = (float*)_aligned_malloc(sizeof(float) * 128, 16);
float* inp2 = (float*)_aligned_malloc(sizeof(float) * 128, 16);
float* result = (float*)_aligned_malloc(sizeof(float) * 128, 16);
for (int i = 0; i < 128; ++i)
{
inp1[i] = inp2[i] = i;
}
computeSum(result, inp1, inp2, 128);
for (int i = 0; i < 128; ++i)
{
printf("%f\t", result[i]);
}
return 0;
}

Visual C++ 2013 or later will default to use /arch:SSE2 for x86, but you should still check the settings in your Visual Studio project to make sure it hasn't explicitly been set to something else. For x64, /arch:SSE2 is implicit.
The only time that Visual C++ automatically generates multi-lane (like ADDPS) rather than single-lane (ADDSS) instructions is due to the auto-vectorizer. See MSDN for details and pay particular attention to the /Qvec-report:2 switch--and note that this will not happen with optimizations disabled as is common in Debug configurations.
Most SIMD (multi-lane) codegen is better accomplished with explicit intrinsics usage. For a lot of examples of this style of coding, see DirectXMath.

Related

MPI4PY: ring communication with neighbor_alltoallw

Please Help!
I am using MPI (= Message Passing Interface) in python for a ring communication, which means that every rank are sending and receiving from each other. I know one way to realize this is by using for instance MPI.COMM_WORLD.issend()and MPI.COMM_WORLD.recv(), this is working and done.
Now I want to realize the same Output on a different way by using MPI.Topocomm.Neighbor_alltoallw but this is not working. I wrote a C Code and is working there, so the same output can be reached with this function, but when I implement this in python it is not working. Please find below the C Code and the Python Code
The definition of the Function says (mpi4py Package for Python):
Neighbor_alltoallw(...)
Topocomm.Neighbor_alltoallw(self, sendbuf, recvbuf)
Neighbor All-to-All Generalized
I do not understand following things:
why is recbuf not a return value? it seems to be an argument here
how can this be implmented for a ring communication in Python?
Thank you for your time and support!
my working C Code:
#include <stdio.h>
#include <mpi.h>
#define to_right 201
#define max_dims 1
int main (int argc, char *argv[])
{
int my_rank, size;
int snd_buf, rcv_buf;
int right, left;
int sum, i;
MPI_Comm new_comm;
int dims[max_dims],
periods[max_dims],
reorder;
MPI_Aint snd_displs[2], rcv_displs[2];
int snd_counts[2], rcv_counts[2];
MPI_Datatype snd_types[2], rcv_types[2];
MPI_Status status;
MPI_Request request;
MPI_Init(&argc, &argv);
/* Get process info. */
MPI_Comm_size(MPI_COMM_WORLD, &size);
/* Set cartesian topology. */
dims[0] = size;
periods[0] = 1;
reorder = 1;
MPI_Cart_create(MPI_COMM_WORLD, max_dims, dims, periods,
reorder,&new_comm);
/* Get coords */
MPI_Comm_rank(new_comm, &my_rank);
/* MPI_Cart_coords(new_comm, my_rank, max_dims, my_coords); */
/* Get nearest neighbour rank. */
MPI_Cart_shift(new_comm, 0, 1, &left, &right);
/* Compute global sum. */
sum = 0;
snd_buf = my_rank;
rcv_buf = -1000; /* unused value, should be overwritten by first MPI_Recv; only for test purpose */
rcv_counts[0] = 1; MPI_Get_address(&rcv_buf, &rcv_displs[0]); snd_types[0] = MPI_INT;
rcv_counts[1] = 0; rcv_displs[1] = 0 /*unused*/; snd_types[1] = MPI_INT;
snd_counts[0] = 0; snd_displs[0] = 0 /*unused*/; rcv_types[0] = MPI_INT;
snd_counts[1] = 1; MPI_Get_address(&snd_buf, &snd_displs[1]); rcv_types[1] = MPI_INT;
for( i = 0; i < size; i++)
{
/* Substituted by MPI_Neighbor_alltoallw() :
MPI_Issend(&snd_buf, 1, MPI_INT, right, to_right,
new_comm, &request);
MPI_Recv(&rcv_buf, 1, MPI_INT, left, to_right,
new_comm, &status);
MPI_Wait(&request, &status);
*/
MPI_Neighbor_alltoallw(MPI_BOTTOM, snd_counts, snd_displs, snd_types,
MPI_BOTTOM, rcv_counts, rcv_displs, rcv_types, new_comm);
snd_buf = rcv_buf;
sum += rcv_buf;
}
printf ("PE%i:\tSum = %i\n", my_rank, sum);
MPI_Finalize();
}
My not working Python Code:
from mpi4py import MPI
size = MPI.COMM_WORLD.Get_size()
my_rank = MPI.COMM_WORLD.Get_rank()
to_right =201
max_dims=1
dims = [max_dims]
periods=[max_dims]
dims[0]=size
periods[0]=1
reorder = True
new_comm=MPI.Intracomm.Create_cart(MPI.COMM_WORLD,dims,periods,True)
my_rank= new_comm.Get_rank()
left_right= MPI.Cartcomm.Shift(new_comm,0,1)
left=left_right[0]
right=left_right[1]
sum=0
snd_buf=my_rank
rcv_buf=-1000 #unused value, should be overwritten, only for test purpose
for counter in range(0,size):
MPI.Topocomm.Neighbor_alltoallw(new_comm,snd_buf,rcv_buf)
snd_buf=rcv_buf
sum=sum+rcv_buf
print('PE ', my_rank,'sum=',sum)

Multithreaded & SIMD vectorized Mandelbrot in R using Rcpp & OpenMP

As an OpenMP & Rcpp performance test I wanted to check how fast I could calculate the Mandelbrot set in R using the most straightforward and simple Rcpp+OpenMP implementation. Currently what I did was:
#include <Rcpp.h>
#include <omp.h>
// [[Rcpp::plugins(openmp)]]
using namespace Rcpp;
// [[Rcpp::export]]
Rcpp::NumericMatrix mandelRcpp(const double x_min, const double x_max, const double y_min, const double y_max,
const int res_x, const int res_y, const int nb_iter) {
Rcpp::NumericMatrix ret(res_x, res_y);
double x_step = (x_max - x_min) / res_x;
double y_step = (y_max - y_min) / res_y;
int r,c;
#pragma omp parallel for default(shared) private(c) schedule(dynamic,1)
for (r = 0; r < res_y; r++) {
for (c = 0; c < res_x; c++) {
double zx = 0.0, zy = 0.0, new_zx;
double cx = x_min + c*x_step, cy = y_min + r*y_step;
int n = 0;
for (n=0; (zx*zx + zy*zy < 4.0 ) && ( n < nb_iter ); n++ ) {
new_zx = zx*zx - zy*zy + cx;
zy = 2.0*zx*zy + cy;
zx = new_zx;
}
ret(c,r) = n;
}
}
return ret;
}
And then in R:
library(Rcpp)
sourceCpp("mandelRcpp.cpp")
xlims=c(-0.74877,-0.74872);
ylims=c(0.065053,0.065103);
x_res=y_res=1080L; nb_iter=10000L;
system.time(m <- mandelRcpp(xlims[[1]], xlims[[2]], ylims[[1]], ylims[[2]], x_res, y_res, nb_iter))
# 0.92s
rainbow=c(rgb(0.47,0.11,0.53),rgb(0.27,0.18,0.73),rgb(0.25,0.39,0.81),rgb(0.30,0.57,0.75),rgb(0.39,0.67,0.60),rgb(0.51,0.73,0.44),rgb(0.67,0.74,0.32),rgb(0.81,0.71,0.26),rgb(0.89,0.60,0.22),rgb(0.89,0.39,0.18),rgb(0.86,0.13,0.13))
cols=c(colorRampPalette(rainbow)(100),rev(colorRampPalette(rainbow)(100)),"black") # palette
par(mar=c(0, 0, 0, 0))
system.time(image(m^(1/7), col=cols, asp=diff(ylims)/diff(xlims), axes=F, useRaster=T))
# 0.5s
I was unsure though if there is any other obvious speed improvements I could take advantage of aside from OpenMP multithreading, e.g. via simd vectorization? (using simd options in the openmp #pragma didn't seem to do anything)
PS at first my code was crashing but I later found this was solved by replacing ret[r,c] = n; with ret(r,c) = n;
Using Armadillo classes as suggested in the answer below make things very slightly faster, though the timings are almost the same. Also flipped around x and y so it comes out in the right orientation when plotted with image(). Using 8 threads speed is ca. 350 times faster than the vectorized plain R Mandelbrot version here and also about 7.3 times faster than the (non-multithreaded) Python/Numba version here (similar to PyCUDA or PyOpenCL speeds), so quite happy with that... Rasterizing/display now seems the bottleneck in R....
Do not use OpenMP with Rcpp's *Vector or *Matrix objects as they mask SEXP functions / memory allocations that are single-threaded. OpenMP is a multi-threaded approach.
This is why the code is crashing.
One way to get around this limitation is to use a non-R data structure to store the results. One of the following will be sufficient: arma::mat or Eigen::MatrixXd or std::vector<T>... As I favor armadillo, I will change the res matrix to arma::mat from Rcpp::NumericMatrix. Thus, the following will execute your code in parallel:
#include <RcppArmadillo.h> // Note the changed include and new attribute
// [[Rcpp::depends(RcppArmadillo)]]
// Avoid including header if openmp not on system
#ifdef _OPENMP
#include <omp.h>
#endif
// [[Rcpp::plugins(openmp)]]
// Note the changed return type
// [[Rcpp::export]]
arma::mat mandelRcpp(const double x_min, const double x_max,
const double y_min, const double y_max,
const int res_x, const int res_y, const int nb_iter) {
arma::mat ret(res_x, res_y); // note change
double x_step = (x_max - x_min) / res_x;
double y_step = (y_max - y_min) / res_y;
unsigned r,c;
#pragma omp parallel for shared(res)
for (r = 0; r < res_y; r++) {
for (c = 0; c < res_x; c++) {
double zx = 0.0, zy = 0.0, new_zx;
double cx = x_min + c*x_step, cy = y_min + r*y_step;
unsigned n = 0;
for (; (zx*zx + zy*zy < 4.0 ) && ( n < nb_iter ); n++ ) {
new_zx = zx*zx - zy*zy + cx;
zy = 2.0*zx*zy + cy;
zx = new_zx;
}
if(n == nb_iter) {
n = 0;
}
ret(r, c) = n;
}
}
return ret;
}
With the test code (note y and x were not defined, thus I assumed y = ylims and x = xlims) we have:
xlims = ylims = c(-2.0, 2.0)
x_res = y_res = 400L
nb_iter = 256L
system.time(m <-
mandelRcpp(xlims[[1]], xlims[[2]],
ylims[[1]], ylims[[2]],
x_res, y_res, nb_iter))
rainbow = c(
rgb(0.47, 0.11, 0.53),
rgb(0.27, 0.18, 0.73),
rgb(0.25, 0.39, 0.81),
rgb(0.30, 0.57, 0.75),
rgb(0.39, 0.67, 0.60),
rgb(0.51, 0.73, 0.44),
rgb(0.67, 0.74, 0.32),
rgb(0.81, 0.71, 0.26),
rgb(0.89, 0.60, 0.22),
rgb(0.89, 0.39, 0.18),
rgb(0.86, 0.13, 0.13)
)
cols = c(colorRampPalette(rainbow)(100),
rev(colorRampPalette(rainbow)(100)),
"black") # palette
par(mar = c(0, 0, 0, 0))
image(m,
col = cols,
asp = diff(range(ylims)) / diff(range(xlims)),
axes = F)
For:
I went ahead and vectorized the OP's code using GCC's and Clang's vector extensions. Before I show how I did this let me show the performance with the following hardware:
Skylake (SKL) at 3.1 GHz with 4 cores
Knights Landing (KNL) at 1.5 GHz with 68 cores
ARMv8 Cortex-A57 arch64 (Nvidia Jetson TX1) 4 cores at ? GHz
nb_iter = 1000000
GCC Clang
SKL_scalar 6m5,422s
SKL_SSE41 3m18,058s
SKL_AVX2 1m37,843s 1m39,943s
SKL_scalar_omp 0m52,237s
SKL_SSE41_omp 0m29,624s 0m31,356s
SKL_AVX2_omp 0m14,156s 0m16,783s
ARM_scalar 15m28.285s
ARM_vector 9m26.384s
ARM_scalar_omp 3m54.242s
ARM_vector_omp 2m21.780s
KNL_scalar 19m34.121s
KNL_SSE41 11m30.280s
KNL_AVX2 5m0.005s 6m39.568s
KNL_AVX512 2m40.934s 6m20.061s
KNL_scalar_omp 0m9.108s
KNL_SSE41_omp 0m6.666s 0m6.992s
KNL_AVX2_omp 0m2.973s 0m3.988s
KNL_AVX512_omp 0m1.761s 0m3.335s
The theoretical speed up of KNL vs. SKL is
(68 cores/4 cores)*(1.5 GHz/3.1 Ghz)*
(8 doubles per lane/4 doubles per lane) = 16.45
I went into detail about GCC's and Clang's vector extensions capabilities here. To vectorize the OP's code here are three additional vector operations that we need to define.
1. Broadcasting
For a vector v and a scalar s GCC cannot do v = s but Clang can. But I found a nice solution which works for GCC and Clang here. For example
vsi v = s - (vsi){};
2. A any() function like in OpenCL or like in R.
The best I came up with is a generic function
static bool any(vli const & x) {
for(int i=0; i<VLI_SIZE; i++) if(x[i]) return true;
return false;
}
Clang actually generates relatively efficient code for this using the ptest instruction (but not for AVX512) but GCC does not.
3. Compression
The calculations are done as 64-bit doubles but the result is written out as 32-bit integers. So two calculations are done using 64-bit integers and then the two calculations are compressed into one vector of 32-bit integers. I came up with a generic solution which Clang does a good job with
static vsi compress(vli const & lo, vli const & hi) {
vsi lo2 = (vsi)lo, hi2 = (vsi)hi, z;
for(int i=0; i<VLI_SIZE; i++) z[i+0*VLI_SIZE] = lo2[2*i];
for(int i=0; i<VLI_SIZE; i++) z[i+1*VLI_SIZE] = hi2[2*i];
return z;
}
The follow solution works better for GCC but is no better for Clang. But since this function is not critical I just use the generic version.
static vsi compress(vli const & low, vli const & high) {
#if defined(__clang__)
return __builtin_shufflevector((vsi)low, (vsi)high, MASK);
#else
return __builtin_shuffle((vsi)low, (vsi)high, (vsi){MASK});
#endif
}
These definitions don't rely on anything x86 specific and the code (defined below) compiles for ARM processors as well with GCC and Clang.
Now that these are defined here is the code
#include <string.h>
#include <inttypes.h>
#include <Rcpp.h>
using namespace Rcpp;
#ifdef _OPENMP
#include <omp.h>
#endif
// [[Rcpp::plugins(openmp)]]
// [[Rcpp::plugins(cpp14)]]
#if defined ( __AVX512F__ ) || defined ( __AVX512__ )
static const int SIMD_SIZE = 64;
#elif defined ( __AVX2__ )
static const int SIMD_SIZE = 32;
#else
static const int SIMD_SIZE = 16;
#endif
static const int VSI_SIZE = SIMD_SIZE/sizeof(int32_t);
static const int VLI_SIZE = SIMD_SIZE/sizeof(int64_t);
static const int VDF_SIZE = SIMD_SIZE/sizeof(double);
#if defined(__clang__)
typedef int32_t vsi __attribute__ ((ext_vector_type(VSI_SIZE)));
typedef int64_t vli __attribute__ ((ext_vector_type(VLI_SIZE)));
typedef double vdf __attribute__ ((ext_vector_type(VDF_SIZE)));
#else
typedef int32_t vsi __attribute__ ((vector_size (SIMD_SIZE)));
typedef int64_t vli __attribute__ ((vector_size (SIMD_SIZE)));
typedef double vdf __attribute__ ((vector_size (SIMD_SIZE)));
#endif
static bool any(vli const & x) {
for(int i=0; i<VLI_SIZE; i++) if(x[i]) return true;
return false;
}
static vsi compress(vli const & lo, vli const & hi) {
vsi lo2 = (vsi)lo, hi2 = (vsi)hi, z;
for(int i=0; i<VLI_SIZE; i++) z[i+0*VLI_SIZE] = lo2[2*i];
for(int i=0; i<VLI_SIZE; i++) z[i+1*VLI_SIZE] = hi2[2*i];
return z;
}
// [[Rcpp::export]]
IntegerVector frac(double x_min, double x_max, double y_min, double y_max, int res_x, int res_y, int nb_iter) {
IntegerVector out(res_x*res_y);
vdf x_minv = x_min - (vdf){}, y_minv = y_min - (vdf){};
vdf x_stepv = (x_max - x_min)/res_x - (vdf){}, y_stepv = (y_max - y_min)/res_y - (vdf){};
double a[VDF_SIZE] __attribute__ ((aligned(SIMD_SIZE)));
for(int i=0; i<VDF_SIZE; i++) a[i] = 1.0*i;
vdf vi0 = *(vdf*)a;
#pragma omp parallel for schedule(dynamic) collapse(2)
for (int r = 0; r < res_y; r++) {
for (int c = 0; c < res_x/(VSI_SIZE); c++) {
vli nv[2] = {0 - (vli){}, 0 - (vli){}};
for(int j=0; j<2; j++) {
vdf c2 = 1.0*VDF_SIZE*(2*c+j) + vi0;
vdf zx = 0.0 - (vdf){}, zy = 0.0 - (vdf){}, new_zx;
vdf cx = x_minv + c2*x_stepv, cy = y_minv + r*y_stepv;
vli t = -1 - (vli){};
for (int n = 0; any(t = zx*zx + zy*zy < 4.0) && n < nb_iter; n++, nv[j] -= t) {
new_zx = zx*zx - zy*zy + cx;
zy = 2.0*zx*zy + cy;
zx = new_zx;
}
}
vsi sp = compress(nv[0], nv[1]);
memcpy(&out[r*res_x + VSI_SIZE*c], (int*)&sp, SIMD_SIZE);
}
}
return out;
}
The R code is almost the same as the OP's code
library(Rcpp)
sourceCpp("frac.cpp", verbose=TRUE, rebuild=TRUE)
xlims=c(-0.74877,-0.74872);
ylims=c(0.065053,0.065103);
x_res=y_res=1080L; nb_iter=100000L;
t = system.time(m <- frac(xlims[[1]], xlims[[2]], ylims[[1]], ylims[[2]], x_res, y_res, nb_iter))
print(t)
m2 = matrix(m, ncol = x_res)
rainbow = c(
rgb(0.47, 0.11, 0.53),
rgb(0.27, 0.18, 0.73),
rgb(0.25, 0.39, 0.81),
rgb(0.30, 0.57, 0.75),
rgb(0.39, 0.67, 0.60),
rgb(0.51, 0.73, 0.44),
rgb(0.67, 0.74, 0.32),
rgb(0.81, 0.71, 0.26),
rgb(0.89, 0.60, 0.22),
rgb(0.89, 0.39, 0.18),
rgb(0.86, 0.13, 0.13)
)
cols = c(colorRampPalette(rainbow)(100),
rev(colorRampPalette(rainbow)(100)),"black") # palette
par(mar = c(0, 0, 0, 0))
image(m2^(1/7), col=cols, asp=diff(ylims)/diff(xlims), axes=F, useRaster=T)
To compile for GCC or Clang change the file ~/.R/Makevars to
CXXFLAGS= -Wall -std=c++14 -O3 -march=native -ffp-contract=fast -fopenmp
#uncomment the following two lines for clang
#CXX=clang-5.0
#LDFLAGS= -lomp
If you are having trouble getting OpenMP to work for Clang see this.
The code produces more or less the same image.

OpenMP and MPI hybrid dynamic scheduling

As the number of threads increase, the count which is "temp" decreases..
When I sent the number of threads as "1" it gives an correct answer but as the number of threads increases, running time shorter but gives wrong answer
#include <stdio.h>
#include <mpi.h>
#include <complex.h>
#include <time.h>
#include <omp.h>
#define MAXITERS 1000
// globals
int count = 0;
int nptsside;
float side2;
float side4;
int temp = 0;
int inset(double complex c) {
int iters;
float rl,im;
double complex z = c;
for (iters = 0; iters < MAXITERS; iters++) {
z = z*z + c;
rl = creal(z);
im = cimag(z);
if (rl*rl + im*im > 4) return 0;
}
return 1;
}
int main(int argc, char **argv)
{
nptsside = atoi(argv[1]);
side2 = nptsside / 2.0;
side4 = nptsside / 4.0;
//struct timespec bgn,nd;
//clock_gettime(CLOCK_REALTIME, &bgn);
int x,y; float xv,yv;
double complex z;
int i;
int mystart, myend;
int nrows;
int nprocs, mype;
int data;
MPI_Status status;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &mype);
nrows = nptsside/nprocs;
printf("%d\n", nprocs);
mystart = mype*nrows;
myend = mystart + nrows - 1;
#pragma omp parallel shared(mystart, myend, temp)
{
int nth = omp_get_num_threads();
printf("%d\n", nth);
#ifdef STATIC
#pragma omp for reduction(+:temp) schedule(static)
#elif defined DYNAMIC
#pragma omp for reduction(+:temp) schedule(dynamic)
#elif defined GUIDED
#pragma omp for reduction(+:temp) schedule(guided)
#endif
for (x=mystart; x<=myend; x++) {
for ( y=0; y<nptsside; y++) {
xv = (x - side2) / side4;
yv = (y - side2) / side4;
z = xv + yv*I;
if (inset(z)) {
temp++;
}
}
}
}
if(mype==0) {
count += temp;
printf("%d\n", temp);
for (i = 1; i < nprocs; i++) {
MPI_Recv(&temp, 1, MPI_INT, i, 0, MPI_COMM_WORLD, &status);
count += temp;
printf("%d\n", temp);
}
}
else{
MPI_Send(&temp, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);
}
MPI_Finalize();
if(mype==0) {
printf("%d\n", count);
}
//clock_gettime(CLOCK_REALTIME, &nd);
//printf("%f\n",timediff(bgn,nd));
}
You are not defining any private variables for when you enter the OpenMP loop.
First off, you must always declare your loop counter for your OpenMP loop (as well as any loop counters for nested loops inside your OpenMP loop) private.
Secondly, you have three variables (xv, yv, and z) that each depend on your iterations in these loops. Thus, each thread needs to have its own private copy of these variables as well. Changing your parallel statement to
#pragma omp parallel shared(mystart, myend, temp) private(x, y, xv, yv, z)
should fix your OpenMP problems.
Seeing as you say that setting your number of threads to 1 yields the correct answer, I have not looked at your MPI code.
EDIT: OK I lied, I briefly looked into your MPI code now. Instead of all of your sends and receives, you should be writing a single reduce. This collective will be much faster than the blocking communication you set up currently.
MPI_Reduce(&temp, &count, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);

Improving space complexity in matrix chain multiplication

I want to know whether there is any way to improve space complexity of Dynamic Programming solution of matrix multiplication problem from O(N^2) to something better?
Here is a solution with space compexity O(n)
#include <iostream>
using namespace std;
int min_cost(int a[], int n) {
int* b = new int(n);
b[1] = 0;
b[2] = a[0]*a[1]*a[2];
for (int j = 3; j < n; j++) {
b[j] = min( b[j-1] + a[0]*a[j-1]*a[j], b[j-2] + a[j-1]*a[j-2]*a[j] + a[0]*a[j-2]*a[j]);
}
return b[n-1];
}
int main() {
int arr[] = {10, 20, 30};
int size = sizeof(arr)/sizeof(arr[0]);
printf("Minimum number of multiplications is %d ", min_cost(arr, size));
return 0;
}

ArrayFire frame search algorithm crash

I am new to ArrayFire and CUDA development in general, I just started using ArrayFire a couple of days ago after failing miserably using Thrust.
I am building an ArrayFire-based algorithm that is supposed to search a single 32x32 pixel frame in a database of a couple hundred thousand 32x32 frames that are stored into device memory.
At first I initialize a matrix that has 1024 + 1 pixels as rows (I need an extra one to keep a frame group id) and a predefined number (this case 1000) of frames, indexed by coloumn.
Here's the function that performs the search, if I uncomment "pixels_uint32 = device_frame_ptr[pixel_group_idx];" the program crashes. The pointer seems to be valid so I do not understand why this happens. Maybe there is something I do not know regarding accessing device memory in this way?
#include <iostream>
#include <stdio.h>
#include <sys/types.h>
#include <arrayfire.h>
#include "utils.h"
using namespace af;
using namespace std;
/////////////////////////// CUDA settings ////////////////////////////////
#define TEST_DEBUG false
#define MAX_NUMBER_OF_FRAMES 1000 // maximum (2499999 frames) X (1024 + 1 pixels per frame) x (2 bytes per pixel) = 5.124.997.950 bytes (~ 5GB)
#define BLOB_FINGERPRINT_SIZE 1024 //32x32
//percentage of macroblocks that should match: 0.9 means 90%
#define MACROBLOCK_COMPARISON_OVERALL_THRESHOLD 768 //1024 * 0.75
//////////////////////// End of CUDA settings ////////////////////////////
array search_frame(array d_db_vec)
{
try {
uint number_of_uint32_for_frame = BLOB_FINGERPRINT_SIZE / 2;
// create one-element array to hold the result of the computation
array frame_found(1,MAX_NUMBER_OF_FRAMES, u32);
frame_found = 0;
gfor (array frame_idx, MAX_NUMBER_OF_FRAMES) {
// get the blob id it's the last coloumn of the matrix
array blob_id = d_db_vec(number_of_uint32_for_frame, frame_idx); // addressing with (pixel_idx, frame_idx)
// define some hardcoded pixel to search for
uint8_t searched_r = 0x0;
uint8_t searched_g = 0x3F;
uint8_t searched_b = 0x0;
uint8_t b1 = 0;
uint8_t g1 = 0;
uint8_t r1 = 0;
uint8_t b2 = 0;
uint8_t g2 = 0;
uint8_t r2 = 0;
uint32_t sum1 = 0;
uint32_t sum2 = 0;
uint32_t *device_frame_ptr = NULL;
uint32_t pixels_uint32 = 0;
uint pixel_match_counter = 0;
//uint pixel_match_counter = 0;
array frame = d_db_vec(span, frame_idx);
device_frame_ptr = frame.device<uint32_t>();
for (uint pixel_group_idx = 0; pixel_group_idx < number_of_uint32_for_frame; pixel_group_idx++) {
// test to see if the whole matrix is traversed
// d_db_vec(pixel_group_idx, frame_idx) = 0;
/////////////////////////////// PROBLEMATIC CODE ///////////////////////////////////
pixels_uint32 = 0x7E007E0;
//pixels_uint32 = device_frame_ptr[pixel_group_idx]; //why does this crash the program?
// if I uncomment the above line the program tries to copy the u32 frame into the pixels_uint32 variable
// something goes wrong, since the pointer device_frame_ptr is not NULL and the elements should be there judging by the lines above
////////////////////////////////////////////////////////////////////////////////////
// splitting the first pixel into its components
b1 = (pixels_uint32 & 0xF8000000) >> 27; //(input & 11111000000000000000000000000000)
g1 = (pixels_uint32 & 0x07E00000) >> 21; //(input & 00000111111000000000000000000000)
r1 = (pixels_uint32 & 0x001F0000) >> 16; //(input & 00000000000111110000000000000000)
// splitting the second pixel into its components
b2 = (pixels_uint32 & 0xF800) >> 11; //(input & 00000000000000001111100000000000)
g2 = (pixels_uint32 & 0x07E0) >> 5; //(input & 00000000000000000000011111100000)
r2 = (pixels_uint32 & 0x001F); //(input & 00000000000000000000000000011111)
// checking if they are a match
sum1 = abs(searched_r - r1) + abs(searched_g - g1) + abs(searched_b - b1);
sum2 = abs(searched_r - r2) + abs(searched_g - g2) + abs(searched_b - b2);
// if they match, increment the local counter
pixel_match_counter = (sum1 <= 16) ? pixel_match_counter + 1 : pixel_match_counter;
pixel_match_counter = (sum2 <= 16) ? pixel_match_counter + 1 : pixel_match_counter;
}
bool is_found = pixel_match_counter > MACROBLOCK_COMPARISON_OVERALL_THRESHOLD;
// write down if the frame is a match or not
frame_found(0,frame_idx) = is_found ? frame_found(0,frame_idx) : blob_id;
}
// test to see if the whole matrix is traversed - this has to print zeroes
if (TEST_DEBUG)
print(d_db_vec);
// return the matches array
return frame_found;
} catch (af::exception& e) {
fprintf(stderr, "%s\n", e.what());
throw;
}
}
// make 2 green pixels
uint32_t make_test_pixel_group() {
uint32_t b1 = 0x0; //11111000000000000000000000000000
uint32_t g1 = 0x7E00000; //00000111111000000000000000000000
uint32_t r1 = 0x0; //00000000000111110000000000000000
uint32_t b2 = 0x0; //00000000000000001111100000000000
uint32_t g2 = 0x7E0; //00000000000000000000011111100000
uint32_t r2 = 0x0; //00000000000000000000000000011111
uint32_t green_pix = b1 | g1 | r1 | b2 | g2 | r2;
return green_pix;
}
int main(int argc, char ** argv)
{
info();
/////////////////////////////////////// CREATE THE DATABASE ///////////////////////////////////////
uint number_of_uint32_for_frame = BLOB_FINGERPRINT_SIZE / 2;
array d_db_vec(number_of_uint32_for_frame + 1, // fingerprint size + 1 extra u32 for blob id
MAX_NUMBER_OF_FRAMES, // number of frames
u32); // type of elements is 32-bit unsigned integer (unsigned) with the configuration RGBRGB (565565)
if (TEST_DEBUG == true) {
for (uint frame_idx = 0; frame_idx < MAX_NUMBER_OF_FRAMES; frame_idx++) {
for (uint pix_idx = 0; pix_idx < number_of_uint32_for_frame; pix_idx++) {
d_db_vec(pix_idx, frame_idx) = make_test_pixel_group(); // fill everything with green :D
}
}
} else {
d_db_vec = rand(number_of_uint32_for_frame + 1, MAX_NUMBER_OF_FRAMES);
}
cout << "Setting blob ids. \n\n";
for (uint frame_idx = 0; frame_idx < MAX_NUMBER_OF_FRAMES; frame_idx++) {
// set the blob id to 123456
d_db_vec(number_of_uint32_for_frame, frame_idx) = 123456; // blob_id = 123456
}
if (TEST_DEBUG)
print(d_db_vec);
cout << "Done setting blob ids. \n\n";
//////////////////////////////////// CREATE THE SEARCHED FRAME ///////////////////////////////////
// to be done, for now we use the hardcoded values at line 37-39 to simulate the searched pixel:
//37 uint8_t searched_r = 0x0;
//38 uint8_t searched_g = 0x3F;
//39 uint8_t searched_b = 0x0;
///////////////////////////////////////////// SEARCH /////////////////////////////////////////////
clock_t timer = startTimer();
for (int i = 0; i< 1000; i++) {
array frame_found = search_frame(d_db_vec);
if (TEST_DEBUG)
print(frame_found);
}
stopTimer(timer);
return 0;
}
Here is the console output with the line commented:
arrayfire/examples/helloworld$ ./helloworld
ArrayFire v1.9.1 (64-bit Linux, build 9af23ea)
License: Server (27000#server.accelereyes.com)
CUDA toolkit 5.0, driver 304.54
GPU0 Tesla C2075, 5376 MB, Compute 2.0
Memory Usage: 5312 MB free (5376 MB total)
Setting blob ids.
Done setting blob ids.
Time: 0.03 seconds.
Here is the console output with the line uncommented:
arrayfire/examples/helloworld$ ./helloworld
ArrayFire v1.9.1 (64-bit Linux, build 9af23ea)
License: Server (27000#server.accelereyes.com)
CUDA toolkit 5.0, driver 304.54
GPU0 Tesla C2075, 5376 MB, Compute 2.0
Memory Usage: 5312 MB free (5376 MB total)
Setting blob ids.
Done setting blob ids.
Segmentation fault
Thanks in advance for any help on this issue. I really tried everything but without success.
Disclaimer: I am the lead developer of arrayfire. I see that you have posted on AccelerEyes forums as well, but I am posting here to clear up some common issues with your code.
Do not use .device(), .host(), .scalar() inside gfor loop. This will cause divergences inside the GFOR loop, and GFOR was not designed for this.
You can not index into a device pointer. The pointer refers to a location on the GPU. When you do device_frame_ptr[pixel_group_idx];, the system is looking for the equivalent position on the CPU. This is the reason for your segmentation fault.
Use vectorized code. For example, you don't need the inner for loop of the gfor. Instead of doing b1 = (pixels_uint32 & 0xF8000000) >> 27; inside a for loop, You can do array B1 = (frame & 0xF800000000) >> 27;. i.e. instead of getting data back to CPU and using a for loop, you are doing the entire operation inside the GPU.
Don't use if-else or ternary operators inside GFOR. These cause divergences again. For example, pixel_match_counter = sum(sum1 <= 16) + sum(sum2 < 16); and found(0, found_idx) = is_found * found(0, found_idx) + (1 - is_found) * blob_id.
I have answered the particular problem you are facing. If you have any follow up questions, please follow up on our forums and / or our support email. Stackoverflow is good for asking a specific question, but not to debug your entire program.

Resources