Element-Wise Matrix Multiplication in Rcpp - rcpp

I am working on a code that requires an element-wise matrix multiplication. I am trying to implement this in Rcpp since the code requires some expensive loops. I am fairly new to Rcpp, and may be missing something, but I cannot get the element-wise matrix multiplication to work.
// [[Rcpp::export]]
NumericMatrix multMat(NumericMatrix m1, NumericMatrix m2) {
NumericMatrix multMatrix = m1 * m2 // How can this be implemented ?
}
I may be missing something very trivial, and wanted to ask if there was any method to do this (other than using loops to iterate over each element and multiply).
Thanks in advance.

You probably want to use RcppArmadillo (or RcppEigen) for actual math on matrices.
R> library(RcppArmadillo)
R> cppFunction("arma::mat schur(arma::mat& a, arma::mat& b) {
+ return(a % b); }", depends="RcppArmadillo")
R> schur(matrix(1:4,2,2), matrix(4:1,2,2))
[,1] [,2]
[1,] 4 6
[2,] 6 4
R>
Element-wise multiplication is also called Schur (or Hadamard) multiplication. In Armadillo, the % supports it; see the Armadillo docs for more.

If you want to fake it, you can follow what's done here and use Rcpp's sugar on regular vectors, and convert them to matrices as needed:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector multMat(NumericMatrix m1, NumericMatrix m2) {
NumericVector multMatrix = m1 * m2;
multMatrix.attr("dim") = Dimension(m1.nrow(), m1.ncol());
return multMatrix;
}
/*** R
multMat( matrix(1:9, nrow=3), matrix(1:9, nrow=3) )
*/
But, as Dirk said, you're better off using RcppArmadillo for matrix operations.

Related

HLSL mul and D3DXMATRIX order mismatch

I'm trying to multiply the transformation matrix in shader with vectors directly without doing unnecessary transportation. According to HLSL's mul documentation:
mul(x, y) Multiplies x and y using matrix math. The inner dimension x-columns and y-rows must be equal.
x [in] The x input value. If x is a vector, it treated as a row
vector.
y [in] The y input value. If y is a vector, it treated as a column
vector.
I have in the C++ code:
const D3DXMATRIX viewProjection = view * projection;
...
const D3DXMATRIX modelViewProjection = model * viewProjection;
where modelViewProjection is row-major order matrix that is copied to a constant buffer, not transposed. However, for this to work in the HLSL i need to multiply the transformation matrix with the position vector as:
output.position = mul(transformation, position);
which is the opposite of what the mul documentation is saying.
Can someone explain where is the mismatch here?
The deprecated D3DXMath library and the more modern DirectXMath use row-major matrix order. The HLSL language defaults to using column-major matrix order as it's slightly more efficient for multiplies. Therefore, most use of setting constant buffer constants will transpose matrix data. In almost all cases, any 'cost' of transposing the matrix here is completely hidden by all the other latencies in the system.
You can of course tell HLSL to use row-major matrix order instead, which means the HLSL mul needs to do an extra instruction on every vertex which is why it's usually worth doing the transpose on the CPU once per update instead.
See MSDN

Robust linear interpolation

Given two segment endpoints A and B (in two dimensions), I would like to perform linear interpolation based on a value t, i.e.:
C = A + t(B-A)
In the ideal world, A, B and C should be collinear. However, we are operating with limited floating-point here, so there will be small deviations. To work around numerical issues with other operations I am using robust adaptive routines originally created by Jonathan Shewchuk. In particular, Shewchuk implements an orientation function orient2d that uses adaptive precision to exactly test the orientation of three points.
Here my question: is there a known procedure how the interpolation can be computed using the floating-point math, so that it lies exactly on the line between A and B? Here, I care less about the accuracy of the interpolation itself and more about the resulting collinearity. In another terms, its ok if C is shifted around a bit as long as collinearity is satisfied.
The bad news
The request can't be satisfied. There are values of A and B for which there is NO value of t other than 0 and 1 for which lerp(A, B, t) is a float.
A trivial example in single precision is x1 = 12345678.f and x2 = 12345679.f. Regardless of the values of y1 and y2, the required result must have an x component between 12345678.f and 12345679.f, and there's no single-precision float between these two.
The (sorta) good news
The exact interpolated value, however, can be represented as the sum of 5 floating-point values (vectors in the case of 2D): one for the formula's result, one for the error in each operation [1] and one for multiplying the error by t. I'm not sure if that will be useful to you. Here's a 1D C version of the algorithm in single precision that uses fused multiply-add to calculate the product error, for simplicity:
#include <math.h>
float exact_sum(float a, float b, float *err)
{
float sum = a + b;
float z = sum - a;
*err = a - (sum - z) + (b - z);
return sum;
}
float exact_mul(float a, float b, float *err)
{
float prod = a * b;
*err = fmaf(a, b, -prod);
return prod;
}
float exact_lerp(float A, float B, float t,
float *err1, float *err2, float *err3, float *err4)
{
float diff = exact_sum(B, -A, err1);
float prod = exact_mul(diff, t, err2);
*err1 = exact_mul(*err1, t, err4);
return exact_sum(A, prod, err3);
}
In order for this algorithm to work, operations need to conform to IEEE-754 semantics in round-to-nearest mode. That's not guaranteed by the C standard, but the GNU gcc compiler can be instructed to do so, at least in processors supporting SSE2 [2][3].
It is guaranteed that the arithmetic addition of (result + err1 + err2 + err3 + err4) will be equal to the desired result; however, there is no guarantee that the floating-point addition of these quantities will be exact.
To use the above example, exact_lerp(12345678.f, 12345679.f, 0.300000011920928955078125f, &err1, &err2, &err3, &err4) returns a result of 12345678.f and err1, err2, err3 and err4 are 0.0f, 0.0f, 0.300000011920928955078125f and 0.0f respectively. Indeed, the correct result is 12345678.300000011920928955078125 which can't be represented as a single-precision float.
A more convoluted example: exact_lerp(0.23456789553165435791015625f, 7.345678806304931640625f, 0.300000011920928955078125f, &err1, &err2, &err3, &err4) returns 2.3679010868072509765625f and the errors are 6.7055225372314453125e-08f, 8.4771045294473879039287567138671875e-08f, 1.490116119384765625e-08f and 2.66453525910037569701671600341796875e-15f. These numbers add up to the exact result, which is 2.36790125353468550173374751466326415538787841796875 and can't be exactly stored in a single-precision float.
All numbers in the examples above are written using their exact values, rather than a number that approximates to them. For example, 0.3 can't be represented exactly as a single-precision float; the closest one has an exact value of 0.300000011920928955078125 which is the one I've used.
It might be possible that if you calculate err1 + err2 + err3 + err4 + result (in that order), you get an approximation that is considered collinear in your use case. Perhaps worth a try.
References
[1] Graillat, Stef (2007). Accurate Floating Point Product and Exponentiation.
[2] Enabling strict floating point mode in GCC
[3] Semantics of Floating Point Math in GCC

Define swizzling programmatically (as in GLSL)

How would one write swizzling as a defined behaviour in a programming language? (swizzling members like matrices and vectors in GLSL) So if I wanted to make a programming language that would allow the definition of swizzling on some members, what would be a good way to do it? So for example I could do this:
struct
{
swizzable
{
float x, float y, float z, float w
}
}
But this is missing a lot. For example it does not define that what sould it return when I swizzle more or less elements or assign to a subset or just the elements backwards. Like in GLSL I can do v.xyz to create a Vec3 from a Vec4 called v. Or I could assign a subset of members: v.zyx = ... in any order.
So this swizzable substruct is not a solution (or at least too limited). Another way would be to return an array of swizzled members and an implicit cast (with a constructor) would generate the wanted element:
struct Vec2
{
swizzable { float x, float y }
Vec2(float[2] elements)
{ x = elements[0]; y = elements[1]; }
}
struct Vec3
{
swizzable { float x, float y, float z }
}
So if I accessed a Vec3's x and y via swizzling, I would get a float[2] and because I have a constructor for Vec2, I can assign this array to it (and implicitly instantiating a vec2).
This looks like a better solution but still: How could one do better?
Edit: Sorry I didn't specify the question: I want to implement a programming language that supports this kind of thing.
I'm not sure how to give a good, detailed answer, so here is just one idea.
If I understand right, swizzling is mainly a syntactic convenience. The page https://www.opengl.org/wiki/GLSL_Optimizations gives the following example of swizzling:
gl_FragColor = mycolor.xyzw * constantList.xxxy + constantList.yyyx;
This could simply be syntactic shorthand for something like:
gl_FragColor = Vector(mycolor.x, mycolor.y, mycolor.z, mycolor.w)
* Vector(constantList.x, constantList.x, constantList.x, constantList.y)
+ Vector(constantList.y, constantList.y, constantList.y, constantList.x);
So, one step may be to figure out how to parse the shorter syntax and interpret it as meaning something similar to the longer syntax.
I don't see why it would be necessary to declare the struct as anything more complicated than struct myStruct { float x, float y, float z, float w }. The language itself should be able to handle all the details of how to implement this swizzling.

Multithreaded sparse matrix multiplication in Matlab

I am performing several matrix multiplications of an NxN sparse (~1-2%) matrix, let's call it B, with an NxM dense matrix, let's call it A (where M < N). N is large, as is M; on the order of several thousands. I am running Matlab 2013a.
Now, usually, matrix multiplications and most other matrix operations are implicitly parallelized in Matlab, i.e. they make use of multiple threads automatically.
This appears NOT to be the case if either of the matrices are sparse (see e.g. this StackOverflow discussion - with no answer for the intended question - and this largely unanswered MathWorks thread).
This is a rather unhappy surprise for me.
We can verify that multithreading has no effects for sparse matrix operations by the following code:
clc; clear all;
N = 5000; % set matrix sizes
M = 3000;
A = randn(N,M); % create dense random matrices
B = sprand(N,N,0.015); % create sparse random matrix
Bf = full(B); %create a dense form of the otherwise sparse matrix B
for i=1:3 % test for 1, 2, and 4 threads
m(i) = 2^(i-1);
maxNumCompThreads(m(i)); % set the thread count available to Matlab
tic % starts timer
y = B*A;
walltime(i) = toc; % wall clock time
speedup(i) = walltime(1)/walltime(i);
end
% display number of threads vs. speed up relative to just a single thread
[m',speedup']
This produces the following output, which illustrates that there is no difference between using 1, 2, and 4 threads for sparse operations:
threads speedup
1.0000 1.0000
2.0000 0.9950
4.0000 1.0155
If, on the other hand, I replace B by its dense form, refered to as Bf above, I get significant speedup:
threads speedup
1.0000 1.0000
2.0000 1.8894
4.0000 3.4841
(illustrating that matrix operations for dense matrices in Matlab are indeed implicitly parallelized)
So, my question: is there any way at all to access a parallelized/threaded version of matrix operations for sparse matrices (in Matlab) without converting them to dense form?
I found one old suggestion involving .mex files at MathWorks, but it seems the links are dead and not very well documented/no feedback? Any alternatives?
It seems to be a rather severe restriction of implicit parallelism functionality, since sparse matrices are abound in computationally heavy problems, and hyperthreaded functionality highly desirable in these cases.
MATLAB already uses SuiteSparse by Tim Davis for many of its operation on sparse matrices (for example see here), but neither of which I believe are multithreaded.
Usually computations on sparse matrices are memory-bound rather than CPU-bound. So even you use a multithreaded library, I doubt you will see huge benefits in terms of performance, at least not comparable to those specialized in dense matrices...
After all that the design of sparse matrices have different goals in mind than regular dense matrices, where efficient memory storage is often more important.
I did a quick search online, and found a few implementations out there:
sparse BLAS, spBLAS, PSBLAS. For instance, Intel MKL and AMD ACML do have some support for sparse matrices
cuSPARSE, CUSP, VexCL, ViennaCL, etc.. that run on the GPU.
I ended up writing my own mex file with OpenMP for multithreading. Code as follows. Don't forget to use -largeArrayDims and /openmp (or -fopenmp) flags when compiling.
#include <omp.h>
#include "mex.h"
#include "matrix.h"
#define ll long long
void omp_smm(double* A, double*B, double* C, ll m, ll p, ll n, ll* irs, ll* jcs)
{
for (ll j=0; j<p; ++j)
{
ll istart = jcs[j];
ll iend = jcs[j+1];
#pragma omp parallel for
for (ll ii=istart; ii<iend; ++ii)
{
ll i = irs[ii];
double aa = A[ii];
for (ll k=0; k<n; ++k)
{
C[i+k*m] += B[j+k*p]*aa;
}
}
}
}
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
double *A, *B, *C; /* pointers to input & output matrices*/
size_t m,n,p; /* matrix dimensions */
A = mxGetPr(prhs[0]); /* first sparse matrix */
B = mxGetPr(prhs[1]); /* second full matrix */
mwIndex * irs = mxGetIr(prhs[0]);
mwIndex * jcs = mxGetJc(prhs[0]);
m = mxGetM(prhs[0]);
p = mxGetN(prhs[0]);
n = mxGetN(prhs[1]);
/* create output matrix C */
plhs[0] = mxCreateDoubleMatrix(m, n, mxREAL);
C = mxGetPr(plhs[0]);
omp_smm(A,B,C, m, p, n, (ll*)irs, (ll*)jcs);
}
On matlab central the same question was asked, and this answer was given:
I believe the sparse matrix code is implemented by a few specialized TMW engineers rather than an external library like BLAS/LAPACK/LINPACK/etc...
Which basically means, that you are out of luck.
However I can think of some tricks to achieve faster computations:
If you need to do several multiplications: do multiple multiplications at once and process them in parallel?
If you just want to do one multiplication: Cut the matrix into pieces (for example top half and bottom half), do the calculations of the parts in parallel and combine the results afterwards
Probably these solutions will not turn out to be as fast as properly implemented multithreading, but hopefully you can still get a speedup.

Initializing a Matrix to NA in Rcpp

There is a way to initialize Numeric vector with NA values like.
NumericVector x(10,NumericVector::get_na())
is there any similar way to initialize a matrix to NA values?
Here is a version that does not waste memory.
#include <Rcpp.h>
using namespace Rcpp ;
// [[Rcpp::export]]
NumericMatrix na_matrix(int n){
NumericMatrix m(n,n) ;
std::fill( m.begin(), m.end(), NumericVector::get_na() ) ;
return m ;
}
FWIW, in Rcpp11, you can use some more expressive syntax:
NumericMatrix m(n,n, NA) ;
Thanks to this constructor
Sort of.
Matrices in R really are vectors with a dimension attributes. So here is one way for a square matrix which generalizes easily to rectangular matrices:
R> cppFunction('NumericMatrix foo2(int a) {
+ NumericVector v = NumericVector(a*a,NumericVector::get_na());
+ return NumericMatrix(a,a,v.begin());
+ }')
> foo2(2)
[,1] [,2]
[1,] NA NA
[2,] NA NA
R>
Edit: But for almost all real work I use Armadillo matrix classes as they are easy to use, mature, performant and pretty feature-complete. They also have a number of constructors which fill, but (currently?) none for NA. But it is trivial to call the fill() method on the matrix constructed:
R> cppFunction('arma::mat foo3(int a) {
+ return arma::mat(a,a).fill(NA_REAL);
+ }', depends="RcppArmadillo")
R> foo3(2)
[,1] [,2]
[1,] NA NA
[2,] NA NA
R>

Resources