In Rcpp How to create a NumericMatrix by a NumbericaVector? - rcpp

In Rcpp How to create a NumericMatrix by a NumbericaVector?
Something like
// vector_1 has 16 element
NumericMatrix mat = NumericMatrix(vector_1, nrow = 4);
Thanks.

Edit: I knew we had something better. See below for update.
Looks like we do not have a matching convenience constructor for this. But you can just drop in a helper function -- the following is minimally viable (one should check that n + k == length(vector)) and taken from one of the unit tests:
// [[Rcpp::export]]
Rcpp::NumericMatrix vec2mat(Rcpp::NumericVector vec, int n, int k) {
Rcpp::NumericMatrix mat = Rcpp::no_init(n, k);
for (auto i = 0; i < n * k; i++) mat[i] = vec[i];
return mat;
}
Another constructor takes the explicit dimensions and then copies the payload for you (via memcpy()), removing the need for the loop:
// [[Rcpp::export]]
Rcpp::NumericMatrix vec2mat2(Rcpp::NumericVector s, int n, int k) {
Rcpp::NumericMatrix mat(n, k, s.begin());
return mat;
}
Full example below:
> Rcpp::sourceCpp("~/git/stackoverflow/66720922/answer.cpp")
> v <- (1:9) * 1.0 # numeric
> vec2mat(v, 3, 3)
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> vec2mat2(v, 3, 3)
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
>
Full source code below.
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::NumericMatrix vec2mat(Rcpp::NumericVector vec, int n, int k) {
Rcpp::NumericMatrix mat = Rcpp::no_init(n, k);
for (auto i = 0; i < n * k; i++) mat[i] = vec[i];
return mat;
}
// [[Rcpp::export]]
Rcpp::NumericMatrix vec2mat2(Rcpp::NumericVector s, int n, int k) {
Rcpp::NumericMatrix mat(n, k, s.begin());
return mat;
}
/*** R
v <- (1:9) * 1.0 # numeric
vec2mat(v, 3, 3)
vec2mat2(v, 3, 3)
*/
Depending on what you want to do with the matrix object (linear algrebra?) you may want to consider RcppArmadillo (or RcppEigen) as those packages also have plenty of vector/matrix converters.

Related

Create NumericMatrix from NumericVector

Is there a way to create NumericMatrix from NumericVectors? Something like this:
Rcpp::cppFunction("NumericMatrix f(){
NumericVector A(10, 2.0);
NumericVector B(10, 1.0);
return NumericMatrix(10,2,C(A,B)); //ERROR
}")
> f()
Sure. There is for example cbind.
Code
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::NumericMatrix makeMatrix(Rcpp::NumericVector a, Rcpp::NumericVector b) {
return Rcpp::cbind(a, b);
}
/*** R
a <- c(1,2,3)
b <- c(3,2,1)
makeMatrix(a,b)
*/
Output
> Rcpp::sourceCpp("~/git/stackoverflow/65538515/answer.cpp")
> a <- c(1,2,3)
> b <- c(3,2,1)
> makeMatrix(a,b)
[,1] [,2]
[1,] 1 3
[2,] 2 2
[3,] 3 1
>

Quickly determine the approximate maximum of an integer vector

I'd like to use the fact that pmax(x, 0) = (x + abs(x)) / 2 on an integer vector using Rcpp for performance.
I've written a naive implementation:
IntegerVector do_pmax0_abs_int(IntegerVector x) {
R_xlen_t n = x.length();
IntegerVector out(clone(x));
for (R_xlen_t i = 0; i < n; ++i) {
int oi = out[i];
out[i] += abs(oi);
out[i] /= 2;
}
return out;
}
which is indeed performant; however, it invokes undefined behaviour should x contains any element larger than .Machine$integer.max / 2.
Is there a way to quickly determine whether or not the vector would be less than .Machine$integer.max / 2? I considered a bit-shifting but this would not be valid for negative numbers.
As mentioned in the comments you can make use of int64_t for intermediate results. In addition, it makes sense to not copy x to out and don't initilize out to zero everywhere:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector do_pmax0_abs_int(IntegerVector x) {
R_xlen_t n = x.length();
IntegerVector out(clone(x));
for (R_xlen_t i = 0; i < n; ++i) {
int oi = out[i];
out[i] += abs(oi);
out[i] /= 2;
}
return out;
}
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
IntegerVector do_pmax0_abs_int64(IntegerVector x) {
R_xlen_t n = x.length();
IntegerVector out = no_init(n);
for (R_xlen_t i = 0; i < n; ++i) {
int64_t oi = x[i];
oi += std::abs(oi);
out[i] = static_cast<int>(oi / 2);
}
return out;
}
/***R
ints <- as.integer(sample.int(.Machine$integer.max, 1e6) - 2^30)
bench::mark(do_pmax0_abs_int(ints),
do_pmax0_abs_int64(ints),
pmax(ints, 0))[, 1:5]
ints <- 2L * ints
bench::mark(#do_pmax0_abs_int(ints),
do_pmax0_abs_int64(ints),
pmax(ints, 0))[, 1:5]
*/
Result:
> Rcpp::sourceCpp('57310889/code.cpp')
> ints <- as.integer(sample.int(.Machine$integer.max, 1e6) - 2^30)
> bench::mark(do_pmax0_abs_int(ints),
+ do_pmax0_abs_int64(ints),
+ pmax(ints, 0))[, 1:5]
# A tibble: 3 x 5
expression min median `itr/sec` mem_alloc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
1 do_pmax0_abs_int(ints) 1.91ms 3.31ms 317. 3.82MB
2 do_pmax0_abs_int64(ints) 1.28ms 2.67ms 432. 3.82MB
3 pmax(ints, 0) 9.85ms 10.68ms 86.9 15.26MB
> ints <- 2L * ints
> bench::mark(#do_pmax0_abs_int(ints),
+ do_pmax0_abs_int64(ints),
+ pmax(ints, 0))[, 1:5]
# A tibble: 2 x 5
expression min median `itr/sec` mem_alloc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
1 do_pmax0_abs_int64(ints) 1.28ms 2.52ms 439. 3.82MB
2 pmax(ints, 0) 9.88ms 10.83ms 89.5 15.26MB
Notes:
Without no_init the two C++ methods are equally fast.
I ave removed the original method from the second benchmark, since bench::mark compares the results by default, and the original method produces wrong results for that particular input.

Rcpp and "optional" arguments on functions [duplicate]

I created a cumsum function in an R package with rcpp which will cumulatively sum a vector until it hits the user defined ceiling or floor. However, if one wants the cumsum to be bounded above, the user must still specify a floor.
Example:
a = c(1, 1, 1, 1, 1, 1, 1)
If i wanted to cumsum a and have an upper bound of 3, I could cumsum_bounded(a, lower = 1, upper = 3). I would rather not have to specify the lower bound.
My code:
#include <Rcpp.h>
#include <float.h>
#include <cmath>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector cumsum_bounded(NumericVector x, int upper, int lower) {
NumericVector res(x.size());
double acc = 0;
for (int i=0; i < x.size(); ++i) {
acc += x[i];
if (acc < lower) acc = lower;
else if (acc > upper) acc = upper;
res[i] = acc;
}
return res;
}
What I would like:
#include <Rcpp.h>
#include <float.h>
#include <cmath>
#include <climits> //for LLONG_MIN and LLONG_MAX
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector cumsum_bounded(NumericVector x, long long int upper = LLONG_MAX, long long int lower = LLONG_MIN) {
NumericVector res(x.size());
double acc = 0;
for (int i=0; i < x.size(); ++i) {
acc += x[i];
if (acc < lower) acc = lower;
else if (acc > upper) acc = upper;
res[i] = acc;
}
return res;
}
In short, yes its possible but it requires finesse that involves creating an intermediary function or embedding sorting logic within the main function.
In long, Rcpp attributes only supports a limit feature set of values. These values are listed in the Rcpp FAQ 3.12 entry
String literals delimited by quotes (e.g. "foo")
Integer and Decimal numeric values (e.g. 10 or 4.5)
Pre-defined constants including:
Booleans: true and false
Null Values: R_NilValue, NA_STRING, NA_INTEGER, NA_REAL, and NA_LOGICAL.
Selected vector types can be instantiated using the
empty form of the ::create static member function.
CharacterVector, IntegerVector, and NumericVector
Matrix types instantiated using the rows, cols constructor Rcpp::Matrix n(rows,cols)
CharacterMatrix, IntegerMatrix, and NumericMatrix)
If you were to specify numerical values for LLONG_MAX and LLONG_MIN this would meet the criteria to directly use Rcpp attributes on the function. However, these values are implementation specific. Thus, it would not be ideal to hardcode them. Thus, we have to seek an outside solution: the Rcpp::Nullable<T> class to enable the default NULL value. The reason why we have to wrap the parameter type with Rcpp::Nullable<T> is that NULL is a very special and can cause heartache if not careful.
The NULL value, unlike others on the real number line, will not be used to bound your values in this case. As a result, it is the perfect candidate to use on the function call. There are two choices you then have to make: use Rcpp::Nullable<T> as the parameters on the main function or create a "logic" helper function that has the correct parameters and can be used elsewhere within your application without worry. I've opted for the later below.
#include <Rcpp.h>
#include <float.h>
#include <cmath>
#include <climits> //for LLONG_MIN and LLONG_MAX
using namespace Rcpp;
NumericVector cumsum_bounded_logic(NumericVector x,
long long int upper = LLONG_MAX,
long long int lower = LLONG_MIN) {
NumericVector res(x.size());
double acc = 0;
for (int i=0; i < x.size(); ++i) {
acc += x[i];
if (acc < lower) acc = lower;
else if (acc > upper) acc = upper;
res[i] = acc;
}
return res;
}
// [[Rcpp::export]]
NumericVector cumsum_bounded(NumericVector x,
Rcpp::Nullable<long long int> upper = R_NilValue,
Rcpp::Nullable<long long int> lower = R_NilValue) {
if(upper.isNotNull() && lower.isNotNull()){
return cumsum_bounded_logic(x, Rcpp::as< long long int >(upper), Rcpp::as< long long int >(lower));
} else if(upper.isNull() && lower.isNotNull()){
return cumsum_bounded_logic(x, LLONG_MAX, Rcpp::as< long long int >(lower));
} else if(upper.isNotNull() && lower.isNull()) {
return cumsum_bounded_logic(x, Rcpp::as< long long int >(upper), LLONG_MIN);
} else {
return cumsum_bounded_logic(x, LLONG_MAX, LLONG_MIN);
}
// Required to quiet compiler
return x;
}
Test Output
cumsum_bounded(a, 5)
## [1] 1 2 3 4 5 5 5
cumsum_bounded(a, 5, 2)
## [1] 2 3 4 5 5 5 5

subset NumericMatrix by row and column names in Rcpp

I am trying to create a function in Rcpp that will take as input a pairwise numeric matrix, as well as a list of vectors, each element being a subset of row/column names. I would like this function identify the subset of the matrix that matches those names, and return the mean of the values.
Below I generated some dummy data that resembles the sort of data I have, and follow with an attempt of a Rcpp function.
library(Rcpp)
dat <- c(spA = 4, spB = 10, spC = 8, spD = 1, spE = 5, spF = 9)
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA
Here I have a list made up of character vectors of various subsets of the row/column names in pdist
subsetList <- replicate(10, sample(names(dat), 4), simplify=FALSE)
For each of these sets of names, I would like to identify the subset of the pairwise matrix and take the mean of the values
Here is what I have so far, which does not work, but I think it illustrates where I am trying to get.
cppFunction('
List meanDistByCell(List input, NumericMatrix pairmat) {
int n = input.size();
List out(n);
List dimnames = pairmat.attr( "dimnames" );
CharacterVector colnames = dimnames[1];
for (int i = 0; i < n; i++) {
CharacterVector sp = as< CharacterVector >(input[i]);
if (sp.size() > 0) {
out[i] = double(mean(pairmat(sp, sp)));
} else {
out[i] = NA_REAL;
}
}
return out;
}
')
Any help would be greatly appreciated! Thanks!
Although (contiguous) range-based subsetting is available (e.g. x(Range(first_row, last_row), Range(first_col, last_col))), as coatless pointed out, subsetting by CharacterVector is not currently supported, so you will have to roll your own for the time being. A general-ish approach might look something like this:
template <int RTYPE> inline Matrix<RTYPE>
Subset2D(const Matrix<RTYPE>& x, CharacterVector crows, CharacterVector ccols) {
R_xlen_t i = 0, j = 0, rr = crows.length(), rc = ccols.length(), pos;
Matrix<RTYPE> res(rr, rc);
CharacterVector xrows = rownames(x), xcols = colnames(x);
IntegerVector rows = match(crows, xrows), cols = match(ccols, xcols);
for (; j < rc; j++) {
// NB: match returns 1-based indices
pos = cols[j] - 1;
for (i = 0; i < rr; i++) {
res(i, j) = x(rows[i] - 1, pos);
}
}
rownames(res) = crows;
colnames(res) = ccols;
return res;
}
// [[Rcpp::export]]
NumericMatrix subset2d(NumericMatrix x, CharacterVector rows, CharacterVector cols) {
return Subset2D(x, rows, cols);
}
This assumes that the input matrix has both row and column names, and that the row and column lookup vectors are valid subsets of those dimnames; additional defensive code could be added to make this more robust. To demonstrate,
subset2d(pdist, subsetList[[1]], subsetList[[1]])
# spB spD spE spC
# spB NA NA NA NA
# spD 9 NA NA 7
# spE 5 4 NA 3
# spC 2 NA NA NA
pdist[subsetList[[1]], subsetList[[1]]]
# spB spD spE spC
# spB NA NA NA NA
# spD 9 NA NA 7
# spE 5 4 NA 3
# spC 2 NA NA NA
Subset2D takes care of most of the boilerplate involved in implementing meanDistByCell; all that remains is to loop over the input list, apply this to each list element, and store the mean of the result in the output list:
// [[Rcpp::export]]
List meanDistByCell(List keys, NumericMatrix x, bool na_rm = false) {
R_xlen_t i = 0, sz = keys.size();
List res(sz);
if (!na_rm) {
for (; i < sz; i++) {
res[i] = NumericVector::create(
mean(Subset2D(x, keys[i], keys[i]))
);
}
} else {
for (; i < sz; i++) {
res[i] = NumericVector::create(
mean(na_omit(Subset2D(x, keys[i], keys[i])))
);
}
}
return res;
}
all.equal(
lapply(subsetList, function(x) mean(pdist[x, x], na.rm = TRUE)),
meanDistByCell2(subsetList, pdist, TRUE)
)
# [1] TRUE
Although the use of Subset2D allows for a much cleaner implementation of meanDistByCell, in this situation it is inefficient for at least a couple of reasons:
It sets the dimnames of the return object (rownames(res) = crows;, colnames(res) = ccols;), which you have no need for here.
It makes a call to match to obtain indices for each of rownames and colnames, which is unnecessary since you know in advance that rownames(x) == colnames(x).
You will incur the cost of both of these points k times, for an input list with length k.
A more efficient -- but consequently less concise -- approach would be to essentially implement only the aspects of Subset2D that are needed, inline inside of meanDistByCell:
// [[Rcpp::export]]
List meanDistByCell2(List keys, NumericMatrix x, bool na_rm = false) {
R_xlen_t k = 0, sz = keys.size(), i = 0, j = 0, nidx, pos;
List res(sz);
CharacterVector cx = colnames(x);
if (!na_rm) {
for (; k < sz; k++) {
// NB: match returns 1-based indices
IntegerVector idx = match(as<CharacterVector>(keys[k]), cx) - 1;
nidx = idx.size();
NumericVector tmp(nidx * nidx);
for (j = 0; j < nidx; j++) {
pos = idx[j];
for (i = 0; i < nidx; i++) {
tmp[nidx * j + i] = x(idx[i], pos);
}
}
res[k] = NumericVector::create(mean(tmp));
}
} else {
for (; k < sz; k++) {
IntegerVector idx = match(as<CharacterVector>(keys[k]), cx) - 1;
nidx = idx.size();
NumericVector tmp(nidx * nidx);
for (j = 0; j < nidx; j++) {
pos = idx[j];
for (i = 0; i < nidx; i++) {
tmp[nidx * j + i] = x(idx[i], pos);
}
}
res[k] = NumericVector::create(mean(na_omit(tmp)));
}
}
return res;
}
all.equal(
meanDistByCell(subsetList, pdist, TRUE),
meanDistByCell2(subsetList, pdist, TRUE)
)
# [1] TRUE

Sort elements of a NumericMatrix by dim names

I have a NumericMatrix m. Say m is (the elements in the square brackets are the dim names)
7 9 8
4 6 5
1 3 2
with column names = {"x", "z", "y"}, row names = {"z", "y", "x"}
I want the following output
1 2 3
4 5 6
7 8 9
with column names = {"x", "y", "z"}, row names = {"x", "y", "z"}
So what I want to do is the following -
Sort elements of each row according to the column names
Permute the rows such that their corresponding row names are sorted
Is there an easy way to do this in Rcpp for a general NumericMatrix?
This isn't necessarily the simplest approach, but it appears to work:
#include <Rcpp.h>
#include <map>
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
Rcpp::NumericMatrix dim_sort(const Rcpp::NumericMatrix& m) {
Rcpp::Function rownames("rownames");
Rcpp::Function colnames("colnames");
Rcpp::CharacterVector rn = rownames(m);
Rcpp::CharacterVector cn = colnames(m);
Rcpp::NumericMatrix result(Rcpp::clone(m));
Rcpp::CharacterVector srn(Rcpp::clone(rn));
Rcpp::CharacterVector scn(Rcpp::clone(cn));
std::map<std::string, int> row_map;
std::map<std::string, int> col_map;
for (int i = 0; i < rn.size(); i++) {
row_map.insert(std::pair<std::string, int>(Rcpp::as<std::string>(rn[i]), i));
col_map.insert(std::pair<std::string, int>(Rcpp::as<std::string>(cn[i]), i));
}
typedef std::map<std::string, int>::const_iterator cit;
cit cm_it = col_map.begin();
int J = 0;
for (; cm_it != col_map.end(); ++cm_it) {
int I = 0;
int j = cm_it->second;
scn[J] = cm_it->first;
cit rm_it = row_map.begin();
for (; rm_it != row_map.end(); ++rm_it) {
int i = rm_it->second;
result(J, I) = m(j, i);
srn[I] = rm_it->first;
I++;
}
J++;
}
result.attr("dimnames") = Rcpp::List::create(srn, scn);
return result;
}
/*** R
x <- matrix(
c(7,9,8,4,6,5,1,3,2),
nrow = 3,
dimnames = list(
c("x", "z", "y"),
c("z", "y", "x")
),
byrow = TRUE
)
R> x
z y x
x 7 9 8
z 4 6 5
y 1 3 2
R> dim_sort(x)
x y z
x 1 2 3
y 4 5 6
z 7 8 9
*/
I used a std::map<std::string, int> for two reasons:
maps automatically maintain a sorted order based on their keys, so by using the dim names as keys, the container does the sorting for us.
Letting a key's corresponding value be an integer representing the order in which it was added, we have an index for retrieving the appropriate value along a given dimension.

Resources