I have a data frame in which its columns do not have name and I want to name them in RCPP?How can I do that? - rcpp

I am very new in Rcpp. I have a data frame in which its columns do not have name and I want to name them in Rcpp. How can I do that? That is, this data frame is an input and then I want to name its columns in the first step.
Please let me know how I can do that.

Welcome to StackOverflow. We can modify the existing example in the RcppExamples package (which you may find helpful, just as other parts of the Rcpp documentation) to show this.
In essence, we just reassign a names attribute.
Code
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List DataFrameExample(const DataFrame & DF) {
// access each column by name
IntegerVector a = DF["a"];
CharacterVector b = DF["b"];
DateVector c = DF["c"];
// do something
a[2] = 42;
b[1] = "foo";
c[0] = c[0] + 7; // move up a week
// create a new data frame
DataFrame NDF = DataFrame::create(Named("a")=a,
Named("b")=b,
Named("c")=c);
// and reassign names
NDF.attr("names") = CharacterVector::create("tic", "tac", "toe");
// and return old and new in list
return List::create(Named("origDataFrame") = DF,
Named("newDataFrame") = NDF);
}
/*** R
D <- data.frame(a=1:3,
b=LETTERS[1:3],
c=as.Date("2011-01-01")+0:2)
rl <- DataFrameExample(D)
print(rl)
*/
Demo
R> Rcpp::sourceCpp("~/git/stackoverflow/61616170/answer.cpp")
R> D <- data.frame(a=1:3,
+ b=LETTERS[1:3],
+ c=as.Date("2011-01-01")+0:2)
R> rl <- DataFrameExample(D)
R> print(rl)
$origDataFrame
a b c
1 1 A 2011-01-08
2 2 foo 2011-01-02
3 42 C 2011-01-03
$newDataFrame
tic tac toe
1 1 A 2011-01-08
2 2 foo 2011-01-02
3 42 C 2011-01-03
R>
If you comment the line out you get the old names.

Related

Extract a data.frame from a list within Rcpp

This is probably a really simple question, but I can't figure out what's wrong.
I have a list that I pass to an Rcpp function, and the first element of that list is a data.frame.
How do I get that data.frame?
bar = list(df = data.frame(A = 1:3,B=letters[1:3]),some_other_variable = 2)
foo(bar)
And the following C++ code:
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::NumericVector bar(Rcpp::List test){
Rcpp::DataFrame df_test = test["df"];
Rcpp::NumericVector result = df_test["A"];
return result;
}
I get the following error on the line DataFrame df_test = test["df"]:
error: conversion from 'Rcpp::Vector<19>::NameProxy{aka 'Rcpp::internal::generic_name_proxy<19, Rcpp::PreserveStorage> to 'Rcpp::DataFrame{aka 'Rcpp::DataFrame_ImplRcpp::PreserveStorage ambiguous
Anyone know what I'm missing? Thanks.
There may be a combination of issues going on with the instantiation and construction of List and DataFrame objects. See the (old !!) RcppExamples package for working examples.
Here is a repaired version of your code that works and does something with the vector inside the data.frame:
Code
#include <Rcpp.h>
// [[Rcpp::export]]
int bar(Rcpp::List test){
Rcpp::DataFrame df(test["df"]);
Rcpp::IntegerVector ivec = df["A"];
return Rcpp::sum(ivec);
}
/*** R
zz <- list(df = data.frame(A = 1:3,B=letters[1:3]),some_other_variable = 2)
bar(zz)
*/
Demo
> Rcpp::sourceCpp("~/git/stackoverflow/70035630/answer.cpp")
> zz <- list(df = data.frame(A = 1:3,B=letters[1:3]),some_other_variable = 2)
> bar(zz)
[1] 6
>
Edit: For completeness, the assignment op can be used with a SEXP as in SEXP df2 = test["df"]; which can then used to instantiate a data.frame. Template programming is difficult and not all corners are completely smoothed.

RCPP and the %*% operator, revisited

I'm trying to decide if it makes sense to implement R's %*% operator in RCpp
if my dataset is huge. BUT, I am really having trouble getting a RCpp implementation.
Here is my example R code
# remove everything in the global environment
rm(list = ls())
n_states = 4 # number of states
v_n <- c("H", "S1", "S2", "D") # the 4 states of the model:
n_t = 100 # number of transitions
# create transition matrix with random numbers. This transition matrix is constant.
m_P = matrix(runif(n_states*n_t), # insert n_states * n_t random numbers (can change this later)
nrow = n_states,
ncol = n_states,
dimnames = list(v_n, v_n))
# create markov trace, what proportion of population in each state at each period (won't make sense due to random numbers but that is fine)
m_TR <- matrix(NA,
nrow = n_t + 1 ,
ncol = n_states,
dimnames = list(0:n_t, v_n)) # create Markov trace (n_t + 1 because R doesn't understand Cycle 0)
# initialize Markov trace
m_TR[1, ] <- c(1, 0, 0, 0)
# run the loop
microbenchmark::microbenchmark( # function from microbenchmark library used to calculate how long this takes to run
for (t in 1:n_t){ # throughout the number of cycles
m_TR[t + 1, ] <- m_TR[t, ] %*% m_P # estimate the Markov trace for cycle the next cycle (t + 1)
}
) # end of micro-benchmark function
print(m_TR) # print the result.
And, here is the replacement for the %*% operator: (WHich doesn't seem to work correctly at all, although I can't fgure out why.
library(Rcpp)
cppFunction(
'void estimate_markov(int n_t, NumericMatrix m_P, NumericMatrix m_TR )
{
// We want to reproduce this
// matrix_A[X+1,] <- matrix_A[X,] %*% matrix_B
// The %*% operation behaves as follows for a vector_V %*% matrix_M
// Here the Matrix M is populated with letters so that you can
// more easily see how the operation is performed
// So, a multiplication like this:
//
// V M
// {1} %*% {A D}
// {2} {B E}
// {3} {C F}
//
// Results in a vector:
// V_result
// {1*A + 1*D}
// {2*B + 2*E}
// {3*C + 3*F}
//
// Now use values instead of letters for M (ex: A=1, B=2, .., F=6)
// V_result
// {1*1 + 1*4} {1 + 4} {5}
// {2*2 + 2*5} => {4 + 10} => {14}
// {3*3 + 3*6} {9 + 18} {27}
//
// Note that the RHS matrix may contain any number of columns,
// but *MUST* must contain the same number of rows as LHS vector
// Get dimensions of matricies , and sanity check
// number of elements in a vector from the LHS matrix must equal == number of row on the RHS
if( m_TR.cols() != m_P.rows())
throw std::range_error("Matrix mismatch, m_P.rows != m_TR.rows");
// we want to know these dimensions, and there is no reason to call these functons in a loop
// store the values once
int cnt_P_cols = m_P.cols();
int cnt_TR_cols = m_TR.cols();
//
for(int Index = 1; Index <= n_t; ++Index)
{
// iterate over the columns in m_TR
for(int col_iter = 0; col_iter < cnt_TR_cols; ++col_iter)
{
// an accumulator for the vector multiplication
double sum = 0;
// The new value comes from the previous row (Index-1)
double orig_TR = m_TR(col_iter, Index-1);
// iterate over the columns in m_P corresponding to this Index
for(int p_iter = 0; p_iter < cnt_P_cols; ++p_iter)
{
// accumulate the value of this TR scalar * the m_P vector
sum += orig_TR * m_P(p_iter, Index);
}
m_TR(col_iter, Index) = sum;
}
}
}'
)
Can someone point me to where my logic is going wrong.

How to calculate the number of neighbors of a string with exact and at most d mismatches?

Given a string, and a set of four alphabets (A, B, C, D) for generating strings of length n. I need a generalized mathematical formula to calculate the number of neighbors for any string of length n with at most d mismatches, and the number of neighbors with exactly d mismatches.
For example: Given a string=”AAA” and d=3
We have 9 Strings with exactly d=1
BAA
CAA
DAA
ABA
ACA
ADA
AAB
AAC
AAD
We have 27 Strings with exactly d=2
BBA BCA BDA
BAB BAC BAD
CBA CCA CDA
CAB CAC CAD
DBA DCA DDA
DAB DAC DAD
ABB ABC ABD
ACB ACC ACD
ADB ADC ADD
We have 27 Strings with exactly d=3
BBB CBB DBB
BCB CCB DCB
BDB CDB DDB
BBC CBC DBC
BCC CCC DCC
BDC CDC DDC
BBD CBD DBD
BCD CCD DCD
BDD CDD DDD
Number of Strings with at most d=3 are 9+27+27=63 strings
Let's consider a string of size n.
We want to know how many 'neighbors' this string has, with a distance d. The first thing we remark, with your definition of 'distance', is that it means that we must choose d characters among the n of the string and modify them. So there are n choose d possible combinations of charactersto modify.
Each of these can be modified in 3 different manners (since the size of the alphabet is 4.
So ultimately, we have:
n choose d possible combinations of characters that will be modified
d characters will be modified, and each of them can be modified in 3different manners.
So the formula is ultimately (s - 1) ^ d * (n choose d), where s is the size of the alphabet (here 4). I let you verify that it fits the first examples you provided.
If you want to try it out:
#include <iostream>
#include <string>
using namespace std;
int n = 3; int d = 2;
string s = "AAA";
int counter(string curr, int index, int currd){
if(currd == 0 || index == n){
cout<<curr<<s.substr(index, n - index)<<endl;
return 1;
}
int ans = 0;
for(char c = 'A'; c < 'E'; c++){
if(c != s[index]){
ans += counter(curr + c, index + 1, currd - 1);
}
else{
ans += counter(curr + c, index + 1, currd);
}
}
return ans;
}
int main(){
cout<<"answer = "<<counter("", 0, d) - 1;
}

Sort elements of a NumericMatrix by dim names

I have a NumericMatrix m. Say m is (the elements in the square brackets are the dim names)
7 9 8
4 6 5
1 3 2
with column names = {"x", "z", "y"}, row names = {"z", "y", "x"}
I want the following output
1 2 3
4 5 6
7 8 9
with column names = {"x", "y", "z"}, row names = {"x", "y", "z"}
So what I want to do is the following -
Sort elements of each row according to the column names
Permute the rows such that their corresponding row names are sorted
Is there an easy way to do this in Rcpp for a general NumericMatrix?
This isn't necessarily the simplest approach, but it appears to work:
#include <Rcpp.h>
#include <map>
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
Rcpp::NumericMatrix dim_sort(const Rcpp::NumericMatrix& m) {
Rcpp::Function rownames("rownames");
Rcpp::Function colnames("colnames");
Rcpp::CharacterVector rn = rownames(m);
Rcpp::CharacterVector cn = colnames(m);
Rcpp::NumericMatrix result(Rcpp::clone(m));
Rcpp::CharacterVector srn(Rcpp::clone(rn));
Rcpp::CharacterVector scn(Rcpp::clone(cn));
std::map<std::string, int> row_map;
std::map<std::string, int> col_map;
for (int i = 0; i < rn.size(); i++) {
row_map.insert(std::pair<std::string, int>(Rcpp::as<std::string>(rn[i]), i));
col_map.insert(std::pair<std::string, int>(Rcpp::as<std::string>(cn[i]), i));
}
typedef std::map<std::string, int>::const_iterator cit;
cit cm_it = col_map.begin();
int J = 0;
for (; cm_it != col_map.end(); ++cm_it) {
int I = 0;
int j = cm_it->second;
scn[J] = cm_it->first;
cit rm_it = row_map.begin();
for (; rm_it != row_map.end(); ++rm_it) {
int i = rm_it->second;
result(J, I) = m(j, i);
srn[I] = rm_it->first;
I++;
}
J++;
}
result.attr("dimnames") = Rcpp::List::create(srn, scn);
return result;
}
/*** R
x <- matrix(
c(7,9,8,4,6,5,1,3,2),
nrow = 3,
dimnames = list(
c("x", "z", "y"),
c("z", "y", "x")
),
byrow = TRUE
)
R> x
z y x
x 7 9 8
z 4 6 5
y 1 3 2
R> dim_sort(x)
x y z
x 1 2 3
y 4 5 6
z 7 8 9
*/
I used a std::map<std::string, int> for two reasons:
maps automatically maintain a sorted order based on their keys, so by using the dim names as keys, the container does the sorting for us.
Letting a key's corresponding value be an integer representing the order in which it was added, we have an index for retrieving the appropriate value along a given dimension.

Indexing using input matrix RcppArmadillo

I have two vectors.. one is an output by group and the second is an index for the appartenance to one group. In practice, it is something like that
mean_group = 1, 2, 3
group_id = 1,1,3,2,3,2
And I would like to to assign each id to the value of its group.. In basic R, I will just do mean_group[group_id]..
I have to avoid using a loop, otherwise, there would be no point in using armadillo. Is there a way to do that?
Thanks in advance
I am not sure how hard you tried to find this in the Armadillo documentation, but this works out of the box in Armadillo. Try the following as file armaind.cpp:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::vec subsetter(arma::vec big, arma::uvec ind) {
arma::vec small = big.elem( ind );
return small;
}
/*** R
big <- 2*(1:10)
ind <- c(3,5,7)
subsetter(big, ind)
*/
which gets you
R> Rcpp::sourceCpp("/tmp/armaind.cpp")
R> big <- 2*(1:10)
R> ind <- c(3,5,7)
R> subsetter(big, ind)
[,1]
[1,] 8
[2,] 12
[3,] 16
R>
Note the off-by-one indexing difference between R and C++.

Resources