Access dimension names from sp_mat in Rcpp Armadillo - rcpp

I'm brand new to Rcpp and trying to determine how to access the dimension names of an input so that I can use them later in the script. Specifically, I'm trying to grab the column names off of a sparse matrix in Armadillo and use them to name the rows in a separate object.
An example to clarify:
Let's start by generating a trivial sparse matrix.
input_mat <- Matrix::Matrix(sample(c(0,1), 35, replace =T)
,nrow = 5
,ncol = 7
,dimnames = list(LETTERS[1:5], letters[1:7]))
Next, let's use that to do something in Rcpp. We will output a numeric matrix filled with some random numbers. nrow of the output = ncol of the input.
cppFunction('NumericMatrix map_columns(arma::sp_mat x, int k) {
int n = x.n_cols;
NumericMatrix new_mat = NumericMatrix(n, k);
for(int i = 0; i < n; i++) {
for(int j = 0; j < k; j++) {
new_mat(i,j) = rand() % 100 + 1;
}
}
rownames(new_mat) = CharacterVector::create("a", "b", "c", "d", "e", "f", "g");
return(new_mat);
}', depends = "RcppArmadillo"
)
map_columns(input_mat, 4)
Instead of manually specifying the rownames of new_mat, I want to grab the colnames of x and assign the names on the fly. I've tried accessing slot names of the sparse matrix and have tried to assign them the same way I would in R, but no luck.
I'm guessing that I'm making a simple nube mistake. Can someone help me solve this? Any assistance will be greatly appreciated.

I do not know a possibility to access S4 slots after the conversion to an Armadillo object, However, you can pass the sparse matrix as an S4 object to the function and handle the conversion explicitly:
input_mat <- Matrix::rsparsematrix(5, 7, 0.2)
input_mat#Dimnames <- list(LETTERS[1:5], letters[1:7])
Rcpp::cppFunction('NumericMatrix map_columns(Rcpp::S4 y, int k) {
arma::sp_mat x = Rcpp::as<arma::sp_mat>(y);
int n = x.n_cols;
NumericMatrix new_mat = NumericMatrix(n, k);
for(int i = 0; i < n; i++) {
for(int j = 0; j < k; j++) {
new_mat(i,j) = rand() % 100 + 1;
}
}
Rcpp::List dimnames = y.slot("Dimnames");
Rcpp::CharacterVector colnames = dimnames[1];
rownames(new_mat) = colnames;
return(new_mat);
}', depends = "RcppArmadillo"
)
map_columns(input_mat, 4)
Note that I am creating a sparse matrix instead of a dense matrix found in your example code.
Side note: Don't use rand(). Use R's RNG, <random> from C++11 or ...

Related

Is it possible to parallelize or unroll this loop?

I am trying to see if I can improve the performance of the following loop in C++, which uses two dimensional vectors (_external and _Table) and has a carried loop dependency on the previous iteration. Additionally, it has a calculated index accessor in the innermost loop that will make the access of _Table non sequential on the right hand side.
int N = 8000;
int M = 400
int P = 100;
for(int i = 1; i <= N; i++){
for(int j = 0; j < M; j++){
for(int k =0; k < P; k++){
int index = _external.at(j).at(k);
_Table.at(j).at(i) += _Table.at(index).at(i-1);
}
}
}
What can I do to improve the performance of a loop like this?
Well it looks to me like the order in which these statements:
int index = _external.at(j).at(k);
_Table.at(j).at(i) += _Table.at(index).at(i-1);
are executed is critical to correctness. (That is, if the iteration order for i, j, k changes, then the results will be different ... and incorrect.)
So I think you are only left with micro-optimizations, like hoisting the expressions _Table.at(j).at(i) and _external.at(j) out of the innermost loop.
Consider this:
for(int k =0; k < P; k++){
int index = _external.at(j).at(k);
_Table.at(j).at(i) += _Table.at(index).at(i-1);
}
This loop is repeatedly adding numbers to _Table.at(j).at(i). Since (by inspection) _Table.at(index).at(i-1) must be reading from a different cell of the table (because of i-1 versus i), you could do this:
int temp = 0;
for(int k =0; k < P; k++){
int index = _external.at(j).at(k);
temp += _Table.at(index).at(i-1);
}
_Table.at(j).at(i) += temp;
This will reduce the number of calls to at, and may also improve cache performance a bit.

subset NumericMatrix by row and column names in Rcpp

I am trying to create a function in Rcpp that will take as input a pairwise numeric matrix, as well as a list of vectors, each element being a subset of row/column names. I would like this function identify the subset of the matrix that matches those names, and return the mean of the values.
Below I generated some dummy data that resembles the sort of data I have, and follow with an attempt of a Rcpp function.
library(Rcpp)
dat <- c(spA = 4, spB = 10, spC = 8, spD = 1, spE = 5, spF = 9)
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA
Here I have a list made up of character vectors of various subsets of the row/column names in pdist
subsetList <- replicate(10, sample(names(dat), 4), simplify=FALSE)
For each of these sets of names, I would like to identify the subset of the pairwise matrix and take the mean of the values
Here is what I have so far, which does not work, but I think it illustrates where I am trying to get.
cppFunction('
List meanDistByCell(List input, NumericMatrix pairmat) {
int n = input.size();
List out(n);
List dimnames = pairmat.attr( "dimnames" );
CharacterVector colnames = dimnames[1];
for (int i = 0; i < n; i++) {
CharacterVector sp = as< CharacterVector >(input[i]);
if (sp.size() > 0) {
out[i] = double(mean(pairmat(sp, sp)));
} else {
out[i] = NA_REAL;
}
}
return out;
}
')
Any help would be greatly appreciated! Thanks!
Although (contiguous) range-based subsetting is available (e.g. x(Range(first_row, last_row), Range(first_col, last_col))), as coatless pointed out, subsetting by CharacterVector is not currently supported, so you will have to roll your own for the time being. A general-ish approach might look something like this:
template <int RTYPE> inline Matrix<RTYPE>
Subset2D(const Matrix<RTYPE>& x, CharacterVector crows, CharacterVector ccols) {
R_xlen_t i = 0, j = 0, rr = crows.length(), rc = ccols.length(), pos;
Matrix<RTYPE> res(rr, rc);
CharacterVector xrows = rownames(x), xcols = colnames(x);
IntegerVector rows = match(crows, xrows), cols = match(ccols, xcols);
for (; j < rc; j++) {
// NB: match returns 1-based indices
pos = cols[j] - 1;
for (i = 0; i < rr; i++) {
res(i, j) = x(rows[i] - 1, pos);
}
}
rownames(res) = crows;
colnames(res) = ccols;
return res;
}
// [[Rcpp::export]]
NumericMatrix subset2d(NumericMatrix x, CharacterVector rows, CharacterVector cols) {
return Subset2D(x, rows, cols);
}
This assumes that the input matrix has both row and column names, and that the row and column lookup vectors are valid subsets of those dimnames; additional defensive code could be added to make this more robust. To demonstrate,
subset2d(pdist, subsetList[[1]], subsetList[[1]])
# spB spD spE spC
# spB NA NA NA NA
# spD 9 NA NA 7
# spE 5 4 NA 3
# spC 2 NA NA NA
pdist[subsetList[[1]], subsetList[[1]]]
# spB spD spE spC
# spB NA NA NA NA
# spD 9 NA NA 7
# spE 5 4 NA 3
# spC 2 NA NA NA
Subset2D takes care of most of the boilerplate involved in implementing meanDistByCell; all that remains is to loop over the input list, apply this to each list element, and store the mean of the result in the output list:
// [[Rcpp::export]]
List meanDistByCell(List keys, NumericMatrix x, bool na_rm = false) {
R_xlen_t i = 0, sz = keys.size();
List res(sz);
if (!na_rm) {
for (; i < sz; i++) {
res[i] = NumericVector::create(
mean(Subset2D(x, keys[i], keys[i]))
);
}
} else {
for (; i < sz; i++) {
res[i] = NumericVector::create(
mean(na_omit(Subset2D(x, keys[i], keys[i])))
);
}
}
return res;
}
all.equal(
lapply(subsetList, function(x) mean(pdist[x, x], na.rm = TRUE)),
meanDistByCell2(subsetList, pdist, TRUE)
)
# [1] TRUE
Although the use of Subset2D allows for a much cleaner implementation of meanDistByCell, in this situation it is inefficient for at least a couple of reasons:
It sets the dimnames of the return object (rownames(res) = crows;, colnames(res) = ccols;), which you have no need for here.
It makes a call to match to obtain indices for each of rownames and colnames, which is unnecessary since you know in advance that rownames(x) == colnames(x).
You will incur the cost of both of these points k times, for an input list with length k.
A more efficient -- but consequently less concise -- approach would be to essentially implement only the aspects of Subset2D that are needed, inline inside of meanDistByCell:
// [[Rcpp::export]]
List meanDistByCell2(List keys, NumericMatrix x, bool na_rm = false) {
R_xlen_t k = 0, sz = keys.size(), i = 0, j = 0, nidx, pos;
List res(sz);
CharacterVector cx = colnames(x);
if (!na_rm) {
for (; k < sz; k++) {
// NB: match returns 1-based indices
IntegerVector idx = match(as<CharacterVector>(keys[k]), cx) - 1;
nidx = idx.size();
NumericVector tmp(nidx * nidx);
for (j = 0; j < nidx; j++) {
pos = idx[j];
for (i = 0; i < nidx; i++) {
tmp[nidx * j + i] = x(idx[i], pos);
}
}
res[k] = NumericVector::create(mean(tmp));
}
} else {
for (; k < sz; k++) {
IntegerVector idx = match(as<CharacterVector>(keys[k]), cx) - 1;
nidx = idx.size();
NumericVector tmp(nidx * nidx);
for (j = 0; j < nidx; j++) {
pos = idx[j];
for (i = 0; i < nidx; i++) {
tmp[nidx * j + i] = x(idx[i], pos);
}
}
res[k] = NumericVector::create(mean(na_omit(tmp)));
}
}
return res;
}
all.equal(
meanDistByCell(subsetList, pdist, TRUE),
meanDistByCell2(subsetList, pdist, TRUE)
)
# [1] TRUE

CodeJam 2014: How to solve task "New Lottery Game"?

I want to know efficient approach for the New Lottery Game problem.
The Lottery is changing! The Lottery used to have a machine to generate a random winning number. But due to cheating problems, the Lottery has decided to add another machine. The new winning number will be the result of the bitwise-AND operation between the two random numbers generated by the two machines.
To find the bitwise-AND of X and Y, write them both in binary; then a bit in the result in binary has a 1 if the corresponding bits of X and Y were both 1, and a 0 otherwise. In most programming languages, the bitwise-AND of X and Y is written X&Y.
For example:
The old machine generates the number 7 = 0111.
The new machine generates the number 11 = 1011.
The winning number will be (7 AND 11) = (0111 AND 1011) = 0011 = 3.
With this measure, the Lottery expects to reduce the cases of fraudulent claims, but unfortunately an employee from the Lottery company has leaked the following information: the old machine will always generate a non-negative integer less than A and the new one will always generate a non-negative integer less than B.
Catalina wants to win this lottery and to give it a try she decided to buy all non-negative integers less than K.
Given A, B and K, Catalina would like to know in how many different ways the machines can generate a pair of numbers that will make her a winner.
For small input we can check all possible pairs but how to do it with large inputs. I guess we represent the binary number into string first and then check permutations which would give answer less than K. But I can't seem to figure out how to calculate possible permutations of 2 binary strings.
I used a general DP technique that I described in a lot of detail in another answer.
We want to count the pairs (a, b) such that a < A, b < B and a & b < K.
The first step is to convert the numbers to binary and to pad them to the same size by adding leading zeroes. I just padded them to a fixed size of 40. The idea is to build up the valid a and b bit by bit.
Let f(i, loA, loB, loK) be the number of valid suffix pairs of a and b of size 40 - i. If loA is true, it means that the prefix up to i is already strictly smaller than the corresponding prefix of A. In that case there is no restriction on the next possible bit for a. If loA ist false, A[i] is an upper bound on the next bit we can place at the end of the current prefix. loB and loK have an analogous meaning.
Now we have the following transition:
long long f(int i, bool loA, bool loB, bool loK) {
// TODO add memoization
if (i == 40)
return loA && loB && loK;
int hiA = loA ? 1: A[i]-'0'; // upper bound on the next bit in a
int hiB = loB ? 1: B[i]-'0'; // upper bound on the next bit in b
int hiK = loK ? 1: K[i]-'0'; // upper bound on the next bit in a & b
long long res = 0;
for (int a = 0; a <= hiA; ++a)
for (int b = 0; b <= hiB; ++b) {
int k = a & b;
if (k > hiK) continue;
res += f(i+1, loA || a < A[i]-'0',
loB || b < B[i]-'0',
loK || k < K[i]-'0');
}
return res;
}
The result is f(0, false, false, false).
The runtime is O(max(log A, log B)) if memoization is added to ensure that every subproblem is only solved once.
What I did was just to identify when the answer is A * B.
Otherwise, just brute force the rest, this code passed the large input.
// for each test cases
long count = 0;
if ((K > A) || (K > B)) {
count = A * B;
continue; // print count and go to the next test case
}
count = A * B - (A-K) * (B-K);
for (int i = K; i < A; i++) {
for (int j = K; j < B; j++) {
if ((i&j) < K) count++;
}
}
I hope this helps!
just as Niklas B. said.
the whole answer is.
#include <algorithm>
#include <cstring>
#include <iomanip>
#include <iostream>
#include <iterator>
#include <map>
#include <sstream>
#include <string>
#include <vector>
using namespace std;
#define MAX_SIZE 32
int A, B, K;
int arr_a[MAX_SIZE];
int arr_b[MAX_SIZE];
int arr_k[MAX_SIZE];
bool flag [MAX_SIZE][2][2][2];
long long matrix[MAX_SIZE][2][2][2];
long long
get_result();
int main(int argc, char *argv[])
{
int case_amount = 0;
cin >> case_amount;
for (int i = 0; i < case_amount; ++i)
{
const long long result = get_result();
cout << "Case #" << 1 + i << ": " << result << endl;
}
return 0;
}
long long
dp(const int h,
const bool can_A_choose_1,
const bool can_B_choose_1,
const bool can_K_choose_1)
{
if (MAX_SIZE == h)
return can_A_choose_1 && can_B_choose_1 && can_K_choose_1;
if (flag[h][can_A_choose_1][can_B_choose_1][can_K_choose_1])
return matrix[h][can_A_choose_1][can_B_choose_1][can_K_choose_1];
int cnt_A_max = arr_a[h];
int cnt_B_max = arr_b[h];
int cnt_K_max = arr_k[h];
if (can_A_choose_1)
cnt_A_max = 1;
if (can_B_choose_1)
cnt_B_max = 1;
if (can_K_choose_1)
cnt_K_max = 1;
long long res = 0;
for (int i = 0; i <= cnt_A_max; ++i)
{
for (int j = 0; j <= cnt_B_max; ++j)
{
int k = i & j;
if (k > cnt_K_max)
continue;
res += dp(h + 1,
can_A_choose_1 || (i < cnt_A_max),
can_B_choose_1 || (j < cnt_B_max),
can_K_choose_1 || (k < cnt_K_max));
}
}
flag[h][can_A_choose_1][can_B_choose_1][can_K_choose_1] = true;
matrix[h][can_A_choose_1][can_B_choose_1][can_K_choose_1] = res;
return res;
}
long long
get_result()
{
cin >> A >> B >> K;
memset(arr_a, 0, sizeof(arr_a));
memset(arr_b, 0, sizeof(arr_b));
memset(arr_k, 0, sizeof(arr_k));
memset(flag, 0, sizeof(flag));
memset(matrix, 0, sizeof(matrix));
int i = 31;
while (i >= 1)
{
arr_a[i] = A % 2;
A /= 2;
arr_b[i] = B % 2;
B /= 2;
arr_k[i] = K % 2;
K /= 2;
i--;
}
return dp(1, 0, 0, 0);
}

Brute-force transposition decryption - word segmentation

I'm a 2nd year B. Comp. Sci. student and have a cryptography assignment that's really giving me grief. We've been given a text file of transposition-encrypted English phrases and an English dictionary file, then asked to write a program that deciphers the phrases automatically without any user input.
My first idea was to simply brute-force all possible permutations of the ciphertext, which should be trivial. However, I then have to decide which one is the most-likely to be the actual plaintext, and this is what I'm struggling with.
There's heaps of information on word segmentation here on SO, including this and this amongst other posts. Using this information and what I've already learned at uni, here's what I have so far:
string DecryptTransposition(const string& cipher, const string& dict)
{
vector<string> plain;
int sz = cipher.size();
int maxCols = ceil(sz / 2.0f);
int maxVotes = 0, key = 0;
// Iterate through all possible no.'s of cols.
for (int c = 2; c <= maxCols; c++)
{
int r = sz / c; // No. of complete rows if c is no. of cols.
int e = sz % c; // No. of extra letters if c is no. of cols.
string cipherCpy(cipher);
vector<string> table;
table.assign(r, string(c, ' '));
if (e > 0) table.push_back(string(e, ' '));
for (int y = 0; y < c; y++)
{
for (int x = 0; x <= r; x++)
{
if (x == r && e-- < 1) break;
table[x][y] = cipherCpy[0];
cipherCpy.erase(0, 1);
}
}
plain.push_back(accumulate(table.begin(),
table.end(), string("")));
// plain.back() now points to the plaintext
// generated from cipher with key = c
int votes = 0;
for (int i = 0, j = 2; (i + j) <= sz; )
{
string word = plain.back().substr(i, j);
if (dict.find('\n' + word + '\n') == string::npos) j++;
else
{
votes++;
i += j;
j = 2;
}
}
if (votes > maxVotes)
{
maxVotes = votes;
key = c;
}
}
return plain[key - 2]; // Minus 2 since we started from 2
}
There are two main problems with this algorithm:
It is incredibly slow, taking ~30 sec. to decrypt a 80-char. message.
It isn't completely accurate (I'd elaborate on this if I hadn't already taken up a whole page, but you can try it for yourself with the full VC++ 2012 project).
Any suggestions on how I could improve this algorithm would be greatly appreciated. MTIA :-)

Search an integer in a row-sorted two dim array, is there any better approach?

I have recently come across with this problem,
you have to find an integer from a sorted two dimensional array. But the two dim array is sorted in rows not in columns. I have solved the problem but still thinking that there may be some better approach. So I have come here to discuss with all of you. Your suggestions and improvement will help me to grow in coding. here is the code
int searchInteger = Int32.Parse(Console.ReadLine());
int cnt = 0;
for (int i = 0; i < x; i++)
{
if (intarry[i, 0] <= searchInteger && intarry[i,y-1] >= searchInteger)
{
if (intarry[i, 0] == searchInteger || intarry[i, y - 1] == searchInteger)
Console.WriteLine("string present {0} times" , ++cnt);
else
{
int[] array = new int[y];
int y1 = 0;
for (int k = 0; k < y; k++)
array[k] = intarry[i, y1++];
bool result;
if (result = binarySearch(array, searchInteger) == true)
{
Console.WriteLine("string present inside {0} times", ++ cnt);
Console.ReadLine();
}
}
}
}
Where searchInteger is the integer we have to find in the array. and binary search is the methiod which is returning boolean if the value is present in the single dimension array (in that single row).
please help, is it optimum or there are better solution than this.
Thanks
Provided you have declared the array intarry, x and y as follows:
int[,] intarry =
{
{0,7,2},
{3,4,5},
{6,7,8}
};
var y = intarry.GetUpperBound(0)+1;
var x = intarry.GetUpperBound(1)+1;
// intarry.Dump();
You can keep it as simple as:
int searchInteger = Int32.Parse(Console.ReadLine());
var cnt=0;
for(var r=0; r<y; r++)
{
for(var c=0; c<x; c++)
{
if (intarry[r, c].Equals(searchInteger))
{
cnt++;
Console.WriteLine(
"string present at position [{0},{1}]" , r, c);
} // if
} // for
} // for
Console.WriteLine("string present {0} times" , cnt);
This example assumes that you don't have any information whether the array is sorted or not (which means: if you don't know if it is sorted you have to go through every element and can't use binary search). Based on this example you can refine the performance, if you know more how the data in the array is structured:
if the rows are sorted ascending, you can replace the inner for loop by a binary search
if the entire array is sorted ascending and the data does not repeat, e.g.
int[,] intarry = {{0,1,2}, {3,4,5}, {6,7,8}};
then you can exit the loop as soon as the item is found. The easiest way to do this to create
a function and add a return statement to the inner for loop.

Resources