constructing an identifier string for each row in data

constructing an identifier string for each row in data - string

I have the following data:
library(data.table)
d = data.table(a = c(1:3), b = c(2:4))
and would like to get this result (in a way that would work with arbitrary number of columns):
d[, c := paste0('a_', a, '_b_', b)]
d
# a b c
#1: 1 2 a_1_b_2
#2: 2 3 a_2_b_3
#3: 3 4 a_3_b_4
The following works, but I'm hoping to find something shorter and more legible.
d = data.table(a = c(1:3), b = c(2:4))
d[, c := apply(mapply(paste, names(.SD), .SD, MoreArgs = list(sep = "_")),
1, paste, collapse = "_")]

one way, only slightly cleaner:
d[, c := apply(d, 1, function(x) paste(names(d), x, sep="_", collapse="_")) ]
a b c
1: 1 2 a_1_b_2
2: 2 3 a_2_b_3
3: 3 4 a_3_b_4

Here is an approach using do.call('paste'), but requiring only a single call to paste
I will benchmark on a situtation where the columns are integers (as this seems a more sensible test case
N <- 1e4
d <- setnames(as.data.table(replicate(5, sample(N), simplify = FALSE)), letters[seq_len(5)])
f5 <- function(d){
l <- length(d)
o <- c(1L, l + 1L) + rep_len(seq_len(l) -1L, 2L * l)
do.call('paste',c((c(as.list(names(d)),d))[o],sep='_'))}
microbenchmark(f1(d), f2(d),f5(d))
Unit: milliseconds
expr min lq median uq max neval
f1(d) 41.51040 43.88348 44.60718 45.29426 52.83682 100
f2(d) 193.94656 207.20362 210.88062 216.31977 252.11668 100
f5(d) 30.73359 31.80593 32.09787 32.64103 45.68245 100

To avoid looping through rows, you can use this:
do.call(paste, c(lapply(names(d), function(n)paste0(n,"_",d[[n]])), sep="_"))
Benchmarking:
N <- 1e4
d <- data.table(a=runif(N),b=runif(N),c=runif(N),d=runif(N),e=runif(N))
f1 <- function(d)
{
do.call(paste, c(lapply(names(d), function(n)paste0(n,"_",d[[n]])), sep="_"))
}
f2 <- function(d)
{
apply(d, 1, function(x) paste(names(d), x, sep="_", collapse="_"))
}
require(microbenchmark)
microbenchmark(f1(d), f2(d))
Note: f2 inspired in #Ricardo's answer.
Results:
Unit: milliseconds
expr min lq median uq max neval
f1(d) 195.8832 213.5017 216.3817 225.4292 254.3549 100
f2(d) 418.3302 442.0676 451.0714 467.5824 567.7051 100
Edit note: previous benchmarking with N <- 1e3 didn't show much difference in times. Thanks again #eddi.

Related

How to handle n not a multiple p in worker processes in matrix multiplication?

I am working on a problem regarding pseudocode for matrix multiplication using worker processes. w is the amount of workers, p is the amount of processors and n is the amount of processes.
The psuedocode calculates the matrix result by dividing the i rows into P strips of n/P rows each.
process worker[w = 1 to P]
int first = (w-1) * n/P;
int last = first + n/P - 1;
for [i = first to last] {
for [j = 0 to n-1] {
c[i,j] = 0.0;
for[k = 0 to n-1]
c[i,j] = c[i,j] + a[i,k]*b[k,j];
}
}
}
my question is how I would handle if n was not a multiple of P processors as can happen often where n is not divisible by p?

The simplest solution is to give the last worker all the remaining rows (they won't be more than P-1):
if w == P {
last += n mod P
}
n mod P is the remainder of the division of n by P.
Or change the calculation of first and last like this:
int first = ((w-1) * n)/P
int last = (w * n)/P - 1
This automatically takes care for the case when n is not divisible by P. The brackets are not really necessary in most languages where * and / have the same precedence and are left-associative. The point is that the multiplication by n should happen before the division by P.
Example: n = 11, P = 3:
w = 1: first = 0, last = 2 (3 rows)
w = 2: first = 3, last = 6 (4 rows)
w = 3: first = 7, last = 10 (4 rows)
This is a better solution as it spreads the remainder of the division evenly among the workers.

RcppArmadillo: diagonal matrix multiplication is very slow

Let x be a vector and M a matrix.
In R, I can do
D <- diag(exp(x))
crossprod(M, D%M)
and in RcppArmadillo, I have the following which is much slower.
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::mat multiple_mnv(const arma::vec& x, const arma::mat& M) {
arma::colvec diagonal(x.size())
for (int i = 0; i < x.size(); i++)
{
diagonal(i) = exp(x[i]);
}
arma::mat D = diagmat(diagonal);
return M.t()*D*M;
}
Why is this so slow? How can I speed this up?

Welcome to Stack Overflow manju. For future questions, please be advised that a minimal reproducible example is expected, and in fact is in your best interest to provide; it helps others help you. Here's an example of how you could provide example data for others to work with:
## Set seed for reproducibility
set.seed(123)
## Generate data
x <- rnorm(10)
M <- matrix(rnorm(100), nrow = 10, ncol = 10)
## Output code for others to copy your objects
dput(x)
dput(M)
This is the data I will work with to show that your C++ code is in fact not slower than R. I used your C++ code (adding in a missing semicolon):
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::mat foo(const arma::vec& x, const arma::mat& M) {
arma::colvec diagonal(x.size());
for ( int i = 0; i < x.size(); i++ )
{
diagonal(i) = exp(x[i]);
}
arma::mat D = diagmat(diagonal);
return M.t() * D * M;
}
Note also that I had to make some of my own choices about the type of the return object and types of the function arguments (this is one of the places where a minimal reproducible example could help you: What if these choices affect my results?) I then create an R function to do what foo() does:
bar <- function(v, M) {
D <- diag(exp(v))
return(crossprod(M, D %*% M))
}
Note also that I had to fix a typo you had, changing D%M to D %*% M. Let's double check they give the same results:
all.equal(foo(x, M), bar(x, M))
# [1] TRUE
Now let's explore how fast they are:
library(microbenchmark)
bench <- microbenchmark(cpp = foo(x, M), R = foo(x, M), times = 1e5)
bench
# Unit: microseconds
# expr min lq mean median uq max
# cpp 22.185 23.015 27.00436 23.204 23.461 31143.30
# R 22.126 23.028 25.48256 23.216 23.475 29628.86
Those look pretty much the same to me! We can also look at a density plot of the times (throwing out the extreme value outliers to make things a little clearer):
cpp_times <- with(bench, time[expr == "cpp"])
R_times <- with(bench, time[expr == "R"])
cpp_time_dens <- density(cpp_times[cpp_times < quantile(cpp_times, 0.95)])
R_time_dens <- density(R_times[R_times < quantile(R_times, 0.95)])
plot(cpp_time_dens, col = "blue", xlab = "Time (in nanoseconds)", ylab = "",
main = "Comparing C++ and R execution time")
lines(R_time_dens, col = "red")
legend("topright", col = c("blue", "red"), bty = "n", lty = 1,
legend = c("C++ function (foo)", "R function (bar)"))
Why?
As helpfully pointed out by Dirk Eddelbuettel in the comments, in the end both R and Armadillo are going to be calling a LAPACK or BLAS routine anyways -- you shouldn't expect much difference unless you can give Armadillo a hint on how to be more efficient.
Can we make the Armadillo code faster?
Yes! As pointed out by mtall in the comments, we can give Armadillo the hint that we're dealing with a diagonal matrix. Let's try; we'll use the following code:
// [[Rcpp::export]]
arma::mat baz(const arma::vec& x, const arma::mat& M) {
return M.t() * diagmat(arma::exp(x)) * M;
}
And benchmark it:
all.equal(foo(x, M), baz(x, M))
# [1] TRUE
library(microbenchmark)
bench <- microbenchmark(cpp = foo(x, M), R = foo(x, M),
cpp2 = baz(x, M), times = 1e5)
bench
# Unit: microseconds
# expr min lq mean median uq max
# cpp 22.822 23.757 27.57015 24.118 24.632 26600.48
# R 22.855 23.771 26.44725 24.124 24.638 30619.09
# cpp2 20.035 21.218 25.49863 21.587 22.123 36745.72
We see a small but sure improvement; let's take a look graphically as we did before:
cpp_times <- with(bench, time[expr == "cpp"])
cpp2_times <- with(bench, time[expr == "cpp2"])
R_times <- with(bench, time[expr == "R"])
cpp_time_dens <- density(cpp_times[cpp_times < quantile(cpp_times, 0.95)])
cpp2_time_dens <- density(cpp2_times[cpp2_times < quantile(cpp2_times, 0.95)])
R_time_dens <- density(R_times[R_times < quantile(R_times, 0.95)])
xlims <- range(c(cpp_time_dens$x, cpp2_time_dens$x, R_time_dens$x))
ylims <- range(c(cpp_time_dens$y, cpp2_time_dens$y, R_time_dens$y))
ylims <- ylims * c(1, 1.15)
cols <- c("#0072b2", "#f0e442", "#d55e00")
cols <- c("#e69f00", "#56b4e9", "#009e73")
labs <- c("C++ original", "C++ improved", "R")
plot(cpp_time_dens, col = cols[1], xlim = xlims, ylim = ylims,
xlab = "Time (in nanoseconds)", ylab = "",
main = "Comparing C++ and R execution time")
lines(cpp2_time_dens, col = cols[2])
lines(R_time_dens, col = cols[3])
legend("topleft", col = cols, bty = "n", lty = 1, legend = labs, horiz = TRUE)

Calculate min, max, count in linux file [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I have a file on linux server that has data like :
a 22
a 10
a 17
a 51
a 33
b 51
b 47
c 33
I want a shell script or linux commands to find min, avg, 90%, max and count for each value in column 1.
Example:
for a min = 10, avg = 26, 90% = 33, max = 51, and count = 5.

Here a version with even the 90% percentile using gawk.
The definition of percentile is that one given by
Wikipedia and called Nearest rank.
The function round can be found here.
#!/bin/bash
gawk '
function round(x, ival, aval, fraction)
{
ival = int(x) # integer part, int() truncates
# see if fractional part
if (ival == x) # no fraction
return ival # ensure no decimals
if (x < 0) {
aval = -x # absolute value
ival = int(aval)
fraction = aval - ival
if (fraction >= .5)
return int(x) - 1 # -2.5 --> -3
else
return int(x) # -2.3 --> -2
} else {
fraction = x - ival
if (fraction >= .5)
return ival + 1
else
return ival
}
}
# the following block processes all the lines
# and populates counters and values
{
if($1 in counters) {
counters[$1]++;
} else {
counters[$1] = 1;
}
i = counters[$1];
values[$1, i] = $2;
} END {
for (c in counters) {
delete tmp;
min = values[c, 1];
max = values[c, 1];
sum = values[c, 1];
tmp[1] = values[c, 1];
for (i = 2; i <= counters[c]; i++) {
if (values[c, i] < min) min = values[c, i];
if (values[c, i] > max) max = values[c, i];
sum += values[c, i];
tmp[i] = values[c, i];
}
# The following 3 lines compute the percentile.
n = asort(tmp, tmp_sorted);
idx = round(0.9 * n + 0.5); # Nearest rank definition
percentile = tmp_sorted[idx];
# Output of the statistics for this group.
printf "for %s min = %d, avg = %f, 90 = %d,max = %d, count = %d\n", c, min, (sum / counters[c]), percentile, max, counters[c];
}
}'
To run execute:
./stats.sh < input.txt
I am assuming that the above script is named stats.sh and your input is saved in input.txt.
The output is:
for a min = 10, avg = 26.600000, 90 = 51,max = 51, count = 5
for b min = 47, avg = 49.000000, 90 = 51,max = 51, count = 2
for c min = 33, avg = 33.000000, 90 = 33,max = 33, count = 1
Here the explanation:
counters is an associative array, the key is the value in column 1
and the value is the number of values found in the input for each
value in column 1.
values is a two dimensional (value_in_column_one, counter_per_value)
array that keeps all the values grouped by value in column one.
At the end of the script the outermost loop goes trough all the values
found in column 1. The innermost for loop analyses all the values belonging
to a particular value in column 1 and it computes all the statics.

For lines starting with a, here's an awk script.
$ echo 'a 22
a 10
a 17
a 51
a 33
b 51
b 47
c 33' | awk 'BEGIN{n=0;s=0;};/^a/{n=n+1;s=s+$2;};END{print n;print s;print s/n;}'
5
133
26.6

Using awk:
awk 'NR==1{min=$1} {sum+=$2; if(min>=$2) min=$2; if(max<$2) max=$2}
END{printf("max=%d,min=%d,count=%d,avg=%.2f\n", max, min, NR, (sum/NR))}' file
max=51,min=10,count=8,avg=33.00
EDIT:
awk '$1 != v {
if (NR>1)
printf("For %s max=%d,min=%d,count=%d,avg=%.2f\n", v, max, min, k, (sum/k));
v=$1;
min=$2;
k=sum=max=0
}
{
k++;
sum+=$2;
if (min > $2)
min=$2;
if (max < $2)
max=$2
}
END {
printf("For %s max=%d,min=%d,count=%d,avg=%.2f\n", v, max, min, k, (sum/k))
}' < <(sort -n -k1,2 f)
OUTPUT:
For a max=51,min=10,count=5,avg=26.60
For b max=51,min=47,count=2,avg=49.00
For c max=33,min=33,count=1,avg=33.00

Minimum no. of comparisons to find median of 3 numbers

I was implementing quicksort and I wished to set the pivot to be the median or three numbers. The three numbers being the first element, the middle element, and the last element.
Could I possibly find the median in less no. of comparisons?
median(int a[], int p, int r)
{
int m = (p+r)/2;
if(a[p] < a[m])
{
if(a[p] >= a[r])
return a[p];
else if(a[m] < a[r])
return a[m];
}
else
{
if(a[p] < a[r])
return a[p];
else if(a[m] >= a[r])
return a[m];
}
return a[r];
}

If the concern is only comparisons, then this should be used.
int getMedian(int a, int b , int c) {
int x = a-b;
int y = b-c;
int z = a-c;
if(x*y > 0) return b;
if(x*z > 0) return c;
return a;
}

int32_t FindMedian(const int n1, const int n2, const int n3) {
auto _min = min(n1, min(n2, n3));
auto _max = max(n1, max(n2, n3));
return (n1 + n2 + n3) - _min - _max;
}

You can't do it in one, and you're only using two or three, so I'd say you've got the minimum number of comparisons already.

Rather than just computing the median, you might as well put them in place. Then you can get away with just 3 comparisons all the time, and you've got your pivot closer to being in place.
T median(T a[], int low, int high)
{
int middle = ( low + high ) / 2;
if( a[ middle ].compareTo( a[ low ] ) < 0 )
swap( a, low, middle );
if( a[ high ].compareTo( a[ low ] ) < 0 )
swap( a, low, high );
if( a[ high ].compareTo( a[ middle ] ) < 0 )
swap( a, middle, high );
return a[middle];
}

I know that this is an old thread, but I had to solve exactly this problem on a microcontroller that has very little RAM and does not have a h/w multiplication unit (:)). In the end I found the following works well:
static char medianIndex[] = { 1, 1, 2, 0, 0, 2, 1, 1 };
signed short getMedian(const signed short num[])
{
return num[medianIndex[(num[0] > num[1]) << 2 | (num[1] > num[2]) << 1 | (num[0] > num[2])]];
}

If you're not afraid to get your hands a little dirty with compiler intrinsics you can do it with exactly 0 branches.
The same question was discussed before on:
Fastest way of finding the middle value of a triple?
Though, I have to add that in the context of naive implementation of quicksort, with a lot of elements, reducing the amount of branches when finding the median is not so important because the branch predictor will choke either way when you'll start tossing elements around the the pivot. More sophisticated implementations (which don't branch on the partition operation, and avoid WAW hazards) will benefit from this greatly.

remove max and min value from total sum
int med3(int a, int b, int c)
{
int tot_v = a + b + c ;
int max_v = max(a, max(b, c));
int min_v = min(a, min(b, c));
return tot_v - max_v - min_v
}

There is actually a clever way to isolate the median element from three using a careful analysis of the 6 possible permutations (of low, median, high). In python:
def med(a, start, mid, last):
# put the median of a[start], a[mid], a[last] in the a[start] position
SM = a[start] < a[mid]
SL = a[start] < a[last]
if SM != SL:
return
ML = a[mid] < a[last]
m = mid if SM == ML else last
a[start], a[m] = a[m], a[start]
Half the time you have two comparisons otherwise you have 3 (avg 2.5). And you only swap the median element once when needed (2/3 of the time).
Full python quicksort using this at:
https://github.com/mckoss/labs/blob/master/qs.py

You can write up all the permutations:
1 0 2
1 2 0
0 1 2
2 1 0
0 2 1
2 0 1
Then we want to find the position of the 1. We could do this with two comparisons, if our first comparison could split out a group of equal positions, such as the first two lines.
The issue seems to be that the first two lines are different on any comparison we have available: a<b, a<c, b<c. Hence we have to fully identify the permutation, which requires 3 comparisons in the worst case.

Using a Bitwise XOR operator, the median of three numbers can be found.
def median(a,b,c):
m = max(a,b,c)
n = min(a,b,c)
ans = m^n^a^b^c
return ans

AS2: How to iterate X times through a percentage calculation (containing a circular reference)?

Here is a question for the Excel / math-wizards.
I'm having trouble doing a calculation which is based on a formula with a circular reference. The calculation has been done in an Excel worksheet.
I've deducted the following equations from an Excel file:
a = 240000
b = 1400 + c + 850 + 2995
c = CEIL( ( a + b ) * 0.015, 100 )
After the iterations the total of A+B is supposed to be 249045 (where b = 9045).
In the Excel file this gives a circular reference, which is set to be allowed to iterate 4 times.
My problem: Recreate the calculation in AS2, going through 4 iterations.
I am not good enough at math to break this problem down.
Can anyone out there help me?
Edit: I've changed the formatting of the number in variable a. Sorry, I'm from DK and we use period as a thousand separator. I've removed it to avoid confusion :-)
2nd edit: The third equation, C uses Excels CEIL() function to round the number to nearest hundredth.

I don't know action script, but I think you want:
a = 240000
c = 0
for (i = 0; i < 4; i++){
b = 1400 + c + 850 + 2995
c = (a + b) * 0.015
}
But you need to determine what to use for the initial value of c. I assume that Excel uses 0, since I get the same value when running the above as I get in Excel with iterations = 4, c = 3734.69...
Where do you get the "A + B is supposed to be 249045" value? In Excel and in the above AS, b only reaches 8979 with those values.

function calcRegistrationTax( amount, iterations ) {
function roundToWhole( n, to ) {
if( n > 0 )
return Math.ceil( n/ to ) * to;
else if( n < 0)
return Math.floor( n/ to ) * to;
else
return to;
}
var a = amount;
var b = 0;
var c = 0
for (var i = 0; i < iterations; i++){
b = basicCost + ( c ) + financeDeclaration + handlingFee;
c = ( a + b ) * basicFeeRatio;
c = roundToWhole( c, 100 );
}
return b;
}
totalAmount = 240000 + calcRegistrationTax( 240000, 4 ); // This gives 249045
This did it, thanks to Benjamin for the help.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

constructing an identifier string for each row in data - string

one way, only slightly cleaner: d[, c := apply(d, 1, function(x) paste(names(d), x, sep="_", collapse="_")) ] a b c 1: 1 2 a_1_b_2 2: 2 3 a_2_b_3 3: 3 4 a_3_b_4

Related

How to handle n not a multiple p in worker processes in matrix multiplication?

RcppArmadillo: diagonal matrix multiplication is very slow

Calculate min, max, count in linux file [closed]

Minimum no. of comparisons to find median of 3 numbers

AS2: How to iterate X times through a percentage calculation (containing a circular reference)?

Categories

Resources