Extract characters from string in R and save in different variables - string

I have the following dataset elections:
var1 <- c("125677", "255422", "475544", "333344", "233452")
var2 <- c("PRB", "PAN", "PR", "PV", "PJ")
var3 <- c("PCB/PTdoB/PCO/PRB", "PAN", "DEM/PR/PT/PSDB/PMDB/PV", "DEM/PR/PT/PSDB/PMDB/PV/PSTU/PSOL", "DEM/PJ")
elections <- cbind(var1, var2, var3)
Which looks like this:
var1 var2 var3
---------------
125677 PRB PCB/PTdoB/PCO/PRB
255422 PAN PAN
475544 PR DEM/PR/PT/PSDB/PMDB/PV
333344 PV DEM/PR/PT/PSDB/PMDB/PV/PSTU/PSOL
233452 PJ DEM/PJ
I want to disaggregate var3in eight additional variables, var4 to var11, that can be filled by the characters separated by / in var3. Therefore the result I want is this:
var1 var2 var3 var4 var5 var6 var7 var8 var9 var10 var11
---------------------------------------------------------
125677 PRB PCB/PTdoB/PCO/PRB PCB PTdoB PCO PRB
255422 PAN PAN PAN
475544 PR DEM/PR/PT/PSDB/PMDB/PV DEM PR PT PSDB PMDB PV
333344 PV DEM/PR/PT/PSDB/PMDB/PV/PSTU/PSOL DEM PR PT PSDB PMDB PV PSTU PSOL
233452 PJ DEM/PJ DEM PJ
I was able to get a result close to the one I want with strsplit(elections$var3, '/'), but the problem is that this produces a list of objects. Therefore it works when there is only one element in var3, but it does not when there is more than one.
Any ideas?

A direct way would be to use read.csv (or read.table) on that variable (either before or after you add it to your existing dataset). Here, I've used read.csv which defaults with a fill = TRUE argument that will let you split the data the way you are expecting to.
Here's an example:
read.csv(text = elections[, "var3"], sep = "/", header = FALSE)
# V1 V2 V3 V4 V5 V6 V7 V8
# 1 PCB PTdoB PCO PRB
# 2 PAN
# 3 DEM PR PT PSDB PMDB PV
# 4 DEM PR PT PSDB PMDB PV PSTU PSOL
# 5 DEM PJ
Or, possibly (if your dataset is a data.frame):
read.csv(text = as.character(elections$var3), sep = "/", header = FALSE)
This approach is essentially what is taken with concat.split from my "splitstackshape" package, though it does a little bit more checking and will conveniently combine the output back into the original dataset.
Assuming now "elections" is a data.frame, usage would be:
library(splitstackshape)
concat.split(elections, "var3", "/", drop = TRUE)
# var1 var2 var3_1 var3_2 var3_3 var3_4 var3_5 var3_6 var3_7 var3_8
# 1 125677 PRB PCB PTdoB PCO PRB
# 2 255422 PAN PAN
# 3 475544 PR DEM PR PT PSDB PMDB PV
# 4 333344 PV DEM PR PT PSDB PMDB PV PSTU PSOL
# 5 233452 PJ DEM PJ
Update
Ultimately, however, read.csv is somewhat slow (so by extension, the concat.split approach would be slow). The approach I'm working on for a revision of the function is along the following lines until I come up with something better:
myMat <- function(inVec, sep) {
if (!is.character(inVec)) inVec <- as.character(inVec)
nCols <- max(vapply(gregexpr(sep, inVec, fixed = TRUE), length, 1L)) + 1
M <- matrix("", ncol = nCols, nrow = length(inVec))
Spl <- strsplit(inVec, sep, fixed = TRUE)
Len <- vapply(Spl, length, 1L)
Ind <- cbind(rep(seq_along(Len), Len), sequence(Len))
M[Ind] <- unlist(Spl)
M
}
Some benchmarks
Sample data:
var1 <- c("125677", "255422", "475544", "333344", "233452")
var2 <- c("PRB", "PAN", "PR", "PV", "PJ")
var3 <- c("PCB/PTdoB/PCO/PRB", "PAN", "DEM/PR/PT/PSDB/PMDB/PV", "DEM/PR/PT/PSDB/PMDB/PV/PSTU/PSOL", "DEM/PJ")
elections <- data.frame(var1, var2, var3)
Functions to evaluate:
fun1 <- function() myMat(elections$var3, "/")
fun2 <- function() read.csv(text = as.character(elections$var3), sep = "/", header = FALSE)
The results:
microbenchmark(fun1(), fun2())
# Unit: microseconds
# expr min lq median uq max neval
# fun1() 159.936 175.5445 193.291 244.6075 566.188 100
# fun2() 974.151 1017.1280 1070.796 1690.0100 2146.724 100
BIGGER data (but still not very big):
elections <- do.call(rbind, replicate(5000, elections, simplify = FALSE))
dim(elections)
# [1] 25000 3
microbenchmark(fun1(), fun2(), times = 10)
# Unit: milliseconds
# expr min lq median uq max neval
# fun1() 195.1358 211.8841 232.1093 287.560 324.6918 10
# fun2() 2764.8115 3524.7989 3626.1480 3639.303 3728.2099 10
I run out of patience waiting for one million rows with fun2(), but for fun1(), it takes about 19 seconds, which is OK, but not something I'm totally happy with.

Related

Seperate integers from 1 variable into different variables

Not sure how to explain this the best but I have a variable with 4 int's in it. Is there a simple way to extract the 4 int's into 4 seperate var's?
Example:
The variable contains: 4567
And then the output is:
var1 = 4
var2 = 5
var3 = 6
var4 = 7
Another way can be:
val n = 4567
val (var1, var2, var3, var4) = "$n".map { it.digitToInt() }
Note that this will fail if number contains less than 4 digits.
You can do this:
val input = 4567
val var1 = input / 1000
val var2 = (input % 1000) / 100
val var3 = (input % 100) / 10
val var4 = (input % 10)

Plot output differences between python and julia

I am trying to use julai as main language for my work. But I find that this plot is different than python (Which outputs the right plot)
Here is the python code and output
import numpy as np
import math
import matplotlib.pyplot as plt
u = 9.27*10**(-21)
k = 1.38*10**(-16)
j2 = 7/2
nrr = 780
h = 1000
na = 6*10**(23)
rho = 7.842
mgd = 157.25
a = mgd
d = na*rho*u/a
m_f = []
igd = 7.0
for t in range(1,401):
while True:
h1 = h+d*nrr*igd
x2 = (7*u*h1)/(k*t)
x4 = 2*j2
q2 = (x4+1)/x4
m = abs(7*(q2*math.tanh(q2*x2)**-1 - (1/x4)*math.tanh(x2/x4)**-1))
if abs(m - igd) < 10**(-12):
break
else:
igd = m
m_f.append(abs(m))
plt.plot(range(1,401), m_f)
plt.savefig("Py_plot.pdf")
and it gives the following right plot
The right plot as expected
But when I do the same calculations in julia it gives different output than python, here is my julia code
using Plots
u = 9.27*10^(-21)
k = 1.38*10^(-16)
j2 = 7/2
nrr = 780
h = 1000
na = 6*10^(23)
rho = 7.842
mgd = 157.25
a = mgd
d = na*rho*u/a
igd = 7.0
m = 0.0
m_f = Float64[]
for t in 1:400
while true
h1 = h+d*nrr*igd
x2 = (7*u*h1)/(k*t)
x4 = 2*j2
q2 = (x4+1)/x4
m = 7*(q2*coth(rad2deg(q2*x2))-(1/x4)*coth(rad2deg(x2/x4)))
if abs(abs(m)-igd) < 10^(-10)
break
else
igd = m
end
end
push!(m_f, abs(m))
end
plot(1:400, m_f)
and this is the unexpected julia output
unexpected wrong output from julia
I wish for help....
Code:
using Plots
const u = 9.27e-21
const k = 1.38e-16
const j2 = 7/2
const nrr = 780
const h = 1000
const na = 6.0e23
const rho = 7.842
const mgd = 157.25
const a = mgd
const d = na*rho*u/a
function plot_graph()
igd = 7.0
m = 0.0
trange = 1:400
m_f = Vector{Float64}(undef, length(trange))
for t in trange
while true
h1 = h+d*nrr*igd
x2 = (7*u*h1)/(k*t)
x4 = 2*j2
q2 = (x4+1)/x4
m = abs(7*(q2*coth(q2*x2)-(1/x4)*coth(x2/x4)))
if isapprox(m, igd, atol = 10^(-10))
break
else
igd = m
end
end
m_f[t] = m
end
plot(trange, m_f)
end
Plot:
Changes for correctness:
Changed na = 6.0*10^(23) to na = 6.0e23.
Since ^ has a higher precedence than *, 10^23 is evaluated first, and since the operands are Int values, the result is also an Int. However, Int (i.e. Int64) can only hold numbers up to approximately 9 * 10^18, so 10^23 overflows and gives a wrong result.
julia> 10^18
1000000000000000000
julia> 10^19 #overflow starts here
-8446744073709551616
julia> 10^23 #and gives a wrong value here too
200376420520689664
6.0e23 avoids this problem by directly using the scientific e-notation to create a literal Float64 value (Float64 can hold this value without overflowing).
Removed the rad2deg calls when calling coth. Julia trigonometric functions by default take radians, so there's no need to make this conversion.
Other changes
Marked all the constants as const, and moved the rest of the code into a function. See Performance tip: Avoid non-constant global variables
Changed the abs(m - igd) < 10^-10 to isapprox(m, igd, atol = 10^-10) which performs basically the same check, but is clearer and more flexible (for eg. if you wanted to change to a relative tolerance rtol later).
Stored the 1:400 as a named variable trange. This is just because it's used multiple times, so it's easier to manage as a variable.
Changed m_f = Float64[] to m_f = Vector{Float64}(undef, length(trange)) (and the push! at the end to an assignment). If the size of the array is known beforehand (as it is in this case), it's better for performance to pre-allocate it with undef values and then assign to it.
Changed u and k also to use the scientific e-notation, for consistency and clarity (thanks to #DNF for suggesting the use of this notation in the comments).

RcppArmadillo: diagonal matrix multiplication is very slow

Let x be a vector and M a matrix.
In R, I can do
D <- diag(exp(x))
crossprod(M, D%M)
and in RcppArmadillo, I have the following which is much slower.
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::mat multiple_mnv(const arma::vec& x, const arma::mat& M) {
arma::colvec diagonal(x.size())
for (int i = 0; i < x.size(); i++)
{
diagonal(i) = exp(x[i]);
}
arma::mat D = diagmat(diagonal);
return M.t()*D*M;
}
Why is this so slow? How can I speed this up?
Welcome to Stack Overflow manju. For future questions, please be advised that a minimal reproducible example is expected, and in fact is in your best interest to provide; it helps others help you. Here's an example of how you could provide example data for others to work with:
## Set seed for reproducibility
set.seed(123)
## Generate data
x <- rnorm(10)
M <- matrix(rnorm(100), nrow = 10, ncol = 10)
## Output code for others to copy your objects
dput(x)
dput(M)
This is the data I will work with to show that your C++ code is in fact not slower than R. I used your C++ code (adding in a missing semicolon):
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::mat foo(const arma::vec& x, const arma::mat& M) {
arma::colvec diagonal(x.size());
for ( int i = 0; i < x.size(); i++ )
{
diagonal(i) = exp(x[i]);
}
arma::mat D = diagmat(diagonal);
return M.t() * D * M;
}
Note also that I had to make some of my own choices about the type of the return object and types of the function arguments (this is one of the places where a minimal reproducible example could help you: What if these choices affect my results?) I then create an R function to do what foo() does:
bar <- function(v, M) {
D <- diag(exp(v))
return(crossprod(M, D %*% M))
}
Note also that I had to fix a typo you had, changing D%M to D %*% M. Let's double check they give the same results:
all.equal(foo(x, M), bar(x, M))
# [1] TRUE
Now let's explore how fast they are:
library(microbenchmark)
bench <- microbenchmark(cpp = foo(x, M), R = foo(x, M), times = 1e5)
bench
# Unit: microseconds
# expr min lq mean median uq max
# cpp 22.185 23.015 27.00436 23.204 23.461 31143.30
# R 22.126 23.028 25.48256 23.216 23.475 29628.86
Those look pretty much the same to me! We can also look at a density plot of the times (throwing out the extreme value outliers to make things a little clearer):
cpp_times <- with(bench, time[expr == "cpp"])
R_times <- with(bench, time[expr == "R"])
cpp_time_dens <- density(cpp_times[cpp_times < quantile(cpp_times, 0.95)])
R_time_dens <- density(R_times[R_times < quantile(R_times, 0.95)])
plot(cpp_time_dens, col = "blue", xlab = "Time (in nanoseconds)", ylab = "",
main = "Comparing C++ and R execution time")
lines(R_time_dens, col = "red")
legend("topright", col = c("blue", "red"), bty = "n", lty = 1,
legend = c("C++ function (foo)", "R function (bar)"))
Why?
As helpfully pointed out by Dirk Eddelbuettel in the comments, in the end both R and Armadillo are going to be calling a LAPACK or BLAS routine anyways -- you shouldn't expect much difference unless you can give Armadillo a hint on how to be more efficient.
Can we make the Armadillo code faster?
Yes! As pointed out by mtall in the comments, we can give Armadillo the hint that we're dealing with a diagonal matrix. Let's try; we'll use the following code:
// [[Rcpp::export]]
arma::mat baz(const arma::vec& x, const arma::mat& M) {
return M.t() * diagmat(arma::exp(x)) * M;
}
And benchmark it:
all.equal(foo(x, M), baz(x, M))
# [1] TRUE
library(microbenchmark)
bench <- microbenchmark(cpp = foo(x, M), R = foo(x, M),
cpp2 = baz(x, M), times = 1e5)
bench
# Unit: microseconds
# expr min lq mean median uq max
# cpp 22.822 23.757 27.57015 24.118 24.632 26600.48
# R 22.855 23.771 26.44725 24.124 24.638 30619.09
# cpp2 20.035 21.218 25.49863 21.587 22.123 36745.72
We see a small but sure improvement; let's take a look graphically as we did before:
cpp_times <- with(bench, time[expr == "cpp"])
cpp2_times <- with(bench, time[expr == "cpp2"])
R_times <- with(bench, time[expr == "R"])
cpp_time_dens <- density(cpp_times[cpp_times < quantile(cpp_times, 0.95)])
cpp2_time_dens <- density(cpp2_times[cpp2_times < quantile(cpp2_times, 0.95)])
R_time_dens <- density(R_times[R_times < quantile(R_times, 0.95)])
xlims <- range(c(cpp_time_dens$x, cpp2_time_dens$x, R_time_dens$x))
ylims <- range(c(cpp_time_dens$y, cpp2_time_dens$y, R_time_dens$y))
ylims <- ylims * c(1, 1.15)
cols <- c("#0072b2", "#f0e442", "#d55e00")
cols <- c("#e69f00", "#56b4e9", "#009e73")
labs <- c("C++ original", "C++ improved", "R")
plot(cpp_time_dens, col = cols[1], xlim = xlims, ylim = ylims,
xlab = "Time (in nanoseconds)", ylab = "",
main = "Comparing C++ and R execution time")
lines(cpp2_time_dens, col = cols[2])
lines(R_time_dens, col = cols[3])
legend("topleft", col = cols, bty = "n", lty = 1, legend = labs, horiz = TRUE)

Swap two characters in the cell array of strings

I have a cell array of string and I want to swap A and B in a percentage of the cell array , like 20%, 30% of the total number of strings in the cell array
For example :
A_in={ 'ABCDE'
'ACD'
'ABCDE'
'ABCD'
'CDE' };
Now, we need to swap A and B in 40% of the sequences in A (2/5 sequences ). There are some sequences which do not contain A and B so we just skip them, and we will swap the sequences which contain AB . The pickup sequences in A are chosen randomly. I appropriate someone can tell me how to do this . The expected output is:
A_out={ 'ABCDE'
'ACD'
'BACDE'
'BACD'
'CDE' }
Get the random precent index with randsample and swap with strrep
% Input
swapStr = 'AB';
swapPerc = 0.4; % 40%
% Get index to swap
hasPair = find(~cellfun('isempty', regexp(A_in, swapStr)));
swapIdx = randsample(hasPair, ceil(numel(hasPair) * swapPerc));
% Swap char pair
A_out = A_in;
A_out(swapIdx) = strrep(A_out(swapIdx), swapStr, fliplr(swapStr));
you can use strfind, like:
A_in={ 'ABCDE';
'ACD';
'ABCDE';
'ABCD';
'CDE' };
ABcells = strfind(A_in,'AB');
idxs = find(~cellfun(#isempty,ABcells));
n = numel(idxs);
perc = 0.6;
k = round(n*perc);
idxs = randsample(idxs,k);
A_out = A_in;
A_out(idxs) = cellfun(#(a,idx) [a(1:idx-1) 'BA' a(idx+2:end)],A_in(idxs),ABcells(idxs),'UniformOutput',false);

constructing an identifier string for each row in data

I have the following data:
library(data.table)
d = data.table(a = c(1:3), b = c(2:4))
and would like to get this result (in a way that would work with arbitrary number of columns):
d[, c := paste0('a_', a, '_b_', b)]
d
# a b c
#1: 1 2 a_1_b_2
#2: 2 3 a_2_b_3
#3: 3 4 a_3_b_4
The following works, but I'm hoping to find something shorter and more legible.
d = data.table(a = c(1:3), b = c(2:4))
d[, c := apply(mapply(paste, names(.SD), .SD, MoreArgs = list(sep = "_")),
1, paste, collapse = "_")]
one way, only slightly cleaner:
d[, c := apply(d, 1, function(x) paste(names(d), x, sep="_", collapse="_")) ]
a b c
1: 1 2 a_1_b_2
2: 2 3 a_2_b_3
3: 3 4 a_3_b_4
Here is an approach using do.call('paste'), but requiring only a single call to paste
I will benchmark on a situtation where the columns are integers (as this seems a more sensible test case
N <- 1e4
d <- setnames(as.data.table(replicate(5, sample(N), simplify = FALSE)), letters[seq_len(5)])
f5 <- function(d){
l <- length(d)
o <- c(1L, l + 1L) + rep_len(seq_len(l) -1L, 2L * l)
do.call('paste',c((c(as.list(names(d)),d))[o],sep='_'))}
microbenchmark(f1(d), f2(d),f5(d))
Unit: milliseconds
expr min lq median uq max neval
f1(d) 41.51040 43.88348 44.60718 45.29426 52.83682 100
f2(d) 193.94656 207.20362 210.88062 216.31977 252.11668 100
f5(d) 30.73359 31.80593 32.09787 32.64103 45.68245 100
To avoid looping through rows, you can use this:
do.call(paste, c(lapply(names(d), function(n)paste0(n,"_",d[[n]])), sep="_"))
Benchmarking:
N <- 1e4
d <- data.table(a=runif(N),b=runif(N),c=runif(N),d=runif(N),e=runif(N))
f1 <- function(d)
{
do.call(paste, c(lapply(names(d), function(n)paste0(n,"_",d[[n]])), sep="_"))
}
f2 <- function(d)
{
apply(d, 1, function(x) paste(names(d), x, sep="_", collapse="_"))
}
require(microbenchmark)
microbenchmark(f1(d), f2(d))
Note: f2 inspired in #Ricardo's answer.
Results:
Unit: milliseconds
expr min lq median uq max neval
f1(d) 195.8832 213.5017 216.3817 225.4292 254.3549 100
f2(d) 418.3302 442.0676 451.0714 467.5824 567.7051 100
Edit note: previous benchmarking with N <- 1e3 didn't show much difference in times. Thanks again #eddi.

Resources