Is possible to define a random limit for a loop in JAGS? - survival-analysis

I am trying to implement a Weibull proportional hazards model with a cure fraction following the approach outlined by Hui, Ibrahim and Sinha (1999) - A New Bayesian Model for Survival Data with a Surviving Fraction. However, I am not sure if it is possible to define a random limit for a looping in JAGS.
library(R2OpenBUGS)
library(rjags)
set.seed(1234)
censored <- c(1, 1)
time_mod <- c(NA, NA)
time_cens <- c(5, 7)
tau <- 4
design_matrix <- rbind(c(1, 0, 0, 0), c(1, 0.2, 0.2, 0.04))
jfun <- function() {
for(i in 1:nobs) {
censored[i] ~ dinterval(time_mod[i], time_cens[i])
time_mod[i] <- ifelse(N[i] == 0, tau, min(Z))
for (k in 1:N[i]){
Z[k] ~ dweib(1, 1)
}
N[i] ~ dpois(fc[i])
fc[i] <- exp(inprod(design_matrix[i, ], beta))
}
beta[1] ~ dnorm(0, 10)
beta[2] ~ dnorm(0, 10)
beta[3] ~ dnorm(0, 10)
beta[4] ~ dnorm(0, 10)
}
inits <- function() {
time_init <- rep(NA, length(time_mod))
time_init[which(!status)] <- time_cens[which(!status)] + 1
out <- list(beta = rnorm(4, 0, 10),
time_mod = time_init,
N = rpois(length(time_mod), 5))
return(out)
}
data_base <- list('time_mod' = time_mod, 'time_cens' = time_cens,
'censored' = censored, 'design_matrix' = design_matrix,
'tau' = tau,
'nobs' = length(time_cens[!is.na(time_cens)]))
tc1 <- textConnection("jmod", "w")
write.model(jfun, tc1)
close(tc1)
# Calling JAGS
tc2 <- textConnection(jmod)
j <- jags.model(tc2,
data = data_base,
inits = inits(),
n.chains = 1,
n.adapt = 1000)
I observed the below error:
Error in jags.model(tc2, data = data_base, inits = inits(), n.chains = 1, :
RUNTIME ERROR:
Compilation error on line 6.
Unknown variable N
Either supply values for this variable with the data
or define it on the left hand side of a relation.

I am not entirely certain, but I am pretty sure that you cannot declare a random number of nodes in BUGS in general, so it would not be a specific JAGS' quirk.
Nevertheless, you can get a way around that.
Since BUGS is a declarative language instead of a procedural one, it is enough to declare an arbitrary but deterministic number of nodes (let's say "large enough") and then associate only a random number of them with a distribution and with observed data, leaving the remaining nodes deterministic.
Once you have observed the maximum value of N[i] (let's say N.max), you can pass it as a parameter to JAGS and then change this code of yours:
for (k in 1:N[i]){
Z[k] ~ dweib(1, 1)
}
into this:
for (k in 1:N.max){
if (k <= N[i]){
Z[k] ~ dweib(1, 1)
} else {
Z[k] <- 0
}
}
I hope this will do the trick in your case. So please give feedback latter about it.
Needless to say, if you have some non-zero, observed data associated to a deterministic Z[k], then all hell breaks loose inside Jags...

Related

RCPP and the %*% operator, revisited

I'm trying to decide if it makes sense to implement R's %*% operator in RCpp
if my dataset is huge. BUT, I am really having trouble getting a RCpp implementation.
Here is my example R code
# remove everything in the global environment
rm(list = ls())
n_states = 4 # number of states
v_n <- c("H", "S1", "S2", "D") # the 4 states of the model:
n_t = 100 # number of transitions
# create transition matrix with random numbers. This transition matrix is constant.
m_P = matrix(runif(n_states*n_t), # insert n_states * n_t random numbers (can change this later)
nrow = n_states,
ncol = n_states,
dimnames = list(v_n, v_n))
# create markov trace, what proportion of population in each state at each period (won't make sense due to random numbers but that is fine)
m_TR <- matrix(NA,
nrow = n_t + 1 ,
ncol = n_states,
dimnames = list(0:n_t, v_n)) # create Markov trace (n_t + 1 because R doesn't understand Cycle 0)
# initialize Markov trace
m_TR[1, ] <- c(1, 0, 0, 0)
# run the loop
microbenchmark::microbenchmark( # function from microbenchmark library used to calculate how long this takes to run
for (t in 1:n_t){ # throughout the number of cycles
m_TR[t + 1, ] <- m_TR[t, ] %*% m_P # estimate the Markov trace for cycle the next cycle (t + 1)
}
) # end of micro-benchmark function
print(m_TR) # print the result.
And, here is the replacement for the %*% operator: (WHich doesn't seem to work correctly at all, although I can't fgure out why.
library(Rcpp)
cppFunction(
'void estimate_markov(int n_t, NumericMatrix m_P, NumericMatrix m_TR )
{
// We want to reproduce this
// matrix_A[X+1,] <- matrix_A[X,] %*% matrix_B
// The %*% operation behaves as follows for a vector_V %*% matrix_M
// Here the Matrix M is populated with letters so that you can
// more easily see how the operation is performed
// So, a multiplication like this:
//
// V M
// {1} %*% {A D}
// {2} {B E}
// {3} {C F}
//
// Results in a vector:
// V_result
// {1*A + 1*D}
// {2*B + 2*E}
// {3*C + 3*F}
//
// Now use values instead of letters for M (ex: A=1, B=2, .., F=6)
// V_result
// {1*1 + 1*4} {1 + 4} {5}
// {2*2 + 2*5} => {4 + 10} => {14}
// {3*3 + 3*6} {9 + 18} {27}
//
// Note that the RHS matrix may contain any number of columns,
// but *MUST* must contain the same number of rows as LHS vector
// Get dimensions of matricies , and sanity check
// number of elements in a vector from the LHS matrix must equal == number of row on the RHS
if( m_TR.cols() != m_P.rows())
throw std::range_error("Matrix mismatch, m_P.rows != m_TR.rows");
// we want to know these dimensions, and there is no reason to call these functons in a loop
// store the values once
int cnt_P_cols = m_P.cols();
int cnt_TR_cols = m_TR.cols();
//
for(int Index = 1; Index <= n_t; ++Index)
{
// iterate over the columns in m_TR
for(int col_iter = 0; col_iter < cnt_TR_cols; ++col_iter)
{
// an accumulator for the vector multiplication
double sum = 0;
// The new value comes from the previous row (Index-1)
double orig_TR = m_TR(col_iter, Index-1);
// iterate over the columns in m_P corresponding to this Index
for(int p_iter = 0; p_iter < cnt_P_cols; ++p_iter)
{
// accumulate the value of this TR scalar * the m_P vector
sum += orig_TR * m_P(p_iter, Index);
}
m_TR(col_iter, Index) = sum;
}
}
}'
)
Can someone point me to where my logic is going wrong.

Error in update.jags(model, n.iter, ...) : Error in node sd[1] Invalid parent values

I am having error in node sd[1], it says invalid parent values in the compiler. I am working with a gaussian model for "Galaxies" data from "MASS"p package of R.
library(rjags)
library(MASS)
library(mcsm)
data("galaxies")
summary(galaxies)
y = galaxies
ngroups = 2
jags_data = list(y=y, n=length(y), ngroups=ngroups)
gaussmodel = "
model {
for (i in 1:n) {
y[i] ~ dnorm(mu[z[i]], tau[z[i]])
z[i] ~ dcat(group_probs)
}
group_probs ~ ddirich(d)
for (j in 1:ngroups) {
mu_raw[j] ~ dnorm(0, 1E-6)
tau[j] ~ dgamma(0.001, 0.001)
sd[j] = pow(tau[j], -0.5)
d[j] = 2
}
mu = sort(mu_raw)
}
"
model = jags.model(textConnection(gaussmodel), data=jags_data,
n.chains=4)
update(model,n.iter=1E4)
samples = coda.samples(model=model, variable.names=c("mu", "sd", "group_probs"), n.iter=1E4, thin=5)
I don't know much about rjags and bayesian analysis in detail but I think your problem is in the sd line of the code where sd=pow(tau[j],-0.5)
I believe the -0.5 is the problem. I am not sure if you intended for the value to be negative but it seemed that the value caused some problems to suffix in your dirichlet model.
Taking away the negative value seemed to do the trick.

RcppArmadillo: diagonal matrix multiplication is very slow

Let x be a vector and M a matrix.
In R, I can do
D <- diag(exp(x))
crossprod(M, D%M)
and in RcppArmadillo, I have the following which is much slower.
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::mat multiple_mnv(const arma::vec& x, const arma::mat& M) {
arma::colvec diagonal(x.size())
for (int i = 0; i < x.size(); i++)
{
diagonal(i) = exp(x[i]);
}
arma::mat D = diagmat(diagonal);
return M.t()*D*M;
}
Why is this so slow? How can I speed this up?
Welcome to Stack Overflow manju. For future questions, please be advised that a minimal reproducible example is expected, and in fact is in your best interest to provide; it helps others help you. Here's an example of how you could provide example data for others to work with:
## Set seed for reproducibility
set.seed(123)
## Generate data
x <- rnorm(10)
M <- matrix(rnorm(100), nrow = 10, ncol = 10)
## Output code for others to copy your objects
dput(x)
dput(M)
This is the data I will work with to show that your C++ code is in fact not slower than R. I used your C++ code (adding in a missing semicolon):
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::mat foo(const arma::vec& x, const arma::mat& M) {
arma::colvec diagonal(x.size());
for ( int i = 0; i < x.size(); i++ )
{
diagonal(i) = exp(x[i]);
}
arma::mat D = diagmat(diagonal);
return M.t() * D * M;
}
Note also that I had to make some of my own choices about the type of the return object and types of the function arguments (this is one of the places where a minimal reproducible example could help you: What if these choices affect my results?) I then create an R function to do what foo() does:
bar <- function(v, M) {
D <- diag(exp(v))
return(crossprod(M, D %*% M))
}
Note also that I had to fix a typo you had, changing D%M to D %*% M. Let's double check they give the same results:
all.equal(foo(x, M), bar(x, M))
# [1] TRUE
Now let's explore how fast they are:
library(microbenchmark)
bench <- microbenchmark(cpp = foo(x, M), R = foo(x, M), times = 1e5)
bench
# Unit: microseconds
# expr min lq mean median uq max
# cpp 22.185 23.015 27.00436 23.204 23.461 31143.30
# R 22.126 23.028 25.48256 23.216 23.475 29628.86
Those look pretty much the same to me! We can also look at a density plot of the times (throwing out the extreme value outliers to make things a little clearer):
cpp_times <- with(bench, time[expr == "cpp"])
R_times <- with(bench, time[expr == "R"])
cpp_time_dens <- density(cpp_times[cpp_times < quantile(cpp_times, 0.95)])
R_time_dens <- density(R_times[R_times < quantile(R_times, 0.95)])
plot(cpp_time_dens, col = "blue", xlab = "Time (in nanoseconds)", ylab = "",
main = "Comparing C++ and R execution time")
lines(R_time_dens, col = "red")
legend("topright", col = c("blue", "red"), bty = "n", lty = 1,
legend = c("C++ function (foo)", "R function (bar)"))
Why?
As helpfully pointed out by Dirk Eddelbuettel in the comments, in the end both R and Armadillo are going to be calling a LAPACK or BLAS routine anyways -- you shouldn't expect much difference unless you can give Armadillo a hint on how to be more efficient.
Can we make the Armadillo code faster?
Yes! As pointed out by mtall in the comments, we can give Armadillo the hint that we're dealing with a diagonal matrix. Let's try; we'll use the following code:
// [[Rcpp::export]]
arma::mat baz(const arma::vec& x, const arma::mat& M) {
return M.t() * diagmat(arma::exp(x)) * M;
}
And benchmark it:
all.equal(foo(x, M), baz(x, M))
# [1] TRUE
library(microbenchmark)
bench <- microbenchmark(cpp = foo(x, M), R = foo(x, M),
cpp2 = baz(x, M), times = 1e5)
bench
# Unit: microseconds
# expr min lq mean median uq max
# cpp 22.822 23.757 27.57015 24.118 24.632 26600.48
# R 22.855 23.771 26.44725 24.124 24.638 30619.09
# cpp2 20.035 21.218 25.49863 21.587 22.123 36745.72
We see a small but sure improvement; let's take a look graphically as we did before:
cpp_times <- with(bench, time[expr == "cpp"])
cpp2_times <- with(bench, time[expr == "cpp2"])
R_times <- with(bench, time[expr == "R"])
cpp_time_dens <- density(cpp_times[cpp_times < quantile(cpp_times, 0.95)])
cpp2_time_dens <- density(cpp2_times[cpp2_times < quantile(cpp2_times, 0.95)])
R_time_dens <- density(R_times[R_times < quantile(R_times, 0.95)])
xlims <- range(c(cpp_time_dens$x, cpp2_time_dens$x, R_time_dens$x))
ylims <- range(c(cpp_time_dens$y, cpp2_time_dens$y, R_time_dens$y))
ylims <- ylims * c(1, 1.15)
cols <- c("#0072b2", "#f0e442", "#d55e00")
cols <- c("#e69f00", "#56b4e9", "#009e73")
labs <- c("C++ original", "C++ improved", "R")
plot(cpp_time_dens, col = cols[1], xlim = xlims, ylim = ylims,
xlab = "Time (in nanoseconds)", ylab = "",
main = "Comparing C++ and R execution time")
lines(cpp2_time_dens, col = cols[2])
lines(R_time_dens, col = cols[3])
legend("topleft", col = cols, bty = "n", lty = 1, legend = labs, horiz = TRUE)

Math.Net Exponential Moving Average

I'm using simple moving average in Math.Net, but now that I also need to calculate EMA (exponential moving average) or any kind of weighted moving average, I don't find it in the library.
I looked over all methods under MathNet.Numerics.Statistics and beyond, but didn't find anything similar.
Is it missing in library or I need to reference some additional package?
I don't see any EMA in MathNet.Numerics, however it's trivial to program. The routine below is based on the definition at Investopedia.
public double[] EMA(double[] x, int N)
{
// x is the input series
// N is the notional age of the data used
// k is the smoothing constant
double k = 2.0 / (N + 1);
double[] y = new double[x.Length];
y[0] = x[0];
for (int i = 1; i < x.Length; i++) y[i] = k * x[i] + (1 - k) * y[i - 1];
return y;
}
Occasionally I found this package: https://daveskender.github.io/Stock.Indicators/docs/INDICATORS.html It targets to the latest .NET framework and has very detailed documents.
Try this:
public IEnumerable<double> EMA(IEnumerable<double> items, int notationalAge)
{
double k = 2.0d / (notationalAge + 1), prev = 0.0d;
var e = items.GetEnumerator();
if (!e.MoveNext()) yield break;
yield return prev = e.Current;
while(e.MoveNext())
{
yield return prev = (k * e.Current) + (1 - k) * prev;
}
}
It will still work with arrays, but also List, Queue, Stack, IReadOnlyCollection, etc.
Although it's not explicitly stated I also get the sense this is working with money, in which case it really ought to use decimal instead of double.

using data.table with multiple threads in R

Is there a way to utilize multiple threads for computation using data.table in R? For example let's say i have the following data.table:
dtb <- data.table(id=rep(1:10000, 1000), x=1:1e7)
setkey(dtb, id)
f <- function(m) { #some really complicated function }
res <- dtb[,f(x), by=id]
Is there a way to get R to multithread this if f takes a while to compute? What about in the case that f is quick, will multithreading help or is most of the time going to be taken by data.table in splitting things up into groups?
I am not sure that this is "multi-threading", but perhaps you meant to include a multi-core solution? If so, then look at this earlier answer: Performing calculations by subsets of data in R found with a search for "[r] [data.table] parallel"
Edit: (doubling of speed on a 4 core machine, but my system monitor suggests this only used 2 cores during the mclapply call.) Code copied from this thread: http://r.789695.n4.nabble.com/Access-to-local-variables-in-quot-j-quot-expressions-tt2315330.html#a2315337
calc.fake.dt.mclapply <- function (dt) {
mclapply(6*c(1000,1:4,6,8,10),
function(critical.age) {
dt$tmp <- pmax((dt$age < critical.age) * dt$x, 0)
dt[, cumsum.lag(tmp), by = grp]$V1})
}
mk.fake.df <- function (n.groups=10000, n.per.group=70) {
data.frame(grp=rep(1:n.groups, each=n.per.group),
age=rep(0:(n.per.group-1), n.groups),
x=rnorm(n.groups * n.per.group),
## These don't do anything, but only exist to give
## the table a similar size to the real data.
y1=rnorm(n.groups * n.per.group),
y2=rnorm(n.groups * n.per.group),
y3=rnorm(n.groups * n.per.group),
y4=rnorm(n.groups * n.per.group)) }
df <- mk.fake.df
df <- mk.fake.df()
calc.fake.dt.lapply <- function (dt) { # use base lapply for testing
lapply(6*c(1000,1:4,6,8,10),
function(critical.age) {
dt$tmp <- pmax((dt$age < critical.age) * dt$x, 0)
dt[, cumsum.lag(tmp), by = grp]$V1})
}
mk.fake.dt <- function (fake.df) {
fake.dt <- as.data.table(fake.df)
setkey(fake.dt, grp, age)
fake.dt
}
dt <- mk.fake.dt()
require(data.table)
dt <- mk.fake.dt(df)
cumsum.lag <- function (x) {
x.prev <- c(0, x[-length(x)])
cumsum(x.prev)
}
system.time(res.dt.mclapply <- calc.fake.dt.mclapply(dt))
user system elapsed
1.896 4.413 1.210
system.time(res.dt.lapply <- calc.fake.dt.lapply(dt))
user system elapsed
1.391 0.793 2.175

Resources