Tidymodels is an amazing package! Censored data is very common in medical research. When I was trying to tune the 'boost_tree' model using the 'censored' package, the error comes:' Unknown mode for parsnip model.' Does it seem that 'censored regression' mode was not supported in parsnip? Very desire for solving!
library(censored)
library(survival)
library(tidymodels)
set.seed(123)
lung$surv <- Surv(lung$time, lung$status)
lung_split <- lung %>%
initial_split(strata = status )
lung_train <- training(lung_split)
lung_test <- testing(lung_split)
set.seed(123)
lung_folds <- vfold_cv(lung_train, strata = status)
lung_rec <- recipe(surv ~ age + ph.ecog, data = lung_train)
boost_spec <-
boost_tree(
mtry = tune(),
trees = tune()
) %>%
set_engine("mboost") %>%
set_mode('censored regression')
lung_grid <- crossing(mtry = c(2,3,4), trees = c(100,200,15))
lung_wf <- workflow(lung_rec, boost_spec)
set.seed(345)
tree_rs <-
tune_grid(
lung_wf,
boost_spec,
resamples = lung_folds,
grid = lung_grid
#metrics = metric_set(accuracy, roc_auc, sensitivity, specificity)
)
tree_rs
Related
I have estimated nested logit in R using the mlogit package. However, I encountered some problems when trying to estimate the marginal effect. Below is the code I implemented.
library(mlogit)
# data
data2 = read.csv(file = "neat_num_energy.csv")
new_ener2 <- mlogit.data(
data2,
choice="alter4", shape="long",
alt.var="energy_altern",chid.var="id")
# estimate model
nest2 <- mlogit(
alter4 ~ expendmaint + expendnegy |
educ + sex + ppa_power_sp + hu_price_powersupply +
hu_2price +hu_3price + hu_9price + hu_10price +
hu_11price + hu_12price,
data = data2,
nests = list(
Trad = c('Biomas_Trad', 'Solar_Trad'),
modern = c('Biomas_Modern', 'Solar_Modern')
), unscaled=FALSE)
# create Z variable
z3 <- with(data2, data.frame(
expendnegy = tapply(expendnegy, idx(nest2,2), mean),
expendmaint= tapply(expendmaint, idx(nest2,2), mean),
educ= mean(educ),
sex = mean(sex),
hu_price_powersupply = mean(hu_price_powersupply),
ppa_power_sp = mean(ppa_power_sp),
hu_2price = mean(hu_2price),
hu_3price = mean(hu_3price),
hu_9price = mean(hu_9price),
hu_10price = mean(hu_10price),
hu_11price = mean(hu_11price),
ppa_power_sp = mean(ppa_power_sp),
hu_12price = mean(hu_12price)
))
effects(nest2, covariate = "sex", data = z3, type = "ar")
#> ** Error in Solve.default (H, g[!fixed]): Lapack routine dgesv: #> system is exactly singular:U[6,6] =0.**
My data is in long format with expendmaint and expendnegy being the only alternative specific while every other variable is case specific.
altern4 is a nominal variable representing each alternative
Let x be a vector and M a matrix.
In R, I can do
D <- diag(exp(x))
crossprod(M, D%M)
and in RcppArmadillo, I have the following which is much slower.
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::mat multiple_mnv(const arma::vec& x, const arma::mat& M) {
arma::colvec diagonal(x.size())
for (int i = 0; i < x.size(); i++)
{
diagonal(i) = exp(x[i]);
}
arma::mat D = diagmat(diagonal);
return M.t()*D*M;
}
Why is this so slow? How can I speed this up?
Welcome to Stack Overflow manju. For future questions, please be advised that a minimal reproducible example is expected, and in fact is in your best interest to provide; it helps others help you. Here's an example of how you could provide example data for others to work with:
## Set seed for reproducibility
set.seed(123)
## Generate data
x <- rnorm(10)
M <- matrix(rnorm(100), nrow = 10, ncol = 10)
## Output code for others to copy your objects
dput(x)
dput(M)
This is the data I will work with to show that your C++ code is in fact not slower than R. I used your C++ code (adding in a missing semicolon):
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::mat foo(const arma::vec& x, const arma::mat& M) {
arma::colvec diagonal(x.size());
for ( int i = 0; i < x.size(); i++ )
{
diagonal(i) = exp(x[i]);
}
arma::mat D = diagmat(diagonal);
return M.t() * D * M;
}
Note also that I had to make some of my own choices about the type of the return object and types of the function arguments (this is one of the places where a minimal reproducible example could help you: What if these choices affect my results?) I then create an R function to do what foo() does:
bar <- function(v, M) {
D <- diag(exp(v))
return(crossprod(M, D %*% M))
}
Note also that I had to fix a typo you had, changing D%M to D %*% M. Let's double check they give the same results:
all.equal(foo(x, M), bar(x, M))
# [1] TRUE
Now let's explore how fast they are:
library(microbenchmark)
bench <- microbenchmark(cpp = foo(x, M), R = foo(x, M), times = 1e5)
bench
# Unit: microseconds
# expr min lq mean median uq max
# cpp 22.185 23.015 27.00436 23.204 23.461 31143.30
# R 22.126 23.028 25.48256 23.216 23.475 29628.86
Those look pretty much the same to me! We can also look at a density plot of the times (throwing out the extreme value outliers to make things a little clearer):
cpp_times <- with(bench, time[expr == "cpp"])
R_times <- with(bench, time[expr == "R"])
cpp_time_dens <- density(cpp_times[cpp_times < quantile(cpp_times, 0.95)])
R_time_dens <- density(R_times[R_times < quantile(R_times, 0.95)])
plot(cpp_time_dens, col = "blue", xlab = "Time (in nanoseconds)", ylab = "",
main = "Comparing C++ and R execution time")
lines(R_time_dens, col = "red")
legend("topright", col = c("blue", "red"), bty = "n", lty = 1,
legend = c("C++ function (foo)", "R function (bar)"))
Why?
As helpfully pointed out by Dirk Eddelbuettel in the comments, in the end both R and Armadillo are going to be calling a LAPACK or BLAS routine anyways -- you shouldn't expect much difference unless you can give Armadillo a hint on how to be more efficient.
Can we make the Armadillo code faster?
Yes! As pointed out by mtall in the comments, we can give Armadillo the hint that we're dealing with a diagonal matrix. Let's try; we'll use the following code:
// [[Rcpp::export]]
arma::mat baz(const arma::vec& x, const arma::mat& M) {
return M.t() * diagmat(arma::exp(x)) * M;
}
And benchmark it:
all.equal(foo(x, M), baz(x, M))
# [1] TRUE
library(microbenchmark)
bench <- microbenchmark(cpp = foo(x, M), R = foo(x, M),
cpp2 = baz(x, M), times = 1e5)
bench
# Unit: microseconds
# expr min lq mean median uq max
# cpp 22.822 23.757 27.57015 24.118 24.632 26600.48
# R 22.855 23.771 26.44725 24.124 24.638 30619.09
# cpp2 20.035 21.218 25.49863 21.587 22.123 36745.72
We see a small but sure improvement; let's take a look graphically as we did before:
cpp_times <- with(bench, time[expr == "cpp"])
cpp2_times <- with(bench, time[expr == "cpp2"])
R_times <- with(bench, time[expr == "R"])
cpp_time_dens <- density(cpp_times[cpp_times < quantile(cpp_times, 0.95)])
cpp2_time_dens <- density(cpp2_times[cpp2_times < quantile(cpp2_times, 0.95)])
R_time_dens <- density(R_times[R_times < quantile(R_times, 0.95)])
xlims <- range(c(cpp_time_dens$x, cpp2_time_dens$x, R_time_dens$x))
ylims <- range(c(cpp_time_dens$y, cpp2_time_dens$y, R_time_dens$y))
ylims <- ylims * c(1, 1.15)
cols <- c("#0072b2", "#f0e442", "#d55e00")
cols <- c("#e69f00", "#56b4e9", "#009e73")
labs <- c("C++ original", "C++ improved", "R")
plot(cpp_time_dens, col = cols[1], xlim = xlims, ylim = ylims,
xlab = "Time (in nanoseconds)", ylab = "",
main = "Comparing C++ and R execution time")
lines(cpp2_time_dens, col = cols[2])
lines(R_time_dens, col = cols[3])
legend("topleft", col = cols, bty = "n", lty = 1, legend = labs, horiz = TRUE)
The following example is for anyone who is building a Cox Proportional Hazards models and trying to produce prediction error curves, but get an error stating:
Error in coxModelFrame.coxph(object) : invalid object
set x=TRUE in the call to coxph.
Here is the code to reproduce the error:
LIBRARIES
library(survival)
library(survminer)
library(pec)
library(Hmisc)
library(rms)
library(riskRegression)
#install.packages("doMC", repos="http://R-Forge.R-project.org")
library(doMC)
The Data
#Load and store the data
lcOrig <- read.csv("cancer.csv")
#Replace all the 1's with 0's (censored)
lcOrig$status <- gsub(pattern = "1", replacement = "0", x = lcOrig$status, fixed = TRUE)
#Replace all the 2's with 1's (death)
lcOrig$status <- gsub (pattern = "2", replacement = "1", x = lcOrig$status, fixed = TRUE)
#Do the same thing for sex (0 = Males, 1 = Females)
lcOrig$sex <- gsub(pattern = "1", replacement = "0", x = lcOrig$sex, fixed = TRUE)
lcOrig$sex <- gsub(pattern = "2", replacement = "1", x = lcOrig$sex, fixed = TRUE)
#Change the class of these variables to integer.
lcOrig$status <- as.integer(lcOrig$status)
lcOrig$sex <- as.integer(lcOrig$sex)
lcOrig$ph.ecog <- as.integer(lcOrig$ph.ecog)
#Remove missing values and column with over 20% missing data.
apply(lcOrig, 2, function(x) sum(is.na(x))/length(x))
lcOrig <- lcOrig[, c(1:9, 11)]
lc <- lcOrig[complete.cases(lcOrig), ]
Cox Proportional Hazards
fitform1 <- Surv(time, status) ~ inst + age + sex + ph.ecog + ph.karno + pat.karno + wt.loss
cox1 <- coxph(fitform1, data = lc)
PREDICTION ERROR CURVES
extends <- function(...) TRUE
library("doMC")
registerDoMC()
set.seed(0692)
fitpec1 <- pec(list("CPH" = cox1), data = lc, formula = fitform1, splitMethod = "cv10", B = 5, keep.index = TRUE, keep.matrix = TRUE)
The last line of code results in the following error:
Error in coxModelFrame.coxph(object) : invalid object
set x=TRUE in the call to coxph
SOLUTION
Change:
cox1 <- coxph(fitform1, data = lc)
To:
cox1 <- coxph(fitform1, data = lc, x = TRUE)
This did not use to be a requirement 2 years ago, but is now. I hope this helps save you some time!
I'm using GNU Screen to send code from Vim to R with the following .vimrc configuration:
map <C-L> "kyy:echo system("screen -S $STY -p R -X stuff ".shellescape(#k))<CR>j
vmap <C-L> "xy:echo system("screen -S $STY -p R -X stuff ".shellescape(#x."\n"))<CR>j
(I use cntr-L to send code from Vim to my R window titled "R".)
I can run functions within the current screen,
exp_val <- function(roll){
vals <- numeric(length(roll))
vals[roll == "G"] <- 1/6
vals[roll == "Y"] <- 1/3
vals[roll == "R"] <- 1/2
sum(vals)
}
prob_choice <- function(roll, ng = 6, ny = 4, nr = 3){
tot <- ng + ny + nr
i <- c(ng,ny,nr)
col <- c("G","Y","R")
p1 <- i[col==roll[1]]/tot
i[col==roll[1]] <- i[col==roll[1]] - 1
p2 <- i[col==roll[2]]/(tot-1)
i[col==roll[2]] <- i[col==roll[2]] - 1
p3 <- i[col==roll[3]]/(tot-2)
p1*p2*p3
}
but when I try to scroll down and highlight another function that extends below the original screen view
single_round <- function(sampsp,probs=rep(1/nrow(sampsp),nrow(sampsp))){
r <- sample(1:nrow(sampsp),1,prob = probs)
dice <- sampsp[r,]
roll <- numeric(3)
i <- which(dice=="G"); roll[i] <- sample(c("B","S","F"),1,prob = c(1/2,1/6,1/3))
j <- which(dice=="Y"); roll[j] <- sample(c("B","S","F"),1,prob = c(1/3,1/3,1/3))
k <- which(dice=="R"); roll[k] <- sample(c("B","S","F"),1,prob = c(1/6,1/2,1/3))
num_dice <- c(length(i),length(j),length(k))
outcome <- c(sum(roll=="B"),sum(roll=="S"),sum(roll=="F"))
if (is.null(roll[i])){roll[i] <- 0}
if (is.null(roll[j])){roll[j] <- 0}
if (is.null(roll[k])){roll[k] <- 0}
feet <- c(sum(roll[i]=="F"),sum(roll[j]=="F"),sum(roll[k]=="F")) # to keep track of color of footprints
list(num_dice,outcome,feet)
}
I get the following error
X: stuff: one or two arguments required
Is there anything I can change in my .vimrc so I can highlight and execute a large part of the script that extends beyond the original screen view?
I am trying to implement a Weibull proportional hazards model with a cure fraction following the approach outlined by Hui, Ibrahim and Sinha (1999) - A New Bayesian Model for Survival Data with a Surviving Fraction. However, I am not sure if it is possible to define a random limit for a looping in JAGS.
library(R2OpenBUGS)
library(rjags)
set.seed(1234)
censored <- c(1, 1)
time_mod <- c(NA, NA)
time_cens <- c(5, 7)
tau <- 4
design_matrix <- rbind(c(1, 0, 0, 0), c(1, 0.2, 0.2, 0.04))
jfun <- function() {
for(i in 1:nobs) {
censored[i] ~ dinterval(time_mod[i], time_cens[i])
time_mod[i] <- ifelse(N[i] == 0, tau, min(Z))
for (k in 1:N[i]){
Z[k] ~ dweib(1, 1)
}
N[i] ~ dpois(fc[i])
fc[i] <- exp(inprod(design_matrix[i, ], beta))
}
beta[1] ~ dnorm(0, 10)
beta[2] ~ dnorm(0, 10)
beta[3] ~ dnorm(0, 10)
beta[4] ~ dnorm(0, 10)
}
inits <- function() {
time_init <- rep(NA, length(time_mod))
time_init[which(!status)] <- time_cens[which(!status)] + 1
out <- list(beta = rnorm(4, 0, 10),
time_mod = time_init,
N = rpois(length(time_mod), 5))
return(out)
}
data_base <- list('time_mod' = time_mod, 'time_cens' = time_cens,
'censored' = censored, 'design_matrix' = design_matrix,
'tau' = tau,
'nobs' = length(time_cens[!is.na(time_cens)]))
tc1 <- textConnection("jmod", "w")
write.model(jfun, tc1)
close(tc1)
# Calling JAGS
tc2 <- textConnection(jmod)
j <- jags.model(tc2,
data = data_base,
inits = inits(),
n.chains = 1,
n.adapt = 1000)
I observed the below error:
Error in jags.model(tc2, data = data_base, inits = inits(), n.chains = 1, :
RUNTIME ERROR:
Compilation error on line 6.
Unknown variable N
Either supply values for this variable with the data
or define it on the left hand side of a relation.
I am not entirely certain, but I am pretty sure that you cannot declare a random number of nodes in BUGS in general, so it would not be a specific JAGS' quirk.
Nevertheless, you can get a way around that.
Since BUGS is a declarative language instead of a procedural one, it is enough to declare an arbitrary but deterministic number of nodes (let's say "large enough") and then associate only a random number of them with a distribution and with observed data, leaving the remaining nodes deterministic.
Once you have observed the maximum value of N[i] (let's say N.max), you can pass it as a parameter to JAGS and then change this code of yours:
for (k in 1:N[i]){
Z[k] ~ dweib(1, 1)
}
into this:
for (k in 1:N.max){
if (k <= N[i]){
Z[k] ~ dweib(1, 1)
} else {
Z[k] <- 0
}
}
I hope this will do the trick in your case. So please give feedback latter about it.
Needless to say, if you have some non-zero, observed data associated to a deterministic Z[k], then all hell breaks loose inside Jags...