How I can plot multiple roc together? - modeling

I want to find some good predictors (genes). This is my data, log transformed RNA-seq:
TRG CDK6 EGFR KIF2C CDC20
Sample 1 TRG12 11.39 10.62 9.75 10.34
Sample 2 TRG12 10.16 8.63 8.68 9.08
Sample 3 TRG12 9.29 10.24 9.89 10.11
Sample 4 TRG45 11.53 9.22 9.35 9.13
Sample 5 TRG45 8.35 10.62 10.25 10.01
Sample 6 TRG45 11.71 10.43 8.87 9.44
I have calculated confusion matrix for different models like below
1- I tested each of 23 genes individually in this code and each of them gives p-value < 0.05 remained as a good predictor; For example for CDK6 I have done
glm=glm(TRG ~ CDK6, data = df, family = binomial(link = 'logit'))
Finally I obtained five genes and I put them in this model:
final <- glm(TRG ~ CDK6 + CXCL8 + IL6 + ISG15 + PTGS2 , data = df, family = binomial(link = 'logit'))
I want a plot like this for ROC curve of each model but I don't know how to do that
Any help please?

I will give you an answer using the pROC package. Disclaimer: I am the author and maintiner of the package. There are alternative ways to do it.
The plot your are seeing was probably generated by the ggroc function of pROC. In order to generate such a plot from glm models, you need to 1) use the predict function to generate the predictions, 2) generate the roc curves and store them in a list, preferably named to get a legend automatically, and 3) call ggroc.
glm.cdk6 <- glm(TRG ~ CDK6, data = df, family = binomial(link = 'logit'))
final <- glm(TRG ~ CDK6 + CXCL8 + IL6 + ISG15 + PTGS2 , data = df, family = binomial(link = 'logit'))
rocs <- list()
library(pROC)
rocs[["CDK6"]] <- roc(df$TRG, predict(glm.cdk6))
rocs[["final"]] <- roc(df$TRG, predict(final))
ggroc(rocs)

Related

Python - export the final random forests tree for Graphviz

I have a Python code with a decision tree and random forests. The decision tree finds the biggest contributor using:
contr = decisiontree.feature_importances_.max() * 100
contr_full = decisiontree.feature_importances_ * 100
#Showing name
location = pd.to_numeric(np.where(contr_full == contr)[0][0])
result = list(df_dmy)[location + 1]
This returns the biggest contributor in my dataset and is then exported to a Graphviz format using:
tree.export_graphviz(rpart, out_file=path_file + '\\Decision Tree Code for Graphviz.dot', filled=True,
feature_names=list(df_dmy.drop(['Reason of Removal'], axis=1).columns),
impurity=False, label=None, proportion=True,
class_names=['Unscheduled', 'Scheduled'], rounded=True)
In the case of random forests, I have managed to export every tree that is used there (100 trees):
i = 0
for tree_data in rf.estimators_:
with open('tree_' + str(i) + '.dot', 'w') as my_file:
my_file = tree.export_graphviz(tree_data , out_file = my_file)
i = i + 1
This, of course, generates 100 word files with the different trees. Not every tree however contains the information that is needed, since some trees show a different result. I do know the biggest contributor of the classifier, but I also want to see the decision tree with that result.
What I tried was:
i= 0
for tree_data in rf.estimators_:
#Feature importance
df_trees = tree_data.tree_.threshold
contr = df_trees.max() * 100
contr_full = df_trees * 100
#Showing name
location = pd.to_numeric(np.where(contr_full == contr)[0][0])
result = print(list(df_dmy)[location + 1])
Using this, I get the error:
IndexError: list index out of range
for which I have no idea what is wrong here.
I wanted a dataframe of biggest contributors together with their contributing factors in order to filter this to the actual biggest contributor and biggest contribution. See example:
Result (in a dataframe) =
Result Contribution
0 Car 0.74
1 Bike 0.71
2 Car 0.79
Python knows already that the result from random forests gave 'car' as the biggest contributor, the first filter is to remove everything except 'car':
Result Contribution
0 Car 0.74
2 Car 0.79
Then it has to search for the highest contribution and retrieve the index.
Result Contribution
2 Car 0.79
Then it has to export the tree information corresponding to that index.
I know it is quite a long story, but I hope someone knows how to finish this code.
Regards, Ganesh
names = []
contributors = []
df = pd.DataFrame(columns=['Parameter', 'Value'])
for tree_data in rf.estimators_:
#Feature importance
df_trees = tree_data.tree_.threshold
contr = tree_data.feature_importances_.max() * 100
contr_full = tree_data.feature_importances_ * 100
contr_location = pd.to_numeric(np.where(contr_full == contr)[0][0])
names.append(list(titanic_dmy.columns)[contr_location + 1])
contributors.append(contr)
df['Parameter']=np.array(names)
df['Value']=np.array(contributors)
idx = df.index[df['Value'] == df['Value'].loc[df['Value'].idxmax()]].tolist()[0]
#Export to Graphviz
tree.export_graphviz(rf.estimators_[idx], out_file=path_file + '\\RF Decision Tree for Graphviz.dot',
filled=True, max_depth=graphviz_leafs, feature_names=list(titanic_dmy.drop(['survived'],
axis=1).columns), impurity=False, label=None, proportion=True,
class_names=['Unscheduled', 'Scheduled'], rounded=True, precision=2)

p-values for estimates in flexsurvreg

I fitted a survival model using an inverse weibull distribution in flexsurvreg:
if (require("actuar")){
invweibull <- list(name="invweibull",
pars=c("shape","scale"),
location="scale",
transforms=c(log, log),
inv.transforms=c(exp, exp),
inits=function(t){ c(1, median(t)) })
invweibull <- flexsurvreg(formula = kpnsurv~iaas, data = kpnrs2,
dist=invweibull)
invweibull
}
And I got the following output:
Call:
flexsurvreg(formula = kpnsurv ~ iaas, data = kpnrs2, dist = invweibull)
Estimates:
data. mean. est L95% U95% se exp(est) L95% U95%
shape NA 0.4870 0.4002 0.5927 0.0488 NA NA NA
scale NA 62.6297 36.6327 107.0758 17.1371 NA NA NA
iaas 0.4470 -0.6764 -1.2138 -0.1391 0.2742 0.5084 0.2971 0.8701
N = 302, Events: 54, Censored: 248
Total time at risk: 4279
Log-likelihood = -286.7507, df = 3
AIC = 579.5015
How can I get the p-value of the covariate estimate (in this case iaas)? Thank you for your help.
Just in case this is still useful to anyone, this worked for me. First extract the matrix of coefficient information from the model:
invweibull.res <- invweibull$res
Then divide the estimated coefficients by their standard errors to calculate the Wald statistics, which have asymptotic standard normal distributions:
invweibull.wald <- invweibull.res[,1]/invweibull.res[,4]
Finally, get the p-values:
invweibull.p <- 2*pnorm(-abs(invweibull.wald))

Nested Random effect in JAGS/ WinBUGS

I am interested in fitting the following nested random effect model in JAGS.
SAS code
proc nlmixed data=data1 qpoints=20;
parms beta0=2 beta1=1 ;
bounds vara >=0, varb_a >=0;
eta = beta0+ beta1*t+ b2+b3;
p = exp(eta)/(1+exp(eta));
model TestResult ~ binary(p);
random b2 ~ normal(0,vara) subject = HHcode;
random b3 ~ normal(0,varb_a) subject = IDNo_N(HHcode);
run;
My question: How to specify the random effect part?
I have repeated measurements on individuals. These individuals are further nested in the household. Note: The number of individuals per household vary!
Looking forward to hearing from you
Let's assume that we have two vectors which indicate which house and which individual a data point belongs to (these are things you will need to create, in R you can make these by changing a factor to numeric via as.numeric). So, if we have 10 data points from 2 houses and 5 individuals they would look like this.
house_vec = c(1,1,1,1,1,1,2,2,2,2) # 6 points for house 1, 4 for house 2
ind_vec = c(1,1,2,2,3,3,4,4,5,5) # everyone has two observations
N = 10 # number of data points
So, the above vectors tell us that there are 3 individuals in the first house (because the first 6 elements of house_vec are 1 and the first 6 elements of ind_vec range from 1 to 3) and the second house has 2 individuals (last 4 elements of house_vec are 2 and the last 4 elements of ind_vec are 4 and 5). With these vectors, we can do nested indexing in JAGS to create your random effect structure. Something like this would suffice. These vectors would be supplied in the data.list that you have to include with TestResult
for(i in 1:N){
mu_house[house_vec[i]] ~ dnorm(0, taua)
mu_ind[ind_vec[i]] ~ dnorm(mu_house[house_vec[i]], taub_a)
}
# priors
taua ~ dgamma(0.01, 0.01) # precision
sda <- 1 / sqrt(taua) # derived standard deviation
taub_a ~ dgamma(0.01, 0.01) # precision
sdb_a <- 1 / sqrt(taub_a) # derived standard deviation
You would only need to include mu_ind within the linear predictor, as it is informed by mu_house. So the rest of the model would look like.
for(i in 1:N){
logit(p[i]) <- beta0 + beta1 * t + mu_ind[ind_vec[i]]
TestResult[i] ~ dbern(p[i])
}
You would then need to set priors for beta0 and beta1

Python Index and Bounds error using data set

Our class is using Python as a solution tool for models. However, this is my first time with python or any programming language since VB in 1997 so I'm struggling. We have the following code provided to us.
from numpy import loadtxt, array, ones, column_stack
from numpy import dot, sqrt
from scipy.linalg import inv
from scipy.stats import norm, t
f = loadtxt('text data.raw')
y = f[:,4]
n = y.size
x = array([f[:,2],f[:,8],f[:,4]])
one = ones(n)
#xa = column_stack([one,f[:,3],f[:,4]])
xa = column_stack([one,x.T])
k = xa.shape[1]
xx = dot(xa.T,xa)
invx = inv(xx)
xy = dot(xa.T,y)
b = dot(invx,xy)
# Compute cov(b)
e = y - dot(xa,b)
s2 = dot(e.T,e)/(n-k)
covb = invx*s2
# Compute t-stat
tstat = b[1]/sqrt(covb[1][1])
#compute p-value
p = 1 - norm.cdf(tstat,0,1)
pt = 1 - t.cdf(tstat,88)
Our data set is a 10x88 matrix. Our goal is to create a linear program and find a few answers. On our data column 1 is already set to price which in our linear program is our desired out put and I need to use column 3,4, and 5. as my x1,x2, and x3. I'm not sure how or what line 9 and 11 values need to be changed to in order to accomplish that task nor am I currently understanding what those two lines are specifically calling for or doing in the program. Again, I'm not familiar with programming.
Everything I try generally yields an error similar to
IndexError: index 5 is out of bounds for axis 1 with size 5
Any suggestions?

How to find variability of a set of Cartesian Points (xyz) or fitting/distance to 3D line and/or plane?

So I was looking at this question:
Matlab - Standard Deviation of Cartesian Points
Which basically answers my question, except the problem is I have xyz, not xy. So I don't think Ax=b would work in this case.
I have, say, 10 Cartesian points, and I want to be able to find the standard deviation of these points. Now, I don't want standard deviation of each X, Y and Z (as a result of 3 sets) but I just want to get one number.
This can be done using MATLAB or excel.
To better understand what I'm doing, I have this desired point (1,2,3) and I recorded (1.1,2.1,2.9), (1.2,1.9,3.1) and so on. I wanted to be able to find the variability of all the recorded points.
I'm open for any other suggestions.
If you do the same thing as in the other answer you linked, it should work.
x_vals = xyz(:,1);
y_vals = xyz(:,2);
z_vals = xyz(:,3);
then make A with 3 columns,
A = [x_vals y_vals ones(size(x_vals))];
and
b = z_vals;
Then
sol=A\b;
m = sol(1);
n = sol(2);
c = sol(3);
and then
errs = (m*x_vals + n*y_vals + c) - z_vals;
After that you can use errs just as in the linked question.
Randomly clustered data
If your data is not expected to be near a line or a plane, just compute the distance of each point to the centroid:
xyz_bar = mean(xyz);
M = bsxfun(#minus,xyz,xyz_bar);
d = sqrt(sum(M.^2,2)); % distances to centroid
Then you can compute variability anyway you like. For example, standard deviation and RMS error:
std(d)
sqrt(mean(d.^2))
Data about a 3D line
If the data points are expected to be roughly along the path of a line, with some deviation from it, you might look at the distance to a best fit line. First, fit a 3D line to your points. One way is using the following parametric form of a 3D line:
x = a*t + x0
y = b*t + y0
z = c*t + z0
Generate some test data, with noise:
abc = [2 3 1]; xyz0 = [6 12 3];
t = 0:0.1:10;
xyz = bsxfun(#plus,bsxfun(#times,abc,t.'),xyz0) + 0.5*randn(numel(t),3)
plot3(xyz(:,1),xyz(:,2),xyz(:,3),'*') % to visualize
Estimate the 3D line parameters:
xyz_bar = mean(xyz) % centroid is on the line
M = bsxfun(#minus,xyz,xyz_bar); % remove mean
[~,S,V] = svd(M,0)
abc_est = V(:,1).'
abc/norm(abc) % compare actual slope coefficients
Distance from points to a 3D line:
pointCentroidSeg = bsxfun(#minus,xyz_bar,xyz);
pointCross = cross(pointCentroidSeg, repmat(abc_est,size(xyz,1),1));
errs = sqrt(sum(pointCross.^2,2))
Now you have the distance from each point to the fit line ("error" of each point). You can compute the mean, RMS, standard deviation, etc.:
>> std(errs)
ans =
0.3232
>> sqrt(mean(errs.^2))
ans =
0.7017
Data about a 3D plane
See David's answer.

Resources