Latent variable with Gaussian mixture model to impute the missing data - gaussian

I'm currently trying to impute the missing data through Gaussian mixture model.
My reference paper is from here:
http://mlg.eng.cam.ac.uk/zoubin/papers/nips93.pdf
I currently focus on bivariate dataset with 2 Gaussian components.
This is the code to define the weight for each Gaussian component:
myData = faithful[,1:2]; # the data matrix
for (i in (1:N)) {
prob1 = pi1*dmvnorm(na.exclude(myData[,1:2]),m1,Sigma1); # probabilities of sample points under model 1
prob2 = pi2*dmvnorm(na.exclude(myData[,1:2]),m2,Sigma2); # same for model 2
Z<-rbinom(no,1,prob1/(prob1 + prob2 )) # Z is latent variable as to assign each data point to the particular component
pi1<-rbeta(1,sum(Z)+1/2,no-sum(Z)+1/2)
if (pi1>1/2) {
pi1<-1-pi1
Z<-1-Z
}
}
This is my code to define the missing values:
> whichMissXY<-myData[ which(is.na(myData$waiting)),1:2]
> whichMissXY
eruptions waiting
11 1.833 NA
12 3.917 NA
13 4.200 NA
14 1.750 NA
15 4.700 NA
16 2.167 NA
17 1.750 NA
18 4.800 NA
19 1.600 NA
20 4.250 NA
My constraint is, how to impute the missing data in "waiting" variable based on particular component.
This code is my first attempt to impute the missing data using conditional mean imputation. I know, it is definitely in the wrong way. The outcome would not lie to the particular component and produce outlier.
miss.B2 <- which(is.na(myData$waiting))
for (i in miss.B2) {
myData[i, "waiting"] <- m1[2] + ((rho * sqrt(Sigma1[2,2]/Sigma1[1,1])) * (myData[i, "eruptions"] - m1[1] ) + rnorm(1,0,Sigma1[2,2]))
#print(miss.B[i,])
}
I would appreciate if someone could give any advice on how to improve the imputation technique that could work with latent/hidden variable through Gaussian mixture model.
Thank you in advance

This is a solution for one type of covariance structure.
devtools::install_github("alexwhitworth/emclustr")
library(emclustr)
data(faithful)
set.seed(23414L)
ff <- apply(faithful, 2, function(j) {
na_idx <- sample.int(length(j), 50, replace=F)
j[na_idx] <- NA
return(j)
})
ff2 <- em_clust_mvn_miss(ff, nclust=2)
# hmm... seems I don't return the imputed values.
# note to self to update the code
plot(faithful, col= ff2$mix_est)
And the parameter outputs
$it
[1] 27
$clust_prop
[1] 0.3955708 0.6044292
$clust_params
$clust_params[[1]]
$clust_params[[1]]$mu
[1] 2.146797 54.833431
$clust_params[[1]]$sigma
[1] 13.41944
$clust_params[[2]]
$clust_params[[2]]$mu
[1] 4.317408 80.398192
$clust_params[[2]]$sigma
[1] 13.71741

Related

Using APCluster on Pre-Classified Data - Coloring Dendograms and Nicely Formatted Output

Question 1:
I am trying to work with the plot() function on an AggExResult object and the clusters in the documentation (https://cran.r-project.org/web/packages/apcluster/apcluster.pdf) work as expected.
In my own data, I have an additional column in the input which provides a pre-defined “target” for classification purposes, and I am wondering if there is a way to have the dendogram labels highlighted by color (e.g. red=class 0, blue=class 1) with the class of the targets being factors (or characters). I am ultimately trying to visually display how many clusters contain "pure" vs. "mixed" classes. Here is some slightly modified code from the online documentation to show roughly what my input data looks like:
cl1Targ <- matrix(nrow=50,ncol=1)
for(c1t in 1:nrow(cl1Targ)){ cl1Targ[c1t] <- as.factor(0) }
cl2Targ <- matrix(nrow=50,ncol=1)
for(c2t in 1:nrow(cl2Targ)){ cl2Targ[c2t] <- as.factor(1) }
## create two Gaussian clouds
#cl1 <- cbind(rnorm(50,0.2,0.05),rnorm(50,0.8,0.06))
#cl2 <- cbind(rnorm(50,0.7,0.08),rnorm(50,0.3,0.05))
cl1 <- cbind(rnorm(50,0.2,0.05),rnorm(50,0.8,0.06),cl1Targ)
cl2 <- cbind(rnorm(50,0.7,0.08),rnorm(50,0.3,0.05),cl2Targ)
x <- rbind(cl1,cl2)
colnames(x) <- c('Column 1','Column 2','Class_ID')
## compute similarity matrix (negative squared Euclidean)
sim <- negDistMat(x, r=2)
## run affinity propagation
apres <- apcluster(sim, q=0.7)
## compute agglomerative clustering from scratch
aggres1 <- aggExCluster(sim)
## plot dendrogram
plot(aggres1, main='aggres1 w/ target') #
How would I color the dendogram by the target defined in the input?
Question 2:
When I show() the example data’s APResult, I see the following:
show(apres)
APResult object
Number of samples = 100
Number of iterations = 165
Input preference = -0.01281384
Sum of similarities = -0.1222309
Sum of preferences = -0.1409522
Net similarity = -0.2631832
Number of clusters = 11
Exemplars:
8 17 24 37 43 52 58 68 92 95 99
Clusters:
Cluster 1, exemplar 8:
7 8 9 25 31 36 39 42 47 48
Cluster 2, exemplar 17:
6 11 13 15 17 18 19 23 32 35
Cluster 3, exemplar 24:
2 5 10 24 45
When I use my own data, I see the following (the row.names, which are the drugs being clustered by gene expression mean fold change values)
show(apclr2q05_mean)
APResult object
Number of samples = 1045
Number of iterations = 429
Input preference = -390.0822
Sum of similarities = -89326.99
Sum of preferences = -83477.58
Net similarity = -172804.6
Number of clusters = 214
Exemplars:
amantadine_58mg6h_fc amiodarone_147mg3d_fc clarithromycin_56mg1d_fc fluconazole_394mg5d_fc ketoconazole_114mg5d_fc ketoconazole_2274mg1d_fc
pantoprazole_1100mg1d_fc pantoprazole_1100mg3d_fc quetiapine_500mg5d_fc roxithromycin_312mg5d_fc torsemide_3mg3d_fc acetazolamide_250mg3d_fc
Clusters:
Cluster 1, exemplar amantadine_58mg6h_fc:
amantadine_58mg6h_fc promazine_100mg1d_fc cyproteroneAcetate_2500mg6h_fc danazol_2g5d_fc ivermectin_7500ug1d_fc letrozole_250mg6h_fc
mefenamicAcid_93mg3d_fc olanzapine_23mg1d_fc secobarbital_20mg6h_fc zaleplon_100mg3d_fc
Cluster 2, exemplar amiodarone_147mg3d_fc:
amiodarone_147mg3d_fc amiodarone_147mg5d_fc aspirin_375mg5d_fc betaNapthoflavone_80mg5d_fc clofibrate_130mg3d_fc finasteride_800mg5d_fc
Cluster 3, exemplar clarithromycin_56mg1d_fc:
ciprofloxacin_72mg5d_fc ciprofloxacin_450mg6h_fc clarithromycin_56mg1d_fc clarithromycin_56mg3d_fc clarithromycin_56mg5d_fc
Cluster 4, exemplar fluconazole_394mg5d_fc:
fluconazole_394mg5d_fc
Also what I would expect in terms of content but I would like to format this for reporting purposes. I have tried to export this using dput() but I get a lot of extra unnecessary information in the output file. I am wondering how I might be able to export the same type of information from above along with the object name and target classifier mentioned above into a table that would look like the following (and add the name of the object to the output):
Name of object = apclr2q05_mean
Number of samples = 1045
Number of iterations = 429
Input preference = -390.0822
Sum of similarities = -89326.99
Sum of preferences = -83477.58
Net similarity = -172804.6
Number of clusters = 214
Exemplars: Target
amantadine_58mg6h_fc 1
amiodarone_147mg3d_fc 1
clarithromycin_56mg1d_fc 1
fluconazole_394mg5d_fc 0
ketoconazole_114mg5d_fc 0
ketoconazole_2274mg1d_fc 0
Clusters:
Cluster 1, exemplar amantadine_58mg6h_fc:
Drug Target
amantadine_58mg6h_fc 1
promazine_100mg1d_fc 1
cyproteroneAcetate_2500mg6h_fc 1
danazol_2g5d_fc 0
ivermectin_7500ug1d_fc 0
Cluster 2, exemplar amiodarone_147mg3d_fc:
Drug Target
Etc…
A big THANK YOU to Ulrich for his quick response to these questions by email and we wanted to share our discussion with the community so I will let him respond with his solution so that he gets the credit he deserves :-)
As an update, I tried to implement the answer to Question 1 and the sample code works as expected, but I am having trouble getting this to work on my data. The input data has two parts. The first is a matrix with the numeric measurement data including column and row labels:
> fci[1:3,1:3]
M30596_PROBE1 AI231309_PROBE1 NM_012489_PROBE1
amantadine_58mg1d_fc 0.05630744 -0.10441722 0.41873201
amantadine_58mg6h_fc -0.42780274 -0.26222322 0.02703001
amantadine_220mg1d_fc 0.35260779 -0.09902214 0.04067055
The second is the "target" values in Factor format, each of which corresponds to same row in fci above:
> targs[1:3]
amantadine_58mg1d_fc amantadine_58mg6h_fc amantadine_220mg1d_fc
0 0 0
Levels: 0 1
From here, the tree was built as below:
# build the AggExResult:
aglomr1 <- aggExCluster(negDistMat(r=2), fci)
# convert the data
tree <- as.dendrogram(aglomr1)
# assign the color codes
colorCodes <- c("0"="red", "1"="green")
names(targs) <- rownames(fci)
xColor <- colorCodes[as.character(targs)]
names(xColor) <- rownames(fci)
# plot the colored tree
labels_colors(tree) <- xColor[order.dendrogram(tree)]
plot(tree, main="Colored Tree")
The tree was generated but the leaves were not colored. Doing some digging:
> head(xColor)
0 0 0 0 0 0
"red" "red" "red" "red" "red" "red"
That part seems to work as expected in terms of the targets having the correct colors assigned, but the rownames are not in xColor, and the line labels_colors(tree) <- xColor[order.dendrogram(tree)] does not return similar labels, but rather what appear to be row numbers, or NAs:
> head(order.dendrogram(tree))
[1] "295" "929" "488" "493" "233" "235"
> head(labels_colors(tree))
295 929 488 493 233 235
> head(xColor[order.dendrogram(tree)])
<NA> <NA> <NA> <NA> <NA> <NA>
NA NA NA NA NA NA
How would I get the line labels_colors(tree) <- xColor[order.dendrogram(tree)] to behave in the same way as the example provided? Specifically, what I am trying to show is the leaf lables such as amantadine_58mg1d_fc being highlighted in the color that corresponds to the target (0/1).
Here is my answer to your Question 1: the plot() method for 'AggExResult' objects internally uses the plot.dendrogram() method. Since this method does not allow for coloring leaves of dendrograms, this will not work. However, there is the 'dendextend' package which offers such a functionality. (BTW, I found that solution in another thread: Label and color leaf dendrogram in r) Since 'apcluster' offers some casts to 'hclust' and 'dendrogram' objects, this package's functionality can be used more or less directly.
So, here is some sample code:
library(apcluster)
## create two Gaussian clouds along with class labels 0/1
cl1 <- cbind(rnorm(50, 0.2, 0.05), rnorm(50, 0.8, 0.06))
cl2 <- cbind(rnorm(50, 0.7, 0.08), rnorm(50, 0.3, 0.05))
x <- cbind(Columns=data.frame(rbind(cl1, cl2)),
"Class_ID"=factor(as.character(c(rep(0, 50), rep(1, 50)))))
## compute similarity matrix (negative squared Euclidean)
sim <- negDistMat(x[, 1:2], r=2)
## compute agglomerative clustering from scratch
aggres1 <- aggExCluster(sim)
## load 'dendextend' package
## install.packages("dendextend") ## if not yet installed
library(dendextend)
## convert object
tree <- as.dendrogram(aggres1)
## assign color codes
colorCodes <- c("0"="red", "1"="green")
xColor <- colorCodes[x$Class_ID]
names(xColor) <- rownames(x)
## plot color-labeled tree
labels_colors(tree) <- xColor[order.dendrogram(tree)]
plot(tree)
Here is my answer to your Question 2: Sorry, no such functionality is implemented in the 'apcluster' package. And since this is quite a special request, I am reluctant to include it the package (let alone the fact that show() methods cannot have additional arguments). So, alternatively, I want to provide you with a custom function that allows for labeling/grouping exemplars and samples:
library(apcluster)
## create two Gaussian clouds along with class labels 0/1
cl1 <- cbind(rnorm(50, 0.2, 0.05), rnorm(50, 0.8, 0.06))
cl2 <- cbind(rnorm(50, 0.7, 0.08), rnorm(50, 0.3, 0.05))
x <- cbind(Columns=data.frame(rbind(cl1, cl2)),
"Class_ID"=factor(as.character(c(rep(0, 50), rep(1, 50)))))
## compute similarity matrix (negative squared Euclidean)
sim <- negDistMat(x[, 1:2], r=2)
## special show() function with labeled data
show.ExClust.labeled <- function(object, labels=NULL)
{
if (!is(object, "ExClust"))
stop("'object' is not of class 'ExClust'")
if (is.null(labels))
{
show(object)
return(invisible(NULL))
}
cat("\n", class(object), " object\n", sep="")
if (!is.finite(object#l) || !is.finite(object#it))
stop("object is not result of an affinity propagation run; ",
"it is pointless to create 'APResult' objects yourself.")
cat("\nNumber of samples = ", object#l, "\n")
if (length(object#sel) > 0)
{
cat("Number of sel samples = ", length(object#sel),
paste(" (", round(100*length(object#sel)/object#l,1),
"%)\n", sep=""))
cat("Number of sweeps = ", object#sweeps, "\n")
}
cat("Number of iterations = ", object#it, "\n")
cat("Input preference = ", object#p, "\n")
cat("Sum of similarities = ", object#dpsim, "\n")
cat("Sum of preferences = ", object#expref, "\n")
cat("Net similarity = ", object#netsim, "\n")
cat("Number of clusters = ", length(object#exemplars), "\n\n")
if (length(object#exemplars) > 0)
{
if (length(names(object#exemplars)) == 0)
{
cat("Exemplars:\n")
df <- data.frame("Sample"=object#exemplars,
Label=labels[object#exemplars])
print(df, row.names=FALSE)
for (i in 1:length(object#exemplars))
{
cat("\nCluster ", i, ", exemplar ",
object#exemplars[i], ":\n", sep="")
df <- data.frame(Sample=object#clusters[[i]],
Label=labels[object#clusters[[i]]])
print(df, row.names=FALSE)
}
}
else
{
df <- data.frame("Exemplars"=names(object#exemplars),
Label=labels[names(object#exemplars)])
print(df, row.names=FALSE)
for (i in 1:length(object#exemplars))
{
cat("\nCluster ", i, ", exemplar ",
names(object#exemplars)[i], ":\n", sep="")
df <- data.frame(Sample=names(object#clusters[[i]]),
Label=labels[names(object#clusters[[i]])])
print(df, row.names=FALSE)
}
}
}
else
{
cat("No clusters identified.\n")
}
}
## create label vector (with proper names)
label <- x$Class_ID
names(label) <- rownames(x)
## run apcluster()
apres <- apcluster(sim, q=0.3)
## show with labels
show.ExClust.labeled(apres, label)

p-values for estimates in flexsurvreg

I fitted a survival model using an inverse weibull distribution in flexsurvreg:
if (require("actuar")){
invweibull <- list(name="invweibull",
pars=c("shape","scale"),
location="scale",
transforms=c(log, log),
inv.transforms=c(exp, exp),
inits=function(t){ c(1, median(t)) })
invweibull <- flexsurvreg(formula = kpnsurv~iaas, data = kpnrs2,
dist=invweibull)
invweibull
}
And I got the following output:
Call:
flexsurvreg(formula = kpnsurv ~ iaas, data = kpnrs2, dist = invweibull)
Estimates:
data. mean. est L95% U95% se exp(est) L95% U95%
shape NA 0.4870 0.4002 0.5927 0.0488 NA NA NA
scale NA 62.6297 36.6327 107.0758 17.1371 NA NA NA
iaas 0.4470 -0.6764 -1.2138 -0.1391 0.2742 0.5084 0.2971 0.8701
N = 302, Events: 54, Censored: 248
Total time at risk: 4279
Log-likelihood = -286.7507, df = 3
AIC = 579.5015
How can I get the p-value of the covariate estimate (in this case iaas)? Thank you for your help.
Just in case this is still useful to anyone, this worked for me. First extract the matrix of coefficient information from the model:
invweibull.res <- invweibull$res
Then divide the estimated coefficients by their standard errors to calculate the Wald statistics, which have asymptotic standard normal distributions:
invweibull.wald <- invweibull.res[,1]/invweibull.res[,4]
Finally, get the p-values:
invweibull.p <- 2*pnorm(-abs(invweibull.wald))

Nested Random effect in JAGS/ WinBUGS

I am interested in fitting the following nested random effect model in JAGS.
SAS code
proc nlmixed data=data1 qpoints=20;
parms beta0=2 beta1=1 ;
bounds vara >=0, varb_a >=0;
eta = beta0+ beta1*t+ b2+b3;
p = exp(eta)/(1+exp(eta));
model TestResult ~ binary(p);
random b2 ~ normal(0,vara) subject = HHcode;
random b3 ~ normal(0,varb_a) subject = IDNo_N(HHcode);
run;
My question: How to specify the random effect part?
I have repeated measurements on individuals. These individuals are further nested in the household. Note: The number of individuals per household vary!
Looking forward to hearing from you
Let's assume that we have two vectors which indicate which house and which individual a data point belongs to (these are things you will need to create, in R you can make these by changing a factor to numeric via as.numeric). So, if we have 10 data points from 2 houses and 5 individuals they would look like this.
house_vec = c(1,1,1,1,1,1,2,2,2,2) # 6 points for house 1, 4 for house 2
ind_vec = c(1,1,2,2,3,3,4,4,5,5) # everyone has two observations
N = 10 # number of data points
So, the above vectors tell us that there are 3 individuals in the first house (because the first 6 elements of house_vec are 1 and the first 6 elements of ind_vec range from 1 to 3) and the second house has 2 individuals (last 4 elements of house_vec are 2 and the last 4 elements of ind_vec are 4 and 5). With these vectors, we can do nested indexing in JAGS to create your random effect structure. Something like this would suffice. These vectors would be supplied in the data.list that you have to include with TestResult
for(i in 1:N){
mu_house[house_vec[i]] ~ dnorm(0, taua)
mu_ind[ind_vec[i]] ~ dnorm(mu_house[house_vec[i]], taub_a)
}
# priors
taua ~ dgamma(0.01, 0.01) # precision
sda <- 1 / sqrt(taua) # derived standard deviation
taub_a ~ dgamma(0.01, 0.01) # precision
sdb_a <- 1 / sqrt(taub_a) # derived standard deviation
You would only need to include mu_ind within the linear predictor, as it is informed by mu_house. So the rest of the model would look like.
for(i in 1:N){
logit(p[i]) <- beta0 + beta1 * t + mu_ind[ind_vec[i]]
TestResult[i] ~ dbern(p[i])
}
You would then need to set priors for beta0 and beta1

Get quadratic equation term of a graph in R

I need to find the quadratic equation term of a graph I have plotted in R.
When I do this in excel, the term appears in a text box on the chart but I'm unsure how to move this to a cell for subsequent use (to apply to values requiring calibrating) or indeed how to ask for it in R. If it is summonable in R, is it saveable as an object to do future calculations with?
This seems like it should be a straightforward request in R, but I can't find any similar questions. Many thanks in advance for any help anyone can provide on this.
All the answers provide aspects of what you appear at want to do, but non thus far brings it all together. Lets consider Tom Liptrot's answer example:
fit <- lm(speed ~ dist + I(dist^2), cars)
This gives us a fitted linear model with a quadratic in the variable dist. We extract the model coefficients using the coef() extractor function:
> coef(fit)
(Intercept) dist I(dist^2)
5.143960960 0.327454437 -0.001528367
So your fitted equation (subject to rounding because of printing is):
\hat{speed} = 5.143960960 + (0.327454437 * dist) + (-0.001528367 * dist^2)
(where \hat{speed} is the fitted values of the response, speed).
If you want to apply this fitted equation to some data, then we can write our own function to do it:
myfun <- function(newdist, model) {
coefs <- coef(model)
res <- coefs[1] + (coefs[2] * newdist) + (coefs[3] * newdist^2)
return(res)
}
We can apply this function like this:
> myfun(c(21,3,4,5,78,34,23,54), fit)
[1] 11.346494 6.112569 6.429325 6.743024 21.386822 14.510619 11.866907
[8] 18.369782
for some new values of distance (dist), Which is what you appear to want to do from the Q. However, in R we don't do things like this normally, because, why should the user have to know how to form fitted or predicted values from all the different types of model that can be fitted in R?
In R, we use standard methods and extractor functions. In this case, if you want to apply the "equation", that Excel displays, to all your data to get the fitted values of this regression, in R we would use the fitted() function:
> fitted(fit)
1 2 3 4 5 6 7 8
5.792756 8.265669 6.429325 11.608229 9.991970 8.265669 10.542950 12.624600
9 10 11 12 13 14 15 16
14.510619 10.268988 13.114445 9.428763 11.081703 12.122528 13.114445 12.624600
17 18 19 20 21 22 23 24
14.510619 14.510619 16.972840 12.624600 14.951557 19.289106 21.558767 11.081703
25 26 27 28 29 30 31 32
12.624600 18.369782 14.057455 15.796751 14.057455 15.796751 17.695765 16.201008
33 34 35 36 37 38 39 40
18.688450 21.202650 21.865976 14.951557 16.972840 20.343693 14.057455 17.340416
41 42 43 44 45 46 47 48
18.038887 18.688450 19.840853 20.098387 18.369782 20.576773 22.333670 22.378377
49 50
22.430008 21.93513
If you want to apply your model equation to some new data values not used to fit the model, then we need to get predictions from the model. This is done using the predict() function. Using the distances I plugged into myfun above, this is how we'd do it in a more R-centric fashion:
> newDists <- data.frame(dist = c(21,3,4,5,78,34,23,54))
> newDists
dist
1 21
2 3
3 4
4 5
5 78
6 34
7 23
8 54
> predict(fit, newdata = newDists)
1 2 3 4 5 6 7 8
11.346494 6.112569 6.429325 6.743024 21.386822 14.510619 11.866907 18.369782
First up we create a new data frame with a component named "dist", containing the new distances we want to get predictions for from our model. It is important to note that we include in this data frame a variable that has the same name as the variable used when we created our fitted model. This new data frame must contain all the variables used to fit the model, but in this case we only have one variable, dist. Note also that we don't need to include anything about dist^2. R will handle that for us.
Then we use the predict() function, giving it our fitted model and providing the new data frame just created as argument 'newdata', giving us our new predicted values, which match the ones we did by hand earlier.
Something I glossed over is that predict() and fitted() are really a whole group of functions. There are versions for lm() models, for glm() models etc. They are known as generic functions, with methods (versions if you like) for several different types of object. You the user generally only need to remember to use fitted() or predict() etc whilst R takes care of using the correct method for the type of fitted model you provide it. Here are some of the methods available in base R for the fitted() generic function:
> methods(fitted)
[1] fitted.default* fitted.isoreg* fitted.nls*
[4] fitted.smooth.spline*
Non-visible functions are asterisked
You will possibly get more than this depending on what other packages you have loaded. The * just means you can't refer to those functions directly, you have to use fitted() and R works out which of those to use. Note there isn't a method for lm() objects. This type of object doesn't need a special method and thus the default method will get used and is suitable.
You can add a quadratic term in the forumla in lm to get the fit you are after. You need to use an I()around the term you want to square as in the example below:
plot(speed ~ dist, cars)
fit1 = lm(speed ~ dist, cars) #fits a linear model
abline(fit1) #puts line on plot
fit2 = lm(speed ~ I(dist^2) + dist, cars) #fits a model with a quadratic term
fit2line = predict(fit2, data.frame(dist = -10:130))
lines(-10:130 ,fit2line, col=2) #puts line on plot
To get the coefficients from this use:
coef(fit2)
I dont think it is possible in Excel, as they only provide functions to get coefficients for a linear regression (SLOPE, INTERCEPT, LINEST) or for a exponential one (GROWTH, LOGEST), though you may have more luck by using Visual Basic.
As for R you can extract model coefficients using the coef function:
mdl <- lm(y ~ poly(x,2,raw=T))
coef(mdl) # all coefficients
coef(mdl)[3] # only the 2nd order coefficient
I guess you mean that you plot X vs Y values in Excel or R, and in Excel use the "Add trendline" functionality. In R, you can use the lm function to fit a linear function to your data, and this also gives you the "r squared" term (see examples in the linked page).

R: Call matrixes from a vector of string names?

Imagine I've got 100 numeric matrixes with 5 columns each.
I keep the names of that matrixes in a vector or list:
Mat <- c("GON1EU", "GON2EU", "GON3EU", "NEW4", ....)
I also have a vector of coefficients "coef",
coef <- c(1, 2, 2, 1, ...)
And I want to calculate a resulting vector in this way:
coef[1]*GON1EU[,1]+coef[2]*GON2EU[,1]+coef[3]*GON3EU[,1]+coef[4]*NEW4[,1]+.....
How can I do it in a compact way, using the the vector of names?
Something like:
coef*(Object(Mat)[,1])
I think the key is how to call an object from a string with his name and use and vectorial notation. But I don't know how.
get() allows you to refer to an object by a string. It will only get you so far though; you'll still need to construct the repeated call to get() on the list matrices etc. However, I wonder if an alternative approach might be feasible? Instead of storing the matrices separately in the workspace, why not store the matrices in a list?
Then you can use sapply() on the list to extract the first column of each matrix in the list. The sapply() step returns a matrix, which we multiply by the coefficient vector. The column sums of that matrix are the values you appear to want from your above description. At least I'm assuming that coef[1]*GON1EU[,1] is a vector of length(GON1EU[,1]), etc.
Here's some code implementing this idea.
vec <- 1:4 ## don't use coef - there is a function with that name
mat <- matrix(1:12, ncol = 3)
myList <- list(mat1 = mat, mat2 = mat, mat3 = mat, mat4 = mat)
colSums(sapply(myList, function(x) x[, 1]) * vec)
Here is some output:
> sapply(myList, function(x) x[, 1]) * vec
mat1 mat2 mat3 mat4
[1,] 1 1 1 1
[2,] 4 4 4 4
[3,] 9 9 9 9
[4,] 16 16 16 16
> colSums(sapply(myList, function(x) x[, 1]) * vec)
mat1 mat2 mat3 mat4
30 30 30 30
The above example suggest you create, or read in, your 100 matrices as components of a list from the very beginning of your analysis. This will require you to alter the code you used to generate the 100 matrices. Seeing as you already have your 100 matrices in your workspace, to get myList from these matrices we can use the vector of names you already have and use a loop:
Mat <- c("mat","mat","mat","mat")
## loop
for(i in seq_along(myList2)) {
myList[[i]] <- get(Mat[i])
}
## or as lapply call - Kudos to Ritchie Cotton for pointing that one out!
## myList <- lapply(Mat, get)
myList <- setNames(myList, paste(Mat, 1:4, sep = ""))
## You only need:
myList <- setNames(myList, Mat)
## as you have the proper names of the matrices
I used "mat" repeatedly in Mat as that is the name of my matrix above. You would use your own Mat. If vec contains what you have in coef, and you create myList using the for loop above, then all you should need to do is:
colSums(sapply(myList, function(x) x[, 1]) * vec)
To get the answer you wanted.
See help(get) and that's that.
If you'd given us a reproducible example I'd have said a bit more. For example:
> a=1;b=2;c=3;d=4
> M=letters[1:4]
> M
[1] "a" "b" "c" "d"
> sum = 0 ; for(i in 1:4){sum = sum + i * get(M[i])}
> sum
[1] 30
Put whatever you need in the loop, or use apply over the vector M and get the object:
> sum(unlist(lapply(M,function(n){get(n)^2})))
[1] 30

Resources