Random effects modeling using mgcv and using lmer. Basically identical fits but VERY different likelihoods and DF. Which to use for testing? - statistics

I am aware that there is a duality between random effects and smooth curve estimation. At this link, Simon Wood describes how to specify random effects using mgcv. Of particular note is the following passage:
For example if g is a factor then s(g,bs="re") produces a random coefficient for each level of g, with the radndom coefficients all modelled as i.i.d. normal.
After a quick simulation, I can see this is correct, and that the model fits are almost identical. However, the likelihoods and degrees of freedom are VERY different. Can anyone explain the difference? Which one should be used for testing?
library(mgcv)
library(lme4)
set.seed(1)
x <- rnorm(1000)
ID <- rep(1:200,each=5)
y <- x
for(i in 1:200) y[which(ID==i)] <- y[which(ID==i)] + rnorm(1)
y <- y + rnorm(1000)
ID <- as.factor(ID)
# gam (mgcv)
m <- gam(y ~ x + s(ID,bs="re"))
gam.vcomp(m)
coef(m)[1:2]
logLik(m)
# lmer
m2 <- lmer(y ~ x + (1|ID))
sqrt(VarCorr(m2)$ID[1])
summary(m2)$coef[,1]
logLik(m2)
mean( abs( fitted(m)-fitted(m2) ) )
Full disclosure: I encountered this problem because I want to fit a GAM that also includes random effects (repeated measures), but need to know if I can trust likelihood-based tests under those models.

Related

Anova for multiple point patterns not working for Strauss model

I just started getting into spatial analysis and am fitting some models to my data. My main goal is to test for spatial regularity (whether there is inhibition between points).
I created my hyperframe for the data below. There are 6 point patterns (Areas), 4 in subhabitat 1, and 2 in subhabitat 2.
ALL_ppp <- list(a1ppp, a2ppp, a3ppp, a4ppp, a5ppp, a6ppp)
H <- hyperframe(Area = c("A1","A2","A3","A4","A5","A6"), Subhabitat = c("sbh1","sbh1","sbh1","sbh1","sbh2","sbh2"), Points = ALL_ppp )
I then created some models. This model fits a Strauss process with a different interaction radius for each area, with intensity depending on subhabitat type. It is very similar to the example in the book on page 700.
radii <- c(mean(area1$diameter), mean(area2$diameter),mean(area3$diameter),mean(area4$diameter),mean(area5$diameter),mean(area6$diameter))
Rad <- hyperframe(R=radii)
Str <- with(Rad, Strauss(R))
Int <- hyperframe(str=Str)
fittest8 <- mppm(Points ~ Subhabitat, H, interaction=Int, iformula = ~str:Area)
I would like to conduct a formal test for significance for the Strauss interaction parameters using anova.mppm to test for regularity. However, I am not sure if I am doing this properly, as I cannot seem to get this to work. I have tried:
fittest8 <- mppm(Points ~ Subhabitat, H, interaction=Int, iformula = ~str:Area)
fitex <- mppm(Points ~ Subhabitat, H)
anova.mppm(fittest8, fitex, test = "Chi")
I get the error "Error: Coefficient ‘str’ is missing from new.coef" and cannot find a way to resolve this. Any advice would be greatly appreciated.
Thanks!
First, please learn how to make a minimal reproducible example. This will make it easier for people to help you solve the problem, without having to guess what was in your data.
In your example, the columns named Area and Subhabitat in the hyperframe H are character vectors, but in your code, the call to mppm would require that they are factors. I assume you converted them to factors in order to be able to fit the model fittest8. (Another reason to make a working example)
You said that your example was similar to one on page 700 of the spatstat book which does work. In that case, a good strategy is to modify your example to make it as similar as possible to the example that works, because this will narrow down the possible cause.
A working example of the problem, similar to the one in the book, is:
Str <- hyperframe(str=with(simba, Strauss(mean(nndist(Points)))))
fit1 <- mppm(Points ~ group, simba, interaction=Str, iformula=~str:group)
fit0 <- mppm(Points ~ group, simba)
anova(fit0, fit1, test="Chi")
which yields the same error Error: Coefficient ‘str’ is missing from new.coef
The simplest way to avoid this is to replace the interaction formula ~str:group by str+str:group:
fit1x <- mppm(Points ~ group, simba, interaction=Str,
iformula = ~str + str:group)
anova(fit0, fit1x, test="Chi")
or in your example
fittest8X <- mppm(Points ~ Subhabitat, H, interaction=Int,
iformula=~str + str:Area)
anova(fittest8X, fitex, test="Chi")
Note that fittest8X and fittest8 are equivalent models but are expressed in a slightly different way.
The interaction formula and the trend formula are connected in a complicated way and the software is not always successful in disentangling them. If you get this kind of problem again, try different versions of the interaction formula.

Generating clustered spatstat marks for a ppp object

This question is very close to what has been asked here. The answer is great if we want to generate random marks to an already existing point pattern - we draw from a multivariate normal distribution and associate with each point.
However, I need to generate marks that follows the marks given in the lansing dataset that comes with spatstat for my own point pattern. In other words, I have a point pattern without marks and I want to simulate marks with a definite pattern (for example, to illustrate the concept of segregation for my own data). How do I make such marks? I understand the number of points could be different between lansing and my data set but I am allowed to reduce the window or create more points. Thanks!
Here is another version of segregation in four different rectangular
regions.
library(spatstat)
p <- c(.6,.2,.1,.1)
prob <- rbind(p,
p[c(4,1:3)],
p[c(3:4,1:2)],
p[c(2:4,1)])
X <- unmark(spruces)
labels <- factor(LETTERS[1:4])
subwins <- quadrats(X, 2, 2)
Xsplit <- split(X, subwins)
rslt <- NULL
for(i in seq_along(Xsplit)){
Y <- Xsplit[[i]]
marks(Y) <- sample(labels, size = npoints(Y),
replace = TRUE, prob = prob[i,])
rslt <- superimpose(rslt, Y)
}
plot(rslt, main = "", cols = 1:4)
plot(subwins, add = TRUE)
Segregation refers to the fact that one species predominates in a
specific part of the observation window. An extreme example would be to
segregate completely based on e.g. the x-coordinate. This would generate strips
of points of different types:
library(spatstat)
X <- lansing
Y <- cut(X, X$x, breaks = 6, labels = LETTERS[1:6])
plot(Y, cols = 1:6)
Without knowing more details about the desired type of segregation it is
hard to suggest something more useful.

estimated posteriors in JAGS by levels of a factor

I am running an N-mixture model in JAGS, trying to see if posterior predicted values of N are higher in one habitat than another. I am wondering how to obtain posterior probabilities of estimated population size for each habitat individually after running the model. So, e.g., if I wanted to sum across all sites, I'd put
totalN<-sum(N[]) in the JAGS model and identify "totalN" as one of my parameters. If I have 2 habitat levels over which to sum N, do I need a for loop or is there another way to define it?
Below is my model so far...
model{
priors
#abundance
beta0 ~ dnorm(0, 0.001) # log(lambda) intercept
beta1 ~ dnorm(0, 0.001) #this is my regression parameter for habitat
tau.T ~ dgamma(0.001, 0.001) #this is for random effect of transect
# detection
alpha.p ~ dgamma(0.01, 0.01)
beta.p ~ dgamma (0.01, 0.01)
Poisson model for abundance
for (i in 1:nsite){
loglam[i] <- beta1*habitat[i] + ranef[transect[i]]
loglam.lim[i] <- min(250, max(-250, loglam[i])) # 'Stabilize' log
lam[i] <- exp(loglam.lim[i])
N[i] ~ dpois(lam[i])
}
for (i in 1:14){
ranef[i]~dnorm(beta0,tau.T)
}
Measurement error model
for (i in 1:nsite){
for (j in 1:nrep){
y[i,j] ~ dbin(p[i,j], N[i])
p[i,j] ~ dbeta(alpha.p,beta.p) #detection probability follows a beta distribution
}
}
posterior predictions
Nperhabitat<-sum(N[habitat]) #this doesn't work, only estimates a single set of posterior densities for N
#and get a derived detection probability
}
I am going to assume here that habitat is a binary vector. I would add two additional vectors to your data that define which elements in habitat are 1 and which are 0. From there you can index N with those two vectors.
# done in R and added to the data list supplied to JAGS
hab_1 <- which(habitat == 1)
hab_0 <- which(habitat == 0)
# add to data list
data_list <- list(..., hab_1 = hab_1, hab_0 = hab_0)
Then, inside the JAGS model you would just add:
N_habitat_1 <- sum(N[hab_1])
N_habitat_0 <- sum(N[hab_0])
This is effectively telling JAGS to provide the total abundance per habitat type. If you have way more sites of one habitat vs another this abundance may hide that the density of individuals could actually be less. Thus, you may want to divide this abundance by the total number of sites of each habitat type:
dens_habitat_1 <- sum(N[hab_1]) / sum(habitat)
dens_habitat_0 <- sum(N[hab_0]) / sum(1 - habitat)
This is, of course, assuming that habitat is binary.

KNN Accuracy of 100%?

I have used the following code for KNN
jd <- jobdata
head (jd)
jd$ipermanency rate= as.integer(as.factor(jd$ipermanency rate))
jd$`permanency rate`=as.integer(as.factor(jd$`permanency rate`))
jd$`job skills`=as.integer(as.factor(jd$`job skills`))
jd$Default <- factor(jd$Default)
num.vars <- sapply(jd, is.numeric)
jd[num.vars] <- lapply(jd[num.vars], scale)
jd$`permanency rate` <- factor(jd$`permanency rate`)
num.vars <- sapply(jd, is.numeric)
jd[num.vars] <- lapply(jd[num.vars], scale)
myvars <- c("permanency rate", "job skills")
jd.subset <- jd[myvars]
summary(jd.subset)
set.seed(123)
test <- 1:100
train.jd <- jd.subset[-test,]
test.jd <- jd.subset[test,]
train.def <- jd$`permanency rate`[-test]
test.def <- jd$`permanency rate`[test]
library(class)
knn.1 <- knn(train.jd, test.jd, train.def, k=1)
knn.3 <- knn(train.jd, test.jd, train.def, k=3)
knn.5 <- knn(train.jd, test.jd, train.def, k=5)
But whenever I calculate the proportion of correct classification for k = 1, 3 & 5 I always get 100% correctness. Is this normal or have I gone wrong somewhere
Thanks
We can't say that knn classifier always produces wrong results.Actually it is based on the dataset. In best case, the train data can be equal to the test data,where it always produces the 100% results.
Train data == Test data - 100% Efficient in all cases
Only if the model is an overfit case. That means model is not able to capture randomness and hence is predicting with 100 percent on training data
This is likely not the case in most projects as most y_labels (target) are likely to fall close together when you have a complex dataset with a large number of independent variables (predictors).
It would be good for you to try implementing some clustering techniques or a simple pair plot of your variables with the color set to your target variable to see if they are nicely grouped together.
An example would be:
# This is an implementation in python
import matplotlib.pyplot as plt
import seaborn as sns
sns.pairplot(data = jd, hue = "permanency rate")
Depending on the language and library you are using, KNN classifier usually sets n_neighbours (K) = 5 by default. Thus you can try to go above this value to see if it returns a different result.
You should also construct your confusion matrix and review your metrics.

can a random number be added to a set..such that its mean and variance will not change

i have a set of 4 values. i want to generate a random number which will be adding to the each of the set. But after adding ,the values of mean and variance should not change.
Meaning mean and variance of set before adding should be same as after adding the number.i was trying to approach it with genetic algorithm .can anyone please give me more insight on this?
Let us suppose your set is called x. Let us also suppose that you will add values to x to make it y. In R, this could be achieved by
x <- rnorm(4, mean = 5, sd = 2)
x
[1] 5.124843 3.070105 4.444706 6.657949
rand <- rnorm(0, sd(x))/1000 # Divide by 1000 so rand will have minimum
#impact on the mean and variance of x when added
y <- x + rand
y
[1] 5.124799 3.066977 4.444524 6.656452
mean(x); mean(y)
[1] 4.824401
[1] 4.823188
Now this will show some incremental change but to minimize the incremental change, you can scale rand by dividing it by a large number (as I did) or multiplying it by a small number. Another way you can about this is by using the jitter function in R. This function uses a small uniform distribution centered about 0 to sample and add noise to data.
x <- c(1, -.5, 2, -1.2)
jitter(x)
[1] 1.1117953 -0.5391391 2.0695948 -1.1145638
The only downside to jitter is that you cannot scale your noise from outside the function. It will scale your entire x vector.

Resources