Anova for multiple point patterns not working for Strauss model - statistics

I just started getting into spatial analysis and am fitting some models to my data. My main goal is to test for spatial regularity (whether there is inhibition between points).
I created my hyperframe for the data below. There are 6 point patterns (Areas), 4 in subhabitat 1, and 2 in subhabitat 2.
ALL_ppp <- list(a1ppp, a2ppp, a3ppp, a4ppp, a5ppp, a6ppp)
H <- hyperframe(Area = c("A1","A2","A3","A4","A5","A6"), Subhabitat = c("sbh1","sbh1","sbh1","sbh1","sbh2","sbh2"), Points = ALL_ppp )
I then created some models. This model fits a Strauss process with a different interaction radius for each area, with intensity depending on subhabitat type. It is very similar to the example in the book on page 700.
radii <- c(mean(area1$diameter), mean(area2$diameter),mean(area3$diameter),mean(area4$diameter),mean(area5$diameter),mean(area6$diameter))
Rad <- hyperframe(R=radii)
Str <- with(Rad, Strauss(R))
Int <- hyperframe(str=Str)
fittest8 <- mppm(Points ~ Subhabitat, H, interaction=Int, iformula = ~str:Area)
I would like to conduct a formal test for significance for the Strauss interaction parameters using anova.mppm to test for regularity. However, I am not sure if I am doing this properly, as I cannot seem to get this to work. I have tried:
fittest8 <- mppm(Points ~ Subhabitat, H, interaction=Int, iformula = ~str:Area)
fitex <- mppm(Points ~ Subhabitat, H)
anova.mppm(fittest8, fitex, test = "Chi")
I get the error "Error: Coefficient ‘str’ is missing from new.coef" and cannot find a way to resolve this. Any advice would be greatly appreciated.
Thanks!

First, please learn how to make a minimal reproducible example. This will make it easier for people to help you solve the problem, without having to guess what was in your data.
In your example, the columns named Area and Subhabitat in the hyperframe H are character vectors, but in your code, the call to mppm would require that they are factors. I assume you converted them to factors in order to be able to fit the model fittest8. (Another reason to make a working example)
You said that your example was similar to one on page 700 of the spatstat book which does work. In that case, a good strategy is to modify your example to make it as similar as possible to the example that works, because this will narrow down the possible cause.
A working example of the problem, similar to the one in the book, is:
Str <- hyperframe(str=with(simba, Strauss(mean(nndist(Points)))))
fit1 <- mppm(Points ~ group, simba, interaction=Str, iformula=~str:group)
fit0 <- mppm(Points ~ group, simba)
anova(fit0, fit1, test="Chi")
which yields the same error Error: Coefficient ‘str’ is missing from new.coef
The simplest way to avoid this is to replace the interaction formula ~str:group by str+str:group:
fit1x <- mppm(Points ~ group, simba, interaction=Str,
iformula = ~str + str:group)
anova(fit0, fit1x, test="Chi")
or in your example
fittest8X <- mppm(Points ~ Subhabitat, H, interaction=Int,
iformula=~str + str:Area)
anova(fittest8X, fitex, test="Chi")
Note that fittest8X and fittest8 are equivalent models but are expressed in a slightly different way.
The interaction formula and the trend formula are connected in a complicated way and the software is not always successful in disentangling them. If you get this kind of problem again, try different versions of the interaction formula.

Related

Generating clustered spatstat marks for a ppp object

This question is very close to what has been asked here. The answer is great if we want to generate random marks to an already existing point pattern - we draw from a multivariate normal distribution and associate with each point.
However, I need to generate marks that follows the marks given in the lansing dataset that comes with spatstat for my own point pattern. In other words, I have a point pattern without marks and I want to simulate marks with a definite pattern (for example, to illustrate the concept of segregation for my own data). How do I make such marks? I understand the number of points could be different between lansing and my data set but I am allowed to reduce the window or create more points. Thanks!
Here is another version of segregation in four different rectangular
regions.
library(spatstat)
p <- c(.6,.2,.1,.1)
prob <- rbind(p,
p[c(4,1:3)],
p[c(3:4,1:2)],
p[c(2:4,1)])
X <- unmark(spruces)
labels <- factor(LETTERS[1:4])
subwins <- quadrats(X, 2, 2)
Xsplit <- split(X, subwins)
rslt <- NULL
for(i in seq_along(Xsplit)){
Y <- Xsplit[[i]]
marks(Y) <- sample(labels, size = npoints(Y),
replace = TRUE, prob = prob[i,])
rslt <- superimpose(rslt, Y)
}
plot(rslt, main = "", cols = 1:4)
plot(subwins, add = TRUE)
Segregation refers to the fact that one species predominates in a
specific part of the observation window. An extreme example would be to
segregate completely based on e.g. the x-coordinate. This would generate strips
of points of different types:
library(spatstat)
X <- lansing
Y <- cut(X, X$x, breaks = 6, labels = LETTERS[1:6])
plot(Y, cols = 1:6)
Without knowing more details about the desired type of segregation it is
hard to suggest something more useful.

Coefficient of Variations?

I have a list of values incrementing exponentially. I was asked to have multiple Coefficent of variations from them. You might agree with me that CV is only for the whole set of numbers and dividing the set of numbers into subgroups and calculating a CV for each subgroup seems unreasonable. Would there be any statistical idea behind multiple CVs and if there is, how histogram can be made by the CVs, I mean what would the bins of the historgram. I appreciate the answers in advance
I agree with you - it does not make sense to me to calculate multiple CVs for one dataset unless there's some inferential reason for doing so.
That being said, there might actually be a reason for considering sub-groups of a dataset. In the field of Statistics, context is everything. My first thought is to ask your colleague why they want you do proceed that way. Maybe there's a good reason, maybe they don't have as full a grasp of stats as you do, regardless, it should be an enlightening conversation to have.
If you do decide to go this route, here's some R code that might help (R is great - flexible, powerful, and free)
# first, simulating some fake data (100 values of measurement & group for 10 groups)
x <- rnorm(100, mean=10, sd=1)
group <- sample(LETTERS[1:10], 100, replace=T)
# first few values of each
head(data.frame(x, group))
x group
1 10.778480 F
2 9.274193 B
3 9.639143 G
4 9.080369 I
5 10.727895 D
6 10.850306 G
# this is the part you'd actually need...
# calculating the sd & avgs for each group
sds <- tapply(x, group, sd)
avgs <- tapply(x, group, mean)
# then the cv
cvs <- sds/avgs
cvs
A B C D E F G H I J
0.07859528 0.07570556 0.09370247 0.12552468 0.08897856 0.11044543 0.10947615 0.10323379 0.08908262 0.09729945
# and if you want a histogram, R makes it pretty easy
hist(cvs)

(Incremental)PCA's Eigenvectors are not transposed but should be?

When we posted a homework assignment about PCA we told the course participants to pick any way of calculating the eigenvectors they found. They found multiple ways: eig, eigh (our favorite was svd). In a later task we told them to use the PCAs from scikit-learn - and were surprised that the results differed a lot more than we expected.
I toyed around a bit and we posted an explanation to the participants that either solution was correct and probably just suffered from numerical instabilities in the algorithms. However, recently I picked that file up again during a discussion with a co-worker and we quickly figured out that there's an interesting subtle change to make to get all results to be almost equivalent: Transpose the eigenvectors obtained from the SVD (and thus from the PCAs).
A bit of code to show this:
def pca_eig(data):
"""Uses numpy.linalg.eig to calculate the PCA."""
data = data.T # data
val, vec = np.linalg.eig(data)
return val, vec
versus
def pca_svd(data):
"""Uses numpy.linalg.svd to calculate the PCA."""
u, s, v = np.linalg.svd(data)
return s ** 2, v
Does not yield the same result. Changing the return of pca_svd to s ** 2, v.T, however, works! It makes perfect sense following the definition by wikipedia: The SVD of X follows X=UΣWT where
the right singular vectors W of X are equivalent to the eigenvectors of XTX
So to get the eigenvectors we need to transposed the output v of np.linalg.eig(...).
Unless there is something else going on? Anyway, the PCA and IncrementalPCA both show wrong results (or eig is wrong? I mean, transposing that yields the same equality), and looking at the code for PCA reveals that they are doing it as I did it initially:
U, S, V = linalg.svd(X, full_matrices=False)
# flip eigenvectors' sign to enforce deterministic output
U, V = svd_flip(U, V)
components_ = V
I created a little gist demonstrating the differences (nbviewer), the first with PCA and IncPCA as they are (also no transposition of the SVD), the second with transposed eigenvectors:
Comparison without transposition of SVD/PCAs (normalized data)
Comparison with transposition of SVD/PCAs (normalized data)
As one can clearly see, in the upper image the results are not really great, while the lower image only differs in some signs, thus mirroring the results here and there.
Is this really wrong and a bug in scikit-learn? More likely I am using the math wrong – but what is right? Can you please help me?
If you look at the documentation, it's pretty clear from the shape that the eigenvectors are in the rows, not the columns.
The point of the sklearn PCA is that you can use the transform method to do the correct transformation.

Random effects modeling using mgcv and using lmer. Basically identical fits but VERY different likelihoods and DF. Which to use for testing?

I am aware that there is a duality between random effects and smooth curve estimation. At this link, Simon Wood describes how to specify random effects using mgcv. Of particular note is the following passage:
For example if g is a factor then s(g,bs="re") produces a random coefficient for each level of g, with the radndom coefficients all modelled as i.i.d. normal.
After a quick simulation, I can see this is correct, and that the model fits are almost identical. However, the likelihoods and degrees of freedom are VERY different. Can anyone explain the difference? Which one should be used for testing?
library(mgcv)
library(lme4)
set.seed(1)
x <- rnorm(1000)
ID <- rep(1:200,each=5)
y <- x
for(i in 1:200) y[which(ID==i)] <- y[which(ID==i)] + rnorm(1)
y <- y + rnorm(1000)
ID <- as.factor(ID)
# gam (mgcv)
m <- gam(y ~ x + s(ID,bs="re"))
gam.vcomp(m)
coef(m)[1:2]
logLik(m)
# lmer
m2 <- lmer(y ~ x + (1|ID))
sqrt(VarCorr(m2)$ID[1])
summary(m2)$coef[,1]
logLik(m2)
mean( abs( fitted(m)-fitted(m2) ) )
Full disclosure: I encountered this problem because I want to fit a GAM that also includes random effects (repeated measures), but need to know if I can trust likelihood-based tests under those models.

JAGS: multivariate normal with both, unobservable and observable variables?

I have a "simple" problem with JAGS that drives me crazy. In essence, consider the following example that works:
x2[i] ~ dnorm(mu[i,1], tau1);
u[i] ~ dnorm(mu[i,2], tau2);
Here, x2 is an observable variable (that is, data), while u is a latent variable. In the example, both are drawn independently from two distinct normal distributions.
However, I want them to be (possibly) dependent, that is, to be drawn from one multivariate normal distribution. So I would like to do:
c(x2[i], u[i]) ~ dmnorm(mu[i,1:2], Omega[1:2,1:2]);
Unfortunately, this doesn't work because this syntax is not correct. However, having tried many different syntaxes, neither of them does work. E.g.,
y[i,1] <- x2[i];
y[i,2] <- u[i];
y[i,1:2] ~ dmnorm(mu[i,1:2], Omega[1:2,1:2]);
leads to the error Node y[1,1:2] overlaps previously defined nodes, what is obvious.
So what can I do? Please, help me, I'm getting mad...
UPDATE: I figured out that I can at least do the following:
(in R:)
p <- 1/(1+exp(-x2));
t <- rep(10000, length(x2));
s <- rbinom(length(x2), t, p2);
(in JAGS:)
nul[i,1] <- 0;
nul[i,2] <- 0;
e[i,1:2] ~ dmnorm(nul[i,1:2], Omega[1:2,1:2]);
u[i] <- mu[i,2] + e[i,2];
x2g[i] <- mu[i,1] + e[i,1];
pg[i] <- 1/(1+exp(-x2g[i]));
s[i] ~ dbin(pg[i], t[i]);
This works (a bit), but looses of course efficiency since an observable variable (x2) is treated as if it was only indirectly observable (through s).
You are defining y twice Once in:
y[i,1] <- x2[i];
y[i,2] <- u[i];
And once in
y[i,1:2] ~ dmnorm(mu[i,1:2], Omega[1:2,1:2]);
You might be able to get away with:
x2[i] <- y[i,1];
Or you might be able to simply write out the regression (after all it is bivariate, so not that difficulty).
You also might get a faster response on the JAGS mailing list (which Martyn Plummer regularly monitors).
You can use the data block as follows:
data{
for(i in 1:length(x2)) {
y[i,1] <- x2[i]
y[i,2] <- u[i]
}
}
model{
for(i in 1:length(x2)) {
y[i,1:2] ~ dmnorm(mu[i,], Tau)
}
# ... definition of mu, Tau, and their prior distribution
}
However, be sure there are no missing values in x2 or u, since a multivariate node cannot be partly observed.
Regards! :)

Resources