how to make an optimal combinatorial selection in R - combinatorics

The problem I'm trying to solve is basically the same as the one in this post:
https://stats.stackexchange.com/questions/339935/python-library-for-combinatorial-optimization
And my current implementation uses indeed a genetic algorithm based optimizer.
However, I would like to solve it as a binary linear programming problem (at least try, even though it's 'NP-hard', apparently).
My question is how to formulate the LP in the best way, because I am not sure I am doing it right.
The following is a simplified version of what I'm dealing with, which however shows exactly where the problem lies.
We make m*n (in this case 6) objects by a combinatorial process taking m (3) objects of type 'R1' (say {A,B,C}) and n (2) objects of type 'R2' (say {X,Y}).
The 6 objects {AX,AY,BX,BY,CX,CY} are evaluated and each gets a score D, in this case {0.8,0.7,0.5,0.9,0.4,0.0}, in this order.
CL <- cbind(expand.grid(R2=LETTERS[24:25],R1=LETTERS[1:3],stringsAsFactors = FALSE),D=c(0.8,0.7,0.5,0.9,0.4,0.0))
Now we want to select 2 distinct R1's and 1 R2 such that the sum of D is maximal.
In this example, the answer is R1 = {A,B}, R2 = {Y}.
However, one would not get to such conclusion taking, for instance, the 2 R1's and the R2 with the highest average D.
It would work for R1, but not for R2:
aggregate(D~R1,CL,mean)
# R1 D
#1 A 0.75
#2 B 0.70
#3 C 0.20
aggregate(D~R2,CL,mean)
# R2 D
#1 X 0.5666667
#2 Y 0.5333333
I know how to formulate this as a linear programming problem; only I am not sure my formulation is efficient, because basically it results in a problem with mn+m+n variables and 2(m+n)+2 constraints.
The main difficulty is that I need somehow to count the number of distinct R1's and R2's chosen, and I don't know any way of doing that apart from what I will show below (and is also described in my other post here).
This is what I would do:
CL["Entry"] <- seq_len(dim(CL)[[1]])
R1.mat <- table(CL$R1,CL$Entry)
R2.mat <- table(CL$R2,CL$Entry)
N_R1 <- dim(R1.mat)[[1]]
N_R2 <- dim(R2.mat)[[1]]
N_Entry <- dim(CL)[[1]]
constr.mat <- NULL
dir <- NULL
rhs <- NULL
constr.mat <- rbind(constr.mat,cbind(R1.mat,-diag(table(CL$R1)),matrix(0,N_R1,N_R2)))
dir <- c(dir,rep("<=",N_R1))
rhs <- c(rhs,rep(0,N_R1))
constr.mat <- rbind(constr.mat,cbind(R2.mat,matrix(0,N_R2,N_R1),-diag(table(CL$R2))))
dir <- c(dir,rep("<=",N_R2))
rhs <- c(rhs,rep(0,N_R2))
constr.mat <- rbind(constr.mat,constr.mat)
dir <- c(dir,rep(">=",N_R1+N_R2))
rhs <- c(rhs,1-table(CL$R1),1-table(CL$R2))
constr.mat <- rbind(constr.mat,c(rep(0,N_Entry),rep(1,N_R1),rep(0,N_R2)))
dir <- c(dir,"==")
rhs <- c(rhs,2)
constr.mat <- rbind(constr.mat,c(rep(0,N_Entry),rep(0,N_R1),rep(1,N_R2)))
dir <- c(dir,"==")
rhs <- c(rhs,1)
obj <- c(aggregate(D~Entry,CL,c)[["D"]],rep(0,N_R1+N_R2))
Which can be solved for instance by lpSolve:
sol <- lp("max", obj, constr.mat, dir, rhs, all.bin = TRUE,num.bin.solns = 1, use.rw=FALSE, transpose.constr=TRUE)
sol$solution
#[1] 0 1 0 1 0 0 1 1 0 0 1
showing that products {AY,BY} were selected, corresponding to R1 = {A,B} and R2 = {Y}:
CL[as.logical(sol$solution[1:N_Entry]),]
# R2 R1 D Entry
#2 Y A 0.7 2
#4 Y B 0.9 4
I found that on large problems lpSolve gets stuck for ages; Rsymphony seemed to perform better.
But again, my main question is: is this way of formulating the LP efficient? Should I do it differently?
Thanks!
EDIT
In the meantime, working on a somewhat related problem, I found that only one set of constraints may be sufficient, if one adds 'costs' (in this example, negative) to the objective vector for the 'distinct R1 and R2' variables.
So here, instead of:
obj <- c(aggregate(D~Entry,CL,c)[["D"]],rep(0,N_R1+N_R2))
I would do:
obj <- c(aggregate(D~Entry,CL,c)[["D"]],rep(-1,N_R1+N_R2))
This would make m+n constraints unnecessary.
It still remains a huge problem to solve, even for relatively small m, n, so if anyone can advise how to do it better...
I had a look at lp.transport, but that would be limited to 2 dimensions (i.e. only R1 and R2, not R1, R2, R3 for instance), and I don't think you can constrain the number of distinct objects per category in that kind of solver.

Related

Anova for multiple point patterns not working for Strauss model

I just started getting into spatial analysis and am fitting some models to my data. My main goal is to test for spatial regularity (whether there is inhibition between points).
I created my hyperframe for the data below. There are 6 point patterns (Areas), 4 in subhabitat 1, and 2 in subhabitat 2.
ALL_ppp <- list(a1ppp, a2ppp, a3ppp, a4ppp, a5ppp, a6ppp)
H <- hyperframe(Area = c("A1","A2","A3","A4","A5","A6"), Subhabitat = c("sbh1","sbh1","sbh1","sbh1","sbh2","sbh2"), Points = ALL_ppp )
I then created some models. This model fits a Strauss process with a different interaction radius for each area, with intensity depending on subhabitat type. It is very similar to the example in the book on page 700.
radii <- c(mean(area1$diameter), mean(area2$diameter),mean(area3$diameter),mean(area4$diameter),mean(area5$diameter),mean(area6$diameter))
Rad <- hyperframe(R=radii)
Str <- with(Rad, Strauss(R))
Int <- hyperframe(str=Str)
fittest8 <- mppm(Points ~ Subhabitat, H, interaction=Int, iformula = ~str:Area)
I would like to conduct a formal test for significance for the Strauss interaction parameters using anova.mppm to test for regularity. However, I am not sure if I am doing this properly, as I cannot seem to get this to work. I have tried:
fittest8 <- mppm(Points ~ Subhabitat, H, interaction=Int, iformula = ~str:Area)
fitex <- mppm(Points ~ Subhabitat, H)
anova.mppm(fittest8, fitex, test = "Chi")
I get the error "Error: Coefficient ‘str’ is missing from new.coef" and cannot find a way to resolve this. Any advice would be greatly appreciated.
Thanks!
First, please learn how to make a minimal reproducible example. This will make it easier for people to help you solve the problem, without having to guess what was in your data.
In your example, the columns named Area and Subhabitat in the hyperframe H are character vectors, but in your code, the call to mppm would require that they are factors. I assume you converted them to factors in order to be able to fit the model fittest8. (Another reason to make a working example)
You said that your example was similar to one on page 700 of the spatstat book which does work. In that case, a good strategy is to modify your example to make it as similar as possible to the example that works, because this will narrow down the possible cause.
A working example of the problem, similar to the one in the book, is:
Str <- hyperframe(str=with(simba, Strauss(mean(nndist(Points)))))
fit1 <- mppm(Points ~ group, simba, interaction=Str, iformula=~str:group)
fit0 <- mppm(Points ~ group, simba)
anova(fit0, fit1, test="Chi")
which yields the same error Error: Coefficient ‘str’ is missing from new.coef
The simplest way to avoid this is to replace the interaction formula ~str:group by str+str:group:
fit1x <- mppm(Points ~ group, simba, interaction=Str,
iformula = ~str + str:group)
anova(fit0, fit1x, test="Chi")
or in your example
fittest8X <- mppm(Points ~ Subhabitat, H, interaction=Int,
iformula=~str + str:Area)
anova(fittest8X, fitex, test="Chi")
Note that fittest8X and fittest8 are equivalent models but are expressed in a slightly different way.
The interaction formula and the trend formula are connected in a complicated way and the software is not always successful in disentangling them. If you get this kind of problem again, try different versions of the interaction formula.

Generating clustered spatstat marks for a ppp object

This question is very close to what has been asked here. The answer is great if we want to generate random marks to an already existing point pattern - we draw from a multivariate normal distribution and associate with each point.
However, I need to generate marks that follows the marks given in the lansing dataset that comes with spatstat for my own point pattern. In other words, I have a point pattern without marks and I want to simulate marks with a definite pattern (for example, to illustrate the concept of segregation for my own data). How do I make such marks? I understand the number of points could be different between lansing and my data set but I am allowed to reduce the window or create more points. Thanks!
Here is another version of segregation in four different rectangular
regions.
library(spatstat)
p <- c(.6,.2,.1,.1)
prob <- rbind(p,
p[c(4,1:3)],
p[c(3:4,1:2)],
p[c(2:4,1)])
X <- unmark(spruces)
labels <- factor(LETTERS[1:4])
subwins <- quadrats(X, 2, 2)
Xsplit <- split(X, subwins)
rslt <- NULL
for(i in seq_along(Xsplit)){
Y <- Xsplit[[i]]
marks(Y) <- sample(labels, size = npoints(Y),
replace = TRUE, prob = prob[i,])
rslt <- superimpose(rslt, Y)
}
plot(rslt, main = "", cols = 1:4)
plot(subwins, add = TRUE)
Segregation refers to the fact that one species predominates in a
specific part of the observation window. An extreme example would be to
segregate completely based on e.g. the x-coordinate. This would generate strips
of points of different types:
library(spatstat)
X <- lansing
Y <- cut(X, X$x, breaks = 6, labels = LETTERS[1:6])
plot(Y, cols = 1:6)
Without knowing more details about the desired type of segregation it is
hard to suggest something more useful.

Random effects modeling using mgcv and using lmer. Basically identical fits but VERY different likelihoods and DF. Which to use for testing?

I am aware that there is a duality between random effects and smooth curve estimation. At this link, Simon Wood describes how to specify random effects using mgcv. Of particular note is the following passage:
For example if g is a factor then s(g,bs="re") produces a random coefficient for each level of g, with the radndom coefficients all modelled as i.i.d. normal.
After a quick simulation, I can see this is correct, and that the model fits are almost identical. However, the likelihoods and degrees of freedom are VERY different. Can anyone explain the difference? Which one should be used for testing?
library(mgcv)
library(lme4)
set.seed(1)
x <- rnorm(1000)
ID <- rep(1:200,each=5)
y <- x
for(i in 1:200) y[which(ID==i)] <- y[which(ID==i)] + rnorm(1)
y <- y + rnorm(1000)
ID <- as.factor(ID)
# gam (mgcv)
m <- gam(y ~ x + s(ID,bs="re"))
gam.vcomp(m)
coef(m)[1:2]
logLik(m)
# lmer
m2 <- lmer(y ~ x + (1|ID))
sqrt(VarCorr(m2)$ID[1])
summary(m2)$coef[,1]
logLik(m2)
mean( abs( fitted(m)-fitted(m2) ) )
Full disclosure: I encountered this problem because I want to fit a GAM that also includes random effects (repeated measures), but need to know if I can trust likelihood-based tests under those models.

unused variable(s) warning in runjags model

I am running JAGS models through the R package runjags. I just updated to JAGS 4.0.0 from JAGS 3.4, and have noticed some unexpected behavior that seems to be related to the update.
First, when I run a model, I now get a warning message WARNING: Unused variable(s) in data table: followed by a list of data objects that are referenced in the model and provided as data. It doesn't seem to affect the results (but it is very puzzling). I have, however, noticed a few times while playing around with this that for some variables the posteriors were virtually identical to the priors (indicating that no updating occured). I can't seem to recreate the update failure right now, but below is a reproducible code example illustrating the odd warning message. The code example on the run.jags help page also produces the same warning.
Second, I thought I'd check to see if the same message pops up if I use the R package R2jags instead of runjags, but R2jags won't load because apparently rjags (one of the dependencies) is not compatible with JAGS 4.0 (its looking for JAGS 3.X). Also, in the runjags function run.jags, the argument method="rjags" doesn't seem to work anymore, but method="parallel" does work.
I'm using runjags_2.0.1-4 and R 3.2.2.
So my questions are:
1) Is rjags really incompatible with JAGS 4.0? The motivation to go to 4.0 was to use vectors as indices (see https://martynplummer.wordpress.com/2015/08/16/whats-new-in-jags-4-0-0-part-34-r-style-features/).
2) What is up with the unused variable(s) warning, and should I be concerned about it?
Thanks,
Glenn
Code:
#--- GENERATE DATA ------------------------
rm(list=ls())
# Number of sites and observations per site
N <- 200
nobs <- 3
# generate covariates and standardize (where appropriate)
set.seed(123)
forest <- rnorm(N)
# relationship between occupancy and covariates
b0 <- 0.5
b.for <- 0.5
psi <- plogis(b0 + b.for*forest)
# draw occupancy for each site
z <- rbinom(n=N, size=1,prob=psi)
# specify detection probablility
p <- 0.5
pz <- p*z
# generate the observations
Y <- rbinom(n=N, size=nobs,prob=pz)
#---- BUGS model ------------------------
model1 <- "model {
for (i in 1:N){
logit(eta[i]) <- b0 + b.for*forest[i]
z[i] ~ dbern(eta[i])
pz[i] <- z[i]*p
y[i] ~ dbin(pz[i],nobs)
} #i
b0.0 ~ dunif(0,1)
b0 <- log(b0.0/(1-b0.0))
b.for ~ dnorm(0,0.01)
p ~ dunif(0,1)
}"
occ.data1 <-list(y=Y,N=N,nobs=nobs,forest=forest)
inits1 <- function(){list(b0.0=runif(1),b.for=rnorm(1),p=runif(1),z=as.numeric(Y>0))}
parameters1 <- c("b0","b.for","p")
#---- RUN MODEL ------------------------
library(runjags)
ni <- 2000
nt <- 1
nb <- 1000
nc <- 3
ad <- 100
out <- run.jags(model=model1,data=occ.data1,monitor=parameters1,n.chains=nc,inits=inits1,burnin=nb,
sample=ni,adapt=ad,thin=nt,modules=c("glm","dic"),method="parallel")
To answer your questions:
1) rjags and JAGS used linked (non-interchangable) versions, and CRAN systems are still using JAGS_3.4.0 so the version of rjags on CRAN matches. This will be updated soon, and in the meantime you can grab the correct version of rjags from the sourceforge page as #jbaums notes.
2) This is a helpful message from JAGS/rjags telling you that you have specified something as data that the model isn't using. Remember that variable names are case sensitive i.e.
library('runjags')
model <- "model {
m ~ dunif(-1000,1000)
#data# M
#inits# m
#monitor# m
}"
M <- 0
m <- list(-10, 10)
results <- run.jags(model, method="interruptible", n.chains=2)
results <- run.jags(model, method="rjags", n.chains=2)
... gives you a warning because M does not match m. Also note that the warning looks a bit different from the two function calls - in the first it comes half-way down the JAGS output and in the second it comes as a warning in R after the function is completed.
As for 'should I be concerned' - yes if you think these variables should be in your model. If you can't find the problem try posting the code you are using - it got cut off from your original post.
Matt

Problem detecting cyclic numbers in Haskell

I am doing problem 61 at project Euler and came up with the following code (to test the case they give):
p3 n = n*(n+1) `div` 2
p4 n = n*n
p5 n = n*(3*n -1) `div` 2
p6 n = n*(2*n -1)
p7 n = n*(5*n -3) `div` 2
p8 n = n*(3*n -2)
x n = take 2 $ show n
x2 n = reverse $ take 2 $ reverse $ show n
pX p = dropWhile (< 999) $ takeWhile (< 10000) [p n|n<-[1..]]
isCyclic2 (a,b,c) = x2 b == x c && x2 c == x a && x2 a == x b
ns2 = [(a,b,c)|a <- pX p3 , b <- pX p4 , c <- pX p5 , isCyclic2 (a,b,c)]
And all ns2 does is return an empty list, yet cyclic2 with the arguments given as the example in the question, yet the series doesn't come up in the solution. The problem must lie in the list comprehension ns2 but I can't see where, what have I done wrong?
Also, how can I make it so that the pX only gets the pX (n) up to the pX used in the previous pX?
PS: in case you thought I completely missed the problem, I will get my final solution with this:
isCyclic (a,b,c,d,e,f) = x2 a == x b && x2 b == x c && x2 c == x d && x2 d == x e && x2 e == x f && x2 f == x a
ns = [[a,b,c,d,e,f]|a <- pX p3 , b <- pX p4 , c <- pX p5 , d <- pX p6 , e <- pX p7 , f <- pX p8 ,isCyclic (a,b,c,d,e,f)]
answer = sum $ head ns
The order is important. The cyclic numbers in the question are 8128, 2882, 8281, and these are not P3/127, P4/91, P5/44 but P3/127, P5/44, P4/91.
Your code is only checking in the order 8128, 8281, 2882, which is not cyclic.
You would get the result if you check for
isCyclic2 (a,c,b)
in your list comprehension.
EDIT: Wrong Problem!
I assumed you were talking about the circular number problem, Sorry!
There is a more efficient way to do this with something like this:
take (2 * l x -1) . cycle $ show x
where l = length . show
Try that and see where it gets you.
If I understand you right here, you're no longer asking why your code doesn't work but how to make it faster. That's actually the whole fun of Project Euler to find an efficient way to solve the problems, so proceed with care and first try to think of reducing your search space yourself. I suggest you let Haskell print out the three lists pX p3, pX p4, pX p5 and see how you'd go about looking for a cycle.
If you would proceed like your list comprehension, you'd start with the first element of each list, 1035, 1024, 1080. I'm pretty sure you would stop right after picking 1035 and 1024 and not test for cycles with any value from P5, let alone try all the permutations of the combinations involving these two numbers.
(I haven't actually worked on this problem yet, so this is how I would go about speeding it up. There may be some math wizardry out there that's even faster)
First, start looking at the numbers you get from pX. You can drop more than those. For example, P3 contains 6105 - there's no way you're going to find a number in the other sets starting with '05'. So you can also drop those numbers where the number modulo 100 is less than 10.
Then (for the case of 3 sets), we can sometimes see after drawing two numbers that there can't be any number in the last set that will give you a cycle, no matter how you permutate (e.g. 1035 from P3 and 3136 from P4 - there can't be a cycle here).
I'd probably try to build a chain by starting with the elements from one list, one by one, and for each element, find the elements from the remaining lists that are valid successors. For those that you've found, continue trying to find the next chain element from the remaining lists. When you've built a chain with one number from every list, you just have to check if the last two digits of the last number match the first two digits of the first number.
Note when looking for successors, you again don't have to traverse the entire lists. If you're looking for a successor to 3015 from P5, for example, you can stop when you hit a number that's 1600 or larger.
If that's too slow still, you could transform the lists other than the first one to maps where the map key is the first two digits and the associated values are lists of numbers that start with those digits. Saves you from going through the lists from the start again and again.
I hope this helps a bit.
btw, I sense some repetition in your code.
you can unite your [p3, p4, p5, p6, p7, p8] functions into one function that will take the 3 from the p3 as a parameter etc.
to find what the pattern is, you can make all the functions in the form of
pX n = ... `div` 2

Resources