JAGS and WinBUGS giving differing DIC - jags

I'm doing a network meta-analysis including several clinical trials. The response is binomial. Each trial contains several treatments.
When I do a random effects model, the output from JAGS and WinBUGS is similar. When I do a fixed effects model, the DIC and pD components are way out, though the posteriors of the parameters I'm interested in are similar.
I've got similar models that have Gaussian response, not binomial, and JAGS and WinBUGS are in agreement.
The BUGS/JAGS code for the fixed effects model is lifted from page 61 of this and appears below. However, the same code runs and produces similar posteriors using WinBUGS and JAGS, it's just DIC and pD that differ markedly. So I don't think this code is the problem.
for(i in 1:ns){ # Loop over studies
mu[i] ~ dnorm(0, .0001)
# Vague priors for all trial baselines
for (k in 1:na[i]) { # Loop over arms
r[i, k] ~ dbin(p[i, k], n[i, k])
# binomial likelihood
logit(p[i, k]) <- mu[i] + d[t[i, k]] - d[t[i, 1]]
# model for linear predictor
rhat[i, k] <- p[i, k] * n[i, k]
# expected value of the numerators
dev[i, k] <-
2 * (r[i, k] * (log(r[i, k]) - log(rhat[i, k])) +
(n[i, k] - r[i, k]) * (log(n[i, k] - r[i, k]) +
- log(n[i, k] - rhat[i, k]) ))
# Deviance contribution
}
resdev[i] <- sum(dev[i, 1:na[i]])
# summed residual deviance contribution for this trial
}
totresdev <- sum(resdev[])
# Total Residual Deviance
d[1] <- 0
# treatment effect is zero for reference treatment
for (k in 2:nt){
d[k] ~ dnorm(0, .0001)
} # vague priors for treatment effects
I found an old post describing a known issue, but that's too old for me to think it's the same problem.
Are there any known issues with JAGS reporting the wrong DIC and pD? (Searching for "JAGS bugs" isn't all that helpful.)
I'd be grateful of any pointers.

There are a number of different ways to calculate pD, and the method used by JAGS differs to that used by WinBUGS. See the Details section of the following help file:
?rjags::dic
Specifically:
DIC (Spiegelhalter et al 2002) is calculated by adding the “effective number of parameters” (pD) to the expected deviance. The definition of pD used by dic.samples is the one proposed by Plummer (2002) and requires two or more parallel chains in the model.
The details are in the (lengthy, but well worth reading) discussion of the following paper:
Spiegelhalter, D., N. Best, B. Carlin, and A. van der Linde (2002), Bayesian measures of model complexity and fit (with discussion). Journal of the Royal Statistical Society Series B 64, 583-639.
In general: all methods to calculate pD (of which I am aware) are approximations, and if two such methods disagree then it is probably because the assumptions behind for one (or both) approximation(s) are not being met.
Matt

Related

What is the CP/MILP problem name for the assignment of agents to tasks with fixed start time and end time?

I'm trying to solve a Constraint Satisfaction Optimisation Problem that assigns agents to tasks. However, different then the basic Assignment Problem, a agent can be assigned to many tasks if the tasks do not overlap.
Each task has a fixed start_time and end_time.
The agents are assigned to the tasks according to some unary&binary constraints.
Variables = set of tasks
Domain = set of compatible agents (for each variable)
Constraints = unary&binary
Optimisation fct = some liniar function
An example of the problem: the allocation of parking space (or teams) for trucks for which we know the arrival and departure time.
I'm interested if there is in the literature a precise name for these type of problems. I presume it is some kind of assignment problem. Also, if you ever approach the problem, how do you solve it?
Thank you.
I would interpret this as: rectangular assignment-problem with conflicts
which is arguably much more hard (NP-hard in general) than the polynomially-solvable assignment-problem.
The demo shown in the other answer might work and ortools' cp-sat is great, but i don't see a good reason to use discrete-time based reasoning here like it's done: interval-variables, edge-finding and co. based scheduling constraints (+ conflict-analysis / explanations). This stuff is total overkill and the overhead will be noticable. I don't see any need to reason about time, but just about time-induced conflicts.
Edit: One could label those two approaches (linked + proposed) as compact formulation and extended formulation. Extended formulations usually show stronger relaxations and better (solving) results as long as scalability is not an issue. Compact approaches might become more viable again with bigger data (bit it's hard to guess here as scheduling-propagators are not that cheap).
What i would propose:
(1) Formulate an integer-programming model following the basic assignment-problem formulation + adaptions to make it rectangular -> a worker is allowed to tackle multiple tasks while all tasks are tackled (one sum-equality dropped)
see wiki
(2) Add integrality = mark variables as binary -> because the problem is not satisfying total unimodularity anymore
(3) Add constraints to forbid conflicts
(4) Add constraints: remaining stuff (e.g. compatibility)
Now this is all straightforward, but i would propose one non-naive improvement in regards to (3):
The conflicts can be interpreted as stable-set polytope
Your conflicts are induced by a-priori defined time-windows and their overlappings (as i interpret it; this is the core assumption behind this whole answer)
This is an interval graph (because of time-windows)
All interval graphs are chordal
Chordal graphs allow enumeration of all max-cliques in poly-time (implying there are only polynomial many)
python: networkx.algorithms.chordal.chordal_graph_cliques
The set (enumeration) of all maximal cliques define the facets of the stable-set polytope
Those (a constraint for each element in the set) we add as constraints!
(The stable-set polytope on the graph in use here would also allow very very powerful semidefinite-relaxations but it's hard to foresee in which cases this would actually help due to SDPs being much more hard to work with: warmstart within tree-search; scalability; ...)
This will lead to a poly-size integer-programming problem which should be very very good when using a good IP-solver (commercials or if open-source needed: Cbc > GLPK).
Small demo about (3)
import itertools
import networkx as nx
# data: inclusive, exclusive
# --------------------------
time_windows = [
(2, 7),
(0, 10),
(6, 12),
(12, 20),
(8, 12),
(16, 20)
]
# helper
# ------
def is_overlapping(a, b):
return (b[1] > a[0] and b[0] < a[1])
# raw conflicts
# -------------
binary_conflicts = []
for a, b in itertools.combinations(range(len(time_windows)), 2):
if is_overlapping(time_windows[a], time_windows[b]):
binary_conflicts.append( (a, b) )
# conflict graph
# --------------
G = nx.Graph()
G.add_edges_from(binary_conflicts)
# maximal cliques
# ---------------
max_cliques = nx.chordal_graph_cliques(G)
print('naive constraints: raw binary conflicts')
for i in binary_conflicts:
print('sum({}) <= 1'.format(i))
print('improved constraints: clique-constraints')
for i in max_cliques:
print('sum({}) <= 1'.format(list(i)))
Output:
naive constraints: raw binary conflicts
sum((0, 1)) <= 1
sum((0, 2)) <= 1
sum((1, 2)) <= 1
sum((1, 4)) <= 1
sum((2, 4)) <= 1
sum((3, 5)) <= 1
improved constraints: clique-constraints
sum([1, 2, 4]) <= 1
sum([0, 1, 2]) <= 1
sum([3, 5]) <= 1
Fun facts:
Commercial integer-programming solvers and maybe even Cbc might even try to do the same reasoning about clique-constraints to some degree although without the assumption of chordality where it's an NP-hard problem
ortools' cp-sat solver has also a code-path for this (again: general NP-hard case)
Should trigger when expressing the conflict-based model (much harder to decide on this exploitation on general discrete-time based scheduling models)
Caveats
Implementation / Scalability
There are still open questions like:
duplicating max-clique constraints over each worker vs. merging them somehow
be more efficient/clever in finding conflicts (sorting)
will it scale to the data: how big will the graph be / how many conflicts and constraints from those do we need
But those things usually follow instance-statistics (aka "don't decide blindly").
I don't know a name for the specific variant you're describing - maybe others would. However, this indeed seems a good fit for a CP/MIP solver; I would go with the OR-Tools CP-SAT solver, which is free, flexible and usually works well.
Here's a reference implementation with Python, assuming each vehicle requires a team assigned to it with no overlaps, and that the goal is to minimize the number of teams in use.
The framework allows to directly model allowed / forbidden assignments (check out the docs)
from ortools.sat.python import cp_model
model = cp_model.CpModel()
## Data
num_vehicles = 20
max_teams = 10
# Generate some (semi-)interesting data
interval_starts = [i % 9 for i in range(num_vehicles)]
interval_len = [ (num_vehicles - i) % 6 for i in range(num_vehicles)]
interval_ends = [ interval_starts[i] + interval_len[i] for i in range(num_vehicles)]
### variables
# t, v is true iff vehicle v is served by team t
team_assignments = {(t, v): model.NewBoolVar("team_assignments_%i_%i" % (t, v)) for t in range(max_teams) for v in range(num_vehicles)}
#intervals for vehicles. Each interval can be active or non active, according to team_assignments
vehicle_intervals = {(t, v): model.NewOptionalIntervalVar(interval_starts[v], interval_len[v], interval_ends[v], team_assignments[t, v], 'vehicle_intervals_%i_%i' % (t, v))
for t in range(max_teams) for v in range(num_vehicles)}
team_in_use = [model.NewBoolVar('team_in_use_%i' % (t)) for t in range(max_teams)]
## constraints
# non overlap for each team
for t in range(max_teams):
model.AddNoOverlap([vehicle_intervals[t, v] for v in range(num_vehicles)])
# each vehicle must be served by exactly one team
for v in range(num_vehicles):
model.Add(sum(team_assignments[t, v] for t in range(max_teams)) == 1)
# what teams are in use?
for t in range(max_teams):
model.AddMaxEquality(team_in_use[t], [team_assignments[t, v] for v in range(num_vehicles)])
#symmetry breaking - use teams in-order
for t in range(max_teams-1):
model.AddImplication(team_in_use[t].Not(), team_in_use[t+1].Not())
# let's say that the goal is to minimize the number of teams required
model.Minimize(sum(team_in_use))
solver = cp_model.CpSolver()
# optional
# solver.parameters.log_search_progress = True
# solver.parameters.num_search_workers = 8
# solver.parameters.max_time_in_seconds = 5
result_status = solver.Solve(model)
if (result_status == cp_model.INFEASIBLE):
print('No feasible solution under constraints')
elif (result_status == cp_model.OPTIMAL):
print('Optimal result found, required teams=%i' % (solver.ObjectiveValue()))
elif (result_status == cp_model.FEASIBLE):
print('Feasible (non optimal) result found')
else:
print('No feasible solution found under constraints within time')
# Output:
#
# Optimal result found, required teams=7
EDIT:
#sascha suggested a beautiful approach for analyzing the (known in advance) time window overlaps, which would make this solvable as an assignment problem.
So while the formulation above might not be the optimal one for this (although it could be, depending on how the solver works), I've tried to replace the no-overlap conditions with the max-clique approach suggested - full code below.
I did some experiments with moderately large problems (100 and 300 vehicles), and it seems empirically that on smaller problems (~100) this does improve by some - about 15% on average on the time to optimal solution; but I could not find a significant improvement on the larger (~300) problems. This might be either because my formulation is not optimal; because the CP-SAT solver (which is also a good IP solver) is smart enough; or because there's something I've missed :)
Code:
(this is basically the same code from above, with the logic to support using the network approach instead of the no-overlap one copied from #sascha's answer):
from timeit import default_timer as timer
from ortools.sat.python import cp_model
model = cp_model.CpModel()
run_start_time = timer()
## Data
num_vehicles = 300
max_teams = 300
USE_MAX_CLIQUES = True
# Generate some (semi-)interesting data
interval_starts = [i % 9 for i in range(num_vehicles)]
interval_len = [ (num_vehicles - i) % 6 for i in range(num_vehicles)]
interval_ends = [ interval_starts[i] + interval_len[i] for i in range(num_vehicles)]
if (USE_MAX_CLIQUES):
## Max-cliques analysis
# for the max-clique approach
time_windows = [(interval_starts[i], interval_ends[i]) for i in range(num_vehicles)]
def is_overlapping(a, b):
return (b[1] > a[0] and b[0] < a[1])
# raw conflicts
# -------------
binary_conflicts = []
for a, b in itertools.combinations(range(len(time_windows)), 2):
if is_overlapping(time_windows[a], time_windows[b]):
binary_conflicts.append( (a, b) )
# conflict graph
# --------------
G = nx.Graph()
G.add_edges_from(binary_conflicts)
# maximal cliques
# ---------------
max_cliques = nx.chordal_graph_cliques(G)
##
### variables
# t, v is true iff point vehicle v is served by team t
team_assignments = {(t, v): model.NewBoolVar("team_assignments_%i_%i" % (t, v)) for t in range(max_teams) for v in range(num_vehicles)}
#intervals for vehicles. Each interval can be active or non active, according to team_assignments
vehicle_intervals = {(t, v): model.NewOptionalIntervalVar(interval_starts[v], interval_len[v], interval_ends[v], team_assignments[t, v], 'vehicle_intervals_%i_%i' % (t, v))
for t in range(max_teams) for v in range(num_vehicles)}
team_in_use = [model.NewBoolVar('team_in_use_%i' % (t)) for t in range(max_teams)]
## constraints
# non overlap for each team
if (USE_MAX_CLIQUES):
overlap_constraints = [list(l) for l in max_cliques]
for t in range(max_teams):
for l in overlap_constraints:
model.Add(sum(team_assignments[t, v] for v in l) <= 1)
else:
for t in range(max_teams):
model.AddNoOverlap([vehicle_intervals[t, v] for v in range(num_vehicles)])
# each vehicle must be served by exactly one team
for v in range(num_vehicles):
model.Add(sum(team_assignments[t, v] for t in range(max_teams)) == 1)
# what teams are in use?
for t in range(max_teams):
model.AddMaxEquality(team_in_use[t], [team_assignments[t, v] for v in range(num_vehicles)])
#symmetry breaking - use teams in-order
for t in range(max_teams-1):
model.AddImplication(team_in_use[t].Not(), team_in_use[t+1].Not())
# let's say that the goal is to minimize the number of teams required
model.Minimize(sum(team_in_use))
solver = cp_model.CpSolver()
# optional
solver.parameters.log_search_progress = True
solver.parameters.num_search_workers = 8
solver.parameters.max_time_in_seconds = 120
result_status = solver.Solve(model)
if (result_status == cp_model.INFEASIBLE):
print('No feasible solution under constraints')
elif (result_status == cp_model.OPTIMAL):
print('Optimal result found, required teams=%i' % (solver.ObjectiveValue()))
elif (result_status == cp_model.FEASIBLE):
print('Feasible (non optimal) result found, required teams=%i' % (solver.ObjectiveValue()))
else:
print('No feasible solution found under constraints within time')
print('run time: %.2f sec ' % (timer() - run_start_time))

estimated posteriors in JAGS by levels of a factor

I am running an N-mixture model in JAGS, trying to see if posterior predicted values of N are higher in one habitat than another. I am wondering how to obtain posterior probabilities of estimated population size for each habitat individually after running the model. So, e.g., if I wanted to sum across all sites, I'd put
totalN<-sum(N[]) in the JAGS model and identify "totalN" as one of my parameters. If I have 2 habitat levels over which to sum N, do I need a for loop or is there another way to define it?
Below is my model so far...
model{
priors
#abundance
beta0 ~ dnorm(0, 0.001) # log(lambda) intercept
beta1 ~ dnorm(0, 0.001) #this is my regression parameter for habitat
tau.T ~ dgamma(0.001, 0.001) #this is for random effect of transect
# detection
alpha.p ~ dgamma(0.01, 0.01)
beta.p ~ dgamma (0.01, 0.01)
Poisson model for abundance
for (i in 1:nsite){
loglam[i] <- beta1*habitat[i] + ranef[transect[i]]
loglam.lim[i] <- min(250, max(-250, loglam[i])) # 'Stabilize' log
lam[i] <- exp(loglam.lim[i])
N[i] ~ dpois(lam[i])
}
for (i in 1:14){
ranef[i]~dnorm(beta0,tau.T)
}
Measurement error model
for (i in 1:nsite){
for (j in 1:nrep){
y[i,j] ~ dbin(p[i,j], N[i])
p[i,j] ~ dbeta(alpha.p,beta.p) #detection probability follows a beta distribution
}
}
posterior predictions
Nperhabitat<-sum(N[habitat]) #this doesn't work, only estimates a single set of posterior densities for N
#and get a derived detection probability
}
I am going to assume here that habitat is a binary vector. I would add two additional vectors to your data that define which elements in habitat are 1 and which are 0. From there you can index N with those two vectors.
# done in R and added to the data list supplied to JAGS
hab_1 <- which(habitat == 1)
hab_0 <- which(habitat == 0)
# add to data list
data_list <- list(..., hab_1 = hab_1, hab_0 = hab_0)
Then, inside the JAGS model you would just add:
N_habitat_1 <- sum(N[hab_1])
N_habitat_0 <- sum(N[hab_0])
This is effectively telling JAGS to provide the total abundance per habitat type. If you have way more sites of one habitat vs another this abundance may hide that the density of individuals could actually be less. Thus, you may want to divide this abundance by the total number of sites of each habitat type:
dens_habitat_1 <- sum(N[hab_1]) / sum(habitat)
dens_habitat_0 <- sum(N[hab_0]) / sum(1 - habitat)
This is, of course, assuming that habitat is binary.

How to obtain multinomial probabilities in WinBUGS with multiple regression

In WinBUGS, I am specifying a model with a multinomial likelihood function, and I need to make sure that the multinomial probabilities are all between 0 and 1 and sum to 1.
Here is the part of the code specifying the likelihood:
e[k,i,1:9] ~ dmulti(P[k,i,1:9],n[i,k])
Here, the array P[] specifies the probabilities for the multinomial distribution.
These probabilities are to be estimated from my data (the matrix e[]) using multiple linear regressions on a series of fixed and random effects. For instance, here is the multiple linear regression used to predict one of the elements of P[]:
P[k,1,2] <- intercept[1,2] + Slope1[1,2]*Covariate1[k] +
Slope2[1,2]*Covariate2[k] + Slope3[1,2]*Covariate3[k]
+ Slope4[1,2]*Covariate4[k] + RandomEffect1[group[k]] +
RandomEffect2[k]
At compiling, the model produces an error:
elements of proportion vector of multinomial e[1,1,1] must be between zero and one
If I understand this correctly, this means that the elements of the vector P[k,i,1:9] (the probability vector in the multinomial likelihood function above) may be very large (or small) numbers. In reality, they all need to be between 0 and 1, and sum to 1.
I am new to WinBUGS, but from reading around it seems that somehow using a beta regression rather than multiple linear regressions might be the way forward. However, although this would allow each element to be between 0 and 1, it doesn't seem to get to the heart of the problem, which is that all the elements of P[k,i,1:9] must be positive and sum to 1.
It may be that the response variable can very simply be transformed to be a proportion. I have tried this by trying to divide each element by the sum of P[k,i,1:9], but so far no success.
Any tips would be very gratefully appreciated!
(I have supplied the problematic sections of the model; the whole thing is fairly long.)
The usual way to do this would be to use the multinomial equivalent of a logit link to constrain the transformed probabilities to the interval (0,1). For example (for a single predictor but it is the same principle for as many predictors as you need):
Response[i, 1:Categories] ~ dmulti(prob[i, 1:Categories], Trials[i])
phi[i,1] <- 1
prob[i,1] <- 1 / sum(phi[i, 1:Categories])
for(c in 2:Categories){
log(phi[i,c]) <- intercept[c] + slope1[c] * Covariate1[i]
prob[i,c] <- phi[i,c] / sum(phi[i, 1:Categories])
}
For identifibility the value of phi[1] is set to 1, but the other values of intercept and slope1 are estimated independently. When the number of Categories is equal to 2, this collapses to the usual logistic regression but coded for a multinomial response:
log(phi[i,2]) <- intercept[2] + slope1[2] * Covariate1[i]
prob[i,2] <- phi[i, 2] / (1 + phi[i, 2])
prob[i,1] <- 1 / (1 + phi[i, 2])
ie:
logit(prob[i,2]) <- intercept[2] + slope1[2] * Covariate1[i]
prob[i,1] <- 1 - prob[i,2]
In this model I have indexed slope1 by the category, meaning that each level of the outcome has an independent relationship with the predictor. If you have an ordinal response and want to assume that the odds ratio associated with the covariate is consistent between successive levels of the response, then you can drop the index on slope1 (and reformulate the code slightly so that phi is cumulative) to get a proportional odds logistic regression (POLR).
Edit
Here is a link to some example code covering logistic regression, multinomial regression and POLR from a course I teach:
http://runjags.sourceforge.net/examples/squirrels.R
Note that it uses JAGS (rather than WinBUGS) but as far as I know there are no differences in model syntax for these types of models. If you want to quickly get started with runjags & JAGS from a WinBUGS background then you could follow this vignette:
http://runjags.sourceforge.net/quickjags.html

How to compare predictive power of PCA and NMF

I would like to compare the output of an algorithm with different preprocessed data: NMF and PCA.
In order to get somehow a comparable result, instead of choosing just the same number of components for each PCA and NMF, I would like to pick the amount that explains e.g 95% of retained variance.
I was wondering if its possible to identify the variance retained in each component of NMF.
For instance using PCA this would be given by:
retainedVariance(i) = eigenvalue(i) / sum(eigenvalue)
Any ideas?
TL;DR
You should loop over different n_components and estimate explained_variance_score of the decoded X at each iteration. This will show you how many components do you need to explain 95% of variance.
Now I will explain why.
Relationship between PCA and NMF
NMF and PCA, as many other unsupervised learning algorithms, are aimed to do two things:
encode input X into a compressed representation H;
decode H back to X', which should be as close to X as possible.
They do it in a somehow similar way:
Decoding is similar in PCA and NMF: they output X' = dot(H, W), where W is a learned matrix parameter.
Encoding is different. In PCA, it is also linear: H = dot(X, V), where V is also a learned parameter. In NMF, H = argmin(loss(X, H, W)) (with respect to H only), where loss is mean squared error between X and dot(H, W), plus some additional penalties. Minimization is performed by coordinate descent, and result may be nonlinear in X.
Training is also different. PCA learns sequentially: the first component minimizes MSE without constraints, each next kth component minimizes residual MSE subject to being orthogonal with the previous components. NMF minimizes the same loss(X, H, W) as when encoding, but now with respect to both H and W.
How to measure performance of dimensionality reduction
If you want to measure performance of an encoding/decoding algorithm, you can follow the usual steps:
Train your encoder+decoder on X_train
To measure in-sample performance, compare X_train'=decode(encode(X_train)) with X_train using your preferred metric (e.g. MAE, RMSE, or explained variance)
To measure out-of-sample performance (generalizing ability) of your algorithm, do step 2 with the unseen X_test.
Let's try it with PCA and NMF!
from sklearn import decomposition, datasets, model_selection, preprocessing, metrics
# use the well-known Iris dataset
X, _ = datasets.load_iris(return_X_y=True)
# split the dataset, to measure overfitting
X_train, X_test = model_selection.train_test_split(X, test_size=0.5, random_state=1)
# I scale the data in order to give equal importance to all its dimensions
# NMF does not allow negative input, so I don't center the data
scaler = preprocessing.StandardScaler(with_mean=False).fit(X_train)
X_train_sc = scaler.transform(X_train)
X_test_sc = scaler.transform(X_test)
# train the both decomposers
pca = decomposition.PCA(n_components=2).fit(X_train_sc)
nmf = decomposition.NMF(n_components=2).fit(X_train_sc)
print(sum(pca.explained_variance_ratio_))
It will print you explained variance ratio of 0.9536930834362043 - the default metric of PCA, estimated using its eigenvalues. We can measure it in a more direct way - by applying a metric to actual and "predicted" values:
def get_score(model, data, scorer=metrics.explained_variance_score):
""" Estimate performance of the model on the data """
prediction = model.inverse_transform(model.transform(data))
return scorer(data, prediction)
print('train set performance')
print(get_score(pca, X_train_sc))
print(get_score(nmf, X_train_sc))
print('test set performance')
print(get_score(pca, X_test_sc))
print(get_score(nmf, X_test_sc))
which gives
train set performance
0.9536930834362043 # same as before!
0.937291711378812
test set performance
0.9597828443047842
0.9590555069007827
You can see that on the training set PCA performs better than NMF, but on the test set their performance is almost identical. This happens, because NMF applies lots of regularization:
H and W (the learned parameter) must be non-negative
H should be as small as possible (L1 and L2 penalties)
W should be as small as possible (L1 and L2 penalties)
These regularizations make NMF fit worse than possible to the training data, but they might improve its generalizing ability, which happened in our case.
How to choose the number of components
In PCA, it is simple, because its components h_1, h_2, ... h_k are learned sequentially. If you add the new component h_(k+1), the first k will not change. Thus, you can estimate performance of each component, and these estimates will not depent on the number of components. This makes it possible for PCA to output the explained_variance_ratio_ array after only a single fit to data.
NMF is more complex, because all its components are trained at the same time, and each one depends on all the rest. Thus, if you add the k+1th component, the first k components will change, and you cannot match each particular component with its explained variance (or any other metric).
But what you can to is to fit a new instance of NMF for each number of components, and compare the total explained variance:
ks = [1,2,3,4]
perfs_train = []
perfs_test = []
for k in ks:
nmf = decomposition.NMF(n_components=k).fit(X_train_sc)
perfs_train.append(get_score(nmf, X_train_sc))
perfs_test.append(get_score(nmf, X_test_sc))
print(perfs_train)
print(perfs_test)
which would give
[0.3236945680665101, 0.937291711378812, 0.995459457205891, 0.9974027602663655]
[0.26186701106012833, 0.9590555069007827, 0.9941424954209546, 0.9968456603914185]
Thus, three components (judging by the train set performance) or two components (by the test set) are required to explain at least 95% of variance. Please notice that this case is unusual and caused by a small size of training and test data: usually performance degrades a little bit on the test set, but in my case it actually improved a little.

Random effects modeling using mgcv and using lmer. Basically identical fits but VERY different likelihoods and DF. Which to use for testing?

I am aware that there is a duality between random effects and smooth curve estimation. At this link, Simon Wood describes how to specify random effects using mgcv. Of particular note is the following passage:
For example if g is a factor then s(g,bs="re") produces a random coefficient for each level of g, with the radndom coefficients all modelled as i.i.d. normal.
After a quick simulation, I can see this is correct, and that the model fits are almost identical. However, the likelihoods and degrees of freedom are VERY different. Can anyone explain the difference? Which one should be used for testing?
library(mgcv)
library(lme4)
set.seed(1)
x <- rnorm(1000)
ID <- rep(1:200,each=5)
y <- x
for(i in 1:200) y[which(ID==i)] <- y[which(ID==i)] + rnorm(1)
y <- y + rnorm(1000)
ID <- as.factor(ID)
# gam (mgcv)
m <- gam(y ~ x + s(ID,bs="re"))
gam.vcomp(m)
coef(m)[1:2]
logLik(m)
# lmer
m2 <- lmer(y ~ x + (1|ID))
sqrt(VarCorr(m2)$ID[1])
summary(m2)$coef[,1]
logLik(m2)
mean( abs( fitted(m)-fitted(m2) ) )
Full disclosure: I encountered this problem because I want to fit a GAM that also includes random effects (repeated measures), but need to know if I can trust likelihood-based tests under those models.

Resources