GAM error: Fitting terminated with step failure - check results carefully - gam

I’m writing a GAM using the mgcv package that predicts burrow abundance and distribution of two different species on an island using data obtained during a field trip and images taken from the Sentinel satellite. 101 plots were surveyed. 922 burrows belonging to species 1 were recorded in 66 plots and 29 burrows belonging to species 2 were recorded in 8 plots.
I used a negative binomial distribution for species 1 as using a Poisson distribution resulted in the model being over dispersed. The maximal model was:
gam(Species_1 ~ s(x, y, bs="ts") +
Sentinel2_band_1 + Sentinel2_band_2 + Sentinel2_band_3 + Sentinel2_band_4 + Sentinel2_band_5 +
Sentinel2_band_6 + Sentinel2_band_7 + Sentinel2_band_8 + Sentinel2_band_9 + Sentinel2_band_10 +
I(Sentinel2_band_1^2) + I(Sentinel2_band_2^2) + I(Sentinel2_band_3^2) + I(Sentinel2_band_4^2) + I(Sentinel2_band_5^2) +
I(Sentinel2_band_6^2) + I(Sentinel2_band_7^2) + I(Sentinel2_band_8^2) + I(Sentinel2_band_9^2) + I(Sentinel2_band_10^2) +
aspect + elevation + slope +
I(aspect^2) + I(elevation^2) + I(slope^2) +
aspect:elevation + aspect:slope + elevation:slope,
data = dat,
family = nb(1))
The model selection process has resulted in a model that gives acceptable results.
When I run the same model using species 2 as the response variable I get the following error message:
Warning message:
In newton(lsp = lsp, X = G$X, y = G$y, Eb = G$Eb, UrS = G$UrS, L = G$L, :
Fitting terminated with step failure - check results carefully
The diagnostic plots also look pretty dodgy:
My assumption the issue I’m encountering is due to the much smaller sample size for species 2.
Any ideas what I can do to resolve this problem?

Related

I'm Looking for ways to make this lexicographical code faster

I've been working on code to calculate the distance between 33 3D points and calculate the shortest route is between them. The initial code took in all 33 points and paired them consecutively and calculated the distances between the pairs using math.sqrt and sum them all up to get a final distance.
My problem is that with the sheer number of permutations of a list with 33 points (33 factorial!) the code is going to need to be at its absolute best to find the answer within a human lifetime (assuming I can use as many CPUs as I can get my hands on to increase the sheer computational power).
I've designed a simple web server to hand out an integer and convert it to a list and have the code perform a set number of lexicographical permutations from that point and send back the resulting shortest distance of that block. This part is fine but I have concerns over the code that does the distance calculations
I've put together a test version of my code so I could change things and see if it made the execution time faster or slower. This code starts at the beginning of the permutation list (0 to 32) in order and performs 50 million lexicographical iterations on it, checking the distance of the points at every iteration. the code is detailed below.
import json
import datetime
import math
def next_lexicographic_permutation(x):
i = len(x) - 2
while i >= 0:
if x[i] < x[i+1]:
break
else:
i -= 1
if i < 0:
return False
j = len(x) - 1
while j > i:
if x[j] > x[i]:
break
else:
j-= 1
x[i], x[j] = x[j], x[i]
reverse(x, i + 1)
return x
def reverse(arr, i):
if i > len(arr) - 1:
return
j = len(arr) - 1
while i < j:
arr[i], arr[j] = arr[j], arr[i]
i += 1
j -= 1
# ip for initial permutation
ip = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]
lookup = '{"0":{"name":"van Maanen\'s Star","x":-6.3125,"y":-11.6875,"z":-4.125},\
"1":{"name":"Wolf 124","x":-7.25,"y":-27.1562,"z":-19.0938},\
"2":{"name":"Midgcut","x":-14.625,"y":10.3438,"z":13.1562},\
"3":{"name":"PSPF-LF 2","x":-4.40625,"y":-17.1562,"z":-15.3438},\
"4":{"name":"Wolf 629","x":-4.0625,"y":7.6875,"z":20.0938},\
"5":{"name":"LHS 3531","x":1.4375,"y":-11.1875,"z":16.7812},\
"6":{"name":"Stein 2051","x":-9.46875,"y":2.4375,"z":-15.375},\
"7":{"name":"Wolf 25","x":-11.0625,"y":-20.4688,"z":-7.125},\
"8":{"name":"Wolf 1481","x":5.1875,"y":13.375,"z":13.5625},\
"9":{"name":"Wolf 562","x":1.46875,"y":12.8438,"z":15.5625},\
"10":{"name":"LP 532-81","x":-1.5625,"y":-27.375,"z":-32.3125},\
"11":{"name":"LP 525-39","x":-19.7188,"y":-31.125,"z":-9.09375},\
"12":{"name":"LP 804-27","x":3.3125,"y":17.8438,"z":43.2812},\
"13":{"name":"Ross 671","x":-17.5312,"y":-13.8438,"z":0.625},\
"14":{"name":"LHS 340","x":20.4688,"y":8.25,"z":12.5},\
"15":{"name":"Haghole","x":-5.875,"y":0.90625,"z":23.8438},\
"16":{"name":"Trepin","x":26.375,"y":10.5625,"z":9.78125},\
"17":{"name":"Kokary","x":3.5,"y":-10.3125,"z":-11.4375},\
"18":{"name":"Akkadia","x":-1.75,"y":-33.9062,"z":-32.9688},\
"19":{"name":"Hill Pa Hsi","x":29.4688,"y":-1.6875,"z":25.375},\
"20":{"name":"Luyten 145-141","x":13.4375,"y":-0.8125,"z":6.65625},\
"21":{"name":"WISE 0855-0714","x":6.53125,"y":-2.15625,"z":2.03125},\
"22":{"name":"Alpha Centauri","x":3.03125,"y":-0.09375,"z":3.15625},\
"23":{"name":"LHS 450","x":-12.4062,"y":7.8125,"z":-1.875},\
"24":{"name":"LP 245-10","x":-18.9688,"y":-13.875,"z":-24.2812},\
"25":{"name":"Epsilon Indi","x":3.125,"y":-8.875,"z":7.125},\
"26":{"name":"Barnard\'s Star","x":-3.03125,"y":1.375,"z":4.9375},\
"27":{"name":"Epsilon Eridani","x":1.9375,"y":-7.75,"z":-6.84375},\
"28":{"name":"Narenses","x":-1.15625,"y":-11.0312,"z":21.875},\
"29":{"name":"Wolf 359","x":3.875,"y":6.46875,"z":-1.90625},\
"30":{"name":"LAWD 26","x":20.9062,"y":-7.5,"z":3.75},\
"31":{"name":"Avik","x":13.9688,"y":-4.59375,"z":-6.0},\
"32":{"name":"George Pantazis","x":-12.0938,"y":-16.0,"z":-14.2188}}'
lookup = json.loads(lookup)
lowest_total = 9999
# create 2D array for the distances and called it b to keep code looking clean.
b = [[0 for i in range(33)] for j in range(33)]
for x in range(33):
for y in range(33):
if x == y:
continue
else:
b[x][y] = math.sqrt(((lookup[str(x)]["x"] - lookup[str(y)]['x']) ** 2) + ((lookup[str(x)]['y'] - lookup[str(y)]['y']) ** 2) + ((lookup[str(x)]['z'] - lookup[str(y)]['z']) ** 2))
# begin timer
start_date = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%SZ")
start = datetime.datetime.now()
print("[{}] Start".format(start_date))
# main iteration loop
for x in range(50_000_000):
distance = b[ip[0]][ip[1]] + b[ip[1]][ip[2]] + b[ip[2]][ip[3]] +\
b[ip[3]][ip[4]] + b[ip[4]][ip[5]] + b[ip[5]][ip[6]] +\
b[ip[6]][ip[7]] + b[ip[7]][ip[8]] + b[ip[8]][ip[9]] +\
b[ip[9]][ip[10]] + b[ip[10]][ip[11]] + b[ip[11]][ip[12]] +\
b[ip[12]][ip[13]] + b[ip[13]][ip[14]] + b[ip[14]][ip[15]] +\
b[ip[15]][ip[16]] + b[ip[16]][ip[17]] + b[ip[17]][ip[18]] +\
b[ip[18]][ip[19]] + b[ip[19]][ip[20]] + b[ip[20]][ip[21]] +\
b[ip[21]][ip[22]] + b[ip[22]][ip[23]] + b[ip[23]][ip[24]] +\
b[ip[24]][ip[25]] + b[ip[25]][ip[26]] + b[ip[26]][ip[27]] +\
b[ip[27]][ip[28]] + b[ip[28]][ip[29]] + b[ip[29]][ip[30]] +\
b[ip[30]][ip[31]] + b[ip[31]][ip[32]]
if distance < lowest_total:
lowest_total = distance
ip = next_lexicographic_permutation(ip)
# end timer
finish_date = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%SZ")
finish = datetime.datetime.now()
print("[{}] Finish".format(finish_date))
diff = finish - start
print("Time taken => {}".format(diff))
print("Lowest distance => {}".format(lowest_total))
This is the result of a lot of work to make things faster. I was initially using string look-ups to find the distance to be calculated with a dict having keys like "1-2", but very quickly found out that it was very slow, I then moved onto hashed versions of the "1-2" key and the speed increased but the fastest way I have found so far is using a 2D array and looking up the values from there.
I have also found that manually constructing the distance calculation saved time over having a for x in ranges(32): loop adding the distances up and incrementing a variable to get the total.
Another great speed up was using pypy3 instead of python3 to execute it.
This usually takes 11 seconds to complete using pypy3
running 50 million of the distance calculation on its own takes 5.2 seconds
running 50 million of the next_lexicographic_permutation function on its own takes 6 seconds
I can't think of any way to make this faster and I believe there may be optimizations to be made in the next_lexicographic_permutation function. From what I've read about this the main bottleneck seems to be the switching of positions in the array:
x[i], x[j] = x[j], x[i]
Edit : added clarification of lifetime to represent human lifetime
The brute-force approach of calculating all the distances is going to be slower than a partitioning approach. Here is a similar question for the 3D case.

Measurement invariance in Lavaan (subgroup analysis); Warning: covariance matrix of latent variables is not positive definite

I have a good fitting CFA model and am now conducting some additional analyses by different subgroups (e.g., gender, age, education, ...). Specifically, I'm calculating fit indices for measurement invariance (configural, metric, scalar). Everything looks fine, except for one of my subgroups. I get the following Warning message in Lavaan for a group that has a relatively small sample size:
Warning message:
In lav_object_post_check(object) :
lavaan WARNING: covariance matrix of latent variables
is not positive definite;
use lavInspect(fit, "cov.lv") to investigate.
Model specification and fit:
model <- '
F1 =~ a + b + c + d + e
F2 =~ f + g + h + i + j
F3 =~ k + l + m + n + o
F4 =~ p + q + r + s + t
F5 =~ u + v + w + x + y
Crossload =~ g + j + n + o + q
'
fit.model.group3 <- sem(model, data=MyData.group3, estimator='WLSMV', missing='pairwise', ordered = T)
fit.model.configural.edu <- sem(model, data=MyData.edu, estimator='WLSMV', missing='pairwise', ordered = T, group = "edu")
This grouping variable (edu) has 3 levels (Ns: group1 = 1150, group2 = 215, group3 = 120), and I only get the warning for group3 (the smallest group). I can fix it combining some factors (which were strongly correlated for this group), and then fitting the more parsimonious (4-factor) model. However, as the point of this subgroup analysis is to investigate the same CFA model across the different subgroups, I don't think it's appropriate to change the model specification(?). And so, is it okay to ignore this warning message for the sake of the measurement invariance analysis (the model still converges normally, and I get the necessary metrics)? Note that some of the eigenvalues are negative, and I get the following output from this call:
> det(lavInspect(fit.model.group3, "cov.lv"))
[1] 2.544325e-06
As it's close to zero, could it be a machine precision issue?
Any advice would be very much appreciated. Unfortunately, my technical knowledge is lacking and so I've struggled to follow some of the existing advice elsewhere.
Best wishes,
Mary

How to evaluate the significance of the three-way interaction with quadratic term?

I have the following two mixed models
Full model (with the three-way interaction)
m_Kunkle_pacc3_n <- lmer(pacc3_old ~ PRS_Kunkle*AgeAtVisit +
PRS_Kunkle*I(AgeAtVisit^2) +
APOE_score*AgeAtVisit + APOE_score*I(AgeAtVisit^2) + PRS_Kunkle*APOE_score + famhist +
+ gender + EdYears_Coded_Max20 + VisNo + X1 + X2 + X3 + X4 + X5 +
(1 |family/DBID),
data = WRAP_all, REML = F)
Nested model (exclude the three-way interaction (two variables are excluded: three-way interaction with linear age and quadratic age))
m_Kunkle_pacc3 <- lmer(pacc3_old ~ PRS_Kunkle*AgeAtVisit*APOE_score +
PRS_Kunkle*I(AgeAtVisit^2)*APOE_score +
+ gender + EdYears_Coded_Max20 + VisNo + famhist + X1 + X2 + X3 + X4 + X5 +
(1 |family/DBID),
data = WRAP_all, REML = F)
I used the likelihood ratio test to test the difference between full model and nested model, am I correct in testing the significance of this three-way interaction?
pacc3_LRT_Kunkle <- anova(m_Kunkle_pacc3, m_Kunkle_pacc3_n, test = "chisq")
Many thanks
If you are interested in testing the significance of the three-way interaction, I think in general you should do that within the context of a single model. You first select a model based on theoretical considerations and sometimes certain indices, and then look at the parameters of the model you pick. For example, the BIC is related to a model's negative log-likelihood penalized by its complexity (it also depends on your sample size), and you can use the BIC to select a model among competing choices. Once you pick a model that has a certain interaction term within it, you evaluate its coefficient. I should warn you that interpreting three-way interactions can be very challenging, so you should consider that in the context of your problem as well.
TLDR; comparing a model that has a term with one that does not have it (whether you look at their R^2, compare likelihoods or penalized likelihoods, etc.) will tell you something about the whole model, not the parameter itself.

Python implementation of bilinear quadrilateral interpolation

I'm trying to perform bilinear quadrilateral interpolation. So I have four nodes with known values and I want to find a value that lies in between those four nodes by interpolation, but the four nodes do not form a rectangle. 4-node sketch
I found several ways to solve this, but none of them is implemented in Python already. Does there exist somewhere an already finished python implementation? If not which of the two solutions below would you recommend? Or would you recommend another approach?
**************Different solutions*******************
Solution 1:
I found here, https://www.colorado.edu/engineering/CAS/courses.d/IFEM.d/IFEM.Ch16.d/IFEM.Ch16.pdf, that I should solve the following set of equations: set of equations with Ni being: N definition.
Finally this results in solving a set of equations of the form:
a*x+b*y+c*xy=z1
d*x+e*y+f*xy=z2
with x and y being the unknowns. This could be solved numerically using fsolve.
Solution 2:
This one is completely explained here: https://math.stackexchange.com/questions/828392/spatial-interpolation-for-irregular-grid
but it's quite complex and I think it will take me longer to code it.
Due to a lack of answers I went for the first option. You can find the code below. Recommendations to improve this code are always welcome.
import numpy as np
from scipy.optimize import fsolve
def interpolate_quatrilateral(pt1,pt2,pt3,pt4,pt):
'''Interpolates a value in a quatrilateral figure defined by 4 points.
Each point is a tuple with 3 elements, x-coo,y-coo and value.
point1 is the lower left corner, point 2 the lower right corner,
point 3 the upper right corner and point 4 the upper left corner.
args is a list of coordinates in the following order:
x1,x2,x3,x4 and x (x-coo of point to be interpolated) and y1,y2...
code based on the theory found here:
https://www.colorado.edu/engineering/CAS/courses.d/IFEM.d/IFEM.Ch16.d/IFEM.Ch16.pdf'''
coos = (pt1[0],pt2[0],pt3[0],pt4[0],pt[0],
pt1[1],pt2[1],pt3[1],pt4[1],pt[1]) #coordinates of the points merged in tuple
guess = np.array([0,0]) #The center of the quadrilateral seem like a good place to start
[eta, mu] = fsolve(func=find_local_coo_equations, x0=guess, args=coos)
densities = (pt1[2], pt2[2], pt3[2], pt4[2])
density = find_density(eta,mu,densities)
return density
def find_local_coo_equations(guess, *args):
'''This function creates the transformed coordinate equations of the quatrilateral.'''
eta = guess[0]
mu = guess[1]
eq=[0,0]#Initialize eq
eq[0] = 1 / 4 * (args[0] + args[1] + args[2] + args[3]) - args[4] + \
1 / 4 * (-args[0] - args[1] + args[2] + args[3]) * mu + \
1 / 4 * (-args[0] + args[1] + args[2] - args[3]) * eta + \
1 / 4 * (args[0] - args[1] + args[2] - args[3]) * mu * eta
eq[1] = 1 / 4 * (args[5] + args[6] + args[7] + args[8]) - args[9] + \
1 / 4 * (-args[5] - args[6] + args[7] + args[8]) * mu + \
1 / 4 * (-args[5] + args[6] + args[7] - args[8]) * eta + \
1 / 4 * (args[5] - args[6] + args[7] - args[8]) * mu * eta
return eq
def find_density(eta,mu,densities):
'''Finds the final density based on the eta and mu local coordinates calculated
earlier and the densities of the 4 points'''
N1 = 1/4*(1-eta)*(1-mu)
N2 = 1/4*(1+eta)*(1-mu)
N3 = 1/4*(1+eta)*(1+mu)
N4 = 1/4*(1-eta)*(1+mu)
density = densities[0]*N1+densities[1]*N2+densities[2]*N3+densities[3]*N4
return density
pt1= (0,0,1)
pt2= (1,0,1)
pt3= (1,1,2)
pt4= (0,1,2)
pt= (0.5,0.5)
print(interpolate_quatrilateral(pt1,pt2,pt3,pt4,pt))

Excluded variables from regression

SPSS keeps excluding a variable from my regression, and I am not exactly sure why. Here is where I started:
Perf = ILTProt + LProt + AbsoluteFitProt + Male + EDUC + Age + C
I then decided to switch out the AbsoluteFitProt variable for a different measure of a similar thing to give:
Perf = ILTProt + LProt + FitProt + Male + EDUC + Age + C
But SPSS keeps omitting ILTProt so I end up with
Perf = LProt + FitProt + Male + EDUC + Age + C
Does anyone know why this may be? Or How to fix it?
So this was due to co-linnearity between ILTProt and FitProt. Not sure how to fix it though, but have instead just decided to omit the excluded variable.
This (a variable being automatically omitted from a regression model) typically occurs when the variable is a constant or has perfect colinearity/correlation with another variable. So check if ILTProt & FitProt, have a perfect correlation?

Resources