Excluded variables from regression - statistics

SPSS keeps excluding a variable from my regression, and I am not exactly sure why. Here is where I started:
Perf = ILTProt + LProt + AbsoluteFitProt + Male + EDUC + Age + C
I then decided to switch out the AbsoluteFitProt variable for a different measure of a similar thing to give:
Perf = ILTProt + LProt + FitProt + Male + EDUC + Age + C
But SPSS keeps omitting ILTProt so I end up with
Perf = LProt + FitProt + Male + EDUC + Age + C
Does anyone know why this may be? Or How to fix it?

So this was due to co-linnearity between ILTProt and FitProt. Not sure how to fix it though, but have instead just decided to omit the excluded variable.

This (a variable being automatically omitted from a regression model) typically occurs when the variable is a constant or has perfect colinearity/correlation with another variable. So check if ILTProt & FitProt, have a perfect correlation?

Related

PuLP objective function not incorporating constant in lpSum

Any idea why the following objective function doesn't include the -5000 in its formulation?
prob += lp.lpSum([-5000 + price_today_d[i] * ticker_vars[i] for i in ticker_list]), 'Total Cost'
Result:
Total_Cost: 0.82 ticker_A + 27.55 ticker_B
+ 32.73 ticker_C + 30.14 ticker_D + 26.55 ticker_E
By default PuLP ignores constant values since they are not relevant for obtaining the optimal solution. You can always add it after solving.

GAM error: Fitting terminated with step failure - check results carefully

I’m writing a GAM using the mgcv package that predicts burrow abundance and distribution of two different species on an island using data obtained during a field trip and images taken from the Sentinel satellite. 101 plots were surveyed. 922 burrows belonging to species 1 were recorded in 66 plots and 29 burrows belonging to species 2 were recorded in 8 plots.
I used a negative binomial distribution for species 1 as using a Poisson distribution resulted in the model being over dispersed. The maximal model was:
gam(Species_1 ~ s(x, y, bs="ts") +
Sentinel2_band_1 + Sentinel2_band_2 + Sentinel2_band_3 + Sentinel2_band_4 + Sentinel2_band_5 +
Sentinel2_band_6 + Sentinel2_band_7 + Sentinel2_band_8 + Sentinel2_band_9 + Sentinel2_band_10 +
I(Sentinel2_band_1^2) + I(Sentinel2_band_2^2) + I(Sentinel2_band_3^2) + I(Sentinel2_band_4^2) + I(Sentinel2_band_5^2) +
I(Sentinel2_band_6^2) + I(Sentinel2_band_7^2) + I(Sentinel2_band_8^2) + I(Sentinel2_band_9^2) + I(Sentinel2_band_10^2) +
aspect + elevation + slope +
I(aspect^2) + I(elevation^2) + I(slope^2) +
aspect:elevation + aspect:slope + elevation:slope,
data = dat,
family = nb(1))
The model selection process has resulted in a model that gives acceptable results.
When I run the same model using species 2 as the response variable I get the following error message:
Warning message:
In newton(lsp = lsp, X = G$X, y = G$y, Eb = G$Eb, UrS = G$UrS, L = G$L, :
Fitting terminated with step failure - check results carefully
The diagnostic plots also look pretty dodgy:
My assumption the issue I’m encountering is due to the much smaller sample size for species 2.
Any ideas what I can do to resolve this problem?

What is this formula trying to prove?

I have a large spreadsheet with a number of forumlas and they all make complete sense apart from one, which is listed below. Does anyone have any idea what this NORMALDIST calculation is trying to acheive or tell me? It has relevants to HE
=MAX(1,NORMDIST(3,N18,N18/4,TRUE)-NORMDIST(0,N18,N18/4,TRUE) + 2*(NORMDIST(6,N18,N18/4,TRUE)-NORMDIST(3,N18,N18/4,TRUE)) + 3*(NORMDIST(9,N18,N18/4,TRUE)-NORMDIST(6,N18,N18/4,TRUE)) + 4*(NORMDIST(12,N18,N18/4,TRUE)-NORMDIST(9,N18,N18/4,TRUE)) + 5*(NORMDIST(15,N18,N18/4,TRUE)-NORMDIST(12,N18,N18/4,TRUE)) + 6*(NORMDIST(18,N18,N18/4,TRUE)-NORMDIST(15,N18,N18/4,TRUE)) + 7*(NORMDIST(21,N18,N18/4,TRUE)-NORMDIST(18,N18,N18/4,TRUE)) + 8*(NORMDIST(24,N18,N18/4,TRUE)-NORMDIST(21,N18,N18/4,TRUE)) + 9*(NORMDIST(27,N18,N18/4,TRUE)-NORMDIST(24,N18,N18/4,TRUE)) + 10*(NORMDIST(30,N18,N18/4,TRUE)-NORMDIST(27,N18,N18/4,TRUE)) + 11*(NORMDIST(33,N18,N18/4,TRUE)-NORMDIST(30,N18,N18/4,TRUE)) + 12*(NORMDIST(36,N18,N18/4,TRUE)-NORMDIST(33,N18,N18/4,TRUE)) + 13*(NORMDIST(39,N18,N18/4,TRUE)-NORMDIST(36,N18,N18/4,TRUE)) + 14*(NORMDIST(42,N18,N18/4,TRUE)-NORMDIST(39,N18,N18/4,TRUE)) + 15*(NORMDIST(45,N18,N18/4,TRUE)-NORMDIST(42,N18,N18/4,TRUE)) + 16*(NORMDIST(48,N18,N18/4,TRUE)-NORMDIST(45,N18,N18/4,TRUE)) + 17*(NORMDIST(51,N18,N18/4,TRUE)-NORMDIST(48,N18,N18/4,TRUE)) + 18*(NORMDIST(54,N18,N18/4,TRUE)-NORMDIST(51,N18,N18/4,TRUE)) + 19*(NORMDIST(57,N18,N18/4,TRUE)-NORMDIST(54,N18,N18/4,TRUE)) + 20*(NORMDIST(60,N18,N18/4,TRUE)-NORMDIST(57,N18,N18/4,TRUE)) + 21*(NORMDIST(63,N18,N18/4,TRUE)-NORMDIST(60,N18,N18/4,TRUE)) + 22*(NORMDIST(66,N18,N18/4,TRUE)-NORMDIST(63,N18,N18/4,TRUE)) + 23*(NORMDIST(69,N18,N18/4,TRUE)-NORMDIST(66,N18,N18/4,TRUE)))
Strange question I know, but could not think of where else to ask!!!
Cheers
The equation has a series of terms of the form N*[NORMDIST(3N,mu,sigma)-NORMDIST(3N-3,mu,sigma)] where mu is the mean (N18 in the equation), sigma is the standard deviation (N18/4), and with N going from 1 to 23. This appears to be an estimate involving the average of the normal distribution. It would be more rigorous for N to go from minus infinity to plus infinity and it's not clear why this formula truncated the interval to 1..23. Nevertheless, if the person who wrote the equation was calculating the average, then from the properties of the normal distribution you can derive a closed form solution as:
Total of all NORMDIST terms = mu/3 + 1/2
This will be accurate as long as mu (N18) is in the between 0 and 30. If you plug this into the equation you get
=MAX(1,N18/3+0.5)
Hope that helps.
From the docs...
NORMDIST function
Excel for Office 365 Excel for Office 365 for Mac Excel 2019 Excel 2016 More...
Returns the normal distribution for the specified mean and standard deviation. This function has a very wide range of applications in statistics, including hypothesis testing.
Important: This function has been replaced with one or more new functions that may provide improved accuracy and whose names better reflect their usage. Although this function is still available for backward compatibility, you should consider using the new functions from now on, because this function may not be available in future versions of Excel.
For more information about the new function, see NORM.DIST function.

Python3: Whats difference between these 2 calculations about integer divide

Here is my code
a = [10,10,20]
b = [2,5,4]
print(sum(a) / sum(b))
print(sum([i/j for i,j in zip(a,b)])/3)
The output is
3.6363636363636362
4.0
My question is: How to make the first calculation right.And why is there such a difference?
Thanks.
The first one is (10+10+20)/(2+5+4) = 40/11 = 3.6363.
The second one is (10/2 + 10/5 + 20/4)/3 = (5 + 2 + 5)/3=4
Those are two different calculations. There is no reason to assume there should not be any difference.
Nothing is wrong with the calculation.
In the first case i.e, in
(print(sum(a) / sum(b)))
you are first adding the numerator and adding the denominator seperately and then dividing them
let [a,b,c] and [d,e,f] be your list elements, in the first case, you are doing
(a+b+c)/(d+e+f)
While in the second case, you are doing
a/d + b/e + c/f
and then dividing by 3
Which is why you got two different answers

Rounding Error: Harmonic mean with exponent of small numbers

Let us say I have log_a1=-1000, log_a2=-1001, and log_a3=-1002.
n=3
I want to get the harmonic mean (HM) of a1, a2 and a3 (not log_a1, log_a2 and log_a3) such that HM = n/[1/exp(log_a1) + 1/exp(log_a2) + 1/exp(log_a3)].
However, due to rounding error exp(log_a1)=exp(-1000)=0 and accordingly 1/exp(log_a)=inf and HM=0.
Is there any mathematical trick to do? It is okay to get either HM or log(HM).
The best approach is probably to keep things in log scale. Many scientific languages have a log-add-exp function (e.g. numpy.logaddexp in python) that does what you want to high precision, with both the input and the result in log form.
The idea is that you want to compute e^-1000 + e^-1001 + e^-1002, so you factor it to e^-1000 (1 + + e^-1 + e^-2) and take the log. The result is -1000 + log(1 + e^-1 + e^-2), which can be computed without loss of precision.
log(HM)=log(n)-log(1)+log_a_max - log(sum(1./exp(log_ai - log_a_max)))
For a=[-1000, -1001, -1002];
log(HM)=-1001.309

Resources