How to overcome infinite value to perform multicollinearity test in environmental variables? - geospatial

Here's the dataset which consists of X = 45 columns collected the data from bioclimate database.
The multicollinearity test model -
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
x columns
#Calculating VIF for each feature
vif_data["VIF"] = [variance_inflation_factor(X.values, i)
for i in range (0, len(X.columns))]
-------------------------------------------------------------
/usr/local/lib/python3.7/dist-packages/statsmodels/stats/outliers_influence.py:193:
RuntimeWarning: divide by zero encountered in double_scalars
vif = 1. / (1. - r_squared_i)
vif_data
Trial :
I've converted all variables into float and int vice-versa but still getting infinite values for all variables after performing multicollinearity test.
I didn't find any reference material to tackle this problem specially in python.
Please help me out, I am using it for species distribution modelling.

Related

Mixed integer nonlinear programming with gekko python

I want to solve the following optimization problem using Gekko in python 3.7 window version.
Original Problem
Here, x_s are continuous variables, D and Epsilon are deterministic and they are also parameters.
However, since minimization function exists in the objective function, I remove it using binary variables(z1, z2) and then the problem becomes MINLP as follows.
Modified problem
With Gekko,
(1) Can both original problem & modified problem be solved?
(2) How can I code summation in the objective function and also D & epsilon which are parameters in Gekko?
Thanks in advance.
Both problems should be feasible with Gekko but the original appears easier to solve. Here are a few suggestions for the original problem:
Use m.Maximize() for the objective
Use sum() for the inner summation and m.sum() for outer summation for the objective function. I switch to m.sum() when the summation would create an expression that is over 15,000 characters. Using sum() creates one long expression and m.sum() breaks the summation into pieces but takes longer to compile.
Use m.min3() for the min(Dt,xs) terms or slack variables s with x[i]+s[i]=D[i]. It appears that Dt (size 30) is an upper bound, but it has different dimensions that xs (size 100). Slack variables are much more efficient than using binary variables.
D = np.array(100)
x = m.Array(m.Var,100,lb=0,ub=2000000)
The modified problem has 6000 binary variables and 100 continuous variables. There are 2^6000 potential combinations of those variables so it may take a while to solve, even with the efficient branch and bound method of APOPT. Here are a few suggestions for the modified problem:
Use matrix multiplications when possible. Below is an example of matrix operations with Gekko.
from gekko import GEKKO
import numpy as np
m = GEKKO(remote=False)
ni = 3; nj = 2; nk = 4
# solve AX=B
A = m.Array(m.Var,(ni,nj),lb=0)
X = m.Array(m.Var,(nj,nk),lb=0)
AX = np.dot(A,X)
B = m.Array(m.Var,(ni,nk),lb=0)
# equality constraints
m.Equations([AX[i,j]==B[i,j] for i in range(ni) \
for j in range(nk)])
m.Equation(5==m.sum([m.sum([A[i][j] for i in range(ni)]) \
for j in range(nj)]))
m.Equation(2==m.sum([m.sum([X[i][j] for i in range(nj)]) \
for j in range(nk)]))
# objective function
m.Minimize(m.sum([m.sum([B[i][j] for i in range(ni)]) \
for j in range(nk)]))
m.solve()
print(A)
print(X)
print(B)
Declare z1 and z2 variables as integer type with integer=True. Here is more information on using the integer type.
Solve locally with m=GEKKO(remote=False). The processing time will be large and the public server resets connections and deletes jobs every day. Switch to local mode to avoid a potential disruption.

Hot to get the set difference of two 2d numpy arrays, or equivalent of np.setdiff1d in a 2d array?

Here Get intersecting rows across two 2D numpy arrays they got intersecting rows by using the function np.intersect1d. So i changed the function to use np.setdiff1d to get the set difference but it doesn't work properly. The following is the code.
def set_diff2d(A, B):
nrows, ncols = A.shape
dtype={'names':['f{}'.format(i) for i in range(ncols)],
'formats':ncols * [A.dtype]}
C = np.setdiff1d(A.view(dtype), B.view(dtype))
return C.view(A.dtype).reshape(-1, ncols)
The following data is used for checking the issue:
min_dis=400
Xt = np.arange(50, 3950, min_dis)
Yt = np.arange(50, 3950, min_dis)
Xt, Yt = np.meshgrid(Xt, Yt)
Xt[::2] += min_dis/2
# This is the super set
turbs_possible_locs = np.vstack([Xt.flatten(), Yt.flatten()]).T
# This is the subset
subset = turbs_possible_locs[np.random.choice(turbs_possible_locs.shape[0],50, replace=False)]
diffs = set_diff2d(turbs_possible_locs, subset)
diffs is supposed to have a shape of 50x2, but it is not.
Ok, so to fix your issue try the following tweak:
def set_diff2d(A, B):
nrows, ncols = A.shape
dtype={'names':['f{}'.format(i) for i in range(ncols)], 'formats':ncols * [A.dtype]}
C = np.setdiff1d(A.copy().view(dtype), B.copy().view(dtype))
return C
The problem was - A after .view(...) was applied was broken in half - so it had 2 tuple columns, instead of 1, like B. I.e. as a consequence of applying dtype you essentially collapsed 2 columns into tuple - which is why you could do the intersection in 1d in the first place.
Quoting after documentation:
"
a.view(some_dtype) or a.view(dtype=some_dtype) constructs a view of the array’s memory with a different data-type. This can cause a reinterpretation of the bytes of memory.
"
Src https://numpy.org/doc/stable/reference/generated/numpy.ndarray.view.html
I think the "reinterpretation" is exactly what happened - hence for the sake of simplicity I would just .copy() the array.
NB however I wouldn't square it - it's always A which gets 'broken' - whether it's an assignment, or inline B is always fine...

Why is statsmodels.api producing R^2 of 1.000?

I'm using statsmodel to do simple and multiple linear regression and I'm getting bad R^2 values from the summary. The coefficients look to be calculated correctly, but I get an R^2 of 1.000 which is impossible for my data. I graphed it in excel and I should be getting around 0.93, not 1.
I'm using a mask to filter data to send into the model and I'm wondering if that could be the issue, but to me the data looks fine. I am fairly new to python and statsmodel so maybe I'm missing something here.
import statsmodels.api as sm
for i, df in enumerate(fallwy_xy): # Iterate through list of dataframes
if len(df.index) > 0: # Check if frame is empty or not
mask3 = (df['fnu'] >= low) # Mask data below 'low' variable
valid3 = df[mask3]
if len(valid3) > 0: # Check if there is data in range of mask3
X = valid3[['logfnu', 'logdischarge']]
y = valid3[['logssc']]
estm = sm.OLS(y, X).fit()
X = valid3[['logfnu']]
y = valid3[['logssc']]
ests = sm.OLS(y, X).fit()
I finally found out what was going on. Statsmodels by default does not incorporate a constant into its OLS regression equation, you have to call it out specifically with
X = sm.add_constant(X)
The reason the constant is so important is because without it, Statsmodels calculates R-squared differently, uncentered to be exact. If you do add a constant then the R-squared gets calculated the way most people calculate R-squared which is the centered version. Excel does not change the way it calculates R-squared when given a constant or not which is why when Statsmodels reported it's R-squared with no constant it as so different from Excel. The OLS Regression summary from Statsmodels actually points out the calculation method if it uses the uncentered no-constant, calculation by showing R-squared (uncentered): where the R-squared shows up in the summary table. The below links helped me figure this out.
add hasconstant indicator for R-squared and df calculations
Same model coeffs, different R^2 with statsmodels OLS and sci-kit learn linearregression
Warning : Rod Made a Mistake!

Bayesian Linear Regression with PyMC3 and a large dataset - bracket nesting level exceeded maximum and slow performance

I would like to use a Bayesian multivariate linear regression to estimate the strength of players in team sports (e.g. ice hockey, basketball or soccer). For that purpose, I create a matrix, X, containing the players as columns and the matches as rows. For each match the player entry is either 1 (player plays in the home team), -1 (player plays in the away team) or 0 (player does not take part in this game). The dependent variable Y is defined as the scoring differences for both teams in each match (Score_home_team - Score_away_team).
Thus, the number of parameters will be quite large for one season (e.g. X is defined by 300 rows x 450 columns; i.e. 450 player coefficients + y-intercept). When running the fit I came across a compilation error:
('Compilation failed (return status=1): /Users/me/.theano/compiledir_Darwin-17.7.0-x86_64-i386-64bit-i386-3.6.5-64/tmpdxxc2379/mod.cpp:27598:32: fatal error: bracket nesting level exceeded maximum of 256.
I tried to handle this error by setting:
theano.config.gcc.cxxflags = "-fbracket-depth=1024"
Now, the sampling is running. However, it is so slow that even if I take only 35 of 300 rows the sampling is not completed within 20 minutes.
This is my basic code:
import pymc3 as pm
basic_model = pm.Model()
with basic_model:
# Priors for beta coefficients - these are the coefficients of the players
dict_betas = {}
for col in X.columns:
dict_betas[col] = pm.Normal(col, mu=0, sd=10)
# Priors for unknown model parameters
alpha = pm.Normal('alpha', mu=0, sd=10) # alpha is the y-intercept
sigma = pm.HalfNormal('sigma', sd=1) # standard deviation of the observations
# Expected value of outcome
mu = alpha
for col in X.columns:
mu = mu + dict_betas[col] * X[col] # mu = alpha + beta_1 * Player_1 + beta_2 * Player_2 + ...
# Likelihood (sampling distribution) of observations
Y_obs = pm.Normal('Y_obs', mu=mu, sd=sigma, observed=Y)
The instantiation of the model runs within one minute for the large dataset. I do the sampling using:
with basic_model:
# draw 500 posterior samples
trace = pm.sample(500)
The sampling is completed for small sample sizes (e.g. 9 rows, 80 columns) within 7 minutes. However, the time is increasing substantially with increasing sample size.
Any suggestions how I can get this Bayesian linear regression to run in a feasible amount of time? Are these kind of problems doable using PyMC3 (remember I came across a bracket nesting error)? I saw in a recent publication that this kind of analysis is doable in R (https://arxiv.org/pdf/1810.08032.pdf). Therefore, I guess it should also somehow work with Python 3.
Any help is appreciated!
Eliminating the for loops should improve performance and might also take care of the nesting issue you are reporting. Theano TensorVariables and the PyMC3 random variables that derive from them are already multidimensional and support linear algebra operations. Try changing your code to something along the lines of
beta = pm.Normal('beta', mu=0, sd=10, shape=X.shape[1])
...
mu = alpha + pm.math.dot(X, beta)
...
If you need specify different prior values for mu and/or sd, those arguments accept anything that theano.tensor.as_tensor_variable() accepts, so you can pass a list or numpy array.
I highly recommend getting familiar with the theano.tensor and pymc3.math operations since sometimes you must use these to properly manipulate random variables, and in general it should lead to more efficient code.

Error in caret-svm - "NAs are not allowed in subscripted assignments"

experts. I am a beginner to R. I am trying to use caret-SVM to make classification. The kernel is svmPoly.
First, I used the default parameters to train the model with leave-one-out cross-validation
The code is :
ctrl <- trainControl(method = "LOOCV",
classProbs = T,
savePredictions = T,
repeats = 1)
modelFit <- train(group~.,data=table_svm,method="svmPoly",
preProc = c("center","scale"),
trControl = ctrl)
The best accuracy is 80%. And the final values used for the model were degree = 1, scale = 0.1 and C = 1 .
Second, I tried to tune the parameters.
The code is:
grid_svmpoly=expand.grid(degree=c(1:11),scale=seq(0,5,length.out=25),C=10^c(0:4))
modelFit_tune <- train(group~.,data=table_svm,method="svmPoly",
preProc = c("center","scale"),
tuneGrid=grid_svmpoly,
trControl = ctrl)
I got an error message: Error in { :
task 264 failed - "NAs are not allowed in subscripted assignments"
I checked the data and found no NA.
There must be some NA inside the data-set. I am not new to this but not much expert. To ensure there is no NA inside first convert data-set into matrix format using:
x <- data.matrix(dataframe)
then use which() function which very handy in this case:
which(is.na(x)==T)
I hope this will help you finding the answer. The values will be in row wise order.
Let me know if this resolve your query.

Resources