Lasso regression, no variable was dropped - statistics

I am performing lasso regression in R for binary response variable.
I am using cv.glmnet to find the best lambda and using glmnet to check the coefficients for the best lambda case. When calling both functions, I specify standardize =TRUE and alpha = 1.
I have about 40 variables in my case and I am sure some of them are strongly correlated with each other from scatterplots and vif(when I was performing logistic regression on the same data).
The best lambda that I got from my lasso regression is <0.001 and no variable is dropped in the best model (with lambda = best lambda).
Wondering why no variable was dropped.

Basically it's because your lambda value is too small. lambda<0.001 means that your penalty is so small that it really don't matter at all. Look at this "stupid" example:
Let's generate some sample random data. Note that variable z and z1 are strongly corelated.
library(glmnet)
z<-rnorm(100)
data<-data.frame(y=3+rnorm(100),x1=rnorm(100),x2=rnorm(100),x3=rnorm(100),x4=rnorm(100),x5=rnorm(100),
x6=rnorm(100),x7=rnorm(100),x8=rnorm(100),x9=rnorm(100),x10=rnorm(100),z=z,z1=z+rnorm(100,0,0.3))
Now run some models:
gl<-glmnet(y=data$y,x=as.matrix(data[,-1]),alpha = 1)
plot(gl,xvar="lambda")
lambda equal 0.001 means log(lambda)=-6.907755 and even in this "stupid" example where we could think that the coefficients won't be significant (so values should be equal to 0) we will get small but nonzero values (like in the plot).
Coefficient from glmnet with lambda=0.001 are very similar to those from glm (like I said, small lambda equal no penalty for log-likelihood):
gl1<-glmnet(y=data$y,x=as.matrix(data[,-1]),alpha = 1,lambda=0.001)
gl2<-glm(data=data,formula=y~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+z+z1)
gl1$beta
# x1 -0.10985215
# x2 -0.12337595
# x3 0.06088970
# x4 -0.12714515
# x5 -0.12237959
# x6 -0.01439966
# x7 0.02037826
# x8 0.22288055
# x9 -0.10131195
# x10 -0.04268274
# z -0.04526606
# z1 0.04628616
gl3$coefficients
(Intercept) x1 x2 x3 x4 x5 x6
2.98542594 -0.11104062 -0.12478162 0.06293879 -0.12833484 -0.12385855 -0.01556657
x7 x8 x9 x10 z z1
0.02071605 0.22408006 -0.10195640 -0.04419441 -0.04602251 0.04513612
Now look what is the difference on the coefficients from those two methods:
as.vector(gl1$beta)-as.vector(gl2$coefficients)[-1]
# [1] 0.0011884697 0.0014056731 -0.0020490872 0.0011896872 0.0014789566 0.0011669064
# [7] -0.0003377824 -0.0011995019 0.0006444471 0.0015116774 0.0007564556 0.00115004

Related

Implications of regressing Y = f(x1, x2) where Y = x1 + x2 + x3

In various papers I seen regressions of the sort of Y = f(x1, x2), where f() is usually a simple OLS and, importantly, Y = x1 + x2 + x3. In other words, regressors are exactly a part of Y.
These papers used regressors as a way to describe data rather than isolating a causal inference between X and Y. I was wondering what are the implication of the above strategy. To begin with, do numbers / significance test make any sense at all? thanks.
I understand that the approach mechanically fails if regressors included in the analyisis completely describe Y for obvious reasons (perfect collinearity). However, I would like to understand better the implication of only including some of the x in it.

multiple regression correlation effect

I would like to investigate the effects of two independent variables on a dependent variable. Suppose we have X1, X2 independent variables, and Y dependent variable.
I use two different approaches. In the first approach, to eliminate the effect of X1 on Y, I generate the conditional distribution of Y|X1 and perform regression using the second variable X2. When I check the correlations between X2 and Y|X1, I obtain relatively high correlations (R2>0.50). However, when I perform multiple regression over a wide range of data (X1 and X2), the effect of X2 on Y is decreased and becomes insignificant. How do these approaches give conflicting results? What is the most appropriate approach to determine the effect of X2 on Y for a given X1 value? Thanks.
It could be good to see the code or the above in mathematical notation.
For instance: did you include the constant terms?
What do you see when:
Y = B0 + B1X1 + B2X2
That will be the easiest to check, and B2 will give you probably what you want.
That model is still simple, you could explore something like:
Y = B0 + B1X1 + B2X2 + B3X1X2
or
Y = B0 + B1X1 + B2X2 + B3X1X2 + B4X1^2 + B5X2^2
And see if there are changes in the coefficients and if there are new significant coefficients.
You could go further and explore Structural Equation Models

sklearn customized standarization of data

Suppose I have a 2D numpy array:
X = np.array[
[..., ...],
[..., ...]]
And I want to standardize the data either with:
X = StandardScaler().fit_transform(X)
or:
X = (X - X.mean())/X.std()
The results are different. Why are they different?
Assuming X is a feature matrix of shape (n x m) (n instances and m features). We want to scale each feature so its instances are distributed with a mean of zero and with unit variance.
To do this you need to calculate the mean and standard deviation of each feature for the provided instances (column of X) and then calculate the scaled feature vectors. Currently you are calculating the mean and standard deviation of the whole dataset and scaling the data using these values: this will give you meaningless results in all but a few special cases (i.e., X = np.ones((100,2)) is such a special case).
Practically, to calculate these statistics for each feature you will need to set the axis parameter of the .mean() or .std() methods to 0. This will perform the calculations along the columns and return a (1 x m) shaped array (actually a (m,) array, but thats another story), where each value is the mean or standard deviation for the given column. You can then use numpy broadcasting to correctly scale the feature vectors.
The below example shows how you can correctly implement it manually. x1 and x2 are 2 features with 100 training instances. We store them in a feature matrix X.
x1 = np.linspace(0, 100, 100)
x2 = 10 * np.random.normal(size=100)
X = np.c_[x1, x2]
# scale the data using the sklearn implementation
X_scaled = StandardScaler().fit_transform(X)
# scale the data taking mean and std along columns
X_scaled_manual = (X - X.mean(axis=0)) / X.std(axis=0)
If you print the two you will see they match exactly, explicitly:
print(np.sum(X_scaled-X_scaled_manual))
returns 0.0.

What features does each coefficient in dual_coef_ correspond to for the polynomial kernel?

I trained a simple quadratic SVM using sklearn.svm.SVC on 3 features. In other words, X is nx3, Y is length n and I simply ran the following code with no problem:
svc = SVC(kernel='poly', degree = 2)
svc.fit(X,Y)
As my goal is to plot this boundary in 3D, I am trying to figure out which features each of the resulting coefficients correspond to. Naturally, a quadratic function with 3 features will result in an intercept term and 10 coefficients where each coefficient corresponds to:
x1^2, x2^2, x3^2, x1x2, x1x3, x2x3, x1x2x3, x1, x2, x3
However, svc.dual_coef returns an array of the 10 coefficients but I do not know which of them correspond to which of the 10 features, is there a way to figure this out?
Thanks!

How fit_intercept parameter impacts linear regression with scikit learn

I am trying to fit a linear model and my dataset is normalized where each feature is divided by the maximum possible value. So the values ranges from 0-1. Now i came to know from my previous post Linear Regression vs Closed form Ordinary least squares in Python linear regression in scikit learn produces same result as Closed form OLS when fit_intercept parameter is set to false. I am not quite getting how fit_intercept works.
For any linear problem, if y is the predicted value.
y(w, x) = w_0 + w_1 x_1 + ... + w_p x_p
Across the module, the vector w = (w_1, ..., w_p) is denoted as coef_ and w_0 as intercept_
In closed form OLS we also have a bias value for w_0 and we introduce vector X_0=[1...1] before computing the dot product and solves using matrix multiplication and inverse.
w = np.dot(X.T, X)
w1 = np.dot(np.linalg.pinv(w), np.dot(X.T, Y))
When fit_intercept is True, scikit-learn linear regression solves the problem if y is the predicted value.
y(w, x) = w_0 + w_1 x_1 + ... + w_p x_p + b where b is the intercept item.
How does it differ to use fit_intercept in a model and when should one set it to True/False. I was trying to look at the source code and it seems like the coefficients are normalized by some scale.
if self.fit_intercept:
self.coef_ = self.coef_ / X_scale
self.intercept_ = y_offset - np.dot(X_offset, self.coef_.T)
else:
self.intercept_ = 0
What does this scaling do exactly. I want to interpret the coefficients in both approach (Linear Regression, Closed form OLS) but since just setting fit_intercept True/False gives different result for Linear Regression i can't quite decide on the intuition behind them. Which one is better and why?
Let's take a step back and consider the following sentence you said:
since just setting fit_intercept True/False gives different result for Linear Regression
That is not entirely true. It may or may not be different, and it depends entirely on your data. It would help to understand what goes into the calculation of regression weights. I mean this somewhat literally: what does your input (x) data look like?
Understanding your input data, and understanding why it matters, will help you realize why you sometimes get different results, and why at other times the results are the same
Data setup
Lets set up some test data:
import numpy as np
from sklearn.linear_model import LinearRegression
np.random.seed(1243)
x = np.random.randint(0,100,size=10)
y = np.random.randint(0,100,size=10)
Our x and y variables look like this:
X Y
51 29
3 73
7 77
98 29
29 80
90 37
49 9
42 53
8 17
65 35
No-intercept model
Recall that the calculation of regression weights has a closed form solution, which we can obtain using normal equations:
Using this method, we get a single regression coefficient because we only have 1 predictor variable:
x = x.reshape(-1,1)
w = np.dot(x.T, x)
w1 = np.dot(np.linalg.pinv(w), np.dot(x.T, y))
print(w1)
[ 0.53297593]
Now, let's look at scikit-learn when we set fit_intercept = False:
clf = LinearRegression(fit_intercept=False)
print(clf.fit(x, y).coef_)
[ 0.53297593]
What happens when we set fit_intercept = True instead?
clf = LinearRegression(fit_intercept=True)
print(clf.fit(x, y).coef_)
[-0.35535884]
It would seem that setting fit_intercept to True and False gives different answers, and that the "correct" answer occurs only when we set it to False, but this is not entirely correct...
Intercept model
At this point we have to consider what our input data actually is. In the models above, our data matrix (also called a feature matrix, or design matrix in statistics) is just a single vector containing our x values. The y variable is not included in the design matrix. If we want to add an intercept to our model, one common approach is to add a column of 1's to the design matrix, so x becomes:
x_vals = x.flatten()
x = np.zeros((10, 2))
x[:,0] = 1
x[:,1] = x_vals
intercept x
0 1.0 51.0
1 1.0 3.0
2 1.0 7.0
3 1.0 98.0
4 1.0 29.0
5 1.0 90.0
6 1.0 49.0
7 1.0 42.0
8 1.0 8.0
9 1.0 65.0
Now, when we use this as our design matrix, we can try the closed form solution again:
w = np.dot(x.T, x)
w1 = np.dot(np.linalg.pinv(w), np.dot(x.T, y))
print(w1)
[ 59.60686058 -0.35535884]
Notice 2 things:
We now have 2 coefficients. The first is our intercept and the second is the regression coefficient for the x predictor variable
The coefficient for x matches the coefficient from the scikit-learn output above when we set fit_intercept = True
So in the scikit-learn models above, why was there a difference between True and False? Because in one case no intercept was modeled. In the other case the underlying model included an intercept, which is confirmed when you manually add an intercept term/column when solving the normal equations
If you were to use this new design matrix in scikit-learn, it doesn't matter whether you set True or False for fit_intercept, the coefficient for the predictor variable will not change (the intercept value will be different due to centering, but thats irrelevant for this discussion):
clf = LinearRegression(fit_intercept=False)
print(clf.fit(x, y).coef_)
[ 59.60686058 -0.35535884]
clf = LinearRegression(fit_intercept=True)
print(clf.fit(x, y).coef_)
[ 0. -0.35535884]
Summing up
The output (i.e. coefficient values) you get will be entirely dependent on the matrix that you input into these calculations (whether its normal equations, scikit-learn, or any other)
How does it differ to use fit_intercept in a model and when should one set it to True/False
If your design matrix does not contain a 1's column, then normal equations and scikit-learn (fit_intercept = False) will give you the same answer (as you noted). However, if you set the parameter to True, the answer you get will actually be the same as normal equations if you calculated that with a 1's column.
When should you set True/False? As the name suggests, you set False when you don't want to include an intercept in your model. You set True when you do want an intercept, with the understanding that the coefficient values will change, but will match the normal equations approach when your data includes a 1's column
So True/False doesn't actually give you different results (compared to normal equations) when considering the same underlying model. The difference you observe is because you're looking at two different statistical models (one with an intercept term, and one without). The reason the fit_intercept parameter exists is so you can create an intercept model without the hassle of manually adding that 1's column. It effectively allows you to toggle between the two underlying statistical models.
Without going into the details of mathematical formulation, when the fit intercept is set to false, the estimator deliberately sets the intercept to zero and this in turn affects the other regressors as the 'responsibility' of the error reduction falls onto these factors. As a result, the result could be very different in either cases if it is sensitive to the presence of an intercept term. The scaling shifts the origin thereby allowing the same closed loop solutions to both intercept and intercept-free models.

Resources