How is L2 (ridge) penalty calculated in sklearn LogisticRegression function? - python-3.x

For example when executing the following logistic regression model on my data in Python . . .
### Logistic regression with ridge penalty (L2) ###
from sklearn.linear_model import LogisticRegression
log_reg_l2_sag = LogisticRegression(penalty='l2', solver='sag', n_jobs=-1)
log_reg_l2_sag.fit(xtrain, ytrain)
I have not specified a range of ridge penalty values. Is the optimum ridge penalty explicitly calculated with a formula (as is done with the ordinary least squares ridge regression), or is the optimum penalty chosen from a default range of penalty values? The documentation isn't clear on this.

As far as I understood your question. You want to know how the 'L2' regularization works in case of logistic regression. Like how the optimum value is found out.
We don't give a grid here like [0.0001, 0.01 ] because the optimum values are found out using the 'solver' paramter of the LogisticRegression.
The solver in your case is Stochastic Average Gradient Descent which finds out the optimum values for the L2 regularization.
The L2 regularization will keep all the columns, keeping the coefficients of the least important paramters close to 0.

Related

Is there any place in scikit-learn Lasso/Quantile Regression source code that L1 regularization is applied?

I could not find where the Manhattan distance of weights is calculated and multiplied with alpha (L1 reg. coefficient) in the Lasso Regression and the Quantile Regression source code of scikit-learn.
I was trying to implement Lasso Regression and Quantile Regression w/ NumPy and compare results w/ scikit-learn models.
I don't believe the loss function (including the regularization penalty) is ever explicitly calculated, no.
Instead, the loss function is optimized by coordinate descent, and so we only ever need to actually calculate derivatives of the loss function. That happens in the enet_coordinate_descent function (or relatives), and I think the relevant bit is here.

Gradient Descent with Linear regression in Sklearn

The Linear regression model from sklearn uses a closed or normal equation to find the parameters. However with large datasets Gradient Descent is said to be more efficient. Is there any way to use the LinearRegression from sklearn using gradient descent.
The function you are looking for is: sklearn.linear_model.SGDRegressor
You can modify the loss hyperparameter which will define the loss function to be used.
Be aware that the SGD of SGDRegressor stands for Stochastic Gradient Descent. Which means that the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate).

Recursive feature elemination with CV doesn't reduce feature count

I have this protein dataset that I need to perform a RFE on. There are 100 examples with binary class labels (sick - 1, healthy - 0) and 9847 features for each example. To reduce the dimensionality I am performing a RFECV with a LogisticRegression estimator and 5 fold CV. This is the code:
model = LogisticRegression()
rfecv = RFECV(estimator=model, step=1, cv=StratifiedKFold(5), n_jobs=-1)
rfecv.fit(X_train, y_train)
print("Number of features selected: %d" % rfecv.n_features_)
Number of features selected: 9874
I then plot the number of features vs the CV scores:
plt.figure()
plt.xlabel("feature count")
plt.ylabel("CV accuracy")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()
What I think is happening (and this is what I need an expert for) is that the first peak shows the optimal number of features. After that the curve drops and only starts to climb again because of overfitting, not really seperating classes but examples. Could this be the case? And if so how can I obtain these features (i.e. the ones at that first peak), because rfecv.support_ only gives me the ones where the highest accuracy was reached (meaning: all of them).
And while I am at it: How would I choose the best estimator for the RFE? Is it just by trial and error, going through all possible classifiers or is there any logic why I would use a Logit over a linear SVC for example?
One way that i use for feature relevance is the RandomForest or ExtremeRandomizedTrees.
i can use:
rfecv.n_features
to see how much features the find and:
rfec.ranking
to see the features index in descending order. another algorithm that you can use is the PCA to reduce the dimension of you Dataset.

What is the difference between LinearSVC and SVC(kernel="linear")?

I found sklearn.svm.LinearSVC and sklearn.svm.SVC(kernel='linear') and they seem very similar to me, but I get very different results on Reuters.
sklearn.svm.LinearSVC: 81.05% in 28.87s train / 9.71s test
sklearn.svm.SVC : 33.55% in 6536.53s train / 2418.62s test
Both have a linear kernel. The tolerance of the LinearSVC is higher than the one of SVC:
LinearSVC(C=1.0, tol=0.0001, max_iter=1000, penalty='l2', loss='squared_hinge', dual=True, multi_class='ovr', fit_intercept=True, intercept_scaling=1)
SVC (C=1.0, tol=0.001, max_iter=-1, shrinking=True, probability=False, cache_size=200, decision_function_shape=None)
How do both functions differ otherwise? Even if I set kernel='linear, tol=0.0001, max_iter=1000 anddecision_function_shape='ovr'theSVCtakes much longer thanLinearSVC`. Why?
I use sklearn 0.18 and both are wrapped in the OneVsRestClassifier. I'm not sure if this makes the same as multi_class='ovr' / decision_function_shape='ovr'.
Truly, LinearSVC and SVC(kernel='linear') yield different results, i. e. metrics score and decision boundaries, because they use different approaches. The toy example below proves it:
from sklearn.datasets import load_iris
from sklearn.svm import LinearSVC, SVC
X, y = load_iris(return_X_y=True)
clf_1 = LinearSVC().fit(X, y) # possible to state loss='hinge'
clf_2 = SVC(kernel='linear').fit(X, y)
score_1 = clf_1.score(X, y)
score_2 = clf_2.score(X, y)
print('LinearSVC score %s' % score_1)
print('SVC score %s' % score_2)
--------------------------
>>> 0.96666666666666667
>>> 0.98666666666666669
The key principles of that difference are the following:
By default scaling, LinearSVC minimizes the squared hinge loss while SVC minimizes the regular hinge loss. It is possible to manually define a 'hinge' string for loss parameter in LinearSVC.
LinearSVC uses the One-vs-All (also known as One-vs-Rest) multiclass reduction while SVC uses the One-vs-One multiclass reduction. It is also noted here. Also, for multi-class classification problem SVC fits N * (N - 1) / 2 models where N is the amount of classes. LinearSVC, by contrast, simply fits N models. If the classification problem is binary, then only one model is fit in both scenarios. multi_class and decision_function_shape parameters have nothing in common. The second one is an aggregator that transforms the results of the decision function in a convenient shape of (n_features, n_samples). multi_class is an algorithmic approach to establish a solution.
The underlying estimators for LinearSVC are liblinear, that do in fact penalize the intercept. SVC uses libsvm estimators that do not. liblinear estimators are optimized for a linear (special) case and thus converge faster on big amounts of data than libsvm. That is why LinearSVC takes less time to solve the problem.
In fact, LinearSVC is not actually linear after the intercept scaling as it was stated in the comments section.
The main difference between them is linearsvc lets your choose only linear classifier whereas svc let yo choose from a variety of non-linear classifiers. however it is not recommended to use svc for non-linear problems as they are super slow. try importing other libraries for doing non-linear classifications.
now the point that even after defining kernel='linear' we don't get same output is because both linearsvc and svc try different approaches while doing the background mathematics. also linearsvc works on principle of one-vs-rest, and svc works on one-vs-one.
I hope this answers your question.

Different Linear Regression Coefficients with statsmodels and sklearn

I was planning to use sklearn linear_model to plot a graph of linear regression result, and statsmodels.api to get a detail summary of the learning result. However, the two packages produce very different results on the same input.
For example, the constant term from sklearn is 7.8e-14, but the constant term from statsmodels is 48.6. (I added a column of 1's in x for constant term when using both methods) My code for both methods are succint:
# Use statsmodels linear regression to get a result (summary) for the model.
def reg_statsmodels(y, x):
results = sm.OLS(y, x).fit()
return results
# Use sklearn linear regression to compute the coefficients for the prediction.
def reg_sklearn(y, x):
lr = linear_model.LinearRegression()
lr.fit(x, y)
return lr.coef_
The input is too complicated to post here. Is it possible that a singular input x caused this problem?
By making a 3-d plot using PCA, it seems that the sklearn result is not a good approximation. What are some explanations? I still want to make a visualization, so it will be very helpful to fix the issues in the sklearn linear regression implementation.
You say that
I added a column of 1's in x for constant term when using both methods
But the documentation of LinearRegression says that
LinearRegression(fit_intercept=True, [...])
it fits an intercept by default. This could explain why you have the differences in the constant term.
Now for the other coefficients, differences can occur when two of the variables are highly correlated. Let's consider the most extreme case where two of your columns are identical. Then reducing the coefficient in front of any of the two can be compensated by increasing the other. This is the first thing I'd check.

Resources