what class do the logistic regression coefficients belong to? - scikit-learn

I'm running a simple binary logistic regression for a dependent variable [0,1] (negative result, positive result). I was wondering which class the outputs of lr_model.coef_ would belong to? Is it dependent on the order of lr_model.classes_?
from sklearn.linear_model import LogisticRegression
lr_model = LogisticRegression()
lr_model.fit(X, y)
lr_model.coef_
--Output--
[[ 4.26733720e-01 6.32369274e-01 ]]
lr_model.classes_
--Output--
array([0, 1])

Related

scikit-learn linear regression K fold cross validation

I want to run Linear Regression along with K fold cross validation using sklearn library on my training data to obtain the best regression model. I then plan to use the predictor with the lowest mean error returned on my test set.
For example the below piece of code gives me an array of 20 results with different neg mean absolute errors, I am interested in finding the predictor which gives me this (least) error and then use that predictor on my test set.
sklearn.model_selection.cross_val_score(LinearRegression(), trainx, trainy, scoring='neg_mean_absolute_error', cv=20)
There is no such thing as "predictor which gives me this (least) error" in cross_val_score, all estimators in :
sklearn.model_selection.cross_val_score(LinearRegression(), trainx, trainy, scoring='neg_mean_absolute_error', cv=20)
are the same.
You may wish to check GridSearchCV that will indeed search through different sets of hyperparams and return the best estimator:
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
X,y = datasets.make_regression()
lr_model = LinearRegression()
parameters = {'normalize':[True,False]}
clf = GridSearchCV(lr_model, parameters, refit=True, cv=5)
best_model = clf.fit(X,y)
Note the refit=True param that ensures the best model is refit on the whole dataset and returned.

Cross_val_predict: Getting predicted values and predicted probabilities in one step

Following example script outputs the predicted values and predicted probabilities:
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_val_predict
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
lg = linear_model.LogisticRegression(random_state=0, solver='lbfgs')
y_prob = cross_val_predict(lg, X, y, cv=4, method='predict_proba')
y_pred = cross_val_predict(lg, X, y, cv=4)
y_prob[0:5]
y_pred[0:5]
I tried following without success:
test = cross_val_predict(lg, X, y, cv=4, method=['predict','predict_proba'])
Question: Is there a way to get both predicted values and predicted probabilities in one step, without running cross-validation twice? Also, I have to make sure that the values and probabilities correspond to the same input data.
The values of y_pred can be derived from y_prob:
# The probabilities as in the original code sample
y_prob = cross_val_predict(lg, X, y, cv=4, method='predict_proba')
import numpy as np
# Get a list of classes that matches the columns of `y_prob`
y_sorted = np.unique(y)
# Use the highest probability for predicting the label
indices = np.argmax(y_prob, axis=1)
# Get the label for each sample
y_pred = y_sorted[indices]
Now, it may happen that y_pred from cross_val_predict does not match the y_pred here in all cases. This happens, when there are multiple classes with identical highest probability, as is the case in your sample code. For example, the predicted probabilites are zero for all classes for the first sample. Anyway, it seems to me, that logistic regression (which is, in fact, classification) is not suitable for the diabetes dataset.
For the rationale of y_sorted see the cross_val_predict docs:
method : string, optional, default: ‘predict’
Invokes the passed method name of the passed estimator. For method=’predict_proba’, the columns correspond to the classes in sorted order.

scikit-learn LogisticRegressionCV: best coefficients

I am trying to understand how the best coefficients are calculated in a logistic regression cross-validation, where the "refit" parameter is True.
If I understand the docs correctly, the best coefficients are the result of first determining the best regularization parameter "C", i.e., the value of C that has the highest average score over all folds. Then, the best coefficients are simply the coefficients that were calculated on the fold that has the highest score for the best C. I assume that if the maximum score is achieved by several folds, the coefficients of these folds would be averaged to give the best coefficients (I didn't see anything on how this case is handled in the docs).
To test my understanding, I determined the best coefficients in two different ways:
directly from the coef_ attribute of the fitted model, and
from the coefs_paths attribute, which contains the path of the coefficients obtained during cross-validating across each fold and then across each C.
The results I get from 1. and 2. are similar but not identical, so I was hoping someone could point out what I am doing wrong here.
Thanks!
An example to demonstrate the issue:
from sklearn.datasets import load_breast_cancer
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Set parameters
n_folds = 10
C_values = [0.001, 0.01, 0.05, 0.1, 1., 100.]
# Load and preprocess data
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
X_train_scaled = StandardScaler().fit_transform(X_train)
# Fit model
clf = LogisticRegressionCV(Cs=C_values, cv=n_folds, penalty='l1',
refit=True, scoring='roc_auc',
solver='liblinear', random_state=0,
fit_intercept=False)
clf.fit(X_train_scaled, y_train)
########################
# Get and plot coefficients using method 1
########################
coefs1 = clf.coef_
coefs1_series = pd.Series(coefs1.ravel(), index=cancer['feature_names'])
coefs1_series.sort_values().plot(kind="barh")
########################
# Get and plot coefficients using method 2
########################
# mean of scores of class "1"
scores = clf.scores_[1]
mean_scores = np.mean(scores, axis=0)
# Get index of the C that has the highest average score across all folds
best_C_idx = np.where(mean_scores==np.max(mean_scores))[0][0]
# Get index (here: indices) of the folds with highest scores for the
# best C
best_folds_idx = np.where(scores[:, best_C_idx]==np.max(scores[:, best_C_idx]))[0]
paths = clf.coefs_paths_[1] # has shape (n_folds, len(C_values), n_features)
coefs2 = np.squeeze(paths[best_folds_idx, best_C_idx, :])
coefs2 = np.mean(coefs2, axis=0)
coefs2_series = pd.Series(coefs2.ravel(), index=cancer['feature_names'])
coefs2_series.sort_values().plot(kind="barh")
I think this article answers your question: https://orvindemsy.medium.com/understanding-grid-search-randomized-cvs-refit-true-120d783a5e94.
The key point is the refit parameter of LogisticRegressionCV.
According to sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html)
refitbool, default=True
If set to True, the scores are averaged across all folds, and the coefs and the C that corresponds to the best score is taken, and a final refit is done using these parameters. Otherwise the coefs, intercepts and C that correspond to the best scores across folds are averaged.
Best.

Keras and Sklearn logreg returning different results

I'm comparing the results of a logistic regressor written in Keras to the default Sklearn Logreg. My input is one-dimensional. My output has two classes and I'm interested in the probability that the output belongs to the class 1.
I'm expecting the results to be almost identical, but they are not even close.
Here is how I generate my random data. Note that X_train, X_test are still vectors, I'm just using capital letters because I'm used to it. Also there is no need for scaling in this case.
X = np.linspace(0, 1, 10000)
y = np.random.sample(X.shape)
y = np.where(y<X, 1, 0)
Here's cumsum of y plotted over X. Doing a regression here is not rocket science.
I do a standard train-test-split:
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train = X_train.reshape(-1,1)
X_test = X_test.reshape(-1,1)
Next, I train a default logistic regressor:
from sklearn.linear_model import LogisticRegression
sk_lr = LogisticRegression()
sk_lr.fit(X_train, y_train)
sklearn_logreg_result = sk_lr.predict_proba(X_test)[:,1]
And a logistic regressor that I write in Keras:
from keras.models import Sequential
from keras.layers import Dense
keras_lr = Sequential()
keras_lr.add(Dense(1, activation='sigmoid', input_dim=1))
keras_lr.compile(loss='mse', optimizer='sgd', metrics=['accuracy'])
_ = keras_lr.fit(X_train, y_train, verbose=0)
keras_lr_result = keras_lr.predict(X_test)[:,0]
And a hand-made solution:
pearson_corr = np.corrcoef(X_train.reshape(X_train.shape[0],), y_train)[0,1]
b = pearson_corr * np.std(y_train) / np.std(X_train)
a = np.mean(y_train) - b * np.mean(X_train)
handmade_result = (a + b * X_test)[:,0]
I expect all three to deliver similar results, but here is what happens. This is a reliability diagram using 100 bins.
I have played around with loss functions and other parameters, but the Keras logreg stays roughly like this. What might be causing the problem here?
edit: Using binary crossentropy is not the solution here, as shown by this plot (note that the input data has changed between the two plots).
While both implementations are a form of Logistic Regression there's quite a few differences. While both solutions converge to a comparable minimum (0.75/0.76 ACC) they are not identical.
Optimizer - keras uses vanille SGD where sklearn's LR is based on
liblinear which implements trust region Newton method
Regularization - sklearn has built in L2 regularization
Weights -The weights are randomly initialized and probably sampled from a different distribution.

sklearn: Evaluating LinearSVC's AUC

I know that one would evaluate the AUC of sklearn.svm.SVC by passing in the probability=True option into the constructor, and having the SVM predict probabilities, but I'm not sure how to evaluate sklearn.svm.LinearSVC's AUC. Does anyone have any idea how?
I'd like to use LinearSVC over SVC because LinearSVC seems to train faster on data with many attributes.
You can use the CalibratedClassifierCV class to extract the probabilities. Here is an example with code.
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn import datasets
#Load iris dataset
iris = datasets.load_iris()
X = iris.data[:, :2] # Using only two features
y = iris.target #3 classes: 0, 1, 2
linear_svc = LinearSVC() #The base estimator
# This is the calibrated classifier which can give probabilistic classifier
calibrated_svc = CalibratedClassifierCV(linear_svc,
method='sigmoid', #sigmoid will use Platt's scaling. Refer to documentation for other methods.
cv=3)
calibrated_svc.fit(X, y)
# predict
prediction_data = [[2.3, 5],
[4, 7]]
predicted_probs = calibrated_svc.predict_proba(prediction_data) #important to use predict_proba
print predicted_probs
Looks like it's not possible.
https://github.com/scikit-learn/scikit-learn/issues/4820

Resources