sklearn auc ValueError: Only one class present in y_true - scikit-learn

I searched Google, and saw a couple of StackOverflow posts about this error. They are not my cases.
I use keras to train a simple neural network and make some predictions on the splitted test dataset. But when use roc_auc_score to calculate AUC, I got the following error:
"ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.".
I inspect the target label distribution, and they are highly imbalanced. Some labels(in the total 29 labels) have only 1 instance. So it's likely they will have no positive label instance in the test label. So the sklearn's roc_auc_score function reported the only one class problem. That's reasonable.
But I'm curious, as when I use sklearn's cross_val_score function, it can handle the AUC calculation without error.
my_metric = 'roc_auc'
scores = cross_validation.cross_val_score(myestimator, data,
labels, cv=5,scoring=my_metric)
I wonder what happened in the cross_val_score, is it because the cross_val_score use a stratified cross-validation data split?
UPDATE
I continued to make some digging, but still can't find the difference behind.I see that cross_val_score call check_scoring(estimator, scoring=None, allow_none=False) to return a scorer, and the check_scoring will call get_scorer(scoring) which will return scorer=SCORERS[scoring]
And the SCORERS['roc_auc'] is roc_auc_scorer;
the roc_auc_scorer is made by
roc_auc_scorer = make_scorer(roc_auc_score, greater_is_better=True,
needs_threshold=True)
So, it's still using the roc_auc_score function. I don't get why cross_val_score behave differently with directly calling roc_auc_score.

I think your hunch is correct. The AUC (area under ROC curve) needs a sufficient number of either classes in order to make sense.
By default, cross_val_score calculates the performance metric one each fold separately. Another option could be to do cross_val_predict and compute the AUC over all folds combined.
You could do something like:
from sklearn.metrics import roc_auc_score
from sklearn.cross_validation import cross_val_predict
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
class ProbaEstimator(LogisticRegression):
"""
This little hack needed, because `cross_val_predict`
uses `estimator.predict(X)` internally.
Replace `LogisticRegression` with whatever classifier you like.
"""
def predict(self, X):
return super(self.__class__, self).predict_proba(X)[:, 1]
# some example data
X, y = make_classification()
# define your estimator
estimator = ProbaEstimator()
# get predictions
pred = cross_val_predict(estimator, X, y, cv=5)
# compute AUC score
roc_auc_score(y, pred)

Related

Scikit-learn output in a pleasant way [duplicate]

As an R user, I wanted to also get up to speed on scikit.
Creating a linear regression model(s) is fine, but can't seem to find a reasonable way to get a standard summary of regression output.
Code example:
# Linear Regression
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LinearRegression
# Load the diabetes datasets
dataset = datasets.load_diabetes()
# Fit a linear regression model to the data
model = LinearRegression()
model.fit(dataset.data, dataset.target)
print(model)
# Make predictions
expected = dataset.target
predicted = model.predict(dataset.data)
# Summarize the fit of the model
mse = np.mean((predicted-expected)**2)
print model.intercept_, model.coef_, mse,
print(model.score(dataset.data, dataset.target))
Issues:
seems like the intercept and coef are built into the model, and I just type print (second to last line) to see them.
What about all the other standard regression output like R^2, adjusted R^2, p values, etc. If I read the examples correctly, seems like you have to write a function/equation for each of these and then print it.
So, is there no standard summary output for lin. reg. models?
Also, in my printed array of outputs of coefficients, there are no variable names associated with each of these? I just get the numeric array. Is there a way to print these where I get an output of the coefficients and the variable they go with?
My printed output:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)
152.133484163 [ -10.01219782 -239.81908937 519.83978679 324.39042769 -792.18416163
476.74583782 101.04457032 177.06417623 751.27932109 67.62538639] 2859.69039877
0.517749425413
Notes: Started off with Linear, Ridge and Lasso. I have gone through the examples. Below is for the basic OLS.
There exists no R type regression summary report in sklearn. The main reason is that sklearn is used for predictive modelling / machine learning and the evaluation criteria are based on performance on previously unseen data (such as predictive r^2 for regression).
There does exist a summary function for classification called sklearn.metrics.classification_report which calculates several types of (predictive) scores on a classification model.
For a more classic statistical approach, take a look at statsmodels.
I use:
import sklearn.metrics as metrics
def regression_results(y_true, y_pred):
# Regression metrics
explained_variance=metrics.explained_variance_score(y_true, y_pred)
mean_absolute_error=metrics.mean_absolute_error(y_true, y_pred)
mse=metrics.mean_squared_error(y_true, y_pred)
mean_squared_log_error=metrics.mean_squared_log_error(y_true, y_pred)
median_absolute_error=metrics.median_absolute_error(y_true, y_pred)
r2=metrics.r2_score(y_true, y_pred)
print('explained_variance: ', round(explained_variance,4))
print('mean_squared_log_error: ', round(mean_squared_log_error,4))
print('r2: ', round(r2,4))
print('MAE: ', round(mean_absolute_error,4))
print('MSE: ', round(mse,4))
print('RMSE: ', round(np.sqrt(mse),4))
statsmodels package gives a quiet decent summary
from statsmodels.api import OLS
OLS(dataset.target,dataset.data).fit().summary()
You can do using statsmodels
import statsmodels.api as sm
X = sm.add_constant(X.ravel())
results = sm.OLS(y,x).fit()
results.summary()
results.summary() will organize the results into three tabels
You can use the following option to have a summary table:
import statsmodels.api as sm
#log_clf = LogisticRegression()
log_clf =sm.Logit(y_train,X_train)
classifier = log_clf.fit()
y_pred = classifier.predict(X_test)
print(classifier.summary2())
Use model.summary() after predict
# Linear Regression
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LinearRegression
# load the diabetes datasets
dataset = datasets.load_diabetes()
# fit a linear regression model to the data
model = LinearRegression()
model.fit(dataset.data, dataset.target)
print(model)
# make predictions
expected = dataset.target
predicted = model.predict(dataset.data)
# >>>>>>>Print out the statistics<<<<<<<<<<<<<
model.summary()
# summarize the fit of the model
mse = np.mean((predicted-expected)**2)
print model.intercept_, model.coef_, mse,
print(model.score(dataset.data, dataset.target))

scikit-learn linear regression K fold cross validation

I want to run Linear Regression along with K fold cross validation using sklearn library on my training data to obtain the best regression model. I then plan to use the predictor with the lowest mean error returned on my test set.
For example the below piece of code gives me an array of 20 results with different neg mean absolute errors, I am interested in finding the predictor which gives me this (least) error and then use that predictor on my test set.
sklearn.model_selection.cross_val_score(LinearRegression(), trainx, trainy, scoring='neg_mean_absolute_error', cv=20)
There is no such thing as "predictor which gives me this (least) error" in cross_val_score, all estimators in :
sklearn.model_selection.cross_val_score(LinearRegression(), trainx, trainy, scoring='neg_mean_absolute_error', cv=20)
are the same.
You may wish to check GridSearchCV that will indeed search through different sets of hyperparams and return the best estimator:
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
X,y = datasets.make_regression()
lr_model = LinearRegression()
parameters = {'normalize':[True,False]}
clf = GridSearchCV(lr_model, parameters, refit=True, cv=5)
best_model = clf.fit(X,y)
Note the refit=True param that ensures the best model is refit on the whole dataset and returned.

Score obtained from cross_val_score is RMSE or MSE?

I am using following code:-
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = regressor, X=X,y=y, cv =10)
accuracies.mean()
This mean value is RMSE or MSE ?
EDIT:- I am using random forest regression. In Scikit learn documentation they describe it as accuracy. How can i relate it with RMSE or MSE
It is actually neither RMSE nor MSE. If you look into the documentation of cross_val_score, you can see that it has a parameter scoring for which it says:
If None, the estimator’s default scorer (if available) is used.
In your case, this means it will use the default scorer of the RandomForestRegressor. When you look up the documentation for its .score() method, it tells you:
Return the coefficient of determination R^2 of the prediction.
This means you computed the mean R^2. If you want to change this behavior, you have to specify the scoring parameter of cross_val_score. Options can be found here.
Scores obtained from cross_val_score regressor are by default 'r2' (R-squared), if you want to get RMSE you can use 'neg_root_mean_squared_error' and then change the sign. Sklearn sets a negative score because an optimization process usually seeks to maximize the score. But in this case, by maximizing it, we would be seeking to increase the error. I think, this was resolved by changing the sign of the score. If you would like to use other metrics, you can find them at scikit-learn page.
Here is how you could get the RMSE scores
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = regressor, X=X,y=y,scoring='neg_mean_squared_error', cv =10)
accuracies.mean()
If you want to get the cross-validation score in RMSE or MSE, try this
from sklearn.model_selection import cross_val_score
nrmse = cross_val_score(estimator = regressor, X=X,y=y,scoring='neg_root_mean_squared_error', cv =10)
print(nrmse.mean()*-1)
nmse = cross_val_score(estimator = regressor, X=X,y=y,scoring='neg_mean_squared_error', cv =10)
print(nmse.mean()*-1)
The following code gives you all the scoring options:
sorted(sklearn.metrics.SCORERS.keys())
Which gives you:
['accuracy', 'adjusted_mutual_info_score', 'adjusted_rand_score',
'average_precision', 'balanced_accuracy', 'completeness_score',
'explained_variance', 'f1', 'f1_macro', 'f1_micro', 'f1_samples',
'f1_weighted', 'fowlkes_mallows_score', 'homogeneity_score',
'jaccard', 'jaccard_macro', 'jaccard_micro', 'jaccard_samples',
'jaccard_weighted', 'max_error', 'mutual_info_score',
'neg_brier_score', 'neg_log_loss', 'neg_mean_absolute_error',
'neg_mean_absolute_percentage_error', 'neg_mean_gamma_deviance',
'neg_mean_poisson_deviance', 'neg_mean_squared_error',
'neg_mean_squared_log_error', 'neg_median_absolute_error',
'neg_root_mean_squared_error', 'normalized_mutual_info_score',
'precision', 'precision_macro', 'precision_micro',
'precision_samples', 'precision_weighted', 'r2', 'rand_score',
'recall', 'recall_macro', 'recall_micro', 'recall_samples',
'recall_weighted', 'roc_auc', 'roc_auc_ovo',
'roc_auc_ovo_weighted', 'roc_auc_ovr', 'roc_auc_ovr_weighted',
'top_k_accuracy', 'v_measure_score']

Sklearn Logistic Regression predict_proba returning 0 or 1

I don't have any example data to share in order to replicate the problem, but perhaps someone can provide a high level answer. I've created a lot of logistic regression models in the past, and this is the first time my predict proba scores are showing up as either 1 or 0.
I'm creating a binary classifier to predict one of two labels. I've also used a couple of other algorithms, XGBClassifier and RandomForestCalssifier with the same dataset. For these, predict_proba yields the expected probability results (i.e, float values between 0 and 1).
Also, for the LogisticRegression model, I've tried a variety of parameters including all default params, yet the issue persists. Weirdly enough, using SGDClassifier with loss = 'log' or 'modified_huber' also yields the same binary predict_proba results, so I'm thinking this might be something intrinsic to the dataset, but not sure. Also, this issue only occurs if I standardize training set data. So far I've tried both StandardScaler and MinMaxScaler, same results.
Has anyone ever encountered a problem such as this?
Edit:
The LR parameters are:
LogisticRegression(C=1.7993269963183343, class_weight='balanced', dual=False,
fit_intercept=True, intercept_scaling=1, l1_ratio=.5,
max_iter=100, multi_class='warn', n_jobs=-1, penalty='elasticnet',
random_state=58, solver='saga', tol=0.0001, verbose=0,
warm_start=False)
Again, the issue only occurs when standardizing the data with either StandardScaler() or MinMaxScaler(). Which is odd because the data is not a uniform scale across all features. For instance, some features are represented as percents, others are represented as dollar values, and others are dummy coded representations.
This can happen when you do the following two things in sequence:
Fit an estimator with standardized training data and then later on,
Pass unstandardized data to the same estimator in the validation or testing phase.
Here's an example of predict_proba returning 0 or 1 using the UCI ML Breast Cancer Wisconsin (Diagnostic) dataset:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=123)
# Example 1 [CORRECT]
pipeline = make_pipeline(StandardScaler(), LogisticRegression())
pipeline.fit(X_train, y_train)
# Pipeline(steps=[('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression())])
print(pipeline)
y_pred = pipeline.predict_proba(X_test)
# [0.37264656 0.62735344]
print(y_pred.mean(axis=0))
# Example 2 [INCORRECT]
# Fit the model with standardized training set
X_scaled = StandardScaler().fit_transform(X_train)
model = LogisticRegression()
model.fit(X_scaled, y_train)
# Test the model with unstandardized test set
y_pred = model.predict_proba(X_test)
# [1.00000000e+000 2.48303123e-204]
print(y_pred.mean(axis=0))
Since the estimator in Example 2 was fitted on scaled data with a unit variance of 1.0 (X_scaled), the variance of the data it's being tested on (X_test) is much higher than expected. It's no surprise then that this results in very extreme probabilities.
You can prevent this from happening by wrapping your estimator within a pipeline and calling the pipeline fit method instead of the estimator's fit method (see Example 1). Doing it this way guarantees that the same transformations are applied to the data in the training, validation and testing phases.

How to get best_estimator parameters from GridSearch using cross_val_score?

I want to know the result of the GridSearch when I'm using nested cross validation with cross_val_score for convenience.
When using cross_val_score, you get an array of scores. It would be useful to receive the fitted estimator back or a summary of the chosen parameters for that estimator.
I know you can do this yourself but just implementing cross-validation manually but it is much more convenient if it can be done in conjunction with cross_val_score.
Any way to do it or is this a feature to suggest?
The GridSearchCV class in scikit-learn already does cross validation internally. You can pass any CV iterator as the cv argument of the constructor of GridSearchCV.
The answer to your question is that it is a feature to suggest. Unfortunately, you can't get the best parameters of the models fitted with nested cross-validation using cross_val_score (as of now, scikit 0.14).
See this example:
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import cross_val_score
digits = datasets.load_digits()
X = digits.data
y = digits.target
hyperparams = [{'fit_intercept':[True, False]}]
algo = LinearRegression()
grid = GridSearchCV(algo, hyperparams, cv=5, scoring='mean_squared_error')
# Nested cross validation
cross_val_score(grid, X, y)
grid.best_score_
[Out]:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-4-4c4ac83c58fb> in <module>()
15 # Nested cross validation
16 cross_val_score(grid, X, y)
---> 17 grid.best_score_
AttributeError: 'GridSearchCV' object has no attribute 'best_score_'
(Note also that the scores you get from cross_val_score are not the ones defined in scoring, here the mean squared error. What you see is the score function of the best estimator. The bug of v0.14 is described here.)
In sklearn v0.20.0 (which will be released in late 2018), the trained estimators are exposed by the function cross_validate if requested.
See here the corresponding pull-request for the new feature. Something like this will work:
from sklearn.metrics.scorer import check_scoring
from sklearn.model_selection import cross_validate
scorer = check_scoring(estimator=gridSearch, scoring=scoring)
cvRet = cross_validate(estimator=gridSearch, X=X, y=y,
scoring={'score': scorer}, cv=cvOuter,
return_train_score=False,
return_estimator=True,
n_jobs=nJobs)
scores = cvRet['test_score'] # Equivalent to output of cross_val_score()
estimators = cvRet['estimator']
If return_estimator=True, the estimators can be retrieved from the returned dictionary as cvRet['estimator']. The list stored in cvRet['test_score'] is equivalent to the output of cross_val_score. See here how cross_val_score() is implemented by means of cross_validate().

Resources