Why is the ROC_AUC from cross_val_score so much higher than manually using a StratfiedKFold with metrics.roc_auc_score for an XGB classifier? - scikit-learn

Method 1 - StratifiedKFold cross validation
skf = StratifiedKFold(n_splits=5, shuffle=False)
roc_aucs_temp = []
for i, (train_index, test_index) in enumerate(skf.split(X_train_xgb, y_train_xgb)):
X_train_fold, X_test_fold = X_train_xgb.iloc[train_index], X_train_xgb.iloc[test_index]
y_train_fold, y_test_fold = y_train_xgb[train_index], y_train_xgb[test_index]
xgb_temp.fit(X_train_fold, y_train_fold)
y_pred=model.predict(X_test_fold)
roc_aucs_temp.append(metrics.roc_auc_score(y_test_fold, y_pred))
print(roc_aucs_temp)
[0.8622474747474748, 0.8497474747474747, 0.9045918367346939, 0.8670918367346939, 0.879591836734694]
Method 2 CrossValScore
# this uses the same CV object as method 1
print(cross_val_score(xgb, X_train_xgb, y_train_xgb, cv=skf, scoring='roc_auc'))
[0.9614899 0.94861111 0.96045918 0.97270408 0.96977041]
I might be misunderstanding the functionality of cross_val_score, but from my understanding it creates K folds of training and test data. It then trains the model on K-1 folds, and tests on 1 fold, repeatedly. It should be around the same accuracy as manually creating K Folds with StratifiedKFold. Why isn't it?
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html

The documentation for roc_auc_score indicates its second argument is the label scores rather than the predicted labels. Like they show in their example, you probably want something like model.predict_proba(X_test_fold)[:, 1] instead of model.predict(X_test_fold).
cross_val_score with roc_auc is evaluating it that way, and that is why you are seeing the difference.

Related

k-NN GridSearchCV taking extremely long time to execute

I am attempting to use sklearn to train a KNN model on the MNIST classification task. When I try to tune my parameters using either sklearn's GridSearchCV or RandomisedSearchCV classes, my code is taking an extremely long time to execute.
As an experiment, I created a KNN model using KNeighborsClassifier() with the default parameters and passed these same parameters to GridSearchCV. Afaik, this should mean GridSearchCV only has single set of parameters and so should effectively not perform a "search". I then called the .fit() methods of both on the training data and timed their execution (see code below). The KNN model's .fit() method took about 11 seconds to run, whereas the GridSearchCV model took over 20 minutes.
I understand that GridSearchCV should take slightly longer as it is performing 5-fold cross validation, but the difference in execution time seems too large for it to be explained by that.
Am I doing something with my GridSearchCV call that it causing it to take such a long time to execute? And is there anything that I can do to accelerate it?
import sklearn
import time
# importing models
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
# Importing data
from sklearn.datasets import fetch_openml
mnist = fetch_openml(name='mnist_784')
print("data loaded")
# splitting the data into stratified train & test sets
X, y = mnist.data, mnist.target # mnist mj.data.shape is (n_samples, n_features)
sss = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 0)
for train_index, test_index in sss.split(X,y):
X_train, y_train = X[train_index], y[train_index]
X_test, y_test = X[test_index], y[test_index]
print("data split")
# Data has no missing values and is preprocessed, so no cleaing needed.
# using a KNN model, as recommended
knn = KNeighborsClassifier()
print("model created")
print("training model")
start = time.time()
knn.fit(X_train, y_train)
end = time.time()
print(f"Execution time for knn paramSearch was: {end-start}")
# Parameter tuning.
# starting by performing a broad-range search on n_neighbours to work out the
# rough scale the parameter should be on
print("beginning param tuning")
params = {'n_neighbors':[5],
'weights':['uniform'],
'leaf_size':[30]
}
paramSearch = GridSearchCV(
estimator = knn,
param_grid = params,
cv=5,
n_jobs = -1)
start = time.time()
paramSearch.fit(X_train, y_train)
end = time.time()
print(f"Execution time for knn paramSearch was: {end-start}")
With vanilla KNN, the costly procedure is predicting, not fitting: fitting just saves a copy of the data, and then predicting has to do the work of finding nearest neighbors. So since your search involves scoring on each test fold, that's going to take a lot more time than just fitting. A better comparison would have you predict on the training set in the no-search section.
However, sklearn does have different options for the algorithm parameter, which aim to trade away some of the prediction complexity for added training time, by building a search structure so that fewer comparisons are needed at prediction time. With the default algorithm='auto', you're probably building a ball tree, and so the effect of the first paragraph won't be so profound. I suspect this is still the issue though: now the training time will be non-neglibible, but the scoring portion in the search is what takes most of the time.

scikit-learn linear regression K fold cross validation

I want to run Linear Regression along with K fold cross validation using sklearn library on my training data to obtain the best regression model. I then plan to use the predictor with the lowest mean error returned on my test set.
For example the below piece of code gives me an array of 20 results with different neg mean absolute errors, I am interested in finding the predictor which gives me this (least) error and then use that predictor on my test set.
sklearn.model_selection.cross_val_score(LinearRegression(), trainx, trainy, scoring='neg_mean_absolute_error', cv=20)
There is no such thing as "predictor which gives me this (least) error" in cross_val_score, all estimators in :
sklearn.model_selection.cross_val_score(LinearRegression(), trainx, trainy, scoring='neg_mean_absolute_error', cv=20)
are the same.
You may wish to check GridSearchCV that will indeed search through different sets of hyperparams and return the best estimator:
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
X,y = datasets.make_regression()
lr_model = LinearRegression()
parameters = {'normalize':[True,False]}
clf = GridSearchCV(lr_model, parameters, refit=True, cv=5)
best_model = clf.fit(X,y)
Note the refit=True param that ensures the best model is refit on the whole dataset and returned.

Sklearn Logistic Regression predict_proba returning 0 or 1

I don't have any example data to share in order to replicate the problem, but perhaps someone can provide a high level answer. I've created a lot of logistic regression models in the past, and this is the first time my predict proba scores are showing up as either 1 or 0.
I'm creating a binary classifier to predict one of two labels. I've also used a couple of other algorithms, XGBClassifier and RandomForestCalssifier with the same dataset. For these, predict_proba yields the expected probability results (i.e, float values between 0 and 1).
Also, for the LogisticRegression model, I've tried a variety of parameters including all default params, yet the issue persists. Weirdly enough, using SGDClassifier with loss = 'log' or 'modified_huber' also yields the same binary predict_proba results, so I'm thinking this might be something intrinsic to the dataset, but not sure. Also, this issue only occurs if I standardize training set data. So far I've tried both StandardScaler and MinMaxScaler, same results.
Has anyone ever encountered a problem such as this?
Edit:
The LR parameters are:
LogisticRegression(C=1.7993269963183343, class_weight='balanced', dual=False,
fit_intercept=True, intercept_scaling=1, l1_ratio=.5,
max_iter=100, multi_class='warn', n_jobs=-1, penalty='elasticnet',
random_state=58, solver='saga', tol=0.0001, verbose=0,
warm_start=False)
Again, the issue only occurs when standardizing the data with either StandardScaler() or MinMaxScaler(). Which is odd because the data is not a uniform scale across all features. For instance, some features are represented as percents, others are represented as dollar values, and others are dummy coded representations.
This can happen when you do the following two things in sequence:
Fit an estimator with standardized training data and then later on,
Pass unstandardized data to the same estimator in the validation or testing phase.
Here's an example of predict_proba returning 0 or 1 using the UCI ML Breast Cancer Wisconsin (Diagnostic) dataset:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=123)
# Example 1 [CORRECT]
pipeline = make_pipeline(StandardScaler(), LogisticRegression())
pipeline.fit(X_train, y_train)
# Pipeline(steps=[('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression())])
print(pipeline)
y_pred = pipeline.predict_proba(X_test)
# [0.37264656 0.62735344]
print(y_pred.mean(axis=0))
# Example 2 [INCORRECT]
# Fit the model with standardized training set
X_scaled = StandardScaler().fit_transform(X_train)
model = LogisticRegression()
model.fit(X_scaled, y_train)
# Test the model with unstandardized test set
y_pred = model.predict_proba(X_test)
# [1.00000000e+000 2.48303123e-204]
print(y_pred.mean(axis=0))
Since the estimator in Example 2 was fitted on scaled data with a unit variance of 1.0 (X_scaled), the variance of the data it's being tested on (X_test) is much higher than expected. It's no surprise then that this results in very extreme probabilities.
You can prevent this from happening by wrapping your estimator within a pipeline and calling the pipeline fit method instead of the estimator's fit method (see Example 1). Doing it this way guarantees that the same transformations are applied to the data in the training, validation and testing phases.

Multi-label classification with class weights in Keras

I have a 1000 classes in the network and they have multi-label outputs. For each training example, the number of positive output is same(i.e 10) but they can be assigned to any of the 1000 classes. So 10 classes have output 1 and rest 990 have output 0.
For the multi-label classification, I am using 'binary-cross entropy' as cost function and 'sigmoid' as the activation function. When I tried this rule of 0.5 as the cut-off for 1 or 0. All of them were 0. I understand this is a class imbalance problem. From this link, I understand that, I might have to create extra output labels.Unfortunately, I haven't been able to figure out how to incorporate that into a simple neural network in keras.
nclasses = 1000
# if we wanted to maximize an imbalance problem!
#class_weight = {k: len(Y_train)/(nclasses*(Y_train==k).sum()) for k in range(nclasses)}
inp = Input(shape=[X_train.shape[1]])
x = Dense(5000, activation='relu')(inp)
x = Dense(4000, activation='relu')(x)
x = Dense(3000, activation='relu')(x)
x = Dense(2000, activation='relu')(x)
x = Dense(nclasses, activation='sigmoid')(x)
model = Model(inputs=[inp], outputs=[x])
adam=keras.optimizers.adam(lr=0.00001)
model.compile('adam', 'binary_crossentropy')
history = model.fit(
X_train, Y_train, batch_size=32, epochs=50,verbose=0,shuffle=False)
Could anyone help me with the code here and I would also highly appreciate if you could suggest a good 'accuracy' metric for this problem?
Thanks a lot :) :)
I have a similar problem and unfortunately have no answer for most of the questions. Especially the class imbalance problem.
In terms of metric there are several possibilities: In my case I use the top 1/2/3/4/5 results and check if one of them is right. Because in your case you always have the same amount of labels=1 you could take your top 10 results and see how many percent of them are right and average this result over your batch size. I didn't find a possibility to include this algorithm as a keras metric. Instead, I wrote a callback, which calculates the metric on epoch end on my validation data set.
Also, if you predict the top n results on a test dataset, see how many times each class is predicted. The Counter Class is really convenient for this purpose.
Edit: If found a method to include class weights without splitting the output.
You need a numpy 2d array containing weights with shape [number classes to predict, 2 (background and signal)].
Such an array could be calculated with this function:
def calculating_class_weights(y_true):
from sklearn.utils.class_weight import compute_class_weight
number_dim = np.shape(y_true)[1]
weights = np.empty([number_dim, 2])
for i in range(number_dim):
weights[i] = compute_class_weight('balanced', [0.,1.], y_true[:, i])
return weights
The solution is now to build your own binary crossentropy loss function in which you multiply your weights yourself:
def get_weighted_loss(weights):
def weighted_loss(y_true, y_pred):
return K.mean((weights[:,0]**(1-y_true))*(weights[:,1]**(y_true))*K.binary_crossentropy(y_true, y_pred), axis=-1)
return weighted_loss
weights[:,0] is an array with all the background weights and weights[:,1] contains all the signal weights.
All that is left is to include this loss into the compile function:
model.compile(optimizer=Adam(), loss=get_weighted_loss(class_weights))

sklearn auc ValueError: Only one class present in y_true

I searched Google, and saw a couple of StackOverflow posts about this error. They are not my cases.
I use keras to train a simple neural network and make some predictions on the splitted test dataset. But when use roc_auc_score to calculate AUC, I got the following error:
"ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.".
I inspect the target label distribution, and they are highly imbalanced. Some labels(in the total 29 labels) have only 1 instance. So it's likely they will have no positive label instance in the test label. So the sklearn's roc_auc_score function reported the only one class problem. That's reasonable.
But I'm curious, as when I use sklearn's cross_val_score function, it can handle the AUC calculation without error.
my_metric = 'roc_auc'
scores = cross_validation.cross_val_score(myestimator, data,
labels, cv=5,scoring=my_metric)
I wonder what happened in the cross_val_score, is it because the cross_val_score use a stratified cross-validation data split?
UPDATE
I continued to make some digging, but still can't find the difference behind.I see that cross_val_score call check_scoring(estimator, scoring=None, allow_none=False) to return a scorer, and the check_scoring will call get_scorer(scoring) which will return scorer=SCORERS[scoring]
And the SCORERS['roc_auc'] is roc_auc_scorer;
the roc_auc_scorer is made by
roc_auc_scorer = make_scorer(roc_auc_score, greater_is_better=True,
needs_threshold=True)
So, it's still using the roc_auc_score function. I don't get why cross_val_score behave differently with directly calling roc_auc_score.
I think your hunch is correct. The AUC (area under ROC curve) needs a sufficient number of either classes in order to make sense.
By default, cross_val_score calculates the performance metric one each fold separately. Another option could be to do cross_val_predict and compute the AUC over all folds combined.
You could do something like:
from sklearn.metrics import roc_auc_score
from sklearn.cross_validation import cross_val_predict
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
class ProbaEstimator(LogisticRegression):
"""
This little hack needed, because `cross_val_predict`
uses `estimator.predict(X)` internally.
Replace `LogisticRegression` with whatever classifier you like.
"""
def predict(self, X):
return super(self.__class__, self).predict_proba(X)[:, 1]
# some example data
X, y = make_classification()
# define your estimator
estimator = ProbaEstimator()
# get predictions
pred = cross_val_predict(estimator, X, y, cv=5)
# compute AUC score
roc_auc_score(y, pred)

Resources