Fitting in nested cross-validation with cross_val_score with pipeline and GridSearch - scikit-learn

I am working in scikit and I am trying to tune my XGBoost.
I made an attempt to use a nested cross-validation using the pipeline for the rescaling of the training folds (to avoid data leakage and overfitting) and in parallel with GridSearchCV for param tuning and cross_val_score to get the roc_auc score at the end.
from imblearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
std_scaling = StandardScaler()
algo = XGBClassifier()
steps = [('std_scaling', StandardScaler()), ('algo', XGBClassifier())]
pipeline = Pipeline(steps)
parameters = {'algo__min_child_weight': [1, 2],
'algo__subsample': [0.6, 0.9],
'algo__max_depth': [4, 6],
'algo__gamma': [0.1, 0.2],
'algo__learning_rate': [0.05, 0.5, 0.3]}
cv1 = RepeatedKFold(n_splits=2, n_repeats = 5, random_state = 15)
clf_auc = GridSearchCV(pipeline, cv = cv1, param_grid = parameters, scoring = 'roc_auc', n_jobs=-1, return_train_score=False)
cv1 = RepeatedKFold(n_splits=2, n_repeats = 5, random_state = 15)
outer_clf_auc = cross_val_score(clf_auc, X_train, y_train, cv = cv1, scoring = 'roc_auc')
Question 1.
How do I fit cross_val_score to the training data?
Question2.
Since I included the StandardScaler() in the pipeline does it make sense to include the X_train in the cross_val_score or should I use a standardized form of the X_train (i.e. std_X_train)?
std_scaler = StandardScaler().fit(X_train)
std_X_train = std_scaler.transform(X_train)
std_X_test = std_scaler.transform(X_test)

You chose the right way to avoid data leakage as you say - nested CV.
The thing is in nested CV what you estimate is not the score of a real estimator you can "hold in your hand", but of a non-existing "meta-estimator" which describes you model selection process as well.
Meaning - in every round of the outer cross validation (in your case represented by cross_val_score), the estimator clf_auc undergoes internal CV which selects the best model under the given fold of the external CV.
Therefore, for every fold of the external CV you are scoring a different estimator chosen by the internal CV.
For example, in one external CV fold the model scored can be one that selected the param algo__min_child_weight to be 1, and in another a model that selected it to be 2.
The score of the external CV therefore represents a more high-level score: "under the process of reasonable model selection, how well will my selected model generalize".
Now, if you want to finish the process with a real model in hand you would have to select it in some way (cross_val_score will not do that for you).
The way to do that is to now fit your internal model over the entire data.
meaning to perform:
clf_auc.fit(X, y)
This is the moment to understand what you've done here:
You have a model you can use, which is fitted over all the data available.
When you're asked "how well does that model generalizes on new data?" the answer is the score you got during your nested CV - which captured the model selection process as part of your model's scoring.
And regarding Question #2 - if the scaler is part of the pipeline, there is no reason to manipulate the X_train externally.

Related

k-NN GridSearchCV taking extremely long time to execute

I am attempting to use sklearn to train a KNN model on the MNIST classification task. When I try to tune my parameters using either sklearn's GridSearchCV or RandomisedSearchCV classes, my code is taking an extremely long time to execute.
As an experiment, I created a KNN model using KNeighborsClassifier() with the default parameters and passed these same parameters to GridSearchCV. Afaik, this should mean GridSearchCV only has single set of parameters and so should effectively not perform a "search". I then called the .fit() methods of both on the training data and timed their execution (see code below). The KNN model's .fit() method took about 11 seconds to run, whereas the GridSearchCV model took over 20 minutes.
I understand that GridSearchCV should take slightly longer as it is performing 5-fold cross validation, but the difference in execution time seems too large for it to be explained by that.
Am I doing something with my GridSearchCV call that it causing it to take such a long time to execute? And is there anything that I can do to accelerate it?
import sklearn
import time
# importing models
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
# Importing data
from sklearn.datasets import fetch_openml
mnist = fetch_openml(name='mnist_784')
print("data loaded")
# splitting the data into stratified train & test sets
X, y = mnist.data, mnist.target # mnist mj.data.shape is (n_samples, n_features)
sss = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 0)
for train_index, test_index in sss.split(X,y):
X_train, y_train = X[train_index], y[train_index]
X_test, y_test = X[test_index], y[test_index]
print("data split")
# Data has no missing values and is preprocessed, so no cleaing needed.
# using a KNN model, as recommended
knn = KNeighborsClassifier()
print("model created")
print("training model")
start = time.time()
knn.fit(X_train, y_train)
end = time.time()
print(f"Execution time for knn paramSearch was: {end-start}")
# Parameter tuning.
# starting by performing a broad-range search on n_neighbours to work out the
# rough scale the parameter should be on
print("beginning param tuning")
params = {'n_neighbors':[5],
'weights':['uniform'],
'leaf_size':[30]
}
paramSearch = GridSearchCV(
estimator = knn,
param_grid = params,
cv=5,
n_jobs = -1)
start = time.time()
paramSearch.fit(X_train, y_train)
end = time.time()
print(f"Execution time for knn paramSearch was: {end-start}")
With vanilla KNN, the costly procedure is predicting, not fitting: fitting just saves a copy of the data, and then predicting has to do the work of finding nearest neighbors. So since your search involves scoring on each test fold, that's going to take a lot more time than just fitting. A better comparison would have you predict on the training set in the no-search section.
However, sklearn does have different options for the algorithm parameter, which aim to trade away some of the prediction complexity for added training time, by building a search structure so that fewer comparisons are needed at prediction time. With the default algorithm='auto', you're probably building a ball tree, and so the effect of the first paragraph won't be so profound. I suspect this is still the issue though: now the training time will be non-neglibible, but the scoring portion in the search is what takes most of the time.

Sklearn Logistic Regression predict_proba returning 0 or 1

I don't have any example data to share in order to replicate the problem, but perhaps someone can provide a high level answer. I've created a lot of logistic regression models in the past, and this is the first time my predict proba scores are showing up as either 1 or 0.
I'm creating a binary classifier to predict one of two labels. I've also used a couple of other algorithms, XGBClassifier and RandomForestCalssifier with the same dataset. For these, predict_proba yields the expected probability results (i.e, float values between 0 and 1).
Also, for the LogisticRegression model, I've tried a variety of parameters including all default params, yet the issue persists. Weirdly enough, using SGDClassifier with loss = 'log' or 'modified_huber' also yields the same binary predict_proba results, so I'm thinking this might be something intrinsic to the dataset, but not sure. Also, this issue only occurs if I standardize training set data. So far I've tried both StandardScaler and MinMaxScaler, same results.
Has anyone ever encountered a problem such as this?
Edit:
The LR parameters are:
LogisticRegression(C=1.7993269963183343, class_weight='balanced', dual=False,
fit_intercept=True, intercept_scaling=1, l1_ratio=.5,
max_iter=100, multi_class='warn', n_jobs=-1, penalty='elasticnet',
random_state=58, solver='saga', tol=0.0001, verbose=0,
warm_start=False)
Again, the issue only occurs when standardizing the data with either StandardScaler() or MinMaxScaler(). Which is odd because the data is not a uniform scale across all features. For instance, some features are represented as percents, others are represented as dollar values, and others are dummy coded representations.
This can happen when you do the following two things in sequence:
Fit an estimator with standardized training data and then later on,
Pass unstandardized data to the same estimator in the validation or testing phase.
Here's an example of predict_proba returning 0 or 1 using the UCI ML Breast Cancer Wisconsin (Diagnostic) dataset:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=123)
# Example 1 [CORRECT]
pipeline = make_pipeline(StandardScaler(), LogisticRegression())
pipeline.fit(X_train, y_train)
# Pipeline(steps=[('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression())])
print(pipeline)
y_pred = pipeline.predict_proba(X_test)
# [0.37264656 0.62735344]
print(y_pred.mean(axis=0))
# Example 2 [INCORRECT]
# Fit the model with standardized training set
X_scaled = StandardScaler().fit_transform(X_train)
model = LogisticRegression()
model.fit(X_scaled, y_train)
# Test the model with unstandardized test set
y_pred = model.predict_proba(X_test)
# [1.00000000e+000 2.48303123e-204]
print(y_pred.mean(axis=0))
Since the estimator in Example 2 was fitted on scaled data with a unit variance of 1.0 (X_scaled), the variance of the data it's being tested on (X_test) is much higher than expected. It's no surprise then that this results in very extreme probabilities.
You can prevent this from happening by wrapping your estimator within a pipeline and calling the pipeline fit method instead of the estimator's fit method (see Example 1). Doing it this way guarantees that the same transformations are applied to the data in the training, validation and testing phases.

Error message on attempting to fit training data using GridSearch function

from sklearn.preprocessing import PolynomialFeatures
polyreg = PolynomialFeatures(degree = 4)
param_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
grid_search_polyreg = GridSearchCV(polyreg, param_grid, cv = 5)
grid_search_polyreg.fit(x_train, y_train)
grid_search_polyreg.score(x_test, y_test)
print("Best Parameters for polynomial regression:
{}".format(grid_search_polyreg.best_params_))
print("Best Score for polynomial regression:
{:.2f}".format(grid_search_polyreg.best_score_))
TypeError: If no scoring is specified, the estimator passed should
have a 'score' method. The estimator PolynomialFeatures(degree=4,
include_bias=True, interaction_only=False) does not.
1)I understand that alpha is not a parameter for polynomial features. But when I tried to remove alpha and fit the data it did not work.
2) Does that mean that I am not supposed to use grid search for getting scores of KNN Regressor, Linear and kernel SVM?
I am new to python and any suggestion is much appreciated. Thanks in advance.
sklearn.preprocessing.PolynomialFeatures() doesn't have a scoring function. It's not actually an estimator or machine learning model, it just transforms a matrix. You can have it as part of your pipeline and test its parameters, but you have to pass an actual estimator with a scoring function to GridSearchCV.
Fitting to data has a different meaning when your dealing with transformers vs estimators, only in the latter case does it mean "train".

Computing Pipeline logistic regression predict_proba in sklearn

I have a dataframe with 3 features and 3 classes that I split into X_train, Y_train, X_test, and Y_test and then run Sklearn's Pipeline with PCA, StandardScaler and finally Logistic Regression. I want to be able to calculate the probabilities directly from the LR weights and the raw data without using predict_proba but don't know how because I'm not sure exactly how pipeline pipes X_test through PCA and StandardScaler into logistic regression. Is this realistic without being able to use PCA's and StandardScaler's fit method? Any help would be greatly appreciated!
So far, I have:
pca = PCA(whiten=True)
scaler = StandardScaler()
logistic = LogisticRegression(fit_intercept = True, class_weight = 'balanced', solver = sag, n_jobs = -1, C = 1.0, max_iter = 200)
pipe = Pipeline(steps = [ ('pca', pca), ('scaler', scaler), ('logistic', logistic) ]
pipe.fit(X_train, Y_train)
predict_probs = pipe.predict_proba(X_test)
coefficents = pipe.steps[2][1].coef_ (3 by 30)
intercepts = pipe.steps[2][1].intercept_ (1 by 3)
This is also the question I don't figure out, thanks for Kumar's answer.
I regarded pipeline will lead to new transform for x_test, but when I tried to run Pipeline composed of StandardScalar and LogisticRegression, and to run my own defined function using StandardScalar and LogisticRegression, I found that Pipeline actually use the transform fitted by x_train. So don't worry about using pipeline, it's really a convenient and useful tool for machine learning!

Specific Cross Validation with Random Forest

Am using Random Forest with scikit learn.
RF overfits the data and prediction results are bad.
The overfit does NOT depend on the parameters of the RF:
NBtree, Depth_Tree
Overfit happens with many different parameters (Tested it across grid_search).
To remedy:
I tweak the initial data/ down sampling some results
in order to affect the fitting (Manually pre-process noise sample).
Loop on random generation of RF fits,
Get RF prediction on the data for prediction
Select the model which best fits the "predicted data" (not the calibration data).
This Monte carlos is very consuming,
Just wondering if there is another way to do
cross validation on random Forest ? (ie NOT the hyper-parameter optimization).
EDITED
Cross-Validation with any classifier in scikit-learn is really trivial:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np
clf = RandomForestClassifier() #Initialize with whatever parameters you want to
# 10-Fold Cross validation
print np.mean(cross_val_score(clf, X_train, y_train, cv=10))
If you wish to run Grid Search, you can easily do it via the GridSearchCV class. In order to do so you will have to provide a param_grid, which according to the documentation is
Dictionary with parameters names (string) as keys and lists of
parameter settings to try as values, or a list of such dictionaries,
in which case the grids spanned by each dictionary in the list are
explored. This enables searching over any sequence of parameter
settings.
So maybe, you could define your param_grid as follows:
param_grid = {
'n_estimators': [5, 10, 15, 20],
'max_depth': [2, 5, 7, 9]
}
Then you can use the GridSearchCV class as follows
from sklearn.model_selection import GridSearchCV
grid_clf = GridSearchCV(clf, param_grid, cv=10)
grid_clf.fit(X_train, y_train)
You can then get the best model using grid_clf. best_estimator_ and the best parameters using grid_clf. best_params_. Similarly you can get the grid scores using grid_clf.cv_results_
Hope this helps!

Resources