Using pipeline with preprocessing steps with GridsearchCV - scikit-learn

I am using a imblearn's pipeline as an estimator and gridsearchcv for hyperparameter tuning as seen below:
pipeline = imbpipeline(steps = [['scaler', MinMaxScaler()],
['smote', SMOTE(random_state=11)],
['classifier',LogisticRegression() ]])
search = GridSearchCV(pipeline, classifier_params, scoring='accuracy', cv=cv_inner, refit=True)
search.fit(X_train, y_train)
Here train set is used for hyperparameter tuning, splitted to subtrain and validation in each fold.
My problem is this:
For the minmaxscaler and the logistic regression part i would like fit_transform applied on each subtrain set and just transform to be applied on each corresponding validation set, which i think is what is done here.
However, for SMOTE i would like fit_transform applied to each subtrain set but leave each corresponding validation set untouched but i am not sure if this is the case here.
Does someone know more about this?
If this is not the case is there an example where gridsearch cv is coded on a lower level?(maybe the CV part is implemented 'by hand')

From the imblearn pipeline documentation:
The samplers are only applied during fit.

Related

Does it make sense to use scikit-learn cross_val_predict() to (i)make predictions with unseen data in k-fold cross-validation and (ii)compare models?

I'm training and evaluating a logistic regression and a XGBoost classifier.
With the XGBoost classifier, a training/validation/test split of the data and the subsequent training and validation shows the model is overfitting the training data. So, I'm working with k-fold cross-validation to reduce overfitting.
To work with k-fold cross-validation, I'm splitting my data into training and test sets and performing the k-fold cross-validation on the training set. The code looks something like the following:
model = XGBClassifier()
kfold = StratifiedKFold(n_splits = 10)
results = cross_val_score(model, x_train, y_train, cv = kfold)
The code works. Now, I've read several forums and blogs on how to make predictions after a k-fold cross-validation, but after these readings, I'm still not sure about the proper way of doing the predictions.
It would seem that using the cross_val_predict() method from sklearn.model_selection and using the test set is OK. The code would look something like the following:
y_pred = cross_val_predict(model, x_test, y_test, cv = kfold)
The code works, but the issue is whether this makes sense since I've seen more complicated ways of doing so and where it doesn't seem clear whether the training or the test set should be used for the predictions.
And if this makes sense, computing the accuracy score and the confusion matrix would be as simple as running something like the following:
accuracy = metrics.accuracy_score(y_test, y_pred)
cm = metrics.confusion_matrix(y_test, y_pred)
These two would help compare the logistic regression and the XGBoost classifier. Does this way of making predictions and evaluating models make sense?
Any help is appreciated! Thanks!
I want to answer this question I posted myself by summarizing things I have read and tried.
First, I want to clarify that the idea behind splitting my data into training/test sets and performing the k-fold cross-validation on the training set is to reserve the test set for providing a generalization error in much the same way we split data into training/validation/test sets and use the test set for providing a generalization error. For the sake of clarity, let me split the discussion into 2 sections.
Section 1
Now, reading more stuff, it's clearer to me cross_val_predict() returns the predictions that were obtained during the cross-validation when the elements were in a test set (see section 3.1.1.2 in this scikit-learn cross-validation doc). This test set refers to one of the test sets the cross-validation procedure internally creates (cross-validation creates a test set in each fold). Thus:
y_pred = cross_val_predict(model, x_train, y_train, cv = kfold)
returns the predictions from the cross-validation internal test sets. It then seems safe to obtain the accuracy and confusion matrix with:
accuracy = metrics.accuracy_score(y_train, y_pred)
cm = metrics.confusion_matrix(y_train, y_pred)
While cross_val_predict(model, x_test, y_test, cv = kfold) runs, it seems doing this doesn't make much sense.
Section 2
From some blogs that talk about creating a confusion matrix after a cross-validation procedure (see here and here), I borrowed code that, for each fold of the cross-validation, extracts the labels and predictions from the internal test set. These labels and predictions are later used to compute the confusion matrix. Assuming I store the labels and predictions in variables called actual_classes and predicted_classes, respectively, I then run:
accuracy = metrics.accuracy_score(actual_classes, predicted_classes)
cm = metrics.confusion_matrix(actual_classes, predicted_classes)
The results are exactly the same as the ones from Section 1's equivalent code. This reinforces that cross_val_predict(model, x_train, y_train, cv = kfold) works fine.
Thus:
Does it make sense to use scikit-learn cross_val_predict() to make
predictions with unseen data in k-fold cross-validation? I would say
No, it doesn't since cross_val_predict() makes predictions with
the internal test sets from the cross-validation procedure. It
seems that to make predictions with unseen data and compute a
generalization error we would need a way to extract one of the
models from the cross-validation procedure (e.g., see this
question)
Does it make sense to use scikit-learn cross_val_predict() to
compare models? I would say Yes, it does as long as the method is
executed as shown in Section 1. The accuracy and confusion matrix
could be used to make comparisons against other models.
Any comment is appreciated! Thanks!

OneVsRestClassifier Hyperparameter Tuning for every base estimator

I am working on a multiclass problem with six different classes and I am using OneVsRestClassifier.
I have then performed hyperparameter tuning with GridSearchCV and obtained the optimized classifier with clf.best_estimator_.
As far as I understand, this returns one set of the hyperparameters for the aggregated model/every base estimator.
Is there a way to perform hyperparameter tuning separately for each base estimator?
Sure, just reverse the order of the search and the multiclass wrapper:
one_class_clf = GridSearchCV(base_classifier, params, ...)
clf = OneVsRestClassifier(one_class_clf)
Fitting clf generates the one-vs-rest problems, and for each of those fits a copy of the grid-searched base_classifier.

is it possible to set the splitting strategy for GridSearchCv?

I'm optimizing model's hyperparameters by GridSearchCv. And because the data I'm working with is very imbalanced, I need to "choose" the manner that the algortihm splits the train/test sets in order to ensure that the underrepresented points are in both sets.
By reading scikit-learn's documentation, I have the idea that it's possible to set the splitting strategy for GridSearch but I'm not sure how or if this is the case.
I would be very grateful if someone could help me with this.
Yes, pass in the GridSearchCV as cv a StratifiedKFold object.
from sklearn.model_selection import StratifiedKFold
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
skf = StratifiedKFold(n_splits=5)
clf = GridSearchCV(svc, parameters, cv = skf)
clf.fit(iris.data, iris.target)
By default, if you are training a classification model with GridSearchCV, the default method for splitting the dataset is StratifiedKFold, that takes care of balancing the dataset according to the target variable.
If your dataset is imbalanced for some other reason (not the target variable), you can choose another criteria to perform the split. Carefully read the documentation of GridSearchCV, and select an appropriate CV splitter.
In the scikit-learn documentation of model selection, there are many Splitter Classes that you could use. Or you can define your own splitter class according to your criteria, but it would be more difficult.

How do I put both PCA and a keras classifier into an sklearn pipeline and do a grid search CV?

I am trying to do a grid search CV operation on a keras NN with PCA beforehand. To this end, I have constructed a pipeline consisting of the PCA step then the keras estimator using the sklearn wrapper. However, one of the things I want to search through is n_components of the PCA, which means that the input size of the neural net needs to be variable and dependent on the number of features selected in a previous piepline step.
Below is my code to create a NN which I have put into a keras wrapper
def create_model(learning_rate=0.01, activation='relu',Input_Vector_Size=34,
neuron_number=10,dropout_prob=0.1):
# Create an Adam optimizer with the given learning rate
opt=Adam(lr=learning_rate)
# Create your binary classification model
model=Sequential()
model.add(Dense(neuron_number,input_shape=(Input_Vector_Size,),
activation=activation))
model.add(Dropout(dropout_prob))
model.add(Dense(1,activation='sigmoid')) #output layer
# Compile model with your optimizer, loss, and metrics
model.compile(optimizer=opt,loss='binary_crossentropy',
metrics=['accuracy'])
return model
Which I put into a pipline using
#%% Create a Pipeline with the Keras_Classifier and PCA
pca=PCA(n_components=0.9)
NN=KerasClassifier(build_fn=create_model,verbose=0)
# Define the parameters to try out
pipeline=Pipeline([('pca',pca),('NN',NN)])
params = {'pca__n_components':[0.8,0.85,0.9,0.95],
'NN__activation': ['relu', 'tanh'],
'NN__neuron_number': [10, 15, 20],
'NN__dropout_prob':[0.05,0.1,0.2,0.3],
'NN__learning_rate': [0.1, 0.01, 0.001]}
However I'm a bit stuck on what to set as the param for Input_Vector_Size as this will be dependent on how many features PCA has selected.
So, is it possible to make a pipeline parameter (here, the Input_Vector_Size) that is dependent on a parameter in a previous step of the pipeline (here, the number of features selected by PCA)?
(Note: I realize that one option around this is to just have an autoencoder in my NN and vary the compression, however I was hoping to do PCA specifically)

Using a transformer (estimator) to transform the target labels in sklearn.pipeline

I understand that one can chain several estimators that implement the transform method to transform X (the feature set) in sklearn.pipeline. However I have a use case where I would like also transform the target labels (like transform the labels to [1...K] instead of [0, K-1] and I would love to do that as a component in my pipeline. Is it possible to that at all using the sklearn.pipeline.?
There is now a nicer way to do this built into scikit-learn; using a compose.TransformedTargetRegressor.
When constructing these objects you give them a regressor and a transformer. When you .fit() them they transform the targets before regressing, and when you .predict() them they transform their predicted targets back to the original space.
It's important to note that you can pass them a pipeline object, so they should interface nicely with your existing setup. For example, take the following setup where I train a ridge regression to predict 1 target given 2 features:
# Imports
import numpy as np
from sklearn import compose, linear_model, metrics, pipeline, preprocessing
# Generate some training and test features and targets
X_train = np.random.rand(200).reshape(100,2)
y_train = 1.2*X_train[:, 0]+3.4*X_train[:, 1]+5.6
X_test = np.random.rand(20).reshape(10,2)
y_test = 1.2*X_test[:, 0]+3.4*X_test[:, 1]+5.6
# Define my model and scalers
ridge = linear_model.Ridge(alpha=1e-2)
scaler = preprocessing.StandardScaler()
minmax = preprocessing.MinMaxScaler(feature_range=(-1,1))
# Construct a pipeline using these methods
pipe = pipeline.make_pipeline(scaler, ridge)
# Construct a TransformedTargetRegressor using this pipeline
# ** So far the set-up has been standard **
regr = compose.TransformedTargetRegressor(regressor=pipe, transformer=minmax)
# Fit and train the regr like you would a pipeline
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print("MAE: {}".format(metrics.mean_absolute_error(y_test, y_pred)))
This still isn't quite as smooth as I'd like it to be, for example you can access the regressor that contained by a TransformedTargetRegressor using .regressor_ but the coefficients stored there are untransformed. This means there are some extra hoops to jump through if you want to work your way back to the equation that generated the data.
No, pipelines will always pass y through unchanged. Do the transformation outside the pipeline.
(This is a known design flaw in scikit-learn, but it's never been pressing enough to change or extend the API.)
You could add the label column to the end of the training data, then you apply your transformation and you delete that column before training your model. That's not very pro but enough.

Resources