OneVsRestClassifier Hyperparameter Tuning for every base estimator - scikit-learn

I am working on a multiclass problem with six different classes and I am using OneVsRestClassifier.
I have then performed hyperparameter tuning with GridSearchCV and obtained the optimized classifier with clf.best_estimator_.
As far as I understand, this returns one set of the hyperparameters for the aggregated model/every base estimator.
Is there a way to perform hyperparameter tuning separately for each base estimator?

Sure, just reverse the order of the search and the multiclass wrapper:
one_class_clf = GridSearchCV(base_classifier, params, ...)
clf = OneVsRestClassifier(one_class_clf)
Fitting clf generates the one-vs-rest problems, and for each of those fits a copy of the grid-searched base_classifier.

Related

Does it make sense to use scikit-learn cross_val_predict() to (i)make predictions with unseen data in k-fold cross-validation and (ii)compare models?

I'm training and evaluating a logistic regression and a XGBoost classifier.
With the XGBoost classifier, a training/validation/test split of the data and the subsequent training and validation shows the model is overfitting the training data. So, I'm working with k-fold cross-validation to reduce overfitting.
To work with k-fold cross-validation, I'm splitting my data into training and test sets and performing the k-fold cross-validation on the training set. The code looks something like the following:
model = XGBClassifier()
kfold = StratifiedKFold(n_splits = 10)
results = cross_val_score(model, x_train, y_train, cv = kfold)
The code works. Now, I've read several forums and blogs on how to make predictions after a k-fold cross-validation, but after these readings, I'm still not sure about the proper way of doing the predictions.
It would seem that using the cross_val_predict() method from sklearn.model_selection and using the test set is OK. The code would look something like the following:
y_pred = cross_val_predict(model, x_test, y_test, cv = kfold)
The code works, but the issue is whether this makes sense since I've seen more complicated ways of doing so and where it doesn't seem clear whether the training or the test set should be used for the predictions.
And if this makes sense, computing the accuracy score and the confusion matrix would be as simple as running something like the following:
accuracy = metrics.accuracy_score(y_test, y_pred)
cm = metrics.confusion_matrix(y_test, y_pred)
These two would help compare the logistic regression and the XGBoost classifier. Does this way of making predictions and evaluating models make sense?
Any help is appreciated! Thanks!
I want to answer this question I posted myself by summarizing things I have read and tried.
First, I want to clarify that the idea behind splitting my data into training/test sets and performing the k-fold cross-validation on the training set is to reserve the test set for providing a generalization error in much the same way we split data into training/validation/test sets and use the test set for providing a generalization error. For the sake of clarity, let me split the discussion into 2 sections.
Section 1
Now, reading more stuff, it's clearer to me cross_val_predict() returns the predictions that were obtained during the cross-validation when the elements were in a test set (see section 3.1.1.2 in this scikit-learn cross-validation doc). This test set refers to one of the test sets the cross-validation procedure internally creates (cross-validation creates a test set in each fold). Thus:
y_pred = cross_val_predict(model, x_train, y_train, cv = kfold)
returns the predictions from the cross-validation internal test sets. It then seems safe to obtain the accuracy and confusion matrix with:
accuracy = metrics.accuracy_score(y_train, y_pred)
cm = metrics.confusion_matrix(y_train, y_pred)
While cross_val_predict(model, x_test, y_test, cv = kfold) runs, it seems doing this doesn't make much sense.
Section 2
From some blogs that talk about creating a confusion matrix after a cross-validation procedure (see here and here), I borrowed code that, for each fold of the cross-validation, extracts the labels and predictions from the internal test set. These labels and predictions are later used to compute the confusion matrix. Assuming I store the labels and predictions in variables called actual_classes and predicted_classes, respectively, I then run:
accuracy = metrics.accuracy_score(actual_classes, predicted_classes)
cm = metrics.confusion_matrix(actual_classes, predicted_classes)
The results are exactly the same as the ones from Section 1's equivalent code. This reinforces that cross_val_predict(model, x_train, y_train, cv = kfold) works fine.
Thus:
Does it make sense to use scikit-learn cross_val_predict() to make
predictions with unseen data in k-fold cross-validation? I would say
No, it doesn't since cross_val_predict() makes predictions with
the internal test sets from the cross-validation procedure. It
seems that to make predictions with unseen data and compute a
generalization error we would need a way to extract one of the
models from the cross-validation procedure (e.g., see this
question)
Does it make sense to use scikit-learn cross_val_predict() to
compare models? I would say Yes, it does as long as the method is
executed as shown in Section 1. The accuracy and confusion matrix
could be used to make comparisons against other models.
Any comment is appreciated! Thanks!

Using pipeline with preprocessing steps with GridsearchCV

I am using a imblearn's pipeline as an estimator and gridsearchcv for hyperparameter tuning as seen below:
pipeline = imbpipeline(steps = [['scaler', MinMaxScaler()],
['smote', SMOTE(random_state=11)],
['classifier',LogisticRegression() ]])
search = GridSearchCV(pipeline, classifier_params, scoring='accuracy', cv=cv_inner, refit=True)
search.fit(X_train, y_train)
Here train set is used for hyperparameter tuning, splitted to subtrain and validation in each fold.
My problem is this:
For the minmaxscaler and the logistic regression part i would like fit_transform applied on each subtrain set and just transform to be applied on each corresponding validation set, which i think is what is done here.
However, for SMOTE i would like fit_transform applied to each subtrain set but leave each corresponding validation set untouched but i am not sure if this is the case here.
Does someone know more about this?
If this is not the case is there an example where gridsearch cv is coded on a lower level?(maybe the CV part is implemented 'by hand')
From the imblearn pipeline documentation:
The samplers are only applied during fit.

How to train a linear SVM with H2O

The H2OSupportVectorMachineEstimator in H2O seems to only support "gaussian" as the value of the kernel_type parameter. Is there a way to train a linear SVM with H2O?
As you mentioned, based on the documentation (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/svm.html) currently there is no way to train linear SVM on H2O. Within linear models, I think it only has GLM (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/glm.html).

sklearn.linear_model.SGDClassifier manual inference for multiclass classification

I have trained a model for 3-class classification using sklearn.linear_model.SGDClassifier. Now I'm looking for a way for manual inference of the model. The problem here is that the model contains three pairs of [coef_, intercept_] so I don't understand how can I do a prediction in C++.
The code for training looks like in the sklearn example:
clf = make_pipeline(StandardScaler(), SGDClassifier(max_iter=1000, tol=1e-3))
clf.fit(train_features, train_labels)
I tried to calculate values coef_ * sample + intercept_ for each of the classes but didn't understand how to determine the class using that numbers.

Which model: Best estimator from gridsearchCV or all training data?

I am a little confused when it comes to gridsearch and fitting the final model. I split the in 2: training and testing. The testing set is only used for final evaluation. I perform grid search only using the training data.
Say one has done a grid search over several hyperparameters using cross-validation. The grid search gives the best combination of the hyperparameters. Next step is to train the model, and this is where I am confused. I see 2 possibilities:
1) Don't train the model. Use the parameters from the best model from the grid search.
or
2) Don't use the parameters from the best model from the grid search. Train the model on the full training set with the best hyperparameter combination from the grid search.
What is the correct approach, 1 or 2?
This is probably late, but might be useful for someone else who comes along.
GridSearchCV has an attribute called refit, which is set to True by default. This means that after performing k-fold cross-validation (i.e., training on a subset of the data you passed in), it refits the model using the best hyperparameters from the grid search, on the complete training set.
Presumably your question, from what I can glean, can be summarized as:
Suppose you use 5-fold cross-validation. Your model is then fitted only on 4 folds, as the fifth fold is used for validation. So would you need to retrain the model on the whole of train (i.e., the data from all 5 folds)?
The answer is no, provided you set refit to True, in which case GridSearchCV will perform the training over the whole of the training set using the best hyperparameters it has found after cross-validation. It will then return the trained estimator object, on which you can directly call the predict method, as you would normally do otherwise.
Refer: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
You train the model using the training set and the parameters obtained by the GridSearch.
And then you can test the model with the test set.

Resources