compute test metric on all folds of gridsearch - scikit-learn

I am tuning a pipeline that includes imputation, standardization and prediction. It is implemented as an sklearn Pipeline and I am running GridSearchCV with k folds.
Is it possible to have the test metric be computed on the predictions of all folds concatenated rather than computing it within each fold and computing the average? How can I implement this?

Related

OneVsRestClassifier Hyperparameter Tuning for every base estimator

I am working on a multiclass problem with six different classes and I am using OneVsRestClassifier.
I have then performed hyperparameter tuning with GridSearchCV and obtained the optimized classifier with clf.best_estimator_.
As far as I understand, this returns one set of the hyperparameters for the aggregated model/every base estimator.
Is there a way to perform hyperparameter tuning separately for each base estimator?
Sure, just reverse the order of the search and the multiclass wrapper:
one_class_clf = GridSearchCV(base_classifier, params, ...)
clf = OneVsRestClassifier(one_class_clf)
Fitting clf generates the one-vs-rest problems, and for each of those fits a copy of the grid-searched base_classifier.

OneVsRest predict_proba() and predict() don't match?

I'm using OneVsRestClassifier on a multiclass problem with svm.SVC as the base estimator.
The argmax from the predict_proba() does not match the predicted class:
Is there some normalization going on in the background? How do I get predict_proba() and predict()
to match?
According to the scikit learn's SVC documentation on multi-class classification, there can be discrepancies between the output of predict and the argmax of predict_proba (emphasis mine):
The decision_function method of SVC and NuSVC gives per-class scores for each sample (or a single score per sample in the binary case). When the constructor option probability is set to True, class membership probability estimates (from the methods predict_proba and predict_log_proba) are enabled. In the binary case, the probabilities are calibrated using Platt scaling: logistic regression on the SVM’s scores, fit by an additional cross-validation on the training data. In the multiclass case, this is extended as per Wu et al. (2004).
Needless to say, the cross-validation involved in Platt scaling is an expensive operation for large datasets. In addition, the probability estimates may be inconsistent with the scores, in the sense that the “argmax” of the scores may not be the argmax of the probabilities. (E.g., in binary classification, a sample may be labeled by predict as belonging to a class that has probability <½ according to predict_proba.) Platt’s method is also known to have theoretical issues. If confidence scores are required, but these do not have to be probabilities, then it is advisable to set probability=False and use decision_function instead of predict_proba.
You cannot get them to match using a SVC. You can try another model if you need the probabilities. If you do not need probabilities, as stated in the documentation, you can use decision_function (see here for more details.)

Feature selection on a keras model

I was trying to find the best features that dominate for the output of my regression model, Following is my code.
seed = 7
np.random.seed(seed)
estimators = []
estimators.append(('mlp', KerasRegressor(build_fn=baseline_model, epochs=3,
batch_size=20)))
pipeline = Pipeline(estimators)
rfe = RFE(estimator= pipeline, n_features_to_select=5)
fit = rfe.fit(X_set, Y_set)
But I get the following runtime error when running.
RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes
How to overcome this issue and select best features for my model? If not, Can I use algorithms like LogisticRegression() provided and supported by RFE in Scikit to achieve the task of finding best features for my dataset?
I assume your Keras model is some kind of a neural network. And with NN in general it is kind of hard to see which input features are relevant and which are not. The reason for this is that each input feature has multiple coefficients that are linked to it - each corresponding to one node of the first hidden layer. Adding additional hidden layers makes it even more complicated to determine how big of an impact the input feature has on the final prediction.
On the other hand, for linear models it is very straightforward since each feature x_i has a corresponding weight/coefficient w_i and its magnitude directly determines how big of an impact it has in prediction (assuming that features are scaled of course).
The RFE estimator (Recursive feature elimination) assumes that your prediction model has an attribute coef_ (linear models) or feature_importances_(tree models) that has the length of input features and that it represents their relevance (in absolute terms).
My suggestion:
Feature selection: (Option a) Run the RFE on any linear / tree model to reduce the number of features to some desired number n_features_to_select. (Option b) Use regularized linear models like lasso / elastic net that enforce sparsity. The problem here is that you cannot directly set the actual number of selected features. (Option c) Use any other feature selection technique from here.
Neural Network: Use only features from (1) for your neural network.
Suggestion:
Perform the RFE algorithm on a sklearn-based algorithm to observe feature importance. Finally, you use the most importantly observed features to train your algorithm based on Keras.
To your question: Standardization is not required for logistic regression

What does the CV stand for in sklearn.linear_model.LogisticRegressionCV?

scikit-learn has two logistic regression functions:
sklearn.linear_model.LogisticRegression
sklearn.linear_model.LogisticRegressionCV
I'm just curious what the CV stands for in the second one. The only acronym I know in ML that matches "CV" is cross-validation, but I'm guessing that's not it, since that would be achieved in scikit-learn with a wrapper function, not as part of the logistic regression function itself (I think).
You are right in guessing that the latter allows the user to perform cross validation. The user can pass the number of folds as an argument cv of the function to perform k-fold cross-validation (default is 10 folds with StratifiedKFold).
I would recommend reading the documentation for the functions LogisticRegression and LogisticRegressionCV
Yes, it's cross-validation. Excerpt from the docs:
For the grid of Cs values (that are set by default to be ten values in a logarithmic scale between 1e-4 and 1e4), the best hyperparameter is selected by the cross-validator StratifiedKFold, but it can be changed using the cv parameter.
The point here is the following:
yes: sklearn has general model-selection wrappers providing CV-functionality for all those classifiers/regressors
but: when the classifier/regressor is known/fixed a-priori (to some extent) or sometimes even some CV-model, one can gain advantages using these facts with specialized code bound to one classifier/regressor resulting in improved performance!
Typically:
CV already embedded in optimization-algorithm
Efficient warm-starting (instead of full re-optimization after just the change of one parameter like alpha)
It seems, at least the latter idea is used in sklearn's LogisticRegressionCV, as seen in this excerpt:
In the case of newton-cg and lbfgs solvers, we warm start along the path i.e guess the initial coefficients of the present fit to be the coefficients got after convergence in the previous fit, so it is supposed to be faster for high-dimensional dense data.
May I also refer you to this section in scikit-learn documentation which I beleive explains it well:
Some models can fit data for a range of values of some parameter
almost as efficiently as fitting the estimator for a single value of
the parameter. This feature can be leveraged to perform a more
efficient cross-validation used for model selection of this parameter.
The most common parameter amenable to this strategy is the parameter
encoding the strength of the regularizer. In this case we say that we
compute the regularization path of the estimator.
And logistic regression is one such model. That's why scikit-learn has the dedicated LogisticRegressionCV class that does this.
There are some things left out on other answers, e.g. about gridsearch functionality. See the docs:
cross-validation estimator
An estimator that has built-in cross-validation capabilities to automatically select the best hyper-parameters (see the User Guide). Some example of cross-validation estimators are ElasticNetCV and LogisticRegressionCV. Cross-validation estimators are named EstimatorCV and tend to be roughly equivalent to GridSearchCV(Estimator(), ...). The advantage of using a cross-validation estimator over the canonical estimator class along with grid search is that they can take advantage of warm-starting by reusing precomputed results in the previous steps of the cross-validation process. This generally leads to speed improvements. An exception is the RidgeCV class, which can instead perform efficient Leave-One-Out CV.
https://scikit-learn.org/stable/glossary.html#term-cross-validation-estimator
https://github.com/amueller/talks_odt/blob/master/2015/nyc-open-data-2015-andvanced-sklearn.pdf

Meaning of GridSearchCV with RFECV in sklearn

Based on Recursive feature elimination and grid search using scikit-learn, I know that RFECV can be combined with GridSearchCV to obtain better parameter setting for the model like linear SVM.
As said in the answer, there are two ways:
"Run GridSearchCV on RFECV, which will result in splitting the data into folds two times (ones inside GridSearchCV and once inside RFECV), but the search over the number of components will be efficient."
"Do GridSearchCV just on RFE, which would result in a single splitting of the data, but in very inefficient scanning of the parameters of the RFE estimator."
To make my question clear, I have to firstly clarify RFECV:
Split the whole data into n folds.
In every fold, obtain the feature rank by fitting only the training data to rfe.
Sort the ranking and fit the training data to SVM and test it on testing data for scoring. This should be done m times, each with decreasing number of features, where m is the number of features assuming step=1.
A sequence of scores is obtained in the previous step and such sequence would be lastly averaged across n folds after step 1~3 have been done in n times, resulting in an averaged scoring sequence suggesting the best number of features to do in rfe.
Take that best number of features as the argument of n_features_to_select in rfe fitted with the original whole data.
.support_ to get the "winners" among features; .grid_scores_ to get the averaged scoring sequence.
Please correct me if I am wrong, thank you.
So my question is where to put GridSearchCV? I guess the second way "do GridSearchCV just on RFE" is do GridSearchCV on step 5 which sets the parameter of SVM to one of the value in the grid, fit it on training data split by GridSearchCV to obtain the number of features suggested in step 4, and test it with the rest of the data for the score. Such process is done in k times and an averaged score indicates the goodness of that value in the grid, where k is the argument cv in GridSearchCV. However, selected features might be different due to alternative training data and grid value, which makes this second way not reasonable if it is done as my guess.
How actually does GridSearchCV be combined with RFECV?

Resources