What will GridsearchCV choose if there are multiple estimators having the same score? - scikit-learn

I'm using RandomForestClassifier in sklearn, and using GridsearchCV for getting best estimator.
I'm wondering when there are many estimators (from simple one to complex one) having the same scores in GridsearchCV, what will be the resulted estimator out of GridsearchCV? The simplest one? or random one?

GridSearchCV does not assess the model complexity (though that would be a neat feature). Neither does it choose among the best models randomly.
Instead, GridSearchCV simply performs an np.argmin() on the stored errors. See the corresponding line in the source code.
Now, according to the NumPy docs,
In case of multiple occurrences of the minimum values, the indices corresponding to the first occurrence are returned.
That is, GridSearchCV will always select the first among the best models.

Related

sklearn.ensemble Can you use less estimators than the number trained in final model?

Most sklearn.ensemble models (GradientBoostingClassifier, RandomForestClassifier etc.) take an n_estimators param for number of estimators in the ensemble. If you've trained a model with X estimators, can you use less than X estimators in your prediction? This can be useful for model selection.
Example: train 800 trees, you might want to see how a 400 tree model performs. Given that you have an 800 tree model, you should just be able to predict with the first 400 trees rather than training it again.
This can be done in boosting models, but a bagging model like random forest may not have this option. Decision trees in boosting models are sequential, so to use the first 400 trees from the 800 trees would make sense. But trees in random forest are without sequence, so you would have to randomly sample 400 trees, which I don't think the module offers.
The boosting models (GradientBoostingClassifier, AdaBoostClassifier, and HistGradientBoostingClassifier) all support this through the staged_xyz methods. You don't directly set the number of estimators; instead, you get all the partial predictions, and can extract whichever one(s) you want.
For others like RandomForestClassifier there isn't builtin support, but you can access its estimators_ and do the aggregation of the predictions yourself. You can also overwrite the attribute estimators_ with a subset (in a deep copy of the estimator, say) and then use the predict functionality directly; I wouldn't count on that working in future versions, but it does work as of 0.22.

Can BaggingClassifier manually define multiple base_estimator in Sklearn?

I'm trying to use BaggingClassifier from Sklearn to define multiple base_estimator. From my understanding, something would be similar to this.
clf = BaggingClassifier(base_estimator=[SVC(), DecisionTreeClassifier()], n_estimators=3, random_state=0)
But BaggingClassifier here doesn't take a list as its base_estimator.
I assume I can switch to StackingRegressor(estimators=) to define multiple estimators manually. But it will be a pain to list out, for example 100 estimators, no mentioning there will be many permutations and combinations of the base estimators.
Can you help me understand how to define multiple base_estimator in sklearn.BaggingClassifier?
You can only pass one estimator to base_estimator. The whole idea behind BaggingBlassifier is to train one model on random samples of the training data in an attempt to reduce its variance.
If you need two or more estimators, each one of them trained on random subsets of data, I suggest two different options:
Create your own voting process from two separate bagging classifiers
Train two different BaggingClassifiers and pass them to sklearn.ensemble.StackingClassifier.

XGboost classifier

I am new to XGBoost and I am currently working on a project where we have built an XGBoost classifier. Now we want to run some feature selection techniques. Is backward elimination method a good idea for this? I have used it in regression but I am not sure if/how to use it in a classification problem. Any leads will be greatly appreciated.
Note: I have already tried permutation line importance and it has yielded good results! Looking for another method to evaluate the features in the model.
Consider asking your question on Cross Validated since feature selection is more about theory/practice than code.
What is your concern ? Remove "noisy" features who drive down your results, obtain a sparse model ? Backward selection is one way to do of course. That being said, not sure if you are aware of this but XGBoost computes its own "variable importance" values.
# plot feature importance using built-in function
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot
model = XGBClassifier()
model.fit(X, y)
# plot feature importance
plot_importance(model)
pyplot.show()
Something like this. This importance is based on how many times a feature is used to make a split. You can then define for instance a threshold below which you do not keep the variables. However do not forget that :
This variable importance has been obtained on the training data only
The removal of a variable with high importance may not affect your prediction error, e.g. if it is correlated with another highly important variable. Other tricks such as this one may exist.

What does the CV stand for in sklearn.linear_model.LogisticRegressionCV?

scikit-learn has two logistic regression functions:
sklearn.linear_model.LogisticRegression
sklearn.linear_model.LogisticRegressionCV
I'm just curious what the CV stands for in the second one. The only acronym I know in ML that matches "CV" is cross-validation, but I'm guessing that's not it, since that would be achieved in scikit-learn with a wrapper function, not as part of the logistic regression function itself (I think).
You are right in guessing that the latter allows the user to perform cross validation. The user can pass the number of folds as an argument cv of the function to perform k-fold cross-validation (default is 10 folds with StratifiedKFold).
I would recommend reading the documentation for the functions LogisticRegression and LogisticRegressionCV
Yes, it's cross-validation. Excerpt from the docs:
For the grid of Cs values (that are set by default to be ten values in a logarithmic scale between 1e-4 and 1e4), the best hyperparameter is selected by the cross-validator StratifiedKFold, but it can be changed using the cv parameter.
The point here is the following:
yes: sklearn has general model-selection wrappers providing CV-functionality for all those classifiers/regressors
but: when the classifier/regressor is known/fixed a-priori (to some extent) or sometimes even some CV-model, one can gain advantages using these facts with specialized code bound to one classifier/regressor resulting in improved performance!
Typically:
CV already embedded in optimization-algorithm
Efficient warm-starting (instead of full re-optimization after just the change of one parameter like alpha)
It seems, at least the latter idea is used in sklearn's LogisticRegressionCV, as seen in this excerpt:
In the case of newton-cg and lbfgs solvers, we warm start along the path i.e guess the initial coefficients of the present fit to be the coefficients got after convergence in the previous fit, so it is supposed to be faster for high-dimensional dense data.
May I also refer you to this section in scikit-learn documentation which I beleive explains it well:
Some models can fit data for a range of values of some parameter
almost as efficiently as fitting the estimator for a single value of
the parameter. This feature can be leveraged to perform a more
efficient cross-validation used for model selection of this parameter.
The most common parameter amenable to this strategy is the parameter
encoding the strength of the regularizer. In this case we say that we
compute the regularization path of the estimator.
And logistic regression is one such model. That's why scikit-learn has the dedicated LogisticRegressionCV class that does this.
There are some things left out on other answers, e.g. about gridsearch functionality. See the docs:
cross-validation estimator
An estimator that has built-in cross-validation capabilities to automatically select the best hyper-parameters (see the User Guide). Some example of cross-validation estimators are ElasticNetCV and LogisticRegressionCV. Cross-validation estimators are named EstimatorCV and tend to be roughly equivalent to GridSearchCV(Estimator(), ...). The advantage of using a cross-validation estimator over the canonical estimator class along with grid search is that they can take advantage of warm-starting by reusing precomputed results in the previous steps of the cross-validation process. This generally leads to speed improvements. An exception is the RidgeCV class, which can instead perform efficient Leave-One-Out CV.
https://scikit-learn.org/stable/glossary.html#term-cross-validation-estimator
https://github.com/amueller/talks_odt/blob/master/2015/nyc-open-data-2015-andvanced-sklearn.pdf

Meaning of GridSearchCV with RFECV in sklearn

Based on Recursive feature elimination and grid search using scikit-learn, I know that RFECV can be combined with GridSearchCV to obtain better parameter setting for the model like linear SVM.
As said in the answer, there are two ways:
"Run GridSearchCV on RFECV, which will result in splitting the data into folds two times (ones inside GridSearchCV and once inside RFECV), but the search over the number of components will be efficient."
"Do GridSearchCV just on RFE, which would result in a single splitting of the data, but in very inefficient scanning of the parameters of the RFE estimator."
To make my question clear, I have to firstly clarify RFECV:
Split the whole data into n folds.
In every fold, obtain the feature rank by fitting only the training data to rfe.
Sort the ranking and fit the training data to SVM and test it on testing data for scoring. This should be done m times, each with decreasing number of features, where m is the number of features assuming step=1.
A sequence of scores is obtained in the previous step and such sequence would be lastly averaged across n folds after step 1~3 have been done in n times, resulting in an averaged scoring sequence suggesting the best number of features to do in rfe.
Take that best number of features as the argument of n_features_to_select in rfe fitted with the original whole data.
.support_ to get the "winners" among features; .grid_scores_ to get the averaged scoring sequence.
Please correct me if I am wrong, thank you.
So my question is where to put GridSearchCV? I guess the second way "do GridSearchCV just on RFE" is do GridSearchCV on step 5 which sets the parameter of SVM to one of the value in the grid, fit it on training data split by GridSearchCV to obtain the number of features suggested in step 4, and test it with the rest of the data for the score. Such process is done in k times and an averaged score indicates the goodness of that value in the grid, where k is the argument cv in GridSearchCV. However, selected features might be different due to alternative training data and grid value, which makes this second way not reasonable if it is done as my guess.
How actually does GridSearchCV be combined with RFECV?

Resources