Meaning of GridSearchCV with RFECV in sklearn - scikit-learn

Based on Recursive feature elimination and grid search using scikit-learn, I know that RFECV can be combined with GridSearchCV to obtain better parameter setting for the model like linear SVM.
As said in the answer, there are two ways:
"Run GridSearchCV on RFECV, which will result in splitting the data into folds two times (ones inside GridSearchCV and once inside RFECV), but the search over the number of components will be efficient."
"Do GridSearchCV just on RFE, which would result in a single splitting of the data, but in very inefficient scanning of the parameters of the RFE estimator."
To make my question clear, I have to firstly clarify RFECV:
Split the whole data into n folds.
In every fold, obtain the feature rank by fitting only the training data to rfe.
Sort the ranking and fit the training data to SVM and test it on testing data for scoring. This should be done m times, each with decreasing number of features, where m is the number of features assuming step=1.
A sequence of scores is obtained in the previous step and such sequence would be lastly averaged across n folds after step 1~3 have been done in n times, resulting in an averaged scoring sequence suggesting the best number of features to do in rfe.
Take that best number of features as the argument of n_features_to_select in rfe fitted with the original whole data.
.support_ to get the "winners" among features; .grid_scores_ to get the averaged scoring sequence.
Please correct me if I am wrong, thank you.
So my question is where to put GridSearchCV? I guess the second way "do GridSearchCV just on RFE" is do GridSearchCV on step 5 which sets the parameter of SVM to one of the value in the grid, fit it on training data split by GridSearchCV to obtain the number of features suggested in step 4, and test it with the rest of the data for the score. Such process is done in k times and an averaged score indicates the goodness of that value in the grid, where k is the argument cv in GridSearchCV. However, selected features might be different due to alternative training data and grid value, which makes this second way not reasonable if it is done as my guess.
How actually does GridSearchCV be combined with RFECV?

Related

What do sklearn.cross_validation scores mean?

I am working on a time-series prediction problem using GradientBoostingRegressor, and I think I'm seeing significant overfitting, as evidenced by a significantly better RMSE for training than for prediction. In order to examine this, I'm trying to use sklearn.model_selection.cross_validate, but I'm having problems understanding the result.
First: I was calculating RMSE by fitting to all my training data, then "predicting" the training data outputs using the fitted model and comparing those with the training outputs (the same ones I used for fitting). The RMSE that I observe is the same order of magnitude the predicted values and, more important, it's in the same ballpark as the RMSE I get when I submit my predicted results to Kaggle (although the latter is lower, reflecting overfitting).
Second, I use the same training data, but apply sklearn.model_selection.cross_validate as follows:
cross_validate( predictor, features, targets, cv = 5, scoring = "neg_mean_squared_error" )
I figure the neg_mean_squared_error should be the square of my RMSE. Accounting for that, I still find that the error reported by cross_validate is one or two orders of magnitude smaller than the RMSE I was calculating as described above.
In addition, when I modify my GradientBoostingRegressor max_depth from 3 to 2, which I would expect reduces overfitting and thus should improve the CV error, I find that the opposite is the case.
I'm keenly interested to use Cross Validation so I don't have to validate my hyperparameter choices by using up Kaggle submissions, but given what I've observed, I'm not clear that the results will be understandable or useful.
Can someone explain how I should be using Cross Validation to get meaningful results?
I think there is a conceptual problem here.
If you want to compute the error of a prediction you should not use the training data. As the name says theese type of data are used only in training, for evaluating accuracy scores you ahve to use data that the model has never seen.
About cross-validation I can tell that it's an approach to find the best training/testing set. The process is as follows: you divide your data into n groups and you do various iterating changing the testing group you pick. If you have n groups you will do n iteration and each time the training and testing set will be different. It's more understamdable in the image below.
Basically what you should do it's kile this:
Train the model using months from 0 to 30 (for example)
See the predictions made with months from 31 to 35 as input.
If the input has to be the same lenght divide feature in half (should be 17 months).
I hope I understood correctly, othewise comment.

What will GridsearchCV choose if there are multiple estimators having the same score?

I'm using RandomForestClassifier in sklearn, and using GridsearchCV for getting best estimator.
I'm wondering when there are many estimators (from simple one to complex one) having the same scores in GridsearchCV, what will be the resulted estimator out of GridsearchCV? The simplest one? or random one?
GridSearchCV does not assess the model complexity (though that would be a neat feature). Neither does it choose among the best models randomly.
Instead, GridSearchCV simply performs an np.argmin() on the stored errors. See the corresponding line in the source code.
Now, according to the NumPy docs,
In case of multiple occurrences of the minimum values, the indices corresponding to the first occurrence are returned.
That is, GridSearchCV will always select the first among the best models.

GridSearchCV: based on mean_test_score results, predict should perform much worse, but it does not

I am trying to evaluate the performance of a regressor by means of GridSearchCV. In my implementation cv is an int, so I'm applying the K-fold validation method. Looking at cv_results_['mean_test_score'],
the best mean score on the k-fold unseen data is around 0.7, while the train scores are much higher, like 0.999. This is very normal, and I'm ok with that.
Well, following the reasoning behind this concept, when I apply the best_estimator_ on the whole data set, I expect to see at least some part of the data predicted not perfectly, right? Instead, the numerical deviations between the predicted quantities and the real values are near zero for all datapoints. And this smells of overfitting.
I don't understand that, because if I remove a small part of the data and apply GridSearchCV to the remaining part, I find almost identical results as above, but the best regressor applied to the totally unseen data predicts with much higher errors, like 10%, 30% or 50%. Which is what I expected, at least for some points, fitting GridSearchCV on the whole set, based on the results of k-fold test sets.
Now, I understand that this forces the predictor to see all datapoints, but the best estimator is the result of k fits, each of them never saw 1/k fraction of data. Being the mean_test_score the average between these k scores, I expect to see a bunch of predictions (depending on cv value) which show errors distributed around a mean error that justifies a 0.7 score.
The refit=True parameter of GridSearchCV makes the estimator with the found best set of hyperparameters be refit on the full data. So if your training error is almost zero in the CV folds, you would expect it to be near zero in the best_estimator_ as well.

why is sklearn.feature_selection.RFECV giving different results for each run

I tried to do feature selection with RFECV but it is giving out different results each time, does cross-validation divide the sample X into random chunks or into sequential deterministic chunks?
Also, why is the score different for grid_scores_ and score(X,y)? why are the scores sometimes negative?
Does cross-validation divide the sample X into random chunks or into sequential deterministic chunks?
CV divides the data into deterministic chunks by default. You can change this behaviour by setting the shuffle parameter to True.
However, RFECV uses sklearn.model_selection.StratifiedKFold if the y is binary or multiclass.
This means that it will split the data such that each fold has the same (or nearly the same ratio of classes). In order to do this, the exact data in each fold can change slightly in different iterations of CV. However, this should not cause major changes in the data.
If you are passing a CV iterator using the cv parameter, then you can fix the splits by specifying a random state. The random state is linked to random decisions made by the algorithm. Using the same random state each time will ensure the same behaviour.
Also, why is the score different for grid_scores_ and score(X,y)?
grid_scores_ is an array of cross-validation scores. grid_scores_[i] is the cross-validation score for the i-th iteration. This means that the first score is the score for all features, the second is the score when one set of features is removed and so on. The number of features removed in each is equal to the value of the step parameter. This is = 1 by default.
score(X, y) selects the optimal number of features and returns the score for those features.
why are the scores sometimes negative?
This depends on the estimator and scorer you are using. If you have set no scorer RFECV will use the default score function for the estimator. Generally, this is accuracy, but in your particular case, might be something that returns a negative value.

Confusing example of nested cross validation in scikit-learn

I'm looking at this example from scikit-learn documentation: http://scikit-learn.org/0.18/auto_examples/model_selection/plot_nested_cross_validation_iris.html
It seems to me that crossvalidation is not performed in an unbiased way here. Both GridSearchCV (supposedly the inner CV loop) and cross_val_score (supposedly the outer CV loop) are using the same data and the same folds. Therefore there is an overlap between the data the classifier was trained on and evaluated with. What am I getting wrong?
#Gael - As I cannot add a comment, I am posting this in the answer section. I am not sure what Gael means by "the first split is done inside cross_val_score, and the second split is done inside GridSearchCV (that the whole point of the GridSearchCV object)". Are you trying to imply that the cross_val_score function passes the (k-1)-fold data (used for training in outer loop) to the clf object ? That does not appear to be the case, as I can comment out the cross_val_score function and just set nested_score[i] to a dummy variable, and still obtain the exact same clf.best_score_. This implies that the GridSearchCV is evaluated separately and does use all available data, and not a subset of training data.
In nested CV, to the best of my understanding, the idea is that the inner loop will do the hyper-parameter search on a smaller subset of training data, and then the outer loop will use these parameters to do a cross-validation. One of the reasons for using smaller training data in the inner loop is to avoid information leakage. It doesn't appear that's what is happening here. The inner loop is first using all the data to search for hyper-parameters, which are then used for cross-validation in the outer loop. Thus, the inner loop has already seen all data and any testing done in the outer loop will suffer from information leakage. If I am mistaken, could you please point me to the section of code which you are referring to in your answer ?
Totally agree, that nested-cv procedure is wrong, cross_val_score is taken the best hyperparameters computed by GridSearchCV and computing a cv score using such hyperparameters. In nested-cv, you need the outer loop for assessing model performance and the inner loop for model selection, such that, the portion of data used in the inner loop for model selection must not be the same used for assessing model performance. An example will be a LOOCV outer loop for assessing performance (or, it will be a 5cv, 10cv, or whatever you like) and a 10cv-fold for model selection with grid search in the inner loop. That means that, if you have N observations then you will perform model selection in the inner loop (using grid search and 10-CV, for example) on the N-1 observations, and you will asses the model performance on the LOO observation (or in the hold-out data sample if you choose another approach).
(Note that you are estimating N best models in the sense of hyperparameters internally) .
it will be helpful to have access to the link of the code of cross_val_score and GridSearchCV.
Some references for nested CV are:
Christophe Ambroise and Georey J McLachlan. Selection bias in gene extraction on the basis of microarray gene-expression data. Proceedings of the national academy of sciences 99, 10 (2002), 6562 - 6566.
Gavin C Cawley and Nicola LC Talbot. On overfitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research 11, Jul (2010), 2079{2107.
Note:
I did not find anything in the documentation of cross_val_score indicating that internally the hyperparameters are optimized using parameter search, grid search + cross-validation for example, on the k-1 folds of data, and using those optimized parameters on the hold-out data sample (what I am saying is different to the code in http://scikit-learn.org/dev/auto_examples/model_selection/plot_nested_cross_validation_iris.html)
They are not using the same data. Granted, the code of the example does not make it apparent, because the splits are not visible: the first split is done inside cross_val_score, and the second split is done inside GridSearchCV (that the whole point of the GridSearchCV object). Using functions and objects rather than hand-written for loops may make things less transparent, but it:
Enables reuses
Adds many "little things" that would render the for loop tedious, such as parallel computing, support for different scoring function, etc.
Is actually safer in terms of avoid data leakage because our splitting code has been audited many many times.
If you are not convinced, take a look at the code of cross_val_score and GridSearchCV.
The example was improved recently to specify this in the comments:
http://scikit-learn.org/dev/auto_examples/model_selection/plot_nested_cross_validation_iris.html
(pull request on https://github.com/scikit-learn/scikit-learn/pull/7949 )

Resources