Does Gridsearch CV shuffle the data before creating the folds? - scikit-learn

I'm used sklearn GridsearchCV to tune hyperparameters but want to know if the dataset I give it will be shuffled before the folds are created. I'd like it to NOT be shuffled but I can't find if it is or isn't in the documentation. Something like train_test_split has a boolean to shuffle or not.

By default, GridSearchCV will use a clean StratifiedKFold or KFold cross-validator. The default for these cross-validators is shuffle=False. The cv parameter documentation of GridSearchCV provides some additional information, too.

From the documentation
3.1.3. A note on shuffling
If the data ordering is not arbitrary (e.g. samples with the same class label are contiguous), shuffling it first may be essential to get a meaningful cross- validation result. However, the opposite may be true if the samples are not independently and identically distributed. For example, if samples correspond to news articles, and are ordered by their time of publication, then shuffling the data will likely lead to a model that is overfit and an inflated validation score: it will be tested on samples that are artificially similar (close in time) to training samples.
Some cross validation iterators, such as KFold, have an inbuilt option to shuffle the data indices before splitting them. Note that:
This consumes less memory than shuffling the data directly.
By default no shuffling occurs, including for the (stratified) K fold cross- validation performed by specifying cv=some_integer to cross_val_score, grid search, etc. Keep in mind that train_test_split still returns a random split.
The random_state parameter defaults to None, meaning that the shuffling will be different every time KFold(..., shuffle=True) is iterated. However, GridSearchCV will use the same shuffling for each set of parameters validated by a single call to its fit method.
To get identical results for each split, set random_state to an integer.

Related

gridsearchCV - shuffle data for every single parameter combination

I am using gridsearchCV to determine model hyper-parameters:
pipe = Pipeline(steps=[(self.FE, FE_algorithm), (self.CA, Class_algorithm)])
param_grid = {**FE_grid, **CA_grid}
scorer = make_scorer(f1_score, average='macro')
search = GridSearchCV(pipe, param_grid, cv=ShuffleSplit(test_size=0.20, n_splits=5,random_state=0), n_jobs=-1,
verbose=3, scoring=scorer)
search.fit(self.data_input, self.data_output)
However, I believe I am running into some problems with overfitting:
results
I would like to shuffle the data under every single parameter combination, is there any way to do this? Currently, with the k-fold cross validation the same sets of validation data are being evaluated for each parameter combination, k-fold, and so overfitting is becoming an issue.
No, there isn't. The search splits the data once and creates a task for each combination of fold and parameter combination (source).
Shuffling per parameter combination is probably not desirable anyway: the selection might then just pick the "easiest" split instead of the "best" parameter. If you think you are overfitting to the validation folds, then consider using
fewer parameter options
more folds, or repeated splits*
a scoring callable that customizes evaluation
models that are more conservative
*my favorite among these, although the computation cost may be too high

sklearn.ensemble Can you use less estimators than the number trained in final model?

Most sklearn.ensemble models (GradientBoostingClassifier, RandomForestClassifier etc.) take an n_estimators param for number of estimators in the ensemble. If you've trained a model with X estimators, can you use less than X estimators in your prediction? This can be useful for model selection.
Example: train 800 trees, you might want to see how a 400 tree model performs. Given that you have an 800 tree model, you should just be able to predict with the first 400 trees rather than training it again.
This can be done in boosting models, but a bagging model like random forest may not have this option. Decision trees in boosting models are sequential, so to use the first 400 trees from the 800 trees would make sense. But trees in random forest are without sequence, so you would have to randomly sample 400 trees, which I don't think the module offers.
The boosting models (GradientBoostingClassifier, AdaBoostClassifier, and HistGradientBoostingClassifier) all support this through the staged_xyz methods. You don't directly set the number of estimators; instead, you get all the partial predictions, and can extract whichever one(s) you want.
For others like RandomForestClassifier there isn't builtin support, but you can access its estimators_ and do the aggregation of the predictions yourself. You can also overwrite the attribute estimators_ with a subset (in a deep copy of the estimator, say) and then use the predict functionality directly; I wouldn't count on that working in future versions, but it does work as of 0.22.

What will GridsearchCV choose if there are multiple estimators having the same score?

I'm using RandomForestClassifier in sklearn, and using GridsearchCV for getting best estimator.
I'm wondering when there are many estimators (from simple one to complex one) having the same scores in GridsearchCV, what will be the resulted estimator out of GridsearchCV? The simplest one? or random one?
GridSearchCV does not assess the model complexity (though that would be a neat feature). Neither does it choose among the best models randomly.
Instead, GridSearchCV simply performs an np.argmin() on the stored errors. See the corresponding line in the source code.
Now, according to the NumPy docs,
In case of multiple occurrences of the minimum values, the indices corresponding to the first occurrence are returned.
That is, GridSearchCV will always select the first among the best models.

Sklearn overfitting

I have a data set containing 1000 points each with 2 inputs and 1 output. It has been split into 80% for training and 20% for testing purpose. I am training it using sklearn support vector regressor. I have got 100% accuracy with training set but results obtained with test set are not good. I think it may be because of overfitting. Please can you suggest me something to solve the problem.
You may be right: if your model scores very high on the training data, but it does poorly on the test data, it is usually a symptom of overfitting. You need to retrain your model under a different situation. I assume you are using train_test_split provided in sklearn, or a similar mechanism which guarantees that your split is fair and random. So, you will need to tweak the hyperparameters of SVR and create several models and see which one does best on your test data.
If you look at the SVR documentation, you will see that it can be initiated using several input parameters, each of which could be set to a number of different values. For the simplicity, let's assume you are only dealing with two parameters that you want to tweak: 'kernel' and 'C', while keeping the third parameter 'degree' set to 4. You are considering 'rbf' and 'linear' for kernel, and 0.1, 1, 10 for C. A simple solution is this:
for kernel in ('rbf', 'linear'):
for c in (0.1, 1, 10):
svr = SVR(kernel=kernel, C=c, degree=4)
svr.fit(train_features, train_target)
score = svr.score(test_features, test_target)
print kernel, c, score
This way, you can generate 6 models and see which parameters lead to the best score, which will be the best model to choose, given these parameters.
A simpler way is to let sklearn to do most of this work for you, using GridSearchCV (or RandomizedSearchCV):
parameters = {'kernel':('linear', 'rbf'), 'C':(0.1, 1, 10)}
clf = GridSearchCV(SVC(degree=4), parameters)
clf.fit(train_features, train_target)
print clf.best_score_
print clf.best_params_
model = clf.best_estimator_ # This is your model
I am working on a little tool to simplify using sklearn for small projects, and make it a matter of configuring a yaml file, and letting the tool do all the work for you. It is available on my github account. You might want to take a look and see if it helps.
Finally, your data may not be linear. In that case you may want to try using something like PolynomialFeatures to generate new nonlinear features based on the existing ones and see if it improves your model quality.
Try fitting your data using training data split Sklearn K-Fold cross-validation, this provides you a fair split of data and better model , though at a cost of performance , which should really matter for small dataset and where the priority is accuracy.
A few hints:
Since you have only two inputs, it would be great if you plot your data. Try either a scatter with alpha = 0.3 or a heatmap.
Try GridSearchCV, as mentioned by #shahins.
Especially, try different values for the C parameter. As mentioned in the docs, if you have a lot of noisy observations you should decrease it. It corresponds to regularize more the estimation.
If it's taking too long, you can also try RandomizedSearchCV
As a side note from #shahins answer (I am not allowed to add comments), both implementations are not equivalent. GridSearchCV is better since it performs cross-validation in the training set for tuning the hyperparameters. Do not use the test set for tuning hyperparameters!
Don't forget to scale your data

Meaning of GridSearchCV with RFECV in sklearn

Based on Recursive feature elimination and grid search using scikit-learn, I know that RFECV can be combined with GridSearchCV to obtain better parameter setting for the model like linear SVM.
As said in the answer, there are two ways:
"Run GridSearchCV on RFECV, which will result in splitting the data into folds two times (ones inside GridSearchCV and once inside RFECV), but the search over the number of components will be efficient."
"Do GridSearchCV just on RFE, which would result in a single splitting of the data, but in very inefficient scanning of the parameters of the RFE estimator."
To make my question clear, I have to firstly clarify RFECV:
Split the whole data into n folds.
In every fold, obtain the feature rank by fitting only the training data to rfe.
Sort the ranking and fit the training data to SVM and test it on testing data for scoring. This should be done m times, each with decreasing number of features, where m is the number of features assuming step=1.
A sequence of scores is obtained in the previous step and such sequence would be lastly averaged across n folds after step 1~3 have been done in n times, resulting in an averaged scoring sequence suggesting the best number of features to do in rfe.
Take that best number of features as the argument of n_features_to_select in rfe fitted with the original whole data.
.support_ to get the "winners" among features; .grid_scores_ to get the averaged scoring sequence.
Please correct me if I am wrong, thank you.
So my question is where to put GridSearchCV? I guess the second way "do GridSearchCV just on RFE" is do GridSearchCV on step 5 which sets the parameter of SVM to one of the value in the grid, fit it on training data split by GridSearchCV to obtain the number of features suggested in step 4, and test it with the rest of the data for the score. Such process is done in k times and an averaged score indicates the goodness of that value in the grid, where k is the argument cv in GridSearchCV. However, selected features might be different due to alternative training data and grid value, which makes this second way not reasonable if it is done as my guess.
How actually does GridSearchCV be combined with RFECV?

Resources