gridsearchCV - shuffle data for every single parameter combination - scikit-learn

I am using gridsearchCV to determine model hyper-parameters:
pipe = Pipeline(steps=[(self.FE, FE_algorithm), (self.CA, Class_algorithm)])
param_grid = {**FE_grid, **CA_grid}
scorer = make_scorer(f1_score, average='macro')
search = GridSearchCV(pipe, param_grid, cv=ShuffleSplit(test_size=0.20, n_splits=5,random_state=0), n_jobs=-1,
verbose=3, scoring=scorer)
search.fit(self.data_input, self.data_output)
However, I believe I am running into some problems with overfitting:
results
I would like to shuffle the data under every single parameter combination, is there any way to do this? Currently, with the k-fold cross validation the same sets of validation data are being evaluated for each parameter combination, k-fold, and so overfitting is becoming an issue.

No, there isn't. The search splits the data once and creates a task for each combination of fold and parameter combination (source).
Shuffling per parameter combination is probably not desirable anyway: the selection might then just pick the "easiest" split instead of the "best" parameter. If you think you are overfitting to the validation folds, then consider using
fewer parameter options
more folds, or repeated splits*
a scoring callable that customizes evaluation
models that are more conservative
*my favorite among these, although the computation cost may be too high

Related

Does Gridsearch CV shuffle the data before creating the folds?

I'm used sklearn GridsearchCV to tune hyperparameters but want to know if the dataset I give it will be shuffled before the folds are created. I'd like it to NOT be shuffled but I can't find if it is or isn't in the documentation. Something like train_test_split has a boolean to shuffle or not.
By default, GridSearchCV will use a clean StratifiedKFold or KFold cross-validator. The default for these cross-validators is shuffle=False. The cv parameter documentation of GridSearchCV provides some additional information, too.
From the documentation
3.1.3. A note on shuffling
If the data ordering is not arbitrary (e.g. samples with the same class label are contiguous), shuffling it first may be essential to get a meaningful cross- validation result. However, the opposite may be true if the samples are not independently and identically distributed. For example, if samples correspond to news articles, and are ordered by their time of publication, then shuffling the data will likely lead to a model that is overfit and an inflated validation score: it will be tested on samples that are artificially similar (close in time) to training samples.
Some cross validation iterators, such as KFold, have an inbuilt option to shuffle the data indices before splitting them. Note that:
This consumes less memory than shuffling the data directly.
By default no shuffling occurs, including for the (stratified) K fold cross- validation performed by specifying cv=some_integer to cross_val_score, grid search, etc. Keep in mind that train_test_split still returns a random split.
The random_state parameter defaults to None, meaning that the shuffling will be different every time KFold(..., shuffle=True) is iterated. However, GridSearchCV will use the same shuffling for each set of parameters validated by a single call to its fit method.
To get identical results for each split, set random_state to an integer.

What does the CV stand for in sklearn.linear_model.LogisticRegressionCV?

scikit-learn has two logistic regression functions:
sklearn.linear_model.LogisticRegression
sklearn.linear_model.LogisticRegressionCV
I'm just curious what the CV stands for in the second one. The only acronym I know in ML that matches "CV" is cross-validation, but I'm guessing that's not it, since that would be achieved in scikit-learn with a wrapper function, not as part of the logistic regression function itself (I think).
You are right in guessing that the latter allows the user to perform cross validation. The user can pass the number of folds as an argument cv of the function to perform k-fold cross-validation (default is 10 folds with StratifiedKFold).
I would recommend reading the documentation for the functions LogisticRegression and LogisticRegressionCV
Yes, it's cross-validation. Excerpt from the docs:
For the grid of Cs values (that are set by default to be ten values in a logarithmic scale between 1e-4 and 1e4), the best hyperparameter is selected by the cross-validator StratifiedKFold, but it can be changed using the cv parameter.
The point here is the following:
yes: sklearn has general model-selection wrappers providing CV-functionality for all those classifiers/regressors
but: when the classifier/regressor is known/fixed a-priori (to some extent) or sometimes even some CV-model, one can gain advantages using these facts with specialized code bound to one classifier/regressor resulting in improved performance!
Typically:
CV already embedded in optimization-algorithm
Efficient warm-starting (instead of full re-optimization after just the change of one parameter like alpha)
It seems, at least the latter idea is used in sklearn's LogisticRegressionCV, as seen in this excerpt:
In the case of newton-cg and lbfgs solvers, we warm start along the path i.e guess the initial coefficients of the present fit to be the coefficients got after convergence in the previous fit, so it is supposed to be faster for high-dimensional dense data.
May I also refer you to this section in scikit-learn documentation which I beleive explains it well:
Some models can fit data for a range of values of some parameter
almost as efficiently as fitting the estimator for a single value of
the parameter. This feature can be leveraged to perform a more
efficient cross-validation used for model selection of this parameter.
The most common parameter amenable to this strategy is the parameter
encoding the strength of the regularizer. In this case we say that we
compute the regularization path of the estimator.
And logistic regression is one such model. That's why scikit-learn has the dedicated LogisticRegressionCV class that does this.
There are some things left out on other answers, e.g. about gridsearch functionality. See the docs:
cross-validation estimator
An estimator that has built-in cross-validation capabilities to automatically select the best hyper-parameters (see the User Guide). Some example of cross-validation estimators are ElasticNetCV and LogisticRegressionCV. Cross-validation estimators are named EstimatorCV and tend to be roughly equivalent to GridSearchCV(Estimator(), ...). The advantage of using a cross-validation estimator over the canonical estimator class along with grid search is that they can take advantage of warm-starting by reusing precomputed results in the previous steps of the cross-validation process. This generally leads to speed improvements. An exception is the RidgeCV class, which can instead perform efficient Leave-One-Out CV.
https://scikit-learn.org/stable/glossary.html#term-cross-validation-estimator
https://github.com/amueller/talks_odt/blob/master/2015/nyc-open-data-2015-andvanced-sklearn.pdf

How to provide weighted eval set to XGBClassifier.fit()?

From the sklearn-style API of XGBClassifier, we can provide eval examples for early-stopping.
eval_set (list, optional) – A list of (X, y) pairs to use as a
validation set for early-stopping
However, the format only mentions a pair of features and labels. So if the doc is accurate, there is no place to provide weights for these eval examples.
Am I missing anything?
If it's not achievable in the sklearn-style, is it supported in the original (i.e. non-sklearn) XGBClassifier API? A short example will be nice, since I never used that version of the API.
As of a few weeks ago, there is a new parameter for the fit method, sample_weight_eval_set, that allows you to do exactly this. It takes a list of weight variables, i.e. one per evaluation set. I don't think this feature has made it into a stable release yet, but it is available right now if you compile xgboost from source.
https://github.com/dmlc/xgboost/blob/b018ef104f0c24efaedfbc896986ad3ed1b66774/python-package/xgboost/sklearn.py#L235
EDIT - UPDATED per conversation in comments
Given that you have a target-variable representing real-valued gain/loss values which you would like to classify as "gain" or "loss", and you would like to make sure the validation-set of the classifier weighs the large-absolute-value gains/losses heaviest, here are two possible approaches:
Create a custom classifier which is just XGBoostRegressor fed to a treshold where the real-valued regression predictions are converted to 1/0 or "gain"/"loss" classifications. The .fit() method of this classifier would just call .fit() of xgbregressor, while .predict() method of this classifier would call .predict() of the regressor and then return the thresholded category predictions.
you mentioned you would like to try weighting the treatment of the records in your validation set, but there is no option for this in xgboost. The way to implement this would be to implement a custom eval-metric. However, you pointed out that eval_metric must be able to return a score for a single label/pred record at a time, so it couldn't accept all your row-values and perform the weighting in the eval metric. The solution to this you mentioned in your comment was "create a callable which has a ref to all validation examples, pass the indices (instead of labels and scores) into eval_set, use the indices to fetch labels and scores from within the callable and return metric for each validation examples." This should also work.
I would tend to prefer option 1 as more straightforward, but trying two different approaches and comparing results is generally a good idea if you have the time, so interested how these turn out for you.

Confusing example of nested cross validation in scikit-learn

I'm looking at this example from scikit-learn documentation: http://scikit-learn.org/0.18/auto_examples/model_selection/plot_nested_cross_validation_iris.html
It seems to me that crossvalidation is not performed in an unbiased way here. Both GridSearchCV (supposedly the inner CV loop) and cross_val_score (supposedly the outer CV loop) are using the same data and the same folds. Therefore there is an overlap between the data the classifier was trained on and evaluated with. What am I getting wrong?
#Gael - As I cannot add a comment, I am posting this in the answer section. I am not sure what Gael means by "the first split is done inside cross_val_score, and the second split is done inside GridSearchCV (that the whole point of the GridSearchCV object)". Are you trying to imply that the cross_val_score function passes the (k-1)-fold data (used for training in outer loop) to the clf object ? That does not appear to be the case, as I can comment out the cross_val_score function and just set nested_score[i] to a dummy variable, and still obtain the exact same clf.best_score_. This implies that the GridSearchCV is evaluated separately and does use all available data, and not a subset of training data.
In nested CV, to the best of my understanding, the idea is that the inner loop will do the hyper-parameter search on a smaller subset of training data, and then the outer loop will use these parameters to do a cross-validation. One of the reasons for using smaller training data in the inner loop is to avoid information leakage. It doesn't appear that's what is happening here. The inner loop is first using all the data to search for hyper-parameters, which are then used for cross-validation in the outer loop. Thus, the inner loop has already seen all data and any testing done in the outer loop will suffer from information leakage. If I am mistaken, could you please point me to the section of code which you are referring to in your answer ?
Totally agree, that nested-cv procedure is wrong, cross_val_score is taken the best hyperparameters computed by GridSearchCV and computing a cv score using such hyperparameters. In nested-cv, you need the outer loop for assessing model performance and the inner loop for model selection, such that, the portion of data used in the inner loop for model selection must not be the same used for assessing model performance. An example will be a LOOCV outer loop for assessing performance (or, it will be a 5cv, 10cv, or whatever you like) and a 10cv-fold for model selection with grid search in the inner loop. That means that, if you have N observations then you will perform model selection in the inner loop (using grid search and 10-CV, for example) on the N-1 observations, and you will asses the model performance on the LOO observation (or in the hold-out data sample if you choose another approach).
(Note that you are estimating N best models in the sense of hyperparameters internally) .
it will be helpful to have access to the link of the code of cross_val_score and GridSearchCV.
Some references for nested CV are:
Christophe Ambroise and Georey J McLachlan. Selection bias in gene extraction on the basis of microarray gene-expression data. Proceedings of the national academy of sciences 99, 10 (2002), 6562 - 6566.
Gavin C Cawley and Nicola LC Talbot. On overfitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research 11, Jul (2010), 2079{2107.
Note:
I did not find anything in the documentation of cross_val_score indicating that internally the hyperparameters are optimized using parameter search, grid search + cross-validation for example, on the k-1 folds of data, and using those optimized parameters on the hold-out data sample (what I am saying is different to the code in http://scikit-learn.org/dev/auto_examples/model_selection/plot_nested_cross_validation_iris.html)
They are not using the same data. Granted, the code of the example does not make it apparent, because the splits are not visible: the first split is done inside cross_val_score, and the second split is done inside GridSearchCV (that the whole point of the GridSearchCV object). Using functions and objects rather than hand-written for loops may make things less transparent, but it:
Enables reuses
Adds many "little things" that would render the for loop tedious, such as parallel computing, support for different scoring function, etc.
Is actually safer in terms of avoid data leakage because our splitting code has been audited many many times.
If you are not convinced, take a look at the code of cross_val_score and GridSearchCV.
The example was improved recently to specify this in the comments:
http://scikit-learn.org/dev/auto_examples/model_selection/plot_nested_cross_validation_iris.html
(pull request on https://github.com/scikit-learn/scikit-learn/pull/7949 )

Sklearn overfitting

I have a data set containing 1000 points each with 2 inputs and 1 output. It has been split into 80% for training and 20% for testing purpose. I am training it using sklearn support vector regressor. I have got 100% accuracy with training set but results obtained with test set are not good. I think it may be because of overfitting. Please can you suggest me something to solve the problem.
You may be right: if your model scores very high on the training data, but it does poorly on the test data, it is usually a symptom of overfitting. You need to retrain your model under a different situation. I assume you are using train_test_split provided in sklearn, or a similar mechanism which guarantees that your split is fair and random. So, you will need to tweak the hyperparameters of SVR and create several models and see which one does best on your test data.
If you look at the SVR documentation, you will see that it can be initiated using several input parameters, each of which could be set to a number of different values. For the simplicity, let's assume you are only dealing with two parameters that you want to tweak: 'kernel' and 'C', while keeping the third parameter 'degree' set to 4. You are considering 'rbf' and 'linear' for kernel, and 0.1, 1, 10 for C. A simple solution is this:
for kernel in ('rbf', 'linear'):
for c in (0.1, 1, 10):
svr = SVR(kernel=kernel, C=c, degree=4)
svr.fit(train_features, train_target)
score = svr.score(test_features, test_target)
print kernel, c, score
This way, you can generate 6 models and see which parameters lead to the best score, which will be the best model to choose, given these parameters.
A simpler way is to let sklearn to do most of this work for you, using GridSearchCV (or RandomizedSearchCV):
parameters = {'kernel':('linear', 'rbf'), 'C':(0.1, 1, 10)}
clf = GridSearchCV(SVC(degree=4), parameters)
clf.fit(train_features, train_target)
print clf.best_score_
print clf.best_params_
model = clf.best_estimator_ # This is your model
I am working on a little tool to simplify using sklearn for small projects, and make it a matter of configuring a yaml file, and letting the tool do all the work for you. It is available on my github account. You might want to take a look and see if it helps.
Finally, your data may not be linear. In that case you may want to try using something like PolynomialFeatures to generate new nonlinear features based on the existing ones and see if it improves your model quality.
Try fitting your data using training data split Sklearn K-Fold cross-validation, this provides you a fair split of data and better model , though at a cost of performance , which should really matter for small dataset and where the priority is accuracy.
A few hints:
Since you have only two inputs, it would be great if you plot your data. Try either a scatter with alpha = 0.3 or a heatmap.
Try GridSearchCV, as mentioned by #shahins.
Especially, try different values for the C parameter. As mentioned in the docs, if you have a lot of noisy observations you should decrease it. It corresponds to regularize more the estimation.
If it's taking too long, you can also try RandomizedSearchCV
As a side note from #shahins answer (I am not allowed to add comments), both implementations are not equivalent. GridSearchCV is better since it performs cross-validation in the training set for tuning the hyperparameters. Do not use the test set for tuning hyperparameters!
Don't forget to scale your data

Resources