Confusing example of nested cross validation in scikit-learn - scikit-learn

I'm looking at this example from scikit-learn documentation: http://scikit-learn.org/0.18/auto_examples/model_selection/plot_nested_cross_validation_iris.html
It seems to me that crossvalidation is not performed in an unbiased way here. Both GridSearchCV (supposedly the inner CV loop) and cross_val_score (supposedly the outer CV loop) are using the same data and the same folds. Therefore there is an overlap between the data the classifier was trained on and evaluated with. What am I getting wrong?

#Gael - As I cannot add a comment, I am posting this in the answer section. I am not sure what Gael means by "the first split is done inside cross_val_score, and the second split is done inside GridSearchCV (that the whole point of the GridSearchCV object)". Are you trying to imply that the cross_val_score function passes the (k-1)-fold data (used for training in outer loop) to the clf object ? That does not appear to be the case, as I can comment out the cross_val_score function and just set nested_score[i] to a dummy variable, and still obtain the exact same clf.best_score_. This implies that the GridSearchCV is evaluated separately and does use all available data, and not a subset of training data.
In nested CV, to the best of my understanding, the idea is that the inner loop will do the hyper-parameter search on a smaller subset of training data, and then the outer loop will use these parameters to do a cross-validation. One of the reasons for using smaller training data in the inner loop is to avoid information leakage. It doesn't appear that's what is happening here. The inner loop is first using all the data to search for hyper-parameters, which are then used for cross-validation in the outer loop. Thus, the inner loop has already seen all data and any testing done in the outer loop will suffer from information leakage. If I am mistaken, could you please point me to the section of code which you are referring to in your answer ?

Totally agree, that nested-cv procedure is wrong, cross_val_score is taken the best hyperparameters computed by GridSearchCV and computing a cv score using such hyperparameters. In nested-cv, you need the outer loop for assessing model performance and the inner loop for model selection, such that, the portion of data used in the inner loop for model selection must not be the same used for assessing model performance. An example will be a LOOCV outer loop for assessing performance (or, it will be a 5cv, 10cv, or whatever you like) and a 10cv-fold for model selection with grid search in the inner loop. That means that, if you have N observations then you will perform model selection in the inner loop (using grid search and 10-CV, for example) on the N-1 observations, and you will asses the model performance on the LOO observation (or in the hold-out data sample if you choose another approach).
(Note that you are estimating N best models in the sense of hyperparameters internally) .
it will be helpful to have access to the link of the code of cross_val_score and GridSearchCV.
Some references for nested CV are:
Christophe Ambroise and Georey J McLachlan. Selection bias in gene extraction on the basis of microarray gene-expression data. Proceedings of the national academy of sciences 99, 10 (2002), 6562 - 6566.
Gavin C Cawley and Nicola LC Talbot. On overfitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research 11, Jul (2010), 2079{2107.
Note:
I did not find anything in the documentation of cross_val_score indicating that internally the hyperparameters are optimized using parameter search, grid search + cross-validation for example, on the k-1 folds of data, and using those optimized parameters on the hold-out data sample (what I am saying is different to the code in http://scikit-learn.org/dev/auto_examples/model_selection/plot_nested_cross_validation_iris.html)

They are not using the same data. Granted, the code of the example does not make it apparent, because the splits are not visible: the first split is done inside cross_val_score, and the second split is done inside GridSearchCV (that the whole point of the GridSearchCV object). Using functions and objects rather than hand-written for loops may make things less transparent, but it:
Enables reuses
Adds many "little things" that would render the for loop tedious, such as parallel computing, support for different scoring function, etc.
Is actually safer in terms of avoid data leakage because our splitting code has been audited many many times.
If you are not convinced, take a look at the code of cross_val_score and GridSearchCV.
The example was improved recently to specify this in the comments:
http://scikit-learn.org/dev/auto_examples/model_selection/plot_nested_cross_validation_iris.html
(pull request on https://github.com/scikit-learn/scikit-learn/pull/7949 )

Related

Aggregate training results to predits

When training the model the results depend on the sampling. In order to obtain something better you could repeat the training (in another randomly create training sample, using Ffolds, StratifiedKFold ... ), somehow aggregate the results and have this way a result that will be more robust that one create in a particular case alone. Question: is it already implemented in sklearn or similar?. Apologies is this is a straighforward question, I haven't see a simple solution.
I see that there is a function called cross_val_predict however my first impresion having a quick look to the source code is that it predecits as many times as trains and I would like to predicts only ones, so I can piclke the, somehow aggregate results, and predict later, instead of repeat the whole training thing again.
So far I think the best option are the ensemblers in sklearn.
I left here the solution I was using before. I am pretty sure could be improved (as mentioned before the Ensemblers in sklearn) are better. I have placed here https://github.com/rafaelvalero/aggreating_predictions_sklearn, where I have left a notebook with and example (using iris database), in case anyone can play around and see in details how could be done.
That solution will train models (in parallel, using joblib), pickle the trained model (a model from SKlearn), store the results (using joblib dump) and later would recover them to create predictions (in parallel, using joblib) that later are aggregated.

What does the CV stand for in sklearn.linear_model.LogisticRegressionCV?

scikit-learn has two logistic regression functions:
sklearn.linear_model.LogisticRegression
sklearn.linear_model.LogisticRegressionCV
I'm just curious what the CV stands for in the second one. The only acronym I know in ML that matches "CV" is cross-validation, but I'm guessing that's not it, since that would be achieved in scikit-learn with a wrapper function, not as part of the logistic regression function itself (I think).
You are right in guessing that the latter allows the user to perform cross validation. The user can pass the number of folds as an argument cv of the function to perform k-fold cross-validation (default is 10 folds with StratifiedKFold).
I would recommend reading the documentation for the functions LogisticRegression and LogisticRegressionCV
Yes, it's cross-validation. Excerpt from the docs:
For the grid of Cs values (that are set by default to be ten values in a logarithmic scale between 1e-4 and 1e4), the best hyperparameter is selected by the cross-validator StratifiedKFold, but it can be changed using the cv parameter.
The point here is the following:
yes: sklearn has general model-selection wrappers providing CV-functionality for all those classifiers/regressors
but: when the classifier/regressor is known/fixed a-priori (to some extent) or sometimes even some CV-model, one can gain advantages using these facts with specialized code bound to one classifier/regressor resulting in improved performance!
Typically:
CV already embedded in optimization-algorithm
Efficient warm-starting (instead of full re-optimization after just the change of one parameter like alpha)
It seems, at least the latter idea is used in sklearn's LogisticRegressionCV, as seen in this excerpt:
In the case of newton-cg and lbfgs solvers, we warm start along the path i.e guess the initial coefficients of the present fit to be the coefficients got after convergence in the previous fit, so it is supposed to be faster for high-dimensional dense data.
May I also refer you to this section in scikit-learn documentation which I beleive explains it well:
Some models can fit data for a range of values of some parameter
almost as efficiently as fitting the estimator for a single value of
the parameter. This feature can be leveraged to perform a more
efficient cross-validation used for model selection of this parameter.
The most common parameter amenable to this strategy is the parameter
encoding the strength of the regularizer. In this case we say that we
compute the regularization path of the estimator.
And logistic regression is one such model. That's why scikit-learn has the dedicated LogisticRegressionCV class that does this.
There are some things left out on other answers, e.g. about gridsearch functionality. See the docs:
cross-validation estimator
An estimator that has built-in cross-validation capabilities to automatically select the best hyper-parameters (see the User Guide). Some example of cross-validation estimators are ElasticNetCV and LogisticRegressionCV. Cross-validation estimators are named EstimatorCV and tend to be roughly equivalent to GridSearchCV(Estimator(), ...). The advantage of using a cross-validation estimator over the canonical estimator class along with grid search is that they can take advantage of warm-starting by reusing precomputed results in the previous steps of the cross-validation process. This generally leads to speed improvements. An exception is the RidgeCV class, which can instead perform efficient Leave-One-Out CV.
https://scikit-learn.org/stable/glossary.html#term-cross-validation-estimator
https://github.com/amueller/talks_odt/blob/master/2015/nyc-open-data-2015-andvanced-sklearn.pdf

Using SVM to perform classification on multi-dimensional time series datasets

I would like to use scikit-learn's svm.SVC() estimator to perform classification tasks on multi-dimensional time series - that is, on time series where the points in the series take values in R^d, where d > 1.
The issue with doing this is that svm.SVC() will only take ndarray objects of dimension at most 2, whereas the dimension of such a dataset would be 3. Specifically, the shape of a given dataset would be (n_samples, n_features, d).
Is there a workaround available? One simple solution would just be to reshape the dataset so that it is 2-dimensional, however I imagine this would lead to the classifier not learning from the dataset properly.
Without any further knowledge about the data reshaping is the best you can do. Feature engineering is a very manual art that depends heavily on domain knowledge.
As a rule of thumb: if you don't really know anything about the data throw in the raw data and see if it works. If you have an idea what properties of the data may be beneficial for classification, try to work it in a feature.
Say we want to classify swiping patterns on a touch screen. This closely resembles your data: We acquired many time series of such patterns by recording the 2D position every few milliseconds.
In the raw data, each time series is characterized by n_timepoints * 2 features. We can use that directly for classification. If we have additional knowledge we can use that to create additional/alternative features.
Let's assume we want to distinguish between zig-zag and wavy patterns. In that case smoothness (however that is defined) may be a very informative feature that we can add as a further column to the raw data.
On the other hand, if we want to distinguish between slow and fast patterns, the instantaneous velocity may be a good feature. However, the velocity can be computed as a simple difference along the time axis. Even linear classifiers can model this easily so it may turn out that such features, although good in principle, do not improve classification of raw data.
If you have lots and lots and lots and lots of data (say an internet full of good examples) Deep Learning neural networks can automatically learn features to some extent, but let's say this is rather advanced. In the end, most practical applications come down to try and error. See what features you can come up with and try them out in practice. And beware the overfitting gremlin.

Sklearn overfitting

I have a data set containing 1000 points each with 2 inputs and 1 output. It has been split into 80% for training and 20% for testing purpose. I am training it using sklearn support vector regressor. I have got 100% accuracy with training set but results obtained with test set are not good. I think it may be because of overfitting. Please can you suggest me something to solve the problem.
You may be right: if your model scores very high on the training data, but it does poorly on the test data, it is usually a symptom of overfitting. You need to retrain your model under a different situation. I assume you are using train_test_split provided in sklearn, or a similar mechanism which guarantees that your split is fair and random. So, you will need to tweak the hyperparameters of SVR and create several models and see which one does best on your test data.
If you look at the SVR documentation, you will see that it can be initiated using several input parameters, each of which could be set to a number of different values. For the simplicity, let's assume you are only dealing with two parameters that you want to tweak: 'kernel' and 'C', while keeping the third parameter 'degree' set to 4. You are considering 'rbf' and 'linear' for kernel, and 0.1, 1, 10 for C. A simple solution is this:
for kernel in ('rbf', 'linear'):
for c in (0.1, 1, 10):
svr = SVR(kernel=kernel, C=c, degree=4)
svr.fit(train_features, train_target)
score = svr.score(test_features, test_target)
print kernel, c, score
This way, you can generate 6 models and see which parameters lead to the best score, which will be the best model to choose, given these parameters.
A simpler way is to let sklearn to do most of this work for you, using GridSearchCV (or RandomizedSearchCV):
parameters = {'kernel':('linear', 'rbf'), 'C':(0.1, 1, 10)}
clf = GridSearchCV(SVC(degree=4), parameters)
clf.fit(train_features, train_target)
print clf.best_score_
print clf.best_params_
model = clf.best_estimator_ # This is your model
I am working on a little tool to simplify using sklearn for small projects, and make it a matter of configuring a yaml file, and letting the tool do all the work for you. It is available on my github account. You might want to take a look and see if it helps.
Finally, your data may not be linear. In that case you may want to try using something like PolynomialFeatures to generate new nonlinear features based on the existing ones and see if it improves your model quality.
Try fitting your data using training data split Sklearn K-Fold cross-validation, this provides you a fair split of data and better model , though at a cost of performance , which should really matter for small dataset and where the priority is accuracy.
A few hints:
Since you have only two inputs, it would be great if you plot your data. Try either a scatter with alpha = 0.3 or a heatmap.
Try GridSearchCV, as mentioned by #shahins.
Especially, try different values for the C parameter. As mentioned in the docs, if you have a lot of noisy observations you should decrease it. It corresponds to regularize more the estimation.
If it's taking too long, you can also try RandomizedSearchCV
As a side note from #shahins answer (I am not allowed to add comments), both implementations are not equivalent. GridSearchCV is better since it performs cross-validation in the training set for tuning the hyperparameters. Do not use the test set for tuning hyperparameters!
Don't forget to scale your data

Meaning of GridSearchCV with RFECV in sklearn

Based on Recursive feature elimination and grid search using scikit-learn, I know that RFECV can be combined with GridSearchCV to obtain better parameter setting for the model like linear SVM.
As said in the answer, there are two ways:
"Run GridSearchCV on RFECV, which will result in splitting the data into folds two times (ones inside GridSearchCV and once inside RFECV), but the search over the number of components will be efficient."
"Do GridSearchCV just on RFE, which would result in a single splitting of the data, but in very inefficient scanning of the parameters of the RFE estimator."
To make my question clear, I have to firstly clarify RFECV:
Split the whole data into n folds.
In every fold, obtain the feature rank by fitting only the training data to rfe.
Sort the ranking and fit the training data to SVM and test it on testing data for scoring. This should be done m times, each with decreasing number of features, where m is the number of features assuming step=1.
A sequence of scores is obtained in the previous step and such sequence would be lastly averaged across n folds after step 1~3 have been done in n times, resulting in an averaged scoring sequence suggesting the best number of features to do in rfe.
Take that best number of features as the argument of n_features_to_select in rfe fitted with the original whole data.
.support_ to get the "winners" among features; .grid_scores_ to get the averaged scoring sequence.
Please correct me if I am wrong, thank you.
So my question is where to put GridSearchCV? I guess the second way "do GridSearchCV just on RFE" is do GridSearchCV on step 5 which sets the parameter of SVM to one of the value in the grid, fit it on training data split by GridSearchCV to obtain the number of features suggested in step 4, and test it with the rest of the data for the score. Such process is done in k times and an averaged score indicates the goodness of that value in the grid, where k is the argument cv in GridSearchCV. However, selected features might be different due to alternative training data and grid value, which makes this second way not reasonable if it is done as my guess.
How actually does GridSearchCV be combined with RFECV?

Resources