I am working on image classification. For CNN image classification, Can I use validation data as test data? or should I split data into three ( train, validation, test )?
Usually, you use validation data for model selection to find the best model and/or hyperparameters. Test data is used to estimate real world performance for the best model from the model selection step. You must not let any test data leak into the validation data and vice versa as you will risk overfitting.
Basically:
When in training phase: train data
Model selection phase: validation data
Testing phase: then you can test your best model from the previous step on the test data to get a real world performance estimate.
All datasets should be nonoverlapping, and you should ideally not know about any properties of the test data.
Related
I'm trying to train an instance segmentation model using Detectron2 but I'm not sure how I could validate every certain number of iterations on 20% of the data available.
I have multiple registered COCO datasets using register_coco_instances() but I would like to split them into training and validation datasets.
For example, I have (dataset_1, dataset_2, dataset_3,) that I want to split 80/20 to become training and validation as such:
cfg.DATASETS.TRAIN = (dataset_1_train, dataset_2_train, dataset_3_train,)
cfg.DATASETS.TEST = (dataset_1_val, dataset_2_val, dataset_3_val,)
And I would like it to validate on all the validation datasets every certain number of iterations. I would probably need a custom trainer for this but I couldn't find a way to split the datasets nor validate on all of the available validation datasets.
I used the 'RandomizedSearchCV' function to estimate my optimal parameters for a random forest model. Can I use the 'best_estimator_' attribute to predict on my test data?
The question I have is, while doing the randomized search, part of the data would have been used for validation. So the best estimate RF model wouldn't have been trained on the entire data set rt? Or is it all taken care of under the hood?
As written in the documentation cv.best_estimator_ returns the estimator that was chosen by the search, i.e. estimator which gave highest score.
If the parameter refit is set to True (default value), the model will be refit the model using the best parameter on the whole dataset including validation. Therefore you can simply use the cv.best_estimator_ to predict on your test data.
Given a set of features extracted from a training dataset which are used to train a SVM.
The SVM parameters (e.g. c, gamma) are chosen using k-folds cross validation e.g. the training dataset is divided into 5 folds, with one chosen as validation set. Rotation of folds is done and the average accuracy used to choose the best parameters.
So then should I have another set (Test set) and report (as in paper publication) the results on this ? My understanding is that since the validation set was used to choose the parameters, the Test set is required.
In machine learning, the Test set is something not seen until we have decided on the classifier (e.g. in competitions, the test set is unknown and we submit our final classifier based only on the training set).
The common approach is that after the cross validation phase, you would need to tune your parameters further and hence the need of a validation set to control the quality of each model.
Once you have a model that you believe can't be improved significantly over the validation set without risk of over-fitting, then you use your model over the test set to report results.
EDIT:
Since you are specifically asking about k-fold cross-validation, the technique implicitly separates a model for testing the resulted model, hence there is no need for an extra test step.
From the wikipedia article:
"Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data"
Wikipedia
I want to train a regression model and in order to do so I use random forest models. However, I also need to do feature selection cause I have so many features in my dataset and I'm afraid if I used all the feature then I'll be overfitting. In order to assess the performance of my model I also perform a 5 fold cross validation and my question of these following two approaches is right and why?
1- should I split the data into two halves, do feature selection on first half and use these selected features to do 5 fold cross validation (CV) on the remaining half (in this case the 5 CV will be using exactly the same selected features).
2- do the following procedure:
1- split the data into 4/5 for training and 1/5 for testing
2- split this training data (the 4/5 of the full data) in to two halves:
a-) on the first half train the model and use the trained model to do feature selection.
b-) Use the selected features from the first part in order to train the model on the second half of the training dataset (this will be our final trained model).
3- test the performance of the model on the remaining 1/5 of the data (which is never used in the training phase)
4- repeat the previous step 5 times and in each time we randomly (without replacement) split the data into 4/5 for training and 1/5 for testing
my only concern is that in the second procedure we will have 5 models and the features of the final models will be the union of the top features of these five models, so I'm not sure if the performance of the 5CV can be reflective of the final performance of the final model especially since the final model has different features than each model in the 5fold (cause it's the union of the selected features of each model in the 5 CV)
Cross validation should always be the outer most loop in any machine learning algorithm.
So, split the data into 5 sets. For every set you choose as your test set (1/5), fit the model after doing a feature selection on the training set (4/5). Repeat this for all the CV folds - here you have 5 folds.
Now once the CV procedure is complete, you have an estimate of your model's accuracy, which is a simple average of your individual CV fold's accuracy.
As far as the final set of features for training the model on the complete set of data is concerned, do the following to select the final set of features.
-- Each time you do a CV on a fold as outlined above, vote for the features that you selected in that particular fold. At the end of 5 fold CV, select a particular number of features that have the top votes.
Use the above selected set of features to do one final procedure of feature selection and then train the model on the complete data (combined of all 5 folds) and move the model to production.
Do the CV on the full data (split it into 5 parts, and use a different combination of a Parts for every split) and then do your feature selection on the cv-splits and then your RF on the output of the selection.
Why: Because CV is checking your model under different Data Splits so your model dont overfit. Since the feature selecetion can be viewed as part of your model you have to check this to for overfitting.
After your Validated your Model with CV then fit your whole data into it and perform the transform of this single model.
Also if your worried about overfitting you should limit the RF in either deep and number of trees. CV is mostly used just as an tool in the developement process of an model and for the final model all of the data is used.
Given a sample dataset with 1000 samples of data, suppose I would like to preprocess the data in order to obtain 10000 rows of data, so each original row of data leads to 10 new samples. In addition, when training my model I would like to be able to perform cross validation as well.
The scoring function I have uses the original data to compute the score so I would like cross validation scoring to work on the original data as well rather than the generated one. Since I am feeding the generated data to the trainer (I am using a RandomForestClassifier), I cannot rely on cross-validation to correctly split the data according to the original samples.
What I thought about doing:
Create a custom feature extractor to extract features to feed to the classifier.
add the feature extractor to a pipeline and feed it to, say, GridSearchCv for example
implement a custom scorer which operates on the original data to score the model given a set of selected parameters.
Is there a better method for what I am trying to accomplish?
I am asking this in connection to a competition going on right now on Kaggle
Maybe you can use Stratified cross validation (e.g. Stratified K-Fold or Stratified Shuffle Split) on the expanded samples and use the original sample idx as stratification info in combination with a custom score function that would ignore the non original samples in the model evaluation.