Given a set of features extracted from a training dataset which are used to train a SVM.
The SVM parameters (e.g. c, gamma) are chosen using k-folds cross validation e.g. the training dataset is divided into 5 folds, with one chosen as validation set. Rotation of folds is done and the average accuracy used to choose the best parameters.
So then should I have another set (Test set) and report (as in paper publication) the results on this ? My understanding is that since the validation set was used to choose the parameters, the Test set is required.
In machine learning, the Test set is something not seen until we have decided on the classifier (e.g. in competitions, the test set is unknown and we submit our final classifier based only on the training set).
The common approach is that after the cross validation phase, you would need to tune your parameters further and hence the need of a validation set to control the quality of each model.
Once you have a model that you believe can't be improved significantly over the validation set without risk of over-fitting, then you use your model over the test set to report results.
EDIT:
Since you are specifically asking about k-fold cross-validation, the technique implicitly separates a model for testing the resulted model, hence there is no need for an extra test step.
From the wikipedia article:
"Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data"
Wikipedia
Related
I am confused in using walking forward validation. I have found several pages that have different or not clear wording on "validation" and "test".
Basically, when I apply it to time series forecasting, I divide the data into training, validation and test set. The validation is then a part of the training set. I understood that. Obviously, for time series, time has to be taken into account, so we have to resort to the walking forward method.
My understanding is that the walking forward method is only applied in the training set so that, for example, hyperparameters can be optimized when they are created. The test set plays no role and is only used at the end to evaluate the model. Is this correct?
Or does the walking forward method split the test set?
I see many examples that do not consider this forward method.
If it is the walking_forward method, it is the validation set:
Another question is, if I don't split the validation and training set before the backpropagation and I choose in the "model.fit" the settings:
model.fit(
"data=X_train, y_train",
"validation_data = X_train, y_train",
"valdidation_split=0.1",
"shuffle=False")
is this model similar to the walking_forward method?
I used the 'RandomizedSearchCV' function to estimate my optimal parameters for a random forest model. Can I use the 'best_estimator_' attribute to predict on my test data?
The question I have is, while doing the randomized search, part of the data would have been used for validation. So the best estimate RF model wouldn't have been trained on the entire data set rt? Or is it all taken care of under the hood?
As written in the documentation cv.best_estimator_ returns the estimator that was chosen by the search, i.e. estimator which gave highest score.
If the parameter refit is set to True (default value), the model will be refit the model using the best parameter on the whole dataset including validation. Therefore you can simply use the cv.best_estimator_ to predict on your test data.
The data set i'm working with consists of train and test sets. To fine tune the deep learning model, 10% of training set is used as validation set. After finding the optimal hyper parameter values, two possible options are
a) Evaluate the model (i.e., the model which is trained on 90 % of train set) with test set
b) Evaluate the model (i.e., the model retrained on complete train set) with test set
Which is of the above options is valid? and why?
Both options are possible,
But in the First case the HPP are optimal, in the second case they are usually close to the optimal hyper parameters (but not optimal), but you have a more representative dataset
What is suggested in general is to do a CrossValidation https://scikit-learn.org/stable/modules/cross_validation.html -> choose different train/test to have a more representative case, and choose the best HPP base on an average on the values for each folds
Because, what you risk is, your model would be very good for this specific case as you can have for Kaggle datasets, but maybe not representative of real use cases once in production.
To sum up:
1. If you just want the best model for this set, maybe option 1 is safest (option 2 can be done too, but you may have worse results)
2. If you are in a "real case study", you should better do cross-validation to have more robust set of HPPs
what is the difference in including oob_Score =True and not including oob_score in RandomForestClassifier in sklearn in python. The out-of-bag (OOB) error is the average error for each calculated using predictions from the trees that do not contain in their respective bootstrap sample right , so how does including the parameter oob_score= True affect the calculations of average error.
For each tree, only a share of data is selected for building the tree, i.e. training. The remaining samples are the the out-of-bag samples. These out-of-bag samples can be used directly during training to compute a test accuracy. If you activate the option, the "oob_score_" and "oob_prediction_" will be computed.
The training model will not change if you activate or not the option. Obviously, due to the random nature of RF, the model will not be exactly the same if you apply twice, but it has nothing to do with the "oob_score" option.
Unfortunately, scikit-learn option does not allow you to set the OOB ration, i.e. the percentage of samples used to build a tree. This is the case in other library (e.g. C++ Shark http://image.diku.dk/shark/sphinx_pages/build/html/rest_sources/tutorials/algorithms/rf.html).
I want to train a regression model and in order to do so I use random forest models. However, I also need to do feature selection cause I have so many features in my dataset and I'm afraid if I used all the feature then I'll be overfitting. In order to assess the performance of my model I also perform a 5 fold cross validation and my question of these following two approaches is right and why?
1- should I split the data into two halves, do feature selection on first half and use these selected features to do 5 fold cross validation (CV) on the remaining half (in this case the 5 CV will be using exactly the same selected features).
2- do the following procedure:
1- split the data into 4/5 for training and 1/5 for testing
2- split this training data (the 4/5 of the full data) in to two halves:
a-) on the first half train the model and use the trained model to do feature selection.
b-) Use the selected features from the first part in order to train the model on the second half of the training dataset (this will be our final trained model).
3- test the performance of the model on the remaining 1/5 of the data (which is never used in the training phase)
4- repeat the previous step 5 times and in each time we randomly (without replacement) split the data into 4/5 for training and 1/5 for testing
my only concern is that in the second procedure we will have 5 models and the features of the final models will be the union of the top features of these five models, so I'm not sure if the performance of the 5CV can be reflective of the final performance of the final model especially since the final model has different features than each model in the 5fold (cause it's the union of the selected features of each model in the 5 CV)
Cross validation should always be the outer most loop in any machine learning algorithm.
So, split the data into 5 sets. For every set you choose as your test set (1/5), fit the model after doing a feature selection on the training set (4/5). Repeat this for all the CV folds - here you have 5 folds.
Now once the CV procedure is complete, you have an estimate of your model's accuracy, which is a simple average of your individual CV fold's accuracy.
As far as the final set of features for training the model on the complete set of data is concerned, do the following to select the final set of features.
-- Each time you do a CV on a fold as outlined above, vote for the features that you selected in that particular fold. At the end of 5 fold CV, select a particular number of features that have the top votes.
Use the above selected set of features to do one final procedure of feature selection and then train the model on the complete data (combined of all 5 folds) and move the model to production.
Do the CV on the full data (split it into 5 parts, and use a different combination of a Parts for every split) and then do your feature selection on the cv-splits and then your RF on the output of the selection.
Why: Because CV is checking your model under different Data Splits so your model dont overfit. Since the feature selecetion can be viewed as part of your model you have to check this to for overfitting.
After your Validated your Model with CV then fit your whole data into it and perform the transform of this single model.
Also if your worried about overfitting you should limit the RF in either deep and number of trees. CV is mostly used just as an tool in the developement process of an model and for the final model all of the data is used.