feature selection and cross validation - statistics

I want to train a regression model and in order to do so I use random forest models. However, I also need to do feature selection cause I have so many features in my dataset and I'm afraid if I used all the feature then I'll be overfitting. In order to assess the performance of my model I also perform a 5 fold cross validation and my question of these following two approaches is right and why?
1- should I split the data into two halves, do feature selection on first half and use these selected features to do 5 fold cross validation (CV) on the remaining half (in this case the 5 CV will be using exactly the same selected features).
2- do the following procedure:
1- split the data into 4/5 for training and 1/5 for testing
2- split this training data (the 4/5 of the full data) in to two halves:
a-) on the first half train the model and use the trained model to do feature selection.
b-) Use the selected features from the first part in order to train the model on the second half of the training dataset (this will be our final trained model).
3- test the performance of the model on the remaining 1/5 of the data (which is never used in the training phase)
4- repeat the previous step 5 times and in each time we randomly (without replacement) split the data into 4/5 for training and 1/5 for testing
my only concern is that in the second procedure we will have 5 models and the features of the final models will be the union of the top features of these five models, so I'm not sure if the performance of the 5CV can be reflective of the final performance of the final model especially since the final model has different features than each model in the 5fold (cause it's the union of the selected features of each model in the 5 CV)

Cross validation should always be the outer most loop in any machine learning algorithm.
So, split the data into 5 sets. For every set you choose as your test set (1/5), fit the model after doing a feature selection on the training set (4/5). Repeat this for all the CV folds - here you have 5 folds.
Now once the CV procedure is complete, you have an estimate of your model's accuracy, which is a simple average of your individual CV fold's accuracy.
As far as the final set of features for training the model on the complete set of data is concerned, do the following to select the final set of features.
-- Each time you do a CV on a fold as outlined above, vote for the features that you selected in that particular fold. At the end of 5 fold CV, select a particular number of features that have the top votes.
Use the above selected set of features to do one final procedure of feature selection and then train the model on the complete data (combined of all 5 folds) and move the model to production.

Do the CV on the full data (split it into 5 parts, and use a different combination of a Parts for every split) and then do your feature selection on the cv-splits and then your RF on the output of the selection.
Why: Because CV is checking your model under different Data Splits so your model dont overfit. Since the feature selecetion can be viewed as part of your model you have to check this to for overfitting.
After your Validated your Model with CV then fit your whole data into it and perform the transform of this single model.
Also if your worried about overfitting you should limit the RF in either deep and number of trees. CV is mostly used just as an tool in the developement process of an model and for the final model all of the data is used.

Related

Evaluate Model Node in Azure ML Studio does not take all the rows of the dataset in confusion matrix

I have this dataset in which the positive class consists of component failures for a specific component of the APS system.
I am doing Predictive Maintenance using Microsoft Azure Machine Learning Studio.
As you can see from the pictures below, I am using 4 algorithm: Logistic Regression, Random Forest, Decision Tree and SVM. And you can see that the Output dataset in the score model node consists of 16k rows. However, when I see the output of the Evaluate Model, in the confusion matrix there are only 160 observations for the Logistic Regression, and the correct number, 16k for Random Forest. I have the same problem, only 160 observations in the models of Decision Tree and SVM. And the same problem is repeated in other experiments for example after feature selection, normalization etc.: some evaluate model does not use all the rows of the test dataset, and some other node does it.
How can I fix this problem? Because I am interested in the real number of false positive and false negatives.
The output metrics shown are based on the validation set (e.g. “validation metric”, “val-accuracy”).All the metrics computed and displayed are on validation set and not on the original training set. All those metrics are calculated only over the validation set without considering the training set, otherwise we would inflate the performances of the model by considering data already used to train the model.

Cross validation of Time Series data: VAR model

I have weekly data with 15 predictor variables for a period of 1 year (52 observations).
I plan to compare Prophet forecasting with VAR model.
Is there a way to run cross-validation for these 2 models especially the VAR.
Thanks
HP
A good explainer on time series cross validation from the forecasting principles and practice book here
Time series cross-validation is done by splitting training data up to some point in time (typically between 2/3 or 4/5) and using the remainder as validation. Then at each step fit a model to the training data, make an out-of-sample prediction, store that prediction, and add the next data point to your training data.
So at the least step, you're fitting your training model on all of your training data except for a single data point since you'll be comparing that single datapoint to what your model forecasts at this last step. You then can do root mean squared or whatever on the list of predictions you have versus their actual values. In this way, you are testing how appropriately your model can fit this dataset over time.
For Prophet, the docs list an easy way to do it in python.
For VAR, I don't know of an easy way, other than looping over the training data to make forecasts, appending the next timestamp at each step, and then comparing to the validation data.

How can I use Chi-square value for text classification using SVM?

I have both positive and negative training documents for a text classification problem. I am planning on calculating chi-square value for every feature in each document. Having that value, how may I proceed to classification using SVM? What would be the threshold value for the classification?
Chi-square value can be used to perform feature selection, which could be a pre-processing step. After that, you could greatly reduce your feature vocabulary (for example, select the most useful 100K terms from a 1M vocabulary). This step might have two benefit: 1. reduce your model size in the next step; 2. faster at prediction time. Cons: may or may not affect the classification performance.
To proceed with a classification, you still need to use those 100K features to train your model (for example, using SVM algorithm). After your model is learnt, you could use the model for classification.

Train multiple models with various measures and accumulate predictions

So I have been playing around with Azure ML lately, and I got one dataset where I have multiple values I want to predict. All of them uses different algorithms and when I try to train multiple models within one experiment; it says the “train model can only predict one value”, and there are not enough input ports on the train-model to take in multiple values even if I was to use the same algorithm for each measure. I tried launching the column selector and making rules, but I get the same error as mentioned. How do I predict multiple values and later put the predicted columns together for the web service output so I don’t have to have multiple API’s?
What you would want to do is to train each model and save them as already trained models.
So create a new experiment, train your models and save them by right clicking on each model and they will show up in the left nav bar in the Studio. Now you are able to drag your models into the canvas and have them score predictions where you eventually make them end up in the same output as I have done in my example through the “Add columns” module. I made this example for Ronaldo (Real Madrid CF player) on how he will perform in match after training day. You can see my demo on http://ronaldoinform.azurewebsites.net
For more detailed explanation on how to save the models and train multiple values; you can check out Raymond Langaeian (MSFT) answer in the comment section on this link:
https://azure.microsoft.com/en-us/documentation/articles/machine-learning-convert-training-experiment-to-scoring-experiment/
You have to train models for each variable that you going to predict. Then add all those predicted columns together and get as a single output for the web service.
The algorithms available in ML are only capable of predicting a single variable at a time based on the inputs it's getting.

Is separate validation and test set needed when training SVM?

Given a set of features extracted from a training dataset which are used to train a SVM.
The SVM parameters (e.g. c, gamma) are chosen using k-folds cross validation e.g. the training dataset is divided into 5 folds, with one chosen as validation set. Rotation of folds is done and the average accuracy used to choose the best parameters.
So then should I have another set (Test set) and report (as in paper publication) the results on this ? My understanding is that since the validation set was used to choose the parameters, the Test set is required.
In machine learning, the Test set is something not seen until we have decided on the classifier (e.g. in competitions, the test set is unknown and we submit our final classifier based only on the training set).
The common approach is that after the cross validation phase, you would need to tune your parameters further and hence the need of a validation set to control the quality of each model.
Once you have a model that you believe can't be improved significantly over the validation set without risk of over-fitting, then you use your model over the test set to report results.
EDIT:
Since you are specifically asking about k-fold cross-validation, the technique implicitly separates a model for testing the resulted model, hence there is no need for an extra test step.
From the wikipedia article:
"Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data"
Wikipedia

Resources