Using the best predictor from Randomized search for test data - scikit-learn

I used the 'RandomizedSearchCV' function to estimate my optimal parameters for a random forest model. Can I use the 'best_estimator_' attribute to predict on my test data?
The question I have is, while doing the randomized search, part of the data would have been used for validation. So the best estimate RF model wouldn't have been trained on the entire data set rt? Or is it all taken care of under the hood?

As written in the documentation cv.best_estimator_ returns the estimator that was chosen by the search, i.e. estimator which gave highest score.
If the parameter refit is set to True (default value), the model will be refit the model using the best parameter on the whole dataset including validation. Therefore you can simply use the cv.best_estimator_ to predict on your test data.

Related

Is it possible to customise tf.estimator for Unsupervised Learning with self-defined evaluation graph not having loss?

I am implementing a item2vec model using the idea of word2vec
with tf.estimator API for product recommendation.
There's no problem implementing training part with tf.estimator. The process is same as word2vec, and I see each transactions as a sentence. Only difference is how to generate training input:(target_item, context_item) pairs. After training the pseudo-classification problem, I could use trained embedding vector for each items to measure relationship between them.
The problem is, for evaluation part, it is not a typical supervised learning evaluation, ie. with eval data as input, going through the same graph, we obtain predictions and accuracy.
The evaluation input data I would like to use, is in a totally different format from training input data.
Format of Eval input data: (target_item, {context_item1, context_item2, ...}). With this, I could obtain top_k nearest items for each context_items and then see if the target_item is in the collection of these nearest items, so that I could obtain a hit-ratio from it.
However, tf.estimator.EstimatorSpec() for mode = MODE.EVAL requires a loss as input. So, does it mean evaluation can only reuse part of the training graph? What could I do if I don't have a loss function for evaluation in my case, as the evaluation does not go through the classification anymore?
Many thanks.

Which model: Best estimator from gridsearchCV or all training data?

I am a little confused when it comes to gridsearch and fitting the final model. I split the in 2: training and testing. The testing set is only used for final evaluation. I perform grid search only using the training data.
Say one has done a grid search over several hyperparameters using cross-validation. The grid search gives the best combination of the hyperparameters. Next step is to train the model, and this is where I am confused. I see 2 possibilities:
1) Don't train the model. Use the parameters from the best model from the grid search.
or
2) Don't use the parameters from the best model from the grid search. Train the model on the full training set with the best hyperparameter combination from the grid search.
What is the correct approach, 1 or 2?
This is probably late, but might be useful for someone else who comes along.
GridSearchCV has an attribute called refit, which is set to True by default. This means that after performing k-fold cross-validation (i.e., training on a subset of the data you passed in), it refits the model using the best hyperparameters from the grid search, on the complete training set.
Presumably your question, from what I can glean, can be summarized as:
Suppose you use 5-fold cross-validation. Your model is then fitted only on 4 folds, as the fifth fold is used for validation. So would you need to retrain the model on the whole of train (i.e., the data from all 5 folds)?
The answer is no, provided you set refit to True, in which case GridSearchCV will perform the training over the whole of the training set using the best hyperparameters it has found after cross-validation. It will then return the trained estimator object, on which you can directly call the predict method, as you would normally do otherwise.
Refer: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
You train the model using the training set and the parameters obtained by the GridSearch.
And then you can test the model with the test set.

How can I use Chi-square value for text classification using SVM?

I have both positive and negative training documents for a text classification problem. I am planning on calculating chi-square value for every feature in each document. Having that value, how may I proceed to classification using SVM? What would be the threshold value for the classification?
Chi-square value can be used to perform feature selection, which could be a pre-processing step. After that, you could greatly reduce your feature vocabulary (for example, select the most useful 100K terms from a 1M vocabulary). This step might have two benefit: 1. reduce your model size in the next step; 2. faster at prediction time. Cons: may or may not affect the classification performance.
To proceed with a classification, you still need to use those 100K features to train your model (for example, using SVM algorithm). After your model is learnt, you could use the model for classification.

Is separate validation and test set needed when training SVM?

Given a set of features extracted from a training dataset which are used to train a SVM.
The SVM parameters (e.g. c, gamma) are chosen using k-folds cross validation e.g. the training dataset is divided into 5 folds, with one chosen as validation set. Rotation of folds is done and the average accuracy used to choose the best parameters.
So then should I have another set (Test set) and report (as in paper publication) the results on this ? My understanding is that since the validation set was used to choose the parameters, the Test set is required.
In machine learning, the Test set is something not seen until we have decided on the classifier (e.g. in competitions, the test set is unknown and we submit our final classifier based only on the training set).
The common approach is that after the cross validation phase, you would need to tune your parameters further and hence the need of a validation set to control the quality of each model.
Once you have a model that you believe can't be improved significantly over the validation set without risk of over-fitting, then you use your model over the test set to report results.
EDIT:
Since you are specifically asking about k-fold cross-validation, the technique implicitly separates a model for testing the resulted model, hence there is no need for an extra test step.
From the wikipedia article:
"Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data"
Wikipedia

advanced feature extraction for cross-validation using sklearn

Given a sample dataset with 1000 samples of data, suppose I would like to preprocess the data in order to obtain 10000 rows of data, so each original row of data leads to 10 new samples. In addition, when training my model I would like to be able to perform cross validation as well.
The scoring function I have uses the original data to compute the score so I would like cross validation scoring to work on the original data as well rather than the generated one. Since I am feeding the generated data to the trainer (I am using a RandomForestClassifier), I cannot rely on cross-validation to correctly split the data according to the original samples.
What I thought about doing:
Create a custom feature extractor to extract features to feed to the classifier.
add the feature extractor to a pipeline and feed it to, say, GridSearchCv for example
implement a custom scorer which operates on the original data to score the model given a set of selected parameters.
Is there a better method for what I am trying to accomplish?
I am asking this in connection to a competition going on right now on Kaggle
Maybe you can use Stratified cross validation (e.g. Stratified K-Fold or Stratified Shuffle Split) on the expanded samples and use the original sample idx as stratification info in combination with a custom score function that would ignore the non original samples in the model evaluation.

Resources