Which model: Best estimator from gridsearchCV or all training data? - scikit-learn

I am a little confused when it comes to gridsearch and fitting the final model. I split the in 2: training and testing. The testing set is only used for final evaluation. I perform grid search only using the training data.
Say one has done a grid search over several hyperparameters using cross-validation. The grid search gives the best combination of the hyperparameters. Next step is to train the model, and this is where I am confused. I see 2 possibilities:
1) Don't train the model. Use the parameters from the best model from the grid search.
or
2) Don't use the parameters from the best model from the grid search. Train the model on the full training set with the best hyperparameter combination from the grid search.
What is the correct approach, 1 or 2?

This is probably late, but might be useful for someone else who comes along.
GridSearchCV has an attribute called refit, which is set to True by default. This means that after performing k-fold cross-validation (i.e., training on a subset of the data you passed in), it refits the model using the best hyperparameters from the grid search, on the complete training set.
Presumably your question, from what I can glean, can be summarized as:
Suppose you use 5-fold cross-validation. Your model is then fitted only on 4 folds, as the fifth fold is used for validation. So would you need to retrain the model on the whole of train (i.e., the data from all 5 folds)?
The answer is no, provided you set refit to True, in which case GridSearchCV will perform the training over the whole of the training set using the best hyperparameters it has found after cross-validation. It will then return the trained estimator object, on which you can directly call the predict method, as you would normally do otherwise.
Refer: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

You train the model using the training set and the parameters obtained by the GridSearch.
And then you can test the model with the test set.

Related

Using the best predictor from Randomized search for test data

I used the 'RandomizedSearchCV' function to estimate my optimal parameters for a random forest model. Can I use the 'best_estimator_' attribute to predict on my test data?
The question I have is, while doing the randomized search, part of the data would have been used for validation. So the best estimate RF model wouldn't have been trained on the entire data set rt? Or is it all taken care of under the hood?
As written in the documentation cv.best_estimator_ returns the estimator that was chosen by the search, i.e. estimator which gave highest score.
If the parameter refit is set to True (default value), the model will be refit the model using the best parameter on the whole dataset including validation. Therefore you can simply use the cv.best_estimator_ to predict on your test data.

How to adopt multiple different loss functions in each steps of LSTM in Keras

I have a set of sentences and their scores, I would like to train a marking system that could predict the score for a given sentence, such one example is like this:
(X =Tomorrow is a good day, Y = 0.9)
I would like to use LSTM to build such a marking system, and also consider the sequential relationship between each word in the sentence, so the training example shown above is transformed as following:
(x1=Tomorrow, y1=is) (x2=is, y2=a) (x3=a, y3=good) (x4=day, y4=0.9)
When training this LSTM, I would like the first three time steps using a softmax classifier, and the final step using a MSE. It is obvious that the loss function used in this LSTM is composed of two different loss functions. In this case, it seems the Keras does not provide the way to address my problem directly. In addition, I am not sure whether my method to build the marking system is correct or not.
Keras support multiple loss functions as well:
model = Model(inputs=inputs,
outputs=[lang_model, sent_model])
model.compile(optimizer='sgd',
loss=['categorical_crossentropy', 'mse'],
metrics=['accuracy'], loss_weights=[1., 1.])
Based on your explanation, I think you need a model that first, predict a token based on previous tokens, in NLP domain it usually called Language model, and then compute a score which I assume it is a sentiment (it is applicable to other domain).
To do so, you can train your language model with LSTM and pick the last output of LSTM for your ranking task. To this end, you need to define two loss function: categorical_crossentropy for the language model and MSE for the ranking task.
This tutorial would be helpful: https://www.pyimagesearch.com/2018/06/04/keras-multiple-outputs-and-multiple-losses/

does training on total dataset improves confidence scores

I'm using SVC(kernel="linear", probability=True) in multiclass classification. when I'm using 2/3rd of my data for training purpose, I'm getting ~72%. And when I tried to predict in production, Confidence scores I'm getting are very less. Does training on the total dataset helps to improve confidence scores?
Does training on the total dataset helps to improve confidence scores?
It might. In general, the more data the better. However evaluating performance should be done on data that the model has not seen before. One way to do this is to set aside a part of the data, a test set, as you have done. Another approach is to use cross-validation, see below.
And when I tried to predict in production, Confidence scores I'm getting are very less.
This means that your model does not generalize well. In other words when presented with data it has not seen before the model starts to make more or less random predictions.
To get a better sense of how well your model generalizes you may want to use cross-validation:
from sklearn.model_selection import cross_val_score
clf = SVC()
scores = cross_val_score(clf, X, Y)
This will train and evaluate your classifier on the full dataset using folds of the full data. A fold For each split the classifier is trained and validation on an exclusive subset of the data. For each split the scores result contains the validation score (for SVC, the accuracy). If you need more control over which metrics to evaluate, use the cross_validation function.
to predict in production
In order to improve your model's performance, there are several methods to consider:
Use more training data
Use an ensemble model to reduce prediction variance
Use a different model (algorithm)

How to use cross-validation splitted data with RandomizedSearchCV

I'm trying to transfer my model from single run to hyper-parameter tuning using RandomizedSearchCV.
In my single run case, my data is splitted into train/validation/test data.
When I run RandomizedSearchCV on my train_data with default 3-fold CV, I notice that the length of my train_input is reduced to 66% of train_data (which makes sense in a 3-fold CV...).
So I'm guessing that I should merge my initial train and validation set into a larger train set and let RandomizedSearchCV split it into train and validation sets.
Would that be the right way to go?
My question is: how can I access the remaining 33% of my train_input to feed it to my validation accuracy test function (note that my score function is running on test set)?
Thanks for your help!
Yoann
I'm not sure that my code would help here since my question is rather generic.
This is the answer that I found by going through sklearn's code: the RandomizedSearchCV doesn't return the splited validation data in an easy way and I should definitely merge my initial train and validation set into a larger train set and let RandomizedSearchCV split it into train and validation sets.
The train_data is splitted for CV using a cross-validator into a train/validation set (in my case, the Stratified K-Folds http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html)
My estimator is defined as follows:
class DNNClassifier(BaseEstimator, ClassifierMixin):
It needs a score function to be able to evaluate the CV performance on the validation set. There is a default score function defined in the ClassifierMixin class (which returns the the mean accuracy and requires a predict function to be implemented in the Estimator class).
In my case, I implemented a custom score function within my estimator class.
The hyperparameter search and CV fit is done calling the fit function of RandomizedSearchCV.
RandomizedSearchCV(DNNClassifier(), param_distribs).fit(train_data)
This fit function runs the estimator's custom fit function on the train set and then the score function on the validation set.
This is done using the _fit_and_score function from the ._validation library.
So I can access the automatically splitted validation set (33% of my train_data input) at the end of my estimator's fit function.
I'd have preferred to access it within my estimator's fit function so that I can use it to plot validation accuracy over training steps and for early stop (I'll keep a separate validation set for that).
I guess I could reconstruct the automatically generated validation set by looking for the missing indexes from my initial train_data (the train_data used in the estimator's fit function has 66% of the indexes of the initial train_data).
If that is something that someone has already done I'd love to hear about it!

What does the CV stand for in sklearn.linear_model.LogisticRegressionCV?

scikit-learn has two logistic regression functions:
sklearn.linear_model.LogisticRegression
sklearn.linear_model.LogisticRegressionCV
I'm just curious what the CV stands for in the second one. The only acronym I know in ML that matches "CV" is cross-validation, but I'm guessing that's not it, since that would be achieved in scikit-learn with a wrapper function, not as part of the logistic regression function itself (I think).
You are right in guessing that the latter allows the user to perform cross validation. The user can pass the number of folds as an argument cv of the function to perform k-fold cross-validation (default is 10 folds with StratifiedKFold).
I would recommend reading the documentation for the functions LogisticRegression and LogisticRegressionCV
Yes, it's cross-validation. Excerpt from the docs:
For the grid of Cs values (that are set by default to be ten values in a logarithmic scale between 1e-4 and 1e4), the best hyperparameter is selected by the cross-validator StratifiedKFold, but it can be changed using the cv parameter.
The point here is the following:
yes: sklearn has general model-selection wrappers providing CV-functionality for all those classifiers/regressors
but: when the classifier/regressor is known/fixed a-priori (to some extent) or sometimes even some CV-model, one can gain advantages using these facts with specialized code bound to one classifier/regressor resulting in improved performance!
Typically:
CV already embedded in optimization-algorithm
Efficient warm-starting (instead of full re-optimization after just the change of one parameter like alpha)
It seems, at least the latter idea is used in sklearn's LogisticRegressionCV, as seen in this excerpt:
In the case of newton-cg and lbfgs solvers, we warm start along the path i.e guess the initial coefficients of the present fit to be the coefficients got after convergence in the previous fit, so it is supposed to be faster for high-dimensional dense data.
May I also refer you to this section in scikit-learn documentation which I beleive explains it well:
Some models can fit data for a range of values of some parameter
almost as efficiently as fitting the estimator for a single value of
the parameter. This feature can be leveraged to perform a more
efficient cross-validation used for model selection of this parameter.
The most common parameter amenable to this strategy is the parameter
encoding the strength of the regularizer. In this case we say that we
compute the regularization path of the estimator.
And logistic regression is one such model. That's why scikit-learn has the dedicated LogisticRegressionCV class that does this.
There are some things left out on other answers, e.g. about gridsearch functionality. See the docs:
cross-validation estimator
An estimator that has built-in cross-validation capabilities to automatically select the best hyper-parameters (see the User Guide). Some example of cross-validation estimators are ElasticNetCV and LogisticRegressionCV. Cross-validation estimators are named EstimatorCV and tend to be roughly equivalent to GridSearchCV(Estimator(), ...). The advantage of using a cross-validation estimator over the canonical estimator class along with grid search is that they can take advantage of warm-starting by reusing precomputed results in the previous steps of the cross-validation process. This generally leads to speed improvements. An exception is the RidgeCV class, which can instead perform efficient Leave-One-Out CV.
https://scikit-learn.org/stable/glossary.html#term-cross-validation-estimator
https://github.com/amueller/talks_odt/blob/master/2015/nyc-open-data-2015-andvanced-sklearn.pdf

Resources