How to use cross-validation splitted data with RandomizedSearchCV - scikit-learn

I'm trying to transfer my model from single run to hyper-parameter tuning using RandomizedSearchCV.
In my single run case, my data is splitted into train/validation/test data.
When I run RandomizedSearchCV on my train_data with default 3-fold CV, I notice that the length of my train_input is reduced to 66% of train_data (which makes sense in a 3-fold CV...).
So I'm guessing that I should merge my initial train and validation set into a larger train set and let RandomizedSearchCV split it into train and validation sets.
Would that be the right way to go?
My question is: how can I access the remaining 33% of my train_input to feed it to my validation accuracy test function (note that my score function is running on test set)?
Thanks for your help!
Yoann

I'm not sure that my code would help here since my question is rather generic.
This is the answer that I found by going through sklearn's code: the RandomizedSearchCV doesn't return the splited validation data in an easy way and I should definitely merge my initial train and validation set into a larger train set and let RandomizedSearchCV split it into train and validation sets.
The train_data is splitted for CV using a cross-validator into a train/validation set (in my case, the Stratified K-Folds http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html)
My estimator is defined as follows:
class DNNClassifier(BaseEstimator, ClassifierMixin):
It needs a score function to be able to evaluate the CV performance on the validation set. There is a default score function defined in the ClassifierMixin class (which returns the the mean accuracy and requires a predict function to be implemented in the Estimator class).
In my case, I implemented a custom score function within my estimator class.
The hyperparameter search and CV fit is done calling the fit function of RandomizedSearchCV.
RandomizedSearchCV(DNNClassifier(), param_distribs).fit(train_data)
This fit function runs the estimator's custom fit function on the train set and then the score function on the validation set.
This is done using the _fit_and_score function from the ._validation library.
So I can access the automatically splitted validation set (33% of my train_data input) at the end of my estimator's fit function.
I'd have preferred to access it within my estimator's fit function so that I can use it to plot validation accuracy over training steps and for early stop (I'll keep a separate validation set for that).
I guess I could reconstruct the automatically generated validation set by looking for the missing indexes from my initial train_data (the train_data used in the estimator's fit function has 66% of the indexes of the initial train_data).
If that is something that someone has already done I'd love to hear about it!

Related

Monitor F1 Score (or a custom metric in general) in a keras callback

Keras 2.0 removed F1 score, but I would like to monitor its value. I am using a sequential model to train a Neural Net.
I defined a function, as suggested here How to calculate F1 Macro in Keras?.
This function works fine only if used it inside model.compile. In this way I see its value at each step. The problem is that I don't want just to see its value but I would like my training to behave differently according to its value, using the callbacks of Keras.
If I try to insert my custom metric in the callbacks then I get this error:
'function object is not iterable'
Do you know how to define a function such that it can be used as an argument in the callbacks?
Callback of Keras will enable us to retrieve the model at different period, based on the metric which we keep track of. This will not affect the training procedure of the model.
You can train your model only with respect to some loss function. For example, cross entropy for classification problem. The readily available loss function in keras are given here
Precision, recall or f1-score are not differentialable functions. Hence, we cannot use that as a loss function for model training.
May be, if you want to tune your hyperparameter (such as learning rate, class weights) for improving f1 score, then you can be do that.
For tuning hyper parameters you can use hyperopt, tutorials

Which model: Best estimator from gridsearchCV or all training data?

I am a little confused when it comes to gridsearch and fitting the final model. I split the in 2: training and testing. The testing set is only used for final evaluation. I perform grid search only using the training data.
Say one has done a grid search over several hyperparameters using cross-validation. The grid search gives the best combination of the hyperparameters. Next step is to train the model, and this is where I am confused. I see 2 possibilities:
1) Don't train the model. Use the parameters from the best model from the grid search.
or
2) Don't use the parameters from the best model from the grid search. Train the model on the full training set with the best hyperparameter combination from the grid search.
What is the correct approach, 1 or 2?
This is probably late, but might be useful for someone else who comes along.
GridSearchCV has an attribute called refit, which is set to True by default. This means that after performing k-fold cross-validation (i.e., training on a subset of the data you passed in), it refits the model using the best hyperparameters from the grid search, on the complete training set.
Presumably your question, from what I can glean, can be summarized as:
Suppose you use 5-fold cross-validation. Your model is then fitted only on 4 folds, as the fifth fold is used for validation. So would you need to retrain the model on the whole of train (i.e., the data from all 5 folds)?
The answer is no, provided you set refit to True, in which case GridSearchCV will perform the training over the whole of the training set using the best hyperparameters it has found after cross-validation. It will then return the trained estimator object, on which you can directly call the predict method, as you would normally do otherwise.
Refer: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
You train the model using the training set and the parameters obtained by the GridSearch.
And then you can test the model with the test set.

does training on total dataset improves confidence scores

I'm using SVC(kernel="linear", probability=True) in multiclass classification. when I'm using 2/3rd of my data for training purpose, I'm getting ~72%. And when I tried to predict in production, Confidence scores I'm getting are very less. Does training on the total dataset helps to improve confidence scores?
Does training on the total dataset helps to improve confidence scores?
It might. In general, the more data the better. However evaluating performance should be done on data that the model has not seen before. One way to do this is to set aside a part of the data, a test set, as you have done. Another approach is to use cross-validation, see below.
And when I tried to predict in production, Confidence scores I'm getting are very less.
This means that your model does not generalize well. In other words when presented with data it has not seen before the model starts to make more or less random predictions.
To get a better sense of how well your model generalizes you may want to use cross-validation:
from sklearn.model_selection import cross_val_score
clf = SVC()
scores = cross_val_score(clf, X, Y)
This will train and evaluate your classifier on the full dataset using folds of the full data. A fold For each split the classifier is trained and validation on an exclusive subset of the data. For each split the scores result contains the validation score (for SVC, the accuracy). If you need more control over which metrics to evaluate, use the cross_validation function.
to predict in production
In order to improve your model's performance, there are several methods to consider:
Use more training data
Use an ensemble model to reduce prediction variance
Use a different model (algorithm)

Pytorch: Intermediate testing during training

How can I test my pytorch model on validation data during training?
I know that there is the function myNet.eval() which apparantly switches of any dropout layers, but is it also preventing the gradients from being accumulated?
Also how would I undo the myNet.eval() command in order to continue with the training?
If anyone has some code snippet / toy example I would be grateful!
How can I test my pytorch model on validation data during training?
There are plenty examples where there are train and test steps for every epoch during training. An easy one would be the official MNIST example. Since pytorch does not offer any high-level training, validation or scoring framework you have to write it yourself. Commonly this consists of
a data loader (commonly based on torch.utils.dataloader.Dataloader)
a main loop over the total number of epochs
a train() function that uses training data to optimize the model
a test() or valid() function to measure the effectiveness of the model given validation data and a metric
This is also what you will find in the linked example.
Alternatively you can use a framework that provides basic looping and validation facilities so you don't have to implement everything by yourself all the time.
tnt is torchnet for pytorch, supplying you with different metrics (such as accuracy) and abstraction of the train loop. See this MNIST example.
inferno and torchsample attempt to model things very similar to Keras and provide some tools for validation
skorch is a scikit-learn wrapper for pytorch that lets you use all the tools and metrics from sklearn
Also how would I undo the myNet.eval() command in order to continue with the training?
myNet.train() or, alternatively, supply a boolean to switch between eval and training: myNet.train(True) for train mode.
I know that there is the function myNet.eval() which apparantly switches of any dropout layers, but is it also preventing the gradients from being accumulated?
It doesn't prevent gradients from accumulating.
But I think during testing, you do want to ignore gradients. In that case, you should mark the variable input to the network as volatile=True, and it will save some time and space used in forward calculation.
Also how would I undo the myNet.eval() command in order to continue with the training?
myNet.train()

keras metric different during training

I have implemented a custom metric based on SIM and when i try the code it works. I have implemented it using tensors and np arrays and both give the same results. However when I start fitting the model the values given back are a lot higher then the values I get when i load the weights generated by the training and applying the same function.
My function is:
def SIM(y_true,y_pred):
n_y_true=y_true/(K.sum(y_true)+K.epsilon())
n_y_pred=y_pred/(K.sum(y_pred)+K.epsilon())
return K.mean(K.sum( K.minimum(n_y_true, n_y_pred)))
When I compile the Keras model I add this to the metrics and during training it gives for example SIM: 0.7092.
When i load the weights and try it the SIM score is around 0.3. The correct weights are loaded (when restarting training with these weights the same values popup). Does anybody know if I am doing anything wrong?
Why are the metrics given back during training so much higher compared to running the function over a batch?

Resources