sklearn pipeline + keras sequential model - how to get history? - python-3.x

Keras models, when .fit is called, return a history object. Is it possible to retrieve it if I use this model as one step of a sklearn pipeline?
btw, i'm using python 3.6
Thanks in advance!

The History callback records training metrics for each epoch. This includes the loss and the accuracy (for classification problems) as well as the loss and accuracy for the validation dataset, if one is set.
The history object is returned from calls to the fit() function used to train the model. Metrics are stored in a dictionary in the history member of the object returned.
This also means that the values have to be in the scope of the fit() function or the sequential model, so if it is in a sklearn pipeline, it doesn't have access to the final values, and it can't store, or return what it can't see.
As of right now I an not aware of a history callback in sklearn so the only I see for you is to manually record the metrics you want to track. One way to do so would be to have pipeline return the data and then simply fit your model onto it. If you are not able to figure that out comment.

Related

SpaCy: Do I need to perform early stopping while training the model for custom entities?

I have divided my data into training and testing.
https://spacy.io/usage/training#ner
As per the code snippet given by spacy for training the custom entities, it seems like there is no early stopping. So I have a question here??
Should I write a custom code which performs the following set of things after every iteration:
1. Iteration completed.
2. Check the model accuracy on the testing data.
3. If the accuracy is more than the previous model then save it else continue.
4. Perform the next iteration.
Or I the final model after completing all the iteration for example 30 iteration is the best model??
Sample output of my custom code:
As per the above output is it right to say the best model is at iteration no 13?
You should switch to the train CLI, which includes better evaluation metrics and early stopping: https://spacy.io/api/cli#train
spacy convert can convert a lot of common NER formats to spacy's internal training format and spacy train has a lot more options than the simple example training script. (spacy uses spacy train internally for the models it distributes.)

Monitor F1 Score (or a custom metric in general) in a keras callback

Keras 2.0 removed F1 score, but I would like to monitor its value. I am using a sequential model to train a Neural Net.
I defined a function, as suggested here How to calculate F1 Macro in Keras?.
This function works fine only if used it inside model.compile. In this way I see its value at each step. The problem is that I don't want just to see its value but I would like my training to behave differently according to its value, using the callbacks of Keras.
If I try to insert my custom metric in the callbacks then I get this error:
'function object is not iterable'
Do you know how to define a function such that it can be used as an argument in the callbacks?
Callback of Keras will enable us to retrieve the model at different period, based on the metric which we keep track of. This will not affect the training procedure of the model.
You can train your model only with respect to some loss function. For example, cross entropy for classification problem. The readily available loss function in keras are given here
Precision, recall or f1-score are not differentialable functions. Hence, we cannot use that as a loss function for model training.
May be, if you want to tune your hyperparameter (such as learning rate, class weights) for improving f1 score, then you can be do that.
For tuning hyper parameters you can use hyperopt, tutorials

Which model: Best estimator from gridsearchCV or all training data?

I am a little confused when it comes to gridsearch and fitting the final model. I split the in 2: training and testing. The testing set is only used for final evaluation. I perform grid search only using the training data.
Say one has done a grid search over several hyperparameters using cross-validation. The grid search gives the best combination of the hyperparameters. Next step is to train the model, and this is where I am confused. I see 2 possibilities:
1) Don't train the model. Use the parameters from the best model from the grid search.
or
2) Don't use the parameters from the best model from the grid search. Train the model on the full training set with the best hyperparameter combination from the grid search.
What is the correct approach, 1 or 2?
This is probably late, but might be useful for someone else who comes along.
GridSearchCV has an attribute called refit, which is set to True by default. This means that after performing k-fold cross-validation (i.e., training on a subset of the data you passed in), it refits the model using the best hyperparameters from the grid search, on the complete training set.
Presumably your question, from what I can glean, can be summarized as:
Suppose you use 5-fold cross-validation. Your model is then fitted only on 4 folds, as the fifth fold is used for validation. So would you need to retrain the model on the whole of train (i.e., the data from all 5 folds)?
The answer is no, provided you set refit to True, in which case GridSearchCV will perform the training over the whole of the training set using the best hyperparameters it has found after cross-validation. It will then return the trained estimator object, on which you can directly call the predict method, as you would normally do otherwise.
Refer: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
You train the model using the training set and the parameters obtained by the GridSearch.
And then you can test the model with the test set.

How to use cross-validation splitted data with RandomizedSearchCV

I'm trying to transfer my model from single run to hyper-parameter tuning using RandomizedSearchCV.
In my single run case, my data is splitted into train/validation/test data.
When I run RandomizedSearchCV on my train_data with default 3-fold CV, I notice that the length of my train_input is reduced to 66% of train_data (which makes sense in a 3-fold CV...).
So I'm guessing that I should merge my initial train and validation set into a larger train set and let RandomizedSearchCV split it into train and validation sets.
Would that be the right way to go?
My question is: how can I access the remaining 33% of my train_input to feed it to my validation accuracy test function (note that my score function is running on test set)?
Thanks for your help!
Yoann
I'm not sure that my code would help here since my question is rather generic.
This is the answer that I found by going through sklearn's code: the RandomizedSearchCV doesn't return the splited validation data in an easy way and I should definitely merge my initial train and validation set into a larger train set and let RandomizedSearchCV split it into train and validation sets.
The train_data is splitted for CV using a cross-validator into a train/validation set (in my case, the Stratified K-Folds http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html)
My estimator is defined as follows:
class DNNClassifier(BaseEstimator, ClassifierMixin):
It needs a score function to be able to evaluate the CV performance on the validation set. There is a default score function defined in the ClassifierMixin class (which returns the the mean accuracy and requires a predict function to be implemented in the Estimator class).
In my case, I implemented a custom score function within my estimator class.
The hyperparameter search and CV fit is done calling the fit function of RandomizedSearchCV.
RandomizedSearchCV(DNNClassifier(), param_distribs).fit(train_data)
This fit function runs the estimator's custom fit function on the train set and then the score function on the validation set.
This is done using the _fit_and_score function from the ._validation library.
So I can access the automatically splitted validation set (33% of my train_data input) at the end of my estimator's fit function.
I'd have preferred to access it within my estimator's fit function so that I can use it to plot validation accuracy over training steps and for early stop (I'll keep a separate validation set for that).
I guess I could reconstruct the automatically generated validation set by looking for the missing indexes from my initial train_data (the train_data used in the estimator's fit function has 66% of the indexes of the initial train_data).
If that is something that someone has already done I'd love to hear about it!

Pytorch: Intermediate testing during training

How can I test my pytorch model on validation data during training?
I know that there is the function myNet.eval() which apparantly switches of any dropout layers, but is it also preventing the gradients from being accumulated?
Also how would I undo the myNet.eval() command in order to continue with the training?
If anyone has some code snippet / toy example I would be grateful!
How can I test my pytorch model on validation data during training?
There are plenty examples where there are train and test steps for every epoch during training. An easy one would be the official MNIST example. Since pytorch does not offer any high-level training, validation or scoring framework you have to write it yourself. Commonly this consists of
a data loader (commonly based on torch.utils.dataloader.Dataloader)
a main loop over the total number of epochs
a train() function that uses training data to optimize the model
a test() or valid() function to measure the effectiveness of the model given validation data and a metric
This is also what you will find in the linked example.
Alternatively you can use a framework that provides basic looping and validation facilities so you don't have to implement everything by yourself all the time.
tnt is torchnet for pytorch, supplying you with different metrics (such as accuracy) and abstraction of the train loop. See this MNIST example.
inferno and torchsample attempt to model things very similar to Keras and provide some tools for validation
skorch is a scikit-learn wrapper for pytorch that lets you use all the tools and metrics from sklearn
Also how would I undo the myNet.eval() command in order to continue with the training?
myNet.train() or, alternatively, supply a boolean to switch between eval and training: myNet.train(True) for train mode.
I know that there is the function myNet.eval() which apparantly switches of any dropout layers, but is it also preventing the gradients from being accumulated?
It doesn't prevent gradients from accumulating.
But I think during testing, you do want to ignore gradients. In that case, you should mark the variable input to the network as volatile=True, and it will save some time and space used in forward calculation.
Also how would I undo the myNet.eval() command in order to continue with the training?
myNet.train()

Resources