what does model.eval() do for batch normalization layer? - pytorch

Why does the testing data use the mean and variance of the all training data? To keep the distribution consistent? What is the difference between the BN layer using model.train compared to model.val

It fixes the mean and var computed in the training phase by keeping estimates of it in running_mean and running_var. See PyTorch Documentation.
As noted there the implementation is based on the description in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. As one tries to use the whole training data one can get (given similar data train/test data) a better estimate of the mean/var for the (unseen) test set.
Also similar questions have been asked here: What does model.eval() do?

Related

What happens if optimal training loss is too high

I am training a Transformer. In many of my setups I obtain validation and training loss that look like this:
Then, I understand that I should stop training at around epoch 1. But then the training loss is very high. Is this a problem? Does the value of training loss actually mean anything?
Thanks
Regarding your first question - it is not necessarily a problem that your training loss is high, since there is no threshold for what is considered as a high training loss. It depends on your dataset, your actual test metrics and your business goals.
More specifically, the problems with the value of training loss:
The number isn't intuitive, since the loss objective is a metric optimized for gradient descent (i.e. a differentiable function, usually the log version of it).
You probably have intuitive business metrics (e.g., precision, recall) oriented towards your end goal, which you should use to decide if your model is good or not.
Your train loss is calculated on the training dataset, which is not always representative of a good model, as can be seen in the overfitted model you posted. You shouldn't use this number to make decisions for the goodness of the model.
It depends on what you are trying to achieve. Is 80% accuracy high or low?
Regarding your second question - Technically, the higher the number the worse the model did in converging, so you should always try to lower it (while taking into consideration overfitting).
Comparatively, you can say that one model has a higher loss than another and then try multiple hyperparameters (e.g., dropout, different optimizers) to minimize the point where the validation set diverges.
You are describing overfitting: Your model's expressive power is too strong and it is memorizing the training data, rather than learning useful representations that can generalize to the validation data.
To mitigate this issue, you should apply stronger regularization to your model to prevent it from memorizing and steer it towards useful representations.
regularization methods include (but are not limited to):
Input augmentations
DropOut
Early stopping
Weight decay

Improve the neural network by analyzing the loss curve

I builted some network based on LSTM. I tuneded parameters. The results are shown in the figure and are not impressive.
How to understand what is bad? Is the dataset bad or the network is not well built?
Since validation loss decreased initially and later increased what you're experiencing is model overfitting.
Since training loss kept decreasing, your model has learnt training set excessively and now model is not generalizing well. Due to this validation loss increased.
To avoid overfitting, you need to regularize your model. You can use L1 or L2 regularization techniques. Additionally, you can also try dropout in your model.
Now coming to your question:
If the dataset is of good quality i.e. it is annotated well and it surely has features which could give result, then dataset and model hand-in-hand decides the quality of the predictions.
Since you're using RNNs that consists a good numbers of parameters, make sure that dataset is also huge to avoid RNNs overfitting on a small dataset. If available dataset is small, start with a small deep learning with less parameters (you can build a small neural network) and gradually scale up the model until you're satisfied with the prediction scores.
You can also refer this: https://towardsdatascience.com/rnn-training-tips-and-tricks-2bf687e67527

Model underfitting

I have trained a model and it took me quite a while to find the correct hyperparameters.
The model has now been trained for 15h and it seems to to its job quite well.
When I observed the training and validation loss though, the training loss is somewhat higher than the validation loss. (red curve: training, green: validation)
I use dropout to regularize my model and as far as I have understood, droput is is only applied during training which might be the reason.
Now Iam wondering if I have trained a valid model?
It doesn't seem like the model is heavily underfitted?
Thanks in advance for any advice,
cheers,
M
First, check whether you have good data set, i.e., if it is a classification, then get equal number of images for all classes and get it from same source not from different sources. And regularization, dropout are used for overfitting/High variance so don't worry about these.
Then, I think your model is doing good when you trained your model the initial error between them are different but as you increased the epochs then they both got into some steady path. So it is good. And may be reason for this is as I mentioned above or you should try shuffle them then using train_test_split for getting better distribution of training and validation sets.
A plot of learning curves shows a good fit if:
The plot of training loss decreases to a point of stability.
The plot of validation loss decreases to a point of stability and has a small gap with the training loss.
In your case these conditions are satisfied.
Still if you want to deal with High Bias/underfitting then here are few methods:
Train bigger models
Train longer. Use better optimization techniques
Try different Neural Network Architecture and also hyper parameters
And also you can use cross-validation or GridSearchCV for finding better optimizer or hyper parameters but it may take really long because you have to train it on different parameters each time considering your time which is 15 hours then it might be very long but you will find better parameters and then train on it.
Above all I think your model is doing okay.
If your model underfits, its performance will be lower, similar as in the case of overfitting, because actually it can not learn effectively to get the optimal result, i.e the proper function to fit the given distribution. So you have to use less regularization technique e.g. less dropout to get the optimal result.
Furthermore the sampling can also be crucial, because there can be training-validation subsets where your model performs well on validation set and less effective on training set and vice-versa. This is one of the reason why we use crossvalidation and different sampling methods e.g. stratified k-fold.

Best Way to Overcome Early Convergence for Machine Learning Model

I have a machine learning model built that tries to predict weather data, and in this case I am doing a prediction on whether or not it will rain tomorrow (a binary prediction of Yes/No).
In the dataset there is about 50 input variables, and I have 65,000 entries in the dataset.
I am currently running a RNN with a single hidden layer, with 35 nodes in the hidden layer. I am using PyTorch's NLLLoss as my loss function, and Adaboost for the optimization function. I've tried many different learning rates, and 0.01 seems to be working fairly well.
After running for 150 epochs, I notice that I start to converge around .80 accuracy for my test data. However, I would wish for this to be even higher. However, it seems like the model is stuck oscillating around some sort of saddle or local minimum. (A graph of this is below)
What are the most effective ways to get out of this "valley" that the model seems to be stuck in?
Not sure why exactly you are using only one hidden layer and what is the shape of your history data but here are the things you can try:
Try more than one hidden layer
Experiment with LSTM and GRU layer and combination of these layers together with RNN.
Shape of your data i.e. the history you look at to predict the weather.
Make sure your features are scaled properly since you have about 50 input variables.
Your question is little ambiguous as you mentioned RNN with a single hidden layer. Also without knowing the entire neural network architecture, it is tough to say how can you bring in improvements. So, I would like to add a few points.
You mentioned that you are using "Adaboost" as the optimization function but PyTorch doesn't have any such optimizer. Did you try using SGD or Adam optimizers which are very useful?
Do you have any regularization term in the loss function? Are you familiar with dropout? Did you check the training performance? Does your model overfit?
Do you have a baseline model/algorithm so that you can compare whether 80% accuracy is good or not?
150 epochs just for a binary classification task looks too much. Why don't you start from an off-the-shelf classifier model? You can find several examples of regression, classification in this tutorial.

Validation loss in keras while training LSTM and stability of LSTM

I am using Keras now to train my LSTM model for a time series problem. My activation function is linear and the optimizer is Rmsprop.
However, i observe the tendency that while the training loss is decreasing slowly overtime, and fluctuate around a small value, the validation loss jumps up and down with a large variance.
Therefore, I come up with two questions:
1. Does the validation loss affect the training process? Will the algorithm look at the validation loss and slow down the learning rate in case it fluctuates alot?
2. How can i make the model more stable so that it will return a more stable values of validation loss?
Thanks
Does the validation loss affect the training process?
No. The validation loss is just a small sample of data that is excluded from the training process. It is run through the network at the end of an epoch, to test how well training is going, so that you can check if the model is over fitting (i.e. training loss much < validation loss).
Fluctuation in validation loss
This is bit tougher to answer without the network or data. It could just mean that your model isn't converging well to unseen data, meaning that its not seeing a enough similar trends from training data to validation data, and each time the weights are adjusted to better suit the training data, the model becomes less accurate for the validation set. You could possibly turn down the learning rate, but if your training loss is decreasing slowly, the learning rate is probably fine. I think in this situation, you have to ask yourself a few questions. Do I have enough data? Does a true time series trend exist in my data? Have I normalized my data correctly? Is my network to large for the data I have?
I had this issue - while training loss was decreasing, the validation loss was not decreasing. I checked and found while I was using LSTM:
I simplified the model - instead of 20 layers, I opted for 8 layers.
Instead of scaling within range (-1,1), I choose (0,1), this right there reduced my validation loss by magnitude of one order
I reduced the batch size from 500 to 50 (just trial and error)
I added more features, which I thought intuitively would add some new intelligent information to the X->y pair
Possible reasons:
Your validation set is very small compare to your trainning set which usually happens. A little change of weights makes validation loss fluctuate much more than trainning loss. This may not neccessary mean that your model is overfiting. As long as the overall trendency of validation loss keeps decreasing.
May be your train and validation data are from different sources, they may have different distributions. This may happen when your data is time series, and you split your train/validation data by a specific timestamp.
Does the validation loss affect the training process?
No, validation(forward-pass-once) and training(forward-and-backward) are different processes. Hence a single forword pass does not change how would you train next.
Will the algorithm look at the validation loss and slow down the learning rate in case it fluctuates alot?
No, But I guess you can implement your own method to do so. However, one thing should be noted, the model is trying to learn the best solution to your cost function which are fed by trainning data only, so changing this learning rate by observing validation loss doesnt make too much sense.
How can i make the model more stable so that it will return a more stable values of validation loss?
The reasons are expained above. If it is the first case, enlarge validation set will make your loss looks more stable but it does NOT mean it fits better. My suggestion is as long as your are sure your model does not overfit (gap between train loss and validation loss are not too large ), you can just save the model which gives the lowest validation loss.
If its the second case, it can be complecated depend on your case. You could try to exclude samples in trainning set which are not "similar" with your validation set, or enlarge your model's capacity if you have enough data. Or perhapes add more metrics to monitor how well the training.

Resources