PyTorch training with dropout and/or batch-normalization - pytorch

A model should be set in the evaluation mode for inference by calling model.eval().
Do we need to also do this during training before getting the model outputs? Like within a training epoch if the network contains one or more dropout and/or batch-normalization layers.
If this is not done then the output of the forward pass in the training epoch might be affected by the randomness in the dropout?
Many example codes do not do this and something along these lines is the common approach:
for t in range(num_epochs):
# forward pass
yhat = model(x)
# get the loss
loss = criterion(yhat , y)
# backward pass, optimizer step
optimizer.zero_grad()
loss.backward()
optimizer.step()
For example here is an example code to look at : convolutional_neural_network/main.py
Should this instead be?
for t in range(num_epochs):
# forward pass
model.eval() # disable dropout etc
yhat = model(x)
# get the loss
loss = criterion(yhat , y)
# backward pass, optimizer step
model.train()
optimizer.zero_grad()
loss.backward()
optimizer.step()

TLDR:
Should this instead be?
No!
Why?
More explanation:
Different Modules behave differently depending on whether they are in training or evaluation/test mode.
BatchNorm and Dropout are only two examples of such modules, basically any module that has a training phase follows this rule.
When you do .eval(), you are signaling all modules in the model to shift operations accordingly.
Update
The answer is during training you should not use eval mode and yes, as long as you have not set the eval mode, the dropout will be active and act randomly in each forward passes. Similarly all other modules that have two phases, will perform accordingly. That is BN will always update the mean/var for each pass, and also if you use batch_size of 1, it will error out as it can not do BN with batch of 1
As it was pointed out in comments, it should be noted that during training, you should not do eval() before the forward pass, as it effectively disables all modules that has different phases for train/test mode such as BN and Dropout (basically any module that has updateable/learnable parameters, or impacts network topology like dropout) will be disabled and you will not see them contributing to your network learning. So don't code like that!
Let me explain a bit what happens during training:
When you are in training mode, all of your modules that make up your model may have two modes, training and test mode. These modules either have learnable parameters that need to be updated during training, like BN, or affect network topology in a sense like Dropout (by disabling some features during forward pass). some modules such as ReLU() only operate in one mode and thus do not have any change when modes change.
When you are in training mode, you feed an image, it passes trough layers until it faces a dropout and here, some features are disabled, thus theri responses to the next layer is omitted, the output goes to other layers until it reaches the end of the network and you get a prediction.
the network may have correct or wrong predictions, which will accordingly update the weights. if the answer was right, the features/combinations of features that resulted in the correct answer will be positively affected and vice versa.
So during training you do not need and should not disable dropout, as it affects the output and should be affecting it so that the model learns a better set of features.
I hope this makes it a bit more clear for you. if you still feel you need more, say so in the comments.

Related

How to fine tune InceptionV3 in Keras

I am trying to train a classifier based on the InceptionV3 architecture in Keras.
For this I loaded the pre-trained InceptionV3 model, without top, and added a final fully connected layer for the classes of my classification problem. In the first training I froze the InceptionV3 base model and only trained the final fully connected layer.
In the second step I want to "fine tune" the network by unfreezing a part of the InceptionV3 model.
Now I know that the InceptionV3 model makes extensive use of BatchNorm layers. It is recommended (link to documentation), when BatchNorm layers are "unfrozen" for fine tuning when transfer learning, to keep the mean and variances as computed by the BatchNorm layers fixed. This should be done by setting the BatchNorm layers to inference mode instead of training mode.
Please also see: What's the difference between the training argument in call() and the trainable attribute?
Now my main question is: how to set ONLY the BatchNorm layers of the InceptionV3 model to inference mode?
Currently I set the whole InceptionV3 base model to inference mode by setting the "training" argument when assembling the network:
inputs = keras.Input(shape=input_shape)
# Scale the 0-255 RGB values to 0.0-1.0 RGB values
x = layers.experimental.preprocessing.Rescaling(1./255)(inputs)
# Set include_top to False so that the final fully connected (with pre-loaded weights) layer is not included.
# We will add our own fully connected layer for our own set of classes to the network.
base_model = keras.applications.InceptionV3(input_shape=input_shape, weights='imagenet', include_top=False)
x = base_model(x, training=False)
# Classification block
x = layers.GlobalAveragePooling2D(name='avg_pool')(x)
x = layers.Dense(num_classes, activation='softmax', name='predictions')(x)
model = keras.Model(inputs=inputs, outputs=x)
What I don't like about this, is that in this way I set the whole model to inference mode which may set some layers to inference mode which should not be.
Here is the part of the code that loads the weights from the initial training that I did and the code that freezes the first 150 layers and unfreezes the remaining layers of the InceptionV3 part:
model.load_weights(load_model_weight_file_name)
for layer in base_model.layers[: 150]:
layer.trainable = False
for layer in base_model.layers[ 150:]:
layer.trainable = True
The rest of my code (not shown here) are the usual compile and fit calls.
Running this code seems to result a network that doesn't really learn (loss and accuracy remain approximately the same). I tried different orders of magnitude for the optimization step size, but that doesn't seem to help.
Another thing that I observed it that when I make the whole InceptionV3 part trainable
base_model.trainable = True
that the training starts with an accuracy server orders of magnitude smaller than were my first training round finished (and of course a much higher loss). Can someone explain this to me? I would at least expect the training to continue were it left off in terms of accuracy and loss.
You could do something like:
for layer in base_model.layers:
if isinstance(layer ,tf.keras.layers.BatchNormalization):
layer.trainable=False
This will iterate over each layer and check the type, setting to inference mode if the layer is BatchNorm.
As for the low starting accuracy during transfer learning, you're only loading the weights and not the optimiser state (as would occur with a full model.load() which loads architecture, weights, optimiser state etc).
This doesn't mean there's an error, but if you must load weights only just let it train, the optimiser will configure eventually and you should see progress. Also as you're potentially over-writing the pre-trained weights in your second run, make sure you use a lower learning rate so the updates are small in comparison i.e. fine-tune the weights rather than blast them to pieces.

Can someone help explain the use of keras.backend.learning_phase_scope(1)?

Need help as I am new to Keras and was reading on dropout and how using dropout can have an impact on loss calculation during training and validation phase. This is because dropout is only present at training time and not validation time, so comparing two losses can be misleading.
Question is
The use of learning_phase_scope(1)
how does it impact validation
What steps to do to correct for testing loss when dropout is used?
It's not only Dropout but BatchNormalization as well that need to be changed or it'll affect validation performance.
If you use keras and just want to get validation loss (and or accuracy or other metrics) then you better use model.evaluate() or add validation_data while model.fit and don't do anything with learning_phase_scope.
The learning_phase_scope(1) means it's for training, 0 is for predict/validate.
Personally I use learning_phase_scope only when I want to train something that not end with simply model.fit (visualize CNN filter) but only once so far in past 3 years.

Update targets in classification images

Why are we updating targets in the implementation of bayesian cnn with mc dropout here?
https://github.com/sungyubkim/MCDO/blob/master/Bayesian_CNN_with_MCDO.ipynb?fbclid=IwAR18IMLcdUUp90TRoYodsJS7GW1smk-KGYovNpojn8LtRhDQckFI_gnpOYc
def update_target(target, original, update_rate):
for target_param, param in zip(target.parameters(), original.parameters()):
target_param.data.copy_((1.0 - update_rate) * target_param.data + update_rate*param.data)
The implementation you have referred to is a data parallel one.
Which means, the author intends to train multiple networks with the same architecture but different hyper-parameters.
Although in an unconventional way, this is what update_target does:
update_target(net_test, net, 0.001)
It updates the net_test with a lower learning rate compared to net, but with the exact same parameter changes applied to original net, that is actually being trained. Only the change scales is different.
I am assuming that this is found to be useful in terms of computational efficiency, since only one of the networks are actually being "trained" during main training phase:
outputs = net(inputs)
loss = CE(outputs, labels)
loss.backward()
optimizer.step()
One less forward pass and one less backprop per step.

Adding dropout between time-steps in pytorch RNN

I am training built-in pytorch rnn modules (eg torch.nn.LSTM) and would like to add fixed-per-minibatch dropout between each time step (Gal dropout, if I understand correctly).
Most simply, I could unroll the network and compute my forward computation on a single batch something like this:
dropout = get_fixed_dropout()
for sequence in batch:
state = initial_state
for token in sequence:
state, output = rnn(token,state)
state, output = dropout(state, output)
outputs.append(output)
loss += loss(outputs,sequence)
loss.backward()
optimiser.step()
However, assuming that looping in python is pretty slow, I would much rather take advantage of pytorch's ability to fully process a batch of several sequences, all in one call (as in rnn(batch,state) for regular forward computations).
i.e., I would prefer something that looks like this:
rnn_with_dropout = drop_neurons(rnn)
outputs, states = rnn_with_dropout(batch,initial_state)
loss = loss(outputs,batch)
loss.backward()
optimiser.step()
rnn = return_dropped(rnn_with_dropout)
(note: Pytorch's rnns do have a dropout parameter, but it is for dropout between layers and not between time steps, and so not what I want).
My questions are:
Am I even understanding Gal dropout correctly?
Is an interface like the one I want available?
If not, would simply implementing some drop_neurons(rnn), return_dropped(rnn) functions that zero random rows in the rnn's weight matrices and bias vectors and then return their previous values after the update step be equivalent? (This imposes the same dropout between the layers as in between the steps, i.e. completely removes some neurons for the whole minibatch, and I'm not sure that doing this is 'correct').

Weird Training Issue with keras - sudden huge drop in loss with zeros in FC layer

I'm getting this odd issue with training a siamese-style CNN with Keras (backend of Tensor Flow, Ubuntu 14.04, Cuda 8, with cudnn). In short, the CNN has a shared set of weights that takes in two images, merges their respective FC layers, and then estimates a regression. I'm using MSE loss with the Adam optimizer (with default parameters). I've done this several times with different types of problems and have never seen the following.
Essentially what happens is on the first epoch, everything seems to be training fine, and the loss is decreasing slowly, as expected (ends at around an MSE of ~3.3 using a batch size of 32). The regression is estimating a 9-dimensional continuous-valued vector.
Then, as soon as the second epoch starts, the loss drops DRAMATICALLY (to ~ 4e-07). You'd think "oh yay the loss is really small--I win", but when I inspect the trained weights by prediction on novel inputs (I'm using the checkpointer to dump out the best set of weights according to the loss), I get odd behavior. No matter what the inputs are (different images, random noise as inputs, even zeros), I always get the same exact output. Further inspection shows that the last FC layer in the shared weights are all zeros.
If I look at the weights after the first epoch, when everything seems "normal", this doesn't happen--I just don't get optimal results (makes sense--only one epoch has occurred). This only happens with the second epoch and on.
Has anybody ever seen this? Any ideas? You think it's a dumb error on my part, or some weird bug?
More details on my network topology here. Here are the shared weights:
shared_model = Sequential()
shared_model.add(Convolution2D(nb_filter=96, nb_row=9, nb_col=9, activation='relu', subsample=(2,2), input_shape=(3,height,width)))
shared_model.add(MaxPooling2D(pool_size=(2,2)))
shared_model.add(Convolution2D(nb_filter=256, nb_row=3, nb_col=3, activation='relu', subsample=(2,2)))
shared_model.add(MaxPooling2D(pool_size=(2,2)))
shared_model.add(Convolution2D(nb_filter=256, nb_row=3, nb_col=3, activation='relu'))
shared_model.add(MaxPooling2D(pool_size=(2,2)))
shared_model.add(Convolution2D(nb_filter=512, nb_row=3, nb_col=3, activation='relu', subsample=(1,1)))
shared_model.add(Flatten())
shared_model.add(Dense(2048, activation='relu'))
shared_model.add(Dropout(0.5))
Then I merge them for regression as follows:
input_1 = Input(shape=(3,height,width))
input_2 = Input(shape=(3,height,width))
encoded_1 = shared_model(input_1)
encoded_2 = shared_model(input_2)
encoded_merged = merge([encoded_1, encoded_2], mode='concat', concat_axis=-1)
fc_H = Dense(9, activation='linear')
h_loss = fc_H(encoded_merged)
model = Model(input=[input_1, input_2], output=h_loss)
Finally, each epoch trains on about 1,000,000 samples, so there should be plenty of data to train. I've just never seen a FC layer get set to all zeros. And even at that, I don't understand how that makes for a very low loss when the training data are not all zeros.
For the zeroes which are seemingly getting predicted by the last layer what might have happened is the dying ReLU problem. Try LeakyReLU, tweak alpha. This worked for me in eradicating those zeros which I would get in the first epoch itself.

Resources