I am training built-in pytorch rnn modules (eg torch.nn.LSTM) and would like to add fixed-per-minibatch dropout between each time step (Gal dropout, if I understand correctly).
Most simply, I could unroll the network and compute my forward computation on a single batch something like this:
dropout = get_fixed_dropout()
for sequence in batch:
state = initial_state
for token in sequence:
state, output = rnn(token,state)
state, output = dropout(state, output)
outputs.append(output)
loss += loss(outputs,sequence)
loss.backward()
optimiser.step()
However, assuming that looping in python is pretty slow, I would much rather take advantage of pytorch's ability to fully process a batch of several sequences, all in one call (as in rnn(batch,state) for regular forward computations).
i.e., I would prefer something that looks like this:
rnn_with_dropout = drop_neurons(rnn)
outputs, states = rnn_with_dropout(batch,initial_state)
loss = loss(outputs,batch)
loss.backward()
optimiser.step()
rnn = return_dropped(rnn_with_dropout)
(note: Pytorch's rnns do have a dropout parameter, but it is for dropout between layers and not between time steps, and so not what I want).
My questions are:
Am I even understanding Gal dropout correctly?
Is an interface like the one I want available?
If not, would simply implementing some drop_neurons(rnn), return_dropped(rnn) functions that zero random rows in the rnn's weight matrices and bias vectors and then return their previous values after the update step be equivalent? (This imposes the same dropout between the layers as in between the steps, i.e. completely removes some neurons for the whole minibatch, and I'm not sure that doing this is 'correct').
Related
A model should be set in the evaluation mode for inference by calling model.eval().
Do we need to also do this during training before getting the model outputs? Like within a training epoch if the network contains one or more dropout and/or batch-normalization layers.
If this is not done then the output of the forward pass in the training epoch might be affected by the randomness in the dropout?
Many example codes do not do this and something along these lines is the common approach:
for t in range(num_epochs):
# forward pass
yhat = model(x)
# get the loss
loss = criterion(yhat , y)
# backward pass, optimizer step
optimizer.zero_grad()
loss.backward()
optimizer.step()
For example here is an example code to look at : convolutional_neural_network/main.py
Should this instead be?
for t in range(num_epochs):
# forward pass
model.eval() # disable dropout etc
yhat = model(x)
# get the loss
loss = criterion(yhat , y)
# backward pass, optimizer step
model.train()
optimizer.zero_grad()
loss.backward()
optimizer.step()
TLDR:
Should this instead be?
No!
Why?
More explanation:
Different Modules behave differently depending on whether they are in training or evaluation/test mode.
BatchNorm and Dropout are only two examples of such modules, basically any module that has a training phase follows this rule.
When you do .eval(), you are signaling all modules in the model to shift operations accordingly.
Update
The answer is during training you should not use eval mode and yes, as long as you have not set the eval mode, the dropout will be active and act randomly in each forward passes. Similarly all other modules that have two phases, will perform accordingly. That is BN will always update the mean/var for each pass, and also if you use batch_size of 1, it will error out as it can not do BN with batch of 1
As it was pointed out in comments, it should be noted that during training, you should not do eval() before the forward pass, as it effectively disables all modules that has different phases for train/test mode such as BN and Dropout (basically any module that has updateable/learnable parameters, or impacts network topology like dropout) will be disabled and you will not see them contributing to your network learning. So don't code like that!
Let me explain a bit what happens during training:
When you are in training mode, all of your modules that make up your model may have two modes, training and test mode. These modules either have learnable parameters that need to be updated during training, like BN, or affect network topology in a sense like Dropout (by disabling some features during forward pass). some modules such as ReLU() only operate in one mode and thus do not have any change when modes change.
When you are in training mode, you feed an image, it passes trough layers until it faces a dropout and here, some features are disabled, thus theri responses to the next layer is omitted, the output goes to other layers until it reaches the end of the network and you get a prediction.
the network may have correct or wrong predictions, which will accordingly update the weights. if the answer was right, the features/combinations of features that resulted in the correct answer will be positively affected and vice versa.
So during training you do not need and should not disable dropout, as it affects the output and should be affecting it so that the model learns a better set of features.
I hope this makes it a bit more clear for you. if you still feel you need more, say so in the comments.
I am new to modeling with Keras. I am trying to evaluate appropriate parameters for setting up the model. How do I know when you use bias vs when to turn it off?
The short answer is, always use bias variables when your model is small. Otherwise, it is still recommended to keep using bias in all neural network architectures.
Because each neurone performs like a simple logistic regression. In each neurone, the input values are multiplied with by the weights and the bias affects the initial level in the sigmoid function, which results the desired the non-linearity.
For example, if you have a zero input in your training data like X = [[0,0,...], [0,0,...],... ] , Y = 1, in a sigmoid function, the output will always be exactly Y=0.5 since X*W is zero. However, in large networks, each node can make a bias node out of the average activation of all of its inputs.
One common task in DL is that you normalize input samples to zero mean and unit variance. One can "manually" perform the normalization using code like this:
mean = np.mean(X, axis = 0)
std = np.std(X, axis = 0)
X = [(x - mean)/std for x in X]
However, then one must keep the mean and std values around, to normalize the testing data, in addition to the Keras model being trained. Since the mean and std are learnable parameters, perhaps Keras can learn them? Something like this:
m = Sequential()
m.add(SomeKerasLayzerForNormalizing(...))
m.add(Conv2D(20, (5, 5), input_shape = (21, 100, 3), padding = 'valid'))
... rest of network
m.add(Dense(1, activation = 'sigmoid'))
I hope you understand what I'm getting at.
Add BatchNormalization as the first layer and it works as expected, though not exactly like the OP's example. You can see the detailed explanation here.
Both the OP's example and batch normalization use a learned mean and standard deviation of the input data during inference. But the OP's example uses a simple mean that gives every training sample equal weight, while the BatchNormalization layer uses a moving average that gives recently-seen samples more weight than older samples.
Importantly, batch normalization works differently from the OP's example during training. During training, the layer normalizes its output using the mean and standard deviation of the current batch of inputs.
A second distinction is that the OP's code produces an output with a mean of zero and a standard deviation of one. Batch Normalization instead learns a mean and standard deviation for the output that improves the entire network's loss. To get the behavior of the OP's example, Batch Normalization should be initialized with the parameters scale=False and center=False.
There's now a Keras layer for this purpose, Normalization. At time of writing it is in the experimental module, keras.layers.experimental.preprocessing.
https://keras.io/api/layers/preprocessing_layers/core_preprocessing_layers/normalization/
Before you use it, you call the layer's adapt method with the data X you want to derive the scale from (i.e. mean and standard deviation). Once you do this, the scale is fixed (it does not change during training). The scale is then applied to the inputs whenever the model is used (during training and prediction).
from keras.layers.experimental.preprocessing import Normalization
norm_layer = Normalization()
norm_layer.adapt(X)
model = keras.Sequential()
model.add(norm_layer)
# ... Continue as usual.
Maybe you can use sklearn.preprocessing.StandardScaler to scale you data,
This object allow you to save the scaling parameters in an object,
Then you can use Mixin types inputs into you model, lets say:
Your_model
[param1_scaler, param2_scaler]
Here is a link https://www.pyimagesearch.com/2019/02/04/keras-multiple-inputs-and-mixed-data/
https://keras.io/getting-started/functional-api-guide/
There's BatchNormalization, which learns mean and standard deviation of the input. I haven't tried using it as the first layer of the network, but as I understand it, it should do something very similar to what you're looking for.
I have a set of sentences and their scores, I would like to train a marking system that could predict the score for a given sentence, such one example is like this:
(X =Tomorrow is a good day, Y = 0.9)
I would like to use LSTM to build such a marking system, and also consider the sequential relationship between each word in the sentence, so the training example shown above is transformed as following:
(x1=Tomorrow, y1=is) (x2=is, y2=a) (x3=a, y3=good) (x4=day, y4=0.9)
When training this LSTM, I would like the first three time steps using a softmax classifier, and the final step using a MSE. It is obvious that the loss function used in this LSTM is composed of two different loss functions. In this case, it seems the Keras does not provide the way to address my problem directly. In addition, I am not sure whether my method to build the marking system is correct or not.
Keras support multiple loss functions as well:
model = Model(inputs=inputs,
outputs=[lang_model, sent_model])
model.compile(optimizer='sgd',
loss=['categorical_crossentropy', 'mse'],
metrics=['accuracy'], loss_weights=[1., 1.])
Based on your explanation, I think you need a model that first, predict a token based on previous tokens, in NLP domain it usually called Language model, and then compute a score which I assume it is a sentiment (it is applicable to other domain).
To do so, you can train your language model with LSTM and pick the last output of LSTM for your ranking task. To this end, you need to define two loss function: categorical_crossentropy for the language model and MSE for the ranking task.
This tutorial would be helpful: https://www.pyimagesearch.com/2018/06/04/keras-multiple-outputs-and-multiple-losses/
In Keras you can specify a dropout layer like this:
model.add(Dropout(0.5))
But with a GRU cell you can specify the dropout as a parameter in the constructor:
model.add(GRU(units=512,
return_sequences=True,
dropout=0.5,
input_shape=(None, features_size,)))
What's the difference? Is one preferable to the other?
In Keras' documentation it adds it as a separate dropout layer (see "Sequence classification with LSTM")
The recurrent layers perform the same repeated operation over and over.
In each timestep, it takes two inputs:
Your inputs (a step of your sequence)
Internal inputs (can be states and the output of the previous step, for instance)
Note that the dimensions of the input and output may not match, which means that "your input" dimensions will not match "the recurrent input (previous step/states)" dimesions.
Then in every recurrent timestep there are two operations with two different kernels:
One kernel is applied to "your inputs" to process and transform it in a compatible dimension
Another (called recurrent kernel by keras) is applied to the inputs of the previous step.
Because of this, keras also uses two dropout operations in the recurrent layers. (Dropouts that will be applied to every step)
A dropout for the first conversion of your inputs
A dropout for the application of the recurrent kernel
So, in fact there are two dropout parameters in RNN layers:
dropout, applied to the first operation on the inputs
recurrent_dropout, applied to the other operation on the recurrent inputs (previous output and/or states)
You can see this description coded either in GRUCell and in LSTMCell for instance in the source code.
What is correct?
This is open to creativity.
You can use a Dropout(...) layer, it's not "wrong", but it will possibly drop "timesteps" too! (Unless you set noise_shape properly or use SpatialDropout1D, which is currently not documented yet)
Maybe you want it, maybe you dont. If you use the parameters in the recurrent layer, you will be applying dropouts only to the other dimensions, without dropping a single step. This seems healthy for recurrent layers, unless you want your network to learn how to deal with sequences containing gaps (this last sentence is a supposal).
Also, with the dropout parameters, you will be really dropping parts of the kernel as the operations are dropped "in every step", while using a separate layer will let your RNN perform non-dropped operations internally, since your dropout will affect only the final output.