Using Dropout with Keras and LSTM/GRU cell - keras

In Keras you can specify a dropout layer like this:
model.add(Dropout(0.5))
But with a GRU cell you can specify the dropout as a parameter in the constructor:
model.add(GRU(units=512,
return_sequences=True,
dropout=0.5,
input_shape=(None, features_size,)))
What's the difference? Is one preferable to the other?
In Keras' documentation it adds it as a separate dropout layer (see "Sequence classification with LSTM")

The recurrent layers perform the same repeated operation over and over.
In each timestep, it takes two inputs:
Your inputs (a step of your sequence)
Internal inputs (can be states and the output of the previous step, for instance)
Note that the dimensions of the input and output may not match, which means that "your input" dimensions will not match "the recurrent input (previous step/states)" dimesions.
Then in every recurrent timestep there are two operations with two different kernels:
One kernel is applied to "your inputs" to process and transform it in a compatible dimension
Another (called recurrent kernel by keras) is applied to the inputs of the previous step.
Because of this, keras also uses two dropout operations in the recurrent layers. (Dropouts that will be applied to every step)
A dropout for the first conversion of your inputs
A dropout for the application of the recurrent kernel
So, in fact there are two dropout parameters in RNN layers:
dropout, applied to the first operation on the inputs
recurrent_dropout, applied to the other operation on the recurrent inputs (previous output and/or states)
You can see this description coded either in GRUCell and in LSTMCell for instance in the source code.
What is correct?
This is open to creativity.
You can use a Dropout(...) layer, it's not "wrong", but it will possibly drop "timesteps" too! (Unless you set noise_shape properly or use SpatialDropout1D, which is currently not documented yet)
Maybe you want it, maybe you dont. If you use the parameters in the recurrent layer, you will be applying dropouts only to the other dimensions, without dropping a single step. This seems healthy for recurrent layers, unless you want your network to learn how to deal with sequences containing gaps (this last sentence is a supposal).
Also, with the dropout parameters, you will be really dropping parts of the kernel as the operations are dropped "in every step", while using a separate layer will let your RNN perform non-dropped operations internally, since your dropout will affect only the final output.

Related

TensorFlow 2.4.0 - Parameters associated with BatchNorm and Activation

I am printing a tensorflow.keras.Model instance summary. The type is tensorflow.python.keras.engine.functional.Functional object.
This model has layers with activations and batch normalization associated. When I print the list of parameters, I see
weights
bias
4 items co-dimensional with the bias
These four items are (I guess) the batch normalization and activations.
My question is: why do we have parameters associated with batch normalization and activations? And what could be the other two items?
My aim is to transpose this Keras model to a PyTorch counterpart, so I need to know the order of the parameters and what these parameters represent
there are no parameters associated with activations, those are simply some element-wise nonlinear function. So no matter how many activations you have they don't account for any parameter counts. However, your guess is right, there are in fact parameters associated with BatchNorm layer, 2 parameters in each BatchNorm layer to be precise (lambda and beta). So those BatchNorm layer does add additional parameters in your network.

Dropout, Regularization and batch normalization

I have a couple of questions about LSTM layers in Keras library
In LSTM layer we have two kind of dropouts: dropout and recurrent-dropout. According to my understanding the first one will drop randomly some features from input (set them to zero) and the second one will do it on hidden units (features of h_t). Since we have different time steps in a LSTM network, is dropping applied seperately to each time step or only one time and will be the same for every step?
My second question is about regularizers in LSTM layer in keras. I know that for example the kernel regularizer will regularize weights corresponding to inputs. but we have different weights for inputs.
For example input gate, update gate and output gates use different weights for input (aslo different weights for h_(t-1)) . So will they be regularized in the same time ? What if I want to regularize only one of them? For example if I want to regularize only the weights used in the formula for input gate.
The last question is about activation functions in keras. In LSTM layer I have activation and recurrent activations. activation is tanh by default and I know in LSTM architecture tanh is used two times (for h_t and candidate of memory cell) and sigmoid is used 3 times (for gates). So does that mean if I change tanh (in LSTM layer in keras) to another function say Relu then it will change for both of h_t and memory cell candidate?
It would be perfect if any of those question could be answered. Thank you very much for your time.

Few questions about Keras documentation

In Keras documentation named activations.md, it says "Activations can either be used through an Activation layer, or through the activation argument supported by all forward layers.". Then what is the meaning of forward layers? I think some layers don't have an activation parameter.(ex. Dropout layer)
And "Activations that are more complex than a simple TensorFlow/Theano/CNTK function (eg. learnable activations, which maintain a state) are available as Advanced Activation layers, and can be found in the module keras.layers.advanced_activations. These include PReLU and LeakyReLU.". Then what is the meaning of state in this case?
I am not sure there is a strict definition of "forward layers" in this context, but basically what it means is that the "classic", keras-built-in types of layers comprising one or more sets of weights used to transform an input matrix into an output one have a activation argument. Typically, Dense layers have one, as well as the various kinds of RNN and CNN layers.
It would not make sense for Dropout layers to have an activation function : they simply add a mechanism triggered at training to (hopefully) improve convergence rate and decrease overfitting chances.
As for the idea of "maintaining a state", it refers to activation functions that would not behave independently on each and every fed-in sample, but would instead retain some learnable information (the so-called state). Typically, for a LeakyReLU activation, you could adjust the leak parameter through training (and it would, in the documentation's terminology, be referred to as a state of this activation function).

I need to understand this LSTM and Masking layers result

I'm new at keras lstm could you please explain to me this model.summary()
in rasa core training
![model after training][1]
Also, what is the Masking layer doing and what does the value -1 in it mean?
A Masking layer is meant to "ignore steps" in sequences.
Your LSTM is working with sequences of 5 steps and 42 features per step.
If all features in a step have the same value defined in Masking (-1 in the example), that step will be ignored during training.
The idea is to simulate variable length sequences.
Not sure exactly, what exactly you don't understand but model.summary()
prints a summary representation of your model. (keras.io)
It lists all layers used in the given model with its respective size.
This particular model obviously starts with a masking layer for input sequences (I guess because of padding) and is followed by the simplest LSTM model possible.

TimeDistributed Layers vs. ConvLSTM-2D

Could anyone explains for me the differences between Time-Distributed Layers (from Keras Wrapper) and ConvLSTM-2D (Convolutional LSTM), for purposes, usage, etc.?
Both applies to a sequence of data.
Time Distributed is a very straightforward layer wrapper which only applies a layer (usually dense layer) on each time point. You need it when you need to change the shape of output tensor, especially the dimension of features, instead of sample size and time step.
ConvLSTM2D, is much more complex. You need to understand cnn and rnn layer first, where LSTM is one of most popular rnn. LSTM itself is applied on a sequence of of tensor, which is used for NLP, time series and for each time step the input is 1-dimension. cnn, the conv part, is usually used to learn from image, which is 2-dimension but don't have a sequence (time step). Combined together, convLSTM is used to learn image in a sequence, like video.

Resources