Keras: the difference between LSTM dropout and LSTM recurrent dropout - keras

From the Keras documentation:
dropout: Float between 0 and 1. Fraction of the units to drop for the
linear transformation of the inputs.
recurrent_dropout: Float between 0 and 1. Fraction of the units to
drop for the linear transformation of the recurrent state.
Can anyone point to where on the image below each dropout happens?

I suggest taking a look at (the first part of) this paper. Regular dropout is applied on the inputs and/or the outputs, meaning the vertical arrows from x_t and to h_t. In your case, if you add it as an argument to your layer, it will mask the inputs; you can add a Dropout layer after your recurrent layer to mask the outputs as well. Recurrent dropout masks (or "drops") the connections between the recurrent units; that would be the horizontal arrows in your picture.
This picture is taken from the paper above. On the left, regular dropout on inputs and outputs. On the right, regular dropout PLUS recurrent dropout:
(Ignore the colour of the arrows in this case; in the paper they are making a further point of keeping the same dropout masks at each timestep)

Above answer highlights one of the recurrent dropout methods but that one is NOT used by tensorflow and keras. Tensorflow Doc.
Keras/TF refers a recurrent method proposed by Semeniuta et al. Also, check below the image comparing different recurrent dropout methods. The Gal and Ghahramani method which is mentioned in above answer is at second position and Semeniuta method is the right most.

Related

/Pytorch: Difference between using GAT dropout and torch.nn.functional.dropout layer?

I was looking at the PyTorch geometric documentation for the Graph Attention Network layer: here (GATconv)
Question: What is the difference between using the dropout parameter in the GATconv layer compared with including a dropout via torch.functional.nn.droupout? Are these different hyper parameters?
My attempt:
From the definitions below, they seem to be referring to different things:
The dropout from torch.nn.functional is defined as: "During training, randomly zeroes some of the elements of the input tensor with probability p using samples from a Bernoulli distribution."
The dropout from the GATconv is defined as: "Dropout probability of the normalized attention coefficients which exposes each node to a stochastically sampled neighborhood during training."
So would the GAT dropout need to be a different hyperparameter in a Grid Search cross validation?

Dropout, Regularization and batch normalization

I have a couple of questions about LSTM layers in Keras library
In LSTM layer we have two kind of dropouts: dropout and recurrent-dropout. According to my understanding the first one will drop randomly some features from input (set them to zero) and the second one will do it on hidden units (features of h_t). Since we have different time steps in a LSTM network, is dropping applied seperately to each time step or only one time and will be the same for every step?
My second question is about regularizers in LSTM layer in keras. I know that for example the kernel regularizer will regularize weights corresponding to inputs. but we have different weights for inputs.
For example input gate, update gate and output gates use different weights for input (aslo different weights for h_(t-1)) . So will they be regularized in the same time ? What if I want to regularize only one of them? For example if I want to regularize only the weights used in the formula for input gate.
The last question is about activation functions in keras. In LSTM layer I have activation and recurrent activations. activation is tanh by default and I know in LSTM architecture tanh is used two times (for h_t and candidate of memory cell) and sigmoid is used 3 times (for gates). So does that mean if I change tanh (in LSTM layer in keras) to another function say Relu then it will change for both of h_t and memory cell candidate?
It would be perfect if any of those question could be answered. Thank you very much for your time.

Using Dropout with Keras and LSTM/GRU cell

In Keras you can specify a dropout layer like this:
model.add(Dropout(0.5))
But with a GRU cell you can specify the dropout as a parameter in the constructor:
model.add(GRU(units=512,
return_sequences=True,
dropout=0.5,
input_shape=(None, features_size,)))
What's the difference? Is one preferable to the other?
In Keras' documentation it adds it as a separate dropout layer (see "Sequence classification with LSTM")
The recurrent layers perform the same repeated operation over and over.
In each timestep, it takes two inputs:
Your inputs (a step of your sequence)
Internal inputs (can be states and the output of the previous step, for instance)
Note that the dimensions of the input and output may not match, which means that "your input" dimensions will not match "the recurrent input (previous step/states)" dimesions.
Then in every recurrent timestep there are two operations with two different kernels:
One kernel is applied to "your inputs" to process and transform it in a compatible dimension
Another (called recurrent kernel by keras) is applied to the inputs of the previous step.
Because of this, keras also uses two dropout operations in the recurrent layers. (Dropouts that will be applied to every step)
A dropout for the first conversion of your inputs
A dropout for the application of the recurrent kernel
So, in fact there are two dropout parameters in RNN layers:
dropout, applied to the first operation on the inputs
recurrent_dropout, applied to the other operation on the recurrent inputs (previous output and/or states)
You can see this description coded either in GRUCell and in LSTMCell for instance in the source code.
What is correct?
This is open to creativity.
You can use a Dropout(...) layer, it's not "wrong", but it will possibly drop "timesteps" too! (Unless you set noise_shape properly or use SpatialDropout1D, which is currently not documented yet)
Maybe you want it, maybe you dont. If you use the parameters in the recurrent layer, you will be applying dropouts only to the other dimensions, without dropping a single step. This seems healthy for recurrent layers, unless you want your network to learn how to deal with sequences containing gaps (this last sentence is a supposal).
Also, with the dropout parameters, you will be really dropping parts of the kernel as the operations are dropped "in every step", while using a separate layer will let your RNN perform non-dropped operations internally, since your dropout will affect only the final output.

Why return sequences in stacked RNNs?

When stacking RNNs, it is mandatory to set return_sequences parameter as True in Keras.
For instance in Keras,
lstm1 = LSTM(1, return_sequences=True)(inputs1)
lstm2 = LSTM(1)(lstm1)
It is somewhat intuitive to preserve the dimensionality of input space for each stacked RNN layer, however, I am not convinced thoroughly.
Can someone (mathematically) explain the reason?
Thanks.
The input shape for recurrent layers is:
(number_of_sequences, time_steps, input_features).
This is absolutely required for recurrent layers because there can only be any recurrency if there are time steps.
Now, compare the "outputs" of the recurrent layers in each case:
with return_sequences=True - (number_of_sequences, time_steps, output_features)
with return_sequences=False - (number_of_sequences, output_features)
Without return_sequences=True, you eliminate the time steps, so, it cannot be fed into a recurrent layer, because there aren't enough dimensions and the most important one, the time_steps is not present.

Does the GaussianDropout Layer in Keras retain probability like the Dropout Layer?

I was wondering, if the GaussianDropout Layer in Keras retains probability like the Dropout Layer.
The Dropout Layer is implemented as an Inverted Dropout which retains probability.
If you aren't aware of the problem you may have a look at the discussion and specifically at the linxihui's answer.
The crucial point which makes the Dropout Layer retaining the probability is the call of K.dropout, which isn't called by a GaussianDropout Layer.
Is there any reason why GaussianDropout Layer does not retain probability?
Or is it retaining but in another way being unseen?
Similar - referring to the Dropout Layer: Is the Keras implementation of dropout correct?
So Gaussian dropout doesn't need to retain probability as its origin comes from applying Central Limit Theorem to inverted dropout. The details might be found here in 2nd and 3rd paragraph of Multiplicative Gaussian Noise chapter (chapter 10, p. 1951).

Resources