Where to implement Layer normalization? - keras

I am trying to implement Layer normalization on my LSTM model but I am unsure of how many Layer norms I need in my model and where to exactly place them
def build_model():
model = Sequential()
layers = [100, 200, 2]
model.add(Bidirectional(LSTM(
layers[0],
input_shape=(timestep, feature),
dropout=0.4,
recurrent_dropout=0.4,
return_sequences=True)))
model.add(LayerNormalization())
model.add(Bidirectional(LSTM(
layers[1],
input_shape=(timestep, feature),
dropout=0.4,
recurrent_dropout=0.4,
return_sequences=False)))
model.add(LayerNormalization())
model.add(Dense(
layers[2]))

Normalization layers usually apply their normalization effect to the previous layer, so it should be put in front of the layer that you want normalized.
Usually all layers are normalized, except the output layer, so the configuration you are showing in your question already does this, so it can be considered to be good practice.
In general you do not have to normalize every layer, and it is a bit of experimentation (trial and error) about which layers to normalize.

Related

What do units and layers do in neural network?

model = keras.Sequential([
# the hidden ReLU layers
layers.Dense(units=4, activation='relu', input_shape=[2]),
layers.Dense(units=3, activation='relu'),
# the linear output layer
layers.Dense(units=1),
])
The above is a Keras sequential model example from Kaggle. I'm having a problem understanding these two things.
Are the units the number of nodes in a hidden layer? I see some people put 250 or what ever. What does the number do when it gets changed higher or lower?
Why would another hidden layer need to be added? What does it actually do the data to add more and more layers?
Answers in brief
units is representing how many neurons in a particular layer.When you have higher number,model has higher parameters to update during learning.Same thing goes to layers as well.(more layers and more neurons take more time to train the model).selecting how many neurons is depend on the use case and dataset and model architecture.
When you have more hidden layers, you have more parameters to update.More parameters and layers meaning model is able to understand complex relationships hidden in the data. For example when you have a image classification(multiple), you need more deep layers with neurons to understand the features in the image, which use to classify in final layer.
play with tensorflow playground,it will give great idea when you change the layers and neurons.

Dropout & batch normalization - does the ordering of layers matter?

I was building a neural network model and my question is that by any chance the ordering of the dropout and batch normalization layers actually affect the model?
Will putting the dropout layer before batch-normalization layer (or vice-versa) actually make any difference to the output of the model if I am using ROC-AUC score as my metric of measurement.
I expect the output to have a large (ROC-AUC) score and want to know that will it be affected in any way by the ordering of the layers.
The order of the layers effects the convergence of your model and hence your results. Based on the Batch Normalization paper, the author suggests that the Batch Normalization should be implemented before the activation function. Since Dropout is applied after computing the activations. Then the right order of layers are:
Dense or Conv
Batch Normalization
Activation
Droptout.
In code using keras, here is how you write it sequentially:
model = Sequential()
model.add(Dense(n_neurons, input_shape=your_input_shape, use_bias=False)) # it is important to disable bias when using Batch Normalization
model.add(BatchNormalization())
model.add(Activation('relu')) # for example
model.add(Dropout(rate=0.25))
Batch Normalization helps to avoid Vanishing/Exploding Gradients when training your model. Therefore, it is specially important if you have many layers. You can read the provided paper for more details.

Stacked LSTM with Multiple Dense Layers After

Andrew Ng talks about Deep RNN architecture by stacking recurrent layers on top of each other. However, he notes that these are usually limited to 2 or 3 recurrent layers due to already complex time-dependent calculations in the structure. But he does add that people commonly add "a bunch of deep layers that are not connected horizontally" after these recurrent layers (Shown as blue boxes that extend from a[3]<1>). I am wondering if he is simply talking about stacking Dense layers on top of the recurrent layers, or is it something more complicated? Something like this in Keras:
model = Sequential()
model.add(keras.layers.LSTM(100, return_sequences=True, batch_input_shape=(32, 1, input_shape), stateful=True))
model.add(keras.layers.LSTM(100, return_sequences=False, stateful=True))
model.add(Dense(100, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
In most cases, yes, the common structure of a RNN after the hidden state includes only dense layers.
However, this can take many forms, such as a dense layer and a softmax layer when predicting the next word of a vocabulary in natural language processing (NLP) (or language modelling) applications (examples here).
Alternatively, for multi-objective prediction, it may the case that multiple separate dense layers are required to generate distinct outputs, such as the value and policy heads in reinforcement learning.
Finally, deep LSTMs can be used as encoders which are part of a larger model that does not necessarily have to include only sequence data. For instance, diagnosing patients with a model that encodes textual notes with a LSTM and encodes images with a CNN, before passing the combined embeddings through final dense layers.

Why return sequences in stacked RNNs?

When stacking RNNs, it is mandatory to set return_sequences parameter as True in Keras.
For instance in Keras,
lstm1 = LSTM(1, return_sequences=True)(inputs1)
lstm2 = LSTM(1)(lstm1)
It is somewhat intuitive to preserve the dimensionality of input space for each stacked RNN layer, however, I am not convinced thoroughly.
Can someone (mathematically) explain the reason?
Thanks.
The input shape for recurrent layers is:
(number_of_sequences, time_steps, input_features).
This is absolutely required for recurrent layers because there can only be any recurrency if there are time steps.
Now, compare the "outputs" of the recurrent layers in each case:
with return_sequences=True - (number_of_sequences, time_steps, output_features)
with return_sequences=False - (number_of_sequences, output_features)
Without return_sequences=True, you eliminate the time steps, so, it cannot be fed into a recurrent layer, because there aren't enough dimensions and the most important one, the time_steps is not present.

Does the GaussianDropout Layer in Keras retain probability like the Dropout Layer?

I was wondering, if the GaussianDropout Layer in Keras retains probability like the Dropout Layer.
The Dropout Layer is implemented as an Inverted Dropout which retains probability.
If you aren't aware of the problem you may have a look at the discussion and specifically at the linxihui's answer.
The crucial point which makes the Dropout Layer retaining the probability is the call of K.dropout, which isn't called by a GaussianDropout Layer.
Is there any reason why GaussianDropout Layer does not retain probability?
Or is it retaining but in another way being unseen?
Similar - referring to the Dropout Layer: Is the Keras implementation of dropout correct?
So Gaussian dropout doesn't need to retain probability as its origin comes from applying Central Limit Theorem to inverted dropout. The details might be found here in 2nd and 3rd paragraph of Multiplicative Gaussian Noise chapter (chapter 10, p. 1951).

Resources