Dropout & batch normalization - does the ordering of layers matter? - python-3.x

I was building a neural network model and my question is that by any chance the ordering of the dropout and batch normalization layers actually affect the model?
Will putting the dropout layer before batch-normalization layer (or vice-versa) actually make any difference to the output of the model if I am using ROC-AUC score as my metric of measurement.
I expect the output to have a large (ROC-AUC) score and want to know that will it be affected in any way by the ordering of the layers.

The order of the layers effects the convergence of your model and hence your results. Based on the Batch Normalization paper, the author suggests that the Batch Normalization should be implemented before the activation function. Since Dropout is applied after computing the activations. Then the right order of layers are:
Dense or Conv
Batch Normalization
Activation
Droptout.
In code using keras, here is how you write it sequentially:
model = Sequential()
model.add(Dense(n_neurons, input_shape=your_input_shape, use_bias=False)) # it is important to disable bias when using Batch Normalization
model.add(BatchNormalization())
model.add(Activation('relu')) # for example
model.add(Dropout(rate=0.25))
Batch Normalization helps to avoid Vanishing/Exploding Gradients when training your model. Therefore, it is specially important if you have many layers. You can read the provided paper for more details.

Related

What do units and layers do in neural network?

model = keras.Sequential([
# the hidden ReLU layers
layers.Dense(units=4, activation='relu', input_shape=[2]),
layers.Dense(units=3, activation='relu'),
# the linear output layer
layers.Dense(units=1),
])
The above is a Keras sequential model example from Kaggle. I'm having a problem understanding these two things.
Are the units the number of nodes in a hidden layer? I see some people put 250 or what ever. What does the number do when it gets changed higher or lower?
Why would another hidden layer need to be added? What does it actually do the data to add more and more layers?
Answers in brief
units is representing how many neurons in a particular layer.When you have higher number,model has higher parameters to update during learning.Same thing goes to layers as well.(more layers and more neurons take more time to train the model).selecting how many neurons is depend on the use case and dataset and model architecture.
When you have more hidden layers, you have more parameters to update.More parameters and layers meaning model is able to understand complex relationships hidden in the data. For example when you have a image classification(multiple), you need more deep layers with neurons to understand the features in the image, which use to classify in final layer.
play with tensorflow playground,it will give great idea when you change the layers and neurons.

Increasing the smoothness of the accuracy curve in image classification

I have developed a Convolutional Neural Network using TILDA image dataset which gives over 90% of accuracy with the following model. I used 4 batches and 100 epochs to the model.
model = keras.Sequential([
layers.Input((30,30,1)),
layers.Conv2D(8,2,padding='same', activation='relu',kernel_regularizer=regularizers.l2(0.01)),
layers.BatchNormalization(),
layers.Conv2D(16,2,padding='same', activation='relu',kernel_regularizer=regularizers.l2(0.01)),
layers.BatchNormalization(),
layers.Conv2D(32,2,padding='same', activation='sigmoid',kernel_regularizer=regularizers.l2(0.01)),
layers.BatchNormalization(),
layers.MaxPooling2D(),
layers.Flatten(),
layers.Dropout(0.5),
layers.Dense(5, activation = "softmax"),
])
Using the above model I could plot the following graphs for the training and validation accuracy.
Do you have any suggestions to increase the smoothness of these curves? What can be the possible reasons for getting such curves? I appreciate your recommendations to improve this model.
The following may help in getting a smoother curve:
NEVER use dropout before the final layer. MaxPool + Dropout in your model discards 87.5% of the data flowing into the final layer. Avoid pooling as well, unless you need global or adaptive pooling to get a fixed shape output. If you must pool, you need a much larger number of kernels to compensate for the loss in information.
Use a lower learning rate. From what the training curve tells, the model is directed to a minima, but with several bumps.
Are you using SGD without momentum? If yes, introduce, momentum. Also consider adaptive optimizers with inbuilt momentum, like Adam.
Why the sigmoid in between? Sigmoid reduces the gradient magnitude and makes learning slower.
If you only care about the curve and are not restricted by number of parameters, consider adding a few more layers and/or channels.

Stacked LSTM with Multiple Dense Layers After

Andrew Ng talks about Deep RNN architecture by stacking recurrent layers on top of each other. However, he notes that these are usually limited to 2 or 3 recurrent layers due to already complex time-dependent calculations in the structure. But he does add that people commonly add "a bunch of deep layers that are not connected horizontally" after these recurrent layers (Shown as blue boxes that extend from a[3]<1>). I am wondering if he is simply talking about stacking Dense layers on top of the recurrent layers, or is it something more complicated? Something like this in Keras:
model = Sequential()
model.add(keras.layers.LSTM(100, return_sequences=True, batch_input_shape=(32, 1, input_shape), stateful=True))
model.add(keras.layers.LSTM(100, return_sequences=False, stateful=True))
model.add(Dense(100, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
In most cases, yes, the common structure of a RNN after the hidden state includes only dense layers.
However, this can take many forms, such as a dense layer and a softmax layer when predicting the next word of a vocabulary in natural language processing (NLP) (or language modelling) applications (examples here).
Alternatively, for multi-objective prediction, it may the case that multiple separate dense layers are required to generate distinct outputs, such as the value and policy heads in reinforcement learning.
Finally, deep LSTMs can be used as encoders which are part of a larger model that does not necessarily have to include only sequence data. For instance, diagnosing patients with a model that encodes textual notes with a LSTM and encodes images with a CNN, before passing the combined embeddings through final dense layers.

Keras Dropout Layer Model Predict

The dropout layer is only supposed to be used during the training of the model, not during testing.
If I have a dropout layer in my Keras sequential model, do I need to do something to remove or silence it before I do model.predict()?
No, you don't need to silence it or remove it. Keras automatically
takes care of it.
It is clearly mentioned in the documentation. A Keras model has two modes:
training
testing
Regularization mechanisms, such as Dropout and L1/L2 weight regularization, are turned off at testing time.
Note: Also, Batch Normalization is a much-preferred technique for regularization, in my opinion, as compared to Dropout. Consider using it.

Accuracy on middle layer of autoencoder implemente using Keras

I have implemented an autoencoder using Keras. I understand that I can add accuracy performance metric as follows:
autoencoder.compile(optimizer='adam',
loss='mean_squared_error',
metrics=['accuracy'])
My question is:
Is the accuracy metric applied on the last layer of the decoder by default? If so, how can I set it so that it would get the representations from middle (hidden) layer to compute accuracy performance? Do I need to define a custom metric? How would that work?
It seems that what you really want is a multiple output network.
So on top of your middle layer that defines your embedding, add a layer (or more) to do your classification.
Then have a look at Multiple outputs in Keras to create your global cost.
You may also want to start by training the autoendoder only, then the classifier additional layers only to see the performance, you can also balance the accuracy of the encoder vs the accuracy of the classifier as a loss, training "both" networks at the same time.

Resources