TensorFlow 2.4.0 - Parameters associated with BatchNorm and Activation - keras

I am printing a tensorflow.keras.Model instance summary. The type is tensorflow.python.keras.engine.functional.Functional object.
This model has layers with activations and batch normalization associated. When I print the list of parameters, I see
weights
bias
4 items co-dimensional with the bias
These four items are (I guess) the batch normalization and activations.
My question is: why do we have parameters associated with batch normalization and activations? And what could be the other two items?
My aim is to transpose this Keras model to a PyTorch counterpart, so I need to know the order of the parameters and what these parameters represent

there are no parameters associated with activations, those are simply some element-wise nonlinear function. So no matter how many activations you have they don't account for any parameter counts. However, your guess is right, there are in fact parameters associated with BatchNorm layer, 2 parameters in each BatchNorm layer to be precise (lambda and beta). So those BatchNorm layer does add additional parameters in your network.

Related

What do units and layers do in neural network?

model = keras.Sequential([
# the hidden ReLU layers
layers.Dense(units=4, activation='relu', input_shape=[2]),
layers.Dense(units=3, activation='relu'),
# the linear output layer
layers.Dense(units=1),
])
The above is a Keras sequential model example from Kaggle. I'm having a problem understanding these two things.
Are the units the number of nodes in a hidden layer? I see some people put 250 or what ever. What does the number do when it gets changed higher or lower?
Why would another hidden layer need to be added? What does it actually do the data to add more and more layers?
Answers in brief
units is representing how many neurons in a particular layer.When you have higher number,model has higher parameters to update during learning.Same thing goes to layers as well.(more layers and more neurons take more time to train the model).selecting how many neurons is depend on the use case and dataset and model architecture.
When you have more hidden layers, you have more parameters to update.More parameters and layers meaning model is able to understand complex relationships hidden in the data. For example when you have a image classification(multiple), you need more deep layers with neurons to understand the features in the image, which use to classify in final layer.
play with tensorflow playground,it will give great idea when you change the layers and neurons.

What is the difference between density of input layer and input_dim in Keras library?

I have a question about the terms of MLP in Keras.
what does the density of a layer mean?
is it the same as the number of neurons? if it is, so what's the role of input_dim?
I have never head of the "density" of a layer in the context of vanilla feed forward networks. I would assume it refers to the number of neurons, but really it depends on context.
Input layer with a certain dimension and the first hidden layer with input_dim argument are both equivalent ways to handle input in Keras.

Few questions about Keras documentation

In Keras documentation named activations.md, it says "Activations can either be used through an Activation layer, or through the activation argument supported by all forward layers.". Then what is the meaning of forward layers? I think some layers don't have an activation parameter.(ex. Dropout layer)
And "Activations that are more complex than a simple TensorFlow/Theano/CNTK function (eg. learnable activations, which maintain a state) are available as Advanced Activation layers, and can be found in the module keras.layers.advanced_activations. These include PReLU and LeakyReLU.". Then what is the meaning of state in this case?
I am not sure there is a strict definition of "forward layers" in this context, but basically what it means is that the "classic", keras-built-in types of layers comprising one or more sets of weights used to transform an input matrix into an output one have a activation argument. Typically, Dense layers have one, as well as the various kinds of RNN and CNN layers.
It would not make sense for Dropout layers to have an activation function : they simply add a mechanism triggered at training to (hopefully) improve convergence rate and decrease overfitting chances.
As for the idea of "maintaining a state", it refers to activation functions that would not behave independently on each and every fed-in sample, but would instead retain some learnable information (the so-called state). Typically, for a LeakyReLU activation, you could adjust the leak parameter through training (and it would, in the documentation's terminology, be referred to as a state of this activation function).

Difference between DepthwiseConv2D and SeparableConv2D

From the document, I know SeparableConv2D is a combination of depthwise and pointwise operation. However, when I call
SeparableConv2D(100, 5, input_shape=(416,416,10)
# total parameters is 1350
model.add(DepthwiseConv2D(5, input_shape=(416,416,10)))
model.add(Conv2D(100, 1))
# total parameters is 1360
Does it mean SeparableConv2D does not use bias in depthwise phase by default?
Thanks.
Correct, checking the source code (I did this for tf.keras but I suppose it is the same for standalone keras) shows that in SeparableConv2D, the separable convolution works using only filters, no biases, and a single bias vector is added at the end. The second version, on the other hand, has biases for both DepthwiseConv2D and Conv2D.
Given that convolution is a linear operation and you are using no non-linearity inbetween depthwise and 1x1 convolution, I would suppose that having two biases is unnecessary in this case, similar to how you don't use biases in a layer that is followed by batch normalization, for example. As such, the extra 10 parameters wouldn't actually improve the model (nor should they really hurt either).

Using Dropout with Keras and LSTM/GRU cell

In Keras you can specify a dropout layer like this:
model.add(Dropout(0.5))
But with a GRU cell you can specify the dropout as a parameter in the constructor:
model.add(GRU(units=512,
return_sequences=True,
dropout=0.5,
input_shape=(None, features_size,)))
What's the difference? Is one preferable to the other?
In Keras' documentation it adds it as a separate dropout layer (see "Sequence classification with LSTM")
The recurrent layers perform the same repeated operation over and over.
In each timestep, it takes two inputs:
Your inputs (a step of your sequence)
Internal inputs (can be states and the output of the previous step, for instance)
Note that the dimensions of the input and output may not match, which means that "your input" dimensions will not match "the recurrent input (previous step/states)" dimesions.
Then in every recurrent timestep there are two operations with two different kernels:
One kernel is applied to "your inputs" to process and transform it in a compatible dimension
Another (called recurrent kernel by keras) is applied to the inputs of the previous step.
Because of this, keras also uses two dropout operations in the recurrent layers. (Dropouts that will be applied to every step)
A dropout for the first conversion of your inputs
A dropout for the application of the recurrent kernel
So, in fact there are two dropout parameters in RNN layers:
dropout, applied to the first operation on the inputs
recurrent_dropout, applied to the other operation on the recurrent inputs (previous output and/or states)
You can see this description coded either in GRUCell and in LSTMCell for instance in the source code.
What is correct?
This is open to creativity.
You can use a Dropout(...) layer, it's not "wrong", but it will possibly drop "timesteps" too! (Unless you set noise_shape properly or use SpatialDropout1D, which is currently not documented yet)
Maybe you want it, maybe you dont. If you use the parameters in the recurrent layer, you will be applying dropouts only to the other dimensions, without dropping a single step. This seems healthy for recurrent layers, unless you want your network to learn how to deal with sequences containing gaps (this last sentence is a supposal).
Also, with the dropout parameters, you will be really dropping parts of the kernel as the operations are dropped "in every step", while using a separate layer will let your RNN perform non-dropped operations internally, since your dropout will affect only the final output.

Resources