Basic question in trainable parameters in CNN convolution - keras

I am learning CNN trainable parameters calculation in Keras. I just wonder why we consider filter calculation as trainable parameters? Since the convolution process is a fixed calculation (i.e. matrix multiplication) and there are nothing need to update (trainable). I know there is a formula but why we consider this as trainable parameters. For example: in the first conV2D, image size, say 10x10x1, filter 3 x 3 , 1 filter, the parameters in keras is 10 (3x3+1).

In your convolution layer there is a 3x3(x1) kernel (the x1 since your image only has a single channel). The values in the convolution layer's kernel are learned parameters. In addition to the kernel itself, the convolution layer (may, it's usually optional) have a learnable bias parameter (that's the +1 in your formula). It's a bit hard to understand from your question, but it looks like in your setup you are asking the layer to learn the parameters for 10 different convolutional kernels (each with a bias) hence the 10(3x3+1) learned parameters.


TensorFlow 2.4.0 - Parameters associated with BatchNorm and Activation

I am printing a tensorflow.keras.Model instance summary. The type is tensorflow.python.keras.engine.functional.Functional object.
This model has layers with activations and batch normalization associated. When I print the list of parameters, I see
4 items co-dimensional with the bias
These four items are (I guess) the batch normalization and activations.
My question is: why do we have parameters associated with batch normalization and activations? And what could be the other two items?
My aim is to transpose this Keras model to a PyTorch counterpart, so I need to know the order of the parameters and what these parameters represent
there are no parameters associated with activations, those are simply some element-wise nonlinear function. So no matter how many activations you have they don't account for any parameter counts. However, your guess is right, there are in fact parameters associated with BatchNorm layer, 2 parameters in each BatchNorm layer to be precise (lambda and beta). So those BatchNorm layer does add additional parameters in your network.

Dropout, Regularization and batch normalization

I have a couple of questions about LSTM layers in Keras library
In LSTM layer we have two kind of dropouts: dropout and recurrent-dropout. According to my understanding the first one will drop randomly some features from input (set them to zero) and the second one will do it on hidden units (features of h_t). Since we have different time steps in a LSTM network, is dropping applied seperately to each time step or only one time and will be the same for every step?
My second question is about regularizers in LSTM layer in keras. I know that for example the kernel regularizer will regularize weights corresponding to inputs. but we have different weights for inputs.
For example input gate, update gate and output gates use different weights for input (aslo different weights for h_(t-1)) . So will they be regularized in the same time ? What if I want to regularize only one of them? For example if I want to regularize only the weights used in the formula for input gate.
The last question is about activation functions in keras. In LSTM layer I have activation and recurrent activations. activation is tanh by default and I know in LSTM architecture tanh is used two times (for h_t and candidate of memory cell) and sigmoid is used 3 times (for gates). So does that mean if I change tanh (in LSTM layer in keras) to another function say Relu then it will change for both of h_t and memory cell candidate?
It would be perfect if any of those question could be answered. Thank you very much for your time.

Difference between DepthwiseConv2D and SeparableConv2D

From the document, I know SeparableConv2D is a combination of depthwise and pointwise operation. However, when I call
SeparableConv2D(100, 5, input_shape=(416,416,10)
# total parameters is 1350
model.add(DepthwiseConv2D(5, input_shape=(416,416,10)))
model.add(Conv2D(100, 1))
# total parameters is 1360
Does it mean SeparableConv2D does not use bias in depthwise phase by default?
Correct, checking the source code (I did this for tf.keras but I suppose it is the same for standalone keras) shows that in SeparableConv2D, the separable convolution works using only filters, no biases, and a single bias vector is added at the end. The second version, on the other hand, has biases for both DepthwiseConv2D and Conv2D.
Given that convolution is a linear operation and you are using no non-linearity inbetween depthwise and 1x1 convolution, I would suppose that having two biases is unnecessary in this case, similar to how you don't use biases in a layer that is followed by batch normalization, for example. As such, the extra 10 parameters wouldn't actually improve the model (nor should they really hurt either).

How to Calculate Output from Conv2D and Conv2DTranspose

Regarding a question about calculating parameter numbers, I have the follow-up questions.
Keras CNN model parameters calculation
Regarding the formula for total parameters, does it mean that, for each input channel, there are 64 sets of learnable weights (filters) but only 1 common bias applied to it, then one single output channel is generated in such a way that 64 filtered results are simply added up followed by adding the common bias?

Keras LSTM: first argument

In Keras, if you want to add an LSTM layer with 10 units, you use model.add(LSTM(10)). I've heard that number 10 referred to as the number of hidden units here and as the number of output units (line 863 of the Keras code here).
My question is, are those two things the same? Is the dimensionality of the output the same as the number of hidden units? I've read a few tutorials (like this one and this one), but none of them state this explicitly.
The answers seems to refer to multi-layer perceptrons (MLP) in which the hidden layer can be of different size and often is. For LSTMs, the hidden dimension is the same as the output dimension by construction:
The h is the output for a given timestep and the cell state c is bound by the hidden size due to element wise multiplication. The addition of terms to compute the gates would require that both the input kernel W and the recurrent kernel U map to the same dimension. This is certainly the case for Keras LSTM as well and is why you only provide single units argument.
To get a good intuition for why this makes sense. Remember that the LSTM job is to encode a sequence into a vector (maybe a Gross oversimplification but its all we need). The size of that vector is specified by hidden_units, the output is:
seq vector RNN weights
(1 X input_dim) * (input_dim X hidden_units),
which has 1 X hidden_units (a row vector representing the encoding of your input sequence). And thus, the names in this case are used synonymously.
Of course RNNs require more than one multiplication and keras implements RNNs as a sequence of matrix-matrix multiplications instead vector-matrix shown above.
The number of hidden units is not the same as the number of output units.
The number 10 controls the dimension of the output hidden state (source code for the LSTM constructor method can be found here. 10 specifies the units argument). In one of the tutorial's you have linked to (colah's blog), the units argument would control the dimension of the vectors ht-1 , ht, and ht+1: RNN image.
If you want to control the number of LSTM blocks in your network, you need to specify this as an input into the LSTM layer. The input shape to the layer is (nb_samples, timesteps, input_dim) Keras documentation. timesteps controls how many LSTM blocks your network contains. Referring to the tutorial on colah's blog again, in RNN image, timesteps would control how many green blocks the network contains.
