Batch normalization vs layer normalization - pytorch

Please illustrate batch normalisation and layer normalisation with a clear notation involving tensors. Also comment on when each one is required/recommended.

I think what you're looking for is in Group Normalization, by Yuxin Wu, Kaiming He IJCV'20.
Especially Fig. 2:

Related

How to perform de-normalization of last layer into labels in Keras, analogous to the preprocessing normalization layer (but inversed)?

It is my understanding that Artificial Neural Networks work best on normalized data, ie typically inputs and outputs should have, ideally, a mean of 0 and a variance of 1 (and even, if possible, a "near gaussian", or at least, "well behaved", distribution).
Therefore, I have seen / written quite a few Keras-using scripts when I first do some feature-wise normalization of the predictors and labels. This is a pain, as this means the need to keep track of a number of mean and std values, applying them correctly later at inference, etc.
I found out recently that there is now out-of-the-box functionality for doing the predictors normalization in Keras in an "adaptable, not trainable" way, which is very convenient, as all the normalization information gets stored and used out-of-the-box in the network object: see: https://keras.io/guides/preprocessing_layers/ , https://keras.io/api/layers/preprocessing_layers/numerical/normalization/#normalization-class . This makes use / bookkeeping much simpler.
My question is: would it make sense / is there a simple way to similarly do in-Keras an "outputs de-normalization", i.e., assuming that the outputs from the network have mean 0 and variance 1, add an adaptable (adaptable not trainable; similar to the preprocessing normalization layer) layer that de-normalize these outputs into the correct mean and variance for each label?
I guess this is quite similar to the preprocessing normalization layer, except that what we would like is the "inverse transformation" of what would be obtained by applying the preprocessing normalization layer on the labels. I.e., when adapting the layer to labels, one gets a layer that "de-normalizes" a 0-mean 1-std distribution into a distribution with feature-wise mean and std corresponding to the labels.
I do not see some way to get this "inverse layer" or "de-normalization layer", am I missing something / is there a simple way to do it?
The normalization layer has an invert parameter:
If True, this layer will apply the inverse transformation to its
inputs: it would turn a normalized input back into its original form.
So, in theory you could use:
layer = tf.keras.layers.Normalization(invert=True)
to de-normalize. Currently, this is wrongly implemented and will not work (but seems like the bug is already fixed in the next keras version)

Number of parameters: Disadvantages of inverted residual blocks for image classification tasks

I have a more general question regarding MobileNet and EfficientNet inverted residual blocks. I have a classification task for an image dataset that is of lower complexity. Therefore I have chosen an architecture with few parameters (EfficientNet B0). But in terms of validation loss, I run into overfitting. A shallow ResNet, ResNext, etc. worked much better. These architectures use regular residual blocks and therefore have more parameters.
So it seems that there is no relation between number of parameters and model complexity here? Can someone please explain what I am missing here?
This is a very interesting question. I would also be very interested in a response to that topic.

Doesn't keras.layers.Flatten lose information?

Brand new to keras and ML in general. I'm looking at https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/, and it uses Flatten between Embedding and Dense because Embedding produces a 2D vector but Dense requires a single dimension.
I'm sure I'm missing something obvious here, but why doesn't this lose which words are in which input vectors? How are we able to still know that input #3 was "nice work" and is associated with label #3, 1, for "positive"?
I guess the original dimensions are retained from the original input and then somehow restored for Dense's output? Or am I just totally missing a major conceptual aspect?
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
Thanks for any guidance!
Embedding layer gives you a vector for each word token, so the output is 2-d. We need to use flatten before any classifier block.
There is some information lost, for example when we use Convolutional layers, and then flat the feature maps, the spatial information is lost. But we already extract the most important features using Conv layers and we feed those features to fully connected layers.
In your example, the temporal dimension is no longer maintained, usually, it's desired to pass the output of the embedding matrix to an RNN/Conv layer for further feature extraction.
Flatten only is applied on the non-batch dimension, meaning the examples are still separated (if you mean that).
For each sample, let's say nice work, we get 2 vectors (1 for nice, 1 for work), now we only want to know the overall sentiment from the sentence so, once we extract the features, we can apply flatten.

Word2vec CBOW model implementations, deviations from the original algorithm

I am trying to implement CBOW model by pytorch.
What I understood from the explanation of word2vec is that word2vec has 2 layers (and therefore 2 matrices), the first matrix contains low dimensional word vectors, which is actually a lookup table and the vector representation of the word projected on the projection layer (no non-linearity, therefore not a hidden layer). The word vectors then multiplied by the 2nd matrix and the that goes to the output through a softmax function. After training, the first matrix can be used as a word embedding.
I see many implementation use 3 layers (1 embedding layer plus 2 more layers.), which is contradictory to my understanding above. Some example implementations here, here and here.
The following three lines of codes have commonly used as a model to implment 3 layer:
self.embeddings = nn.Embedding(vocab_size, embedding_dim)
self.linear1 = nn.Linear(context_size * embedding_dim, 128)
self.linear2 = nn.Linear(128, vocab_size)
My questions are, if my understanding is okay, then why they are using 3 layers? Are there any advantages?
One obvious disadvantage, I think is, it will be computationally expensive.
Word2vec resembles the idea of autoencoder (which also have two-layer), deviating from this proven idea might harm the embedding quality. Am I right?
Another important thing is that, according to the paper that I mentioned above, for multiple context words the average of the vectors will be projected on the protection layer. But instead of averaging, they are concatenating the vectors. Why is that? is there any advantages?
Also, they are using non-linearity at the hidden layer which I think will create a serious performance issue in training with a huge amount of data. Right?

I need to understand this LSTM and Masking layers result

I'm new at keras lstm could you please explain to me this model.summary()
in rasa core training
![model after training][1]
Also, what is the Masking layer doing and what does the value -1 in it mean?
A Masking layer is meant to "ignore steps" in sequences.
Your LSTM is working with sequences of 5 steps and 42 features per step.
If all features in a step have the same value defined in Masking (-1 in the example), that step will be ignored during training.
The idea is to simulate variable length sequences.
Not sure exactly, what exactly you don't understand but model.summary()
prints a summary representation of your model. (keras.io)
It lists all layers used in the given model with its respective size.
This particular model obviously starts with a masking layer for input sequences (I guess because of padding) and is followed by the simplest LSTM model possible.

Resources