DepthwiseConv2D vs SeparableConv2D vs Groups - keras

How do conv2d groups differ from Depthwise and Separable convolutions ?
DepthwiseConv2D: https://keras.io/api/layers/convolution_layers/depthwise_convolution2d/
SeparableConv2D: https://keras.io/api/layers/convolution_layers/separable_convolution2d/
Groups in Conv2D: https://keras.io/api/layers/convolution_layers/convolution2d/
I didn't find any schema or article to understand in what they differs mostly.

Related

Conv2D filters and CNN architecture

I am currently pursuing undergraduation, I am working on CNN model to recognize Telegu characters.
This Questions has two parts,
I have a (32,32,1) shape Telegu character images, I want to train my CNN model to recognize the character. So, what should be my model architecture and how to decide the architecture, no of parameters and hidden layers. I know that my case is exactly same as handwritten digit recognition, but I want to know how to decide those parameters. Is there any common practice in building such architecture.
Operation Conv2D (32, (5,5)) means 32 filters of size 5x5 are applied on to the input, my question is are these filters all same or different, if different what kind of filters are initialized and who decides them?
I tried to surf internet but everywhere I go, the answer I get is Conv2D operation applies filters on input and does the convolution operation.
To decide which model architecture would be best, you need to experiment. Thats the only way. As you want to classify, VGG architecture would be a good starting point I believe. You need to experiment with number of parameters as it depends on your problem. You can use Keras Tuner for it: https://keras.io/keras_tuner/
For kernel initialization, as far as I know convolutional layers in Keras uses Glorot Uniform Initialization but you can change that by using kernel_initializer parameter. Long story short, convolutional layers are initialized with a distribution function and as training goes filters change the values inside, which is learning process. https://keras.io/api/layers/initializers
Edit: I forgot to inform you that I suggest VGG architecture but in a way you downsize the models a lot. Your input shape is little so if your model is too much deep, you will overfit really quickly.

Batch normalization vs layer normalization

Please illustrate batch normalisation and layer normalisation with a clear notation involving tensors. Also comment on when each one is required/recommended.
I think what you're looking for is in Group Normalization, by Yuxin Wu, Kaiming He IJCV'20.
Especially Fig. 2:

Doesn't keras.layers.Flatten lose information?

Brand new to keras and ML in general. I'm looking at https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/, and it uses Flatten between Embedding and Dense because Embedding produces a 2D vector but Dense requires a single dimension.
I'm sure I'm missing something obvious here, but why doesn't this lose which words are in which input vectors? How are we able to still know that input #3 was "nice work" and is associated with label #3, 1, for "positive"?
I guess the original dimensions are retained from the original input and then somehow restored for Dense's output? Or am I just totally missing a major conceptual aspect?
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
Thanks for any guidance!
Embedding layer gives you a vector for each word token, so the output is 2-d. We need to use flatten before any classifier block.
There is some information lost, for example when we use Convolutional layers, and then flat the feature maps, the spatial information is lost. But we already extract the most important features using Conv layers and we feed those features to fully connected layers.
In your example, the temporal dimension is no longer maintained, usually, it's desired to pass the output of the embedding matrix to an RNN/Conv layer for further feature extraction.
Flatten only is applied on the non-batch dimension, meaning the examples are still separated (if you mean that).
For each sample, let's say nice work, we get 2 vectors (1 for nice, 1 for work), now we only want to know the overall sentiment from the sentence so, once we extract the features, we can apply flatten.

Word2vec CBOW model implementations, deviations from the original algorithm

I am trying to implement CBOW model by pytorch.
What I understood from the explanation of word2vec is that word2vec has 2 layers (and therefore 2 matrices), the first matrix contains low dimensional word vectors, which is actually a lookup table and the vector representation of the word projected on the projection layer (no non-linearity, therefore not a hidden layer). The word vectors then multiplied by the 2nd matrix and the that goes to the output through a softmax function. After training, the first matrix can be used as a word embedding.
I see many implementation use 3 layers (1 embedding layer plus 2 more layers.), which is contradictory to my understanding above. Some example implementations here, here and here.
The following three lines of codes have commonly used as a model to implment 3 layer:
self.embeddings = nn.Embedding(vocab_size, embedding_dim)
self.linear1 = nn.Linear(context_size * embedding_dim, 128)
self.linear2 = nn.Linear(128, vocab_size)
My questions are, if my understanding is okay, then why they are using 3 layers? Are there any advantages?
One obvious disadvantage, I think is, it will be computationally expensive.
Word2vec resembles the idea of autoencoder (which also have two-layer), deviating from this proven idea might harm the embedding quality. Am I right?
Another important thing is that, according to the paper that I mentioned above, for multiple context words the average of the vectors will be projected on the protection layer. But instead of averaging, they are concatenating the vectors. Why is that? is there any advantages?
Also, they are using non-linearity at the hidden layer which I think will create a serious performance issue in training with a huge amount of data. Right?

Merging a word embedding trained on a specialized topic to pre trained word embeddings

I have two word embeddings. A pretrained Glove and one that I've trained on medical related documents. The pretrained vectors contain more words, but my word vectors have better representations for medical terms. I was to fusion the two set of embeddings.
Glove (200d) has 4 million terms, and about 10% of these are also found in my own embedding (also 200d). Instead of something simple like concatenating the two (which would result in a lot of 0s), I was wondering if creating a neural network that maps a vector from the Glove space to my own embedding space would help. Specially:
from keras.models import Sequential
from keras.layers import Dense, Activation
model = Sequential()
model.add(Dense(units=100, input_dim=200))
model.add(Activation('sigmoid'))
model.add(Dense(units=50))
model.add(Activation('sigmoid'))
model.add(Dense(units=100))
model.add(Activation('sigmoid'))
model.add(Dense(units=200))
model.add(Activation('linear'))
model.compile(optimizer='rmsprop',
loss='mse')
model.fit(x_train, y_train, epochs=10, batch_size=32)
The results were quite poor, and I wonder if it is because the methodology is incorrect or if the model is not tuned properly.
Sets of word-vectors that weren't trained together have no essential relationships to each other – the distances and orientations are only interpretable within a set of vectors developed under correlated constraints. (As a simple example of this: if you were to take a set of word-vectors and negate all their coordinates, each set would be exactly equally good at finding related-words, or solving analogies, etc. But distances/directions between words in alternate sets would be nearly meaningless.)
Devising a mapping-transformation between the two is a reasonable idea, that's mentioned in one of the original Word2Vec papers ("Exploiting Similarities among Languages for Machine Translation") and the "Skip-Thought Vectors" paper, section 2.2 ("vocabulary expansion").
In those cases a linear transformation matrix is learned - not a multi-layer mapping as your code excerpt seems to suggest.
The gensim library for (among other things) working with word-vectors recently added a facility for learning & applying such transformations, in a TranslationMatrix class. With your 10% vocabulary-overlap, it might be suitable for your purposes.

Resources