Doesn't keras.layers.Flatten lose information? - keras

Brand new to keras and ML in general. I'm looking at https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/, and it uses Flatten between Embedding and Dense because Embedding produces a 2D vector but Dense requires a single dimension.
I'm sure I'm missing something obvious here, but why doesn't this lose which words are in which input vectors? How are we able to still know that input #3 was "nice work" and is associated with label #3, 1, for "positive"?
I guess the original dimensions are retained from the original input and then somehow restored for Dense's output? Or am I just totally missing a major conceptual aspect?
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
Thanks for any guidance!

Embedding layer gives you a vector for each word token, so the output is 2-d. We need to use flatten before any classifier block.
There is some information lost, for example when we use Convolutional layers, and then flat the feature maps, the spatial information is lost. But we already extract the most important features using Conv layers and we feed those features to fully connected layers.
In your example, the temporal dimension is no longer maintained, usually, it's desired to pass the output of the embedding matrix to an RNN/Conv layer for further feature extraction.
Flatten only is applied on the non-batch dimension, meaning the examples are still separated (if you mean that).
For each sample, let's say nice work, we get 2 vectors (1 for nice, 1 for work), now we only want to know the overall sentiment from the sentence so, once we extract the features, we can apply flatten.

Related

What is models.common.C3 in yolov5 model?

[Yolo model summary][1]
Also can someone explain the values in arguments column
[1]: https://i.stack.imgur.com/weBPt.png
I am studying the yolov5 architecture right now, so do not take my answer as absolute truth, but for my understanding the C3 Layer is a CSP bottleneck that includes 3 convolutional layers. Essentially it does a Conv on the input tensor and it concats the result to the same tensor passed through a convolution AND a series of bottleneck layers with e=1. Then the whole thing is passed again through a Convolution layer. CSP stands for Cross Stage Partial layer.
As per the first column, it is used in the forward function of the model to understand which tensor to use as the input value of each layer. The majority of the layers has '-1', meaning they take the last layer's output before them as their input, but there are Concat layers that take different levels as input to recreate the PANet architecture in the neck.
For further questions, I suggest you to ask in the Yolov5 github issues section, as they are often quick to give you answers.

Word2vec CBOW model implementations, deviations from the original algorithm

I am trying to implement CBOW model by pytorch.
What I understood from the explanation of word2vec is that word2vec has 2 layers (and therefore 2 matrices), the first matrix contains low dimensional word vectors, which is actually a lookup table and the vector representation of the word projected on the projection layer (no non-linearity, therefore not a hidden layer). The word vectors then multiplied by the 2nd matrix and the that goes to the output through a softmax function. After training, the first matrix can be used as a word embedding.
I see many implementation use 3 layers (1 embedding layer plus 2 more layers.), which is contradictory to my understanding above. Some example implementations here, here and here.
The following three lines of codes have commonly used as a model to implment 3 layer:
self.embeddings = nn.Embedding(vocab_size, embedding_dim)
self.linear1 = nn.Linear(context_size * embedding_dim, 128)
self.linear2 = nn.Linear(128, vocab_size)
My questions are, if my understanding is okay, then why they are using 3 layers? Are there any advantages?
One obvious disadvantage, I think is, it will be computationally expensive.
Word2vec resembles the idea of autoencoder (which also have two-layer), deviating from this proven idea might harm the embedding quality. Am I right?
Another important thing is that, according to the paper that I mentioned above, for multiple context words the average of the vectors will be projected on the protection layer. But instead of averaging, they are concatenating the vectors. Why is that? is there any advantages?
Also, they are using non-linearity at the hidden layer which I think will create a serious performance issue in training with a huge amount of data. Right?

I need to understand this LSTM and Masking layers result

I'm new at keras lstm could you please explain to me this model.summary()
in rasa core training
![model after training][1]
Also, what is the Masking layer doing and what does the value -1 in it mean?
A Masking layer is meant to "ignore steps" in sequences.
Your LSTM is working with sequences of 5 steps and 42 features per step.
If all features in a step have the same value defined in Masking (-1 in the example), that step will be ignored during training.
The idea is to simulate variable length sequences.
Not sure exactly, what exactly you don't understand but model.summary()
prints a summary representation of your model. (keras.io)
It lists all layers used in the given model with its respective size.
This particular model obviously starts with a masking layer for input sequences (I guess because of padding) and is followed by the simplest LSTM model possible.

TimeDistributed Layers vs. ConvLSTM-2D

Could anyone explains for me the differences between Time-Distributed Layers (from Keras Wrapper) and ConvLSTM-2D (Convolutional LSTM), for purposes, usage, etc.?
Both applies to a sequence of data.
Time Distributed is a very straightforward layer wrapper which only applies a layer (usually dense layer) on each time point. You need it when you need to change the shape of output tensor, especially the dimension of features, instead of sample size and time step.
ConvLSTM2D, is much more complex. You need to understand cnn and rnn layer first, where LSTM is one of most popular rnn. LSTM itself is applied on a sequence of of tensor, which is used for NLP, time series and for each time step the input is 1-dimension. cnn, the conv part, is usually used to learn from image, which is 2-dimension but don't have a sequence (time step). Combined together, convLSTM is used to learn image in a sequence, like video.

How to get output of specific time step in lstm layer using keras?

I was trying to use keras to do entity relation extraction task.
My model looks like example code of keras imdb_bidirectional_lstm.py
model = Sequential()
model.add(Embedding(max_features, 128, input_length=maxlen))
model.add(Bidirectional(LSTM(64)))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
However, it is different from the imdb classify task that relation is related to specific entity in sentence and there may be several relations in one sentence. So I want to get output of specific entity word in BiLSTM layer one time and then concatenate them.
for example, there is a sentence "In Baghdad, a cameraman died when an
American tank fired on the Palestine Hotel." there is several relations in this sentence. so if I want to get the relation between "cameraman" and "tank", I need to get output of "cameraman" and "tank" in biLSTM layer and send them into a MLP. So what should I do to get the output of "cameraman" and "tank" in biLSTM layer? I have tried output attribute of layer, but it seems infeasible.
It may sound confusing. To be brief, how to get output of specific time step in lstm layer?
Any suggestion will be appreciated. Thank you very much!
The return_sequences=True parameter can get outputs of all time steps. Then you need to write a custom layer to extract the output of certain steps you need. AFAIK there is no direct way from keras to achieve that. It should not be too hard to write such a custom layer though.

Resources