Teacher forcing and attention combine
I googled a lot but could not find tutorial that is implementing encoder decoder with teacher forcing and attention such as luong attention. I am confused how attention will be computed in decoder without loop in each time step. teacher forcing need no loop.
Related
I am looking to perform a binary classification on each frame of a video. I have ground truth labels and the feature vectors for each frame of several videos, but am not sure how I could approach this as a time series problem to train an RNN.
How could I use the predictions on previous frames in a video to help inform future predictions? I think this could be accomplished with a one-to-one RNN, though I can't find any relevant examples.
Your task is quite similar to named-entity recognition task in NLP. In named-entity recognition you have a sequence of words and for each word you want to determine if a word is a named entity (say a person or company name) or not.
A many-to-many RNN architecture should be suitable for this task.
Please check for example the following article:
https://towardsdatascience.com/named-entity-recognition-ner-using-keras-bidirectional-lstm-28cd3f301f54
Instead of words you have video frames features, so you would need to replace the first Embedding layer with a Dense layer. Also in the above article N classes are used as labels and not just binary. So, as a started point you can try something like:
model = Sequential()
model.add(TimeDistributed(Dense(16, activation="relu")))
model.add(LSTM(32, return_sequences=True)
model.add(TimeDistributed(Dense(1, activation="relu")))
model.compile(loss='binary_crossentropy', optimizer='nadam', metrics=['accuracy'])
The model would take a sequence of video frame features as input and sequence of ground truth labels as output for each video.
I'm fairly new to NLP and I was reading a blog explaining the transformer model. I was quite confused about the input/output for the decoder block (attached below). I get that y_true is fed into the decoder during the training step to combine with the output of the encoder block. What I don't get is, if we already know y_true, why run this step to get the output probability? I just don't quite get the relationship between the bottom right "Output Embedding" and the top right "Output Probabilities". When we use the model, we wouldn't really have y_true, do we just use y_pred and feed them into the decoder instead? This might be a noob question. Thanks in advance.
I get that y_true is fed into the decoder during the training step to
combine with the output of the encoder block.
Well, yes and no.
The job of the decoder block is to predict the next word. The inputs to the decoder is the output of the encoder and the previous outputs of decoder block itself.
Lets take a translation example ... English to Spanish
We have 5 dogs -> Nosotras tenemos 5 perros
The encoder will encode the english sentence and produce a attention vector as output. At first step the decoder will be fed the attention vector and a <START> token. The decoder will (should) produce the first spanish word Nosotras. This is the Yt. In the next step the decoder will be fed again the attention vector as well as the <START> token and the previous output Yt-1 Nosotras. tenemos will be the output, and so on and so forth, till the decoder spits out a <END> token.
The decoder is thus an Autoregressive Model. It relies on its own output to generate the next sequence.
In addition to #Bhupen's answer it is worth highlighting differences to seq-to-seq models based on RNNs, for which this sequential processing is always necessary.
Transformers have the fundamental advantage that you can train them with parallel processing. That means you can use the same parallel forward pass in the decoder also for the encoder during training. This allows for substantial speedups, allowing for larger training set sizes, to which transformer models owe much of their success.
So to answer the original question: The input to the decoder is
all correct predictions, shifted and masked, at once to be processed in parallel
the previous output of the decoder, starting with the special token <start>, until another special token <end> is predicted.
You can look at the excellent tensorflow implementation, i.e. check out:
Inference, which is indeed the sequential processing as described in the previous answer https://www.tensorflow.org/text/tutorials/transformer#run_inference
But also check the decoder, where you can see that there is no inherent sequential nature (only iterating through layers) https://www.tensorflow.org/text/tutorials/transformer#the_decoder_layer
I have been processing this thought in my head for a long time now. So in NMT, We pass in the text in the source language in the encoder seq2seq stage and the language in the target language in the decoder seq2seq stage and the system learns the conditional probabilities for each word occurring with its target language word. Ex: P(word x|previous n-words). We train this by teacher forcing.
But what if I pass in the input sentence again as input to the decoder stage instead of the target sentence. What would it learn in this case? I'm guessing this will learn to predict the most probable next word in the sentence given the previous text right? What are your thoughts
Thanks in advance
In that case, you would be learning a model that copies the input symbol to the output. It is trivial for the attention mechanism to learn the identity correspondence between the encoder and decoder states. Moreover, RNNs can easily implement a counter. It thus won't provide any realistic estimate of the probability, it will assign most of the probability mass to the corresponding word in the source sentence.
I am writing a sequence to sequence neural network in Pytorch. In the official Pytorch seq2seq tutorial, there is code for an Attention Decoder that I cannot understand/think might contain a mistake.
It computes the attention weights at each time step by concatenating the output and the hidden state at this time, and then multiplying by a matrix to get a vector of size equal to the output sequence length. Note, these attention weights don’t depend on the encoder sequence (named encoder_outputs in the code), which I think it should.
Also, the paper cited in the tutorial, lists three different score functions that can be used to compute attention weights (section 3.1 in the paper). None of these functions is just concatenating and multiplying by a matrix.
So it seems to me that the code in the tutorial is mistaken both in the function it applies and the arguments that are passed to this function. Am I missing something?
This tutorial has a simplified version of these attentions in the Luong paper that you mentioned.
It just uses a linear layer to combine the input embedding and the decoder RNN hidden state. This is sometimes called a 'location-based' attention, because it does not depend on the encoder outputs. Then it applies the softmax and computes the attention weights and the process goes as it would normally.
This is not always bad to have, as from the encoder outputs the attention mechanism might attend to a previous token and then the attention would not be monotonic, so your model would fail.
To implement the attentions from the Luong paper, you I suggest to use the 'concat' attention, after applying linear layers to both the decoder hidden state and the encoder outputs. Then the matrix W_a will transform these concatenated results to an arbitrary dimension of your choice, and finally the v_a is a vector that will transform to the desired context vector dimension.
In the algorithm, attn_weights depends on decode parameters.
Then we get an output of a linear layer(here 10). This is attention vector.
Then we multiply this with encoder_outputs. So at every epoch, we update attn_weights by back propagation. Verbally, at every iteration, it is learning in the reverse direction.
Let me give an example:
Our task is translate from English to German.
I want to sing a song. -> Ich möchte ein Lied singen.
At decoder, singen verb is at end. So our decoder attn_weights see decoder output,and learns to apply which parts of input encoding. When you multiply this value with encoder_outputs , you get a matrix of ,which have high values in necessary points.
So infact this way, it is learning when decoder see a sentence pattern in german,
which parts of input it must pay attention. So direction of learning is correct,I think.
I am stuck with a problem where in I want to ensure that specific tokens/words are produced while decoding and generating abstractive-style sentences.
I am working with deep learning models like LSTM and transformer model for generating short sentences(100-200 characters). I want that some words like places or nouns(like brand names) be present in the generated texts.
I am not sure if there has been any research on this, I couldn't really find a paper after an extensive search on it.
TIA, any leads or suggestions are appreciated. :)
I am not sure but you can try to condition your output based on those specific words. Your trainer can be like a seq2seq decoder but instead of attending to the encoder outputs it can attend to those specific words.