I'm fairly new to NLP and I was reading a blog explaining the transformer model. I was quite confused about the input/output for the decoder block (attached below). I get that y_true is fed into the decoder during the training step to combine with the output of the encoder block. What I don't get is, if we already know y_true, why run this step to get the output probability? I just don't quite get the relationship between the bottom right "Output Embedding" and the top right "Output Probabilities". When we use the model, we wouldn't really have y_true, do we just use y_pred and feed them into the decoder instead? This might be a noob question. Thanks in advance.
I get that y_true is fed into the decoder during the training step to
combine with the output of the encoder block.
Well, yes and no.
The job of the decoder block is to predict the next word. The inputs to the decoder is the output of the encoder and the previous outputs of decoder block itself.
Lets take a translation example ... English to Spanish
We have 5 dogs -> Nosotras tenemos 5 perros
The encoder will encode the english sentence and produce a attention vector as output. At first step the decoder will be fed the attention vector and a <START> token. The decoder will (should) produce the first spanish word Nosotras. This is the Yt. In the next step the decoder will be fed again the attention vector as well as the <START> token and the previous output Yt-1 Nosotras. tenemos will be the output, and so on and so forth, till the decoder spits out a <END> token.
The decoder is thus an Autoregressive Model. It relies on its own output to generate the next sequence.
In addition to #Bhupen's answer it is worth highlighting differences to seq-to-seq models based on RNNs, for which this sequential processing is always necessary.
Transformers have the fundamental advantage that you can train them with parallel processing. That means you can use the same parallel forward pass in the decoder also for the encoder during training. This allows for substantial speedups, allowing for larger training set sizes, to which transformer models owe much of their success.
So to answer the original question: The input to the decoder is
all correct predictions, shifted and masked, at once to be processed in parallel
the previous output of the decoder, starting with the special token <start>, until another special token <end> is predicted.
You can look at the excellent tensorflow implementation, i.e. check out:
Inference, which is indeed the sequential processing as described in the previous answer https://www.tensorflow.org/text/tutorials/transformer#run_inference
But also check the decoder, where you can see that there is no inherent sequential nature (only iterating through layers) https://www.tensorflow.org/text/tutorials/transformer#the_decoder_layer
Related
I am looking to perform a binary classification on each frame of a video. I have ground truth labels and the feature vectors for each frame of several videos, but am not sure how I could approach this as a time series problem to train an RNN.
How could I use the predictions on previous frames in a video to help inform future predictions? I think this could be accomplished with a one-to-one RNN, though I can't find any relevant examples.
Your task is quite similar to named-entity recognition task in NLP. In named-entity recognition you have a sequence of words and for each word you want to determine if a word is a named entity (say a person or company name) or not.
A many-to-many RNN architecture should be suitable for this task.
Please check for example the following article:
https://towardsdatascience.com/named-entity-recognition-ner-using-keras-bidirectional-lstm-28cd3f301f54
Instead of words you have video frames features, so you would need to replace the first Embedding layer with a Dense layer. Also in the above article N classes are used as labels and not just binary. So, as a started point you can try something like:
model = Sequential()
model.add(TimeDistributed(Dense(16, activation="relu")))
model.add(LSTM(32, return_sequences=True)
model.add(TimeDistributed(Dense(1, activation="relu")))
model.compile(loss='binary_crossentropy', optimizer='nadam', metrics=['accuracy'])
The model would take a sequence of video frame features as input and sequence of ground truth labels as output for each video.
I am currently working on a NLP project involving recurrent neural networks. I implemented a LSTM with PyTorch, following the tutorial here.
For my project, I need to extract the hidden representation for every token of an input text. I thought that the easiest way would be to test using a batch size and sequence length of 1, but when I do that the loss gets orders of magnitude larger than in training phase (during training I used a batch size of 64 and a sequence length of 35).
Is there any other way I can easily access these word-level hidden representations? Thank you.
Yes, that is possible with nn.LSTM as long as it is a single layer LSTM. If u check the documentation (here), for the output of an LSTM, you can see it outputs a tensor and a tuple of tensors. The tuple contains the hidden and cell for the last sequence step. What each dimension means of the output depends on how u initialized your network. Either the first or second dimension is the batch dimension and the rest is the sequence of word embeddings you want.
If u use a packed sequence as input, it is a bit of a different story.
I have been processing this thought in my head for a long time now. So in NMT, We pass in the text in the source language in the encoder seq2seq stage and the language in the target language in the decoder seq2seq stage and the system learns the conditional probabilities for each word occurring with its target language word. Ex: P(word x|previous n-words). We train this by teacher forcing.
But what if I pass in the input sentence again as input to the decoder stage instead of the target sentence. What would it learn in this case? I'm guessing this will learn to predict the most probable next word in the sentence given the previous text right? What are your thoughts
Thanks in advance
In that case, you would be learning a model that copies the input symbol to the output. It is trivial for the attention mechanism to learn the identity correspondence between the encoder and decoder states. Moreover, RNNs can easily implement a counter. It thus won't provide any realistic estimate of the probability, it will assign most of the probability mass to the corresponding word in the source sentence.
I am writing a sequence to sequence neural network in Pytorch. In the official Pytorch seq2seq tutorial, there is code for an Attention Decoder that I cannot understand/think might contain a mistake.
It computes the attention weights at each time step by concatenating the output and the hidden state at this time, and then multiplying by a matrix to get a vector of size equal to the output sequence length. Note, these attention weights don’t depend on the encoder sequence (named encoder_outputs in the code), which I think it should.
Also, the paper cited in the tutorial, lists three different score functions that can be used to compute attention weights (section 3.1 in the paper). None of these functions is just concatenating and multiplying by a matrix.
So it seems to me that the code in the tutorial is mistaken both in the function it applies and the arguments that are passed to this function. Am I missing something?
This tutorial has a simplified version of these attentions in the Luong paper that you mentioned.
It just uses a linear layer to combine the input embedding and the decoder RNN hidden state. This is sometimes called a 'location-based' attention, because it does not depend on the encoder outputs. Then it applies the softmax and computes the attention weights and the process goes as it would normally.
This is not always bad to have, as from the encoder outputs the attention mechanism might attend to a previous token and then the attention would not be monotonic, so your model would fail.
To implement the attentions from the Luong paper, you I suggest to use the 'concat' attention, after applying linear layers to both the decoder hidden state and the encoder outputs. Then the matrix W_a will transform these concatenated results to an arbitrary dimension of your choice, and finally the v_a is a vector that will transform to the desired context vector dimension.
In the algorithm, attn_weights depends on decode parameters.
Then we get an output of a linear layer(here 10). This is attention vector.
Then we multiply this with encoder_outputs. So at every epoch, we update attn_weights by back propagation. Verbally, at every iteration, it is learning in the reverse direction.
Let me give an example:
Our task is translate from English to German.
I want to sing a song. -> Ich möchte ein Lied singen.
At decoder, singen verb is at end. So our decoder attn_weights see decoder output,and learns to apply which parts of input encoding. When you multiply this value with encoder_outputs , you get a matrix of ,which have high values in necessary points.
So infact this way, it is learning when decoder see a sentence pattern in german,
which parts of input it must pay attention. So direction of learning is correct,I think.
I am training an encoder-decoder LSTM in keras for text summarization and the CNN dataset with the following architecture
Picture of bidirectional encoder-decoder LSTM
I am pretraining the word embedding (of size 256) using skip-gram and
I then pad the input sequences with zeros so all articles are of equal length
I put a vector of 1's in each summary to act as the "start" token
Use MSE, RMSProp, tanh activation in the decoder output later
Training: 20 epochs, batch_size=100, clip_norm=1,dropout=0.3, hidden_units=256, LR=0.001, training examples=10000, validation_split=0.2
The network trains and training and validation MSE go down to 0.005, however during inference, the decoder keeps producing a repetition of a few words that make no sense and are nowhere near the real summary.
My question is, is there anything fundamentally wrong in my training approach, the padding, loss function, data size, training time so that the network fails to generalize?
Your model looks ok, except for the loss function. I can't figure out how MSE is applicable to word prediction. Cross-entropy loss looks like a natural choice here.
Generated word repetition can be caused by the way the decoder works at inference time: you should not simply select the most probable word from the distribution, but rather sample from it. This will give more variance to the generated text. Start looking at beam search.
If I were to pick a single technique to boost sequence to sequence model performance, it's certainly attention mechanism. There are lots of post about it, you can start with this one, for example.