I have been processing this thought in my head for a long time now. So in NMT, We pass in the text in the source language in the encoder seq2seq stage and the language in the target language in the decoder seq2seq stage and the system learns the conditional probabilities for each word occurring with its target language word. Ex: P(word x|previous n-words). We train this by teacher forcing.
But what if I pass in the input sentence again as input to the decoder stage instead of the target sentence. What would it learn in this case? I'm guessing this will learn to predict the most probable next word in the sentence given the previous text right? What are your thoughts
Thanks in advance
In that case, you would be learning a model that copies the input symbol to the output. It is trivial for the attention mechanism to learn the identity correspondence between the encoder and decoder states. Moreover, RNNs can easily implement a counter. It thus won't provide any realistic estimate of the probability, it will assign most of the probability mass to the corresponding word in the source sentence.
Related
I'm fairly new to NLP and I was reading a blog explaining the transformer model. I was quite confused about the input/output for the decoder block (attached below). I get that y_true is fed into the decoder during the training step to combine with the output of the encoder block. What I don't get is, if we already know y_true, why run this step to get the output probability? I just don't quite get the relationship between the bottom right "Output Embedding" and the top right "Output Probabilities". When we use the model, we wouldn't really have y_true, do we just use y_pred and feed them into the decoder instead? This might be a noob question. Thanks in advance.
I get that y_true is fed into the decoder during the training step to
combine with the output of the encoder block.
Well, yes and no.
The job of the decoder block is to predict the next word. The inputs to the decoder is the output of the encoder and the previous outputs of decoder block itself.
Lets take a translation example ... English to Spanish
We have 5 dogs -> Nosotras tenemos 5 perros
The encoder will encode the english sentence and produce a attention vector as output. At first step the decoder will be fed the attention vector and a <START> token. The decoder will (should) produce the first spanish word Nosotras. This is the Yt. In the next step the decoder will be fed again the attention vector as well as the <START> token and the previous output Yt-1 Nosotras. tenemos will be the output, and so on and so forth, till the decoder spits out a <END> token.
The decoder is thus an Autoregressive Model. It relies on its own output to generate the next sequence.
In addition to #Bhupen's answer it is worth highlighting differences to seq-to-seq models based on RNNs, for which this sequential processing is always necessary.
Transformers have the fundamental advantage that you can train them with parallel processing. That means you can use the same parallel forward pass in the decoder also for the encoder during training. This allows for substantial speedups, allowing for larger training set sizes, to which transformer models owe much of their success.
So to answer the original question: The input to the decoder is
all correct predictions, shifted and masked, at once to be processed in parallel
the previous output of the decoder, starting with the special token <start>, until another special token <end> is predicted.
You can look at the excellent tensorflow implementation, i.e. check out:
Inference, which is indeed the sequential processing as described in the previous answer https://www.tensorflow.org/text/tutorials/transformer#run_inference
But also check the decoder, where you can see that there is no inherent sequential nature (only iterating through layers) https://www.tensorflow.org/text/tutorials/transformer#the_decoder_layer
I have recently read about Bert and want to use BertForMaskedLM for fill_mask task. I know about Bert architecture. Also, as far as I know, BertForMaskedLM is built from Bert with a language modeling head on top, but I have no idea about what language modeling head means here. Can anyone give me a brief explanation.
The BertForMaskedLM, as you have understood correctly uses a Language Modeling(LM) head .
Generally, as well as in this case, LM head is a linear layer having input dimension of hidden state (for BERT-base it will be 768) and output dimension of vocabulary size. Thus, it maps to hidden state output of BERT model to a specific token in the vocabulary. The loss is calculated based on the scores obtained of a given token with respect to the target token.
Additionally to #Ashwin Geet D'Sa's answer.
Here is the Huggingface's LM head definition:
The model head refers to the last layer of a neural network that
accepts the raw hidden states and projects them onto a different
dimension.
You can find the Huggingface's definition for other terms at this page https://huggingface.co/docs/transformers/glossary
I have been trying to learn NLTK and NLP, but to use n-grams to build a next word predictor seems to be relatively simple. What are some other ways I might approach this problem?
This is called Language Modeling. It is one of the primary tasks in NLP. This article is old now, but it explains in detail how to build a character level language model (given chars c_0 through c_(n-1), predict character c_n).
LSTMs are the best balance of resource-usage and accuracy. ULM-FIT is the best example of LSTM language modeling. Most state of the art results are using enormous Transformers, like the famous BERT* and GPT-2.
BERT isn't a traditional language model - rather than predicting the next word, it gets a sentence with a blank, and fills in the blank; this is now called MLM, Masked Language Modeling.
I am trying to solve the problem of sequence completion. Let's suppose we have ground truth sequence (1,2,4,7,6,8,10,12,18,20)
The input to our model is an incomplete sequence. i.e (1,2,4, _ , _ ,_,10,12,18,20). From this incomplete sequence, we want to predict the original sequence (Ground Truth sequence). Which deep learning models can be used to solve this problem?
Is this the problem of encoder-decoder LSTM architecture?
Note: we have thousands of complete sequences to train and test the model.
Any help is appreciated.
This not exactly sequence-to-sequence problem, this is a sequence labeling problem. I would suggest either stacking bidirectional LSTM layers followed by a classifier or Transformer layers followed by a classifier.
Encoder-decoder architecture requires plenty of data to train properly and is particularly useful if the target sequence can be of arbitrary length, only vaguely depending on the source sequence length. It would eventually learn to do the job with enough, but sequence labeling is a more straightforward problem.
With sequence labeling, you can set a custom mask over the output, so the model will only predict the missing numbers. An encoder-decoder model would need to learn to copy most of the input first.
In your sequence completion task, are you trying to predict next items in a sequence or learn only the missing values?
Training a neural network with missing data is an issue on its own terms.
If you're using Keras and LSTM-type NN for solving your problem, you should consider masking, you can refer to this stackoverflow thread for more details: Multivariate LSTM with missing values
Regarding predicting the missing values, why not try auto-encoders?
I am writing a sequence to sequence neural network in Pytorch. In the official Pytorch seq2seq tutorial, there is code for an Attention Decoder that I cannot understand/think might contain a mistake.
It computes the attention weights at each time step by concatenating the output and the hidden state at this time, and then multiplying by a matrix to get a vector of size equal to the output sequence length. Note, these attention weights don’t depend on the encoder sequence (named encoder_outputs in the code), which I think it should.
Also, the paper cited in the tutorial, lists three different score functions that can be used to compute attention weights (section 3.1 in the paper). None of these functions is just concatenating and multiplying by a matrix.
So it seems to me that the code in the tutorial is mistaken both in the function it applies and the arguments that are passed to this function. Am I missing something?
This tutorial has a simplified version of these attentions in the Luong paper that you mentioned.
It just uses a linear layer to combine the input embedding and the decoder RNN hidden state. This is sometimes called a 'location-based' attention, because it does not depend on the encoder outputs. Then it applies the softmax and computes the attention weights and the process goes as it would normally.
This is not always bad to have, as from the encoder outputs the attention mechanism might attend to a previous token and then the attention would not be monotonic, so your model would fail.
To implement the attentions from the Luong paper, you I suggest to use the 'concat' attention, after applying linear layers to both the decoder hidden state and the encoder outputs. Then the matrix W_a will transform these concatenated results to an arbitrary dimension of your choice, and finally the v_a is a vector that will transform to the desired context vector dimension.
In the algorithm, attn_weights depends on decode parameters.
Then we get an output of a linear layer(here 10). This is attention vector.
Then we multiply this with encoder_outputs. So at every epoch, we update attn_weights by back propagation. Verbally, at every iteration, it is learning in the reverse direction.
Let me give an example:
Our task is translate from English to German.
I want to sing a song. -> Ich möchte ein Lied singen.
At decoder, singen verb is at end. So our decoder attn_weights see decoder output,and learns to apply which parts of input encoding. When you multiply this value with encoder_outputs , you get a matrix of ,which have high values in necessary points.
So infact this way, it is learning when decoder see a sentence pattern in german,
which parts of input it must pay attention. So direction of learning is correct,I think.