Mistake in pytorch attention seq2seq tutorial? - pytorch

I am writing a sequence to sequence neural network in Pytorch. In the official Pytorch seq2seq tutorial, there is code for an Attention Decoder that I cannot understand/think might contain a mistake.
It computes the attention weights at each time step by concatenating the output and the hidden state at this time, and then multiplying by a matrix to get a vector of size equal to the output sequence length. Note, these attention weights don’t depend on the encoder sequence (named encoder_outputs in the code), which I think it should.
Also, the paper cited in the tutorial, lists three different score functions that can be used to compute attention weights (section 3.1 in the paper). None of these functions is just concatenating and multiplying by a matrix.
So it seems to me that the code in the tutorial is mistaken both in the function it applies and the arguments that are passed to this function. Am I missing something?

This tutorial has a simplified version of these attentions in the Luong paper that you mentioned.
It just uses a linear layer to combine the input embedding and the decoder RNN hidden state. This is sometimes called a 'location-based' attention, because it does not depend on the encoder outputs. Then it applies the softmax and computes the attention weights and the process goes as it would normally.
This is not always bad to have, as from the encoder outputs the attention mechanism might attend to a previous token and then the attention would not be monotonic, so your model would fail.
To implement the attentions from the Luong paper, you I suggest to use the 'concat' attention, after applying linear layers to both the decoder hidden state and the encoder outputs. Then the matrix W_a will transform these concatenated results to an arbitrary dimension of your choice, and finally the v_a is a vector that will transform to the desired context vector dimension.

In the algorithm, attn_weights depends on decode parameters.
Then we get an output of a linear layer(here 10). This is attention vector.
Then we multiply this with encoder_outputs. So at every epoch, we update attn_weights by back propagation. Verbally, at every iteration, it is learning in the reverse direction.
Let me give an example:
Our task is translate from English to German.
I want to sing a song. -> Ich möchte ein Lied singen.
At decoder, singen verb is at end. So our decoder attn_weights see decoder output,and learns to apply which parts of input encoding. When you multiply this value with encoder_outputs , you get a matrix of ,which have high values in necessary points.
So infact this way, it is learning when decoder see a sentence pattern in german,
which parts of input it must pay attention. So direction of learning is correct,I think.

Related

The decoder part in a transformer model

I'm fairly new to NLP and I was reading a blog explaining the transformer model. I was quite confused about the input/output for the decoder block (attached below). I get that y_true is fed into the decoder during the training step to combine with the output of the encoder block. What I don't get is, if we already know y_true, why run this step to get the output probability? I just don't quite get the relationship between the bottom right "Output Embedding" and the top right "Output Probabilities". When we use the model, we wouldn't really have y_true, do we just use y_pred and feed them into the decoder instead? This might be a noob question. Thanks in advance.
I get that y_true is fed into the decoder during the training step to
combine with the output of the encoder block.
Well, yes and no.
The job of the decoder block is to predict the next word. The inputs to the decoder is the output of the encoder and the previous outputs of decoder block itself.
Lets take a translation example ... English to Spanish
We have 5 dogs -> Nosotras tenemos 5 perros
The encoder will encode the english sentence and produce a attention vector as output. At first step the decoder will be fed the attention vector and a <START> token. The decoder will (should) produce the first spanish word Nosotras. This is the Yt. In the next step the decoder will be fed again the attention vector as well as the <START> token and the previous output Yt-1 Nosotras. tenemos will be the output, and so on and so forth, till the decoder spits out a <END> token.
The decoder is thus an Autoregressive Model. It relies on its own output to generate the next sequence.
In addition to #Bhupen's answer it is worth highlighting differences to seq-to-seq models based on RNNs, for which this sequential processing is always necessary.
Transformers have the fundamental advantage that you can train them with parallel processing. That means you can use the same parallel forward pass in the decoder also for the encoder during training. This allows for substantial speedups, allowing for larger training set sizes, to which transformer models owe much of their success.
So to answer the original question: The input to the decoder is
all correct predictions, shifted and masked, at once to be processed in parallel
the previous output of the decoder, starting with the special token <start>, until another special token <end> is predicted.
You can look at the excellent tensorflow implementation, i.e. check out:
Inference, which is indeed the sequential processing as described in the previous answer https://www.tensorflow.org/text/tutorials/transformer#run_inference
But also check the decoder, where you can see that there is no inherent sequential nature (only iterating through layers) https://www.tensorflow.org/text/tutorials/transformer#the_decoder_layer

Extracting hidden representations for each token - PyTorch LSTM

I am currently working on a NLP project involving recurrent neural networks. I implemented a LSTM with PyTorch, following the tutorial here.
For my project, I need to extract the hidden representation for every token of an input text. I thought that the easiest way would be to test using a batch size and sequence length of 1, but when I do that the loss gets orders of magnitude larger than in training phase (during training I used a batch size of 64 and a sequence length of 35).
Is there any other way I can easily access these word-level hidden representations? Thank you.
Yes, that is possible with nn.LSTM as long as it is a single layer LSTM. If u check the documentation (here), for the output of an LSTM, you can see it outputs a tensor and a tuple of tensors. The tuple contains the hidden and cell for the last sequence step. What each dimension means of the output depends on how u initialized your network. Either the first or second dimension is the batch dimension and the rest is the sequence of word embeddings you want.
If u use a packed sequence as input, it is a bit of a different story.

TimeDistributed Layers vs. ConvLSTM-2D

Could anyone explains for me the differences between Time-Distributed Layers (from Keras Wrapper) and ConvLSTM-2D (Convolutional LSTM), for purposes, usage, etc.?
Both applies to a sequence of data.
Time Distributed is a very straightforward layer wrapper which only applies a layer (usually dense layer) on each time point. You need it when you need to change the shape of output tensor, especially the dimension of features, instead of sample size and time step.
ConvLSTM2D, is much more complex. You need to understand cnn and rnn layer first, where LSTM is one of most popular rnn. LSTM itself is applied on a sequence of of tensor, which is used for NLP, time series and for each time step the input is 1-dimension. cnn, the conv part, is usually used to learn from image, which is 2-dimension but don't have a sequence (time step). Combined together, convLSTM is used to learn image in a sequence, like video.

How to calculate a One-Hot Encoding value into a real-valued vector?

In Word2Vec, i've learned that both of CBOW and Skip-gram produce a one-hot encoding value to create a vector (cmiiw), I wonder how to calculate or represents a One-Hot Encoding value into a real-valued vector, for example (source: DistrictDataLab's Blog about Distributed Representations)
from this:
into:
please help, I was struggling on finding this information.
The word2vec algorithm itself is what incrementally learns the real-valued vector, with varied dimension values.
In contrast to the one-hot encoding, these vectors are often called "dense embeddings". They're "dense" because unlike the one-hot encoding, which is "sparse" with many dimensions and mostly zero values, they have fewer dimensions and (usually) no zero-values. They're an "embedding" because they've "embed" a discrete set-of-words into another continuous-coordinate-system.
You'd want to read the original word2vec paper for a full formal description of how the dense embeddings are made.
But the gist is that the dense vectors start totally random, and so at first the algorithm's internal neural network is useless for predicting neighboring words. But each (context)->(target) word training example from a text corpus is tried against the network, and each time the difference from the desired prediction is used to apply a tiny nudge, towards a better prediction, to both word-vector and internal-network-weight values.
Repeated many times, initially with larger nudges (higher learning-rate) then with ever-smaller nudges, the dense vectors rearrange their coordinates from their initial randomness to a useful relative-arrangement – one that's about-as-good as possible for predicting the training text, given the limits of the model itself. (That is, any further nudge that improves predictions on some examples, worsens it on others – so you might as well consider training done.)
You then read the resulting dense embedding real-valued vectors out of the model, and use them for purposes other than just nearby-word prediction.

How does word2vec or skip-gram model convert words to vector?

I have been reading a lot of papers on NLP, and came across many models. I got the SVD Model and representing it in 2-D, but I still did not get how do we make a word vector by giving a corpus to the word2vec/skip-gram model? Is it also co-occurrence matrix representation for each word? Can you explain it by taking an example corpus:
Hello, my name is John.
John works in Google.
Google has the best search engine.
Basically, how does skip gram convert John to a vector?
I think you will need to read a paper about the training process. Basically the values of the vectors are the node values of the trained neural network.
I tried to read the original paper but I think the paper "word2vec Parameter Learning Explained" by Xin Rong has a more detailed explanation.
The main concept can be easily understood with an example of Autoencoding with neural networks. You train the neural network to pass information from the input layer to the output layer through the middle layer which is smaller.
In a traditional auto encoder, you have an input vector of size N, a middle layer of length M<N, and the output layer,again of size N. You want only one unit at a time turned on in you input layer and you train the network to replicate in the output layer the same unit that is turned on in the input layer.
After the training has completed succesfully you will see that the neural network, to transport the information from the input layer to the output layer, adapted itself so that each input unit has a corresponding vector representation in the middle layer .
Simplifying a bit, in the context of word2vec your input and output vectors work more or less in the same way, except for the fact that in the sample you submit to the network the unit turned on in the input layer is different from the unit turned on in the output layer.
In fact you train the network picking pairs of nearby (not necessarily adjacent) words from your corpus and submitting them to the network.
The size of the input and output vector is equal to the size of the vocabulary you are feeding to the network.
Your input vector has only one unit turned on (the one corresponding to the first word of the chosen pair) the output vector has one unit turned on (the one corresponding to the second word of chosen pair).
For current readers who might also be wondering "what does a word vector exactly mean" as the OP was at that time: As described at http://cs224d.stanford.edu/lecture_notes/LectureNotes1.pdf, a word vector is of dimension n, and n "is an arbitrary size which defines the size of our embedding space." That is to say, this word vector doesn't mean anything concretely. It's just an abstract representation of certain qualities that this word might have, that we can use to distinguish words.
In fact, to directly answer the original question of "how is a word converted to a vector representation", the values of a vector embedding for a word is usually just randomized at initialization, and improved iteration-by-iteration.
This is common in deep learning/neural networks, where the human beings who created the network themselves usually don't have much idea about what the values exactly stand for. The network itself is supposed to figure the values out gradually, through learning. They just abstractly represent something and distinguish stuffs. One example would be AlphaGo, where it would be impossible for the DeepMind team to explain what each value in a vector stands for. It just works.
First of all, you normally don't use SVD with Skip-Gram model, because Skip-Gram is based on neural network. You use SVD because you want to reduce the dimension of your word vector (ex: for visualization on 2D or 3D space), but in neural net you construct your embedding matrices with the dimension of your choice. You use SVD if you constructed your embedding matrix with co-occurrence matrix.
Vector representation with co-occurrence matrix
I wrote an article about this here.
Consider the following two sentences: "all that glitters is not gold" + "all is well that ends well"
Co-occurrence matrix is then:
With co-occurrence matrix, each row is a word vector for the word. However as you can see in the matrix constructed above, each row has 10 columns. This means that the word vectors are 10-dimensional, and can't be visualized in 2D or 3D space. So we run SVD to reduce it to 2 dimension:
Now that the word vectors are 2-dimensional, they can be visualized in a 2D space:
However, reducing the word vectors into 2D matrix results in significant loss of meaningful data, which is why you shouldn't reduce it down too much.
Lets take another example: achieve and success. Lets say they have 10-dimensional word vectors:
Since achieve and success convey similar meanings, their vector representations are similar. Notice their similar values & color band pattern. However, since these are 10-dimensional vectors, these can't be visualized. So we run SVD to reduce the dimension to 3D, and visualize them:
Each value in the word vector represents the word's position within the vector space. Similar words will have similar vectors, and as a result, will be placed closed with each other in the vector space.
Vector representation with Skip-Gram
I wrote an article about it here.
Skip-Gram uses neural net, and therefore does not use SVD because you can specify the word vector's dimension as a hyper-parameter when you first construct the network (if you really need to visualize, then we use a special technique called t-SNE, but not SVD).
Skip-Gram as the following structure:
With Skip-Gram, N-dimensional word vectors are randomly initialized. There are two embedding matrices: input weight matrix W_input and output weight matrix W_output
Lets take W_input as an example. Assume that the words of your interest are passes and should. Since the randomly initialized weight matrix is 3-dimensional, they can be visualized:
These weight matrices (W_input, and W_ouput) are optimized by predicting a center word's neighboring words, and updating the weights in a way that minimizes prediction error. The predictions are computed for each context words of a center word, and their prediction errors are summed up to calculate weight gradients
The weight matrices update equations are:
These updates are applied for each training sample within the corpus (since Word2Vec uses stochastic gradient descent).
Vanilla Skip-Gram vs Negative Sampling
The above Skip-Gram illustration assumes that we use vanilla Skip-Gram. In real-life, we don't use vanilla Skip-Gram because of its high computational cost. Instead, we use an adapted form of Skip-Gram, called negative sampling.

Resources