Extracting hidden representations for each token - PyTorch LSTM - nlp

I am currently working on a NLP project involving recurrent neural networks. I implemented a LSTM with PyTorch, following the tutorial here.
For my project, I need to extract the hidden representation for every token of an input text. I thought that the easiest way would be to test using a batch size and sequence length of 1, but when I do that the loss gets orders of magnitude larger than in training phase (during training I used a batch size of 64 and a sequence length of 35).
Is there any other way I can easily access these word-level hidden representations? Thank you.

Yes, that is possible with nn.LSTM as long as it is a single layer LSTM. If u check the documentation (here), for the output of an LSTM, you can see it outputs a tensor and a tuple of tensors. The tuple contains the hidden and cell for the last sequence step. What each dimension means of the output depends on how u initialized your network. Either the first or second dimension is the batch dimension and the rest is the sequence of word embeddings you want.
If u use a packed sequence as input, it is a bit of a different story.

Related

Tensorflow Keras: Problems to handle variable length input, using generator?

We want to train our model on varying input dimensions. Every input in a given batch and across batches has different dimensions.
We cannot resize our input (since we’ll lose our microscopic features). Now, since we cannot resize our input, converting them into batches of numpy array becomes impossible. In order to handle this now I have made the list for the input and each list of element contained (height, width, 1). Height is variable size and width is constant.
Sometime my input excessively large. In order to do that I have plan to use model.fit_generator(). In this, We find the max height and width of input in a batch and pad every other input with zeros so that every input in the batch has an equal dimension. Now we can easily convert it to a numpy array or a tensor and pass it to the fit_generator(). The model automatically learns to ignore the zeros and learns features from the intended portion from the padded input. This way we have a batch with equal input dimensions but every batch has a different shape (due to difference in max height and width of input across batches).
Now until here, I described the things what I have learned and what I have plan to do with variable input data. But I am stuck with the following confusions:
1- I have plan to use CNN first and then LSTM on that. I am using tensorflow keras. There, we have the facility of padding and masking . However, As for as I know that LSTM can work on masking and padding ignore 0-padded values. However, I am concerned about the CNN (does CNN ignores 0-padded values), because my padded input will first feed to CNN. I have seen some discussion in the following links:
How to apply masking layer to sequential CNN model in Keras?
https://github.com/keras-team/keras/issues/411
In these link, they mentioned that Unfortunately masking is not yet supported by the Keras Conv layers. However, now we can see alot of development and advancements specifically in the form of tensorflow Keras. So I am wondering that now tensorflow keras can support masking input?
2- To use the generator, we can use custom keras generator. For that I went through a vary good tutorial. I made the mind to use this. But I am wondering is there any advance built-in facility in tensorflow keras to use generator and save me to write custom keras generator?

How to change input length in Embedding layer for each batch?

I am designing an embedding layer where the vocabulary is ~4000 and most training examples have a short length of less than 10. However some examples have a length of 100 or possibly even several hundred, and I would like to avoid zero padding every single example to length 100+ in order to maintain constant input length across all examples.
To remedy this I would like to only pad based on the max length within the batch, so that almost all batches would only have input length ~10 with only a few batches having a lot of padding. How do I load in each batch with a different input length into the Embedding layer?
One possible way is to set input_length argument to None. But if you are going to use Dense and Flatten layers after this layer, they might not work. For more visit keras doc page
... This argument is
required if you are going to connect Flatten then Dense layers
upstream (without it, the shape of the dense outputs cannot be
computed)
model = keras.models.Sequential(
[
keras.layers.Embedding(voc_size, embedding_dim, input_length=None)
]
)
Now the model can accept variable length sequences.

when do you use Input shape vs batch_shape in keras?

I don't find API that explains keras Input.
When should you use shape attribute vs batch_shape attribute?
From the Keras source code:
Arguments
shape: A shape tuple (integer), not including the batch size.
For instance, `shape=(32,)` indicates that the expected input
will be batches of 32-dimensional vectors.
batch_shape: A shape tuple (integer), including the batch size.
For instance, `batch_shape=(10, 32)` indicates that
the expected input will be batches of 10 32-dimensional vectors.
`batch_shape=(None, 32)` indicates batches of an arbitrary number
of 32-dimensional vectors.
The batch size is how many examples you have in your training data.
You can use any. Personally I never used "batch_shape". When you use "shape", your batch can be any size, you don't have to care about it.
shape=(32,) means exactly the same as batch_shape=(None,32)
To expand on Daniel's answer, one case I've found where it's necessary to specify batch_shape instead of shape to an Input layer is when you are using stateful LSTMs in the functional API. It's described well in Phillipe Remy's blog. In short, the stateful mode allows you to keep the hidden state values in an LSTM across batches (they usually get reset every batch if the default stateful=False is set). That means it needs knowledge about the batch size in order to shape everything properly. If you don't do this, it yells at you:
ValueError: If a RNN is stateful, it needs to know its batch size. Specify the batch size of your input tensors:
- If using a Sequential model, specify the batch size by passing a `batch_input_shape` argument to your first layer.
- If using the functional API, specify the batch size by passing a `batch_shape` argument to your Input layer.
The second point is the relevant one here. If using LSTM with stateful=True in the functional API, you need to set batch_shape for your Input layers.

Keras LSTM: first argument

In Keras, if you want to add an LSTM layer with 10 units, you use model.add(LSTM(10)). I've heard that number 10 referred to as the number of hidden units here and as the number of output units (line 863 of the Keras code here).
My question is, are those two things the same? Is the dimensionality of the output the same as the number of hidden units? I've read a few tutorials (like this one and this one), but none of them state this explicitly.
The answers seems to refer to multi-layer perceptrons (MLP) in which the hidden layer can be of different size and often is. For LSTMs, the hidden dimension is the same as the output dimension by construction:
The h is the output for a given timestep and the cell state c is bound by the hidden size due to element wise multiplication. The addition of terms to compute the gates would require that both the input kernel W and the recurrent kernel U map to the same dimension. This is certainly the case for Keras LSTM as well and is why you only provide single units argument.
To get a good intuition for why this makes sense. Remember that the LSTM job is to encode a sequence into a vector (maybe a Gross oversimplification but its all we need). The size of that vector is specified by hidden_units, the output is:
seq vector RNN weights
(1 X input_dim) * (input_dim X hidden_units),
which has 1 X hidden_units (a row vector representing the encoding of your input sequence). And thus, the names in this case are used synonymously.
Of course RNNs require more than one multiplication and keras implements RNNs as a sequence of matrix-matrix multiplications instead vector-matrix shown above.
The number of hidden units is not the same as the number of output units.
The number 10 controls the dimension of the output hidden state (source code for the LSTM constructor method can be found here. 10 specifies the units argument). In one of the tutorial's you have linked to (colah's blog), the units argument would control the dimension of the vectors ht-1 , ht, and ht+1: RNN image.
If you want to control the number of LSTM blocks in your network, you need to specify this as an input into the LSTM layer. The input shape to the layer is (nb_samples, timesteps, input_dim) Keras documentation. timesteps controls how many LSTM blocks your network contains. Referring to the tutorial on colah's blog again, in RNN image, timesteps would control how many green blocks the network contains.

How does word2vec or skip-gram model convert words to vector?

I have been reading a lot of papers on NLP, and came across many models. I got the SVD Model and representing it in 2-D, but I still did not get how do we make a word vector by giving a corpus to the word2vec/skip-gram model? Is it also co-occurrence matrix representation for each word? Can you explain it by taking an example corpus:
Hello, my name is John.
John works in Google.
Google has the best search engine.
Basically, how does skip gram convert John to a vector?
I think you will need to read a paper about the training process. Basically the values of the vectors are the node values of the trained neural network.
I tried to read the original paper but I think the paper "word2vec Parameter Learning Explained" by Xin Rong has a more detailed explanation.
The main concept can be easily understood with an example of Autoencoding with neural networks. You train the neural network to pass information from the input layer to the output layer through the middle layer which is smaller.
In a traditional auto encoder, you have an input vector of size N, a middle layer of length M<N, and the output layer,again of size N. You want only one unit at a time turned on in you input layer and you train the network to replicate in the output layer the same unit that is turned on in the input layer.
After the training has completed succesfully you will see that the neural network, to transport the information from the input layer to the output layer, adapted itself so that each input unit has a corresponding vector representation in the middle layer .
Simplifying a bit, in the context of word2vec your input and output vectors work more or less in the same way, except for the fact that in the sample you submit to the network the unit turned on in the input layer is different from the unit turned on in the output layer.
In fact you train the network picking pairs of nearby (not necessarily adjacent) words from your corpus and submitting them to the network.
The size of the input and output vector is equal to the size of the vocabulary you are feeding to the network.
Your input vector has only one unit turned on (the one corresponding to the first word of the chosen pair) the output vector has one unit turned on (the one corresponding to the second word of chosen pair).
For current readers who might also be wondering "what does a word vector exactly mean" as the OP was at that time: As described at http://cs224d.stanford.edu/lecture_notes/LectureNotes1.pdf, a word vector is of dimension n, and n "is an arbitrary size which defines the size of our embedding space." That is to say, this word vector doesn't mean anything concretely. It's just an abstract representation of certain qualities that this word might have, that we can use to distinguish words.
In fact, to directly answer the original question of "how is a word converted to a vector representation", the values of a vector embedding for a word is usually just randomized at initialization, and improved iteration-by-iteration.
This is common in deep learning/neural networks, where the human beings who created the network themselves usually don't have much idea about what the values exactly stand for. The network itself is supposed to figure the values out gradually, through learning. They just abstractly represent something and distinguish stuffs. One example would be AlphaGo, where it would be impossible for the DeepMind team to explain what each value in a vector stands for. It just works.
First of all, you normally don't use SVD with Skip-Gram model, because Skip-Gram is based on neural network. You use SVD because you want to reduce the dimension of your word vector (ex: for visualization on 2D or 3D space), but in neural net you construct your embedding matrices with the dimension of your choice. You use SVD if you constructed your embedding matrix with co-occurrence matrix.
Vector representation with co-occurrence matrix
I wrote an article about this here.
Consider the following two sentences: "all that glitters is not gold" + "all is well that ends well"
Co-occurrence matrix is then:
With co-occurrence matrix, each row is a word vector for the word. However as you can see in the matrix constructed above, each row has 10 columns. This means that the word vectors are 10-dimensional, and can't be visualized in 2D or 3D space. So we run SVD to reduce it to 2 dimension:
Now that the word vectors are 2-dimensional, they can be visualized in a 2D space:
However, reducing the word vectors into 2D matrix results in significant loss of meaningful data, which is why you shouldn't reduce it down too much.
Lets take another example: achieve and success. Lets say they have 10-dimensional word vectors:
Since achieve and success convey similar meanings, their vector representations are similar. Notice their similar values & color band pattern. However, since these are 10-dimensional vectors, these can't be visualized. So we run SVD to reduce the dimension to 3D, and visualize them:
Each value in the word vector represents the word's position within the vector space. Similar words will have similar vectors, and as a result, will be placed closed with each other in the vector space.
Vector representation with Skip-Gram
I wrote an article about it here.
Skip-Gram uses neural net, and therefore does not use SVD because you can specify the word vector's dimension as a hyper-parameter when you first construct the network (if you really need to visualize, then we use a special technique called t-SNE, but not SVD).
Skip-Gram as the following structure:
With Skip-Gram, N-dimensional word vectors are randomly initialized. There are two embedding matrices: input weight matrix W_input and output weight matrix W_output
Lets take W_input as an example. Assume that the words of your interest are passes and should. Since the randomly initialized weight matrix is 3-dimensional, they can be visualized:
These weight matrices (W_input, and W_ouput) are optimized by predicting a center word's neighboring words, and updating the weights in a way that minimizes prediction error. The predictions are computed for each context words of a center word, and their prediction errors are summed up to calculate weight gradients
The weight matrices update equations are:
These updates are applied for each training sample within the corpus (since Word2Vec uses stochastic gradient descent).
Vanilla Skip-Gram vs Negative Sampling
The above Skip-Gram illustration assumes that we use vanilla Skip-Gram. In real-life, we don't use vanilla Skip-Gram because of its high computational cost. Instead, we use an adapted form of Skip-Gram, called negative sampling.

Resources