How to change input length in Embedding layer for each batch? - keras

I am designing an embedding layer where the vocabulary is ~4000 and most training examples have a short length of less than 10. However some examples have a length of 100 or possibly even several hundred, and I would like to avoid zero padding every single example to length 100+ in order to maintain constant input length across all examples.
To remedy this I would like to only pad based on the max length within the batch, so that almost all batches would only have input length ~10 with only a few batches having a lot of padding. How do I load in each batch with a different input length into the Embedding layer?

One possible way is to set input_length argument to None. But if you are going to use Dense and Flatten layers after this layer, they might not work. For more visit keras doc page
... This argument is
required if you are going to connect Flatten then Dense layers
upstream (without it, the shape of the dense outputs cannot be
computed)
model = keras.models.Sequential(
[
keras.layers.Embedding(voc_size, embedding_dim, input_length=None)
]
)
Now the model can accept variable length sequences.

Related

Extracting hidden representations for each token - PyTorch LSTM

I am currently working on a NLP project involving recurrent neural networks. I implemented a LSTM with PyTorch, following the tutorial here.
For my project, I need to extract the hidden representation for every token of an input text. I thought that the easiest way would be to test using a batch size and sequence length of 1, but when I do that the loss gets orders of magnitude larger than in training phase (during training I used a batch size of 64 and a sequence length of 35).
Is there any other way I can easily access these word-level hidden representations? Thank you.
Yes, that is possible with nn.LSTM as long as it is a single layer LSTM. If u check the documentation (here), for the output of an LSTM, you can see it outputs a tensor and a tuple of tensors. The tuple contains the hidden and cell for the last sequence step. What each dimension means of the output depends on how u initialized your network. Either the first or second dimension is the batch dimension and the rest is the sequence of word embeddings you want.
If u use a packed sequence as input, it is a bit of a different story.

LSTM input longer than output

I am not sure if I understand how exactly Keras version of LSTM works.
Let's say I have vector of len=20 as input and I specify keras.layers.LSTM(units=10)
So in this example does the network finish after processing 50% of input or it precess the rest from start (I mean from first cell)?
Units are never related to the input size.
Units are related only to the output size (units = output features or channels).
An LSTM layer will always process the entire data and optionally return either the "same length (all steps)" or "no length (only last step)".
In terms of shapes
You must have an input tensor with shape (batch, len=20, input_features).
And it will output:
For return_sequences=False: (batch, output_features=10) - no length
For return_sequences=True: (batch, len=20, output_features=10) - same length
Output features is always equal to units.
See a full comprehension of the LSTM layers here: Understanding Keras LSTMs

Arbitrary length inputs for CNNs in sequential learning

In An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling, the authors state that TCN networks, a specific type of 1D CNNs applied to sequential data, "can also take in inputs of arbitrary lengths by sliding the 1D convolutional kernels", just like Recurrent Nets. I am asking myself how this can be done.
For an RNN, it is straight-forward that the same function would be applied as often as is the input length. However, for CNNs (or any feed-forward NN in general), one must prespecify the number of input neurons. So the only way I can see TCNs dealing with arbitrary length inputs is by specifying a fixed length input neuron space and then adding zero padding to the arbitrary length inputs.
Am I correct in my understanding?
If you have a fully convolutional neural network, there is no reason to have a fully specified input shape. You definitely need a fixed rank, and the last dimension probably should be the same but otherwise, you can definitely specify an input shape which in tensorflow would look like Input((None, 10)) in the case of 1D-CNNs.
Indeed the shape of the convolution kernel doesn't depend on the length of the input in the temporal dimension (it can depend on the last dimension though typically in convolutional neural networks), and you can apply it to any input with the same rank (and same last dimension).
For example let's say that you are applying only a single 1D convolution, with a kernel that's doing the sum of 2 neighbouring elements (kernel = (1, 1)). This operation could be applied to any input length given it's always 1D.
However, when being confronted with a sequence-to-label task and requiring further operations in the stack such as a fully-connected layer, the inputs must be of fixed length (or must be made so through zero padding).

Padding sequences of 2D elements in keras

I have a set of samples, each being a sequence of a set of attributes (for example a sample can comprise of 10 sequences each having 5 attributes). The number of attributes is always fixed, but the number of sequences (which are timestamps) can vary from sample to sample. I want to use this sample set for training an LSTM network in Keras for a classification problem and therefore I should pad the input size for all batch samples to be the same. But the pad_sequences processor in keras gets a fixed number of sequences with variable attributes and pad the missing attributes in each sequence, while I need to add more sequences of a fixed attribute length to each sample. So I think I can not use it and therefore I padded my samples separately and made a unified datset and then fed my network with it. But is there a shortcut with Keras functions to do this?
Also I heard about masking the padded input data during learning but I am not sure if I really need it as my classifier assigns one class label after processing the whole sample sequence. do I need it? And if yes, could you please help me with a simple example on how to do that?
Unfortunately, the documentation is quite missleading, but pad_sequences does exactly what you want. For example, this code
length3 = np.random.uniform(0, 1, size=(3,2))
length4 = np.random.uniform(0, 1, size=(4,2))
pad_sequences([length3, length4], dtype='float32', padding='post')
results in
[[[0.0385175 0.4333343 ]
[0.332416 0.16542904]
[0.69798684 0.45242336]
[0. 0. ]]
[[0.6518417 0.87938637]
[0.1491589 0.44784057]
[0.27607143 0.02688376]
[0.34607577 0.3605469 ]]]
So, here we have two sequences of different lengths, each timestep having two features, and the result is one numpy array where the shorter of the two sequences got padded with zeros.
Regarding your other question: Masking is a tricky topic, in my experience. But LSTMs should be fine with it. Just use a Masking() layer as your very first one. By default, it will make the LSTMs ignore all zeros, so in you case exactly the ones you added via padding. But you can use any value for masking, just as you can use any value for padding. If possible, choose a value that does not occur in your dataset.
If you don't use masking, that will yield the danger that your LSTM learns that the padded values do have some meaning while in reality they don't.
For example, if during training you feed in the the sequence
[[1,2],
[2,1],
[0,0],
[0,0],
[0,0]]
and later on the trained network you only feed in
[[1,2],
[2,1]]
You could get unexpected results (not necessarily, though). Masking avoids that by excluding the masked value from training.

Keras LSTM: first argument

In Keras, if you want to add an LSTM layer with 10 units, you use model.add(LSTM(10)). I've heard that number 10 referred to as the number of hidden units here and as the number of output units (line 863 of the Keras code here).
My question is, are those two things the same? Is the dimensionality of the output the same as the number of hidden units? I've read a few tutorials (like this one and this one), but none of them state this explicitly.
The answers seems to refer to multi-layer perceptrons (MLP) in which the hidden layer can be of different size and often is. For LSTMs, the hidden dimension is the same as the output dimension by construction:
The h is the output for a given timestep and the cell state c is bound by the hidden size due to element wise multiplication. The addition of terms to compute the gates would require that both the input kernel W and the recurrent kernel U map to the same dimension. This is certainly the case for Keras LSTM as well and is why you only provide single units argument.
To get a good intuition for why this makes sense. Remember that the LSTM job is to encode a sequence into a vector (maybe a Gross oversimplification but its all we need). The size of that vector is specified by hidden_units, the output is:
seq vector RNN weights
(1 X input_dim) * (input_dim X hidden_units),
which has 1 X hidden_units (a row vector representing the encoding of your input sequence). And thus, the names in this case are used synonymously.
Of course RNNs require more than one multiplication and keras implements RNNs as a sequence of matrix-matrix multiplications instead vector-matrix shown above.
The number of hidden units is not the same as the number of output units.
The number 10 controls the dimension of the output hidden state (source code for the LSTM constructor method can be found here. 10 specifies the units argument). In one of the tutorial's you have linked to (colah's blog), the units argument would control the dimension of the vectors ht-1 , ht, and ht+1: RNN image.
If you want to control the number of LSTM blocks in your network, you need to specify this as an input into the LSTM layer. The input shape to the layer is (nb_samples, timesteps, input_dim) Keras documentation. timesteps controls how many LSTM blocks your network contains. Referring to the tutorial on colah's blog again, in RNN image, timesteps would control how many green blocks the network contains.

Resources