I am learning to use Keras LSTM model. I have looked at this tutorial, this tutorial and this tutorial and am feeling unsure about my understanding of LSTM model's input shape. My question is if one is to shape one's data like the first tutorial (8760, 1, 8) and the data is inputted to the network 1 timestep at a time i.e. the input_shape=(1, 8) does the network learn the temporal dependencies between samples?
It only makes sense to have batches of 1 timestep when you're using stateful=True. Otherwise there is no temporal dependency, as you presumed.
The difference is:
stateful=False, input_shape=(1,any):
first batch of shape (N, 1, any): contains N different sequences of length 1
second batch: contains another N different sequences of length 1
total of the two batches: 2N sequences of length 1
more batches: more independent sequences
yes, there is no point in using steps=1 when stateful=False
stateful=True, input_shape=(1,any):
first batch of shape (N, 1, any): contains the first step of N different sequences
second batch: contains the second step of the same N sequences
total of the two batches: N sequences of length 2
more batches = more steps of the same sequences, until you call model.reset_states()
Usually, it's more complicated to handle stateful=True layers, and if you can put entire sequences in a batch, like input_shape=(allSteps, any), there is no reason to turn stateful on.
If you want a detailed explanation of RNNs on Keras, see this answer
Related
I am currently working on a NLP project involving recurrent neural networks. I implemented a LSTM with PyTorch, following the tutorial here.
For my project, I need to extract the hidden representation for every token of an input text. I thought that the easiest way would be to test using a batch size and sequence length of 1, but when I do that the loss gets orders of magnitude larger than in training phase (during training I used a batch size of 64 and a sequence length of 35).
Is there any other way I can easily access these word-level hidden representations? Thank you.
Yes, that is possible with nn.LSTM as long as it is a single layer LSTM. If u check the documentation (here), for the output of an LSTM, you can see it outputs a tensor and a tuple of tensors. The tuple contains the hidden and cell for the last sequence step. What each dimension means of the output depends on how u initialized your network. Either the first or second dimension is the batch dimension and the rest is the sequence of word embeddings you want.
If u use a packed sequence as input, it is a bit of a different story.
I am switching from TensorFlow to PyTorch and I having some troubles with my net.
I have made a Collator(for the DataLoader) that pads each tensor(originally sentence) in each batch into the Maslen of each batch.
so I have different input sizes per each batch.
my network consists of LSTM -> LSTM -> DENSE
my question is, how can I specify this variable input size to the LSTM?
I assume that in TensorFlow I would do Input((None,x)) bedsore the LSTM.
Thank you in advance
The input size of the LSTM is not how long a sample is. So lets say you have a batch with three samples: The first one has the length of 10, the second 12 and the third 15. So what you already did is pad them all with zeros so that all three have the size 15. But the next batch may have been padded to 16. Sure.
But this 15 is not the input size of the LSTM. The input size in the size of one element of a sample in the batch. And that should always be the same.
For example when you want to classify names:
the inputs are names, for example "joe", "mark", "lucas".
But what the LSTM takes as input are the characters. So "J" then "o" so on. So as input size you have to put how many dimensions one character have.
If you use embeddings, the embedding size. When you use one-hot encoding, the vector size (probably 26). An LSTM takes iteratively the characters of the words. Not the entire word at once.
self.lstm = nn.LSTM(input_size=embedding_size, ...)
I hope this answered your question, if not please clearify it! Good luck!
I am designing an embedding layer where the vocabulary is ~4000 and most training examples have a short length of less than 10. However some examples have a length of 100 or possibly even several hundred, and I would like to avoid zero padding every single example to length 100+ in order to maintain constant input length across all examples.
To remedy this I would like to only pad based on the max length within the batch, so that almost all batches would only have input length ~10 with only a few batches having a lot of padding. How do I load in each batch with a different input length into the Embedding layer?
One possible way is to set input_length argument to None. But if you are going to use Dense and Flatten layers after this layer, they might not work. For more visit keras doc page
... This argument is
required if you are going to connect Flatten then Dense layers
upstream (without it, the shape of the dense outputs cannot be
computed)
model = keras.models.Sequential(
[
keras.layers.Embedding(voc_size, embedding_dim, input_length=None)
]
)
Now the model can accept variable length sequences.
In Keras, if you want to add an LSTM layer with 10 units, you use model.add(LSTM(10)). I've heard that number 10 referred to as the number of hidden units here and as the number of output units (line 863 of the Keras code here).
My question is, are those two things the same? Is the dimensionality of the output the same as the number of hidden units? I've read a few tutorials (like this one and this one), but none of them state this explicitly.
The answers seems to refer to multi-layer perceptrons (MLP) in which the hidden layer can be of different size and often is. For LSTMs, the hidden dimension is the same as the output dimension by construction:
The h is the output for a given timestep and the cell state c is bound by the hidden size due to element wise multiplication. The addition of terms to compute the gates would require that both the input kernel W and the recurrent kernel U map to the same dimension. This is certainly the case for Keras LSTM as well and is why you only provide single units argument.
To get a good intuition for why this makes sense. Remember that the LSTM job is to encode a sequence into a vector (maybe a Gross oversimplification but its all we need). The size of that vector is specified by hidden_units, the output is:
seq vector RNN weights
(1 X input_dim) * (input_dim X hidden_units),
which has 1 X hidden_units (a row vector representing the encoding of your input sequence). And thus, the names in this case are used synonymously.
Of course RNNs require more than one multiplication and keras implements RNNs as a sequence of matrix-matrix multiplications instead vector-matrix shown above.
The number of hidden units is not the same as the number of output units.
The number 10 controls the dimension of the output hidden state (source code for the LSTM constructor method can be found here. 10 specifies the units argument). In one of the tutorial's you have linked to (colah's blog), the units argument would control the dimension of the vectors ht-1 , ht, and ht+1: RNN image.
If you want to control the number of LSTM blocks in your network, you need to specify this as an input into the LSTM layer. The input shape to the layer is (nb_samples, timesteps, input_dim) Keras documentation. timesteps controls how many LSTM blocks your network contains. Referring to the tutorial on colah's blog again, in RNN image, timesteps would control how many green blocks the network contains.
Newbie to Keras alert!!!
I've got some questions related to Recurrent Layers in Keras (over theano)
How is the input supposed to be formatted regarding timesteps (say for instance I want a layer that will have 3 timesteps 1 in the future 1 in the past and 1 current) I see some answers and the API proposing padding and using the embedding layer or to shape the input using a time window (3 in this case) and in any case I can't make heads or tails of the API and SimpleRNN examples are scarce and don't seem to agree.
How would the input time window formatting work with a masking layer?
Some related answers propose performing masking with an embedding layer. What does masking have to do with embedding layers anyway, aren't embedding layers basically 1-hot word embeddings? (my application would use phonemes or characters as input)
I can start an answer, but this question is very broad so I would appreciate suggestions on improvement to my answer.
Keras SimpleRNN expects an input of size (num_training_examples, num_timesteps, num_features).
For example, suppose I have sequences of counts of numbers of cars driving by an intersection per hour (small example just to illustrate):
X = np.array([[10, 14, 2, 5], [12, 15, 1, 4], [13, 10, 0, 0]])
Aside: Notice that I was taking observations over four hours, and the last two hours had no cars driving by. That's an example of zero-padding the input, which means making all of the sequences the same length by adding 0s to the end of shorter sequences to match the length of the longest sequence.
Keras would expect the following input shape: (X.shape[0], X.shape1, 1), which means I could do this:
X_train = np.reshape(X, (X.shape[0], X.shape[1], 1))
And then I could feed that into the RNN:
model = Sequential()
model.add(SimpleRNN(units=10, activation='relu', input_shape = (X.shape[1], X.shape[2])))
You'd add more layers, or add regularization, etc., depending on the nature of your task.
For your specific application, I would think you would need to reshape your input to have 3 elements per row (last time step, current, next).
I don't know much about the masking layers, but here is a good place to start.
As far as I know, embeddings are independent of maskings, but you can mask an embedding.
Hope that provides a good starting point!