LSTM state within a batch - keras

I'm currently working on a Keras tutorial for recurrent network training and I'm having trouble understanding the Stateful LSTM concept. To keep things as simple as possible, the sequences have the same length seq_length. As far as I get it, the input data is of shape (n_samples, seq_length, n_features) and we then train our LSTM on n_samples/M batches of size M as follows:
For each batch:
Feed in the 2D-tensors (seq_length, n_features) and for each input 2D-tensor compute the gradient
Sum these gradients to get the total gradient on the batch
Backpropagate the gradient and update weights
In the tutorial's example, feeding in the 2D-tensors is feeding in a sequence of size seq_length of letters encoded as vectors of length n_features. However, the tutorial says that in the Keras implementation of LSTMs, the hidden state is not reset after a whole sequence (2D-tensor) is fed in, but after a batch of sequences is fed in to use more context.
Why does keeping the hidden state of the previous sequence and using it as initial hidden state for our current sequence improve the learning and the predictions on our test set, since that "previously learned" initial hidden state won't be available when making predictions ? Moreover, Keras' default behaviour is to shuffle input samples at the beginning of each epoch so the batch context is changed at each epoch. This behaviour seems contradictory to keeping the hidden state through a batch since batch context is random.

LSTMs in Keras aren't stateful by default - each sequence starts with newly-reset states. By setting stateful=True in your recurrent layer, successive inputs in a batch don't reset the network state. This assumes that the sequences are actually successive, and it means that in a (very informal) sense, you're training on sequences of length batch_size * seq_length.
Why does keeping the hidden state of the previous sequence and using
it as initial hidden state for our current sequence improve the
learning and the predictions on our test set, since that "previously
learned" initial hidden state won't be available when making
predictions ?
In theory, it improves learning because a longer context can teach the network things about the distribution that are still relevant when testing on the individually shorter sequences. If the network is learning some probability distribution, that distribution should hold over different sequence lengths.
Moreover, Keras's default behaviour is to shuffle input samples at the
beginning of each epoch so the batch context is changed at each epoch.
This behaviour seems contradictory to keeping the hidden state through
a batch since batch context is random.
I haven't checked, but I assume that when stateful=True, only batches are shuffled - not the sequences within them.
In general, when we give the network some initial state, we don't mean for that to be a universally better starting point. It just means that the network can take the information from previous sequences into account when training.

Related

Extracting hidden representations for each token - PyTorch LSTM

I am currently working on a NLP project involving recurrent neural networks. I implemented a LSTM with PyTorch, following the tutorial here.
For my project, I need to extract the hidden representation for every token of an input text. I thought that the easiest way would be to test using a batch size and sequence length of 1, but when I do that the loss gets orders of magnitude larger than in training phase (during training I used a batch size of 64 and a sequence length of 35).
Is there any other way I can easily access these word-level hidden representations? Thank you.
Yes, that is possible with nn.LSTM as long as it is a single layer LSTM. If u check the documentation (here), for the output of an LSTM, you can see it outputs a tensor and a tuple of tensors. The tuple contains the hidden and cell for the last sequence step. What each dimension means of the output depends on how u initialized your network. Either the first or second dimension is the batch dimension and the rest is the sequence of word embeddings you want.
If u use a packed sequence as input, it is a bit of a different story.

Ensuring that optimization does not find the trivial solution by setting weights to 0

I am trying to train a neural network which takes as input (input_t0) and an initial hidden state (call it s_t0) and produces a new hidden state (s_t1) by transforming the input via a series of transformations (neural network layers). At the next time step, a transformed input (input_t1) and the hidden state from the previous time step (s_t1) is passed to the same model. This process keeps repeating for a couple of steps.
The goal of optimization is to ensure the distance between s_t0 and s_t1 is small through self-supervision, as s_t1 is supposed to be an transformed version of s_t0. In other words, I want s_t1 to only carry new information in the new input. My intuition tells me taking the norm of the weights and ensuring the norm does not go to zero (is this even possible?) would be one way to achieve this. However, I'm afraid won't be the best thing to do necessarily, as it might not encourage the model to update the state vector with new information.
Currently the way I train the model is by taking the absolute distance between s_t0 and s_t1 via loss = torch.abs(s_t1 - s_t0).mean(dim=1). Then I call loss.backward() and optimizer.step() which changes the weights. Note that the reason that I use abs() is that the hidden states are produced after applying ReLU, so the only hold positive values. So what is the best way to achieve this and ensure the weights don't go to 0? Would I be able to somehow use mutual information for this?
However, I noticed that optimization quickly finds the trivial solution by setting weights to 0. This causes both s_t0 and s_t1 get smaller and smaller until their difference is 0, which satisfies the constraint but does not yield the behavior I expect. Is there a way to ensure weights do not go to zero during optimization?

What are the best normalization technique and LSTM structure for forecasting an output with jumps (outliers)?

I have a time series forecasting case with ten features (inputs), and only one output. I'm using 22 timesteps (history of features) for one step ahead prediction using LSTM. Also, I apply MinMaxScaler for input normalization, but I don't normalize the output. The output contains some rare jumps (such as 20, 50, or more than 100), but the other values are between 0 and ~5 (all values are positive). In this case, it's important to forecast both normal and outlier outputs correctly so I dont want to miss the jumps in my forecasting model. I think if I use MinMaxScaler for output, most of the values will be something near the zero but the others (outliers) will be near one.
What is the best way to normalize the output? Should I leave it without normalization?
What is the best LSTM structure to handle this issue? (currently, I'm using LSTM with relu and Dense layer with relu as the last layer so I the output will be a positive value). I think I should select activation functions correctly for this case.
I think first of all, you should decide on a metric to measure performance. For example, do you want to use MAE or MSE? Or some other metric you decide based on the task at hand. For example, you may tolerate greater error for the "rare jumps", but not for the normal cases, or vice versa. Once you are decided on the error metric, ideally, you should set that as the cost function that the LSTM network would be minimizing.
Now the goal would be to minimize the desired error metric you set. If this was a convex problem, the scaling of the output will not matter. But we now that this is not the case with the complex deep learning architectures. What this means is that while minimizing the cost function with gradient decent, it might get stuck in a local minimum with a very delayed convergence. In this case, normalizing the output might help. How?
Assume that your output has a mean value of 5. With last layers parameters initialized around zero and a bias value of zero (i.e. the linear transformation of relu), the network needs to learn that the bias should be around 5. Depending on the complexity of the network this could take some epochs. However, if you normalize the data, or initialize the bias at 5, then your network starts with a good estimate of the bias and thus converges faster.
Now back to your questions:
I would at least make the output zero mean and use Dense layer with linear output.
The architecture you have seems fine, you can try stacking 2-4 LSTM layers if you think your input has complex time dependencies.
Feel free to update the OP with the the code and the performance you get and we can discuss what else can be improved.

when do you use Input shape vs batch_shape in keras?

I don't find API that explains keras Input.
When should you use shape attribute vs batch_shape attribute?
From the Keras source code:
Arguments
shape: A shape tuple (integer), not including the batch size.
For instance, `shape=(32,)` indicates that the expected input
will be batches of 32-dimensional vectors.
batch_shape: A shape tuple (integer), including the batch size.
For instance, `batch_shape=(10, 32)` indicates that
the expected input will be batches of 10 32-dimensional vectors.
`batch_shape=(None, 32)` indicates batches of an arbitrary number
of 32-dimensional vectors.
The batch size is how many examples you have in your training data.
You can use any. Personally I never used "batch_shape". When you use "shape", your batch can be any size, you don't have to care about it.
shape=(32,) means exactly the same as batch_shape=(None,32)
To expand on Daniel's answer, one case I've found where it's necessary to specify batch_shape instead of shape to an Input layer is when you are using stateful LSTMs in the functional API. It's described well in Phillipe Remy's blog. In short, the stateful mode allows you to keep the hidden state values in an LSTM across batches (they usually get reset every batch if the default stateful=False is set). That means it needs knowledge about the batch size in order to shape everything properly. If you don't do this, it yells at you:
ValueError: If a RNN is stateful, it needs to know its batch size. Specify the batch size of your input tensors:
- If using a Sequential model, specify the batch size by passing a `batch_input_shape` argument to your first layer.
- If using the functional API, specify the batch size by passing a `batch_shape` argument to your Input layer.
The second point is the relevant one here. If using LSTM with stateful=True in the functional API, you need to set batch_shape for your Input layers.

Keras LSTM: first argument

In Keras, if you want to add an LSTM layer with 10 units, you use model.add(LSTM(10)). I've heard that number 10 referred to as the number of hidden units here and as the number of output units (line 863 of the Keras code here).
My question is, are those two things the same? Is the dimensionality of the output the same as the number of hidden units? I've read a few tutorials (like this one and this one), but none of them state this explicitly.
The answers seems to refer to multi-layer perceptrons (MLP) in which the hidden layer can be of different size and often is. For LSTMs, the hidden dimension is the same as the output dimension by construction:
The h is the output for a given timestep and the cell state c is bound by the hidden size due to element wise multiplication. The addition of terms to compute the gates would require that both the input kernel W and the recurrent kernel U map to the same dimension. This is certainly the case for Keras LSTM as well and is why you only provide single units argument.
To get a good intuition for why this makes sense. Remember that the LSTM job is to encode a sequence into a vector (maybe a Gross oversimplification but its all we need). The size of that vector is specified by hidden_units, the output is:
seq vector RNN weights
(1 X input_dim) * (input_dim X hidden_units),
which has 1 X hidden_units (a row vector representing the encoding of your input sequence). And thus, the names in this case are used synonymously.
Of course RNNs require more than one multiplication and keras implements RNNs as a sequence of matrix-matrix multiplications instead vector-matrix shown above.
The number of hidden units is not the same as the number of output units.
The number 10 controls the dimension of the output hidden state (source code for the LSTM constructor method can be found here. 10 specifies the units argument). In one of the tutorial's you have linked to (colah's blog), the units argument would control the dimension of the vectors ht-1 , ht, and ht+1: RNN image.
If you want to control the number of LSTM blocks in your network, you need to specify this as an input into the LSTM layer. The input shape to the layer is (nb_samples, timesteps, input_dim) Keras documentation. timesteps controls how many LSTM blocks your network contains. Referring to the tutorial on colah's blog again, in RNN image, timesteps would control how many green blocks the network contains.

Resources