Flow of states in LSTM - keras

I read that the internal state of LSTMs flows as follows:
it is always passed within a batch, so from the last timestamp of the i-th sample to the first of the i+1st
if the LSTM is stateful then the state is passed between batches, so the memory at the last timestamp of batch_k[i] is passed to the first timestamp of batch_{k+1}[i], for all indices i.
For me, this raises several questions. (Please correct me if my understanding is wrong)
Does this mean that the first timestamp of the (i+1)st sample needs to be the sucessor of the last timestep of sample i? (for all i)
Along the same lines, does the first timestamp of the i-th sample in batch k+1 have to be the sucessor of the last timestamp of the i-th sample in batch k?
If the first two conclusions are correct, then for stateful LSTMs we can NEVER shuffle anything and for the non-stateful ones we can at most shuffle the batches, but not the samples within batches, correct?
Why do we split the batch in samples of more than one timestep, anyway? If the above is correct, then the procedure 'within a sample' is the same as 'within a batch', so we might as well use samples of one timestep each.

Question 1
Not true. Sample s is not related to sample s+1 in the same batch. They're independent.
This means that the sample s of the batch b+1 needs to be the sucessor of the sample s of the batch b.
The samples will be processed in parallel, and batches must keep the same order. (That's why the documentation says you need shuffle=False when training stateful=True layers).
Question 2
This is true :)
Question 3
Partially correct. With stateful=True we cannot shuffle the batches (if there are going to be more than one batch).
But with stateful=False this really doesn't matter, because none of the samples will be related to each other. (Each sample in the batch is completely independent)
Question 4
Since "samples" in a batch are indepentent from each other there is a main reason to have many samples in a batch:
You have many independent sequences instead of just one sequence
But you may want to divide each sequence in many batches regarding the "length/timesteps". You would do this if:
Your sequences are way too long to fit your memory, so you load them partially and process them partially
Your model is predicting the future indefinitely, and you need it to predict the step t(n+1) to pass it as an input before it can produce the step t(n+2).
So, you can indeed use samples of one timestep each in stateful=True layers.
These answers may also help you:
https://stackoverflow.com/a/46331227/2097240
https://stackoverflow.com/a/47719094/2097240

Related

How to handle shared data between samples and batches in Keras

I'm using Keras for timeseries prediction and I want to create a model that is based on the self-attention mechanism that will not use any RNNs. For each sample we look at the last x timesteps of samples to predict the next sample.
In other words I want to feed the network (num_batches, num_samples, timesteps, features) and get (num_batches, predictions).
There is 1 problems with this.
There is a lot of unnecessary duplication of data where sample n has basically the same timesteps and features as sample n+1, only shifted 1 to the left.
How would you handle this assuming you dataset is very large?
I am not very familiar with this, but if your issue is "I have too many replicated data" I think you can solve your problem devising a generator for your data, and then pass the generator as input for the Keras/TensorFlow fit function (according to TensorFlow APIs specification, it is stated that it supports generators as input).
If your question is related to the logic behind the model, I do not see the issue. It is like that you have a sliding window, for each window you predict one value, and then you move the window by a certain amount (in your case, one). Could you argue a little more about your concern?

Relationship between memory cell and time step in LSTM

i'm studying LSTM model.
Does one memory cell of hidden layer in LSTM correspond to one timestep?
example code) model.add(LSTM(128, input_shape = (4, 1)))
When implementing LSTMs in Keras, can set the number of memory cells, as in the example code, regardless of the time step. In the example it is 128.
but, A typical LSTM image is shown to correspond 1: 1 with the number of time steps and the number of memory cells. What is the correct answer?
enter image description here
In LSTM, we supply input in the following manner
[samples,timesteps,features]
samples is for number of training examples you want to feed at a time
timesteps is how many values you want to use
Say you mention timesteps=3
So values at t,t-1 and t-2 are used to predict the data at t+1
features is how many dimensions you want to supply at a time
LSTM has memory cells but I am explaining the code part so as not to confuse you
I hope this helps
as I understand timestep is a length of Sequence per each processing (=Window_Size)... that (dependently on parameter "return_sequences=True/False") will return either multi- or single- output per each step of data processed... like here explained & showed ...
explanation here seems to be better
concerning memory cell - here "A part of a NN that preserves some state across time steps is called a memory cell." - make me consider memory cell to be, probably, a "container" - each for temporal weights per vars in window series, till update of them during further backpropagation (when statefull=True) --
BETTER TO SEE ONCE - pic here memory cell & the logics of its work here
KNOW usage of the whole shape - here - time_steps for backpropagation

How to split the training data and test data for LSTM for time series prediction in Tensorflow

I recently learn the LSTM for time series prediction from
https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/23_Time-Series-Prediction.ipynb
In his tutorial, he says: Instead of training the Recurrent Neural Network on the complete sequences of almost 300k observations, we will use the following function to create a batch of shorter sub-sequences picked at random from the training-data.
def batch_generator(batch_size, sequence_length):
"""
Generator function for creating random batches of training-data.
"""
# Infinite loop.
while True:
# Allocate a new array for the batch of input-signals.
x_shape = (batch_size, sequence_length, num_x_signals)
x_batch = np.zeros(shape=x_shape, dtype=np.float16)
# Allocate a new array for the batch of output-signals.
y_shape = (batch_size, sequence_length, num_y_signals)
y_batch = np.zeros(shape=y_shape, dtype=np.float16)
# Fill the batch with random sequences of data.
for i in range(batch_size):
# Get a random start-index.
# This points somewhere into the training-data.
idx = np.random.randint(num_train - sequence_length)
# Copy the sequences of data starting at this index.
x_batch[i] = x_train_scaled[idx:idx+sequence_length]
y_batch[i] = y_train_scaled[idx:idx+sequence_length]
yield (x_batch, y_batch)
He try to create several bacth samples for training.
My question is that, can we first randomly shuttle the x_train_scaled and y_train_scaled, and then begin sampling several batch size using the follow batch_generator?
my motivation for this question is that, for time series prediction, we want to training the past and predict for the furture. Therefore, is it legal to shuttle the training samples?
In the tutorial, the author chose a piece of continuous samples such as
x_batch[i] = x_train_scaled[idx:idx+sequence_length]
y_batch[i] = y_train_scaled[idx:idx+sequence_length]
Can we pick x_batch and y_batch not continous. For example, the x_batch[0] is picked at 10:00am and x_batch[1] is picked at 9:00am at the same day?
In summary: The follow two question are
(1) can we first randomly shuttle the x_train_scaled and y_train_scaled, and then begin sampling several batch size using the follow batch_generator?
(2) when we train LSTM, Do we need to consider the influence of time order? what parameters we learn for LSTM.
Thanks
(1) We cannot. Imagine trying to predict the weather for tomorrow. Would you want a sequence of temperature values for the last 10 hours or would you want random temperature values of the last 5 years?
Your dataset is a long sequence of values in a 1-hour interval. Your LSTM takes in a sequence of samples that is chronologically connected. For example, with sequence_length = 10 it can take the data from 2018-03-01 09:00:00 to 2018-03-01 19:00:00 as input. If you shuffle the dataset before generating batches that consist of these sequences, you will train your LSTM on predicting based on a sequence of random samples from your whole dataset.
(2) Yes, we need to consider temporal ordering for time series. You can find ways to test your time series LSTM in python here: https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/
The train/test data must be split in such a way as to respect the temporal ordering and the model is never trained on data from the future and only tested on data from the future.
It depends a lot on the dataset. For example, the weather from a random day in the dataset is highly related to the weather of the surrounding days. So, in this case, you should try a statefull LSTM (ie, a LSTM that uses the previous records as input to the next one) and train in order.
However, if your records (or a transformation of them) are independent from each other, but depend on some notion of time, such as the inter-arrival time of the items in a record or a subset of these records, there should be noticeable differences when using shuffling. In some cases, it will improve the robustness of the model; in other cases, it will not generalize. Noticing these differences is part of the evaluation of the model.
In the end, the question is: the "time series" as it is is really a time series (ie, records really depend on their neighbor) or there is some transformation that can break this dependency, but preserv the structure of the problem? And, for this question, there is only one way to get to the answer: explore the dataset.
About authoritative references, I will have to let you down. I learn this from a seasoned researcher in the field, however, according to him, he learn it through a lot of experimentation and failures. As he told me: these aren't rules, they are guidelines; try all the solutions that fits your budget; improve on the best ones; try again.

adding and accessing auxiliary tf.Dataset attributes with Keras

I use a tf.py_func call to parse data (features, labels and sample_weights) from file to a tf.Dataset:
dataset = tf.data.Dataset.from_tensor_slices((records, labels, sample_weights))
dataset = dataset.map(
lambda filename, label, sample_weight: tuple(tf.py_func(
self._my_parse_function, [filename, label, sample_weights], [tf.float32, label.dtype, tf.float32])))
The data is variable-length 1-D sequences, so I also pad the sequences to a fixed length in my_parse_function.
I use tensorflow.python.keras.models.Sequential.fit(...) to train the data (which now accepts datasets as input, including datasets with sample_weights) and tensorflow.python.keras.models.Sequential.predict to predict outputs.
Once I have predictions I would like to do some post-processing to make sense of the outputs. For example, I'd like to truncate the padded data to the actual sequence length. Also, I'd like to know for sure which file the data came from, since I am not sure that ordering is guaranteed with dataset iterators, especially if batching is used (I do batch the dataset as well) or multi-GPU or multi-workers are involved (I hope to try the multi- scenarios). Even if order was 'guaranteed' this is a decent sanity check.
This information, filename (i.e, a string) and sequence length (i.e, an integer), is not currently conveniently accessible, so I'd like to add these two attributes to the dataset elements and be able to retrieve them during/after the call to predict.
What is the best approach to do this?
Thanks
As a workaround, I store this auxiliary information in a 'global' dictionary in my_parse_fn, so it stores (and re-stores) on every iteration through the tf.Dataset. This is ok for now since there are only about 1000 examples in the training set, so storing 1000 strings and integers is not a problem. But if this auxiliary information were larger or the training set were larger, this approach would not be very scalable. In my case, the input data for each training example is significantly large, about 50MB in size, which is why reading a tf.Dataset from file (i.e., on every epoch) is important.
I still think that it would be helpful to be able to more conveniently extend a tf.Dataset with this information. Also I noticed that when I adding a field to a tf.Dataset like dataset.tag to identify, say, dataset.tag = 'training', dataset.tag ='validation' or dataset.tag = 'test' sets, the field did not survive the iterations of training.
So again in this case I'm wondering how a tf.Dataset can be extended.
On the other question, it looks like the order of tf.Dataset elements is respected through iterations, so predictions, say, from tensorflow.python.keras.models.Sequential.predict(...) are ordered as the file ids were presented to my_parse_fn (at least batching respects this ordering, but I still don't know about whether a multi-GPU scenario would as well).
Thanks for any insights.

Batch size for panel data for LSTM in Keras

I have repeated measurements on subjects, which I have structured as input to an LSTM model in Keras as follows:
batch_size = 1
model = Sequential()
model.add(LSTM(50, batch_input_shape=(batch_size, time_steps, features), return_sequences=True))
Where time_steps are the number of measurements on each subject, and features the number of available features on each measurement. Each row of the data is one subject.
My question is regarding the batch size with this type of data.
Should I only use a batch size of 1, or can the batch size be more than 1 subjects?
Related to that, would I benefit from setting stateful to True? Meaning that learning from one batch would inform the other batches too. Correct me if my understanding about this is not right too.
Great question! Using a batch size greater than 1 is possible with this sort of data and setup, provided that your rows are individual experiments on subjects and that your observations for each subject are ordered sequentially through time (e.g. Monday comes before Tuesday). Make sure that your observations between train and test are not split randomly and that your observations are ordered sequentially by subject in each, and you can apply batch processing. Because of this, set shuffle to false if using Keras as Keras shuffles observations in batches by default.
In regards to setting stateful to true: with a stateful model, all the states are propagated to the next batch. This means that the state of the sample located at index i, Xi will be used in the computation of the sample Xi+bs in the next batch. In the case of time series, this generally makes sense. If you believe that a subject measurement Si infleunces the state of the next subject measurement Si+1, then try setting stateful to true. It may be worth exploring setting stateful to false as well to explore and better understand if a previous observation in time infleunces the following observation for a particular subject.
Hope this helps!

Resources