Understanding Keras batch_size vs. batch dimension for LSTM - keras

I'm having some difficulty grasping the input_shape for an LSTM layer in Keras. Assume that is the first layer in the network; it takes input of the form (batch, time, features). Also assume there is only one feature, so the input is of the form (batch, time, 1).
Is the number "batch" the batch size or the number of batches? I assume it's the batch size from the examples I've seen online. Then I'm struggling to see how the number of batches isn't always one.
As a concrete example, I have a time series of 1000 steps, which I split to 10 series of 100 steps. One epoch is when the network goes through all 1000 steps, the 10 series. I should be free to split the 10 series into different batches with different batch sizes, but then the input would be of the form (number of batches, batch size, time steps, 1). What am I misunderstanding?

Related

In the original T5 paper, what does 'step' mean?

I have been reading the original T5 paper 'Exploring the limits of transfer learning with a unified text-to-text transformer.' On page 11, it says "We pre-train each model for 2^19=524,288 steps on C4 before fine-tuning."
I am not sure what the 'steps' mean. Is it the same as epochs? Or the number of iterations per epoch?
I guess 'steps'='iterations' in a single epoch.
A step is a single training iteration. In a step, the model is given a single batch of training instances. So if the batch size is 128, then the model is exposed to 128 instances in a single step.
Epochs aren't the same as steps. An epoch is a single pass over an entire training set. So if the training data contains for example 128,000 instances & the batch size is 128, an epoch amounts to 1,000 steps (128 × 1,000 = 128,000).
The relationship between epochs & steps is related to the size of the training data (see this question for a more detailed comparison). If the data size is changed, then the effective number of steps in an epoch changes as well, (keeping the batch size fixed). So a dataset of 1,280,000 instances would take more steps in an epoch, & vice-versa for a dataset of 12,800 instances.
For this reason, steps are typically reported, especially when it comes to pre-training models on large corpora, because there can be a direct comparison in terms of steps & batch size, which isn't possible (or relatively harder to do) with epochs. So, if someone else wants to compare using an entirely different dataset with a different size, the model would "see" the same number of training instances, if the number of steps & batch size are the same, ensuring that a model isn't unfairly favoured due to training on more instances.

How to split the training data and test data for LSTM for time series prediction in Tensorflow

I recently learn the LSTM for time series prediction from
https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/23_Time-Series-Prediction.ipynb
In his tutorial, he says: Instead of training the Recurrent Neural Network on the complete sequences of almost 300k observations, we will use the following function to create a batch of shorter sub-sequences picked at random from the training-data.
def batch_generator(batch_size, sequence_length):
"""
Generator function for creating random batches of training-data.
"""
# Infinite loop.
while True:
# Allocate a new array for the batch of input-signals.
x_shape = (batch_size, sequence_length, num_x_signals)
x_batch = np.zeros(shape=x_shape, dtype=np.float16)
# Allocate a new array for the batch of output-signals.
y_shape = (batch_size, sequence_length, num_y_signals)
y_batch = np.zeros(shape=y_shape, dtype=np.float16)
# Fill the batch with random sequences of data.
for i in range(batch_size):
# Get a random start-index.
# This points somewhere into the training-data.
idx = np.random.randint(num_train - sequence_length)
# Copy the sequences of data starting at this index.
x_batch[i] = x_train_scaled[idx:idx+sequence_length]
y_batch[i] = y_train_scaled[idx:idx+sequence_length]
yield (x_batch, y_batch)
He try to create several bacth samples for training.
My question is that, can we first randomly shuttle the x_train_scaled and y_train_scaled, and then begin sampling several batch size using the follow batch_generator?
my motivation for this question is that, for time series prediction, we want to training the past and predict for the furture. Therefore, is it legal to shuttle the training samples?
In the tutorial, the author chose a piece of continuous samples such as
x_batch[i] = x_train_scaled[idx:idx+sequence_length]
y_batch[i] = y_train_scaled[idx:idx+sequence_length]
Can we pick x_batch and y_batch not continous. For example, the x_batch[0] is picked at 10:00am and x_batch[1] is picked at 9:00am at the same day?
In summary: The follow two question are
(1) can we first randomly shuttle the x_train_scaled and y_train_scaled, and then begin sampling several batch size using the follow batch_generator?
(2) when we train LSTM, Do we need to consider the influence of time order? what parameters we learn for LSTM.
Thanks
(1) We cannot. Imagine trying to predict the weather for tomorrow. Would you want a sequence of temperature values for the last 10 hours or would you want random temperature values of the last 5 years?
Your dataset is a long sequence of values in a 1-hour interval. Your LSTM takes in a sequence of samples that is chronologically connected. For example, with sequence_length = 10 it can take the data from 2018-03-01 09:00:00 to 2018-03-01 19:00:00 as input. If you shuffle the dataset before generating batches that consist of these sequences, you will train your LSTM on predicting based on a sequence of random samples from your whole dataset.
(2) Yes, we need to consider temporal ordering for time series. You can find ways to test your time series LSTM in python here: https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/
The train/test data must be split in such a way as to respect the temporal ordering and the model is never trained on data from the future and only tested on data from the future.
It depends a lot on the dataset. For example, the weather from a random day in the dataset is highly related to the weather of the surrounding days. So, in this case, you should try a statefull LSTM (ie, a LSTM that uses the previous records as input to the next one) and train in order.
However, if your records (or a transformation of them) are independent from each other, but depend on some notion of time, such as the inter-arrival time of the items in a record or a subset of these records, there should be noticeable differences when using shuffling. In some cases, it will improve the robustness of the model; in other cases, it will not generalize. Noticing these differences is part of the evaluation of the model.
In the end, the question is: the "time series" as it is is really a time series (ie, records really depend on their neighbor) or there is some transformation that can break this dependency, but preserv the structure of the problem? And, for this question, there is only one way to get to the answer: explore the dataset.
About authoritative references, I will have to let you down. I learn this from a seasoned researcher in the field, however, according to him, he learn it through a lot of experimentation and failures. As he told me: these aren't rules, they are guidelines; try all the solutions that fits your budget; improve on the best ones; try again.

Pit in LSTM programming by python

As we all Know, if we want to train a LSTM network, we must reshape the train dataset by the function numpy.reshape(), and reshaping result is like [samples,time_steps,features]. However, the new shape is influenced by the original one. I have seen some blogs teaching LSTM programming taking 1 as time_steps, and if time_steps is another number, samples will change relevently. My question is that does the samplesequal to batch_size?
X = X.reshape(X.shape[0], 1, X.shape[1])
No, samples is different from batch_size. samples is the total number of samples you would have. batch_size would be the size of each batch or the number of samples per each batch used in training, like by .fit.
For example, if samples=128 and batch_size=16, then your data would be divided into 8 batches with each having 16 samples inside during .fit call.
As another note, time_steps is the total time steps or observations within each sample. It does not make much sense to have it as 1 with LSTM as the main advantage of RNN's in general is to learn the temporal patterns. With time_step=1, there won't be any history to leverage. Here as an example that might help:
Assume that your job is to determine if someone is active or not every hour by looking at they breathing rate and heart rate provided every minute, i.e. 2 features measured at 60 samples per hour. (This is just an example, use accelerometers if you really wanted to do this :)) Let's say you have 128 hours of labeled data. Then your input data would be of shape (128, 60, 2) and your output would be of shape (128, 1).
Here, you have 128 samples, 60 time steps or observations per sample, and two features.
Next you split the data into train, validation, and testing according to the samples. For example, your train, validation, and test data would be of shapes (96, 60, 2), (16, 60, 2), and (16, 60, 2), respectively.
If you use batch_size=16, your training, validation, and test data would have 6, 1, and 1 batches, respectively.
No. Samples are not equal to batch size. Samples means the number of rows in your data-set. Your training data-set is divided into number of batches and pass it to the network to train.
In simple words,
Imagine your data-set has 30 samples, and you define your batch_size as 3.
That means the 30 samples divided into 10 batches(30 divided by you defined batch_size = 10). When you train you model, at a time only 3 rows of data will be be pushed to the neural network and then next 3 rows will be push to the neural network. Like wise whole data-set will push to the neural network.
Samples/Batch_size = Number of batches
Remember that batch_size and number of batches are two different things.

Does the input size to CNN change depending on the batch size?

Usually, the input to a convolutional neural network (CNN) is described by an image with given width*height*channels. Will the number of input nodes be different for different numbers of a batch size, as well? Meaning, will the number of input nodes become batch_size*width*height*channels?
From this excellent answer batch size defines number of samples that going to be propagated through the network. Batch size does not have an effect on the network's architecture, including quantity of inputs.
Let's say you have 1,000 RGB images with size 32x32, and you set the batch size to be 100. Your convolutional network should have an input shape of 32x32x3. To train the network the training algorithm picks a sample of 100 images out of the total 1,000, and trains the network on each individual image in that subset. That is your 'batch'. The network architecture doesn't care that your subset (batch) has 100, 200 or 1,000 images, it only cares about the shape of a single image, because that's all it sees at a time. When the network has been trained on all 100 images then it has completed one epoch, and the network parameters are updated.
The training algorithm will pick a different batch of images for each epoch, but the above always holds true: the network only sees a single image at a time, so the image shape must match the input layer shape, with no regard for how many images in that particular batch.
As to why we have batches instead of just training on the entire set (ie setting batch size to 100% and training for one epoch), with GPUs it makes training much faster, and also means the parameters are updated less often while training, giving a smoother and more reliable convergence.

Batch size for panel data for LSTM in Keras

I have repeated measurements on subjects, which I have structured as input to an LSTM model in Keras as follows:
batch_size = 1
model = Sequential()
model.add(LSTM(50, batch_input_shape=(batch_size, time_steps, features), return_sequences=True))
Where time_steps are the number of measurements on each subject, and features the number of available features on each measurement. Each row of the data is one subject.
My question is regarding the batch size with this type of data.
Should I only use a batch size of 1, or can the batch size be more than 1 subjects?
Related to that, would I benefit from setting stateful to True? Meaning that learning from one batch would inform the other batches too. Correct me if my understanding about this is not right too.
Great question! Using a batch size greater than 1 is possible with this sort of data and setup, provided that your rows are individual experiments on subjects and that your observations for each subject are ordered sequentially through time (e.g. Monday comes before Tuesday). Make sure that your observations between train and test are not split randomly and that your observations are ordered sequentially by subject in each, and you can apply batch processing. Because of this, set shuffle to false if using Keras as Keras shuffles observations in batches by default.
In regards to setting stateful to true: with a stateful model, all the states are propagated to the next batch. This means that the state of the sample located at index i, Xi will be used in the computation of the sample Xi+bs in the next batch. In the case of time series, this generally makes sense. If you believe that a subject measurement Si infleunces the state of the next subject measurement Si+1, then try setting stateful to true. It may be worth exploring setting stateful to false as well to explore and better understand if a previous observation in time infleunces the following observation for a particular subject.
Hope this helps!

Resources