In PyTorch, a dataloader cursor is used to iterate over the data during training. The cursor keeps track of the current position within the dataset, and is used to retrieve the next batch of data for training. When training across multiple epochs, the cursor should reset to the beginning of the dataset after each epoch. This allows the model to see the entire dataset multiple times during training, which can help to improve the model's performance.
How do PyTorch DataLoader reset the data cursor across epochs? Does it guarantee the reset from the beginning of the dataset?
Related
I am trying to get consistent results withing training, so I am testing random seeds on all the possible source of randomness in my scripts:
tensorflow.random.set_seed(0)
random.seed(0)
np.random.seed(0)
My doubt is understanding how the model training going to be affected, as I am using a generator, which is supposed to randomly shuffle the samples within the batches.
Are the batches going to be the same up to epoch N, along different instances of model training? In other words, batches would be randomized along different epoch but, in a different instance of the model, the batches will be composed exactly in the same way at epoch N.
If that is the case, gradient values should be the same, and so the model weights at epoch N
Is there anything to which I should pay attention, which can actually harm the robustness of the trained model, due to having initialized the seed and, thus, maybe missing some value in having complete normalization?
Suppose I have 4000 data points at initial and I trained a model using data points from index 0 to 3999 then saved the model. Now after 5 days I loaded the saved model and now I used data from index 5 to 4004 to retrain the saved model. Do the training start from the last learned weights or the weights will be reinitialized as it happens in the fresh model? The data is time-series data.
If the layers aren't frozen, the weights will update every time you start training, irrespective of the data point you choose to re-train from. Hence, the infomation for the weights in the previosuly saved model will be lost.
But if you choose to freeze some layers (and leave the others as they are), the frozen layers will retain the information of weights that were accquired from the previous training (datapoints 0 to 3999) and all other layers will update the weights according to the new data (datapoints 5 to 4004)
I am a bit confused on how Keras fits the models. In general, Keras models are fitted by simply using model.fit(...) something like the following:
model.fit(X_train, y_train, epochs=300, batch_size=64, validation_data=(X_test, y_test))
My question is: Because I stated the testing data by the argument validation_data=(X_test, y_test), does it mean that each epoch is independent? In other words, I understand that at each epoch, Keras train the model using the training data (after getting shuffled) followed by testing the trained model using the provided validation_data. If that's the case, then no matter how many epochs I choose, I only take the results of the last epoch!!
If this scenario is correct, so we do we need multiple epoches? Unless these epoches are dependent somwhow where each epoch uses the same NN weights from the previous epoch, correct?
Thank you
When Keras fit your model it pass throught all the dataset at each epoch by a step corresponding to your batch_size.
For exemple if you have a dataset of 1000 items and a batch_size of 8, the weight of your model will be updated by using 8 items and this until it have seen all your data set.
At the end of that epoch, the model will try to do a prediction on your validation set.
If we have made only one epoch, it would mean that the weight of the model is updated only once per element (because it only "saw" one time the complete dataset).
But in order to minimize the loss function and by backpropagation, we need to update those weights multiple times in order to reach the optimum loss, so pass throught all the dataset multiple times, in other word, multiple epochs.
I hope i'm clear, ask if you need more informations.
I recently learn the LSTM for time series prediction from
https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/23_Time-Series-Prediction.ipynb
In his tutorial, he says: Instead of training the Recurrent Neural Network on the complete sequences of almost 300k observations, we will use the following function to create a batch of shorter sub-sequences picked at random from the training-data.
def batch_generator(batch_size, sequence_length):
"""
Generator function for creating random batches of training-data.
"""
# Infinite loop.
while True:
# Allocate a new array for the batch of input-signals.
x_shape = (batch_size, sequence_length, num_x_signals)
x_batch = np.zeros(shape=x_shape, dtype=np.float16)
# Allocate a new array for the batch of output-signals.
y_shape = (batch_size, sequence_length, num_y_signals)
y_batch = np.zeros(shape=y_shape, dtype=np.float16)
# Fill the batch with random sequences of data.
for i in range(batch_size):
# Get a random start-index.
# This points somewhere into the training-data.
idx = np.random.randint(num_train - sequence_length)
# Copy the sequences of data starting at this index.
x_batch[i] = x_train_scaled[idx:idx+sequence_length]
y_batch[i] = y_train_scaled[idx:idx+sequence_length]
yield (x_batch, y_batch)
He try to create several bacth samples for training.
My question is that, can we first randomly shuttle the x_train_scaled and y_train_scaled, and then begin sampling several batch size using the follow batch_generator?
my motivation for this question is that, for time series prediction, we want to training the past and predict for the furture. Therefore, is it legal to shuttle the training samples?
In the tutorial, the author chose a piece of continuous samples such as
x_batch[i] = x_train_scaled[idx:idx+sequence_length]
y_batch[i] = y_train_scaled[idx:idx+sequence_length]
Can we pick x_batch and y_batch not continous. For example, the x_batch[0] is picked at 10:00am and x_batch[1] is picked at 9:00am at the same day?
In summary: The follow two question are
(1) can we first randomly shuttle the x_train_scaled and y_train_scaled, and then begin sampling several batch size using the follow batch_generator?
(2) when we train LSTM, Do we need to consider the influence of time order? what parameters we learn for LSTM.
Thanks
(1) We cannot. Imagine trying to predict the weather for tomorrow. Would you want a sequence of temperature values for the last 10 hours or would you want random temperature values of the last 5 years?
Your dataset is a long sequence of values in a 1-hour interval. Your LSTM takes in a sequence of samples that is chronologically connected. For example, with sequence_length = 10 it can take the data from 2018-03-01 09:00:00 to 2018-03-01 19:00:00 as input. If you shuffle the dataset before generating batches that consist of these sequences, you will train your LSTM on predicting based on a sequence of random samples from your whole dataset.
(2) Yes, we need to consider temporal ordering for time series. You can find ways to test your time series LSTM in python here: https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/
The train/test data must be split in such a way as to respect the temporal ordering and the model is never trained on data from the future and only tested on data from the future.
It depends a lot on the dataset. For example, the weather from a random day in the dataset is highly related to the weather of the surrounding days. So, in this case, you should try a statefull LSTM (ie, a LSTM that uses the previous records as input to the next one) and train in order.
However, if your records (or a transformation of them) are independent from each other, but depend on some notion of time, such as the inter-arrival time of the items in a record or a subset of these records, there should be noticeable differences when using shuffling. In some cases, it will improve the robustness of the model; in other cases, it will not generalize. Noticing these differences is part of the evaluation of the model.
In the end, the question is: the "time series" as it is is really a time series (ie, records really depend on their neighbor) or there is some transformation that can break this dependency, but preserv the structure of the problem? And, for this question, there is only one way to get to the answer: explore the dataset.
About authoritative references, I will have to let you down. I learn this from a seasoned researcher in the field, however, according to him, he learn it through a lot of experimentation and failures. As he told me: these aren't rules, they are guidelines; try all the solutions that fits your budget; improve on the best ones; try again.
I'm currently working on a Keras tutorial for recurrent network training and I'm having trouble understanding the Stateful LSTM concept. To keep things as simple as possible, the sequences have the same length seq_length. As far as I get it, the input data is of shape (n_samples, seq_length, n_features) and we then train our LSTM on n_samples/M batches of size M as follows:
For each batch:
Feed in the 2D-tensors (seq_length, n_features) and for each input 2D-tensor compute the gradient
Sum these gradients to get the total gradient on the batch
Backpropagate the gradient and update weights
In the tutorial's example, feeding in the 2D-tensors is feeding in a sequence of size seq_length of letters encoded as vectors of length n_features. However, the tutorial says that in the Keras implementation of LSTMs, the hidden state is not reset after a whole sequence (2D-tensor) is fed in, but after a batch of sequences is fed in to use more context.
Why does keeping the hidden state of the previous sequence and using it as initial hidden state for our current sequence improve the learning and the predictions on our test set, since that "previously learned" initial hidden state won't be available when making predictions ? Moreover, Keras' default behaviour is to shuffle input samples at the beginning of each epoch so the batch context is changed at each epoch. This behaviour seems contradictory to keeping the hidden state through a batch since batch context is random.
LSTMs in Keras aren't stateful by default - each sequence starts with newly-reset states. By setting stateful=True in your recurrent layer, successive inputs in a batch don't reset the network state. This assumes that the sequences are actually successive, and it means that in a (very informal) sense, you're training on sequences of length batch_size * seq_length.
Why does keeping the hidden state of the previous sequence and using
it as initial hidden state for our current sequence improve the
learning and the predictions on our test set, since that "previously
learned" initial hidden state won't be available when making
predictions ?
In theory, it improves learning because a longer context can teach the network things about the distribution that are still relevant when testing on the individually shorter sequences. If the network is learning some probability distribution, that distribution should hold over different sequence lengths.
Moreover, Keras's default behaviour is to shuffle input samples at the
beginning of each epoch so the batch context is changed at each epoch.
This behaviour seems contradictory to keeping the hidden state through
a batch since batch context is random.
I haven't checked, but I assume that when stateful=True, only batches are shuffled - not the sequences within them.
In general, when we give the network some initial state, we don't mean for that to be a universally better starting point. It just means that the network can take the information from previous sequences into account when training.