As we all Know, if we want to train a LSTM network, we must reshape the train dataset by the function numpy.reshape(), and reshaping result is like [samples,time_steps,features]. However, the new shape is influenced by the original one. I have seen some blogs teaching LSTM programming taking 1 as time_steps, and if time_steps is another number, samples will change relevently. My question is that does the samplesequal to batch_size?
X = X.reshape(X.shape[0], 1, X.shape[1])
No, samples is different from batch_size. samples is the total number of samples you would have. batch_size would be the size of each batch or the number of samples per each batch used in training, like by .fit.
For example, if samples=128 and batch_size=16, then your data would be divided into 8 batches with each having 16 samples inside during .fit call.
As another note, time_steps is the total time steps or observations within each sample. It does not make much sense to have it as 1 with LSTM as the main advantage of RNN's in general is to learn the temporal patterns. With time_step=1, there won't be any history to leverage. Here as an example that might help:
Assume that your job is to determine if someone is active or not every hour by looking at they breathing rate and heart rate provided every minute, i.e. 2 features measured at 60 samples per hour. (This is just an example, use accelerometers if you really wanted to do this :)) Let's say you have 128 hours of labeled data. Then your input data would be of shape (128, 60, 2) and your output would be of shape (128, 1).
Here, you have 128 samples, 60 time steps or observations per sample, and two features.
Next you split the data into train, validation, and testing according to the samples. For example, your train, validation, and test data would be of shapes (96, 60, 2), (16, 60, 2), and (16, 60, 2), respectively.
If you use batch_size=16, your training, validation, and test data would have 6, 1, and 1 batches, respectively.
No. Samples are not equal to batch size. Samples means the number of rows in your data-set. Your training data-set is divided into number of batches and pass it to the network to train.
In simple words,
Imagine your data-set has 30 samples, and you define your batch_size as 3.
That means the 30 samples divided into 10 batches(30 divided by you defined batch_size = 10). When you train you model, at a time only 3 rows of data will be be pushed to the neural network and then next 3 rows will be push to the neural network. Like wise whole data-set will push to the neural network.
Samples/Batch_size = Number of batches
Remember that batch_size and number of batches are two different things.
I have a couple of questions.
I have data of the following shape:
(32, 64, 11)
where 32 is the batch size, 64 is a sequence length and 11 is the number of features. each sample of mine is 64X11, and has a label of 0 or 1.
I’d like to predict when a sequence has a label of “1”.
I’m trying to use a simple architecture with
conv1D → ReLU → flatten → linear → sigmoid.
For the Conv1D I thought that since it is a multi variate time series prediction, and each row in my data is a second, I think that the number of in channels should be the number of features, since that way it will process all of the features concurrently, (I don’t have any spatial things in my data, it doesn’t matter if a column is in index 0 or 9, as it is important in image with pixels.
I can't get to decide how to “initialize” the conv1D parameters. Currently I think the number of channels should be the number of features and not 1, as the reason I just explained, but unsure of it.
Secondly, should the loss function be BCELOSS or something else? assuming that my labels are 0 or 1, and the prediction for me is I want the model to provide a probability of belonging to class with label 1.
A lot of thanks.
I have been reading the original T5 paper 'Exploring the limits of transfer learning with a unified text-to-text transformer.' On page 11, it says "We pre-train each model for 2^19=524,288 steps on C4 before fine-tuning."
I am not sure what the 'steps' mean. Is it the same as epochs? Or the number of iterations per epoch?
I guess 'steps'='iterations' in a single epoch.
A step is a single training iteration. In a step, the model is given a single batch of training instances. So if the batch size is 128, then the model is exposed to 128 instances in a single step.
Epochs aren't the same as steps. An epoch is a single pass over an entire training set. So if the training data contains for example 128,000 instances & the batch size is 128, an epoch amounts to 1,000 steps (128 × 1,000 = 128,000).
The relationship between epochs & steps is related to the size of the training data (see this question for a more detailed comparison). If the data size is changed, then the effective number of steps in an epoch changes as well, (keeping the batch size fixed). So a dataset of 1,280,000 instances would take more steps in an epoch, & vice-versa for a dataset of 12,800 instances.
For this reason, steps are typically reported, especially when it comes to pre-training models on large corpora, because there can be a direct comparison in terms of steps & batch size, which isn't possible (or relatively harder to do) with epochs. So, if someone else wants to compare using an entirely different dataset with a different size, the model would "see" the same number of training instances, if the number of steps & batch size are the same, ensuring that a model isn't unfairly favoured due to training on more instances.
I'm having some difficulty grasping the input_shape for an LSTM layer in Keras. Assume that is the first layer in the network; it takes input of the form (batch, time, features). Also assume there is only one feature, so the input is of the form (batch, time, 1).
Is the number "batch" the batch size or the number of batches? I assume it's the batch size from the examples I've seen online. Then I'm struggling to see how the number of batches isn't always one.
As a concrete example, I have a time series of 1000 steps, which I split to 10 series of 100 steps. One epoch is when the network goes through all 1000 steps, the 10 series. I should be free to split the 10 series into different batches with different batch sizes, but then the input would be of the form (number of batches, batch size, time steps, 1). What am I misunderstanding?
I am new to probabilistic programming and ML. I am following a code on deep Markov model given on pyro's website. The link to the github page to that code is:
I understand most part of the code. The part I don't understand is mini batch idea they are using from line 175.
Question 1:
Could someone explain what are they doing there when they are using mini-batch?
In pyro documentation they say
mini_batch is a three dimensional tensor, with the first dimension being the batch dimension, the second dimension being the temporal dimension, and the final dimension being the features (88-dimensional in our case)'
Question 2:
What does temporal dimension means here?
Because I want to use this code on my dataset which is a sequential data. I have done one hot encoding of my data such that it's dimension is (10000,500,20) where 10000 is the number of examples/Sequences, 500 is the length of each of these sequences and 20 is the number of features.
Question 3:
How can I use my one hot encoded data as mini batch here?
I'm sorry if it is a really basic question but, insights will be appreciated.
Link to that documentation is:
Question 1: Could someone explain what are they doing there when they are using mini-batch?
To optimize most of the deep learning models, we use mini-batch gradient descent. Here, A mini_batch refers to a small number of examples. Let's say, we have 10,000 training examples and we want to create mini-batches of 50 examples. So, in total there will be 200 mini-batches and we will perform 200 parameter updates during one iteration over the entire dataset.
Question 2: What does the temporal dimension mean here?
In your data: (10000, 500, 20), the second dimension refers to the temporal dimension. You can consider you have examples with 500 timesteps (t1, t2, ..., t500).
Question 3: How can I use my one-hot encoded data as mini-batch here?
In your scenario, you can split your data (10000, 500, 20) into 200 small batches of size (50, 500, 20) where 50 is the number of examples/Sequences in the mini-batch, 500 is the length of each of these sequences and 20 is the number of features.
How do we decide the mini-batch size? Basically, we can tune the batch size just like any other hyperparameters of our model.
I recently learn the LSTM for time series prediction from
In his tutorial, he says: Instead of training the Recurrent Neural Network on the complete sequences of almost 300k observations, we will use the following function to create a batch of shorter sub-sequences picked at random from the training-data.
def batch_generator(batch_size, sequence_length):
Generator function for creating random batches of training-data.
# Infinite loop.
while True:
# Allocate a new array for the batch of input-signals.
x_shape = (batch_size, sequence_length, num_x_signals)
x_batch = np.zeros(shape=x_shape, dtype=np.float16)
# Allocate a new array for the batch of output-signals.
y_shape = (batch_size, sequence_length, num_y_signals)
y_batch = np.zeros(shape=y_shape, dtype=np.float16)
# Fill the batch with random sequences of data.
for i in range(batch_size):
# Get a random start-index.
# This points somewhere into the training-data.
idx = np.random.randint(num_train - sequence_length)
# Copy the sequences of data starting at this index.
x_batch[i] = x_train_scaled[idx:idx+sequence_length]
y_batch[i] = y_train_scaled[idx:idx+sequence_length]
yield (x_batch, y_batch)
He try to create several bacth samples for training.
My question is that, can we first randomly shuttle the x_train_scaled and y_train_scaled, and then begin sampling several batch size using the follow batch_generator?
my motivation for this question is that, for time series prediction, we want to training the past and predict for the furture. Therefore, is it legal to shuttle the training samples?
In the tutorial, the author chose a piece of continuous samples such as
x_batch[i] = x_train_scaled[idx:idx+sequence_length]
y_batch[i] = y_train_scaled[idx:idx+sequence_length]
Can we pick x_batch and y_batch not continous. For example, the x_batch[0] is picked at 10:00am and x_batch[1] is picked at 9:00am at the same day?
In summary: The follow two question are
(1) can we first randomly shuttle the x_train_scaled and y_train_scaled, and then begin sampling several batch size using the follow batch_generator?
(2) when we train LSTM, Do we need to consider the influence of time order? what parameters we learn for LSTM.
(1) We cannot. Imagine trying to predict the weather for tomorrow. Would you want a sequence of temperature values for the last 10 hours or would you want random temperature values of the last 5 years?
Your dataset is a long sequence of values in a 1-hour interval. Your LSTM takes in a sequence of samples that is chronologically connected. For example, with sequence_length = 10 it can take the data from 2018-03-01 09:00:00 to 2018-03-01 19:00:00 as input. If you shuffle the dataset before generating batches that consist of these sequences, you will train your LSTM on predicting based on a sequence of random samples from your whole dataset.
(2) Yes, we need to consider temporal ordering for time series. You can find ways to test your time series LSTM in python here:
The train/test data must be split in such a way as to respect the temporal ordering and the model is never trained on data from the future and only tested on data from the future.
It depends a lot on the dataset. For example, the weather from a random day in the dataset is highly related to the weather of the surrounding days. So, in this case, you should try a statefull LSTM (ie, a LSTM that uses the previous records as input to the next one) and train in order.
However, if your records (or a transformation of them) are independent from each other, but depend on some notion of time, such as the inter-arrival time of the items in a record or a subset of these records, there should be noticeable differences when using shuffling. In some cases, it will improve the robustness of the model; in other cases, it will not generalize. Noticing these differences is part of the evaluation of the model.
In the end, the question is: the "time series" as it is is really a time series (ie, records really depend on their neighbor) or there is some transformation that can break this dependency, but preserv the structure of the problem? And, for this question, there is only one way to get to the answer: explore the dataset.
About authoritative references, I will have to let you down. I learn this from a seasoned researcher in the field, however, according to him, he learn it through a lot of experimentation and failures. As he told me: these aren't rules, they are guidelines; try all the solutions that fits your budget; improve on the best ones; try again.