I am new to probabilistic programming and ML. I am following a code on deep Markov model given on pyro's website. The link to the github page to that code is:
https://github.com/pyro-ppl/pyro/blob/dev/examples/dmm/dmm.py
I understand most part of the code. The part I don't understand is mini batch idea they are using from line 175.
Question 1:
Could someone explain what are they doing there when they are using mini-batch?
In pyro documentation they say
mini_batch is a three dimensional tensor, with the first dimension being the batch dimension, the second dimension being the temporal dimension, and the final dimension being the features (88-dimensional in our case)'
Question 2:
What does temporal dimension means here?
Because I want to use this code on my dataset which is a sequential data. I have done one hot encoding of my data such that it's dimension is (10000,500,20) where 10000 is the number of examples/Sequences, 500 is the length of each of these sequences and 20 is the number of features.
Question 3:
How can I use my one hot encoded data as mini batch here?
I'm sorry if it is a really basic question but, insights will be appreciated.
Link to that documentation is:
https://pyro.ai/examples/dmm.html
Question 1: Could someone explain what are they doing there when they are using mini-batch?
To optimize most of the deep learning models, we use mini-batch gradient descent. Here, A mini_batch refers to a small number of examples. Let's say, we have 10,000 training examples and we want to create mini-batches of 50 examples. So, in total there will be 200 mini-batches and we will perform 200 parameter updates during one iteration over the entire dataset.
Question 2: What does the temporal dimension mean here?
In your data: (10000, 500, 20), the second dimension refers to the temporal dimension. You can consider you have examples with 500 timesteps (t1, t2, ..., t500).
Question 3: How can I use my one-hot encoded data as mini-batch here?
In your scenario, you can split your data (10000, 500, 20) into 200 small batches of size (50, 500, 20) where 50 is the number of examples/Sequences in the mini-batch, 500 is the length of each of these sequences and 20 is the number of features.
How do we decide the mini-batch size? Basically, we can tune the batch size just like any other hyperparameters of our model.
Related
So I've got a simple pytorch example of how to train a ResNet CNN to learn MNIST labeling from this link:
https://zablo.net/blog/post/using-resnet-for-mnist-in-pytorch-tutorial/index.html
It's working great, but I want to hack it a bit so that it does 2 things. First, instead of predicting digits, it predicts animal shapes/colors for a project I'm working on. That's already working quite well already and am happy with it.
Second, I'd like to hack the training (and possibly layers) so that predictions is done in parallel on multiple images at a time. In the MNIST example, basically prediction (or output) would be done for an image that has 10 digits at a time concatenated by me. For clarity, each 10-image input will have the digits 0-9 appearing only once each. The key here is that each of the 10 digit gets a unique class/label from the CNN/ResNet and each class gets assigned exactly once. And that digits that have high confidence will prevent other digits with lower confidence from using that label (a Hungarian algorithm type of approach).
So in my use case I want to train on concatenated images (not single images) as in Fig A below and force the classifier to learn to predict the best unique label for each of the concatenated images and do this all at once. Such an approach should outperform single image classification - and it's particularly useful for my animal classification because otherwise the CNN can sometimes return the same ID for multiple animals which is impossible in my application.
I can already predict in series as in Fig B below. And indeed looking at the confidence of each prediction I am able to implement a Hungarian-algorithm like approach post-prediction to assign the best (most confident) unique IDs in each batch of 4 animals. But this doesn't always work and I'm wondering if ResNet can try and learn the greedy Hungarian assignment as well.
In particular, it's not clear that implementing A simply requires augmenting the data input and labels in the training set will do it automatically - because I don't know how to penalize or dissalow returning the same label twice for each group of images. So for now I can generate these training datasets like this:
print (train_loader.dataset.data.shape)
print (train_loader.dataset.targets.shape)
torch.Size([60000, 28, 28])
torch.Size([60000])
And I guess I would want the targets to be [60000, 10]. And each input image would be [1, 28, 28, 10]? But I'm not sure what the correct approach would be.
Any advice or available links?
I think this is a specific type of training, but I forgot the name.
I'm using Keras for timeseries prediction and I want to create a model that is based on the self-attention mechanism that will not use any RNNs. For each sample we look at the last x timesteps of samples to predict the next sample.
In other words I want to feed the network (num_batches, num_samples, timesteps, features) and get (num_batches, predictions).
There is 1 problems with this.
There is a lot of unnecessary duplication of data where sample n has basically the same timesteps and features as sample n+1, only shifted 1 to the left.
How would you handle this assuming you dataset is very large?
I am not very familiar with this, but if your issue is "I have too many replicated data" I think you can solve your problem devising a generator for your data, and then pass the generator as input for the Keras/TensorFlow fit function (according to TensorFlow APIs specification, it is stated that it supports generators as input).
If your question is related to the logic behind the model, I do not see the issue. It is like that you have a sliding window, for each window you predict one value, and then you move the window by a certain amount (in your case, one). Could you argue a little more about your concern?
I recently learn the LSTM for time series prediction from
https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/23_Time-Series-Prediction.ipynb
In his tutorial, he says: Instead of training the Recurrent Neural Network on the complete sequences of almost 300k observations, we will use the following function to create a batch of shorter sub-sequences picked at random from the training-data.
def batch_generator(batch_size, sequence_length):
"""
Generator function for creating random batches of training-data.
"""
# Infinite loop.
while True:
# Allocate a new array for the batch of input-signals.
x_shape = (batch_size, sequence_length, num_x_signals)
x_batch = np.zeros(shape=x_shape, dtype=np.float16)
# Allocate a new array for the batch of output-signals.
y_shape = (batch_size, sequence_length, num_y_signals)
y_batch = np.zeros(shape=y_shape, dtype=np.float16)
# Fill the batch with random sequences of data.
for i in range(batch_size):
# Get a random start-index.
# This points somewhere into the training-data.
idx = np.random.randint(num_train - sequence_length)
# Copy the sequences of data starting at this index.
x_batch[i] = x_train_scaled[idx:idx+sequence_length]
y_batch[i] = y_train_scaled[idx:idx+sequence_length]
yield (x_batch, y_batch)
He try to create several bacth samples for training.
My question is that, can we first randomly shuttle the x_train_scaled and y_train_scaled, and then begin sampling several batch size using the follow batch_generator?
my motivation for this question is that, for time series prediction, we want to training the past and predict for the furture. Therefore, is it legal to shuttle the training samples?
In the tutorial, the author chose a piece of continuous samples such as
x_batch[i] = x_train_scaled[idx:idx+sequence_length]
y_batch[i] = y_train_scaled[idx:idx+sequence_length]
Can we pick x_batch and y_batch not continous. For example, the x_batch[0] is picked at 10:00am and x_batch[1] is picked at 9:00am at the same day?
In summary: The follow two question are
(1) can we first randomly shuttle the x_train_scaled and y_train_scaled, and then begin sampling several batch size using the follow batch_generator?
(2) when we train LSTM, Do we need to consider the influence of time order? what parameters we learn for LSTM.
Thanks
(1) We cannot. Imagine trying to predict the weather for tomorrow. Would you want a sequence of temperature values for the last 10 hours or would you want random temperature values of the last 5 years?
Your dataset is a long sequence of values in a 1-hour interval. Your LSTM takes in a sequence of samples that is chronologically connected. For example, with sequence_length = 10 it can take the data from 2018-03-01 09:00:00 to 2018-03-01 19:00:00 as input. If you shuffle the dataset before generating batches that consist of these sequences, you will train your LSTM on predicting based on a sequence of random samples from your whole dataset.
(2) Yes, we need to consider temporal ordering for time series. You can find ways to test your time series LSTM in python here: https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/
The train/test data must be split in such a way as to respect the temporal ordering and the model is never trained on data from the future and only tested on data from the future.
It depends a lot on the dataset. For example, the weather from a random day in the dataset is highly related to the weather of the surrounding days. So, in this case, you should try a statefull LSTM (ie, a LSTM that uses the previous records as input to the next one) and train in order.
However, if your records (or a transformation of them) are independent from each other, but depend on some notion of time, such as the inter-arrival time of the items in a record or a subset of these records, there should be noticeable differences when using shuffling. In some cases, it will improve the robustness of the model; in other cases, it will not generalize. Noticing these differences is part of the evaluation of the model.
In the end, the question is: the "time series" as it is is really a time series (ie, records really depend on their neighbor) or there is some transformation that can break this dependency, but preserv the structure of the problem? And, for this question, there is only one way to get to the answer: explore the dataset.
About authoritative references, I will have to let you down. I learn this from a seasoned researcher in the field, however, according to him, he learn it through a lot of experimentation and failures. As he told me: these aren't rules, they are guidelines; try all the solutions that fits your budget; improve on the best ones; try again.
I have repeated measurements on subjects, which I have structured as input to an LSTM model in Keras as follows:
batch_size = 1
model = Sequential()
model.add(LSTM(50, batch_input_shape=(batch_size, time_steps, features), return_sequences=True))
Where time_steps are the number of measurements on each subject, and features the number of available features on each measurement. Each row of the data is one subject.
My question is regarding the batch size with this type of data.
Should I only use a batch size of 1, or can the batch size be more than 1 subjects?
Related to that, would I benefit from setting stateful to True? Meaning that learning from one batch would inform the other batches too. Correct me if my understanding about this is not right too.
Great question! Using a batch size greater than 1 is possible with this sort of data and setup, provided that your rows are individual experiments on subjects and that your observations for each subject are ordered sequentially through time (e.g. Monday comes before Tuesday). Make sure that your observations between train and test are not split randomly and that your observations are ordered sequentially by subject in each, and you can apply batch processing. Because of this, set shuffle to false if using Keras as Keras shuffles observations in batches by default.
In regards to setting stateful to true: with a stateful model, all the states are propagated to the next batch. This means that the state of the sample located at index i, Xi will be used in the computation of the sample Xi+bs in the next batch. In the case of time series, this generally makes sense. If you believe that a subject measurement Si infleunces the state of the next subject measurement Si+1, then try setting stateful to true. It may be worth exploring setting stateful to false as well to explore and better understand if a previous observation in time infleunces the following observation for a particular subject.
Hope this helps!
I'm sorry for asking that stupid thing. I can't apply answers from other questions to my task.
Currently I got well-known error:
expected lstm_input_1 to have 3 dimensions, but got array with shape (7491, 1025)
My data:
matrix - 1025 float numbers in row. 7491 rows
So how to make it 3d? Or am I trying to use wrong layer model?
You need to have an explicit time dimension and a batch dimension. You always have a batch dimension (1 if you are using only one batch) and for recurrent models you need a time dimension as well, as these are sequential models and they operate over time.
Reshape your data to (1,7491,1025) for 1 batch and a sequence of length 7491 with 1025 features per time-step.