How to sample along with another dataloader in PyTorch - pytorch

Assume I have train/valid/test dataset with batch_size and shuffleed as normal.
When I do train/valid/test, I want to sample a certain number (called memory_size) of new samples from the entire dataset for each sample.
For example, I set batch_size as 256, let dataset shuffled, and memory_size as 80.
In every forward step, not only use each sample from dataset, but sample data from entire original dataset which size is memory_size and I want to use it inside forward. Let new samples as Memory (Yeah, I want to adopt idea from Memory Networks). Memory can be overlapped between each sample in train set.
I'm using PyTorch and PyTorch-Lightning. Can I create new memory dataloader per each train_dataloader, val_dataloader, and test_dataloader then load it with original dataloader? or is there a better way to achieve what I want?

Related

Does data preprocessing with kreas.ImageDataGenerator create more data or just change the exsiting one

when using the keras.ImageDataGenerator, and choosing some augmentations (flip, zoom, etc.) does that just change my datastream or does it add augmented data to the data stream in addition, increasing my data size?
Thanks for your help
It does not increase your dataset as far as I know. It just adds random transforms or perturbations in your training data. So, yes, it just changes your data. Having said that, if you need more data
You can store augmented images using save_to_dir parameter
You can maybe use steps_per_epoch = N*(n_samples / batch_size) and train on N times the amount of data per epoch

what should be the maximum size of input layer?

I am trying to create a multilayer perceptron for an training over images dataset. images are 300*300 and input layer is 90000. Is it the right way to create it?
90000 is a huge layer for what I assume you a running on a consumer-grade device. The error is likely Tensorflow running out of RAM.
If you post the whole traceback, I can be much more specific.
In general,
for a basic image classification task:
Try feeding the image into a Conv net first with dimension 300.
Then pool to reduce spatial dimensions.

How to change the batch size during training?

During training, at each epoch, I'd like to change the batch size (for experimental purpose).
Creating a custom Callback seems appropriate but batch_size isn't a member of the Model class.
The only way I see would be to override fit_loop and expose batch_size to the callback at each loop. Is there a cleaner or faster way to do it without using a callback ?
For others who land here, I found the easiest way to do batch size adjustment in Keras is just to call fit more than once (with different batch sizes):
model.fit(X_train, y_train, batch_size=32, epochs=20)
# ...continue training with a larger batch size
model.fit(X_train, y_train, batch_size=512, epochs=10)
I think it will be better to use a custom data generator to have control over the data you pass to the training loop, so you can generate batches of different sizes, process data on the fly etc. Here is an outline:
def data_gen(data):
while True: # generator yields forever
# process data into batch, it could be any size
# it's your responsibility to construct a batch
yield x,y # here x and y are a single batch
Now you can train with model.fit_generator(data_gen(data), steps_per_epoch=100) which will yield 100 batches per epoch. You can also use a Sequence if you want to encapsulate this inside a class.
For most purposes the accepted answer is the best, don't change the batch size. There's probably a better way 99% of the time that this question comes up.
For those 1%-ers who do have an exceptional case where changing the batch size mid-network is appropriate there's a git discussion that addresses this well:
https://github.com/keras-team/keras/issues/4807
To summarize it: Keras doesn't want you to change the batch size, so you need to cheat and add a dimension and tell keras it's working with a batch_size of 1. For example, your batch of 10 cifar10 images was sized [10, 32, 32, 3], now it becomes [1, 10, 32, 32, 3]. You'll need to reshape this throughout the network appropriately. Use tf.expand_dims and tf.squeeze to add and remove a dimension trivially.

value of steps per epoch passed to keras fit generator function

What is the need for setting steps_per_epoch value when calling the function fit_generator() when ideally it should be number of total samples/ batch size?
Keras' generators are infinite.
Because of this, Keras cannot know by itself how many batches the generators should yield to complete one epoch.
When you have a static number of samples, it makes perfect sense to use samples//batch_size for one epoch. But you may want to use a generator that performs random data augmentation for instance. And because of the random process, you will never have two identical training epochs. There isn't then a clear limit.
So, these parameters in fit_generator allow you to control the yields per epoch as you wish, although in standard cases you'll probably keep to the most obvious option: samples//batch_size.
Without data augmentation, the number of samples is static as Daniel mentioned.
Then, the number of samples for training is steps_per_epoch * batch size.
By using ImageDataGenerator in Keras, we make additional training data for data augmentation. Therefore, the number of samples for training can be set by yourself.
If you want two times training data, just set steps_per_epoch as (original sample size *2)/batch_size.

How to set batch size and epoch value in Keras for infinite data set?

I want to feed images to a Keras CNN. The program randomly feeds either an image downloaded from the net, or an image of random pixel values. How do I set batch size and epoch number? My training data is essentially infinite.
Even if your dataset is infinite, you have to set both batch size and number of epochs.
For batch size, you can use the largest batch size that fits into your GPU/CPU RAM, by just trial and error. For example you can try power of two batch sizes like 32, 64, 128, 256.
For number of epochs, this is a parameter that always has to be tuned for the specific problem. You can use a validation set to then train until the validation loss is maximized, or the training loss is almost constant (it converges). Make sure to use a different part of the dataset to decide when to stop training. Then you can report final metrics on another different set (the test set).
It is because implementations are vectorised for faster & efficient execution. When the data is large, all the data cannot fit the memory & hence we use batch size to still get some vectorisation.
In my opinion, one should use a batch size as large as your machine can handle.

Resources