value of steps per epoch passed to keras fit generator function - keras

What is the need for setting steps_per_epoch value when calling the function fit_generator() when ideally it should be number of total samples/ batch size?

Keras' generators are infinite.
Because of this, Keras cannot know by itself how many batches the generators should yield to complete one epoch.
When you have a static number of samples, it makes perfect sense to use samples//batch_size for one epoch. But you may want to use a generator that performs random data augmentation for instance. And because of the random process, you will never have two identical training epochs. There isn't then a clear limit.
So, these parameters in fit_generator allow you to control the yields per epoch as you wish, although in standard cases you'll probably keep to the most obvious option: samples//batch_size.

Without data augmentation, the number of samples is static as Daniel mentioned.
Then, the number of samples for training is steps_per_epoch * batch size.
By using ImageDataGenerator in Keras, we make additional training data for data augmentation. Therefore, the number of samples for training can be set by yourself.
If you want two times training data, just set steps_per_epoch as (original sample size *2)/batch_size.

Related

How does setting random seed affect tensorflow model training?

I am trying to get consistent results withing training, so I am testing random seeds on all the possible source of randomness in my scripts:
tensorflow.random.set_seed(0)
random.seed(0)
np.random.seed(0)
My doubt is understanding how the model training going to be affected, as I am using a generator, which is supposed to randomly shuffle the samples within the batches.
Are the batches going to be the same up to epoch N, along different instances of model training? In other words, batches would be randomized along different epoch but, in a different instance of the model, the batches will be composed exactly in the same way at epoch N.
If that is the case, gradient values should be the same, and so the model weights at epoch N
Is there anything to which I should pay attention, which can actually harm the robustness of the trained model, due to having initialized the seed and, thus, maybe missing some value in having complete normalization?

PyTorch training with Batches of different lenghts?

Is it possible to train model with batches that have unequal lenght during an epoch? I am new to pytorch.
If you take a look at the dataloader documentation, you'll see a drop_last parameter, which explains that sometimes when the dataset size is not divisible by the batch size, then you get a last batch of different size. So basically the answer is yes, it is possible, it happens often and it does not affect (too much) the training of a neural network.
However you must a bit careful, some pytorch layers deal poorly with very small batch sizes. For example if you happen to have Batchnorm layers, and if you get a batch of size 1, you'll get errors due to the fact that batchnorm at some point divides by len(batch)-1. More generally, training a network that has batchnorms generally require batches of significant sizes, say at least 16 (literature generally aims for 32 or 64). So if you happen to have variable size batches, take the time to check whether your layers have requirement in terms of batch size for optimal training and convergence. But except in particular cases, your network will train anyway, no worries.
As for how to make your batches with custom sizes, I suggest you look at and take inspiration from the pytorch implementation of dataloader and sampler. You may want to implement something similar to BatchSampler and use the batch_sampler argument of Dataloader

Large dataset - ANN

I am trying to classify around 400K data with 13 attributes. I have used python sklearn's SVM package, but it didn't work, and then I learned that SVM's are not suitable for large dataset classification. Then I used the (sklearn) ANN using the following MLPClassifier:
MLPClassifier(solver='adam', alpha=1e-5, random_state=1,activation='relu', max_iter=500)
and trained the system using 200K samples, and tested the model on the remaining ones. The classification worked well. However, my concern is that the system is over trained or overfit. Can you please guide me on the number of hidden layers and node sizes to make sure that there is no overfit? (I have learned that the default implementation has 100 hidden neurons. Is it ok to use the default implementation as is?)
To know if your are overfitting you have to compute:
Training set accuracy
Test set accuracy
Once you have calculated this scores, compare it. If training set score is much better than your test set score, then you are overfitting. This means that your model is "memorizing" your data, instead of learning from it to make future predictions.
If you are overfitting with Neuronal Networks you probably have to reduce the number of layers and reduce the number of neurons per layer. There isn't any strict rule that says the number of layer or neurons you need depending on you dataset size. Every dataset can behaves completely different with the same dataset size.
So, to conclude, if you are overfitting, you would have to evaluate your model accuracy using different parameters of layers and number of neurons, and, then, observe with which values you obtain the best results. There are some methods you can use to find the best parameters, is like gridsearchCV.

Keras curriculum learning- change shuffle method

I want to change the way we shuffle training data in Keras. Conventionally we shuffle samples in each epoch and then we batch, and fit the model. Now I want to first batch the samples, and then shuffle these "BATCHES" (all samples in each batch should not shuffle). The reason is that I ordered all samples based on criteria (Curriculum Learning) and I want to preserve such an order.
Do you know how I can do this?
I found the answer according to this link:
How does shuffle = 'batch' argument of the .fit() layer work in the background?
Short answer: we need to set shuffle='batch' in fit.

How to set batch size and epoch value in Keras for infinite data set?

I want to feed images to a Keras CNN. The program randomly feeds either an image downloaded from the net, or an image of random pixel values. How do I set batch size and epoch number? My training data is essentially infinite.
Even if your dataset is infinite, you have to set both batch size and number of epochs.
For batch size, you can use the largest batch size that fits into your GPU/CPU RAM, by just trial and error. For example you can try power of two batch sizes like 32, 64, 128, 256.
For number of epochs, this is a parameter that always has to be tuned for the specific problem. You can use a validation set to then train until the validation loss is maximized, or the training loss is almost constant (it converges). Make sure to use a different part of the dataset to decide when to stop training. Then you can report final metrics on another different set (the test set).
It is because implementations are vectorised for faster & efficient execution. When the data is large, all the data cannot fit the memory & hence we use batch size to still get some vectorisation.
In my opinion, one should use a batch size as large as your machine can handle.

Resources