How does setting random seed affect tensorflow model training? - python-3.x

I am trying to get consistent results withing training, so I am testing random seeds on all the possible source of randomness in my scripts:
tensorflow.random.set_seed(0)
random.seed(0)
np.random.seed(0)
My doubt is understanding how the model training going to be affected, as I am using a generator, which is supposed to randomly shuffle the samples within the batches.
Are the batches going to be the same up to epoch N, along different instances of model training? In other words, batches would be randomized along different epoch but, in a different instance of the model, the batches will be composed exactly in the same way at epoch N.
If that is the case, gradient values should be the same, and so the model weights at epoch N
Is there anything to which I should pay attention, which can actually harm the robustness of the trained model, due to having initialized the seed and, thus, maybe missing some value in having complete normalization?

Related

What is the proper to save the fitted CNN model for MNIST dataset?

I develpoed a simple CNN model for MNIST dataset and i got 98% validation accuracy. But after saving the model through keras as model.h5 and evaluating the inference of th saved model in another jypyter session, the performance of the model is poor and the predictions are random
What needs to be done to get same accuracy after saving and uploading the model in different jypyter notebook session?
(Consider sharing your code/results so the community can help you better).
I'm assuming you're using Tensorflow/Keras, so model.save('my_model.h5') after your model.fit(...) should save the model, including the trained parameters (but not including the internal optimizer data; i.e gradients, etc..., which shouldn't affect the prediction capabilities of the model).
A number of things could cause a generalization gap like that, but...
Case 1: having a high training/validation accuracy and a low test (prediction) accuracy typically means your model overfit on the given training data.
I suggest adding some regularization to your training phase (dropout layers, cutout augmentation, L1/L2, etc...), a fewer number of epochs or early-stopping, or cross-validation/data reshuffle to cross off the possibility of overfitting.
Case 2: low intrinsic dataset variance, but unless you're using a subset of MNIST, this is unlikely. Make sure you are properly splitting your training/validation/test sets.
Again, it could be a number of issues, but these are the most common cases for low model generalization. Post your code (specifying the architecture, optimizer, hyperparameters, data prepropcessing, and test data used) so the answers can be more relevant to your problem.

Data augmentation affects convergence speed

Data augmentation is surely a great regularization method, and it improves my accuracy on the unseen test set. However, I do not understand why it reduces the convergence speed of the network? I know each epoch takes a longer time to train since image transformations are applied on the fly. But why does it affect the convergence? For my current setup, the network hits a 100% training accuracy after 5 epochs without data augmentation (and clearly overfits) - with data augmentation, it takes 23 epochs to hit 95% training accuracy and never seems to hit 100%.
Any links to research papers or comments on the reasonings behind this?
I guess you are evaluating accuracy on the train set, right? And it is a mistake...
Without augmentation your network simply overfits. You have a predefined number of images, for instance, 1000, and your network during training can easily memorize dataset labels. And you are evaluating the model on the fixed (not augmented) dataset.
When you are training your network with data augmentation, basically, you are training a model on a dataset of infinite size. You are doing augmentation on the fly, which means that the model "sees" new images every time, and it cannot memorize them perfectly with 100% accuracy. And you are evaluating the model on the augmented (infinite) dataset.
When you train your model with and without augmentation, you evaluate it on the different datasets, so it is not correct to compare their accuracy.
Piece of advice:
Do not look at train set accuracy, it is simply misleading when you use augmentations. Instead - evaluate your model on the test set (or validation set), which is not augmented. By doing this - you'll see the real accuracy increase for your model.
P.S. If you want to find out more about image augmentaitons, I really recommend you to check this guide - https://notrocketscience.blog/complete-guide-to-data-augmentation-for-computer-vision/

Large dataset - ANN

I am trying to classify around 400K data with 13 attributes. I have used python sklearn's SVM package, but it didn't work, and then I learned that SVM's are not suitable for large dataset classification. Then I used the (sklearn) ANN using the following MLPClassifier:
MLPClassifier(solver='adam', alpha=1e-5, random_state=1,activation='relu', max_iter=500)
and trained the system using 200K samples, and tested the model on the remaining ones. The classification worked well. However, my concern is that the system is over trained or overfit. Can you please guide me on the number of hidden layers and node sizes to make sure that there is no overfit? (I have learned that the default implementation has 100 hidden neurons. Is it ok to use the default implementation as is?)
To know if your are overfitting you have to compute:
Training set accuracy
Test set accuracy
Once you have calculated this scores, compare it. If training set score is much better than your test set score, then you are overfitting. This means that your model is "memorizing" your data, instead of learning from it to make future predictions.
If you are overfitting with Neuronal Networks you probably have to reduce the number of layers and reduce the number of neurons per layer. There isn't any strict rule that says the number of layer or neurons you need depending on you dataset size. Every dataset can behaves completely different with the same dataset size.
So, to conclude, if you are overfitting, you would have to evaluate your model accuracy using different parameters of layers and number of neurons, and, then, observe with which values you obtain the best results. There are some methods you can use to find the best parameters, is like gridsearchCV.

does training on total dataset improves confidence scores

I'm using SVC(kernel="linear", probability=True) in multiclass classification. when I'm using 2/3rd of my data for training purpose, I'm getting ~72%. And when I tried to predict in production, Confidence scores I'm getting are very less. Does training on the total dataset helps to improve confidence scores?
Does training on the total dataset helps to improve confidence scores?
It might. In general, the more data the better. However evaluating performance should be done on data that the model has not seen before. One way to do this is to set aside a part of the data, a test set, as you have done. Another approach is to use cross-validation, see below.
And when I tried to predict in production, Confidence scores I'm getting are very less.
This means that your model does not generalize well. In other words when presented with data it has not seen before the model starts to make more or less random predictions.
To get a better sense of how well your model generalizes you may want to use cross-validation:
from sklearn.model_selection import cross_val_score
clf = SVC()
scores = cross_val_score(clf, X, Y)
This will train and evaluate your classifier on the full dataset using folds of the full data. A fold For each split the classifier is trained and validation on an exclusive subset of the data. For each split the scores result contains the validation score (for SVC, the accuracy). If you need more control over which metrics to evaluate, use the cross_validation function.
to predict in production
In order to improve your model's performance, there are several methods to consider:
Use more training data
Use an ensemble model to reduce prediction variance
Use a different model (algorithm)

value of steps per epoch passed to keras fit generator function

What is the need for setting steps_per_epoch value when calling the function fit_generator() when ideally it should be number of total samples/ batch size?
Keras' generators are infinite.
Because of this, Keras cannot know by itself how many batches the generators should yield to complete one epoch.
When you have a static number of samples, it makes perfect sense to use samples//batch_size for one epoch. But you may want to use a generator that performs random data augmentation for instance. And because of the random process, you will never have two identical training epochs. There isn't then a clear limit.
So, these parameters in fit_generator allow you to control the yields per epoch as you wish, although in standard cases you'll probably keep to the most obvious option: samples//batch_size.
Without data augmentation, the number of samples is static as Daniel mentioned.
Then, the number of samples for training is steps_per_epoch * batch size.
By using ImageDataGenerator in Keras, we make additional training data for data augmentation. Therefore, the number of samples for training can be set by yourself.
If you want two times training data, just set steps_per_epoch as (original sample size *2)/batch_size.

Resources