using flow_from_directory for training and validation, without augmentation - keras

I am training a simple CNN with Nt=148 + Nv=37 images for training and validation respectively. I used the ImageGenerator.flow_from_directory() method because I plan to use data augmentation in the future, but for the time being I don't want any data augmentation. I just want to read the images from disk one by one (and each exactly once, this is primarily important for the validation) to avoid loading all of them in memory.
But the following makes me think that something different than expected is happening:
the training and validation accuracy achieve values which do not resemble a fraction with 148 or 37 as the denominator. Actually trying to estimate a reasonable denominator from a submultiple of the deltas, leads to numbers much bigger than 148 (about 534 or 551, see below (*) why I think they should be multiples of 19) and of 37
verifying all predictions on both the training and and validation datasets (with a separate program, which reads the validation directory only once and doesn't use the above generators), shows a number of fails which is not exactly (1-val_acc)*Nv as I would expect
(*) Lastly I found that the batch size I used for both is 19, so I expect that I am providing 19*7=133 or 19*8=152 training images per epoch and 19 or 38 images as the validation set at each epoch end.
By the way: is it possible to use the model.fit_generator() with generators built from the ImageGenerator.flow_from_directory() to achieve:
- no data augmentation
- both generators should respectively supply all images to the training process and to the validation process exactly once per epoch
- shuffling is fine, and actually desired, so that each epoch runs different
Meanwhile I am orienting myself to set the batch size equal to the validation set length (i.e. 37). Being it a divider of the training set numerosity, I think it should work out the numbers.
But still I am unsure if the following code is achieving the requirement "no data augmentation at all"
valid_augmenter = ImageDataGenerator(rescale=1./255)
val_batch_size = 37
train_generator = train_augmenter.flow_from_directory(
train_data_dir,
target_size=(img_height, img_width),
batch_size=val_batch_size,
class_mode='binary',
color_mode='grayscale',
follow_links=True )
validation_generator = valid_augmenter.flow_from_directory(
validation_data_dir,
target_size=(img_height,img_width),
batch_size=val_batch_size,
class_mode='binary',
color_mode='grayscale',
follow_links=True )

Some issues in your situation.
First of all, that amount of images is quite low. Scrape a lot more images and use augmentation.
Second, I have seen typical fractions are:
from the total data:
80% for train
20% for validation.
Put the images you select in folders with that proportion.
Third, you can check if your code generates data if you put this line in your flow_from_directory call, after the last line (and put a comma after that last line):
save_to_dir='folder_to_see_augmented_images'
Then run the model (compile, and then fit) and check the contents of the save_to_dir folder.

Related

In the original T5 paper, what does 'step' mean?

I have been reading the original T5 paper 'Exploring the limits of transfer learning with a unified text-to-text transformer.' On page 11, it says "We pre-train each model for 2^19=524,288 steps on C4 before fine-tuning."
I am not sure what the 'steps' mean. Is it the same as epochs? Or the number of iterations per epoch?
I guess 'steps'='iterations' in a single epoch.
A step is a single training iteration. In a step, the model is given a single batch of training instances. So if the batch size is 128, then the model is exposed to 128 instances in a single step.
Epochs aren't the same as steps. An epoch is a single pass over an entire training set. So if the training data contains for example 128,000 instances & the batch size is 128, an epoch amounts to 1,000 steps (128 × 1,000 = 128,000).
The relationship between epochs & steps is related to the size of the training data (see this question for a more detailed comparison). If the data size is changed, then the effective number of steps in an epoch changes as well, (keeping the batch size fixed). So a dataset of 1,280,000 instances would take more steps in an epoch, & vice-versa for a dataset of 12,800 instances.
For this reason, steps are typically reported, especially when it comes to pre-training models on large corpora, because there can be a direct comparison in terms of steps & batch size, which isn't possible (or relatively harder to do) with epochs. So, if someone else wants to compare using an entirely different dataset with a different size, the model would "see" the same number of training instances, if the number of steps & batch size are the same, ensuring that a model isn't unfairly favoured due to training on more instances.

Create a percentage in CNN

How can I print percentage in CNN if the case is I want to predict something like "how well the training data match the test data" with percentage appearing in the output.
Do you mean percentage of predicted labels matching the actual labels, in training or testing a Convolutional Neural Network? Because "how well the training data match the test data" does not seem to make sense. Training data and test data are not supposed to be the same, or even have any overlap. The given data-set is divided in to 2 parts: training set (e.g. 70 %) and test set (e.g. 30 %).
Depending on the deep learning platform you are using (e.g. TensorFlow or PyTorch), you may be able to display the accuracy (percentage of predicted labels that match the actual labels) in every epoch of training, and also for testing. Try checking sample code on https://www.tensorflow.org/ or https://pytorch.org/.

which is the most suitable method for training among model.fit(), model.train_on_batch(), model.fit_generator()

I have a training dataset of 600 images with (512*512*1) resolution categorized into 2 classes(300 images per class). Using some augmentation techniques I have increased the dataset to 10000 images. After having following preprocessing steps
all_images=np.array(all_images)/255.0
all_images=all_images.astype('float16')
all_images=all_images.reshape(-1,512,512,1)
saved these images to H5 file.
I am using an AlexNet architecture for classification purpose with 3 convolutional, 3 overlap max-pool layers.
I want to know which of the following cases will be best for training using Google Colab where memory size is limited to 12GB.
1. model.fit(x,y,validation_split=0.2)
# For this I have to load all data into memory and then applying an AlexNet to data will simply cause Resource-Exhaust error.
2. model.train_on_batch(x,y)
# For this I have written a script which randomly loads the data batch-wise from H5 file into the memory and train on that data. I am confused by the property of train_on_batch() i.e single gradient update. Do this will affect my training procedure or will it be same as model.fit().
3. model.fit_generator()
# giving the original directory of images to its data_generator function which automatically augments the data and then train using model.fit_generator(). I haven't tried this yet.
Please guide me which will be the best among these methods in my case. I have read many answers Here, Here, and Here about model.fit(), model.train_on_batch() and model.fit_generator() but I am still confused.
model.fit - suitable if you load the data as numpy-array and train without augmentation.
model.fit_generator - if your dataset is too big to fit in the memory or\and you want to apply augmentation on the fly.
model.train_on_batch - less common, usually used when training more than one model at a time (GAN for example)

value of steps per epoch passed to keras fit generator function

What is the need for setting steps_per_epoch value when calling the function fit_generator() when ideally it should be number of total samples/ batch size?
Keras' generators are infinite.
Because of this, Keras cannot know by itself how many batches the generators should yield to complete one epoch.
When you have a static number of samples, it makes perfect sense to use samples//batch_size for one epoch. But you may want to use a generator that performs random data augmentation for instance. And because of the random process, you will never have two identical training epochs. There isn't then a clear limit.
So, these parameters in fit_generator allow you to control the yields per epoch as you wish, although in standard cases you'll probably keep to the most obvious option: samples//batch_size.
Without data augmentation, the number of samples is static as Daniel mentioned.
Then, the number of samples for training is steps_per_epoch * batch size.
By using ImageDataGenerator in Keras, we make additional training data for data augmentation. Therefore, the number of samples for training can be set by yourself.
If you want two times training data, just set steps_per_epoch as (original sample size *2)/batch_size.

How to set batch size and epoch value in Keras for infinite data set?

I want to feed images to a Keras CNN. The program randomly feeds either an image downloaded from the net, or an image of random pixel values. How do I set batch size and epoch number? My training data is essentially infinite.
Even if your dataset is infinite, you have to set both batch size and number of epochs.
For batch size, you can use the largest batch size that fits into your GPU/CPU RAM, by just trial and error. For example you can try power of two batch sizes like 32, 64, 128, 256.
For number of epochs, this is a parameter that always has to be tuned for the specific problem. You can use a validation set to then train until the validation loss is maximized, or the training loss is almost constant (it converges). Make sure to use a different part of the dataset to decide when to stop training. Then you can report final metrics on another different set (the test set).
It is because implementations are vectorised for faster & efficient execution. When the data is large, all the data cannot fit the memory & hence we use batch size to still get some vectorisation.
In my opinion, one should use a batch size as large as your machine can handle.

Resources