Weighted random sampler - oversample or undersample? - pytorch

Problem
I am training a deep learning model in PyTorch for binary classification, and I have a dataset containing unbalanced class proportions. My minority class makes up about 10% of the given observations. To avoid the model learning to just predict the majority class, I want to use the WeightedRandomSampler from torch.utils.data in my DataLoader.
Let's say I have 1000 observations (900 in class 0, 100 in class 1), and a batch size of 100 for my dataloader.
Without weighted random sampling, I would expect each training epoch to consist of 10 batches.
Questions
Will only 10 batches be sampled per epoch when using this sampler - and consequently, would the model 'miss' a large portion of the majority class during each epoch, since the minority class is now overrepresented in the training batches?
Will using the sampler result in more than 10 batches being sampled per epoch (meaning the same minority class observations may appear many times, and also that training would slow down)?

A small snippet of code to use WeightedRandomSampler
First, define the function:
def make_weights_for_balanced_classes(images, nclasses):
n_images = len(images)
count_per_class = [0] * nclasses
for _, image_class in images:
count_per_class[image_class] += 1
weight_per_class = [0.] * nclasses
for i in range(nclasses):
weight_per_class[i] = float(n_images) / float(count_per_class[i])
weights = [0] * n_images
for idx, (image, image_class) in enumerate(images):
weights[idx] = weight_per_class[image_class]
return weights
And after this, use it in the following way:
import torch
dataset_train = datasets.ImageFolder(traindir)
# For unbalanced dataset we create a weighted sampler
weights = make_weights_for_balanced_classes(dataset_train.imgs, len(dataset_train.classes))
weights = torch.DoubleTensor(weights)
sampler = torch.utils.data.sampler.WeightedRandomSampler(weights, len(weights))
train_loader = torch.utils.data.DataLoader(dataset_train, batch_size=args.batch_size, shuffle = True,
sampler = sampler, num_workers=args.workers, pin_memory=True)

It depends on what you're after, check torch.utils.data.WeightedRandomSampler documentation for details.
There is an argument num_samples which allows you to specify how many samples will actually be created when Dataset is combined with torch.utils.data.DataLoader (assuming you weighted them correctly):
If you set it to len(dataset) you will get the first case
If you set it to 1800 (in your case) you will get the second case
Will only 10 batches be sampled per epoch when using this sampler - and consequently, would the model 'miss' a large portion of the majority class during each epoch [...]
Yes, but new samples will be returned after this epoch passes
Will using the sampler result in more than 10 batches being sampled per epoch (meaning the same minority class observations may appear many times, and also that training would slow down)?
Training would not slow down, each epoch would take longer, but convergence should be approximately the same (as less epochs will be necessary due to more data in each).

Related

How to handle objective convergence, early stopping and learning rate adjustments when using partial_fit method of Scikit SGDClassifier?

I am using partial_fit function from SGDClassifier with log loss to do online learning as I have a large dataset that cannot fit inside the memory as following:
cls = SGDClassifier(loss='log', learning_rate='adaptive', eta0=0.1, penalty='l2', alpha=0.0001)
for batch in training_generator:
cls.partial_fit(batch)
predictions = []
for batch in test_data:
probs = cls.predict_proba(batch)
predictions += list(probs)
In the documentation of partial_fit function it is stated
Internally, this method uses max_iter = 1. Therefore, it is not guaranteed that a minimum of the cost function is reached after calling it once. Matters such as objective convergence, early stopping, and learning rate adjustments should be handled by the user.
Questions:
max_iter = 1 means I would need to loop through the partial_fit as many times as needed by myself for each batch of data as following?
for batch in training_generator:
for _ in range(num_of_iteration):
cls.partial_fit(batch)
Does that statement in the documentation mean I would need to compute the log_loss (learning curve) myself for the validation data in each training iteration and decide when to stop the training? For example, the code as below.
for batch in training_generator:
cls.partial_fit(batch)
predictions = []
for batch in training_generator:
probs = cls.predict_proba(batch)
predictions += list(probs)
training_loss = log_loss(y_true, predictions)
predictions = []
for batch in validation_generator:
probs = cls.predict_proba(batch)
predictions += list(probs)
val_loss = log_loss(y_true, predictions)
#Pseudocode
If val_loss does not decrease after n iteration by some value, then stop training
If I have a large validation and training dataset, can I use a representative subset of the validation and training dataset, i.e: having the same distribution of classes as the full dataset, to compute the loss?
If assuming the validation loss keep decreasing and all the data in the training_generator has came to the end. Should I shuffle the training_generator data and run again the training loop?
#Psedocode
while True:
Run training loop
If val_loss does not decrease after n iteration by some value, then stop training (break while loop)
Finish training loop
The documentation says that the learning_rate adjustment should also be done by the user. Does that mean the learning_rate='adaptive' argument to the SGDClassifier has no effect when using partial_fit? If yes, how can the learning_rate be adjusted?

Can we send same datapoints in same epoch?

If we set the steps_per_epoch (in ImageDataGenerator) higher than the total possible batches(total_samples/batch_Size). Will the model revisit the same data points from starting or will it ignore?
Ex:
Flattened image shape which will go to Dense layer: (2000*1)
batch size: 20
Total no of batches possible: 100 (2000/20)
steps per epoch: 1000 (set explicitly)
As far as I know, steps_per_epoch is independent of the 'real' epoch (which is number_of_inputs/batch_size). Let's use an example similar to what you want to know, with 2000 data points and batch_size of 20 (which means 2000/20 = 100 steps for one 'real' epoch):
If you set steps_per_epoch = 1000: Keras asks for a loop of 1000 batches, which basically means 10 'real' epochs (or 10 times of whole data traversal).
If you set steps_per_epoch = 50: Keras asks for a loop of 50 batches, and the remaining 50 batches of one 'real' epoch is visited in the next loop.

Nan loss in keras with triplet loss

I'm trying to learn an embedding for Paris6k images combining VGG and Adrian Ung triplet loss. The problem is that after a small amount of iterations, in the first epoch, the loss becomes nan, and then the accuracy and validation accuracy grow to 1.
I've already tried lowering the learning rate, increasing the batch size (only to 16 beacuse of memory), changing optimizer (Adam and RMSprop), checking if there are None values on my dataset, changing data format from 'float32' to 'float64', adding a little bias to them and simplify the model.
Here is my code:
base_model = VGG16(include_top = False, input_shape = (512, 384, 3))
input_images = base_model.input
input_labels = Input(shape=(1,), name='input_label')
embeddings = Flatten()(base_model.output)
labels_plus_embeddings = concatenate([input_labels, embeddings])
model = Model(inputs=[input_images, input_labels], outputs=labels_plus_embeddings)
batch_size = 16
epochs = 2
embedding_size = 64
opt = Adam(lr=0.0001)
model.compile(loss=tl.triplet_loss_adapted_from_tf, optimizer=opt, metrics=['accuracy'])
label_list = np.vstack(label_list)
x_train = image_list[:2500]
x_val = image_list[2500:]
y_train = label_list[:2500]
y_val = label_list[2500:]
dummy_gt_train = np.zeros((len(x_train), embedding_size + 1))
dummy_gt_val = np.zeros((len(x_val), embedding_size + 1))
H = model.fit(
x=[x_train,y_train],
y=dummy_gt_train,
batch_size=batch_size,
epochs=epochs,
validation_data=([x_val, y_val], dummy_gt_val),callbacks=callbacks_list)
The images are 3366 with values scaled in range [0, 1].
The network takes dummy values because it tries to learn embeddings from images in a way that images of the same class should have small distance, while images of different classes should have high distances and than the real class is part of the training.
I've noticed that I was previously making an incorrect class division (and keeping images that should be discarded), and I didn't have the nan loss problem.
What should I try to do?
Thanks in advance and sorry for my english.
In some case, the random NaN loss can be caused by your data, because if there are no positive pairs in your batch, you will get a NaN loss.
As you can see in Adrian Ung's notebook (or in tensorflow addons triplet loss; it's the same code) :
semi_hard_triplet_loss_distance = math_ops.truediv(
math_ops.reduce_sum(
math_ops.maximum(
math_ops.multiply(loss_mat, mask_positives), 0.0)),
num_positives,
name='triplet_semihard_loss')
There is a division by the number of positives pairs (num_positives), which can lead to NaN.
I suggest you try to inspect your data pipeline in order to ensure there is at least one positive pair in each of your batches. (You can for example adapt some of the code in the triplet_loss_adapted_from_tf to get the num_positives of your batch, and check if it is greater than 0).
Try increasing your batch size. It happened to me also. As mentioned in the previous answer, network is unable to find any num_positives. I had 250 classes and was getting nan loss initially. I increased it to 128/256 and then there was no issue.
I saw that Paris6k has 15 classes or 12 classes. Increase your batch size 32 and if the GPU memory occurs you can try with model with less parameters. You can work on Efficient B0 model for starting. It has 5.3M compared to VGG16 which has 138M parameters.
I have implemented a package for triplet generation so that every batch is guaranteed to include postive pairs. It is compatible with TF/Keras only.
https://github.com/ma7555/kerasgen (Disclaimer: I am the owner)

Keras.fit_generator takes more time for epoch

I am doing image classification by using Keras , I have 8k images(input) in training sample and 2k images(input) in test sample , defined epoch as 25 . I noticed that epoch is very slow (approx takes an hour for first iteration) .
can any one suggest how can I overcome this , and what is the reason it takes hell lot of time?
code below..
PART-1
initialise neural network
from keras.models import Sequential
#package to perfom first layer , which is convolution , using 2d as it is for image , for video it will be 3d
from keras.layers import Convolution2D
#to perform max pooling on convolved layer
from keras.layers import MaxPool2D
#to convert the pool feature map into large feature vector, will be input for ANN
from keras.layers import Flatten
#to add layeres on ANN
from keras.layers import Dense
#STEP -1
#Initializing CNN
classifier = Sequential()
#add convolution layer
classifier.add(Convolution2D(filters=32,kernel_size=(3,3),strides=(1, 1),input_shape= (64,64,3),activation='relu'))
#filters - Number of feature detecters that we are going to apply in image
#kernel_size - dimension of feature detector
#strides moving thru one unit at a time
#input shape - shape of the input image on which we are going to apply filter thru convolution opeation,
#we will have to covert the image into that shape in image preprocessing before feeding it into convolution
#channell 3 for rgb and 1 for bw , and dimension of pixels
#activation - function we use to avoid non linearity in image
#STEP -2
#add pooling
#this step will significantly reduce the size of feature map , and makes it easier for computation
classifier.add(MaxPool2D(pool_size=(2,2)))
#pool_size - factor by which to downscale
#STEP -3
#flattern the feature map
classifier.add(Flatten())
#STEP -4
#hidden layer
classifier.add(Dense(units=128,activation='relu',kernel_initializer='uniform'))
#output layer
classifier.add(Dense(units=1,activation='sigmoid'))
#Compiling the CNN using stochastic gradient descend
classifier.compile(optimizer='adam',loss = 'binary_crossentropy',
metrics=['accuracy'])
#loss function should be categorical_crossentrophy if output is more than 2 class
#PART2 - Fitting CNN to image
#copied from keras documentation
from keras.preprocessing.image import ImageDataGenerator
train_datagen = ImageDataGenerator(
rescale=1./255,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True)
test_datagen = ImageDataGenerator(rescale=1./255)
training_set = train_datagen.flow_from_directory(
'/Users/arunramji/Downloads/Sourcefiles/CNN_Imageclassification/Convolutional_Neural_Networks/dataset/training_set',
target_size=(64, 64),
batch_size=32,
class_mode='binary')
test_set = test_datagen.flow_from_directory(
'/Users/arunramji/Downloads/Sourcefiles/CNN_Imageclassification/Convolutional_Neural_Networks/dataset/test_set',
target_size=(64, 64),
batch_size=32,
class_mode='binary')
classifier.fit_generator(
training_set,
steps_per_epoch=8000, #number of input (image)
epochs=25,
validation_data=test_set,
validation_steps=2000) # number of training sample
classifier.fit(
training_set,
steps_per_epoch=8000, #number of input (image)
epochs=25,
validation_data=test_set,
validation_steps=2000)
You are setting steps_per_epoch to the wrong value (this is why it takes longer than necessary): it is not set to the number of data points. steps_per_epoch should be set to the size of the dataset divided by the batch size, which should be 8000/32 = 250 for your training set, and 63 for your validation set.
Update:
As Matias in his answer pointed out, your steps_per_epoch parameter setting in your fit method led for the huge slowing down per epoch.
From the fit_generator documentation:
steps_per_epoch:
Integer. Total number of steps (batches of samples)
to yield from generator before declaring one epoch finished and
starting the next epoch. It should typically be equal to
ceil(num_samples / batch_size) Optional for Sequence: if unspecified,
will use the len(generator) as a number of steps.
validation_steps: Only relevant if validation_data is a generator.
Total number of steps (batches of samples) to yield from
validation_data generator before stopping at the end of every epoch.
It should typically be equal to the number of samples of your
validation dataset divided by the batch size. Optional for Sequence:
if unspecified, will use the len(validation_data) as a number of
steps.
Actually Keras has an inconsistency at handling the two parameters, as fit method raises an Valuerror if you uses a simple dataset instead of datagenerator and set the parameters like batch_size=batch_size, steps_per_epoch=num_samples:
ValueError: Number of samples 60000 is less than samples required for specified batch_size 200 and steps 60000
But when data comes from datagenerator it doesn't handle the same problem letting you to have an issue like the current one.
I made a little example code to check these up.
The fit method with steps_per_epoch=num_samples:
Number of samples: 60000
Number of samples per batch: 200
Train for 60000 steps, validate for 50 steps
Epoch 1/5
263/60000 [..............................] - ETA: 4:07:09 - loss: 0.2882 - accuracy: 0.9116
with ETA (estimated time): 4:07:09,
as this is for 60000 steps, each of 200 samples per batch.
The same fit with steps_per_epoch=num_samples // batch_size:
Number of samples: 60000
Number of samples per batch: 200
Train for 300 steps, validate for 50 steps
Epoch 1/5
28/300 [=>............................] - ETA: 1:15 - loss: 1.0946 - accuracy: 0.6446
with ETA: 1:15
Solution:
steps_per_epoch=(training_set.shape[0] // batch_size)
validation_steps=(validation_set.shape[0] // batch_size)
Further possible issues regarding performance:
As #SajanGohil wrote in his comment train_datagen.flow_from_director make some tasks like file operations, preprocessings before actual traning process which sometimes takes more time as the traning itself.
So to avoid these extratime, you can do the preprocessing task before the whole traning process separately only once. Then you can use these preprocessed data at traning time.
Anyway CNNs with vast images are rather time and resource consuming tasks, which assumes GPU usage for this reason.

I don't understand the code for training a classifier in pytorch

I don't understand the line labels.size(0). I'm new to Pytorch and been quite confused about the data structure.
correct = 0
total = 0
with torch.no_grad():
for data in testloader:
images, labels = data
outputs = net(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
print('Accuracy of the network on the 10000 test images: %d %%' % (
100 * correct / total))`
labels is a Tensor with dimensions [N, 1], where N is equal to the number of samples in the batch. .size(...) returns a subclass of tuple (torch.Size) with the dimensions of the Tensor, and .size(0) returns an integer with the value of the first (0-based) dimension (i.e., N).
To answer your question
In PyTorch, tensor.size() allows you to check out the shape of a tensor.
In your code,
images, labels = data
images and labels will each contain N number of training examples depends on your batch size. If you check out the shape of labels, it should be [N, 1], where N is the size of mini-batch training example.
A bit of prescience for those who are new to training a neural network.
When training a neural network, practitioners will forward pass the dataset through the network and optimize the gradients.
Say your training dataset contain 1 million images, and your training script is designed in a way to pass all 1 million images in a single epoch. The problem with this approach is it will take a really long time for you to receive feedback from your neural network. This is where mini-batch training comes in.
In PyTorch, the DataLoader class allows us to split the dataset into multiple batches. If your training loader contains 1 Million examples and batch size is 1000, you will expect each epoch will iterate 1000 step through all the mini-batches. This way, you can observe and optimize the training performance better.

Resources