Why is testing accuracy so low, could there be a bug in my code? - pytorch

I've been training an image classification model using object detection and then applying image classification to the images. I have 87 custom classes in my data(not ImageNet classes), and just over 7000 images altogether(around 60 images per class). I am happy with my object detection code and I think it works quite well, however, for classification I have been using ResNet and AlexNet. I have tried AlexNet, ResNet18, ResNet50 and ResNet101 for training however, I am getting very low testing accuracies(around 10%), and my training accuracies are high for all models. I've also attempted regularisation and changing the learning rates, but I am not getting the higher accuracies(>80%) that I require. I wonder if there is a bug in my code, although I haven't been able to figure it out.
Here is my training code, I have also processed images in the way that Pytorch pretrained models expect:
import torch.nn as nn
import torch.optim as optim
from typing import Callable
import numpy as np
EPOCHS=100
resnet = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50')
resnet.eval()
resnet.fc = nn.Linear(2048, 87)
res_loss = nn.CrossEntropyLoss()
res_optimiser = optim.SGD(resnet.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-5)
def train_model(model, loss_fn, optimiser, modelsavepath):
train_acc = 0
for j in range(EPOCHS):
running_loss = 0.0
correct = 0
total = 0
for i, data in enumerate(training_generator, 0):
model.train()
inputs, labels, paths = data
total += 1
optimizer.zero_grad()
outputs = model(inputs)
_, predicted = torch.max(outputs, 1)
if(predicted.int() == labels.int()):
correct += 1
loss = loss_fn(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
train_acc = train_correct / len(training_generator)
print("Epoch:{}/{} AVG Training Loss:{:.3f} AVG Training Acc {:.2f}% ".format(j + 1, EPOCHS, train_loss, train_acc))
torch.save(model, modelsavepath)
train_model(resnet, res_loss, res_optimiser, 'resnet.pth')
Here is the testing code used for a single image, it is part of a class:
self.model.eval()
outputs = self.model(img[None, ...]) #models expect batches, so give it a singleton batch
scores, predictions = torch.max(outputs, 1)
predictions = predictions.numpy()[0]
possible_scores= np.argmax(scores.detach().numpy())
Is there a bug in my code, either testing or training, or is my model just overfitting? Additionally, is there a better image classification model that I could try?

Your dataset is very small, so you're most likely overfitting. Try:
decrease learning rate (try 0.001, 0.0001, 0.00001)
increase weight_decay (try 1e-4, 1e-3, 1e-2)
if you don't already, use image augmentations (at least the default ones, like random crop and flip).
Watch train/test loss curves when finetuning your model and stop training as soon as you see test accuracy going down while train accuracy goes up.

Related

SDG with batch size >1?

I'm taking the "Deep NNs with PyTorch" course by IBM and I encountered lab examples where SDG is used for optimizer while batch size is >1 in DataLoader.
If I understand correctly, SGD would perform gradient descent with only 1 training example in each step, so it this case how would the SGD interact with each batch of training example?
For example, if batch size = 20, would the SGD optimizer perform 20 GD steps in each batch? If this is the case, then does that mean no matter what batch size I set for DataLoader, the SGD optimizer would just perform (# of training example) GD steps in one epoch?
Layers = [2, 50, 3]
model = Net(Layers)
learning_rate = 0.10
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
train_loader = DataLoader(dataset=data_set, batch_size=20)
criterion = nn.CrossEntropyLoss()
LOSS = train(data_set, model, criterion, train_loader, optimizer, epochs=100)
def train(data_set, model, criterion, train_loader, optimizer, epochs=100):
LOSS = []
ACC = []
for epoch in range(epochs):
for x, y in train_loader:
print(x, y)
optimizer.zero_grad()
yhat = model(x)
loss = criterion(yhat, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
LOSS.append(loss.item())
ACC.append(accuracy(model, data_set))
...
if batch size = 20, would the SGD optimizer perform 20 GD steps in
each batch?
No. Batch size = 20 means, it would process all the 20 samples and then get the scalar loss. Based on that it would backpropagate the error. And that is one step of GD.
This is known as minibatch SGD, instead of taking 1 input like in SGD, it considers 20 and then everything else stays the same.

Memory bottleneck with autoregressive transformer decoding

I am trying to train a transformer model for sequence modeling. Below is a standalone example:
import torch
import torch.nn as nn
criterion = nn.MSELoss()
decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)
transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers=12)
memory = torch.rand(10, 32, 512)
y = torch.rand(20, 32, 512)
start_token = torch.ones((1,32,512))
tgt_input = torch.cat((start_token,y[:-1,:]),axis=0)
optimizer = torch.optim.Adam(transformer_decoder.parameters())
###################Teacher forced
while(True):
optimizer.zero_grad()
out = transformer_decoder(tgt_input, memory, nn.Transformer.generate_square_subsequent_mask(20,20))
loss = criterion(out,y)
print("loss: ", loss.item())
loss.backward()
optimizer.step()
For a 12 layer decoder, the model works fine on a personal machine with 8GB memory. The model is autoregressive and works with shifted targets. Given we provide targets above, I refer to this setting as "teacher forced".
However, at inference stage, we will not have targets fed as above, and one would need to condition on targets generated on the go. This setting is as follows:
###################Non Teacher forced
while(True):
optimizer.zero_grad()
predictions = torch.ones((1,32,512))
for i in range(1,21):
predictions = torch.cat((predictions, transformer_decoder(tgt_input[:i], memory, nn.Transformer.generate_square_subsequent_mask(i,i))[-1].unsqueeze(0)),axis=0)
print("i: ", i, "predictions.shape: ", predictions.shape)
loss = criterion(predictions[1:],y)
print("loss: ", loss.item())
loss.backward()
optimizer.step()
I wish to train the model with a hybrid training strategy with, without teacher forcing. However, the non-teacher forced strategy causes out-of-memory exception and doesn't work. For final inference (testing), usually, with torch.no_grad() it can work, but not in training. Can anyone explain as to why this causes memory bottlenecks exactly?
This is because of the rolling of the computational graph. For the teacher forced model, gradients are not propagated after the true values. However, for non-teacher forced model they backpropagate making the accumulation of gradients (similar to RNN).

Pytorch BCE loss not decreasing for word sense disambiguation task

I am performing word sense disambiguation and have created my own vocabulary of the top 300k most common English words. My model is very simple where each word in the sentences (their respective index value) is passed through an embedding layer which embeds the word and average the resulting embedding. The averaged embedding is then sent through a linear layer, as shown in the model below.
class TestingClassifier(nn.Module):
def __init__(self, vocabSize, features, embeddingDim):
super(TestingClassifier, self).__init__()
self.embeddings = nn.Embedding(vocabSize, embeddingDim)
self.linear = nn.Linear(features, 2)
self.sigmoid = nn.Sigmoid()
def forward(self, inputs):
embeds = self.embeddings(inputs)
avged = torch.mean(embeds, dim=-1)
output = self.linear(avged)
output = self.sigmoid(output)
return output
I am running BCELoss as loss function and SGD as optimizer. My problem is that my loss barely decreases as training goes on, almost as if it converges with a very high loss. I have tried different learning rates (0.0001, 0.001, 0.01 and 0.1) but I get the same issue.
My training function is as follows:
def train_model(model,
optimizer,
lossFunction,
batchSize,
epochs,
isRnnModel,
trainDataLoader,
validDataLoader,
earlyStop = False,
maxPatience = 1
):
validationAcc = []
patienceCounter = 0
stopTraining = False
model.train()
# Train network
for epoch in range(epochs):
losses = []
if(stopTraining):
break
for inputs, labels in tqdm(trainDataLoader, position=0, leave=True):
optimizer.zero_grad()
# Predict and calculate loss
prediction = model(inputs)
loss = lossFunction(prediction, labels)
losses.append(loss)
# Backward propagation
loss.backward()
# Readjust weights
optimizer.step()
print(sum(losses) / len(losses))
curValidAcc = check_accuracy(validDataLoader, model, isRnnModel) # Check accuracy on validation set
curTrainAcc = check_accuracy(trainDataLoader, model, isRnnModel)
print("Epoch", epoch + 1, "Training accuracy", curTrainAcc, "Validation accuracy:", curValidAcc)
# Control early stopping
if(earlyStop):
if(patienceCounter == 0):
if(len(validationAcc) > 0 and curValidAcc < validationAcc[-1]):
benchmark = validationAcc[-1]
patienceCounter += 1
print("Patience counter", patienceCounter)
elif(patienceCounter == maxPatience):
print("EARLY STOP. Patience level:", patienceCounter)
stopTraining = True
else:
if(curValidAcc < benchmark):
patienceCounter += 1
print("Patience counter", patienceCounter)
else:
benchmark = curValidAcc
patienceCounter = 0
validationAcc.append(curValidAcc)
Batch size is 32 (training set contains 8000 rows), vocabulary size is 300k, embedding dimension is 24. I have tried adding more linear layers to the network, but it makes no difference. The prediction accuracy on the training and validation sets stays at around 50% (which is horrible) even after many epochs of training. Any help is much appreciated!

Val loss behaves strange while using custom training loop in tensorflow 2.0

I'm using a VGG16 model written in tf2.0 to train on my own datasets. Some BatchNormalization layers were included in the model and the "training" argument were set to True during training time and False during validation time as described in many tutorials.
The train_loss decreased to a certain level during training as expected. However, the val_loss behaves really strange. I checked out the output of the model after training and found out that, if I set the training argument to True, the output is quite correct, but if I set it to False, the result is incorrect at all.
According to the tutorials in tensorflow website, when training is set to False , the model will normalize its inputs using the mean and variance of its moving statistics learned during training but it doesn't seem so. Am I missing something?
I've provided the trainning and validation code in the below.
def train():
logging.basicConfig(level=logging.INFO)
tdataset = tf.data.Dataset.from_tensor_slices((train_img_list[:200], train_label_list[:200]))
tdataset = tdataset.map(parse_function, 3).shuffle(buffer_size=200).batch(batch_size).repeat(repeat_times)
vdataset = tf.data.Dataset.from_tensor_slices((val_img_list[:100], val_label_list[:100]))
vdataset = vdataset.map(parse_function, 3).batch(batch_size)
### Vgg model
model = VGG_PR(num_classes=num_label)
logging.info('Model loaded')
start_epoch = 0
latest_ckpt = tf.train.latest_checkpoint(os.path.dirname(ckpt_path))
if latest_ckpt:
start_epoch = int(latest_ckpt.split('-')[1].split('.')[0])
model.load_weights(latest_ckpt)
logging.info('model resumed from: {}, start at epoch: {}'.format(latest_ckpt, start_epoch))
else:
logging.info('training from scratch since weights no there')
######## training loop ########
loss_object = tf.keras.losses.MeanSquaredError()
val_loss_object = tf.keras.losses.MeanSquaredError()
optimizer = tf.keras.optimizers.Adam(learning_rate=initial_lr)
train_loss = tf.metrics.Mean(name='train_loss')
val_loss = tf.metrics.Mean(name='val_loss')
writer = tf.summary.create_file_writer(log_path.format(case_num))
with writer.as_default():
for epoch in range(start_epoch, total_epoch):
print('start training')
try:
for batch, data in enumerate(tdataset):
images, labels = data
with tf.GradientTape() as tape:
pred = model(images, training=True)
if len(pred.shape) == 2:
pred = tf.reshape(pred,[-1, 1, 1, num_label])
loss = loss_object(pred, labels)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
if batch % 20 ==0:
logging.info('Epoch: {}, iter: {}, loss:{}'.format(epoch, batch, loss.numpy()))
tf.summary.scalar('train_loss', loss.numpy(), step=epoch*1250*repeat_times+batch) # the tdataset has been repeated 5 times..
tf.summary.text('Zernike_coe_pred', tf.as_string(tf.squeeze(pred)), step=epoch*1250*repeat_times+batch)
tf.summary.text('Zernike_coe_gt', tf.as_string(tf.squeeze(labels)), step=epoch*1250*repeat_times+batch)
writer.flush()
train_loss(loss)
model.save_weights(ckpt_path.format(epoch=epoch))
except KeyboardInterrupt:
logging.info('interrupted.')
model.save_weights(ckpt_path.format(epoch=epoch))
logging.info('model saved into {}'.format(ckpt_path.format(epoch=epoch)))
exit(0)
# validation step
for batch, data in enumerate(vdataset):
images, labels = data
val_pred = model(images, training=False)
if len(val_pred.shape) == 2:
val_pred = tf.reshape(val_pred,[-1, 1, 1, num_label])
v_loss = val_loss_object(val_pred, labels)
val_loss(v_loss)
logging.info('Epoch: {}, average train_loss:{}, val_loss: {}'.format(epoch, train_loss.result(), val_loss.result()))
tf.summary.scalar('val_loss', val_loss.result(), step = epoch)
writer.flush()
train_loss.reset_states()
val_loss.reset_states()
model.save_weights(ckpt_path.format(epoch=epoch))
The train losss reduced to a very small value like the groundtruth label are in the range of [0, 1] and the average train loss can be 0.007, but the val loss is much higher than this. The output of the model tends to be close to 0 if I set training to False.
updated on Nov. 6th:
I have found an interesting thing that if I use tf.function to decorate my model in its call method, the val loss will turn to be correct, but I'm not sure what has happened?
Mentioning the Answer for the benefit of the community.
Issue is resolved, i.e., val loss will turn to be correct if tf.function is used to decorate the model in its call method.

Udacity Deep Learning, Assignment 3, Part 3: Tensorflow dropout function

I am now on assignment 3 of the Udacity Deep Learning class. I have most of it completed and it's working but I noticed that problem 3, which is about using 'dropout' with tensorflow, seems to degrade my performance rather than improve it.
So I think I'm doing something wrong. I'll put my full code here. If someone can explain to me how to properly use dropout, I'd appreciate it. (Or confirm I'm using it correctly and it's just not helping in this case.). It drops accuracy from over 94% (without dropout) down to 91.5%. If you aren't using L2 regularization, the degradation is even larger.
def create_nn(dataset, weights_hidden, biases_hidden, weights_out, biases_out):
# Original layer
logits = tf.add(tf.matmul(tf_train_dataset, weights_hidden), biases_hidden)
# Drop Out layer 1
logits = tf.nn.dropout(logits, 0.5)
# Hidden Relu layer
logits = tf.nn.relu(logits)
# Drop Out layer 2
logits = tf.nn.dropout(logits, 0.5)
# Output: Connect hidden layer to a node for each class
logits = tf.add(tf.matmul(logits, weights_out), biases_out)
return logits
# Create model
batch_size = 128
hidden_layer_size = 1024
beta = 1e-3
graph = tf.Graph()
with graph.as_default():
# Input data. For the training data, we use a placeholder that will be fed
# at run time with a training minibatch.
tf_train_dataset = tf.placeholder(tf.float32,
shape=(batch_size, image_size * image_size))
tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
tf_valid_dataset = tf.constant(valid_dataset)
tf_test_dataset = tf.constant(test_dataset)
# Variables.
weights_hidden = tf.Variable(
#tf.truncated_normal([image_size * image_size, num_labels]))
tf.truncated_normal([image_size * image_size, hidden_layer_size]))
#biases = tf.Variable(tf.zeros([num_labels]))
biases_hidden = tf.Variable(tf.zeros([hidden_layer_size]))
weights_out = tf.Variable(tf.truncated_normal([hidden_layer_size, num_labels]))
biases_out = tf.Variable(tf.zeros([num_labels]))
# Training computation.
#logits = tf.matmul(tf_train_dataset, weights_out) + biases_out
logits = create_nn(tf_train_dataset, weights_hidden, biases_hidden, weights_out, biases_out)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits))
loss += beta * (tf.nn.l2_loss(weights_hidden) + tf.nn.l2_loss(weights_out))
# Optimizer.
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
# Predictions for the training, validation, and test data.
train_prediction = tf.nn.softmax(logits)
#valid_prediction = tf.nn.softmax(tf.matmul(tf_valid_dataset, weights_out) + biases_out)
#test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset, weights_out) + biases_out)
valid_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(tf.matmul(tf_valid_dataset, weights_hidden) + biases_hidden), weights_out) + biases_out)
test_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(tf.matmul(tf_test_dataset, weights_hidden) + biases_hidden), weights_out) + biases_out)
num_steps = 10000
with tf.Session(graph=graph) as session:
tf.global_variables_initializer().run()
print("Initialized")
for step in range(num_steps):
# Pick an offset within the training data, which has been randomized.
# Note: we could use better randomization across epochs.
offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
#offset = (step * batch_size) % (3*128 - batch_size)
#print(offset)
# Generate a minibatch.
batch_data = train_dataset[offset:(offset + batch_size), :]
batch_labels = train_labels[offset:(offset + batch_size), :]
# Prepare a dictionary telling the session where to feed the minibatch.
# The key of the dictionary is the placeholder node of the graph to be fed,
# and the value is the numpy array to feed to it.
feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
_, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict)
if (step % 500 == 0):
print("Minibatch loss at step %d: %f" % (step, l))
print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
print("Validation accuracy: %.1f%%" % accuracy(valid_prediction.eval(), valid_labels))
print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
You would need to turn off dropout during inference. It may not be obvious at first, but the fact that dropout is hardcoded in the NN architecture means it will affect the test data during inference. You can avoid this by creating a placeholder keep_prob, rather than providing the value 0.5 directly. For example:
keep_prob = tf.placeholder(tf.float32)
logits = tf.nn.dropout(logits, keep_prob)
To turn on dropout during training, set the keep_prob value to 0.5:
feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels, keep_prob: 0.5}
During inference/evaluation, you should be able to do something like this to set keep_prob to 1.0 in eval:
accuracy.eval(feed_dict={x: test_prediction, y_: test_labels, keep_prob: 1.0}
EDIT:
Since the issue does not seem to be that dropout is used at inference, the next culprit would be that the dropout is too high for this network size. You can potentially try decreasing the dropout to 20% (i.e. keep_prob=0.8), or increasing the size of the network to give the model an opportunity to learn the representations.
I actually gave it a try with your code, and I'm getting around ~93.5% with 20% dropout with this network size. I have added some additional resources below, including the original Dropout paper to help clarify the intuition behind it, and expands on more tips when using dropout such as increasing the learning rate.
References:
Deep MNIST for Experts: has an example on the above (dropout on/off) using MNIST
Dropout Regularization in Deep Learning Models With Keras
Dropout: A Simple Way to Prevent Neural Networks from Overfitting
2 things I think can cause the problem.
First of all I would not recommend using dropout in first layer (that too 50%, use lower, in range 10-25% if you have to)) as when you use such a high dropout even higher level features are not learnt and propagated to deeper layers. Also try a range of dropouts from 10% to 50% and see how accuracy changes. There is no way to know beforehand what value will work
Secondly, you do not usually use dropout at inference. To fix that pass in keep_prob parameter of dropout as a placeholder and set it to 1 when inferencing.
Also, if the accuracy values you state are training accuracy then there may not even be much of a problem in first place as dropout will usually decrease training accuracy by small amounts as you are not overfitting, its the test/validation accuracy that needs to be closely monitored

Resources