Memory bottleneck with autoregressive transformer decoding - pytorch

I am trying to train a transformer model for sequence modeling. Below is a standalone example:
import torch
import torch.nn as nn
criterion = nn.MSELoss()
decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)
transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers=12)
memory = torch.rand(10, 32, 512)
y = torch.rand(20, 32, 512)
start_token = torch.ones((1,32,512))
tgt_input = torch.cat((start_token,y[:-1,:]),axis=0)
optimizer = torch.optim.Adam(transformer_decoder.parameters())
###################Teacher forced
while(True):
optimizer.zero_grad()
out = transformer_decoder(tgt_input, memory, nn.Transformer.generate_square_subsequent_mask(20,20))
loss = criterion(out,y)
print("loss: ", loss.item())
loss.backward()
optimizer.step()
For a 12 layer decoder, the model works fine on a personal machine with 8GB memory. The model is autoregressive and works with shifted targets. Given we provide targets above, I refer to this setting as "teacher forced".
However, at inference stage, we will not have targets fed as above, and one would need to condition on targets generated on the go. This setting is as follows:
###################Non Teacher forced
while(True):
optimizer.zero_grad()
predictions = torch.ones((1,32,512))
for i in range(1,21):
predictions = torch.cat((predictions, transformer_decoder(tgt_input[:i], memory, nn.Transformer.generate_square_subsequent_mask(i,i))[-1].unsqueeze(0)),axis=0)
print("i: ", i, "predictions.shape: ", predictions.shape)
loss = criterion(predictions[1:],y)
print("loss: ", loss.item())
loss.backward()
optimizer.step()
I wish to train the model with a hybrid training strategy with, without teacher forcing. However, the non-teacher forced strategy causes out-of-memory exception and doesn't work. For final inference (testing), usually, with torch.no_grad() it can work, but not in training. Can anyone explain as to why this causes memory bottlenecks exactly?

This is because of the rolling of the computational graph. For the teacher forced model, gradients are not propagated after the true values. However, for non-teacher forced model they backpropagate making the accumulation of gradients (similar to RNN).

Related

Why is testing accuracy so low, could there be a bug in my code?

I've been training an image classification model using object detection and then applying image classification to the images. I have 87 custom classes in my data(not ImageNet classes), and just over 7000 images altogether(around 60 images per class). I am happy with my object detection code and I think it works quite well, however, for classification I have been using ResNet and AlexNet. I have tried AlexNet, ResNet18, ResNet50 and ResNet101 for training however, I am getting very low testing accuracies(around 10%), and my training accuracies are high for all models. I've also attempted regularisation and changing the learning rates, but I am not getting the higher accuracies(>80%) that I require. I wonder if there is a bug in my code, although I haven't been able to figure it out.
Here is my training code, I have also processed images in the way that Pytorch pretrained models expect:
import torch.nn as nn
import torch.optim as optim
from typing import Callable
import numpy as np
EPOCHS=100
resnet = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50')
resnet.eval()
resnet.fc = nn.Linear(2048, 87)
res_loss = nn.CrossEntropyLoss()
res_optimiser = optim.SGD(resnet.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-5)
def train_model(model, loss_fn, optimiser, modelsavepath):
train_acc = 0
for j in range(EPOCHS):
running_loss = 0.0
correct = 0
total = 0
for i, data in enumerate(training_generator, 0):
model.train()
inputs, labels, paths = data
total += 1
optimizer.zero_grad()
outputs = model(inputs)
_, predicted = torch.max(outputs, 1)
if(predicted.int() == labels.int()):
correct += 1
loss = loss_fn(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
train_acc = train_correct / len(training_generator)
print("Epoch:{}/{} AVG Training Loss:{:.3f} AVG Training Acc {:.2f}% ".format(j + 1, EPOCHS, train_loss, train_acc))
torch.save(model, modelsavepath)
train_model(resnet, res_loss, res_optimiser, 'resnet.pth')
Here is the testing code used for a single image, it is part of a class:
self.model.eval()
outputs = self.model(img[None, ...]) #models expect batches, so give it a singleton batch
scores, predictions = torch.max(outputs, 1)
predictions = predictions.numpy()[0]
possible_scores= np.argmax(scores.detach().numpy())
Is there a bug in my code, either testing or training, or is my model just overfitting? Additionally, is there a better image classification model that I could try?
Your dataset is very small, so you're most likely overfitting. Try:
decrease learning rate (try 0.001, 0.0001, 0.00001)
increase weight_decay (try 1e-4, 1e-3, 1e-2)
if you don't already, use image augmentations (at least the default ones, like random crop and flip).
Watch train/test loss curves when finetuning your model and stop training as soon as you see test accuracy going down while train accuracy goes up.

Why does accruacy of the CNN drop when I remove a filter and its associated weights?

The model architecture is Conv2D with 32 filters -> Flatten -> Dense -> Compile -> Fit
I deleted the last filter from the first layer and the corresponding Fully connected layer in this model using
w,b = model.layers[0].get_weights()
w = np.delete(w, [32], -1)
b = np.delete(b, [32], 0)
w_2,b_2 = model.layers[2].get_weights()
w_2 = w_2[:20956,:]
I use 20956 because the output of the first layer is 26 x 26 x 31, which is an image dimension in 2D multiply by a number of channels.
I create a new model called model_1 using:
# Input stays the same
model_1 = Sequential()
# New modified conv layer
model_1.add(Conv2D(31, kernel_size=(3, 3),
activation='relu',
input_shape=input_shape,
kernel_initializer='he_normal'))
model_1.add(Flatten())
model_1.add(Dense(10, activation='softmax'))
model_1.layers[0].set_weights([w,b])
model_1.layers[2].set_weights([w_2,b_2])
model_1.compile(loss="categorical_crossentropy",
optimizer="Adam",
metrics=['accuracy'])
I can confirm that the weights are the same by doing model_1.layers[0].get_weights()[0] == model.layers[0].get_weights()[0][:,:,:,:31] and model_1.layers[2].get_weights()[0] == model.layers[2].get_weights()[0][:20956,:]which returns True.
When I do
score = model_1.evaluate(x_test_reshape, y_test)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
score = model.evaluate(x_test_reshape, y_test)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
The accuracy drops from 98% to 10%, any ideas why?
What you are essentially doing is removing a channel from the last convolutional layer. Intuitively it may sound like this is not a big deal and the remaining 31 channel will still make the network perform well. In reality all convolution channels interact with each other in the dense layer that follows, but since this interaction is missing one of the channels of information it was optimized on it's accuracy will drop.
Another way to think of this is to view your network as a function of sequential steps that takes as input an image and as output a label with 98% accuracy. Removing a fraction (1/32) of calculations in this function will change the outcomes, and likely give worse results since the function is optimized with these calculations still present. You are removing a part of the function that is apparently crucial to reach the high accuracy.
You can test this by training your new model with 31 channels for a short time. Since the new model only needs to re-learn the function of the deleted channel, it should quickly reach the high performance again.

Tensorflow - The prediction from Neural Network usually goes for all-0s in classification

I'm new to Deep Learning and currently, I work with the Classification Problem. I've implemented it with the last Fully-Connected Layer & Activation as following Tensorflow:
predictions = tf.layers.dense(attention_layer_output, nb_classes, name="Output_Layer")
predictions = tf.reduce_sum(predictions, axis = 0)
targets_raw_ = tf.nn.sigmoid(predictions)
targets_ = tf.round(targets_raw_)
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(labels = self._targets, logits = predictions)
is_correct = tf.equal(targets_, self._targets)
self.accuracy = tf.reduce_mean(tf.cast(is_correct, tf.float32))
tf.summary.scalar('Accuracy', self.accuracy)
adam_opt = tf.train.AdamOptimizer(self._learning_rate)
self.optimizer = adam_opt.minimize(cross_entropy)
I've tested with the first 30 epochs, and it's almost overfitting with one of these classes. When I tried to debug what happened in the above code by tf.Print, I've found that the prediction is usually [0 0 0 0 0 0] in case nb_classes = 6.
So, the accuracy of training usually goes around 83,33%, which means 5/6 class is correct.
Do I have to do anything else with the above code or I still have to wait for training with more epoch?
What I understand from your description is that you have imbalanced training set, i.e. not all classes are represented equally. In that case, accuracy is not a good metric because if one of the classes appears most often and the model predicts this class always, accuracy is still good despite the fact that model is useless. Have a look at AUC instead.

Pytorch: Custom Loss involving Norm of End-to-End Jacobian

Cross posting from Pytorch discussion boards
I want to train a network using a modified loss function that has both a typical classification loss (e.g. nn.CrossEntropyLoss) as well as a penalty on the Frobenius norm of the end-to-end Jacobian (i.e. if f(x) is the output of the network, \nabla_x f(x)).
I’ve implemented a model that can successfully learn using nn.CrossEntropyLoss. However, when I try adding the second loss function (by doing two backwards passes), my training loop runs, but the model never learns. Furthermore, if I calculate the end-to-end Jacobian, but don’t include it in the loss function, the model also never learns. At a high level, my code does the following:
Forward pass to get predicted classes, yhat, from inputs x
Call yhat.backward(torch.ones(appropriate shape), retain_graph=True)
Jacobian norm = x.grad.data.norm(2)
Set loss equal to classification loss + scalar coefficient * jacobian norm
Run loss.backward()
I suspect that I’m misunderstanding how backward() works when run twice, but I haven’t been able to find any good resources to clarify this.
Too much is required to produce a working example, so I’ve tried to extract the relevant code:
def train_model(model, train_dataloader, optimizer, loss_fn, device=None):
if device is None:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.train()
train_loss = 0
correct = 0
for batch_idx, (batch_input, batch_target) in enumerate(train_dataloader):
batch_input, batch_target = batch_input.to(device), batch_target.to(device)
optimizer.zero_grad()
batch_input.requires_grad_(True)
model_batch_output = model(batch_input)
loss = loss_fn(model_output=model_batch_output, model_input=batch_input, model=model, target=batch_target)
train_loss += loss.item() # sum up batch loss
loss.backward()
optimizer.step()
and
def end_to_end_jacobian_loss(model_output, model_input):
model_output.backward(
torch.ones(*model_output.shape),
retain_graph=True)
jacobian = model_input.grad.data
jacobian_norm = jacobian.norm(2)
return jacobian_norm
Edit 1: I swapped my previous implementation with .backward() to autograd.grad and it apparently works! What's the difference?
def end_to_end_jacobian_loss(model_output, model_input):
jacobian = autograd.grad(
outputs=model_output['penultimate_layer'],
inputs=model_input,
grad_outputs=torch.ones(*model_output['penultimate_layer'].shape),
retain_graph=True,
only_inputs=True)[0]
jacobian_norm = jacobian.norm(2)
return jacobian_norm

Udacity Deep Learning, Assignment 3, Part 3: Tensorflow dropout function

I am now on assignment 3 of the Udacity Deep Learning class. I have most of it completed and it's working but I noticed that problem 3, which is about using 'dropout' with tensorflow, seems to degrade my performance rather than improve it.
So I think I'm doing something wrong. I'll put my full code here. If someone can explain to me how to properly use dropout, I'd appreciate it. (Or confirm I'm using it correctly and it's just not helping in this case.). It drops accuracy from over 94% (without dropout) down to 91.5%. If you aren't using L2 regularization, the degradation is even larger.
def create_nn(dataset, weights_hidden, biases_hidden, weights_out, biases_out):
# Original layer
logits = tf.add(tf.matmul(tf_train_dataset, weights_hidden), biases_hidden)
# Drop Out layer 1
logits = tf.nn.dropout(logits, 0.5)
# Hidden Relu layer
logits = tf.nn.relu(logits)
# Drop Out layer 2
logits = tf.nn.dropout(logits, 0.5)
# Output: Connect hidden layer to a node for each class
logits = tf.add(tf.matmul(logits, weights_out), biases_out)
return logits
# Create model
batch_size = 128
hidden_layer_size = 1024
beta = 1e-3
graph = tf.Graph()
with graph.as_default():
# Input data. For the training data, we use a placeholder that will be fed
# at run time with a training minibatch.
tf_train_dataset = tf.placeholder(tf.float32,
shape=(batch_size, image_size * image_size))
tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
tf_valid_dataset = tf.constant(valid_dataset)
tf_test_dataset = tf.constant(test_dataset)
# Variables.
weights_hidden = tf.Variable(
#tf.truncated_normal([image_size * image_size, num_labels]))
tf.truncated_normal([image_size * image_size, hidden_layer_size]))
#biases = tf.Variable(tf.zeros([num_labels]))
biases_hidden = tf.Variable(tf.zeros([hidden_layer_size]))
weights_out = tf.Variable(tf.truncated_normal([hidden_layer_size, num_labels]))
biases_out = tf.Variable(tf.zeros([num_labels]))
# Training computation.
#logits = tf.matmul(tf_train_dataset, weights_out) + biases_out
logits = create_nn(tf_train_dataset, weights_hidden, biases_hidden, weights_out, biases_out)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits))
loss += beta * (tf.nn.l2_loss(weights_hidden) + tf.nn.l2_loss(weights_out))
# Optimizer.
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
# Predictions for the training, validation, and test data.
train_prediction = tf.nn.softmax(logits)
#valid_prediction = tf.nn.softmax(tf.matmul(tf_valid_dataset, weights_out) + biases_out)
#test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset, weights_out) + biases_out)
valid_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(tf.matmul(tf_valid_dataset, weights_hidden) + biases_hidden), weights_out) + biases_out)
test_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(tf.matmul(tf_test_dataset, weights_hidden) + biases_hidden), weights_out) + biases_out)
num_steps = 10000
with tf.Session(graph=graph) as session:
tf.global_variables_initializer().run()
print("Initialized")
for step in range(num_steps):
# Pick an offset within the training data, which has been randomized.
# Note: we could use better randomization across epochs.
offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
#offset = (step * batch_size) % (3*128 - batch_size)
#print(offset)
# Generate a minibatch.
batch_data = train_dataset[offset:(offset + batch_size), :]
batch_labels = train_labels[offset:(offset + batch_size), :]
# Prepare a dictionary telling the session where to feed the minibatch.
# The key of the dictionary is the placeholder node of the graph to be fed,
# and the value is the numpy array to feed to it.
feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
_, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict)
if (step % 500 == 0):
print("Minibatch loss at step %d: %f" % (step, l))
print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
print("Validation accuracy: %.1f%%" % accuracy(valid_prediction.eval(), valid_labels))
print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
You would need to turn off dropout during inference. It may not be obvious at first, but the fact that dropout is hardcoded in the NN architecture means it will affect the test data during inference. You can avoid this by creating a placeholder keep_prob, rather than providing the value 0.5 directly. For example:
keep_prob = tf.placeholder(tf.float32)
logits = tf.nn.dropout(logits, keep_prob)
To turn on dropout during training, set the keep_prob value to 0.5:
feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels, keep_prob: 0.5}
During inference/evaluation, you should be able to do something like this to set keep_prob to 1.0 in eval:
accuracy.eval(feed_dict={x: test_prediction, y_: test_labels, keep_prob: 1.0}
EDIT:
Since the issue does not seem to be that dropout is used at inference, the next culprit would be that the dropout is too high for this network size. You can potentially try decreasing the dropout to 20% (i.e. keep_prob=0.8), or increasing the size of the network to give the model an opportunity to learn the representations.
I actually gave it a try with your code, and I'm getting around ~93.5% with 20% dropout with this network size. I have added some additional resources below, including the original Dropout paper to help clarify the intuition behind it, and expands on more tips when using dropout such as increasing the learning rate.
References:
Deep MNIST for Experts: has an example on the above (dropout on/off) using MNIST
Dropout Regularization in Deep Learning Models With Keras
Dropout: A Simple Way to Prevent Neural Networks from Overfitting
2 things I think can cause the problem.
First of all I would not recommend using dropout in first layer (that too 50%, use lower, in range 10-25% if you have to)) as when you use such a high dropout even higher level features are not learnt and propagated to deeper layers. Also try a range of dropouts from 10% to 50% and see how accuracy changes. There is no way to know beforehand what value will work
Secondly, you do not usually use dropout at inference. To fix that pass in keep_prob parameter of dropout as a placeholder and set it to 1 when inferencing.
Also, if the accuracy values you state are training accuracy then there may not even be much of a problem in first place as dropout will usually decrease training accuracy by small amounts as you are not overfitting, its the test/validation accuracy that needs to be closely monitored

Resources