Pytorch: Custom Loss involving Norm of End-to-End Jacobian - pytorch

Cross posting from Pytorch discussion boards
I want to train a network using a modified loss function that has both a typical classification loss (e.g. nn.CrossEntropyLoss) as well as a penalty on the Frobenius norm of the end-to-end Jacobian (i.e. if f(x) is the output of the network, \nabla_x f(x)).
I’ve implemented a model that can successfully learn using nn.CrossEntropyLoss. However, when I try adding the second loss function (by doing two backwards passes), my training loop runs, but the model never learns. Furthermore, if I calculate the end-to-end Jacobian, but don’t include it in the loss function, the model also never learns. At a high level, my code does the following:
Forward pass to get predicted classes, yhat, from inputs x
Call yhat.backward(torch.ones(appropriate shape), retain_graph=True)
Jacobian norm = x.grad.data.norm(2)
Set loss equal to classification loss + scalar coefficient * jacobian norm
Run loss.backward()
I suspect that I’m misunderstanding how backward() works when run twice, but I haven’t been able to find any good resources to clarify this.
Too much is required to produce a working example, so I’ve tried to extract the relevant code:
def train_model(model, train_dataloader, optimizer, loss_fn, device=None):
if device is None:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.train()
train_loss = 0
correct = 0
for batch_idx, (batch_input, batch_target) in enumerate(train_dataloader):
batch_input, batch_target = batch_input.to(device), batch_target.to(device)
optimizer.zero_grad()
batch_input.requires_grad_(True)
model_batch_output = model(batch_input)
loss = loss_fn(model_output=model_batch_output, model_input=batch_input, model=model, target=batch_target)
train_loss += loss.item() # sum up batch loss
loss.backward()
optimizer.step()
and
def end_to_end_jacobian_loss(model_output, model_input):
model_output.backward(
torch.ones(*model_output.shape),
retain_graph=True)
jacobian = model_input.grad.data
jacobian_norm = jacobian.norm(2)
return jacobian_norm
Edit 1: I swapped my previous implementation with .backward() to autograd.grad and it apparently works! What's the difference?
def end_to_end_jacobian_loss(model_output, model_input):
jacobian = autograd.grad(
outputs=model_output['penultimate_layer'],
inputs=model_input,
grad_outputs=torch.ones(*model_output['penultimate_layer'].shape),
retain_graph=True,
only_inputs=True)[0]
jacobian_norm = jacobian.norm(2)
return jacobian_norm

Related

torch.nn.CrossEntropyLoss over Multiple Batches

I am currently working with torch.nn.CrossEntropyLoss. As far as I know, it is common to compute the loss batch-wise. However, is there a possibility to compute the loss over multiple batches?
More concretely, assume we are given the data
import torch
features = torch.randn(no_of_batches, batch_size, feature_dim)
targets = torch.randint(low=0, high=10, size=(no_of_batches, batch_size))
loss_function = torch.nn.CrossEntropyLoss()
Is there a way to compute in one line
loss = loss_function(features, targets) # raises RuntimeError: Expected target size [no_of_batches, feature_dim], got [no_of_batches, batch_size]
?
Thank you in advance!
You can compute multiple cross-entropy losses but you'll need to do your own reduction. Since cross-entropy loss assumes the feature dim is always the second dimension of the features tensor you will also need to permute it first.
loss_function = torch.nn.CrossEntropyLoss(reduction='none')
loss = loss_function(features.permute(0,2,1), targets).mean(dim=1)
which will result in a loss tensor with no_of_batches entries.

Pytorch - Repeating Loss

I am new to PyTorch and I found a problem when displaying the loss of my model.
Pytorch Adam Optimizer - Model Loss Figure
Pytorch SGD Optimizer - Model Loss Figure
As you can see, the model seem to go up and down multiple times, with a recurrent pattern (the pattern starting to repeat at the begging of every epoch).
The full code can be found at: https://github.com/19valentin99/Kaggle/tree/main/Iris%20Flowers
in main_test.py (the # lines are the ones that I used to debug the code and the answer should be below).
When we just take the loss of the last element (or the loss over the
whole epoch) we will see a smooth decrease in loss
The reason your loss is smooth is because you are looking at the loss of the exact same batch on every iteration. Indeed your train data loader isn't shuffling your instance:
train2 = DataLoader(flowers_data_train, batch_size=BATCH_SIZE)
This means the same batch will appear last on every epoch. That's all there is to it, this doesn't mean the learning is different, it means you are looking at a part of the complete dataset loss.
The difference between "not working" and "working" is based of when the loss is recorded.
The idea is that: overall, the loss converges, but in this time until it converges it jumps up and down.
While it jumps up and down, we might see a pattern if we are sampling too often. The pattern is given by the data we use for training (as the data we use to train is the same every epoch - in batches).
As a result:
For the not-working version: I was recording the loss every epoch, after every batch.
For the working version: I was recording only the latest loss in the epoch.
Pytorch Adam Optimizer - Model Loss (working)
Pytorch SGD Optimizer - Model Loss (working)
Furthermore, I will attach the code which generates the non working version:
loss_list = []
for epoch in range(EPOCHS):
for idx, (x, y) in enumerate(train_load):
x, y = x.to(device), y.to(device)
#Compute Error
prediction = model(x)
#print(prediction, y)
loss = loss_fn(prediction, y)
#debuging
loss_list.append(loss.item())
##Backpropagation
optimizer.zero_grad()
loss.backward()
optimizer.step()
plt.plot(loss_list)
plt.show()
The working code:
loss_list2 = np.zeros((EPOCHS,))
for epoch in range(EPOCHS):
for batch, (x, y) in enumerate(train_load):
x = x.to(device=device)
y = y.to(device=device)
y_pred = model(x)
loss = loss_fn(y_pred, y)
loss_list2[epoch] = loss.item()
# Zero gradients
optimizer.zero_grad()
loss.backward()
optimizer.step()
plt.plot(loss_list2)
plt.show()
In the end, I would like to mention that I know that there are a couple of other threads out there that say how to solve this problem (like: clip the gradients, remove the last batch, model is too simple to capture the data) but in the end, what I discovered is that it wasn't actually a problem but more "when the recording of the data is done".
I hope that this will help other people as well.

Pytorch weights not updating.. sometimes

Not sure what causes this, but sometimes I start training my neural net, and none of my weights update. This happens maybe 4 out of 5 times when I initialize my script. The other 1 time, it updates everything as expected and trains and predicts as expected. Does anyone have any idea why this happens? Started when I changed my loss function if that's relevant.
Here's the gross part of my training loop, let me know any other relevant code I should include.
def train(model, train_loader, test_loader, test_data, full_test, args, epochs, early_stop=5):
t0 = time()
optimizer = Adam(model.parameters(), lr=args.lr)
lr_decay = lr_scheduler.ExponentialLR(optimizer, gamma=args.lr_decay)
best_val_acc, best_mae = 0, 500
for epoch in range(epochs):
model.train()
ti = time()
training_loss = 0.0
for i, (x, y) in enumerate(train_loader):
x, y = Variable(x.cuda()), Variable(y.cuda())
y_pred = model(x, y)
loss = mae_loss(y, y_pred) + rmse_loss(y, y_pred)
loss.backward()
training_loss += loss.detach() * x.size(0)
optimizer.step()
optimizer.zero_grad()
lr_decay.step()
I believe by far the most likely issue is that your loss function is returning something incorrect. Try printing the first few losses to see what they are and ensure they are reasonable and the correct datatype and shape. One possible reason for the weights not updating if your losses seem ok is that the learning rate is too low for your losses and the weights are being changed by such a small amount that it is either rounded off or not apparent.

How to train a CNN model?

When trying to train the CNN model, I came across a code shown below:
def train(n_epochs, loaders, model, optimizer, criterion):
for epoch in range(1,n_epochs):
train_loss = 0
valid_loss = 0
model.train()
for i, (data,target) in enumerate(loaders['train']):
# zero the parameter (weight) gradients
optimizer.zero_grad()
# forward pass to get outputs
output = model(data)
# calculate the loss
loss = criterion(output, target)
# backward pass to calculate the parameter gradients
loss.backward()
# update the parameters
optimizer.step()
Can someone please tell me why is the second for loop used?
i.e; for i, (data,target) in enumerate(loaders['train']):
And why optimizer.zero_grad() and optimizer.step() is used?
torch.utils.data.DataLoader comes in handy when you need to prepare data batches (and perhaps shuffle them before every run).
data_train_loader = DataLoader(data_train, batch_size=64, shuffle=True)
In the above code, first for-loop iterates through the number of epochs while second loop iterates through the training dataset converted into batches via above code. For example:
for batch_idx, samples in enumerate(data_train_loader):
# samples will be a 64 x D dimensional tensor
# batch_idx is each batch index
Learn more about torch.utils.data.DataLoader from here.
Optimizer.zero_gradient(): Before the backward pass, use the optimizer object to zero all of the gradients for the tensors it will update (which are the learnable weights of the model)
optimizer.step(): We generally use optimizer.step() to make the gradient descent step. Calling the step function on an Optimizer makes an update to its parameters.
Learn more about these from here.
Optimizer is used first to load the params like this (missing in your code):
optimizer = optim.Adam(model.parameters(), lr=0.001, momentum=0.9)
This code
loss = criterion(output, target)
Is used to calculate the loss of a single batch where targets is what you got from a tuple (data,target) and data is used as the input for the model, where we got the output.
This step:
optimizer.zero_grad()
Will zero all the gradients found in the optimizer, which is very important on initialization.
The part
loss.backward()
Calculates the gradients, and the optimizer.step() updates our model weights and biases (parameters).
In PyTorch you typically use DataLoader class to load the trainging and validation sets.
loaders['train']
Is probable the full train set, which represents a single epoch.

How to deal with mini-batch loss in Pytorch?

I feed mini-batch data to model, and I just want to know how to deal with the loss. Could I accumulate the loss, then call the backward like:
...
def neg_log_likelihood(self, sentences, tags, length):
self.batch_size = sentences.size(0)
logits = self.__get_lstm_features(sentences, length)
real_path_score = torch.zeros(1)
total_score = torch.zeros(1)
if USE_GPU:
real_path_score = real_path_score.cuda()
total_score = total_score.cuda()
for logit, tag, leng in zip(logits, tags, length):
logit = logit[:leng]
tag = tag[:leng]
real_path_score += self.real_path_score(logit, tag)
total_score += self.total_score(logit, tag)
return total_score - real_path_score
...
loss = model.neg_log_likelihood(sentences, tags, length)
loss.backward()
optimizer.step()
I wonder that if the accumulation could lead to gradient explosion?
So, should I call the backward in loop:
for sentence, tag , leng in zip(sentences, tags, length):
loss = model.neg_log_likelihood(sentence, tag, leng)
loss.backward()
optimizer.step()
Or, use the mean loss just like the reduce_mean in tensorflow
loss = reduce_mean(losses)
loss.backward()
The loss has to be reduced by mean using the mini-batch size. If you look at the native PyTorch loss functions such as CrossEntropyLoss, there is a separate parameter reduction just for this and the default behaviour is to do mean on the mini-batch size.
We usually
get the loss by the loss function
(if necessary) manipulate the loss, for example do the class weighting and etc
calculate the mean loss of the mini-batch
calculate the gradients by the loss.backward()
(if necessary) manipulate the gradients, for example, do the gradient clipping for some RNN models to avoid gradient explosion
update the weights using the optimizer.step() function
So in your case, you can first get the mean loss of the mini-batch and then calculate the gradient using the loss.backward() function and then utilize the optimizer.step() function for the weight updating.

Resources