Pytorch Repeating loss and AUC- When using cumulative loss - pytorch

I am using PyTorch to accumulate and add losses, and then implement backpropagation(loss.backward()) at the end.
At this time, the loss is not updated and remains almost the same, and the AUC repeats exactly the same. Are there any points I haven't considered when using cumulative losses?
Thank you so much for any reply. :)
Below is the loss calculation that occurs in one batch.
opt.zero_grad()
for s in range(len(qshft)):
for a in range(len(qshft[0])):
if(m[s][a]):
y_pred = (y[s][a] * one_hot(qshft[s].long(), self.num_q)).sum(-1)
y_pred = torch.masked_select(y_pred, m[s])
t = torch.masked_select(rshft[s], m[s])
loss += binary_cross_entropy(y_pred, t).clone().detach().requires_grad_(True)
count += 1
loss = torch.tensor(loss/count,requires_grad=True)
loss.backward()
opt.step()
loss_mean.append(loss.detach().cpu().numpy())

Your following operation of detach removes the computation graph, so the loss.backward() and opt.step() won't update your weights which results in repeating loss and AUC.
loss += binary_cross_entropy(y_pred, t).clone().detach().requires_grad_(True)
You can do
loss += binary_cross_entropy(y_pred, t)
and change
loss = torch.tensor(loss/count,requires_grad=True)
to
loss = loss/count
But make sure you reset count and loss to 0 every time you go into this part.

Related

Pytorch - Repeating Loss

I am new to PyTorch and I found a problem when displaying the loss of my model.
Pytorch Adam Optimizer - Model Loss Figure
Pytorch SGD Optimizer - Model Loss Figure
As you can see, the model seem to go up and down multiple times, with a recurrent pattern (the pattern starting to repeat at the begging of every epoch).
The full code can be found at: https://github.com/19valentin99/Kaggle/tree/main/Iris%20Flowers
in main_test.py (the # lines are the ones that I used to debug the code and the answer should be below).
When we just take the loss of the last element (or the loss over the
whole epoch) we will see a smooth decrease in loss
The reason your loss is smooth is because you are looking at the loss of the exact same batch on every iteration. Indeed your train data loader isn't shuffling your instance:
train2 = DataLoader(flowers_data_train, batch_size=BATCH_SIZE)
This means the same batch will appear last on every epoch. That's all there is to it, this doesn't mean the learning is different, it means you are looking at a part of the complete dataset loss.
The difference between "not working" and "working" is based of when the loss is recorded.
The idea is that: overall, the loss converges, but in this time until it converges it jumps up and down.
While it jumps up and down, we might see a pattern if we are sampling too often. The pattern is given by the data we use for training (as the data we use to train is the same every epoch - in batches).
As a result:
For the not-working version: I was recording the loss every epoch, after every batch.
For the working version: I was recording only the latest loss in the epoch.
Pytorch Adam Optimizer - Model Loss (working)
Pytorch SGD Optimizer - Model Loss (working)
Furthermore, I will attach the code which generates the non working version:
loss_list = []
for epoch in range(EPOCHS):
for idx, (x, y) in enumerate(train_load):
x, y = x.to(device), y.to(device)
#Compute Error
prediction = model(x)
#print(prediction, y)
loss = loss_fn(prediction, y)
#debuging
loss_list.append(loss.item())
##Backpropagation
optimizer.zero_grad()
loss.backward()
optimizer.step()
plt.plot(loss_list)
plt.show()
The working code:
loss_list2 = np.zeros((EPOCHS,))
for epoch in range(EPOCHS):
for batch, (x, y) in enumerate(train_load):
x = x.to(device=device)
y = y.to(device=device)
y_pred = model(x)
loss = loss_fn(y_pred, y)
loss_list2[epoch] = loss.item()
# Zero gradients
optimizer.zero_grad()
loss.backward()
optimizer.step()
plt.plot(loss_list2)
plt.show()
In the end, I would like to mention that I know that there are a couple of other threads out there that say how to solve this problem (like: clip the gradients, remove the last batch, model is too simple to capture the data) but in the end, what I discovered is that it wasn't actually a problem but more "when the recording of the data is done".
I hope that this will help other people as well.

Pytorch weights not updating.. sometimes

Not sure what causes this, but sometimes I start training my neural net, and none of my weights update. This happens maybe 4 out of 5 times when I initialize my script. The other 1 time, it updates everything as expected and trains and predicts as expected. Does anyone have any idea why this happens? Started when I changed my loss function if that's relevant.
Here's the gross part of my training loop, let me know any other relevant code I should include.
def train(model, train_loader, test_loader, test_data, full_test, args, epochs, early_stop=5):
t0 = time()
optimizer = Adam(model.parameters(), lr=args.lr)
lr_decay = lr_scheduler.ExponentialLR(optimizer, gamma=args.lr_decay)
best_val_acc, best_mae = 0, 500
for epoch in range(epochs):
model.train()
ti = time()
training_loss = 0.0
for i, (x, y) in enumerate(train_loader):
x, y = Variable(x.cuda()), Variable(y.cuda())
y_pred = model(x, y)
loss = mae_loss(y, y_pred) + rmse_loss(y, y_pred)
loss.backward()
training_loss += loss.detach() * x.size(0)
optimizer.step()
optimizer.zero_grad()
lr_decay.step()
I believe by far the most likely issue is that your loss function is returning something incorrect. Try printing the first few losses to see what they are and ensure they are reasonable and the correct datatype and shape. One possible reason for the weights not updating if your losses seem ok is that the learning rate is too low for your losses and the weights are being changed by such a small amount that it is either rounded off or not apparent.

Trying to accumulate gradients in Pytorch, but getting RuntimeError when calling loss.backward

I'm trying to train a model in Pytorch, and I'd like to have a batch size of 8, but due to memory limitations, I can only have a batch size of at most 4. I've looked all around and read a lot about accumulating gradients, and it seems like the solution to my problem.
However, I seem to have trouble implementing it. Every time I run the code I get RuntimeError: Trying to backward through the graph a second time. I don't understand why since my code looks like all these other examples I've seen (unless I'm just missing something major):
https://stackoverflow.com/a/62076913/1227353
https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255
https://discuss.pytorch.org/t/why-do-we-need-to-set-the-gradients-manually-to-zero-in-pytorch/4903/20
One caveat is that the labels for my images are all different size, so I can't send the output batch and the label batch into the loss function; I have to iterate over them together. This is what an epoch looks like (it's been pared down for the sake of brevity):
# labels_batch contains labels of different sizes
for batch_idx, (inputs_batch, labels_batch) in enumerate(dataloader):
outputs_batch = model(inputs_batch)
# have to do this because labels can't be stacked into a tensor
for output, label in zip(outputs_batch, labels_batch):
output_scaled = interpolate(...) # make output match label size
loss = train_criterion(output_scaled, label) / (BATCH_SIZE * 2)
loss.backward()
if batch_idx % 2 == 1:
optimizer.step()
optimizer.zero_grad()
Is there something I'm missing? If I do the following I also get an error:
# labels_batch contains labels of different sizes
for batch_idx, (inputs_batch, labels_batch) in enumerate(dataloader):
outputs_batch = model(inputs_batch)
# CHANGE: we're gonna accumulate losses manually
batch_loss = 0
# have to do this because labels can't be stacked into a tensor
for output, label in zip(outputs_batch, labels_batch):
output_scaled = interpolate(...) # make output match label size
loss = train_criterion(output_scaled, label) / (BATCH_SIZE * 2)
batch_loss += loss # CHANGE: accumulate!
# CHANGE: do backprop outside for loop
batch_loss.backward()
if batch_idx % 2 == 1:
optimizer.step()
optimizer.zero_grad()
The error I get in this case is RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn. This happens when the next epoch starts though... (INCORRECT, SEE EDIT BELOW)
How can I train my model with gradient accumulation? Or am I doomed to train with a batch size of 4 or less?
Oh and as a side question, does the location of where I put loss.backward() affect what I need to normalize the loss by? Or is it always normalized by BATCH_SIZE * 2?
EDIT:
The second code segment was getting an error due to the fact that I was doing torch.set_grad_enabled(phase == 'train') but I had forgotten to wrap the call to batch_loss.backward() with an if phase == 'train'... my bad
So now the second segment of code seems to work and do gradient accumulation, but why doesn't the first bit of code work? It feel equivalent to setting BATCH_SIZE as 1. Furthermore, I'm creating a new loss object each time, so shouldn't the calls to backward() operate on different graphs entirely?
It seems you have two issues here, you said you couldn't have batch_size=8 because of memory limitations but later state that your labels are not of the same size. The latter seems much more important than the former. Anyway, I will try to answer your questions best I can.
How can I train my model with gradient accumulation? Or am I doomed to train with a batch size of 4 or less?
You want to call .backward() on every loop cycle otherwise the batch will have no effect on the training. You can then call step() and zero_grad() only when batch_idx % 2 is True (i.e. for every other batch).
Here's an example which accumulates the gradient, not the loss:
model = nn.Linear(10, 3)
optim = torch.optim.SGD(model.parameters(), lr=0.1)
ds = TensorDataset(torch.rand(100, 10), torch.rand(100, 3))
dl = DataLoader(ds, batch_size=4)
for i, (x, y) in enumerate(dl):
y_hat = model(x)
loss = F.l1_loss(y_hat, y) / 2
loss.backward()
if i % 2:
optim.step()
optim.zero_grad()
Note this approach is different to accumulating the loss, and back-propagating only all batches (or part of the batches) have gone through the network. In the example above we backpropagate every 4 datapoints and updating the model every 8 datapoints.
Oh and as a side question, does the location of where I put loss.backward() affect what I need to normalize the loss by? Or is it always normalized by BATCH_SIZE * 2?
Usually torch's built-in losses have reduction='mean' set as default. This means the loss gets averaged over all batch elements that contributed to calculating the loss. So this will depend on your loss implementation.
However if you are using gradient accumalation, then yes you will need to average your loss by the number of accumulation steps (here loss = F.l1_loss(y_hat, y) / 2). Since your gradients will be accumulated twice.
To read more about this, I recommend taking a look at this other SO post.

Pytorch: Custom Loss involving Norm of End-to-End Jacobian

Cross posting from Pytorch discussion boards
I want to train a network using a modified loss function that has both a typical classification loss (e.g. nn.CrossEntropyLoss) as well as a penalty on the Frobenius norm of the end-to-end Jacobian (i.e. if f(x) is the output of the network, \nabla_x f(x)).
I’ve implemented a model that can successfully learn using nn.CrossEntropyLoss. However, when I try adding the second loss function (by doing two backwards passes), my training loop runs, but the model never learns. Furthermore, if I calculate the end-to-end Jacobian, but don’t include it in the loss function, the model also never learns. At a high level, my code does the following:
Forward pass to get predicted classes, yhat, from inputs x
Call yhat.backward(torch.ones(appropriate shape), retain_graph=True)
Jacobian norm = x.grad.data.norm(2)
Set loss equal to classification loss + scalar coefficient * jacobian norm
Run loss.backward()
I suspect that I’m misunderstanding how backward() works when run twice, but I haven’t been able to find any good resources to clarify this.
Too much is required to produce a working example, so I’ve tried to extract the relevant code:
def train_model(model, train_dataloader, optimizer, loss_fn, device=None):
if device is None:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.train()
train_loss = 0
correct = 0
for batch_idx, (batch_input, batch_target) in enumerate(train_dataloader):
batch_input, batch_target = batch_input.to(device), batch_target.to(device)
optimizer.zero_grad()
batch_input.requires_grad_(True)
model_batch_output = model(batch_input)
loss = loss_fn(model_output=model_batch_output, model_input=batch_input, model=model, target=batch_target)
train_loss += loss.item() # sum up batch loss
loss.backward()
optimizer.step()
and
def end_to_end_jacobian_loss(model_output, model_input):
model_output.backward(
torch.ones(*model_output.shape),
retain_graph=True)
jacobian = model_input.grad.data
jacobian_norm = jacobian.norm(2)
return jacobian_norm
Edit 1: I swapped my previous implementation with .backward() to autograd.grad and it apparently works! What's the difference?
def end_to_end_jacobian_loss(model_output, model_input):
jacobian = autograd.grad(
outputs=model_output['penultimate_layer'],
inputs=model_input,
grad_outputs=torch.ones(*model_output['penultimate_layer'].shape),
retain_graph=True,
only_inputs=True)[0]
jacobian_norm = jacobian.norm(2)
return jacobian_norm

How to deal with mini-batch loss in Pytorch?

I feed mini-batch data to model, and I just want to know how to deal with the loss. Could I accumulate the loss, then call the backward like:
...
def neg_log_likelihood(self, sentences, tags, length):
self.batch_size = sentences.size(0)
logits = self.__get_lstm_features(sentences, length)
real_path_score = torch.zeros(1)
total_score = torch.zeros(1)
if USE_GPU:
real_path_score = real_path_score.cuda()
total_score = total_score.cuda()
for logit, tag, leng in zip(logits, tags, length):
logit = logit[:leng]
tag = tag[:leng]
real_path_score += self.real_path_score(logit, tag)
total_score += self.total_score(logit, tag)
return total_score - real_path_score
...
loss = model.neg_log_likelihood(sentences, tags, length)
loss.backward()
optimizer.step()
I wonder that if the accumulation could lead to gradient explosion?
So, should I call the backward in loop:
for sentence, tag , leng in zip(sentences, tags, length):
loss = model.neg_log_likelihood(sentence, tag, leng)
loss.backward()
optimizer.step()
Or, use the mean loss just like the reduce_mean in tensorflow
loss = reduce_mean(losses)
loss.backward()
The loss has to be reduced by mean using the mini-batch size. If you look at the native PyTorch loss functions such as CrossEntropyLoss, there is a separate parameter reduction just for this and the default behaviour is to do mean on the mini-batch size.
We usually
get the loss by the loss function
(if necessary) manipulate the loss, for example do the class weighting and etc
calculate the mean loss of the mini-batch
calculate the gradients by the loss.backward()
(if necessary) manipulate the gradients, for example, do the gradient clipping for some RNN models to avoid gradient explosion
update the weights using the optimizer.step() function
So in your case, you can first get the mean loss of the mini-batch and then calculate the gradient using the loss.backward() function and then utilize the optimizer.step() function for the weight updating.

Resources