I am new to pytorch. May I ask what is the difference between adding 'loss.item()' or not? The following 2 parts of code:
for epoch in range(epochs):
trainingloss =0
for i in range(0,X.size()[1], batch_size):
indices = permutation[i:i+batch_size]
F = model.forward(X[n])
optimizer.zero_grad()
criterion = loss(X,n)
criterion.backward()
optimizer.step()
trainingloss += criterion.item()
and this
for epoch in range(epochs):
for i in range(0,X.size()[1], batch_size):
indices = permutation[i:i+batch_size]
F = model.forward(X[n])
optimizer.zero_grad()
criterion = loss(X,n)
criterion.backward()
optimizer.step()
If anyone has any idea please help. Thank you very much.
Calling loss.item() allows you to take a loss variable that is detached from the computation graph that PyTorch creates (this is what .item() does for PyTorch variables).
If you add the line trainingloss += criterion.item() at the end of each "batch loop", this will keep track of the batch loss throughout the iteration by incrementally adding the loss for each minibatch in your training set. This is necessary since you are using minibatches - the loss for each minibatch will not be equal to the loss over all the batches.
Note: If you use PyTorch variables outside the optimization loop, e.g. in a different scope, which could happen if you call something like return loss, it is crucial that you call .item() on any PyTorch variables that are part of the computation graph (as a general rule of thumb, any outputs/loss/models that interact with PyTorch methods will likely be part of your computation graph). If not, this can cause the computation graph to not be de-allocated/deleted from Python memory, and can lead to CPU/GPU memory leaks. What you have above looks correct though!
Also, in the future, PyTorch's DataLoader class can help you with minibatches with less boilerplate code - it can loop over your dataset such that each item you loop over is a training batch - i.e. you don't require two for loops in your optimization.
I hope you enjoy learning/using PyTorch!
In your training loop, the criterion.backward() part computes the gradient of each trainable parameters of the feed forward path, then the optimizer.step() part updates the parameters based on the computed gradients and the optimization techniques. So at the end of this step the training of the model for a particular batch has been finished and the trainingloss += criterion.item() part is only for tracking and monitoring training process and loss values for each step of training.
Related
I'm calculating two losses. One per batch and one per epoch, at the end of the batches loop. When I try to sum these two losses I get the following error:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [64, 49]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
I have my reasons for summing these two losses.
The general idea of the code is something like this:
loss_epoch = 0 # it's zero in the first epoch
for epoch in epochs:
for batch in batches:
optimizer.zero_grad()
loss_batch = criterion_batch(output_batch, target_batch)
loss = loss_batch + loss_epoch # adds zero in the first epoch
loss.backward()
optimizer.step()
loss_epoch = criterion_epoch(output_epoch, target_epoch)
I get that the problem is I'm modifying the gradient when I calculate another loss at the end of the first loop (the loop that goes through the batches) but I couldn't solve this problem.
It also might have something to do with the order of the operations (loss calculation, backward, zero_grad, step).
I need to calculate the loss_epoch at the end of the batch loop because I'm using the entire dataset to calculate this loss.
Assuming that you do not want to backpropagate the epoch_loss through every forward pass for the entire dataset (which of course would be computationally infeasible for a dataset of any non-trivial size), you could detach the epoch_loss and essentially add it as a scalar which is updated once per epoch. Not entirely sure if this is the behavior you want though.
In order to mimick a larger batch size, I want to be able to accumulate gradients every N batches for a model in PyTorch, like:
def train(model, optimizer, dataloader, num_epochs, N):
for epoch_num in range(1, num_epochs+1):
for batch_num, data in enumerate(dataloader):
ims = data.to('cuda:0')
loss = model(ims)
loss.backward()
if batch_num % N == 0:
optimizer.step()
optimizer.zero_grad(set_to_none=True)
For this approach do I need to add the flag retain_graph=True, i.e.
loss.backward(retain_graph=True)
In this manner, are the gradients per each backward call simply summed per each parameter?
You need to set retain_graph=True if you want to make multiple backward passes over the same computational graph, making use of the intermediate results from a single forward pass. This would have been the case, for instance, if you called loss.backward() multiple times after computing loss once, or if you had multiple losses from different parts of the graph to backpropagate from (a good explanation can be found here).
In your case, for each forward pass, you backpropagate exactly once. So you don't need to store the intermediate results from the computational graph once the gradients are computed.
In short:
Intermediate outputs in the graph are cleared after a backward pass, unless explicitly preserved using retain_graph=True.
Gradients accumulate by default, unless explicitly cleared using zero_grad.
I'm trying to update the weights of a model during training only for those batches in which the loss is smaller than that obtained in the previous batch.
So, in the batches loop, I store the loss obtained at each iteration, and then I have tried evaluating a condition: if loss at time t-1 is smaller that that a time t, then I proceed as follows:
if loss[t-1] <= loss[t]:
loss.backward()
optimizer.step()
else:
#do nothing or what ?
Then, nothing should be done in the else part. Nonetheless, I get an error saying CUDA is running out of memory.
Of course, before computing the loss, I perform an optimizer.zero_grad() sentence.
The for loop that runs over batches seems to be running fine, but memory usage blows up. I read that maybe setting gradients to None would prevent the weights update process but I have tried many sentences (output.clone().detach() also optimizer.zero_grad(set_to_none=True)) but I'm not sure they work. I think they did not. Nonetheless, the memory usage explosion still occurs.
Is there a way to get this done?
This is a common problem when storing losses from consecutive steps.
The out-of-memory error is caused because you are storing the losses in a list. The computational graphs will still remain and will stay in memory as long as you keep a reference to your losses. An easy fix is to detach the tensor when you append it to the list:
# loss = loss_fn(...)
losses.append(loss.detach())
Then you can work with
if losses[t] <= losses[t-1]: # current loss is smaller
losses[t].backward()
optimizer.step()
else:
pass
Storing the loss in a list would store the whole graph for that batch for each element in losses. Instead what you can do is the following:
losses.append(loss.cpu().tolist())
optimizer.zero_grad()
if losses[-1] <= losses[-2]: # current loss is smaller
loss.backward()
optimizer.step()
As you only update the model if the current loss is smaller than the previous one you don't actually need to store all the losses. The last one and the value of the previous one is enough. Otherwise if you want to store a finite number of graphs you need to be careful about your available memory which is quite limited in many applications.
I am a beginner with pytorch and have the following problem.
I want to optimize a complex problem that uses torch.min() multiple times, but even with a simple toy example I can't get it to work. My code has these lines:
output = net(input)
loss = g(output)
loss = torch.min(loss, torch.ones(1))
loss.backward()
To minimize this loss the net should ascertain that the output minimizes g:R^2->R. Here g is a very simple function that has a zero at {-1,-2} and if I delete the third line the neural network finds the solution just fine. However with the posted code and bad initial weights the minimum is attained by 1. This leads to the backward() function not updating the weights and no learning happening at all over arbitrary many epochs.
Is there a way to detect/fix this behaviour in more complex cases? My task uses the minimum function multiple times and in more complex ways, so I think it would be quite hard to track every single one and ascertain that learning actually takes place.
Edit: If I start the optimizer multiple times it rarely happens that the optimization works just fine (e.g. converges to {-1,-2}). My interpretation is that in those cases the inital weights randomly lead to the minimum beeing attained in the first component.
Rewrite your code:
output = net(input)
loss_fn = torch.min
y_hat = g(output)
y = torch.ones(1)
loss = loss_fn(y_hat, y)
loss.backward()
You wrote:
This leads to the backward() function not updating the weights and no learning happening at all over arbitrary many epochs.
Once we calculate the loss we call loss.backward(). The loss.backward() will calculate the gradients automatically. Gradients are needed in the next phase, when we use the optimizer.step() function to improve our model parameters (weights).
You need to have the optimizer something like this:
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
Since PyTorch training loop can be 100 lines of code, I will just reference some material in here perfect for beginners.
In Pytorch, we can call zero_grad() to clear the gradients. In Keras, do we have a similar function so that we can achieve the same thing? For example, I want to accumulate gradients among some batches.
If in coustom training loop, its easy to realize:
...
# this is a glance of your coustom training loop
# consider a`flag` has difined to control your behavior
# consider a `buf= []` has difined to control your behavior
with tf.GradientTape() as tape:
loss = ...
grads = tape.gradient(loss, model.trainable_variables)
if flag: # do not accumulate grads
_grads = some_func(buf) # deal with accumulated grads in buf
buf = [] # clear buf
optimizer.apply_gradients(zip(_grads, model.trainable_variables))
else: # accumulate grads
buf.append(grads)
...
If in high level Keras API 'model.complie(), model.fit(),', I have no idea because I both use TF2 and Pytorch, where I prefer coustom training loop, which is an easir way to narrow the distance between the two.
In Pytorch the gradients are accumulated for every variables and the loss value is distribuited among them all. Then the optimizer is the one in charge of making the update to the model parameters (specified at the initialization) and since the update values are ever kept in memory you have to zero the value of update at start.
optimizer = torch.optim.Adam(itertools.chain(*param_list), lr=opt.lr, ...)
...
optimizer.zero_grad()
loss = ...
loss.backward()
optimizer.step()
In keras with gradient tapes you are wrapping a bunch of operation for which variables you want to compute gradients. You call the gradient method on the tape to compute the update passing the loss value and the variables for which you have to compute the gradient update. The optimizer just apply a single update to a single parameter (for the entire list of updates-params you specified).
with tf.GradientTape() as tape:
loss = ...
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
you can use .fit() method instead, that does all of that under the hood.
If your aim is to accumulate multiple times the update, in Keras there is no standard method but you can do it more easily with tapes, accumulating the update values before apply them (See this https://www.tensorflow.org/api_docs/python/tf/GradientTape#:~:text=To%20compute%20multiple%20gradients%20over%20the%20same%20computation).
A good solution to do it with .fit() is explained here: How to accumulate gradients for large batch sizes in Keras
If you want to know more about how the parameters gradients tracked efficiently to distribuite the loss value and understand the whole process better, have a look at (Wikipedia) Automatic differentiation