Truncated Backpropagation Through Time (BPTT) in Pytorch - pytorch

In pytorch, I train a RNN/GRU/LSTM network by starting the Backpropagation (Through Time) with :
loss.backward()
When the sequence is long, I'd like to do a Truncated Backpropagation Through Time instead of a normal Backpropagation Through Time where the whole sequence is used.
But I can't find in the Pytorch API any parameters or functions to set up the truncated BPTT. Did I miss it? Am I supposed to code it myself in Pytorch ?

Here is an example:
for t in range(T):
y = lstm(y)
if T-t == k:
out.detach()
out.backward()
So in this example, k is the parameter you use to control the timesteps you want to unroll.

Related

PyTorch loss function that depends on gradient of network with respect to input

I'm trying to implement a loss function that depends on the gradient of the network with respect to its inputs. That is, the loss function has a term like
sum(u - grad_x(network(x)))
where u is computed by forward propagating x through the network.
I'm able to compute the gradient by calling
funcApprox = funcNetwork.forward(X)
funcGrad = grad(funcApprox, X, grad_outputs=torch.ones_like(funcApprox))
Here, funcNetwork is my NN and X is the input. These computations are done in the loss function.
However, now if I attempt to do the following
opt.zero_grad()
loss = self.loss(X) # My custom loss function that calculates funcGrad, etc., from above
opt.zero_grad()
loss.backward()
opt.step()
I see the following error:
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.
on the loss.backward() line from above.
I've tried playing around with create_graph, retain_graph, etc. but to no avail.
Any help is appreciated!
As per comment by #aretor, setting retain_graph=True, create_graph=False in the grad call in the loss function, and retain_graph=True in backward solves the issue.

pytorch loss accumulated when using mini-batch

I am new to pytorch. May I ask what is the difference between adding 'loss.item()' or not? The following 2 parts of code:
for epoch in range(epochs):
trainingloss =0
for i in range(0,X.size()[1], batch_size):
indices = permutation[i:i+batch_size]
F = model.forward(X[n])
optimizer.zero_grad()
criterion = loss(X,n)
criterion.backward()
optimizer.step()
trainingloss += criterion.item()
and this
for epoch in range(epochs):
for i in range(0,X.size()[1], batch_size):
indices = permutation[i:i+batch_size]
F = model.forward(X[n])
optimizer.zero_grad()
criterion = loss(X,n)
criterion.backward()
optimizer.step()
If anyone has any idea please help. Thank you very much.
Calling loss.item() allows you to take a loss variable that is detached from the computation graph that PyTorch creates (this is what .item() does for PyTorch variables).
If you add the line trainingloss += criterion.item() at the end of each "batch loop", this will keep track of the batch loss throughout the iteration by incrementally adding the loss for each minibatch in your training set. This is necessary since you are using minibatches - the loss for each minibatch will not be equal to the loss over all the batches.
Note: If you use PyTorch variables outside the optimization loop, e.g. in a different scope, which could happen if you call something like return loss, it is crucial that you call .item() on any PyTorch variables that are part of the computation graph (as a general rule of thumb, any outputs/loss/models that interact with PyTorch methods will likely be part of your computation graph). If not, this can cause the computation graph to not be de-allocated/deleted from Python memory, and can lead to CPU/GPU memory leaks. What you have above looks correct though!
Also, in the future, PyTorch's DataLoader class can help you with minibatches with less boilerplate code - it can loop over your dataset such that each item you loop over is a training batch - i.e. you don't require two for loops in your optimization.
I hope you enjoy learning/using PyTorch!
In your training loop, the criterion.backward() part computes the gradient of each trainable parameters of the feed forward path, then the optimizer.step() part updates the parameters based on the computed gradients and the optimization techniques. So at the end of this step the training of the model for a particular batch has been finished and the trainingloss += criterion.item() part is only for tracking and monitoring training process and loss values for each step of training.

Beginner: loss.backwards() doesn't work if input had no impact on output

I am a beginner with pytorch and have the following problem.
I want to optimize a complex problem that uses torch.min() multiple times, but even with a simple toy example I can't get it to work. My code has these lines:
output = net(input)
loss = g(output)
loss = torch.min(loss, torch.ones(1))
loss.backward()
To minimize this loss the net should ascertain that the output minimizes g:R^2->R. Here g is a very simple function that has a zero at {-1,-2} and if I delete the third line the neural network finds the solution just fine. However with the posted code and bad initial weights the minimum is attained by 1. This leads to the backward() function not updating the weights and no learning happening at all over arbitrary many epochs.
Is there a way to detect/fix this behaviour in more complex cases? My task uses the minimum function multiple times and in more complex ways, so I think it would be quite hard to track every single one and ascertain that learning actually takes place.
Edit: If I start the optimizer multiple times it rarely happens that the optimization works just fine (e.g. converges to {-1,-2}). My interpretation is that in those cases the inital weights randomly lead to the minimum beeing attained in the first component.
Rewrite your code:
output = net(input)
loss_fn = torch.min
y_hat = g(output)
y = torch.ones(1)
loss = loss_fn(y_hat, y)
loss.backward()
You wrote:
This leads to the backward() function not updating the weights and no learning happening at all over arbitrary many epochs.
Once we calculate the loss we call loss.backward(). The loss.backward() will calculate the gradients automatically. Gradients are needed in the next phase, when we use the optimizer.step() function to improve our model parameters (weights).
You need to have the optimizer something like this:
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
Since PyTorch training loop can be 100 lines of code, I will just reference some material in here perfect for beginners.

In Keras, is there any function similar to the zero_grad() in Pytorch?

In Pytorch, we can call zero_grad() to clear the gradients. In Keras, do we have a similar function so that we can achieve the same thing? For example, I want to accumulate gradients among some batches.
If in coustom training loop, its easy to realize:
...
# this is a glance of your coustom training loop
# consider a`flag` has difined to control your behavior
# consider a `buf= []` has difined to control your behavior
with tf.GradientTape() as tape:
loss = ...
grads = tape.gradient(loss, model.trainable_variables)
if flag: # do not accumulate grads
_grads = some_func(buf) # deal with accumulated grads in buf
buf = [] # clear buf
optimizer.apply_gradients(zip(_grads, model.trainable_variables))
else: # accumulate grads
buf.append(grads)
...
If in high level Keras API 'model.complie(), model.fit(),', I have no idea because I both use TF2 and Pytorch, where I prefer coustom training loop, which is an easir way to narrow the distance between the two.
In Pytorch the gradients are accumulated for every variables and the loss value is distribuited among them all. Then the optimizer is the one in charge of making the update to the model parameters (specified at the initialization) and since the update values are ever kept in memory you have to zero the value of update at start.
optimizer = torch.optim.Adam(itertools.chain(*param_list), lr=opt.lr, ...)
...
optimizer.zero_grad()
loss = ...
loss.backward()
optimizer.step()
In keras with gradient tapes you are wrapping a bunch of operation for which variables you want to compute gradients. You call the gradient method on the tape to compute the update passing the loss value and the variables for which you have to compute the gradient update. The optimizer just apply a single update to a single parameter (for the entire list of updates-params you specified).
with tf.GradientTape() as tape:
loss = ...
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
you can use .fit() method instead, that does all of that under the hood.
If your aim is to accumulate multiple times the update, in Keras there is no standard method but you can do it more easily with tapes, accumulating the update values before apply them (See this https://www.tensorflow.org/api_docs/python/tf/GradientTape#:~:text=To%20compute%20multiple%20gradients%20over%20the%20same%20computation).
A good solution to do it with .fit() is explained here: How to accumulate gradients for large batch sizes in Keras
If you want to know more about how the parameters gradients tracked efficiently to distribuite the loss value and understand the whole process better, have a look at (Wikipedia) Automatic differentiation

GradientTape losing track of variable

I have a script that performs a Gatys-like neural style transfer. It uses style loss, and a total variation loss. I'm using the GradientTape() to compute my gradients. The losses that I have implemented seem to work fine, but a new loss that I added isn't being properly accounted for by the GradientTape(). I'm using TensorFlow with eager execution enabled.
I suspect it has something to do with how I compute the loss based on the input variable. The input is a 4D tensor (batch, h, w, channels). At the most basic level, the input is a floating point image, and in order to compute this new loss I need to convert it to a binary image to compute the ratio of one pixel color to another. I don't want to actually go and change the image like that during every iteration, so I just make a copy of the tensor(in numpy form) and operate on that to compute the loss. I do not understand the limitations of the GradientTape, but I believe it is "losing the thread" of how the input variable is used to get to the loss when it's converted to a numpy array.
Could I make a copy of the image tensor and perform binarizing operations & loss computation using that? Or am I asking tensorflow to do something that it just can not do?
My new loss function:
def compute_loss(self, **kwargs):
loss = 0
image = self.model.deprocess_image(kwargs['image'].numpy())
binarized_image = self.image_decoder.binarize_image(image)
volume_fraction = self.compute_volume_fraction(binarized_image)
loss = np.abs(self.volume_fraction_target - volume_fraction)
return loss
My implementation using the GradientTape:
def compute_grads_and_losses(self, style_transfer_state):
"""
Computes gradients with respect to input image
"""
with tf.GradientTape() as tape:
loss = self.loss_evaluator.compute_total_loss(style_transfer_state)
total_loss = loss['total_loss']
return tape.gradient(total_loss, style_transfer_state['image']), loss
An example that I believe might illustrate my confusion. The strangest thing is that my code doesn't have any problem running; it just doesn't seem to minimize the new loss term whatsoever. But this example won't even run due to an attribute error: AttributeError: 'numpy.float64' object has no attribute '_id'.
Example:
import tensorflow.contrib.eager as tfe
import tensorflow as tf
def compute_square_of_value(x):
a = turn_to_numpy(x['x'])
return a**2
def turn_to_numpy(arg):
return arg.numpy() #just return arg to eliminate the error
tf.enable_eager_execution()
x = tfe.Variable(3.0, dtype=tf.float32)
data_dict = {'x': x}
with tf.GradientTape() as tape:
tape.watch(x)
y = compute_square_of_value(data_dict)
dy_dx = tape.gradient(y, x) # Will compute to 6.0
print(dy_dx)
Edit:
From my current understanding the issue arises that my use of the .numpy() operation is what makes the Gradient Tape lose track of the variable to compute the gradient from. My original reason for doing this is because my loss operation requires me to physically change values of the tensor, and I don't want to actually change the values used for the tensor that is being optimized. Hence the use of the numpy() copy to work on in order to compute the loss properly. Is there any way around this? Or is shall I consider my loss calculation to be impossible to implement because of this constraint of having to perform essentially non-reversible operations on the input tensor?
The first issue here is that GradientTape only traces operations on tf.Tensor objects. When you call tensor.numpy() the operations executed there fall outside the tape.
The second issue is that your first example never calls tape.watche on the image you want to differentiate with respect to.

Resources