How to "manually" apply your gradients in Pytorch? - pytorch

what would be the equivalent in Pytorch of the following in tensorflow, where loss is the calculated loss in the iteration of the network and net is the Neural Network.
with tf.GradientTape() as tape:
grads = tape.gradient(loss, net.trainable_variables)
optimizer.apply_gradients(zip(grads, net.trainable_variables))
So, we compute our gradients for all the trainable variables in our network in accordance to the loss function. In the next line we apply the gradients via the optimizer. In the use case I have, this is the way to do it and it works fine.
Now, how would I do the same in Pytorch? I am aware of the "standard" way:
optimizer.zero_grad()
loss.backward()
optimizer.step()
That is however not applicable for me. So how can I apply the gradients "manually". Google doesn't help unfortunately, although I think it is probably a rather simple question.
Hope one of you can enlighten me!
Thanks!

Let's break the standard PyTorch way of doing updates; hopefully, that will clarify what you want.
In Pytorch, each NN parameter has a .data and .grad attribute. .data is ... the actual weight tensor, and .grad is the attribute that will hold the gradient. It is None if the gradient is not computed yet. With this knowledge, let's understand the update steps.
First, we do optimizer.zero_grad(). This zeros out or empties the .grad attribute. .grad may be None already if you never computed the gradients.
Next, we do loss.backward(). This is the backprop step that will compute and update each parameter's .grad attribute.
Once we have gradients, we want to update the weights with some rule (SGD, ADAM, etc.), and we do optimizer.step(). This will iterate over all the parameters and update the weights correctly using the compute .grad attributes.
So, now to apply gradients manually, you can replace the optimizer.step() with a for loop like the below:
for param in model.parameters():
param.data = custom_rule(param.data, param.grad, learning_rate, **any_other_arguments)
and that should do the trick.

Related

PyTorch loss function that depends on gradient of network with respect to input

I'm trying to implement a loss function that depends on the gradient of the network with respect to its inputs. That is, the loss function has a term like
sum(u - grad_x(network(x)))
where u is computed by forward propagating x through the network.
I'm able to compute the gradient by calling
funcApprox = funcNetwork.forward(X)
funcGrad = grad(funcApprox, X, grad_outputs=torch.ones_like(funcApprox))
Here, funcNetwork is my NN and X is the input. These computations are done in the loss function.
However, now if I attempt to do the following
opt.zero_grad()
loss = self.loss(X) # My custom loss function that calculates funcGrad, etc., from above
opt.zero_grad()
loss.backward()
opt.step()
I see the following error:
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.
on the loss.backward() line from above.
I've tried playing around with create_graph, retain_graph, etc. but to no avail.
Any help is appreciated!
As per comment by #aretor, setting retain_graph=True, create_graph=False in the grad call in the loss function, and retain_graph=True in backward solves the issue.

Disable "in-place" updates in troch.nn

In short, I want to enable "create_graph" when doing loss.backward() while using torch.nn layers, such that I can do param.backward() to get the gradient of the final weights w.r.t a hyper parameter.
In details, I am implementing an algorithm to solve a bilevel problem (two nested problems). One can look at the parameters optimised in the inner one as the weights of a torch.nn based model, while the parameters of the outer problem as the hyper parameters. I want to optimise both with gradient descent. Thus, to do one update on the hyper parameters, I need the gradient of the model's weights (after being trained) with respect to these hyper parameters. This is because the loss function related to the hyper parameters optimisation is a function of the trained model weights.
The problem is that even when I set (create_graph = True) when I backward the inner loss, optimizer.step() performs in-place updates, so the graph cannot be created. Similarly when replacing optimizer.step() with manually doing the updates on the model weights, as it is still in-place updates:
for name, param in model.named_parameters():
param.data =param.data - param.grad
A simplified code of what I want to do:
for t in range(OUT_MAX_ITR):
model.train()
for i in range(IN_MAX_ITR):
optimizer.zero_grad()
outputs = model(xtr)
loss = compute_loss(outputs)
loss.backward(create_graph=True)
optimizer.step()
theta.grad = None
out_loss = function_of_model_weights()
out_loss.backward()
update_theta(theta, theta.grad)
Here theta is the hyper parameter to be optimised.
Is there a way or a work around in torch to do that second order differentiation (or bilevel optimisation) when working with torch.nn?

PyTorch training with dropout and/or batch-normalization

A model should be set in the evaluation mode for inference by calling model.eval().
Do we need to also do this during training before getting the model outputs? Like within a training epoch if the network contains one or more dropout and/or batch-normalization layers.
If this is not done then the output of the forward pass in the training epoch might be affected by the randomness in the dropout?
Many example codes do not do this and something along these lines is the common approach:
for t in range(num_epochs):
# forward pass
yhat = model(x)
# get the loss
loss = criterion(yhat , y)
# backward pass, optimizer step
optimizer.zero_grad()
loss.backward()
optimizer.step()
For example here is an example code to look at : convolutional_neural_network/main.py
Should this instead be?
for t in range(num_epochs):
# forward pass
model.eval() # disable dropout etc
yhat = model(x)
# get the loss
loss = criterion(yhat , y)
# backward pass, optimizer step
model.train()
optimizer.zero_grad()
loss.backward()
optimizer.step()
TLDR:
Should this instead be?
No!
Why?
More explanation:
Different Modules behave differently depending on whether they are in training or evaluation/test mode.
BatchNorm and Dropout are only two examples of such modules, basically any module that has a training phase follows this rule.
When you do .eval(), you are signaling all modules in the model to shift operations accordingly.
Update
The answer is during training you should not use eval mode and yes, as long as you have not set the eval mode, the dropout will be active and act randomly in each forward passes. Similarly all other modules that have two phases, will perform accordingly. That is BN will always update the mean/var for each pass, and also if you use batch_size of 1, it will error out as it can not do BN with batch of 1
As it was pointed out in comments, it should be noted that during training, you should not do eval() before the forward pass, as it effectively disables all modules that has different phases for train/test mode such as BN and Dropout (basically any module that has updateable/learnable parameters, or impacts network topology like dropout) will be disabled and you will not see them contributing to your network learning. So don't code like that!
Let me explain a bit what happens during training:
When you are in training mode, all of your modules that make up your model may have two modes, training and test mode. These modules either have learnable parameters that need to be updated during training, like BN, or affect network topology in a sense like Dropout (by disabling some features during forward pass). some modules such as ReLU() only operate in one mode and thus do not have any change when modes change.
When you are in training mode, you feed an image, it passes trough layers until it faces a dropout and here, some features are disabled, thus theri responses to the next layer is omitted, the output goes to other layers until it reaches the end of the network and you get a prediction.
the network may have correct or wrong predictions, which will accordingly update the weights. if the answer was right, the features/combinations of features that resulted in the correct answer will be positively affected and vice versa.
So during training you do not need and should not disable dropout, as it affects the output and should be affecting it so that the model learns a better set of features.
I hope this makes it a bit more clear for you. if you still feel you need more, say so in the comments.

In Keras, is there any function similar to the zero_grad() in Pytorch?

In Pytorch, we can call zero_grad() to clear the gradients. In Keras, do we have a similar function so that we can achieve the same thing? For example, I want to accumulate gradients among some batches.
If in coustom training loop, its easy to realize:
...
# this is a glance of your coustom training loop
# consider a`flag` has difined to control your behavior
# consider a `buf= []` has difined to control your behavior
with tf.GradientTape() as tape:
loss = ...
grads = tape.gradient(loss, model.trainable_variables)
if flag: # do not accumulate grads
_grads = some_func(buf) # deal with accumulated grads in buf
buf = [] # clear buf
optimizer.apply_gradients(zip(_grads, model.trainable_variables))
else: # accumulate grads
buf.append(grads)
...
If in high level Keras API 'model.complie(), model.fit(),', I have no idea because I both use TF2 and Pytorch, where I prefer coustom training loop, which is an easir way to narrow the distance between the two.
In Pytorch the gradients are accumulated for every variables and the loss value is distribuited among them all. Then the optimizer is the one in charge of making the update to the model parameters (specified at the initialization) and since the update values are ever kept in memory you have to zero the value of update at start.
optimizer = torch.optim.Adam(itertools.chain(*param_list), lr=opt.lr, ...)
...
optimizer.zero_grad()
loss = ...
loss.backward()
optimizer.step()
In keras with gradient tapes you are wrapping a bunch of operation for which variables you want to compute gradients. You call the gradient method on the tape to compute the update passing the loss value and the variables for which you have to compute the gradient update. The optimizer just apply a single update to a single parameter (for the entire list of updates-params you specified).
with tf.GradientTape() as tape:
loss = ...
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
you can use .fit() method instead, that does all of that under the hood.
If your aim is to accumulate multiple times the update, in Keras there is no standard method but you can do it more easily with tapes, accumulating the update values before apply them (See this https://www.tensorflow.org/api_docs/python/tf/GradientTape#:~:text=To%20compute%20multiple%20gradients%20over%20the%20same%20computation).
A good solution to do it with .fit() is explained here: How to accumulate gradients for large batch sizes in Keras
If you want to know more about how the parameters gradients tracked efficiently to distribuite the loss value and understand the whole process better, have a look at (Wikipedia) Automatic differentiation

Will tensor.backward computes all the gradient in graph?

I read these codes on github:
# loss1, loss2 belong to the same net
net.zero_grad()
loss1 = ...
loss2 = ...
loss1.backward()
loss2.backward()
optim.step()
which is not a mentioned BP method on pytorch official website, and official documentation mention Computes the gradient of current tensor w.r.t. graph leaves. for tensor.backward.
So, other gradients expect the two loss tensors are not computed? And no tensors are updated?
loss.backwards() computes the gradients for all variables in the graph w.r.t the loss function. The parameters are being updated according to the accumulated gradients in optim.step(). In your code, you back propagate twice (one for each loss), the gradients are being accumulated, and only after accumulating both gradients, parameters are being updated by the optimizer.

Resources