Will tensor.backward computes all the gradient in graph? - pytorch

I read these codes on github:
# loss1, loss2 belong to the same net
net.zero_grad()
loss1 = ...
loss2 = ...
loss1.backward()
loss2.backward()
optim.step()
which is not a mentioned BP method on pytorch official website, and official documentation mention Computes the gradient of current tensor w.r.t. graph leaves. for tensor.backward.
So, other gradients expect the two loss tensors are not computed? And no tensors are updated?

loss.backwards() computes the gradients for all variables in the graph w.r.t the loss function. The parameters are being updated according to the accumulated gradients in optim.step(). In your code, you back propagate twice (one for each loss), the gradients are being accumulated, and only after accumulating both gradients, parameters are being updated by the optimizer.

Related

How to "manually" apply your gradients in Pytorch?

what would be the equivalent in Pytorch of the following in tensorflow, where loss is the calculated loss in the iteration of the network and net is the Neural Network.
with tf.GradientTape() as tape:
grads = tape.gradient(loss, net.trainable_variables)
optimizer.apply_gradients(zip(grads, net.trainable_variables))
So, we compute our gradients for all the trainable variables in our network in accordance to the loss function. In the next line we apply the gradients via the optimizer. In the use case I have, this is the way to do it and it works fine.
Now, how would I do the same in Pytorch? I am aware of the "standard" way:
optimizer.zero_grad()
loss.backward()
optimizer.step()
That is however not applicable for me. So how can I apply the gradients "manually". Google doesn't help unfortunately, although I think it is probably a rather simple question.
Hope one of you can enlighten me!
Thanks!
Let's break the standard PyTorch way of doing updates; hopefully, that will clarify what you want.
In Pytorch, each NN parameter has a .data and .grad attribute. .data is ... the actual weight tensor, and .grad is the attribute that will hold the gradient. It is None if the gradient is not computed yet. With this knowledge, let's understand the update steps.
First, we do optimizer.zero_grad(). This zeros out or empties the .grad attribute. .grad may be None already if you never computed the gradients.
Next, we do loss.backward(). This is the backprop step that will compute and update each parameter's .grad attribute.
Once we have gradients, we want to update the weights with some rule (SGD, ADAM, etc.), and we do optimizer.step(). This will iterate over all the parameters and update the weights correctly using the compute .grad attributes.
So, now to apply gradients manually, you can replace the optimizer.step() with a for loop like the below:
for param in model.parameters():
param.data = custom_rule(param.data, param.grad, learning_rate, **any_other_arguments)
and that should do the trick.

PyTorch loss function that depends on gradient of network with respect to input

I'm trying to implement a loss function that depends on the gradient of the network with respect to its inputs. That is, the loss function has a term like
sum(u - grad_x(network(x)))
where u is computed by forward propagating x through the network.
I'm able to compute the gradient by calling
funcApprox = funcNetwork.forward(X)
funcGrad = grad(funcApprox, X, grad_outputs=torch.ones_like(funcApprox))
Here, funcNetwork is my NN and X is the input. These computations are done in the loss function.
However, now if I attempt to do the following
opt.zero_grad()
loss = self.loss(X) # My custom loss function that calculates funcGrad, etc., from above
opt.zero_grad()
loss.backward()
opt.step()
I see the following error:
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.
on the loss.backward() line from above.
I've tried playing around with create_graph, retain_graph, etc. but to no avail.
Any help is appreciated!
As per comment by #aretor, setting retain_graph=True, create_graph=False in the grad call in the loss function, and retain_graph=True in backward solves the issue.

Difference between autograd.grad and autograd.backward?

Suppose I have my custom loss function and I want to fit the solution of some differential equation with help of my neural network. So in each forward pass, I am calculating the output of my neural net and then calculating the loss by taking the MSE with the expected equation to which I want to fit my perceptron.
Now my doubt is: should I use grad(loss) or should I do loss.backward() for backpropagation to calculate and update my gradients?
I understand that while using loss.backward() I have to wrap my tensors with Variable and have to set the requires_grad = True for the variables w.r.t which I want to take the gradient of my loss.
So my questions are :
Does grad(loss) also requires any such explicit parameter to identify the variables for gradient computation?
How does it actually compute the gradients?
Which approach is better?
what is the main difference between the two in a practical scenario.
It would be better if you could explain the practical implications of both approaches because whenever I try to find it online I am just bombarded with a lot of stuff that isn't much relevant to my project.
TLDR; Both are two different interfaces to perform gradient computation: torch.autograd.grad is non-mutable while torch.autograd.backward is.
Descriptions
The torch.autograd module is the automatic differentiation package for PyTorch. As described in the documentation it only requires minimal change to code base in order to be used:
you only need to declare Tensors for which gradients should be computed with the requires_grad=True keyword.
The two main functions torch.autograd provides for gradient computation are torch.autograd.backward and torch.autograd.grad:
torch.autograd.backward (source)
torch.autograd.grad (source)
Description
Computes the sum of gradients of given tensors with respect to graph leaves.
Computes and returns the sum of gradients of outputs with respect to the inputs.
Header
torch.autograd.backward( tensors, grad_tensors=None, retain_graph=None, create_graph=False, grad_variables=None, inputs=None)
torch.autograd.grad( outputs, inputs, grad_outputs=None, retain_graph=None, create_graph=False, only_inputs=True, allow_unused=False)
Parameters
- tensors – Tensors of which the derivative will be computed.- grad_tensors – The "vector" in the Jacobian-vector product, usually gradients w.r.t. each element of corresponding tensors.- retain_graph – If False, the graph used to compute the grad will be freed. [...] - inputs – Inputs w.r.t. which the gradient be will be accumulated into .grad. All other Tensors will be ignored. If not provided, the gradient is accumulated into all the leaf Tensors that were used [...].
- outputs – outputs of the differentiated function.- inputs – Inputs w.r.t. which the gradient will be returned (and not accumulated into .grad).- grad_tensors – The "vector" in the Jacobian-vector product, usually gradients w.r.t. each element of corresponding tensors.- retain_graph – If False, the graph used to compute the grad will be freed. [...].
Usage examples
In terms of high-level usage, you can look at torch.autograd.grad as a non-mutable function. As mentioned in the documentation table above, it will not accumulate the gradients on the grad attribute but instead return the computed partial derivatives. In contrast torch.autograd.backward will be able to mutate the tensors by updating the grad attribute of leaf nodes, the function won't return any value. In other words, the latter is more suitable when computing gradients for a large number of parameters.
In the following, we will take two inputs (x1 and, x2), calculate a tensor y with them, and then compute the partial derivatives of the result w.r.t both inputs, i.e. dL/dx1 and dL/dx2:
>>> x1 = torch.rand(1, requires_grad=True)
>>> x2 = torch.rand(1, requires_grad=True)
>>> x1, x2
(tensor(0.3939, grad_fn=<UnbindBackward>),
tensor(0.7965, grad_fn=<UnbindBackward>))
Inference:
>>> y = x1**2 + 5*x2
>>> y
tensor(4.1377, grad_fn=<AddBackward0>)
Since y was computed using tensor(s) requiring gradients (i.e. with requires_grad=True) - *outside of a torch.no_grad context. It will have a grad_fn function attached. This callback is used to backpropagate onto the computation graph to compute the gradients of preceding tensor nodes.
torch.autograd.grad:
Here we provide torch.ones_like(y) as the grad_outputs.
>>> torch.autograd.grad(y, (x1, x2), torch.ones_like(y))
(tensor(0.7879), tensor(5.))
The above output is a tuple containing the two partial derivatives w.r.t. to the provided inputs respectively in order of appearance, i.e. dL/dx1 and dL/dx2.
This corresponds to the following computation:
# dL/dx1 = dL/dy * dy/dx1 = grad_outputs # 2*x1
# dL/dx2 = dL/dy * dy/dx2 = grad_outputs # 5
torch.autograd.backward: in contrast it will mutate the provided tensors by updating the grad of the tensors which have been used to compute the output tensor and that require gradients. It is equivalent to the torch.Tensor.backward API. Here, we go through the same example by defining x1, x2, and y again. We call backward:
>>> # y.backward(torch.ones_like(y))
>>> torch.autograd.backward(y, torch.ones_like(y))
None
Then you can retrieve the gradients on x1.grad and x2.grad:
>>> x1.grad, x2.grad
(tensor(0.7879), tensor(5.))
In conclusion: both perform the same operation. They are two different interfaces to interact with the autograd library and perform gradient computations. The latter, torch.autograd.backward (equivalent to torch.Tensor.backward), is generally used in neural networks training loops to compute the partial derivative of the loss w.r.t each one of the model's parameters.
You can read more about how torch.autograd.grad works by reading through this other answer I made on: Meaning of grad_outputs in PyTorch's torch.autograd.grad.
In addition to Ivan's answer, having torch.autograd.grad not accumulating gradients into .grad can avoid racing conditions in multi-thread scenarios.
Quoting PyTorch doc https://pytorch.org/docs/stable/notes/autograd.html#non-determinism
If you are calling backward() on multiple thread concurrently but with shared inputs (i.e. Hogwild CPU training). Since parameters are automatically shared across threads, gradient accumulation might become non-deterministic on backward calls across threads, because two backward calls might access and try to accumulate the same .grad attribute. This is technically not safe, and it might result in racing condition and the result might be invalid to use.
But this is expected pattern if you are using the multithreading approach to drive the whole training process but using shared parameters, user who use multithreading should have the threading model in mind and should expect this to happen. User could use the functional API torch.autograd.grad() to calculate the gradients instead of backward() to avoid non-determinism.
implementation details https://github.com/pytorch/pytorch/blob/7e3a694b23b383e38f5e39ef960ba8f374d22404/torch/csrc/autograd/functions/accumulate_grad.h

In Keras, is there any function similar to the zero_grad() in Pytorch?

In Pytorch, we can call zero_grad() to clear the gradients. In Keras, do we have a similar function so that we can achieve the same thing? For example, I want to accumulate gradients among some batches.
If in coustom training loop, its easy to realize:
...
# this is a glance of your coustom training loop
# consider a`flag` has difined to control your behavior
# consider a `buf= []` has difined to control your behavior
with tf.GradientTape() as tape:
loss = ...
grads = tape.gradient(loss, model.trainable_variables)
if flag: # do not accumulate grads
_grads = some_func(buf) # deal with accumulated grads in buf
buf = [] # clear buf
optimizer.apply_gradients(zip(_grads, model.trainable_variables))
else: # accumulate grads
buf.append(grads)
...
If in high level Keras API 'model.complie(), model.fit(),', I have no idea because I both use TF2 and Pytorch, where I prefer coustom training loop, which is an easir way to narrow the distance between the two.
In Pytorch the gradients are accumulated for every variables and the loss value is distribuited among them all. Then the optimizer is the one in charge of making the update to the model parameters (specified at the initialization) and since the update values are ever kept in memory you have to zero the value of update at start.
optimizer = torch.optim.Adam(itertools.chain(*param_list), lr=opt.lr, ...)
...
optimizer.zero_grad()
loss = ...
loss.backward()
optimizer.step()
In keras with gradient tapes you are wrapping a bunch of operation for which variables you want to compute gradients. You call the gradient method on the tape to compute the update passing the loss value and the variables for which you have to compute the gradient update. The optimizer just apply a single update to a single parameter (for the entire list of updates-params you specified).
with tf.GradientTape() as tape:
loss = ...
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
you can use .fit() method instead, that does all of that under the hood.
If your aim is to accumulate multiple times the update, in Keras there is no standard method but you can do it more easily with tapes, accumulating the update values before apply them (See this https://www.tensorflow.org/api_docs/python/tf/GradientTape#:~:text=To%20compute%20multiple%20gradients%20over%20the%20same%20computation).
A good solution to do it with .fit() is explained here: How to accumulate gradients for large batch sizes in Keras
If you want to know more about how the parameters gradients tracked efficiently to distribuite the loss value and understand the whole process better, have a look at (Wikipedia) Automatic differentiation

How to create a custom keras loss function with opencv?

I'm developing a machine learning model using keras and I notice that the available losses functions are not giving the best results on my test set.
I am using an Unet architecture, where I input a (16,16,3) image and the net also outputs a (16,16,3) picture (auto-encoder). I notice that maybe one way to improve the model would be if I used a loss function that compares pixel to pixel on the gradients (laplacian) between the net output and the ground truth. However, I did not found any tutorial that would handle this kind of application, because it would need to use opencv laplacian function on each output image from the net.
The loss function would be something like this:
def laplacian_loss(y_true, y_pred):
# y_true already is the calculated gradients, only needs to compute on the y_pred
# calculates the gradients for each predicted image
y_pred_lap = []
for img in y_pred:
laplacian = cv2.Laplacian( np.float64(img), cv2.CV_64F )
y_pred_lap.append( laplacian )
y_pred_lap = np.array(y_pred_lap)
# mean squared error, according to keras losses documentation
return K.mean(K.square(y_pred_lap - y_true), axis=-1)
Has anyone done something like that for loss calculation?
Given the code above, it seems that it would be equivalent to using a Lambda() layer as the output layer that applies that transformation in the image, before considering the mean square error.
Regardless as whether it is implemented as a Lambda() layer or in the loss function; the transformation needs to be such that Tensorflow understands how to calculate the gradients. The simplest was to do this would probably be to reimplement the cv2.Laplacian computation using Tensorflow math operations.
In order to use the cv2 library directly, you need to create a function that calculates the gradients for what happens inside the cv2 lib; that seems significantly more error prone.
Gradient descent optimisation relies on being able to compute gradients from the inputs to the loss; and back. Any operation in the middle must be differentiable; and Tensorflow must understand the math operations for auto differentiation to work; or you need to add them manually.
I managed to reach a easy solution. The main feature was that the gradient calculation is actually a 2D filter. For more information about it, please follow the link about the laplacian kernel. In that matter, is necessary that the output of my network be filtered by the laplacian kernel. For that, I created an extra convolutional layer with fixed weights, exactly as the laplacian kernel. After that, the network will have two outputs (one been the desired image, and the other been the gradient's image). So, is also necessary to define both losses.
To make it clearer, I'll exemplify. In the end of the network you'll have something like:
channels = 3 # number of channels of network output
lap = Conv2D(channels , (3,3), padding='same', name='laplacian') (net_output)
model = Model(inputs=[net_input], outputs=[net_out, lap])
Define how you want to calculate the losses for each output:
# losses for output, laplacian and gaussian
losses = {
"enhanced": "mse",
"laplacian": "mse"
}
lossWeights = {"enhanced": 1.0, "laplacian": 0.6}
Compile the model:
model.compile(optimizer=Adam(), loss=losses, loss_weights=lossWeights)
Define the laplacian kernel, apply its values in the weights of the above convolutional layer and set trainable equals False (so it won't be updated).
bias = np.asarray([0]*3)
# laplacian kernel
l = np.asarray([
[[[1,1,1],
[1,-8,1],
[1,1,1]
]]*channels
]*channels).astype(np.float32)
bias = np.asarray([0]*3).astype(np.float32)
wl = [l,bias]
model.get_layer('laplacian').set_weights(wl)
model.get_layer('laplacian').trainable = False
When training, remember that you need two values for the ground truth:
model.fit(x=X, y = {"out": y_out, "laplacian": y_lap})
Observation: Do not utilize the BatchNormalization layer! In case you use it, the weights in the laplacian layer will be updated!

Resources