Custom loss function in PyTorch - pytorch

I have three simple questions.
What will happen if my custom loss function is not differentiable? Will pytorch through error or do something else?
If I declare a loss variable in my custom function which will represent the final loss of the model, should I put requires_grad = True for that variable? or it doesn't matter? If it doesn't matter, then why?
I have seen people sometimes write a separate layer and compute the loss in the forward function. Which approach is preferable, writing a function or a layer? Why?
I need a clear and nice explanation to these questions to resolve my confusions. Please help.

Let me have a go.
This depends on what you mean by "non-differentiable". The first definition that makes sense here is that PyTorch doesn't know how to compute gradients. If you try to compute gradients nevertheless, this will raise an error. The two possible scenarios are:
a) You're using a custom PyTorch operation for which gradients have not been implemented, e.g. torch.svd(). In that case you will get a TypeError:
import torch
from torch.autograd import Function
from torch.autograd import Variable
A = Variable(torch.randn(10,10), requires_grad=True)
u, s, v = torch.svd(A) # raises TypeError
b) You have implemented your own operation, but did not define backward(). In this case, you will get a NotImplementedError:
class my_function(Function): # forgot to define backward()
def forward(self, x):
return 2 * x
A = Variable(torch.randn(10,10))
B = my_function()(A)
C = torch.sum(B)
C.backward() # will raise NotImplementedError
The second definition that makes sense is "mathematically non-differentiable". Clearly, an operation which is mathematically not differentiable should either not have a backward() method implemented or a sensible sub-gradient. Consider for example torch.abs() whose backward() method returns the subgradient 0 at 0:
A = Variable(torch.Tensor([-1,0,1]),requires_grad=True)
B = torch.abs(A)
B.backward(torch.Tensor([1,1,1]))
A.grad.data
For these cases, you should refer to the PyTorch documentation directly and dig out the backward() method of the respective operation directly.
It doesn't matter. The use of requires_gradis to avoid unnecessary computations of gradients for subgraphs. If there’s a single input to an operation that requires gradient, its output will also require gradient. Conversely, only if all inputs don’t require gradient, the output also won’t require it. Backward computation is never performed in the subgraphs, where all Variables didn’t require gradients.
Since, there are most likely some Variables (for example parameters of a subclass of nn.Module()), your loss Variable will also require gradients automatically. However, you should notice that exactly for how requires_grad works (see above again), you can only change requires_grad for leaf variables of your graph anyway.
All the custom PyTorch loss functions, are subclasses of _Loss which is a subclass of nn.Module. See here. If you'd like to stick to this convention, you should subclass _Loss when defining your custom loss function. Apart from consistency, one advantage is that your subclass will raise an AssertionError, if you haven't marked your target variables as volatile or requires_grad = False. Another advantage is that you can nest your loss function in nn.Sequential(), because its a nn.Module I would recommend this approach for these reasons.

Related

Difference between autograd.grad and autograd.backward?

Suppose I have my custom loss function and I want to fit the solution of some differential equation with help of my neural network. So in each forward pass, I am calculating the output of my neural net and then calculating the loss by taking the MSE with the expected equation to which I want to fit my perceptron.
Now my doubt is: should I use grad(loss) or should I do loss.backward() for backpropagation to calculate and update my gradients?
I understand that while using loss.backward() I have to wrap my tensors with Variable and have to set the requires_grad = True for the variables w.r.t which I want to take the gradient of my loss.
So my questions are :
Does grad(loss) also requires any such explicit parameter to identify the variables for gradient computation?
How does it actually compute the gradients?
Which approach is better?
what is the main difference between the two in a practical scenario.
It would be better if you could explain the practical implications of both approaches because whenever I try to find it online I am just bombarded with a lot of stuff that isn't much relevant to my project.
TLDR; Both are two different interfaces to perform gradient computation: torch.autograd.grad is non-mutable while torch.autograd.backward is.
Descriptions
The torch.autograd module is the automatic differentiation package for PyTorch. As described in the documentation it only requires minimal change to code base in order to be used:
you only need to declare Tensors for which gradients should be computed with the requires_grad=True keyword.
The two main functions torch.autograd provides for gradient computation are torch.autograd.backward and torch.autograd.grad:
torch.autograd.backward (source)
torch.autograd.grad (source)
Description
Computes the sum of gradients of given tensors with respect to graph leaves.
Computes and returns the sum of gradients of outputs with respect to the inputs.
Header
torch.autograd.backward( tensors, grad_tensors=None, retain_graph=None, create_graph=False, grad_variables=None, inputs=None)
torch.autograd.grad( outputs, inputs, grad_outputs=None, retain_graph=None, create_graph=False, only_inputs=True, allow_unused=False)
Parameters
- tensors – Tensors of which the derivative will be computed.- grad_tensors – The "vector" in the Jacobian-vector product, usually gradients w.r.t. each element of corresponding tensors.- retain_graph – If False, the graph used to compute the grad will be freed. [...] - inputs – Inputs w.r.t. which the gradient be will be accumulated into .grad. All other Tensors will be ignored. If not provided, the gradient is accumulated into all the leaf Tensors that were used [...].
- outputs – outputs of the differentiated function.- inputs – Inputs w.r.t. which the gradient will be returned (and not accumulated into .grad).- grad_tensors – The "vector" in the Jacobian-vector product, usually gradients w.r.t. each element of corresponding tensors.- retain_graph – If False, the graph used to compute the grad will be freed. [...].
Usage examples
In terms of high-level usage, you can look at torch.autograd.grad as a non-mutable function. As mentioned in the documentation table above, it will not accumulate the gradients on the grad attribute but instead return the computed partial derivatives. In contrast torch.autograd.backward will be able to mutate the tensors by updating the grad attribute of leaf nodes, the function won't return any value. In other words, the latter is more suitable when computing gradients for a large number of parameters.
In the following, we will take two inputs (x1 and, x2), calculate a tensor y with them, and then compute the partial derivatives of the result w.r.t both inputs, i.e. dL/dx1 and dL/dx2:
>>> x1 = torch.rand(1, requires_grad=True)
>>> x2 = torch.rand(1, requires_grad=True)
>>> x1, x2
(tensor(0.3939, grad_fn=<UnbindBackward>),
tensor(0.7965, grad_fn=<UnbindBackward>))
Inference:
>>> y = x1**2 + 5*x2
>>> y
tensor(4.1377, grad_fn=<AddBackward0>)
Since y was computed using tensor(s) requiring gradients (i.e. with requires_grad=True) - *outside of a torch.no_grad context. It will have a grad_fn function attached. This callback is used to backpropagate onto the computation graph to compute the gradients of preceding tensor nodes.
torch.autograd.grad:
Here we provide torch.ones_like(y) as the grad_outputs.
>>> torch.autograd.grad(y, (x1, x2), torch.ones_like(y))
(tensor(0.7879), tensor(5.))
The above output is a tuple containing the two partial derivatives w.r.t. to the provided inputs respectively in order of appearance, i.e. dL/dx1 and dL/dx2.
This corresponds to the following computation:
# dL/dx1 = dL/dy * dy/dx1 = grad_outputs # 2*x1
# dL/dx2 = dL/dy * dy/dx2 = grad_outputs # 5
torch.autograd.backward: in contrast it will mutate the provided tensors by updating the grad of the tensors which have been used to compute the output tensor and that require gradients. It is equivalent to the torch.Tensor.backward API. Here, we go through the same example by defining x1, x2, and y again. We call backward:
>>> # y.backward(torch.ones_like(y))
>>> torch.autograd.backward(y, torch.ones_like(y))
None
Then you can retrieve the gradients on x1.grad and x2.grad:
>>> x1.grad, x2.grad
(tensor(0.7879), tensor(5.))
In conclusion: both perform the same operation. They are two different interfaces to interact with the autograd library and perform gradient computations. The latter, torch.autograd.backward (equivalent to torch.Tensor.backward), is generally used in neural networks training loops to compute the partial derivative of the loss w.r.t each one of the model's parameters.
You can read more about how torch.autograd.grad works by reading through this other answer I made on: Meaning of grad_outputs in PyTorch's torch.autograd.grad.
In addition to Ivan's answer, having torch.autograd.grad not accumulating gradients into .grad can avoid racing conditions in multi-thread scenarios.
Quoting PyTorch doc https://pytorch.org/docs/stable/notes/autograd.html#non-determinism
If you are calling backward() on multiple thread concurrently but with shared inputs (i.e. Hogwild CPU training). Since parameters are automatically shared across threads, gradient accumulation might become non-deterministic on backward calls across threads, because two backward calls might access and try to accumulate the same .grad attribute. This is technically not safe, and it might result in racing condition and the result might be invalid to use.
But this is expected pattern if you are using the multithreading approach to drive the whole training process but using shared parameters, user who use multithreading should have the threading model in mind and should expect this to happen. User could use the functional API torch.autograd.grad() to calculate the gradients instead of backward() to avoid non-determinism.
implementation details https://github.com/pytorch/pytorch/blob/7e3a694b23b383e38f5e39ef960ba8f374d22404/torch/csrc/autograd/functions/accumulate_grad.h

Using Autograd .backward() function to calculate an intermediate value in the forward pass of Pytorch model

Hello I am new to Pytorch. I have a simple pytorch module where the output of the module is scalar loss function that depends on the derivative of some polynomial functions. Let's say the output of the forward pass is: input*derivative(x^2+y^2).
One way to implement this, is to explicitly write down the deriviates of the polynomials used and have that be part of the forward model. So output=inputs*(2x+2y). However, this is not robust as if I include more polynomials, I have to manually add more derivative functions which can be time consuming and prone to errors.
I want to initialize the polynomials, use Autograd to get their derivatives, plug that derivative into the output formula. let's say the polynomial function is called n. I do n.backward(retain_graph=True) inside the forward pass. However, it does not seem to work properly as I get very different answers (of the magnitude of the derivatives of loss function vs the model parameters) as when I use the analytic expression in the forward pass.
Note, that both the output of the f.backward and the analytic expression of the derivative match. So it is computing the derivative of the polynomials correctly, but it is having a hard time associating this with the final loss function. Meaning that the backward() call is also messing up the model parameters while it is trying to get the derivatives for the polynomial coefficients. I am sure this is because of my poor understanding of pytorch and adding the f.backward() call inside the forward pass is somehow messing up the loss.backward() call.
Here is a simplified example: The problem is that the value model.learn.grad is not the same when using the analytic method and the autograd .backward() method
class Model(nn.Module):
def __init__(self, grin_type='radial',loss_type='analytic', device='cpu', dtype=torch.float32):
super(Model, self).__init__()
self.dtype=dtype
self.device=device
self.loss_type=loss_type
self.grin_type=grin_type
self.x=torch.tensor(2.,dtype = dtype, device=self.device) #mm
self.learn=nn.Parameter(torch.tensor(5.,dtype = dtype, device=self.device))
def forward(self,inputs,plotting=0):
if self.loss_type=='analytic':
outputs=inputs*self.learn*(2.*self.x)
elif self.loss_type=='autograd':
self.der=self.calc_derivative(self.x)
outputs=inputs*self.der
return outputs
def poly_fun(self,x):
return self.learn*torch.square(x)
def calc_derivative(self,x):
xn=x.clone().detach().requires_grad_(True)
n=self.poly_fun(xn)
dloss_dx=torch.autograd.grad(outputs=n, inputs=xn,create_graph=True)[0]*n/n
return dloss_dx

GradientTape losing track of variable

I have a script that performs a Gatys-like neural style transfer. It uses style loss, and a total variation loss. I'm using the GradientTape() to compute my gradients. The losses that I have implemented seem to work fine, but a new loss that I added isn't being properly accounted for by the GradientTape(). I'm using TensorFlow with eager execution enabled.
I suspect it has something to do with how I compute the loss based on the input variable. The input is a 4D tensor (batch, h, w, channels). At the most basic level, the input is a floating point image, and in order to compute this new loss I need to convert it to a binary image to compute the ratio of one pixel color to another. I don't want to actually go and change the image like that during every iteration, so I just make a copy of the tensor(in numpy form) and operate on that to compute the loss. I do not understand the limitations of the GradientTape, but I believe it is "losing the thread" of how the input variable is used to get to the loss when it's converted to a numpy array.
Could I make a copy of the image tensor and perform binarizing operations & loss computation using that? Or am I asking tensorflow to do something that it just can not do?
My new loss function:
def compute_loss(self, **kwargs):
loss = 0
image = self.model.deprocess_image(kwargs['image'].numpy())
binarized_image = self.image_decoder.binarize_image(image)
volume_fraction = self.compute_volume_fraction(binarized_image)
loss = np.abs(self.volume_fraction_target - volume_fraction)
return loss
My implementation using the GradientTape:
def compute_grads_and_losses(self, style_transfer_state):
"""
Computes gradients with respect to input image
"""
with tf.GradientTape() as tape:
loss = self.loss_evaluator.compute_total_loss(style_transfer_state)
total_loss = loss['total_loss']
return tape.gradient(total_loss, style_transfer_state['image']), loss
An example that I believe might illustrate my confusion. The strangest thing is that my code doesn't have any problem running; it just doesn't seem to minimize the new loss term whatsoever. But this example won't even run due to an attribute error: AttributeError: 'numpy.float64' object has no attribute '_id'.
Example:
import tensorflow.contrib.eager as tfe
import tensorflow as tf
def compute_square_of_value(x):
a = turn_to_numpy(x['x'])
return a**2
def turn_to_numpy(arg):
return arg.numpy() #just return arg to eliminate the error
tf.enable_eager_execution()
x = tfe.Variable(3.0, dtype=tf.float32)
data_dict = {'x': x}
with tf.GradientTape() as tape:
tape.watch(x)
y = compute_square_of_value(data_dict)
dy_dx = tape.gradient(y, x) # Will compute to 6.0
print(dy_dx)
Edit:
From my current understanding the issue arises that my use of the .numpy() operation is what makes the Gradient Tape lose track of the variable to compute the gradient from. My original reason for doing this is because my loss operation requires me to physically change values of the tensor, and I don't want to actually change the values used for the tensor that is being optimized. Hence the use of the numpy() copy to work on in order to compute the loss properly. Is there any way around this? Or is shall I consider my loss calculation to be impossible to implement because of this constraint of having to perform essentially non-reversible operations on the input tensor?
The first issue here is that GradientTape only traces operations on tf.Tensor objects. When you call tensor.numpy() the operations executed there fall outside the tape.
The second issue is that your first example never calls tape.watche on the image you want to differentiate with respect to.

Minimization and maximization at the same time in PyTorch

I would like to know how to take gradient steps for the following mathematical operation in PyTorch (A, B and C are PyTorch modules whose parameters do not overlap)
This is somewhat different than the cost function of a Generative Adversarial Network (GAN), so I cannot use examples for GANs off the shelf, and I got stuck while trying to adapt them for the above cost.
One approach I thought of is to construct two optimizers. Optimizer opt1 has the parameters for the modules A and B, and optimizer opt2 has the parameters of module C. One can then:
take a step for minimizing the cost function for C
run the network again with the same input to get the costs (and intermediate outputs) again
take a step with respect to A and B.
I am sure they must be a better way to do this with PyTorch (maybe using some detach operations), possibly without running the network again. Any help is appreciated.
Yes it is possible without going through the network two times, which is both wasting resources and wrong mathematically, since the weights have changed and so the lost, so you are introducing a delay doing this, which may be interesting but not what you are trying to achieve.
First, create two optimizers just as you said. Compute the loss, and then call backward. At this point, the gradient for the parameters A,B,C have been filled, so now you can just have to call the step method for the optimizer minimizing the loss, but not for the one maximizing it. For the later, you need to reverse the sign of the gradient of the leaf parameter tensor C.
def d(y, x):
return torch.pow(y.abs(), x + 1)
A = torch.nn.Linear(1,2)
B = torch.nn.Linear(2,3)
C = torch.nn.Linear(2,3)
optimizer1 = torch.optim.Adam((*A.parameters(), *B.parameters()))
optimizer2 = torch.optim.Adam(C.parameters())
x = torch.rand((10, 1))
loss = (d(B(A(x)), x) - d(C(A(x)), x)).sum()
optimizer1.zero_grad()
optimizer2.zero_grad()
loss.backward()
for p in C.parameters():
if p.grad is not None: # In general, C is a NN, with requires_grad=False for some layers
p.grad.data.mul_(-1) # Update of grad.data not tracked in computation graph
optimizer1.step()
optimizer2.step()
NB: I have not checked mathematically if the result is correct but I assume it is.

How to use model.reset_states() in Keras?

I have sequential data and I declared a LSTM model which predicts y with x in Keras. So if I call model.predict(x1) and model.predict(x2), Is it correct to call model.reset_states between the two predict() explicitly? Does model.reset_states clear history of inputs, not weights, right?
# data1
x1 = [2,4,2,1,4]
y1 = [1,2,3,2,1]
# dat2
x2 = [5,3,2,4,5]
y2 = [5,3,2,3,2]
And in my actual code, I use model.evaluate(). In evaluate(), is reset_states called implicitly for each data sample?
model.evaluate(dataX, dataY)
reset_states clears only the hidden states of your network. It's worth to mention that depending on if the option stateful=True was set in your network - the behaviour of this function might be different. If it's not set - all states are automatically reset after every batch computations in your network (so e.g. after calling fit, predict and evaluate also). If not - you should call reset_states every time, when you want to make consecutive model calls independent.
If you use explicitly either of:
model.reset_states()
to reset the states of all layers in the model, or
layer.reset_states()
to reset the states of a specific stateful RNN layer (also LSTM layer), implemented here:
def reset_states(self, states=None):
if not self.stateful:
raise AttributeError('Layer must be stateful.')
this means your layer(s) must be stateful.
In LSTM you need to:
explicitly specify the batch size you are using, by passing a batch_size argument to the first layer in your model or batch_input_shape argument
set stateful=True.
specify shuffle=False when calling fit().
The benefits of using stateful models are probable best explained here.

Resources