I was under the (evidently wrong) impression from the documentation that torch.no_grad(), as a context manager, was supposed to make everything requires_grad=False. Indeed that's what I intended to use torch.no_grad() for, as just a convenient context manager for instantiating a bunch of things that I want to stay constant (through training). but that's only the case for torch.Tensors it seems; it doesn't seem to affect torch.nn.Modules, as the following example code shows:
with torch.no_grad():
linear = torch.nn.Linear(2, 3)
for p in linear.parameters():
print(p.requires_grad)
This will output:
True
True
That's a bit counterintuitive in my opinion. Is this the intended behaviour? If so, why? And is there a similarly convenient context manager in which I can be assured that anything I instantiate under it will not require gradient?
This is expected behavior, but I agree it is somewhat unclear from the documentation. Note that the documentation says :
In this mode, the result of every computation will have
requires_grad=False, even when the inputs have requires_grad=True.
This context disables the gradient on the output of any computation done within the context. Technically, declaring/creating a layer is not computation, so the parameter's requires_grad is True. However, for any calculation you'd do inside this context, you won't be able to compute gradients. The requires_grad for the output of calculation would be False. This is probably best explained by extending your code snippet as below:
with torch.no_grad():
linear = torch.nn.Linear(2, 3)
for p in linear.parameters():
print(p.requires_grad)
out = linear(torch.rand(10,2))
print(out.requires_grad)
out = linear(torch.rand(10,2))
print(out.requires_grad)
True
True
False
True
Even if the requires_grad for layer parameters is True, you won't be able to compute the gradient as the output has requires_grad False.
Related
I am trying to train a neural network which takes as input (input_t0) and an initial hidden state (call it s_t0) and produces a new hidden state (s_t1) by transforming the input via a series of transformations (neural network layers). At the next time step, a transformed input (input_t1) and the hidden state from the previous time step (s_t1) is passed to the same model. This process keeps repeating for a couple of steps.
The goal of optimization is to ensure the distance between s_t0 and s_t1 is small through self-supervision, as s_t1 is supposed to be an transformed version of s_t0. In other words, I want s_t1 to only carry new information in the new input. My intuition tells me taking the norm of the weights and ensuring the norm does not go to zero (is this even possible?) would be one way to achieve this. However, I'm afraid won't be the best thing to do necessarily, as it might not encourage the model to update the state vector with new information.
Currently the way I train the model is by taking the absolute distance between s_t0 and s_t1 via loss = torch.abs(s_t1 - s_t0).mean(dim=1). Then I call loss.backward() and optimizer.step() which changes the weights. Note that the reason that I use abs() is that the hidden states are produced after applying ReLU, so the only hold positive values. So what is the best way to achieve this and ensure the weights don't go to 0? Would I be able to somehow use mutual information for this?
However, I noticed that optimization quickly finds the trivial solution by setting weights to 0. This causes both s_t0 and s_t1 get smaller and smaller until their difference is 0, which satisfies the constraint but does not yield the behavior I expect. Is there a way to ensure weights do not go to zero during optimization?
Is there some way to a constraint on the data generated by tensor flow, for example if my model produced two outputs can you impose some sort of constraint on these, like if a and b where the outputs could you pre-enforce something like (a+b)/2<10? So the model wouldn't break this rule?
Thanks in advance
If by "generated by TensorFlow" you mean generated by a neural network, I don't think it is possible to do that in general. You can't really guarantee that the output of a neural network never violates such hard constraints in general, especially at test time.
Here's what you could do:
Add a loss term, something like max(0, (a+b)/2 - 10). This will not guarantee that your constraint is not violated (the optimization of the NN is "best-effort"). This loss function is btw very similar to the hinge loss used in support vector machines.
Use an appropriate activation function. E.g. if you know your data must lie between [0, 1], use the sigmoid activation on the output.
"Project" the output back to the allowed range if it is outside of it.
While the last two options guarantee feasibility, it is not always possible to do that or it is not clear how to do it and - even worse - how this will affect the learning. For example, if you see that (a+b)/2 >= 10 what will you do? Will you decrease b until the constraint is fulfilled, or both trade-off a and b somehow? Sometimes it is possible to define the "closest feasible point" w.r.t. some metric, but not in general.
Official comment shows that "This has any effect only on modules such as Dropout or BatchNorm." But I don't understand its implementation.
Dropout and BatchNorm (and maybe some custom modules) behave differently during training and evaluation. You must let the model know when to switch to eval mode by calling .eval() on the model.
This sets self.training to False for every module in the model. If you are implementing your own module that must behave differently during training and evaluation, you can check the value of self.training while doing so.
Here is the customLayer.py.
I am quite confused about the following things:
The input of the inner layer is not a Variable. Then in backward it becomes a Variable and requires gradient. Why?
grad_output is a Variable yet requires_grad is False. Why is not true?
In my custom layer, I need customize forward and backward operations. It is quite complicated. See the same link. I have posted questions in it.
The gradients are updated through your loss computation and are required for the backpropagation. If you don't have gradients, you cant train your network.
Probably, because you don't want the gradients last on the variable. It's temporally only for one backward phase.
Why do you need a custom backward function? Do you need extra operations on your backpropagation?
I have three simple questions.
What will happen if my custom loss function is not differentiable? Will pytorch through error or do something else?
If I declare a loss variable in my custom function which will represent the final loss of the model, should I put requires_grad = True for that variable? or it doesn't matter? If it doesn't matter, then why?
I have seen people sometimes write a separate layer and compute the loss in the forward function. Which approach is preferable, writing a function or a layer? Why?
I need a clear and nice explanation to these questions to resolve my confusions. Please help.
Let me have a go.
This depends on what you mean by "non-differentiable". The first definition that makes sense here is that PyTorch doesn't know how to compute gradients. If you try to compute gradients nevertheless, this will raise an error. The two possible scenarios are:
a) You're using a custom PyTorch operation for which gradients have not been implemented, e.g. torch.svd(). In that case you will get a TypeError:
import torch
from torch.autograd import Function
from torch.autograd import Variable
A = Variable(torch.randn(10,10), requires_grad=True)
u, s, v = torch.svd(A) # raises TypeError
b) You have implemented your own operation, but did not define backward(). In this case, you will get a NotImplementedError:
class my_function(Function): # forgot to define backward()
def forward(self, x):
return 2 * x
A = Variable(torch.randn(10,10))
B = my_function()(A)
C = torch.sum(B)
C.backward() # will raise NotImplementedError
The second definition that makes sense is "mathematically non-differentiable". Clearly, an operation which is mathematically not differentiable should either not have a backward() method implemented or a sensible sub-gradient. Consider for example torch.abs() whose backward() method returns the subgradient 0 at 0:
A = Variable(torch.Tensor([-1,0,1]),requires_grad=True)
B = torch.abs(A)
B.backward(torch.Tensor([1,1,1]))
A.grad.data
For these cases, you should refer to the PyTorch documentation directly and dig out the backward() method of the respective operation directly.
It doesn't matter. The use of requires_gradis to avoid unnecessary computations of gradients for subgraphs. If there’s a single input to an operation that requires gradient, its output will also require gradient. Conversely, only if all inputs don’t require gradient, the output also won’t require it. Backward computation is never performed in the subgraphs, where all Variables didn’t require gradients.
Since, there are most likely some Variables (for example parameters of a subclass of nn.Module()), your loss Variable will also require gradients automatically. However, you should notice that exactly for how requires_grad works (see above again), you can only change requires_grad for leaf variables of your graph anyway.
All the custom PyTorch loss functions, are subclasses of _Loss which is a subclass of nn.Module. See here. If you'd like to stick to this convention, you should subclass _Loss when defining your custom loss function. Apart from consistency, one advantage is that your subclass will raise an AssertionError, if you haven't marked your target variables as volatile or requires_grad = False. Another advantage is that you can nest your loss function in nn.Sequential(), because its a nn.Module I would recommend this approach for these reasons.