Why grad_output requires_grad is False in pytorch? - pytorch

Here is the customLayer.py.
I am quite confused about the following things:
The input of the inner layer is not a Variable. Then in backward it becomes a Variable and requires gradient. Why?
grad_output is a Variable yet requires_grad is False. Why is not true?
In my custom layer, I need customize forward and backward operations. It is quite complicated. See the same link. I have posted questions in it.

The gradients are updated through your loss computation and are required for the backpropagation. If you don't have gradients, you cant train your network.
Probably, because you don't want the gradients last on the variable. It's temporally only for one backward phase.
Why do you need a custom backward function? Do you need extra operations on your backpropagation?

Related

How to "manually" apply your gradients in Pytorch?

what would be the equivalent in Pytorch of the following in tensorflow, where loss is the calculated loss in the iteration of the network and net is the Neural Network.
with tf.GradientTape() as tape:
grads = tape.gradient(loss, net.trainable_variables)
optimizer.apply_gradients(zip(grads, net.trainable_variables))
So, we compute our gradients for all the trainable variables in our network in accordance to the loss function. In the next line we apply the gradients via the optimizer. In the use case I have, this is the way to do it and it works fine.
Now, how would I do the same in Pytorch? I am aware of the "standard" way:
optimizer.zero_grad()
loss.backward()
optimizer.step()
That is however not applicable for me. So how can I apply the gradients "manually". Google doesn't help unfortunately, although I think it is probably a rather simple question.
Hope one of you can enlighten me!
Thanks!
Let's break the standard PyTorch way of doing updates; hopefully, that will clarify what you want.
In Pytorch, each NN parameter has a .data and .grad attribute. .data is ... the actual weight tensor, and .grad is the attribute that will hold the gradient. It is None if the gradient is not computed yet. With this knowledge, let's understand the update steps.
First, we do optimizer.zero_grad(). This zeros out or empties the .grad attribute. .grad may be None already if you never computed the gradients.
Next, we do loss.backward(). This is the backprop step that will compute and update each parameter's .grad attribute.
Once we have gradients, we want to update the weights with some rule (SGD, ADAM, etc.), and we do optimizer.step(). This will iterate over all the parameters and update the weights correctly using the compute .grad attributes.
So, now to apply gradients manually, you can replace the optimizer.step() with a for loop like the below:
for param in model.parameters():
param.data = custom_rule(param.data, param.grad, learning_rate, **any_other_arguments)
and that should do the trick.

Can I change the tensor value without affecting the back propagation in tensorflow2.X?

I customize a tensor operation (based on the quantization method to modify the value), but I don't want the quantization to affect the backpropagation process, I expect it to do the backpropagation in the same way as before, but with the value quantized by me.

Ensuring that optimization does not find the trivial solution by setting weights to 0

I am trying to train a neural network which takes as input (input_t0) and an initial hidden state (call it s_t0) and produces a new hidden state (s_t1) by transforming the input via a series of transformations (neural network layers). At the next time step, a transformed input (input_t1) and the hidden state from the previous time step (s_t1) is passed to the same model. This process keeps repeating for a couple of steps.
The goal of optimization is to ensure the distance between s_t0 and s_t1 is small through self-supervision, as s_t1 is supposed to be an transformed version of s_t0. In other words, I want s_t1 to only carry new information in the new input. My intuition tells me taking the norm of the weights and ensuring the norm does not go to zero (is this even possible?) would be one way to achieve this. However, I'm afraid won't be the best thing to do necessarily, as it might not encourage the model to update the state vector with new information.
Currently the way I train the model is by taking the absolute distance between s_t0 and s_t1 via loss = torch.abs(s_t1 - s_t0).mean(dim=1). Then I call loss.backward() and optimizer.step() which changes the weights. Note that the reason that I use abs() is that the hidden states are produced after applying ReLU, so the only hold positive values. So what is the best way to achieve this and ensure the weights don't go to 0? Would I be able to somehow use mutual information for this?
However, I noticed that optimization quickly finds the trivial solution by setting weights to 0. This causes both s_t0 and s_t1 get smaller and smaller until their difference is 0, which satisfies the constraint but does not yield the behavior I expect. Is there a way to ensure weights do not go to zero during optimization?

PyTorch training with dropout and/or batch-normalization

A model should be set in the evaluation mode for inference by calling model.eval().
Do we need to also do this during training before getting the model outputs? Like within a training epoch if the network contains one or more dropout and/or batch-normalization layers.
If this is not done then the output of the forward pass in the training epoch might be affected by the randomness in the dropout?
Many example codes do not do this and something along these lines is the common approach:
for t in range(num_epochs):
# forward pass
yhat = model(x)
# get the loss
loss = criterion(yhat , y)
# backward pass, optimizer step
optimizer.zero_grad()
loss.backward()
optimizer.step()
For example here is an example code to look at : convolutional_neural_network/main.py
Should this instead be?
for t in range(num_epochs):
# forward pass
model.eval() # disable dropout etc
yhat = model(x)
# get the loss
loss = criterion(yhat , y)
# backward pass, optimizer step
model.train()
optimizer.zero_grad()
loss.backward()
optimizer.step()
TLDR:
Should this instead be?
No!
Why?
More explanation:
Different Modules behave differently depending on whether they are in training or evaluation/test mode.
BatchNorm and Dropout are only two examples of such modules, basically any module that has a training phase follows this rule.
When you do .eval(), you are signaling all modules in the model to shift operations accordingly.
Update
The answer is during training you should not use eval mode and yes, as long as you have not set the eval mode, the dropout will be active and act randomly in each forward passes. Similarly all other modules that have two phases, will perform accordingly. That is BN will always update the mean/var for each pass, and also if you use batch_size of 1, it will error out as it can not do BN with batch of 1
As it was pointed out in comments, it should be noted that during training, you should not do eval() before the forward pass, as it effectively disables all modules that has different phases for train/test mode such as BN and Dropout (basically any module that has updateable/learnable parameters, or impacts network topology like dropout) will be disabled and you will not see them contributing to your network learning. So don't code like that!
Let me explain a bit what happens during training:
When you are in training mode, all of your modules that make up your model may have two modes, training and test mode. These modules either have learnable parameters that need to be updated during training, like BN, or affect network topology in a sense like Dropout (by disabling some features during forward pass). some modules such as ReLU() only operate in one mode and thus do not have any change when modes change.
When you are in training mode, you feed an image, it passes trough layers until it faces a dropout and here, some features are disabled, thus theri responses to the next layer is omitted, the output goes to other layers until it reaches the end of the network and you get a prediction.
the network may have correct or wrong predictions, which will accordingly update the weights. if the answer was right, the features/combinations of features that resulted in the correct answer will be positively affected and vice versa.
So during training you do not need and should not disable dropout, as it affects the output and should be affecting it so that the model learns a better set of features.
I hope this makes it a bit more clear for you. if you still feel you need more, say so in the comments.

net.zero_grad() vs optim.zero_grad() pytorch

Here they mention the need to include optim.zero_grad() when training to zero the parameter gradients. My question is: Could I do as well net.zero_grad() and would that have the same effect? Or is it necessary to do optim.zero_grad(). Moreover, what happens if I do both? If I do none, then the gradients get accumulated, but what does that exactly mean? do they get added? In other words, what's the difference between doing optim.zero_grad() and net.zero_grad(). I am asking because here, line 115 they use net.zero_grad() and it is the first time I see that, that is an implementation of a reinforcement learning algorithm, where one has to be especially careful with the gradients because there are multiple networks and gradients, so I suppose there is a reason for them to do net.zero_grad() as opposed to optim.zero_grad().
net.zero_grad() sets the gradients of all its parameters (including parameters of submodules) to zero. If you call optim.zero_grad() that will do the same, but for all parameters that have been specified to be optimised. If you are using only net.parameters() in your optimiser, e.g. optim = Adam(net.parameters(), lr=1e-3), then both are equivalent, since they contain the exact same parameters.
You could have other parameters that are being optimised by the same optimiser, which are not part of net, in which case you would either have to manually set their gradients to zero and therefore keep track of all the parameters, or you can simply call optim.zero_grad() to ensure that all parameters that are being optimised, had their gradients set to zero.
Moreover, what happens if I do both?
Nothing, the gradients would just be set to zero again, but since they were already zero, it makes absolutely no difference.
If I do none, then the gradients get accumulated, but what does that exactly mean? do they get added?
Yes, they are being added to the existing gradients. In the backward pass the gradients in respect to every parameter are calculated, and then the gradient is added to the parameters' gradient (param.grad). That allows you to have multiple backward passes, that affect the same parameters, which would not be possible if the gradients were overwritten instead of being added.
For example, you could accumulate the gradients over multiple batches, if you need bigger batches for training stability but don't have enough memory to increase the batch size. This is trivial to achieve in PyTorch, which is essentially leaving off optim.zero_grad() and delaying optim.step() until you have gathered enough steps, as shown in HuggingFace - Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups.
That flexibility comes at the cost of having to manually set the gradients to zero. Frankly, one line is a very small cost to pay, even though many users won't make use of it and especially beginners might find it confusing.

Resources