In short, I want to enable "create_graph" when doing loss.backward() while using torch.nn layers, such that I can do param.backward() to get the gradient of the final weights w.r.t a hyper parameter.
In details, I am implementing an algorithm to solve a bilevel problem (two nested problems). One can look at the parameters optimised in the inner one as the weights of a torch.nn based model, while the parameters of the outer problem as the hyper parameters. I want to optimise both with gradient descent. Thus, to do one update on the hyper parameters, I need the gradient of the model's weights (after being trained) with respect to these hyper parameters. This is because the loss function related to the hyper parameters optimisation is a function of the trained model weights.
The problem is that even when I set (create_graph = True) when I backward the inner loss, optimizer.step() performs in-place updates, so the graph cannot be created. Similarly when replacing optimizer.step() with manually doing the updates on the model weights, as it is still in-place updates:
for name, param in model.named_parameters():
param.data =param.data - param.grad
A simplified code of what I want to do:
for t in range(OUT_MAX_ITR):
model.train()
for i in range(IN_MAX_ITR):
optimizer.zero_grad()
outputs = model(xtr)
loss = compute_loss(outputs)
loss.backward(create_graph=True)
optimizer.step()
theta.grad = None
out_loss = function_of_model_weights()
out_loss.backward()
update_theta(theta, theta.grad)
Here theta is the hyper parameter to be optimised.
Is there a way or a work around in torch to do that second order differentiation (or bilevel optimisation) when working with torch.nn?
Related
what would be the equivalent in Pytorch of the following in tensorflow, where loss is the calculated loss in the iteration of the network and net is the Neural Network.
with tf.GradientTape() as tape:
grads = tape.gradient(loss, net.trainable_variables)
optimizer.apply_gradients(zip(grads, net.trainable_variables))
So, we compute our gradients for all the trainable variables in our network in accordance to the loss function. In the next line we apply the gradients via the optimizer. In the use case I have, this is the way to do it and it works fine.
Now, how would I do the same in Pytorch? I am aware of the "standard" way:
optimizer.zero_grad()
loss.backward()
optimizer.step()
That is however not applicable for me. So how can I apply the gradients "manually". Google doesn't help unfortunately, although I think it is probably a rather simple question.
Hope one of you can enlighten me!
Thanks!
Let's break the standard PyTorch way of doing updates; hopefully, that will clarify what you want.
In Pytorch, each NN parameter has a .data and .grad attribute. .data is ... the actual weight tensor, and .grad is the attribute that will hold the gradient. It is None if the gradient is not computed yet. With this knowledge, let's understand the update steps.
First, we do optimizer.zero_grad(). This zeros out or empties the .grad attribute. .grad may be None already if you never computed the gradients.
Next, we do loss.backward(). This is the backprop step that will compute and update each parameter's .grad attribute.
Once we have gradients, we want to update the weights with some rule (SGD, ADAM, etc.), and we do optimizer.step(). This will iterate over all the parameters and update the weights correctly using the compute .grad attributes.
So, now to apply gradients manually, you can replace the optimizer.step() with a for loop like the below:
for param in model.parameters():
param.data = custom_rule(param.data, param.grad, learning_rate, **any_other_arguments)
and that should do the trick.
I am new to pytorch. May I ask what is the difference between adding 'loss.item()' or not? The following 2 parts of code:
for epoch in range(epochs):
trainingloss =0
for i in range(0,X.size()[1], batch_size):
indices = permutation[i:i+batch_size]
F = model.forward(X[n])
optimizer.zero_grad()
criterion = loss(X,n)
criterion.backward()
optimizer.step()
trainingloss += criterion.item()
and this
for epoch in range(epochs):
for i in range(0,X.size()[1], batch_size):
indices = permutation[i:i+batch_size]
F = model.forward(X[n])
optimizer.zero_grad()
criterion = loss(X,n)
criterion.backward()
optimizer.step()
If anyone has any idea please help. Thank you very much.
Calling loss.item() allows you to take a loss variable that is detached from the computation graph that PyTorch creates (this is what .item() does for PyTorch variables).
If you add the line trainingloss += criterion.item() at the end of each "batch loop", this will keep track of the batch loss throughout the iteration by incrementally adding the loss for each minibatch in your training set. This is necessary since you are using minibatches - the loss for each minibatch will not be equal to the loss over all the batches.
Note: If you use PyTorch variables outside the optimization loop, e.g. in a different scope, which could happen if you call something like return loss, it is crucial that you call .item() on any PyTorch variables that are part of the computation graph (as a general rule of thumb, any outputs/loss/models that interact with PyTorch methods will likely be part of your computation graph). If not, this can cause the computation graph to not be de-allocated/deleted from Python memory, and can lead to CPU/GPU memory leaks. What you have above looks correct though!
Also, in the future, PyTorch's DataLoader class can help you with minibatches with less boilerplate code - it can loop over your dataset such that each item you loop over is a training batch - i.e. you don't require two for loops in your optimization.
I hope you enjoy learning/using PyTorch!
In your training loop, the criterion.backward() part computes the gradient of each trainable parameters of the feed forward path, then the optimizer.step() part updates the parameters based on the computed gradients and the optimization techniques. So at the end of this step the training of the model for a particular batch has been finished and the trainingloss += criterion.item() part is only for tracking and monitoring training process and loss values for each step of training.
A model should be set in the evaluation mode for inference by calling model.eval().
Do we need to also do this during training before getting the model outputs? Like within a training epoch if the network contains one or more dropout and/or batch-normalization layers.
If this is not done then the output of the forward pass in the training epoch might be affected by the randomness in the dropout?
Many example codes do not do this and something along these lines is the common approach:
for t in range(num_epochs):
# forward pass
yhat = model(x)
# get the loss
loss = criterion(yhat , y)
# backward pass, optimizer step
optimizer.zero_grad()
loss.backward()
optimizer.step()
For example here is an example code to look at : convolutional_neural_network/main.py
Should this instead be?
for t in range(num_epochs):
# forward pass
model.eval() # disable dropout etc
yhat = model(x)
# get the loss
loss = criterion(yhat , y)
# backward pass, optimizer step
model.train()
optimizer.zero_grad()
loss.backward()
optimizer.step()
TLDR:
Should this instead be?
No!
Why?
More explanation:
Different Modules behave differently depending on whether they are in training or evaluation/test mode.
BatchNorm and Dropout are only two examples of such modules, basically any module that has a training phase follows this rule.
When you do .eval(), you are signaling all modules in the model to shift operations accordingly.
Update
The answer is during training you should not use eval mode and yes, as long as you have not set the eval mode, the dropout will be active and act randomly in each forward passes. Similarly all other modules that have two phases, will perform accordingly. That is BN will always update the mean/var for each pass, and also if you use batch_size of 1, it will error out as it can not do BN with batch of 1
As it was pointed out in comments, it should be noted that during training, you should not do eval() before the forward pass, as it effectively disables all modules that has different phases for train/test mode such as BN and Dropout (basically any module that has updateable/learnable parameters, or impacts network topology like dropout) will be disabled and you will not see them contributing to your network learning. So don't code like that!
Let me explain a bit what happens during training:
When you are in training mode, all of your modules that make up your model may have two modes, training and test mode. These modules either have learnable parameters that need to be updated during training, like BN, or affect network topology in a sense like Dropout (by disabling some features during forward pass). some modules such as ReLU() only operate in one mode and thus do not have any change when modes change.
When you are in training mode, you feed an image, it passes trough layers until it faces a dropout and here, some features are disabled, thus theri responses to the next layer is omitted, the output goes to other layers until it reaches the end of the network and you get a prediction.
the network may have correct or wrong predictions, which will accordingly update the weights. if the answer was right, the features/combinations of features that resulted in the correct answer will be positively affected and vice versa.
So during training you do not need and should not disable dropout, as it affects the output and should be affecting it so that the model learns a better set of features.
I hope this makes it a bit more clear for you. if you still feel you need more, say so in the comments.
In Pytorch, we can call zero_grad() to clear the gradients. In Keras, do we have a similar function so that we can achieve the same thing? For example, I want to accumulate gradients among some batches.
If in coustom training loop, its easy to realize:
...
# this is a glance of your coustom training loop
# consider a`flag` has difined to control your behavior
# consider a `buf= []` has difined to control your behavior
with tf.GradientTape() as tape:
loss = ...
grads = tape.gradient(loss, model.trainable_variables)
if flag: # do not accumulate grads
_grads = some_func(buf) # deal with accumulated grads in buf
buf = [] # clear buf
optimizer.apply_gradients(zip(_grads, model.trainable_variables))
else: # accumulate grads
buf.append(grads)
...
If in high level Keras API 'model.complie(), model.fit(),', I have no idea because I both use TF2 and Pytorch, where I prefer coustom training loop, which is an easir way to narrow the distance between the two.
In Pytorch the gradients are accumulated for every variables and the loss value is distribuited among them all. Then the optimizer is the one in charge of making the update to the model parameters (specified at the initialization) and since the update values are ever kept in memory you have to zero the value of update at start.
optimizer = torch.optim.Adam(itertools.chain(*param_list), lr=opt.lr, ...)
...
optimizer.zero_grad()
loss = ...
loss.backward()
optimizer.step()
In keras with gradient tapes you are wrapping a bunch of operation for which variables you want to compute gradients. You call the gradient method on the tape to compute the update passing the loss value and the variables for which you have to compute the gradient update. The optimizer just apply a single update to a single parameter (for the entire list of updates-params you specified).
with tf.GradientTape() as tape:
loss = ...
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
you can use .fit() method instead, that does all of that under the hood.
If your aim is to accumulate multiple times the update, in Keras there is no standard method but you can do it more easily with tapes, accumulating the update values before apply them (See this https://www.tensorflow.org/api_docs/python/tf/GradientTape#:~:text=To%20compute%20multiple%20gradients%20over%20the%20same%20computation).
A good solution to do it with .fit() is explained here: How to accumulate gradients for large batch sizes in Keras
If you want to know more about how the parameters gradients tracked efficiently to distribuite the loss value and understand the whole process better, have a look at (Wikipedia) Automatic differentiation
I'm developing a machine learning model using keras and I notice that the available losses functions are not giving the best results on my test set.
I am using an Unet architecture, where I input a (16,16,3) image and the net also outputs a (16,16,3) picture (auto-encoder). I notice that maybe one way to improve the model would be if I used a loss function that compares pixel to pixel on the gradients (laplacian) between the net output and the ground truth. However, I did not found any tutorial that would handle this kind of application, because it would need to use opencv laplacian function on each output image from the net.
The loss function would be something like this:
def laplacian_loss(y_true, y_pred):
# y_true already is the calculated gradients, only needs to compute on the y_pred
# calculates the gradients for each predicted image
y_pred_lap = []
for img in y_pred:
laplacian = cv2.Laplacian( np.float64(img), cv2.CV_64F )
y_pred_lap.append( laplacian )
y_pred_lap = np.array(y_pred_lap)
# mean squared error, according to keras losses documentation
return K.mean(K.square(y_pred_lap - y_true), axis=-1)
Has anyone done something like that for loss calculation?
Given the code above, it seems that it would be equivalent to using a Lambda() layer as the output layer that applies that transformation in the image, before considering the mean square error.
Regardless as whether it is implemented as a Lambda() layer or in the loss function; the transformation needs to be such that Tensorflow understands how to calculate the gradients. The simplest was to do this would probably be to reimplement the cv2.Laplacian computation using Tensorflow math operations.
In order to use the cv2 library directly, you need to create a function that calculates the gradients for what happens inside the cv2 lib; that seems significantly more error prone.
Gradient descent optimisation relies on being able to compute gradients from the inputs to the loss; and back. Any operation in the middle must be differentiable; and Tensorflow must understand the math operations for auto differentiation to work; or you need to add them manually.
I managed to reach a easy solution. The main feature was that the gradient calculation is actually a 2D filter. For more information about it, please follow the link about the laplacian kernel. In that matter, is necessary that the output of my network be filtered by the laplacian kernel. For that, I created an extra convolutional layer with fixed weights, exactly as the laplacian kernel. After that, the network will have two outputs (one been the desired image, and the other been the gradient's image). So, is also necessary to define both losses.
To make it clearer, I'll exemplify. In the end of the network you'll have something like:
channels = 3 # number of channels of network output
lap = Conv2D(channels , (3,3), padding='same', name='laplacian') (net_output)
model = Model(inputs=[net_input], outputs=[net_out, lap])
Define how you want to calculate the losses for each output:
# losses for output, laplacian and gaussian
losses = {
"enhanced": "mse",
"laplacian": "mse"
}
lossWeights = {"enhanced": 1.0, "laplacian": 0.6}
Compile the model:
model.compile(optimizer=Adam(), loss=losses, loss_weights=lossWeights)
Define the laplacian kernel, apply its values in the weights of the above convolutional layer and set trainable equals False (so it won't be updated).
bias = np.asarray([0]*3)
# laplacian kernel
l = np.asarray([
[[[1,1,1],
[1,-8,1],
[1,1,1]
]]*channels
]*channels).astype(np.float32)
bias = np.asarray([0]*3).astype(np.float32)
wl = [l,bias]
model.get_layer('laplacian').set_weights(wl)
model.get_layer('laplacian').trainable = False
When training, remember that you need two values for the ground truth:
model.fit(x=X, y = {"out": y_out, "laplacian": y_lap})
Observation: Do not utilize the BatchNormalization layer! In case you use it, the weights in the laplacian layer will be updated!