In pytorch, is there an efficient way to back propagate the gradients but not update their corresponding variables? It seems making a copy of the weights every time during update is too expensive. But no_grad & set_grad_enabled does not allow back propagation.
Ex. The following seems to take too much time, as it needs to do a copy of the model every time weights are updated:
def __init__():
…
self.model = MyModel()
self.func1 = FuncModel1()
self.func2 = FuncModel2()
…
def trainstep(input):
f1 = self.func1(input)
f2 = self.func2(input)
…
# want to update weights in model & f1 with respect to loss1
loss1 = my_loss(model(f1), y1)
# don’t want to update weights in self.model with respect to loss2
# but want to update weights in f2 for loss2
copy_model = MyModel()
copy_model.load_state_dict(self.model.state_dict())
loss2 = my_loss(copy_model(f2), y2)
total_loss = loss1 + loss2
…
total_loss.backward()
optimizer.step()
when loss.backward() pytorch propagates gradients through the entire compute graph.
However, backward() function itself does not update any of the weights, it only computes gradients.
The update is done via the optimizer in optimizer.step(). If you want to exclude the weights of f1 and f2 from the updates, you can simply either
- init optimizer without the parameters of f1 and f2.
- set the learning rate for f1 and f2 to zero.
Related
I want to develop a lifelong learning system,so i need to prevent important parameter from changing.I read related paper 'Memory Aware Synapses: Learning what (not) to forget',a method was mentioned,I need to calculate the gradient of each parameter conresponding to each input image,so how should i write my code in pytorch?
'Memory Aware Synapses: Learning what (not) to forget'
You can do it using standard optimization procedure and .backward() method on your loss function.
First, scaling as defined in your link:
class Scaler:
def __init__(self, parameters, delta):
self.parameters = parameters
self.delta = delta
def step(self):
"""Multiplies gradients in place."""
for param in self.parameters:
if param.grad is None:
raise ValueError("backward() has to be called before running scaler")
param.grad *= self.delta
One can use it just like optimizer.step(), see below (see comments):
model = torch.nn.Sequential(
torch.nn.Linear(10, 100), torch.nn.ReLU(), torch.nn.Linear(100, 1)
)
scaler = Scaler(model.parameters(), delta=0.001)
optimizer = torch.optim.Adam(model.parameters())
criterion = torch.nn.MSELoss()
X, y = torch.randn(64, 10), torch.randn(64)
# Optimization loop
EPOCHS = 10
for _ in range(EPOCHS):
output = model(X)
loss = criterion(output, y)
loss.backward() # Now model has the gradients
optimizer.step() # Optimize model's parameters
print(next(model.parameters()).grad)
scaler.step() # Scaler gradients
optimizer.zero_grad() # Zero gradient before next step
After scaler.step() you will have gradient scaled available inside param.grad for each parameter (just like those are accessed within Scaler's step method) so you can do whatever you want with them.
I have a trained Tensorflow 2.0 model (from tf.keras.Sequential()) that takes an input layer with 26 columns (X) and produces an output layer with 1 column (Y).
In TF 1.x I was able to calculate the gradient of the output with respect to the input with the following:
model = load_model('mymodel.h5')
sess = K.get_session()
grad_func = tf.gradients(model.output, model.input)
gradients = sess.run(grad_func, feed_dict={model.input: X})[0]
In TF2 when I try to run tf.gradients(), I get the error:
RuntimeError: tf.gradients is not supported when eager execution is enabled. Use tf.GradientTape instead.
In the question In TensorFlow 2.0 with eager-execution, how to compute the gradients of a network output wrt a specific layer?, we see an answer on how to calculate gradients with respect to intermediate layers, but I don't see how to apply this to gradients with respect to the inputs. On the Tensorflow help for tf.GradientTape, there are examples with calculating gradients for simple functions, but not neural networks.
How can tf.GradientTape be used to calculate the gradient of the output with respect to the input?
This should work in TF2:
inp = tf.Variable(np.random.normal(size=(25, 120)), dtype=tf.float32)
with tf.GradientTape() as tape:
preds = model(inp)
grads = tape.gradient(preds, inp)
Basically you do it the same way as TF1, but using GradientTape.
I hope this is what you're looking for. This will give the gradients of the output w.r.t. the inputs.
# Whatever the input you like goes in as the initial_value
x = tf.Variable(np.random.normal(size=(25, 120)), dtype=tf.float32)
y_true = np.random.choice([0,1], size=(25,10))
print(model.output)
print(model.predict(x))
with tf.GradientTape() as tape:
pred = model.predict(x)
grads = tape.gradients(pred, x)
In the above case, we should use tape.watch()
for (x, y) in test_dataset:
with tf.GradientTape() as tape:
tape.watch(x)
pred = model(x)
grads = tape.gradient(pred, x)
but the grads will just have the grads of the inputs
The following method is better, you can use model to predict the prediction results and compute the loss, then use the loss to calculate the grads of all trainable variables
with tf.GradientTape() as tape:
predictions = model(x, training=True)
loss = loss_function(y, predictions)
grads = tape.gradient(loss, model.trainable_variables)
Given a simple 2 layer neural network, the traditional idea is to compute the gradient w.r.t. the weights/model parameters. For an experiment, I want to compute the gradient of the error w.r.t the input. Are there existing Pytorch methods that can allow me to do this?
More concretely, consider the following neural network:
import torch.nn as nn
import torch.nn.functional as F
class NeuralNet(nn.Module):
def __init__(self, n_features, n_hidden, n_classes, dropout):
super(NeuralNet, self).__init__()
self.fc1 = nn.Linear(n_features, n_hidden)
self.sigmoid = nn.Sigmoid()
self.fc2 = nn.Linear(n_hidden, n_classes)
self.dropout = dropout
def forward(self, x):
x = self.sigmoid(self.fc1(x))
x = F.dropout(x, self.dropout, training=self.training)
x = self.fc2(x)
return F.log_softmax(x, dim=1)
I instantiate the model and an optimizer for the weights as follows:
import torch.optim as optim
model = NeuralNet(n_features=args.n_features,
n_hidden=args.n_hidden,
n_classes=args.n_classes,
dropout=args.dropout)
optimizer_w = optim.SGD(model.parameters(), lr=0.001)
While training, I update the weights as usual. Now, given that I have values for the weights, I should be able to use them to compute the gradient w.r.t. the input. I am unable to figure out how.
def train(epoch):
t = time.time()
model.train()
optimizer.zero_grad()
output = model(features)
loss_train = F.nll_loss(output[idx_train], labels[idx_train])
acc_train = accuracy(output[idx_train], labels[idx_train])
loss_train.backward()
optimizer_w.step()
# grad_features = loss_train.backward() w.r.t to features
# features -= 0.001 * grad_features
for epoch in range(args.epochs):
train(epoch)
It is possible, just set input.requires_grad = True for each input batch you're feeding in, and then after loss.backward() you should see that input.grad holds the expected gradient. In other words, if your input to the model (which you call features in your code) is some M x N x ... tensor, features.grad will be a tensor of the same shape, where each element of grad holds the gradient with respect to the corresponding element of features. In my comments below, I use i as a generalized index - if your parameters has for instance 3 dimensions, replace it with features.grad[i, j, k], etc.
Regarding the error you're getting: PyTorch operations build a tree representing the mathematical operation they are describing, which is then used for differentiation. For instance c = a + b will create a tree where a and b are leaf nodes and c is not a leaf (since it results from other expressions). Your model is the expression, and its inputs as well as parameters are the leaves, whereas all intermediate and final outputs are not leaves. You can think of leaves as "constants" or "parameters" and of all other variables as of functions of those. This message tells you that you can only set requires_grad of leaf variables.
Your problem is that at the first iteration, features is random (or however else you initialize) and is therefore a valid leaf. After your first iteration, features is no longer a leaf, since it becomes an expression calculated based on the previous ones. In pseudocode, you have
f_1 = initial_value # valid leaf
f_2 = f_1 + your_grad_stuff # not a leaf: f_2 is a function of f_1
to deal with that you need to use detach, which breaks the links in the tree, and makes the autograd treat a tensor as if it was constant, no matter how it was created. In particular, no gradient calculations will be backpropagated through detach. So you need something like
features = features.detach() - 0.01 * features.grad
Note: perhaps you need to sprinkle a couple more detaches here and there, which is hard to say without seeing your whole code and knowing the exact purpose.
I'm trying to implement the gradient descent with PyTorch according to this schema but can't figure out how to properly update the weights. It is just a toy example with 2 linear layers with 2 nodes in hidden layer and one output.
Learning rate = 0.05;
target output = 1
https://hmkcode.github.io/ai/backpropagation-step-by-step/
Forward
Backward
My code is as following:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
class MyNet(nn.Module):
def __init__(self):
super(MyNet, self).__init__()
self.linear1 = nn.Linear(2, 2, bias=None)
self.linear1.weight = torch.nn.Parameter(torch.tensor([[0.11, 0.21], [0.12, 0.08]]))
self.linear2 = nn.Linear(2, 1, bias=None)
self.linear2.weight = torch.nn.Parameter(torch.tensor([[0.14, 0.15]]))
def forward(self, inputs):
out = self.linear1(inputs)
out = self.linear2(out)
return out
losses = []
loss_function = nn.L1Loss()
model = MyNet()
optimizer = optim.SGD(model.parameters(), lr=0.05)
input = torch.tensor([2.0,3.0])
print('weights before backpropagation = ', list(model.parameters()))
for epoch in range(1):
result = model(input )
loss = loss_function(result , torch.tensor([1.00],dtype=torch.float))
print('result = ', result)
print("loss = ", loss)
model.zero_grad()
loss.backward()
print('gradients =', [x.grad.data for x in model.parameters()] )
optimizer.step()
print('weights after backpropagation = ', list(model.parameters()))
The result is following :
weights before backpropagation = [Parameter containing:
tensor([[0.1100, 0.2100],
[0.1200, 0.0800]], requires_grad=True), Parameter containing:
tensor([[0.1400, 0.1500]], requires_grad=True)]
result = tensor([0.1910], grad_fn=<SqueezeBackward3>)
loss = tensor(0.8090, grad_fn=<L1LossBackward>)
gradients = [tensor([[-0.2800, -0.4200], [-0.3000, -0.4500]]),
tensor([[-0.8500, -0.4800]])]
weights after backpropagation = [Parameter containing:
tensor([[0.1240, 0.2310],
[0.1350, 0.1025]], requires_grad=True), Parameter containing:
tensor([[0.1825, 0.1740]], requires_grad=True)]
Forward pass values:
2x0.11 + 3*0.21=0.85 ->
2x0.12 + 3*0.08=0.48 -> 0.85x0.14 + 0.48*0.15=0.191 -> loss =0.191-1 = -0.809
Backward pass: let's calculate w5 and w6 (output node weights)
w = w - (prediction-target)x(gradient)x(output of previous node)x(learning rate)
w5= 0.14 -(0.191-1)*1*0.85*0.05= 0.14 + 0.034= 0.174
w6= 0.15 -(0.191-1)*1*0.48*0.05= 0.15 + 0.019= 0.169
In my example Torch doesn't multiply the loss by derivative so we get wrong weights after updating. For the output node we got new weights w5,w6 [0.1825, 0.1740] , when it should be [0.174, 0.169]
Moving backward to update the first weight of the output node (w5) we need to calculate: (prediction-target)x(gradient)x(output of previous node)x(learning rate)=-0.809*1*0.85*0.05=-0.034. Updated weight w5 = 0.14-(-0.034)=0.174. But instead pytorch calculated new weight = 0.1825. It forgot to multiply by (prediction-target)=-0.809. For the output node we got gradients -0.8500 and -0.4800. But we still need to multiply them by loss 0.809 and learning rate 0.05 before we can update the weights.
What is the proper way of doing this?
Should we pass 'loss' as an argument to backward() as following: loss.backward(loss) .
That seems to fix it. But I couldn't find any example on this in documentation.
You should use .zero_grad() with optimizer, so optimizer.zero_grad(), not loss or model as suggested in the comments (though model is fine, but it is not clear or readable IMO).
Except that your parameters are updated fine, so the error is not on PyTorch's side.
Based on gradient values you provided:
gradients = [tensor([[-0.2800, -0.4200], [-0.3000, -0.4500]]),
tensor([[-0.8500, -0.4800]])]
Let's multiply all of them by your learning rate (0.05):
gradients_times_lr = [tensor([[-0.014, -0.021], [-0.015, -0.0225]]),
tensor([[-0.0425, -0.024]])]
Finally, let's apply ordinary SGD (theta -= gradient * lr), to get exactly the same results as in PyTorch:
parameters = [tensor([[0.1240, 0.2310], [0.1350, 0.1025]]),
tensor([[0.1825, 0.1740]])]
What you have done is taken the gradients calculated by PyTorch and multiplied them with the output of previous node and that's not how it works!.
What you've done:
w5= 0.14 -(0.191-1)*1*0.85*0.05= 0.14 + 0.034= 0.174
What should of been done (using PyTorch's results):
w5 = 0.14 - (-0.85*0.05) = 0.1825
No multiplication of previous node, it's done behind the scenes (that's what .backprop() does - calculates correct gradients for all of the nodes), no need to multiply them by previous ones.
If you want to calculate them manually, you have to start at the loss (with delta being one) and backprop all the way down (do not use learning rate here, it's a different story!).
After all of them are calculated, you can multiply each weight by optimizers learning rate (or any other formula for that matter, e.g. Momentum) and after this you have your correct update.
How to calculate backprop
Learning rate is not part of backpropagation, leave it alone until you calculate all of the gradients (it confuses separate algorithms together, optimization procedures and backpropagation).
1. Derivative of total error w.r.t. output
Well, I don't know why you are using Mean Absolute Error (while in the tutorial it is Mean Squared Error), and that's why both those results vary. But let's go with your choice.
Derivative of | y_true - y_pred | w.r.t. to y_pred is 1, so IT IS NOT the same as loss. Change to MSE to get equal results (here, the derivative will be (1/2 * y_pred - y_true), but we usually multiply MSE by two in order to remove the first multiplication).
In MSE case you would multiply by the loss value, but it depends entirely on the loss function (it was a bit unfortunate that the tutorial you were using didn't point this out).
2. Derivative of total error w.r.t. w5
You could probably go from here, but... Derivative of total error w.r.t to w5 is the output of h1 (0.85 in this case). We multiply it by derivative of total error w.r.t. output (it is 1!) and obtain 0.85, as done in PyTorch. Same idea goes for w6.
I seriously advise you not to confuse learning rate with backprop, you are making your life harder (and it's not easy with backprop IMO, quite counterintuitive), and those are two separate things (can't stress that one enough).
This source is nice, more step-by-step, with a little more complicated network idea (activations included), so you can get a better grasp if you go through all of it.
Furthermore, if you are really keen (and you seem to be), to know more ins and outs of this, calculate the weight corrections for other optimizers (say, nesterov), so you know why we should keep those ideas separated.
in keras, I want to customize my loss function which not only takes (y_true, y_pred) as input but also need to use the output from the internal layer of the network as the label for an output layer.This picture shows the Network Layout
Here, the internal output is xn, which is a 1D feature vector. in the upper right corner, the output is xn', which is the prediction of xn. In other words, xn is the label for xn'.
While [Ax, Ay] is traditionally known as y_true, and [Ax',Ay'] is y_pred.
I want to combine these two loss components into one and train the network jointly.
Any ideas or thoughts are much appreciated!
I have figured out a way out, in case anyone is searching for the same, I posted here (based on the network given in this post):
The idea is to define the customized loss function and use it as the output of the network. (Notation: A is the true label of variable A, and A' is the predicted value of variable A)
def customized_loss(args):
#A is from the training data
#S is the internal state
A, A', S, S' = args
#customize your own loss components
loss1 = K.mean(K.square(A - A'), axis=-1)
loss2 = K.mean(K.square(S - S'), axis=-1)
#adjust the weight between loss components
return 0.5 * loss1 + 0.5 * loss2
def model():
#define other inputs
A = Input(...) # define input A
#construct your model
cnn_model = Sequential()
...
# get true internal state
S = cnn_model(prev_layer_output0)
# get predicted internal state output
S' = Dense(...)(prev_layer_output1)
# get predicted A output
A' = Dense(...)(prev_layer_output2)
# customized loss function
loss_out = Lambda(customized_loss, output_shape=(1,), name='joint_loss')([A, A', S, S'])
model = Model(input=[...], output=[loss_out])
return model
def train():
m = model()
opt = 'adam'
model.compile(loss={'joint_loss': lambda y_true, y_pred:y_pred}, optimizer = opt)
# train the model
....
First of all you should be using the Functional API. Then you should define the network output as the output plus the result from the internal layer, merge them into a single output (by concatenating), and then make a custom loss function that then splits the merged output into two parts and does the loss computations on its own.
Something like:
def customLoss(y_true, y_pred):
#loss here
internalLayer = Convolution2D()(inputs) #or other layers
internalModel = Model(input=inputs, output=internalLayer)
tmpOut = Dense(...)(internalModel)
mergedOut = merge([tmpOut, mergedOut], mode = "concat", axis = -1)
fullModel = Model(input=inputs, output=mergedOut)
fullModel.compile(loss = customLoss, optimizer = "whatever")
I have my reservations regarding this implementation. The loss computed at the merged layer is propagated back to both the merged branches. Generally you would like to propagate it through just one layer.