Consider the following two contexts
with torch.no_grad():
params = params - learning_rate * params.grad
and
with torch.no_grad():
params -= learning_rate * params.grad
In the second case .backward() is running smoothly and in the first case it is giving the
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
What is the reason for this as it is normal to use x-= a and x = x - a interchangeably?
Note that x -= a and x = x - a cannot be used interchangeably: The latter creates a new tensor that is assigned to the variable x, while the former performes an in place operation.
Therefore with
with torch.no_grad():
params -= learning_rate * params.grad
everything works fine in your optimization loop, while in
with torch.no_grad():
params = params - learning_rate * params.grad
the variable params gets overwritten with a new tensor. Since this new tensor was created within a torch.no_grad() context, this means that this new tensor has params.requires_grad=False and therefore does not have a .grad attribute. Therefore in the next iteration torch will complain that params.grad does not exist.
Related
I am trying to implement truncated backpropagation through time in PyTorch, for the simple case where K1=K2. I have an implementation below that produces reasonable output, but I just want to make sure it is correct. When I look online for PyTorch examples of TBTT, they do inconsistent things around detaching the hidden state and zeroing out the gradient, and the ordering of these operations. Please let me know if I have made a mistake.
In the code below, H maintains the current hidden state, and model(weights, H, x) outputs the prediction and the new hidden state.
while i < NUM_STEPS:
# Grab x, y for ith datapoint
x = data[i]
target = true_output[i]
# Run model
output, new_hidden = model(weights, H, x)
H = new_hidden
# Update running error
error += (output - target)**2
if (i+1) % K == 0:
# Backpropagate
error.backward()
opt.step()
opt.zero_grad()
error = 0
H = H.detach()
i += 1
So the idea of your code is to isolate the last variables after each Kth step. Yes, your implementation is absolutely correct and this answer confirms that.
# truncated to the last K timesteps
while i < NUM_STEPS:
out = model(out)
if (i+1) % K == 0:
out.backward()
out.detach()
out.backward()
You can also follow this example for your reference.
import torch
from ignite.engine import Engine, EventEnum, _prepare_batch
from ignite.utils import apply_to_tensor
class Tbptt_Events(EventEnum):
"""Aditional tbptt events.
Additional events for truncated backpropagation throught time dedicated
trainer.
"""
TIME_ITERATION_STARTED = "time_iteration_started"
TIME_ITERATION_COMPLETED = "time_iteration_completed"
def _detach_hidden(hidden):
"""Cut backpropagation graph.
Auxillary function to cut the backpropagation graph by detaching the hidden
vector.
"""
return apply_to_tensor(hidden, torch.Tensor.detach)
def create_supervised_tbptt_trainer(
model, optimizer, loss_fn, tbtt_step, dim=0, device=None, non_blocking=False, prepare_batch=_prepare_batch
):
"""Create a trainer for truncated backprop through time supervised models.
Training recurrent model on long sequences is computationally intensive as
it requires to process the whole sequence before getting a gradient.
However, when the training loss is computed over many outputs
(`X to many <https://karpathy.github.io/2015/05/21/rnn-effectiveness/>`_),
there is an opportunity to compute a gradient over a subsequence. This is
known as
`truncated backpropagation through time <https://machinelearningmastery.com/
gentle-introduction-backpropagation-time/>`_.
This supervised trainer apply gradient optimization step every `tbtt_step`
time steps of the sequence, while backpropagating through the same
`tbtt_step` time steps.
Args:
model (`torch.nn.Module`): the model to train.
optimizer (`torch.optim.Optimizer`): the optimizer to use.
loss_fn (torch.nn loss function): the loss function to use.
tbtt_step (int): the length of time chunks (last one may be smaller).
dim (int): axis representing the time dimension.
device (str, optional): device type specification (default: None).
Applies to batches.
non_blocking (bool, optional): if True and this copy is between CPU and GPU,
the copy may occur asynchronously with respect to the host. For other cases,
this argument has no effect.
prepare_batch (callable, optional): function that receives `batch`, `device`,
`non_blocking` and outputs tuple of tensors `(batch_x, batch_y)`.
.. warning::
The internal use of `device` has changed.
`device` will now *only* be used to move the input data to the correct device.
The `model` should be moved by the user before creating an optimizer.
For more information see:
* `PyTorch Documentation <https://pytorch.org/docs/stable/optim.html#constructing-it>`_
* `PyTorch's Explanation <https://github.com/pytorch/pytorch/issues/7844#issuecomment-503713840>`_
Returns:
Engine: a trainer engine with supervised update function.
"""
def _update(engine, batch):
loss_list = []
hidden = None
x, y = batch
for batch_t in zip(x.split(tbtt_step, dim=dim), y.split(tbtt_step, dim=dim)):
x_t, y_t = prepare_batch(batch_t, device=device, non_blocking=non_blocking)
# Fire event for start of iteration
engine.fire_event(Tbptt_Events.TIME_ITERATION_STARTED)
# Forward, backward and
model.train()
optimizer.zero_grad()
if hidden is None:
y_pred_t, hidden = model(x_t)
else:
hidden = _detach_hidden(hidden)
y_pred_t, hidden = model(x_t, hidden)
loss_t = loss_fn(y_pred_t, y_t)
loss_t.backward()
optimizer.step()
# Setting state of engine for consistent behaviour
engine.state.output = loss_t.item()
loss_list.append(loss_t.item())
# Fire event for end of iteration
engine.fire_event(Tbptt_Events.TIME_ITERATION_COMPLETED)
# return average loss over the time splits
return sum(loss_list) / len(loss_list)
engine = Engine(_update)
engine.register_events(*Tbptt_Events)
return engine
Currently, I'm trying to optimize the values of an input tensor, x, to a model.
I want to restrict the input to only contain values in the range [0.0;1.0].
There is not too much information about how to do this, when not working with a layer as such.
I've created a minimum working example below, which gives the error message in the title of this post.
The magic happens in the optimize_x() function
If I comment out the line: model.x = model.x.clamp(min=0.0, max=1.0) the issue is fixed, but the tensor is obviously not clamped.
I'm aware that I could just set retain_graph=True - but it's not clear whether this is the right way to go, or if there is a better way of achieving this functionality?
import torch
from torch.distributions import Uniform
class OptimizeInputModel(torch.nn.Module):
def __init__(self):
super().__init__()
self.model = torch.nn.Sequential(
torch.nn.Linear(123, 1000),
torch.nn.Dropout(0.4),
torch.nn.ReLU(),
torch.nn.Linear(1000, 100),
torch.nn.Dropout(0.4),
torch.nn.ReLU(),
torch.nn.Linear(100, 1),
torch.nn.Sigmoid(),
)
in_shape = (1, 123)
self.x = torch.ones(in_shape) * 0.1
self.x.requires_grad = True
def forward(self) -> torch.Tensor:
return self.model(self.x)
class MyLossFunc(torch.nn.Module):
def forward(self, y: torch.Tensor) -> torch.Tensor:
loss = torch.sum(-y)
return loss
def optimize_x():
model = OptimizeInputModel()
optimizer = torch.optim.Adam([model.x], lr=1e-4)
loss_fn = MyLossFunc()
for epoch in range(50000):
# Constrain X to have no values < 0
model.x = model.x.clamp(min=0.0, max=1.0)
y = model()
loss = loss_fn(y)
if epoch % 9 == 0:
print(f'Epoch: {epoch}\t Loss: {loss}')
optimizer.zero_grad()
loss.backward()
optimizer.step()
optimize_x()
Full error message:
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
For anyone in the future who might have the same question.
My solution was to do (note the underscore!):
model.x.data.clamp_(min=0.0, max=1.0)
instead of:
model.x = model.x.clamp(min=0.0, max=1.0)
Suppose a model as in:
model = Model(inputs=[A, B], outputs=C)
With custom loss:
def actor_loss(y_true, y_pred):
log_lik = y_true * K.log(y_pred)
loss = -K.sum(log_lik * K.stop_gradient(B))
return loss
Now I'm trying to define a function that returns the gradients of the loss wrt to the weights for a given pair of input and target output and expose it as such.
Here is an idea of what I mean in pseudocode
def _get_grads(inputs, targets):
loss = model.loss(targets, model.output)
weights = model.trainable_weights
grads = K.gradients(loss, weights)
model.input[0] (aka 'A') <----inputs[0]
model.input[1] (aka 'B') <----inputs[1]
return K.function(model.input, grads)
self.get_grads = _get_grads
My question is how do I feed inputs argument to the graph inside said function.
(So far I've only worked with .fit and not with .gradients and I can't find any decent documentation with custom loss or multiple inputs)
If you call K.function, you get an actual callable function, so you should just call it with some parameter values. The format is exactly the same as model.fit, in your case it should be two arrays of values, including the batch dimension:
self.get_grads = _get_grads(inputs, targets)
grad_value = self.get_grads([input1, input2])
Where input1 and input2 are numpy arrays that include the batch dimension.
My understanding of K.function ,K.gradients and custom loss was fundamentally wrong. You use the function to construct a mini-graph that computes gradients of loss wrt to weights. No need for the function itself to have arguments.
def _get_grads():
targets = Input(shape=...)
loss = model.loss(targets, model.output)
weights = model.trainable_weights
grads = K.gradients(loss, weights)
return K.function(model.input + [targets], grads)
I was under the impression that _get_grads was itself K.function but that was wrong. _get_grads() returns K.function. And then you use that as
f = _get_grads() # constructs the mini-graph that gives gradients
grads = f([inputs, labels])
inputs is fed to model.inputs, labels to targets and it returns grads.
Given a simple 2 layer neural network, the traditional idea is to compute the gradient w.r.t. the weights/model parameters. For an experiment, I want to compute the gradient of the error w.r.t the input. Are there existing Pytorch methods that can allow me to do this?
More concretely, consider the following neural network:
import torch.nn as nn
import torch.nn.functional as F
class NeuralNet(nn.Module):
def __init__(self, n_features, n_hidden, n_classes, dropout):
super(NeuralNet, self).__init__()
self.fc1 = nn.Linear(n_features, n_hidden)
self.sigmoid = nn.Sigmoid()
self.fc2 = nn.Linear(n_hidden, n_classes)
self.dropout = dropout
def forward(self, x):
x = self.sigmoid(self.fc1(x))
x = F.dropout(x, self.dropout, training=self.training)
x = self.fc2(x)
return F.log_softmax(x, dim=1)
I instantiate the model and an optimizer for the weights as follows:
import torch.optim as optim
model = NeuralNet(n_features=args.n_features,
n_hidden=args.n_hidden,
n_classes=args.n_classes,
dropout=args.dropout)
optimizer_w = optim.SGD(model.parameters(), lr=0.001)
While training, I update the weights as usual. Now, given that I have values for the weights, I should be able to use them to compute the gradient w.r.t. the input. I am unable to figure out how.
def train(epoch):
t = time.time()
model.train()
optimizer.zero_grad()
output = model(features)
loss_train = F.nll_loss(output[idx_train], labels[idx_train])
acc_train = accuracy(output[idx_train], labels[idx_train])
loss_train.backward()
optimizer_w.step()
# grad_features = loss_train.backward() w.r.t to features
# features -= 0.001 * grad_features
for epoch in range(args.epochs):
train(epoch)
It is possible, just set input.requires_grad = True for each input batch you're feeding in, and then after loss.backward() you should see that input.grad holds the expected gradient. In other words, if your input to the model (which you call features in your code) is some M x N x ... tensor, features.grad will be a tensor of the same shape, where each element of grad holds the gradient with respect to the corresponding element of features. In my comments below, I use i as a generalized index - if your parameters has for instance 3 dimensions, replace it with features.grad[i, j, k], etc.
Regarding the error you're getting: PyTorch operations build a tree representing the mathematical operation they are describing, which is then used for differentiation. For instance c = a + b will create a tree where a and b are leaf nodes and c is not a leaf (since it results from other expressions). Your model is the expression, and its inputs as well as parameters are the leaves, whereas all intermediate and final outputs are not leaves. You can think of leaves as "constants" or "parameters" and of all other variables as of functions of those. This message tells you that you can only set requires_grad of leaf variables.
Your problem is that at the first iteration, features is random (or however else you initialize) and is therefore a valid leaf. After your first iteration, features is no longer a leaf, since it becomes an expression calculated based on the previous ones. In pseudocode, you have
f_1 = initial_value # valid leaf
f_2 = f_1 + your_grad_stuff # not a leaf: f_2 is a function of f_1
to deal with that you need to use detach, which breaks the links in the tree, and makes the autograd treat a tensor as if it was constant, no matter how it was created. In particular, no gradient calculations will be backpropagated through detach. So you need something like
features = features.detach() - 0.01 * features.grad
Note: perhaps you need to sprinkle a couple more detaches here and there, which is hard to say without seeing your whole code and knowing the exact purpose.
This is in tensorflow-2.0 keras.
x_input keras.layers.Input(shape=(1,))
V = K.variable(0.5, name='V', dtype=tf.float32)
V = tf.reduce_mean(x_input, axis=-1) * 0 + V # this is a stupid way to get things to work
V = tf.expand_dims(V, axis=-1)
model = keras.models.Model(inputs=x_input, outputs=V)
I then use this model (uncompile) as an input into another model construction procedure.
The variabel "V" does not show up in the summary of either the first of second model. How to get a hold of the value V in callbacks or even in eager mode?