Pytorch. Optimizing the input to a model: Trying to backward through the graph a second time, but the buffers have already been freed - pytorch

Currently, I'm trying to optimize the values of an input tensor, x, to a model.
I want to restrict the input to only contain values in the range [0.0;1.0].
There is not too much information about how to do this, when not working with a layer as such.
I've created a minimum working example below, which gives the error message in the title of this post.
The magic happens in the optimize_x() function
If I comment out the line: model.x = model.x.clamp(min=0.0, max=1.0) the issue is fixed, but the tensor is obviously not clamped.
I'm aware that I could just set retain_graph=True - but it's not clear whether this is the right way to go, or if there is a better way of achieving this functionality?
import torch
from torch.distributions import Uniform
class OptimizeInputModel(torch.nn.Module):
def __init__(self):
super().__init__()
self.model = torch.nn.Sequential(
torch.nn.Linear(123, 1000),
torch.nn.Dropout(0.4),
torch.nn.ReLU(),
torch.nn.Linear(1000, 100),
torch.nn.Dropout(0.4),
torch.nn.ReLU(),
torch.nn.Linear(100, 1),
torch.nn.Sigmoid(),
)
in_shape = (1, 123)
self.x = torch.ones(in_shape) * 0.1
self.x.requires_grad = True
def forward(self) -> torch.Tensor:
return self.model(self.x)
class MyLossFunc(torch.nn.Module):
def forward(self, y: torch.Tensor) -> torch.Tensor:
loss = torch.sum(-y)
return loss
def optimize_x():
model = OptimizeInputModel()
optimizer = torch.optim.Adam([model.x], lr=1e-4)
loss_fn = MyLossFunc()
for epoch in range(50000):
# Constrain X to have no values < 0
model.x = model.x.clamp(min=0.0, max=1.0)
y = model()
loss = loss_fn(y)
if epoch % 9 == 0:
print(f'Epoch: {epoch}\t Loss: {loss}')
optimizer.zero_grad()
loss.backward()
optimizer.step()
optimize_x()
Full error message:
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

For anyone in the future who might have the same question.
My solution was to do (note the underscore!):
model.x.data.clamp_(min=0.0, max=1.0)
instead of:
model.x = model.x.clamp(min=0.0, max=1.0)

Related

What should I think about when writing a custom loss function?

I'm trying to get my toy network to learn a sine wave.
I output (via tanh) a number between -1 and 1, and I want the network to minimise the following loss, where self(x) are the predictions.
loss = -torch.mean(self(x)*y)
This should be equivalent to trading a stock with a sinusoidal price, where self(x) is our desired position, and y are the returns of the next time step.
The issue I'm having is that the network doesn't learn anything. It does work if I change the loss function to be torch.mean((self(x)-y)**2) (MSE), but this isn't what I want. I'm trying to focus the network on 'making a profit', not making a prediction.
I think the issue may be related to the convexity of the loss function, but I'm not sure, and I'm not certain how to proceed. I've experimented with differing learning rates, but alas nothing works.
What should I be thinking about?
Actual code:
%load_ext tensorboard
import matplotlib.pyplot as plt; plt.rcParams["figure.figsize"] = (30,8)
import torch;from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F;import pytorch_lightning as pl
from torch import nn, tensor
def piecewise(x): return 2*(x>0)-1
class TsDs(torch.utils.data.Dataset):
def __init__(self, s, l=5): super().__init__();self.l,self.s=l,s
def __len__(self): return self.s.shape[0] - 1 - self.l
def __getitem__(self, i): return self.s[i:i+self.l], torch.log(self.s[i+self.l+1]/self.s[i+self.l])
def plt(self): plt.plot(self.s)
class TsDm(pl.LightningDataModule):
def __init__(self, length=5000, batch_size=1000): super().__init__();self.batch_size=batch_size;self.s = torch.sin(torch.arange(length)*0.2) + 5 + 0*torch.rand(length)
def train_dataloader(self): return DataLoader(TsDs(self.s[:3999]), batch_size=self.batch_size, shuffle=True)
def val_dataloader(self): return DataLoader(TsDs(self.s[4000:]), batch_size=self.batch_size)
dm = TsDm()
class MyModel(pl.LightningModule):
def __init__(self, learning_rate=0.01):
super().__init__();self.learning_rate = learning_rate
super().__init__();self.learning_rate = learning_rate
self.conv1 = nn.Conv1d(1,5,2)
self.lin1 = nn.Linear(20,3);self.lin2 = nn.Linear(3,1)
# self.network = nn.Sequential(nn.Conv1d(1,5,2),nn.ReLU(),nn.Linear(20,3),nn.ReLU(),nn.Linear(3,1), nn.Tanh())
# self.network = nn.Sequential(nn.Linear(5,5),nn.ReLU(),nn.Linear(5,3),nn.ReLU(),nn.Linear(3,1), nn.Tanh())
def forward(self, x):
out = x.unsqueeze(1)
out = self.conv1(out)
out = out.reshape(-1,20)
out = nn.ReLU()(out)
out = self.lin1(out)
out = nn.ReLU()(out)
out = self.lin2(out)
return nn.Tanh()(out)
def step(self, batch, batch_idx, stage):
x, y = batch
loss = -torch.mean(self(x)*y)
# loss = torch.mean((self(x)-y)**2)
print(loss)
self.log("loss", loss, prog_bar=True)
return loss
def training_step(self, batch, batch_idx): return self.step(batch, batch_idx, "train")
def validation_step(self, batch, batch_idx): return self.step(batch, batch_idx, "val")
def configure_optimizers(self): return torch.optim.SGD(self.parameters(), lr=self.learning_rate)
#logger = pl.loggers.TensorBoardLogger(save_dir="/content/")
mm = MyModel(0.1);trainer = pl.Trainer(max_epochs=10)
# trainer.tune(mm, dm)
trainer.fit(mm, datamodule=dm)
#
If I understand you correctly, I think that you were trying to maximize the unnormalized correlation between the network's prediction, self(x), and the target value y.
As you mention, the problem is the convexity of the loss wrt the model weights. One way to see the problem is to consider that the model is a simple linear predictor w'*x, where w is the model weights, w' it's transpose, and x the input feature vector (assume a scalar prediction for now). Then, if you look at the derivative of the loss wrt the weight vector (i.e., the gradient), you'll find that it no longer depends on w!
One way to fix this is change the loss to,
loss = -torch.mean(torch.square(self(x)*y))
or
loss = -torch.mean(torch.abs(self(x)*y))
You will have another big problem, however: these loss functions encourage unbound growth of the model weights. In the linear case, one solves this by a Lagrangian relaxation of a hard constraint on, for example, the norm of the model weight vector. I'm not sure how this would be done with neural networks as each layer would need it's own Lagrangian parameter...

Nested optimization in pytorch

I wrote a short snippet to train a classification model, and learn the learning rate of its optimization algorithm. In my example I tried to update weights of a network in an inner optimization loop and to learn the learning rate of the weight updates using an outer optimization loop (meta-optimization). I'm getting the error:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [3, 10]], which is output 0 of AsStridedBackward0, is at version 12; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
My code snippet is as following (NOTE: I'm using _stateless, an experimental functional API for nn. You need to run with the nightly build of pytorch.)
import torch
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils import _stateless
class MyDataset(Dataset):
def __init__(self, N):
self.N = N
self.x = torch.rand(self.N, 10)
self.y = torch.randint(0, 3, (self.N,))
def __len__(self):
return self.N
def __getitem__(self, idx):
return self.x[idx], self.y[idx]
class MyModel(nn.Module):
def __init__(self):
super(MyModel, self).__init__()
self.fc1 = nn.Linear(10, 10)
self.fc2 = nn.Linear(10, 3)
self.relu = nn.ReLU()
self.alpha = nn.Parameter(torch.randn(1))
self.beta = nn.Parameter(torch.randn(1))
def forward(self, x):
y = self.relu(self.fc1(x))
return self.fc2(y)
epochs = 20
N = 100
dataset = DataLoader(dataset=MyDataset(N), batch_size=10)
model = MyModel()
loss_func = nn.CrossEntropyLoss()
optim = optim.Adam([model.alpha], lr=1e-3)
params = dict(model.named_parameters())
for i in range(epochs):
model.train()
train_loss = 0
for batch_idx, (x, y) in enumerate(dataset):
logits = _stateless.functional_call(model, params, x) # predict
loss_inner = loss_func(logits, y) # loss
optim.zero_grad() # reset grad
loss_inner.backward(create_graph=True, inputs=params.values()) # compute grad
train_loss += loss_inner.item() # store loss
for k, p in params.items():
if k is not 'alpha' and k is not 'beta':
p.update = - model.alpha * p.grad
params[k] = p + p.update # update weight
print('Train Epoch: {}\tLoss: {:.6f}'.format(i, train_loss / N))
logits = _stateless.functional_call(model, params, x) # predict
loss_meta = loss_func(logits, y)
loss_meta.backward()
loss_meta.step()
From the error message, I understand that the issue comes from weight update for the weights of the second layer of the network, which points to an error in my inner loop optimization. Any suggestions would be appreciated.
Check this link and save PARAMs per each epoch and use same inner batch:
https://discuss.pytorch.org/t/issue-using-parameters-internal-method/134549/11
for i in range(epochs):
model.train()
train_loss = 0
params = dict(model.named_parameters()) # add this
for batch_idx, (x, y) in enumerate(dataset):
params = {k: v.clone() for k,v in params.items()} # add this
logits = _stateless.functional_call(model, params, x) # predict
loss_inner = loss_func(logits, y)
..................
You should be updating params[k].data instead of params[k]
(Deleted the example to avoid distraction)
Let me enter in a kind of fundamental discussion (not an answer to your question).
If I undertand correctly you want to compute loss(f(w[i], x)) , and computing the w[i+1,j] = w[i,j] + g(v[j], w[i,j].grad(w.r.t loss)) . Then in the end you want to compute v[j+1] = v[j] + v[j].grad(w.r.t loss).
The gradient of v[j] is computed using the backward propagation, as a function of grad w[i,j]. So what you are trying to do is to choose v[j] that results in a good w[i,j]. I would ask: why would you bother about v[j] if you can control w[i,j] directly? And that's what the standard approach.

Computing the Hessian of a Simple NN in PyTorch wrt to Parameters

I am relatively new to PyTorch and trying to compute the Hessian of a very simple feedforward networks with respect to its weights. I am trying to get torch.autograd.functional.hessian to work. I have been digging the forums and since this is a relatively new function added to PyTorch, I am unable to find a whole lot of information on it. Here is my simple network architecture which is from some sample code on Kaggle on Mnist.
class Network(nn.Module):
def __init__(self):
super(Network, self).__init__()
self.l1 = nn.Linear(input_size, hidden_size)
self.relu = nn.ReLU()
self.l3 = nn.Linear(hidden_size, output_size)
def forward(self, x):
x = self.l1(x)
x = self.relu(x)
x = self.l3(x)
return F.log_softmax(x, dim = 1)
net = Network()
optimizer = optim.SGD(net.parameters(), lr=learning_rate, momentum=0.9)
loss_func = nn.CrossEntropyLoss()
and I am running the NN for a bunch of epochs like:
for e in range(epochs):
for i in range(0, x.shape[0], batch_size):
x_mini = x[i:i + batch_size]
y_mini = y[i:i + batch_size]
x_var = Variable(x_mini)
y_var = Variable(y_mini)
optimizer.zero_grad()
net_out = net(x_var)
loss = loss_func(net_out, y_var)
loss.backward()
optimizer.step()
if i % 100 == 0:
loss_log.append(loss.data)
Then, I add all the parameters to a list and make a tensor out of it as below:
param_list = []
for param in net.parameters():
param_list.append(param.view(-1))
param_list = torch.cat(param_list)
Finally, I am trying to compute the Hessian of the converged network by running:
hessian = torch.autograd.functional.hessian(loss_func, param_list,create_graph=True)
but it gives me this error:
TypeError: forward() missing 1 required positional argument: 'target'
Any help would be appreciated.
Computing the hessian with regard to the parameters of a model (as opposed to the inputs to the model) isn't really well-supported right now. There's some work being done on this at https://github.com/pytorch/pytorch/issues/49171 , but for the moment it's very inconvenient.
Your code has a few other problems -- where you're passing loss_func, you should be passing a function that constructs the computation graph. Also, you never specify the input to the network or the target for the loss function.
Here's some code that cheats a little bit to use the existing functional interface to compute the hessian of the model weights, and concatenates everything together to give the same form as what you were trying to do:
# Pick a random input to the network
src = torch.rand(1, 2)
# Say our target for our loss is all ones
dst = torch.ones(1, dtype=torch.long)
keys = list(net.state_dict().keys())
parameters = list(net.parameters())
sizes = [x.view(-1).shape[0] for x in parameters]
ndims = sum(sizes)
def hessian_hack(*params):
for i in range(len(keys)):
path = keys[i].split('.')
cur = net
for f in range(0, len(path)-1):
cur = net.__getattr__(path[f])
cur.__delattr__(path[-1])
cur.__setattr__(path[-1], params[i])
return loss_func(net(src), dst)
# sub_hessians[i][f] is the hessian of parameter i vs parameter f
sub_hessians = torch.autograd.functional.hessian(
hessian_hack,
tuple(parameters),
create_graph=True)
# We can combine them all into a nice big hessian.
hessian = torch.cat([
torch.cat([
sub_hessians[i][f].reshape(sizes[i], sizes[f])
for f in range(len(sub_hessians[i]))
], axis=1)
for i in range(len(sub_hessians))
], axis=0)
print(hessian)

How to compute the parameter importance in pytorch?

I want to develop a lifelong learning system,so i need to prevent important parameter from changing.I read related paper 'Memory Aware Synapses: Learning what (not) to forget',a method was mentioned,I need to calculate the gradient of each parameter conresponding to each input image,so how should i write my code in pytorch?
'Memory Aware Synapses: Learning what (not) to forget'
You can do it using standard optimization procedure and .backward() method on your loss function.
First, scaling as defined in your link:
class Scaler:
def __init__(self, parameters, delta):
self.parameters = parameters
self.delta = delta
def step(self):
"""Multiplies gradients in place."""
for param in self.parameters:
if param.grad is None:
raise ValueError("backward() has to be called before running scaler")
param.grad *= self.delta
One can use it just like optimizer.step(), see below (see comments):
model = torch.nn.Sequential(
torch.nn.Linear(10, 100), torch.nn.ReLU(), torch.nn.Linear(100, 1)
)
scaler = Scaler(model.parameters(), delta=0.001)
optimizer = torch.optim.Adam(model.parameters())
criterion = torch.nn.MSELoss()
X, y = torch.randn(64, 10), torch.randn(64)
# Optimization loop
EPOCHS = 10
for _ in range(EPOCHS):
output = model(X)
loss = criterion(output, y)
loss.backward() # Now model has the gradients
optimizer.step() # Optimize model's parameters
print(next(model.parameters()).grad)
scaler.step() # Scaler gradients
optimizer.zero_grad() # Zero gradient before next step
After scaler.step() you will have gradient scaled available inside param.grad for each parameter (just like those are accessed within Scaler's step method) so you can do whatever you want with them.

retain_graph problem with GRU in Pytorch 1.6

I am aware that, while employing loss.backward() we need to specify retain_graph=True if there are multiple networks and multiple loss functions to optimize each network separately. But even with (or without) specifying this parameter I am getting errors. Following is an MWE to reproduce the issue (on PyTorch 1.6).
import torch
from torch import nn
from torch import optim
torch.autograd.set_detect_anomaly(True)
class GRU1(nn.Module):
def __init__(self):
super(GRU1, self).__init__()
self.brnn = nn.GRU(input_size=2, bidirectional=True, num_layers=1, hidden_size=100)
def forward(self, x):
return self.brnn(x)
class GRU2(nn.Module):
def __init__(self):
super(GRU2, self).__init__()
self.brnn = nn.GRU(input_size=200, bidirectional=True, num_layers=1, hidden_size=1)
def forward(self, x):
return self.brnn(x)
gru1 = GRU1()
gru2 = GRU2()
gru1_opt = optim.Adam(gru1.parameters())
gru2_opt = optim.Adam(gru2.parameters())
criterion = nn.MSELoss()
for i in range(100):
gru1_opt.zero_grad()
gru2_opt.zero_grad()
vector = torch.randn((15, 100, 2))
gru1_output, _ = gru1(vector) # (15, 100, 200)
loss_gru1 = criterion(gru1_output, torch.randn((15, 100, 200)))
loss_gru1.backward(retain_graph=True)
gru1_opt.step()
gru1_output, _ = gru1(vector) # (15, 100, 200)
gru2_output, _ = gru2(gru1_output) # (15, 100, 2)
loss_gru2 = criterion(gru2_output, torch.randn((15, 100, 2)))
loss_gru2.backward(retain_graph=True)
gru2_opt.step()
print(f"GRU1 loss: {loss_gru1.item()}, GRU2 loss: {loss_gru2.item()}")
With retain_graph set to True I get the error
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [100, 300]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
The error without the parameter is
RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling backward the first time.
which is expected.
Please point at what needs to be changed in the above code for it to begin training. Any help is appreciated.
In such a case, one can detach the computation graph to exclude the parameters that don't need to be optimized. In this case, the computation graph should be detached after the second forward pass with gru1 i.e.
....
gru1_opt.step()
gru1_output, _ = gru1(vector)
gru1_output = gru1_output.detach()
....
This way, you won't "try to backward through the graph a second time" as the error mentioned.

Resources