Clarification in PyTorch's autograd with respect to tracking weights - pytorch

I was reading this blog from PyTorch. Just before the AutoGrad in training Section , it is mentioned
Be aware that only leaf nodes of the computation have their gradients computed. If you tried, for example, print(c.grad) you’d get back None. In this simple example, only the input is a leaf node, so only it has gradients computed.
Then weights are also considered to be leaf nodes. In the subsequent AutoGrad in training Section this below code block is executed.
BATCH_SIZE = 16
DIM_IN = 1000
HIDDEN_SIZE = 100
DIM_OUT = 10
class TinyModel(torch.nn.Module):
def __init__(self):
super(TinyModel, self).__init__()
self.layer1 = torch.nn.Linear(1000, 100)
self.relu = torch.nn.ReLU()
self.layer2 = torch.nn.Linear(100, 10)
def forward(self, x):
x = self.layer1(x)
x = self.relu(x)
x = self.layer2(x)
return x
some_input = torch.randn(BATCH_SIZE, DIM_IN, requires_grad=False)
ideal_output = torch.randn(BATCH_SIZE, DIM_OUT, requires_grad=False)
model = TinyModel()
When
print(model.layer2.weight.grad)
is executed it is shown as None.
But after training with the below code snippet, the weights have gradients,
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
prediction = model(some_input)
loss = (ideal_output - prediction).pow(2).sum()
loss.backward()
print(model.layer2.weight.grad[0][0:10])
So is my understanding correct? i.e when weights are initialised by calling TinyModel(), the requires_autograd is set to False. Only when the training starts happening with loss.backward() then the requires_autograd is set to True and the gradient is kept track?
But in other examples, when we create PyTorch models from scratch, where we initialise the weights randomly along with requires_grad=True, the gradient is tracked from beginning.
Or is the gradient tracking generally enabled only when it is started to be trained? If so why initially it was returning None in the above example.
Thank You in advance

I assume by requires_autograd you mean requires_grad.
when weights are initialised by calling TinyModel(), the requires_autograd is set to False.
No, this isn't true. The attribute model.layer2.weight is an instance of nn.Parameter which has requires_grad == True by default. You can verify this yourself:
model = TinyModel()
assert isinstance(model.layer2.weight, nn.Parameter)
assert model.layer2.weight.requires_grad
assert all(p.requires_grad for p in model.parameters())
why initially it was returning None in the above example
The value of model.layer2.weight.grad is None because at that point no gradient is computed yet. In fact, no forward computation is even computed yet. When loss.backward() is executed, the autograd engine computes the gradient of all tensor p with p.requires_grad == True and stores this gradient in p.grad. That's why model.layer2.weight.grad is no longer None after loss.backward().

Related

PyTorch Geometric custom layer parameters not updating

I am developing a graph neural network using PyTorch Geometric. The idea is to start with multivariate time series, build a graph based on the correlation between those time series and then classify the graph.
I have built a CorrelationLayer that computes the adjacency matrix of the graph using the pearson coefficient, and multiplies it for a matrix of trainable weights.
This matrix is then passed, along with the time series as node features, to a graph convolution layer (i will add other layers for classifications after the graph convolution but i made a super-simplified version for this question).
The problem is that when i try to train the model the weigths of the correlation layer do not update, while the parameters of the graph convolution layer do without any problem)
Here is the code for the correlation layer:
class CorrelationLayer(nn.Module):
def __init__(self, num_time_series):
super().__init__()
self.num_time_series = num_time_series
self.weights = nn.Parameter(torch.rand((num_time_series, num_time_series)))
def forward(self, x):
correlations = torch.zeros((x.shape[0], x.shape[0]))
for i in range(x.shape[0]):
for j in range(i+1, x.shape[0]):
c, _ = pearsonr(x[i], x[j])
correlations[i, j] = c
correlations[j, i] = c
correlations = correlations * self.weights
return correlations
And here is the code for the GCN model:
class GCN(nn.Module):
def __init__(self, num_time_series, ts_length, hidden_channels):
super(GCN, self).__init__()
self.corr_layer = CorrelationLayer(num_time_series)
self.graph_conv = GCNConv(ts_length, hidden_channels)
return
def forward(self, x):
adj = self.corr_layer(x)
out = self.graph_conv(x, torch_geometric.utils.dense_to_sparse(adj)[0])
return out
This is the code that i wrote in order to tray and test the model, with some sample data:
def train(model, X_train, Y_train):
model.train()
for x, y in zip(X_train,Y_train):
out = model(x)
print(model.corr_layer.weights)
print(model.graph_conv.state_dict().values())
loss = criterion(out, y)
loss.backward()
optimizer.step()
optimizer.zero_grad()
X = torch.tensor([
[
[0.,1.,2.,3.],
[1.,2.,3.,4.],
[0.,6.,3.,1.],
[3.,2.,1.,0.]
],
[
[2.,4.,6.,8.],
[1.,2.,3.,4.],
[1.,8.,3.,7.],
[3.,2.,1.,0.]
],
[
[0.,1.,2.,3.],
[1.,2.,3.,4.],
[0.,6.,3.,1.],
[3.,2.,1.,0.]
]
])
Y = torch.tensor([
[[1.],[1.],[1.],[1.]],
[[0.],[0.],[0.],[0.]],
[[1.],[1.],[1.],[1.]]
])
model = GCN(4,4,1)
optimizer = torch.optim.Adam(model.parameters(), lr=0.5)
criterion = torch.nn.MSELoss()
for epoch in range(1, 100):
train(model, X,Y)
With the prints in the train function we can see that the parameters of the graph_conv layer are updating, while the weights of the correlation layer not.
At the moment my guess is that the problem is in the transition from the adjacency matrix to the sparse version with dense_to_sparse but I am not sure.
Has anyone experienced something similar and have any ideas or suggestions?
Well, even though it's a very pointed and specific question, for anyone passing through here in the future, here's the solution:
As pointed out by the user thecho7 on the PyTorch forum (https://discuss.pytorch.org/t/pytorch-geometric-custom-layer-parameters-not-updating/170632/2)
dense_to_sparse contains two tensors that first one is a set of indices of elements and the second one is the value tensor. The index tensor does not contains the gradient where the value tensor has it.
So in the forward method I changed
out = self.graph_conv(x, torch_geometric.utils.dense_to_sparse(adj)[0])
to
out = self.graph_conv(x, torch_geometric.utils.dense_to_sparse(adj)[0], torch_geometric.utils.dense_to_sparse(adj)[1])
and now the weights of the correlation layer update.

How to compute the parameter importance in pytorch?

I want to develop a lifelong learning system,so i need to prevent important parameter from changing.I read related paper 'Memory Aware Synapses: Learning what (not) to forget',a method was mentioned,I need to calculate the gradient of each parameter conresponding to each input image,so how should i write my code in pytorch?
'Memory Aware Synapses: Learning what (not) to forget'
You can do it using standard optimization procedure and .backward() method on your loss function.
First, scaling as defined in your link:
class Scaler:
def __init__(self, parameters, delta):
self.parameters = parameters
self.delta = delta
def step(self):
"""Multiplies gradients in place."""
for param in self.parameters:
if param.grad is None:
raise ValueError("backward() has to be called before running scaler")
param.grad *= self.delta
One can use it just like optimizer.step(), see below (see comments):
model = torch.nn.Sequential(
torch.nn.Linear(10, 100), torch.nn.ReLU(), torch.nn.Linear(100, 1)
)
scaler = Scaler(model.parameters(), delta=0.001)
optimizer = torch.optim.Adam(model.parameters())
criterion = torch.nn.MSELoss()
X, y = torch.randn(64, 10), torch.randn(64)
# Optimization loop
EPOCHS = 10
for _ in range(EPOCHS):
output = model(X)
loss = criterion(output, y)
loss.backward() # Now model has the gradients
optimizer.step() # Optimize model's parameters
print(next(model.parameters()).grad)
scaler.step() # Scaler gradients
optimizer.zero_grad() # Zero gradient before next step
After scaler.step() you will have gradient scaled available inside param.grad for each parameter (just like those are accessed within Scaler's step method) so you can do whatever you want with them.

Pytorch Gradient w.r.t. Inputs using BatchNorm

I'm trying to calculate the gradient of the output of a simple neural network with respect to the inputs. The result looks fine when I don't use a BatchNorm layer. Once I do use it, the result doesn't seem to make much sense. Below is a short example to reproduce the effect.
class Net(nn.Module):
def __init__(self, batch_norm):
super().__init__()
self.batch_norm = batch_norm
self.act_fn = nn.Tanh()
self.aff1 = nn.Linear(1, 10)
self.aff2 = nn.Linear(10, 1)
if batch_norm:
self.bn = nn.BatchNorm1d(10, affine=False) # False for simplicity
def forward(self, x):
x = self.aff1(x)
x = self.act_fn(x)
if self.batch_norm:
x = self.bn(x)
x = self.aff2(x)
return x
x_vals = torch.linspace(0, 1, 100)
x_vals.requires_grad = True
fig, axs = plt.subplots(ncols=2, figsize=(16, 5))
for seed, bn, ax1 in zip([11, 4], [False, True], axs): # different seeds for better illustration of effect
torch.manual_seed(seed)
net = Net(batch_norm=bn)
net.train()
pred = net(x_vals[:, None])
pred_dx = torch.autograd.grad(pred.sum(), x_vals, create_graph=True)[0]
# visualization
ax2 = ax1.twinx()
ax1.plot(x_vals.detach(), pred.detach())
ax2.plot(x_vals.detach(), pred_dx.detach(), linestyle='--', color='orange')
min_idx = torch.argmin((pred[1:]-pred[:-1])**2)
ax2.axvline(x_vals[min_idx].detach(), color='gray', linestyle='dotted')
ax2.axhline(0, color='gray', linestyle='dotted')
ax1.set_title(('With' if bn else 'Without') + ' Batch Norm')
plt.show()
The result also seems to be fine when I use evaluation mode. Unfortunately I can't just switch to eval() mode because the nature of my problem (PINNs) requires calculating gradient(s) during training.
I understand that during training the running mean and variance are updated. Maybe that has an impact? Can I still get the correct gradient somehow?
Thanks for your help!
It is a bit confusing due to the use of the word '.train()' but :
net.train() put layers like batch normalization and dropout to an active state. So yes the running mean and variance will be updated but the gradient will still be computed.
with torch.no_grad() or setting your variables' require_grad to False will prevent any gradient computation
So if you need to "train" your batch normalization you won't really be able to get a gradient without being affected by the batch normalization. In case you don't need to train them just put your model in evaluation mode as the gradients are still computed

What does PyTorch classifier output?

So i am new to deep learning and started learning PyTorch. I created a classifier model with following structure.
class model(nn.Module):
def __init__(self):
super(model, self).__init__()
resnet = models.resnet34(pretrained=True)
layers = list(resnet.children())[:8]
self.features1 = nn.Sequential(*layers[:6])
self.features2 = nn.Sequential(*layers[6:])
self.classifier = nn.Sequential(nn.BatchNorm1d(512), nn.Linear(512, 3))
def forward(self, x):
x = self.features1(x)
x = self.features2(x)
x = F.relu(x)
x = nn.AdaptiveAvgPool2d((1,1))(x)
x = x.view(x.shape[0], -1)
return self.classifier(x)
So basically I wanted to classify among three things {0,1,2}. While evaluating, I passed the image it returned a Tensor with three values like below
(tensor([[-0.1526, 1.3511, -1.0384]], device='cuda:0', grad_fn=<AddmmBackward>)
So my question is what are these three numbers? Are they probability ?
P.S. Please pardon me If I asked something too silly.
The final layer nn.Linear (fully connected layer) of self.classifier of your model produces values, that we can call a scores, for example, it may be: [10.3, -3.5, -12.0], the same you can see in your example as well: [-0.1526, 1.3511, -1.0384] which are not normalized and cannot be interpreted as probabilities.
As you can see it's just a kind of "raw unscaled" network output, in other words these values are not normalized, and it's hard to use them or interpret the results, that's why the common practice is converting them to normalized probability distribution by using softmax after the final layer, as #skinny_func has already described. After that you will get the probabilities in the range of 0 and 1, which is more intuitive representation.
So after training what you would want to do is to apply softmax to the output tensor to extract the probability of each class, then you choose the maximal value (highest probability).
in your case:
prob = torch.nn.functional.softmax(model(x), dim=1)
_, pred_class = torch.max(prob, dim=1)

How to compute gradient of the error with respect to the model input?

Given a simple 2 layer neural network, the traditional idea is to compute the gradient w.r.t. the weights/model parameters. For an experiment, I want to compute the gradient of the error w.r.t the input. Are there existing Pytorch methods that can allow me to do this?
More concretely, consider the following neural network:
import torch.nn as nn
import torch.nn.functional as F
class NeuralNet(nn.Module):
def __init__(self, n_features, n_hidden, n_classes, dropout):
super(NeuralNet, self).__init__()
self.fc1 = nn.Linear(n_features, n_hidden)
self.sigmoid = nn.Sigmoid()
self.fc2 = nn.Linear(n_hidden, n_classes)
self.dropout = dropout
def forward(self, x):
x = self.sigmoid(self.fc1(x))
x = F.dropout(x, self.dropout, training=self.training)
x = self.fc2(x)
return F.log_softmax(x, dim=1)
I instantiate the model and an optimizer for the weights as follows:
import torch.optim as optim
model = NeuralNet(n_features=args.n_features,
n_hidden=args.n_hidden,
n_classes=args.n_classes,
dropout=args.dropout)
optimizer_w = optim.SGD(model.parameters(), lr=0.001)
While training, I update the weights as usual. Now, given that I have values for the weights, I should be able to use them to compute the gradient w.r.t. the input. I am unable to figure out how.
def train(epoch):
t = time.time()
model.train()
optimizer.zero_grad()
output = model(features)
loss_train = F.nll_loss(output[idx_train], labels[idx_train])
acc_train = accuracy(output[idx_train], labels[idx_train])
loss_train.backward()
optimizer_w.step()
# grad_features = loss_train.backward() w.r.t to features
# features -= 0.001 * grad_features
for epoch in range(args.epochs):
train(epoch)
It is possible, just set input.requires_grad = True for each input batch you're feeding in, and then after loss.backward() you should see that input.grad holds the expected gradient. In other words, if your input to the model (which you call features in your code) is some M x N x ... tensor, features.grad will be a tensor of the same shape, where each element of grad holds the gradient with respect to the corresponding element of features. In my comments below, I use i as a generalized index - if your parameters has for instance 3 dimensions, replace it with features.grad[i, j, k], etc.
Regarding the error you're getting: PyTorch operations build a tree representing the mathematical operation they are describing, which is then used for differentiation. For instance c = a + b will create a tree where a and b are leaf nodes and c is not a leaf (since it results from other expressions). Your model is the expression, and its inputs as well as parameters are the leaves, whereas all intermediate and final outputs are not leaves. You can think of leaves as "constants" or "parameters" and of all other variables as of functions of those. This message tells you that you can only set requires_grad of leaf variables.
Your problem is that at the first iteration, features is random (or however else you initialize) and is therefore a valid leaf. After your first iteration, features is no longer a leaf, since it becomes an expression calculated based on the previous ones. In pseudocode, you have
f_1 = initial_value # valid leaf
f_2 = f_1 + your_grad_stuff # not a leaf: f_2 is a function of f_1
to deal with that you need to use detach, which breaks the links in the tree, and makes the autograd treat a tensor as if it was constant, no matter how it was created. In particular, no gradient calculations will be backpropagated through detach. So you need something like
features = features.detach() - 0.01 * features.grad
Note: perhaps you need to sprinkle a couple more detaches here and there, which is hard to say without seeing your whole code and knowing the exact purpose.

Resources