Pytorch Gradient w.r.t. Inputs using BatchNorm - pytorch

I'm trying to calculate the gradient of the output of a simple neural network with respect to the inputs. The result looks fine when I don't use a BatchNorm layer. Once I do use it, the result doesn't seem to make much sense. Below is a short example to reproduce the effect.
class Net(nn.Module):
def __init__(self, batch_norm):
super().__init__()
self.batch_norm = batch_norm
self.act_fn = nn.Tanh()
self.aff1 = nn.Linear(1, 10)
self.aff2 = nn.Linear(10, 1)
if batch_norm:
self.bn = nn.BatchNorm1d(10, affine=False) # False for simplicity
def forward(self, x):
x = self.aff1(x)
x = self.act_fn(x)
if self.batch_norm:
x = self.bn(x)
x = self.aff2(x)
return x
x_vals = torch.linspace(0, 1, 100)
x_vals.requires_grad = True
fig, axs = plt.subplots(ncols=2, figsize=(16, 5))
for seed, bn, ax1 in zip([11, 4], [False, True], axs): # different seeds for better illustration of effect
torch.manual_seed(seed)
net = Net(batch_norm=bn)
net.train()
pred = net(x_vals[:, None])
pred_dx = torch.autograd.grad(pred.sum(), x_vals, create_graph=True)[0]
# visualization
ax2 = ax1.twinx()
ax1.plot(x_vals.detach(), pred.detach())
ax2.plot(x_vals.detach(), pred_dx.detach(), linestyle='--', color='orange')
min_idx = torch.argmin((pred[1:]-pred[:-1])**2)
ax2.axvline(x_vals[min_idx].detach(), color='gray', linestyle='dotted')
ax2.axhline(0, color='gray', linestyle='dotted')
ax1.set_title(('With' if bn else 'Without') + ' Batch Norm')
plt.show()
The result also seems to be fine when I use evaluation mode. Unfortunately I can't just switch to eval() mode because the nature of my problem (PINNs) requires calculating gradient(s) during training.
I understand that during training the running mean and variance are updated. Maybe that has an impact? Can I still get the correct gradient somehow?
Thanks for your help!

It is a bit confusing due to the use of the word '.train()' but :
net.train() put layers like batch normalization and dropout to an active state. So yes the running mean and variance will be updated but the gradient will still be computed.
with torch.no_grad() or setting your variables' require_grad to False will prevent any gradient computation
So if you need to "train" your batch normalization you won't really be able to get a gradient without being affected by the batch normalization. In case you don't need to train them just put your model in evaluation mode as the gradients are still computed

Related

Clarification in PyTorch's autograd with respect to tracking weights

I was reading this blog from PyTorch. Just before the AutoGrad in training Section , it is mentioned
Be aware that only leaf nodes of the computation have their gradients computed. If you tried, for example, print(c.grad) you’d get back None. In this simple example, only the input is a leaf node, so only it has gradients computed.
Then weights are also considered to be leaf nodes. In the subsequent AutoGrad in training Section this below code block is executed.
BATCH_SIZE = 16
DIM_IN = 1000
HIDDEN_SIZE = 100
DIM_OUT = 10
class TinyModel(torch.nn.Module):
def __init__(self):
super(TinyModel, self).__init__()
self.layer1 = torch.nn.Linear(1000, 100)
self.relu = torch.nn.ReLU()
self.layer2 = torch.nn.Linear(100, 10)
def forward(self, x):
x = self.layer1(x)
x = self.relu(x)
x = self.layer2(x)
return x
some_input = torch.randn(BATCH_SIZE, DIM_IN, requires_grad=False)
ideal_output = torch.randn(BATCH_SIZE, DIM_OUT, requires_grad=False)
model = TinyModel()
When
print(model.layer2.weight.grad)
is executed it is shown as None.
But after training with the below code snippet, the weights have gradients,
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
prediction = model(some_input)
loss = (ideal_output - prediction).pow(2).sum()
loss.backward()
print(model.layer2.weight.grad[0][0:10])
So is my understanding correct? i.e when weights are initialised by calling TinyModel(), the requires_autograd is set to False. Only when the training starts happening with loss.backward() then the requires_autograd is set to True and the gradient is kept track?
But in other examples, when we create PyTorch models from scratch, where we initialise the weights randomly along with requires_grad=True, the gradient is tracked from beginning.
Or is the gradient tracking generally enabled only when it is started to be trained? If so why initially it was returning None in the above example.
Thank You in advance
I assume by requires_autograd you mean requires_grad.
when weights are initialised by calling TinyModel(), the requires_autograd is set to False.
No, this isn't true. The attribute model.layer2.weight is an instance of nn.Parameter which has requires_grad == True by default. You can verify this yourself:
model = TinyModel()
assert isinstance(model.layer2.weight, nn.Parameter)
assert model.layer2.weight.requires_grad
assert all(p.requires_grad for p in model.parameters())
why initially it was returning None in the above example
The value of model.layer2.weight.grad is None because at that point no gradient is computed yet. In fact, no forward computation is even computed yet. When loss.backward() is executed, the autograd engine computes the gradient of all tensor p with p.requires_grad == True and stores this gradient in p.grad. That's why model.layer2.weight.grad is no longer None after loss.backward().

Pytorch - vanilla Gradient Visualization

I trained a neural network on MNIST using PyTorch:
class MnistCNN(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d( 1, 16, 3, stride = 1, padding = 2)
self.pool1 = nn.MaxPool2d(kernel_size = 2)
self.conv2 = nn.Conv2d(16, 32, 3, stride = 1, padding = 2)
self.pool2 = nn.MaxPool2d(kernel_size = 2)
self.dropout = nn.Dropout(0.5)
self.lin = nn.Linear(32 * 8 * 8, 10)
def forward(self, x):
# conv block
x = F.relu(self.conv1(x))
x = self.pool1(x)
# conv block
x = F.relu(self.conv2(x))
x = self.pool2(x)
# dense block
x = x.view(x.size(0), -1)
x = self.dropout(x)
return self.lin(x)
I would like to implement vanilla Gradient Visualization (see reference below) on my model.
Simonyan, K., Vedaldi, A., Zisserman, A.
Deep inside convolutional networks: Visualising image classification models and saliency maps.
arXiv preprint arXiv:1312.6034 (2013)
Question: How can I implement this method in PyTorch?
If I understand correctly, vanilla gradient visualization consists in computing the partial derivatives of the loss of my model w.r.t all the pixels in my input image. So to make it short, I need to tweek my self.conv1 layer so that it computes the gradient over its input pixels instead of the gradient over its weights.
Please correct me if I'm wrong.
You do not need to change anything about your conv layer. Each layer computes gradients both w.r.t. parameters (for updates) and w.r.t. inputs (for "downstream" gradients by the chain rule). Therefore, all you need is to set your input image's x gradient property to true:
x, y = ... # get one image from MNIST
x.requires_grad_(True) # indicate to pytorch that you would like to look at these gradients
pred = model(x)
loss = criterion(pred, y)
loss.backward() # propagate gradients
x.grad # <- here you should have the gradients of the loss w.r.t pixels

Using autograd to compute Jacobian matrix of outputs with respect to inputs

I apologize if this question is obvious or trivial. I am very new to pytorch and I am trying to understand the autograd.grad function in pytorch. I have a neural network G that takes in inputs (x,t) and outputs (u,v). Here is the code for G:
class GeneratorNet(torch.nn.Module):
"""
A three hidden-layer generative neural network
"""
def __init__(self):
super(GeneratorNet, self).__init__()
self.hidden0 = nn.Sequential(
nn.Linear(2, 100),
nn.LeakyReLU(0.2)
)
self.hidden1 = nn.Sequential(
nn.Linear(100, 100),
nn.LeakyReLU(0.2)
)
self.hidden2 = nn.Sequential(
nn.Linear(100, 100),
nn.LeakyReLU(0.2)
)
self.out = nn.Sequential(
nn.Linear(100, 2),
nn.Tanh()
)
def forward(self, x):
x = self.hidden0(x)
x = self.hidden1(x)
x = self.hidden2(x)
x = self.out(x)
return x
Or simply G(x,t) = (u(x,t), v(x,t)) where u(x,t) and v(x,t) are scalar valued. Goal: Compute $\frac{\partial u(x,t)}{\partial x}$ and $\frac{\partial u(x,t)}{\partial t}$. At every training step, I have a minibatch of size $100$ so u(x,t) is a [100,1] tensor. Here is my attempt to compute the partial derivatives, where coords is the input (x,t) and just like below I added the requires_grad_(True) flag to the coords as well:
tensor = GeneratorNet(coords)
tensor.requires_grad_(True)
u, v = torch.split(tensor, 1, dim=1)
du = autograd.grad(u, coords, grad_outputs=torch.ones_like(u), create_graph=True,
retain_graph=True, only_inputs=True, allow_unused=True)[0]
du is now a [100,2] tensor.
Question: Is this the tensor of the partials for the 100 input points of the minibatch?
There are similar questions like computing derivatives of the output with respect to inputs but I could not really figure out what's going on. I apologize once again if this is already answered or trivial. Thank you very much.
The code you posted should give you the partial derivative of your first output w.r.t. the input. However, you also have to set requires_grad_(True) on the inputs, as otherwise PyTorch does not build up the computation graph starting at the input and thus it cannot compute the gradient for them.
This version of your code example computes du and dv:
net = GeneratorNet()
coords = torch.randn(10, 2)
coords.requires_grad = True
tensor = net(coords)
u, v = torch.split(tensor, 1, dim=1)
du = torch.autograd.grad(u, coords, grad_outputs=torch.ones_like(u))[0]
dv = torch.autograd.grad(v, coords, grad_outputs=torch.ones_like(v))[0]
You can also compute the partial derivative for a single output:
net = GeneratorNet()
coords = torch.randn(10, 2)
coords.requires_grad = True
tensor = net(coords)
u, v = torch.split(tensor, 1, dim=1)
du_0 = torch.autograd.grad(u[0], coords)[0]
where du_0 == du[0].

How to compute gradient of the error with respect to the model input?

Given a simple 2 layer neural network, the traditional idea is to compute the gradient w.r.t. the weights/model parameters. For an experiment, I want to compute the gradient of the error w.r.t the input. Are there existing Pytorch methods that can allow me to do this?
More concretely, consider the following neural network:
import torch.nn as nn
import torch.nn.functional as F
class NeuralNet(nn.Module):
def __init__(self, n_features, n_hidden, n_classes, dropout):
super(NeuralNet, self).__init__()
self.fc1 = nn.Linear(n_features, n_hidden)
self.sigmoid = nn.Sigmoid()
self.fc2 = nn.Linear(n_hidden, n_classes)
self.dropout = dropout
def forward(self, x):
x = self.sigmoid(self.fc1(x))
x = F.dropout(x, self.dropout, training=self.training)
x = self.fc2(x)
return F.log_softmax(x, dim=1)
I instantiate the model and an optimizer for the weights as follows:
import torch.optim as optim
model = NeuralNet(n_features=args.n_features,
n_hidden=args.n_hidden,
n_classes=args.n_classes,
dropout=args.dropout)
optimizer_w = optim.SGD(model.parameters(), lr=0.001)
While training, I update the weights as usual. Now, given that I have values for the weights, I should be able to use them to compute the gradient w.r.t. the input. I am unable to figure out how.
def train(epoch):
t = time.time()
model.train()
optimizer.zero_grad()
output = model(features)
loss_train = F.nll_loss(output[idx_train], labels[idx_train])
acc_train = accuracy(output[idx_train], labels[idx_train])
loss_train.backward()
optimizer_w.step()
# grad_features = loss_train.backward() w.r.t to features
# features -= 0.001 * grad_features
for epoch in range(args.epochs):
train(epoch)
It is possible, just set input.requires_grad = True for each input batch you're feeding in, and then after loss.backward() you should see that input.grad holds the expected gradient. In other words, if your input to the model (which you call features in your code) is some M x N x ... tensor, features.grad will be a tensor of the same shape, where each element of grad holds the gradient with respect to the corresponding element of features. In my comments below, I use i as a generalized index - if your parameters has for instance 3 dimensions, replace it with features.grad[i, j, k], etc.
Regarding the error you're getting: PyTorch operations build a tree representing the mathematical operation they are describing, which is then used for differentiation. For instance c = a + b will create a tree where a and b are leaf nodes and c is not a leaf (since it results from other expressions). Your model is the expression, and its inputs as well as parameters are the leaves, whereas all intermediate and final outputs are not leaves. You can think of leaves as "constants" or "parameters" and of all other variables as of functions of those. This message tells you that you can only set requires_grad of leaf variables.
Your problem is that at the first iteration, features is random (or however else you initialize) and is therefore a valid leaf. After your first iteration, features is no longer a leaf, since it becomes an expression calculated based on the previous ones. In pseudocode, you have
f_1 = initial_value # valid leaf
f_2 = f_1 + your_grad_stuff # not a leaf: f_2 is a function of f_1
to deal with that you need to use detach, which breaks the links in the tree, and makes the autograd treat a tensor as if it was constant, no matter how it was created. In particular, no gradient calculations will be backpropagated through detach. So you need something like
features = features.detach() - 0.01 * features.grad
Note: perhaps you need to sprinkle a couple more detaches here and there, which is hard to say without seeing your whole code and knowing the exact purpose.

Recurrent neural network architecture

I'm working on a RNN architecture which does speech enhancement. The dimensions of the input is [XX, X, 1024] where XX is the batch size and X is the variable sequence length.
The input to the network is positive valued data and the output is masked binary data(IBM) which is later used to construct enhanced signal.
For instance, if the input to network is [10, 65, 1024] the output will be [10,65,1024] tensor with binary values. I'm using Tensorflow with mean squared error as loss function. But I'm not sure which activation function to use here(which keeps the outputs either zero or one), Following is the code I've come up with so far
tf.reset_default_graph()
num_units = 10 #
num_layers = 3 #
dropout = tf.placeholder(tf.float32)
cells = []
for _ in range(num_layers):
cell = tf.contrib.rnn.LSTMCell(num_units)
cell = tf.contrib.rnn.DropoutWrapper(cell, output_keep_prob = dropout)
cells.append(cell)
cell = tf.contrib.rnn.MultiRNNCell(cells)
X = tf.placeholder(tf.float32, [None, None, 1024])
Y = tf.placeholder(tf.float32, [None, None, 1024])
output, state = tf.nn.dynamic_rnn(cell, X, dtype=tf.float32)
out_size = Y.get_shape()[2].value
logit = tf.contrib.layers.fully_connected(output, out_size)
prediction = (logit)
flat_Y = tf.reshape(Y, [-1] + Y.shape.as_list()[2:])
flat_logit = tf.reshape(logit, [-1] + logit.shape.as_list()[2:])
loss_op = tf.losses.mean_squared_error(labels=flat_Y, predictions=flat_logit)
#adam optimizier as the optimization function
optimizer = tf.train.AdamOptimizer(learning_rate=0.001) #
train_op = optimizer.minimize(loss_op)
#extract the correct predictions and compute the accuracy
correct_pred = tf.equal(tf.argmax(prediction, 1), tf.argmax(Y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
Also my reconstruction isn't good. Can someone suggest on improving the model?
If you want your outputs to be either 0 or 1, to me it seems a good idea to turn this into a classification problem. To this end, I would use a sigmoidal activation and cross entropy:
...
prediction = tf.nn.sigmoid(logit)
loss_op = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=Y, logits=logit))
...
In addition, from my point of view the hidden dimensionality (10) of your stacked RNNs seems quite small for such a big input dimensionality (1024). However this is just a guess, and it is something that needs to be tuned.

Resources