Don't include an operation for gradient computation in PyTorch - pytorch

I have a custom layer. Let the layer be called 'Gaussian'
class Gaussian(nn.Module):
def __init__():
super(Gaussian, self).__init__()
##torch.no_grad
def forward(self, x):
_r = np.random.randint(0, x.shape[0], x.shape[0])
_sample = x[_r]
_d = (_sample - x)
_number = int(self.k * x.shape[0])
x[1: _number] = x[1: _number] + (self.n * _d[1: _number]).detach()
return x
The above class will be used as below:
cnn_model = nn.Sequential(nn.Conv2d(1, 32, 5), Gaussian(), nn.ReLU(), nn.Conv2d(32, 32, 5))
If x is the input, I want the gradient of x to exclude operations that are present in the Gaussian module, but include the calculations in other layers of the neural network(nn.Conv2d etc).
In the end, my aim is to use the Gaussian module to perform calculations but that calculations should not be included in gradient computation.
I tried to do the following:
Used the #torch.no_grad above the forward method of the Gaussian
Using detach after every operation in the Gaussian module:
x[1: _number] = x[1: _number] + (self.n * _d[1: _number]).detach() and similarly for other operations
Use y = x.detach() in the forward method. Perform the operations on y and then x.data = y
Are the above methods correct?
P.S: Question edited

The gradient calculation has sense when there are parameters to optimise.
If your module do not have any parameters, then no gradient will be stored, because there are no parameters to associate it.

Related

How to add a custom loss function to Keras that solves an ODE?

I'm new to Keras, sorry if this is a silly question!
I am trying to get a single-layer neural network to find the solution to a first-order ODE. The neural network N(x) should be the approximate solution to the ODE. I defined the right-hand side function f, and a transformed function g that includes the boundary conditions. I then wrote a custom loss function that only minimises the residual of the approximate solution. I created some empty data for the optimizer to iterate over, and set it going. The optimizer does not seem to be able to adjust the weights to minimize this loss function. Am I thinking about this wrong?
# Define initial condition
A = 1.0
# Define empty training data
x_train = np.empty((10000, 1))
y_train = np.empty((10000, 1))
# Define transformed equation (forced to satisfy boundary conditions)
g = lambda x: N(x.reshape((1000,))) * x + A
# Define rhs function
f = lambda x: np.cos(2 * np.pi * x)
# Define loss function
def OdeLoss(g, f):
epsilon=sys.float_info.epsilon
def loss(y_true, y_pred):
x = np.linspace(0, 1, 1000)
R = K.sum(((g(x+epsilon)-g(x))/epsilon - f(x))**2)
return R
return loss
# Define input tensor
input_tensor = tf.keras.Input(shape=(1,))
# Define hidden layer
hidden = tf.keras.layers.Dense(32)(input_tensor)
# Define non-linear activation layer
activate = tf.keras.activations.relu(hidden)
# Define output tensor
output_tensor = tf.keras.layers.Dense(1)(activate)
# Define neural network
N = tf.keras.Model(input_tensor, output_tensor)
# Compile model
N.compile(loss=OdeLoss(g, f), optimizer='adam')
N.summary()
# Train model
history = N.fit(x_train, y_train, batch_size=1, epochs=1, verbose=1)
The method is based on Lecture 3.2 of MIT course 18.337J, by Chris Rackaukas, who does this in Julia. Cheers!

How does Pytorch build the computation graph

Here is example pytorch code from the website:
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
# 1 input image channel, 6 output channels, 3x3 square convolution
# kernel
self.conv1 = nn.Conv2d(1, 6, 3)
self.conv2 = nn.Conv2d(6, 16, 3)
# an affine operation: y = Wx + b
self.fc1 = nn.Linear(16 * 6 * 6, 120) # 6*6 from image dimension
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
# Max pooling over a (2, 2) window
x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
# If the size is a square you can only specify a single number
x = F.max_pool2d(F.relu(self.conv2(x)), 2)
x = x.view(-1, self.num_flat_features(x))
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
In the forward function, we simply apply a series of transformations to x, but never explicitly define which objects are part of that transformation. Yet when computing the gradient and updating the weights, Pytorch 'magically' knows which weights to update and how the gradient should be calculated.
How does this process work? Is there code analysis going on, or something else that I am missing?
Yes, there is implicit analysis on forward pass. Examine the result tensor, there is thingie like grad_fn= <CatBackward>, that's a link, allowing you to unroll the whole computation graph. And it is built during real forward computation process, no matter how you defined your network module, object oriented with 'nn' or 'functional' way.
You can exploit this graph for net analysis, as torchviz do here: https://github.com/szagoruyko/pytorchviz/blob/master/torchviz/dot.py

Using autograd to compute Jacobian matrix of outputs with respect to inputs

I apologize if this question is obvious or trivial. I am very new to pytorch and I am trying to understand the autograd.grad function in pytorch. I have a neural network G that takes in inputs (x,t) and outputs (u,v). Here is the code for G:
class GeneratorNet(torch.nn.Module):
"""
A three hidden-layer generative neural network
"""
def __init__(self):
super(GeneratorNet, self).__init__()
self.hidden0 = nn.Sequential(
nn.Linear(2, 100),
nn.LeakyReLU(0.2)
)
self.hidden1 = nn.Sequential(
nn.Linear(100, 100),
nn.LeakyReLU(0.2)
)
self.hidden2 = nn.Sequential(
nn.Linear(100, 100),
nn.LeakyReLU(0.2)
)
self.out = nn.Sequential(
nn.Linear(100, 2),
nn.Tanh()
)
def forward(self, x):
x = self.hidden0(x)
x = self.hidden1(x)
x = self.hidden2(x)
x = self.out(x)
return x
Or simply G(x,t) = (u(x,t), v(x,t)) where u(x,t) and v(x,t) are scalar valued. Goal: Compute $\frac{\partial u(x,t)}{\partial x}$ and $\frac{\partial u(x,t)}{\partial t}$. At every training step, I have a minibatch of size $100$ so u(x,t) is a [100,1] tensor. Here is my attempt to compute the partial derivatives, where coords is the input (x,t) and just like below I added the requires_grad_(True) flag to the coords as well:
tensor = GeneratorNet(coords)
tensor.requires_grad_(True)
u, v = torch.split(tensor, 1, dim=1)
du = autograd.grad(u, coords, grad_outputs=torch.ones_like(u), create_graph=True,
retain_graph=True, only_inputs=True, allow_unused=True)[0]
du is now a [100,2] tensor.
Question: Is this the tensor of the partials for the 100 input points of the minibatch?
There are similar questions like computing derivatives of the output with respect to inputs but I could not really figure out what's going on. I apologize once again if this is already answered or trivial. Thank you very much.
The code you posted should give you the partial derivative of your first output w.r.t. the input. However, you also have to set requires_grad_(True) on the inputs, as otherwise PyTorch does not build up the computation graph starting at the input and thus it cannot compute the gradient for them.
This version of your code example computes du and dv:
net = GeneratorNet()
coords = torch.randn(10, 2)
coords.requires_grad = True
tensor = net(coords)
u, v = torch.split(tensor, 1, dim=1)
du = torch.autograd.grad(u, coords, grad_outputs=torch.ones_like(u))[0]
dv = torch.autograd.grad(v, coords, grad_outputs=torch.ones_like(v))[0]
You can also compute the partial derivative for a single output:
net = GeneratorNet()
coords = torch.randn(10, 2)
coords.requires_grad = True
tensor = net(coords)
u, v = torch.split(tensor, 1, dim=1)
du_0 = torch.autograd.grad(u[0], coords)[0]
where du_0 == du[0].

How to compute gradient of the error with respect to the model input?

Given a simple 2 layer neural network, the traditional idea is to compute the gradient w.r.t. the weights/model parameters. For an experiment, I want to compute the gradient of the error w.r.t the input. Are there existing Pytorch methods that can allow me to do this?
More concretely, consider the following neural network:
import torch.nn as nn
import torch.nn.functional as F
class NeuralNet(nn.Module):
def __init__(self, n_features, n_hidden, n_classes, dropout):
super(NeuralNet, self).__init__()
self.fc1 = nn.Linear(n_features, n_hidden)
self.sigmoid = nn.Sigmoid()
self.fc2 = nn.Linear(n_hidden, n_classes)
self.dropout = dropout
def forward(self, x):
x = self.sigmoid(self.fc1(x))
x = F.dropout(x, self.dropout, training=self.training)
x = self.fc2(x)
return F.log_softmax(x, dim=1)
I instantiate the model and an optimizer for the weights as follows:
import torch.optim as optim
model = NeuralNet(n_features=args.n_features,
n_hidden=args.n_hidden,
n_classes=args.n_classes,
dropout=args.dropout)
optimizer_w = optim.SGD(model.parameters(), lr=0.001)
While training, I update the weights as usual. Now, given that I have values for the weights, I should be able to use them to compute the gradient w.r.t. the input. I am unable to figure out how.
def train(epoch):
t = time.time()
model.train()
optimizer.zero_grad()
output = model(features)
loss_train = F.nll_loss(output[idx_train], labels[idx_train])
acc_train = accuracy(output[idx_train], labels[idx_train])
loss_train.backward()
optimizer_w.step()
# grad_features = loss_train.backward() w.r.t to features
# features -= 0.001 * grad_features
for epoch in range(args.epochs):
train(epoch)
It is possible, just set input.requires_grad = True for each input batch you're feeding in, and then after loss.backward() you should see that input.grad holds the expected gradient. In other words, if your input to the model (which you call features in your code) is some M x N x ... tensor, features.grad will be a tensor of the same shape, where each element of grad holds the gradient with respect to the corresponding element of features. In my comments below, I use i as a generalized index - if your parameters has for instance 3 dimensions, replace it with features.grad[i, j, k], etc.
Regarding the error you're getting: PyTorch operations build a tree representing the mathematical operation they are describing, which is then used for differentiation. For instance c = a + b will create a tree where a and b are leaf nodes and c is not a leaf (since it results from other expressions). Your model is the expression, and its inputs as well as parameters are the leaves, whereas all intermediate and final outputs are not leaves. You can think of leaves as "constants" or "parameters" and of all other variables as of functions of those. This message tells you that you can only set requires_grad of leaf variables.
Your problem is that at the first iteration, features is random (or however else you initialize) and is therefore a valid leaf. After your first iteration, features is no longer a leaf, since it becomes an expression calculated based on the previous ones. In pseudocode, you have
f_1 = initial_value # valid leaf
f_2 = f_1 + your_grad_stuff # not a leaf: f_2 is a function of f_1
to deal with that you need to use detach, which breaks the links in the tree, and makes the autograd treat a tensor as if it was constant, no matter how it was created. In particular, no gradient calculations will be backpropagated through detach. So you need something like
features = features.detach() - 0.01 * features.grad
Note: perhaps you need to sprinkle a couple more detaches here and there, which is hard to say without seeing your whole code and knowing the exact purpose.

Pytorch: Learnable threshold for clipping activations

What is the proper way to clip ReLU activations with a learnable threshold? Here's how I implemented it, however I'm not sure if this is correct:
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.act_max = nn.Parameter(torch.Tensor([0]), requires_grad=True)
self.conv1 = nn.Conv2d(3, 32, kernel_size=5)
self.conv2 = nn.Conv2d(32, 64, kernel_size=5)
self.pool = nn.MaxPool2d(2, 2)
self.relu = nn.ReLU()
self.linear = nn.Linear(64 * 5 * 5, 10)
def forward(self, input):
conv1 = self.conv1(input)
pool1 = self.pool(conv1)
relu1 = self.relu(pool1)
relu1[relu1 > self.act_max] = self.act_max
conv2 = self.conv2(relu1)
pool2 = self.pool(conv2)
relu2 = self.relu(pool2)
relu2 = relu2.view(relu2.size(0), -1)
linear = self.linear(relu2)
return linear
model = Net()
torch.nn.init.kaiming_normal_(model.parameters)
nn.init.constant(model.act_max, 1.0)
model = model.cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
for epoch in range(100):
for i in range(1000):
output = model(input)
loss = nn.CrossEntropyLoss()(output, label)
optimizer.zero_grad()
loss.backward()
optimizer.step()
model.act_max.data = model.act_max.data - 0.001 * model.act_max.grad.data
I had to add the last line because without it the value would not update for some reason.
UPDATE: I am now trying a method to compute the uppper bound (act_max) based on the gradients for activations:
For all activations above the threshold (relu1[relu1 > self.act_max]), look at their gradients: compute the average direction all these gradients point to.
For all positive activations below the threshold, compute the average gradient of which direction they want to change to.
The sum of these average gradients determines the direction and magnitude of the change for act_max.
There are two problems with that code.
The implementation-level one is that you're using an in-place operation which generally doesn't work well with autograd. Instead of
relu1[relu1 > self.act_max] = self.act_max
you should use an out-of-place operation like
relu1 = torch.where(relu1 > self.act_max, self.act_max, relu1)
The other is more general : neural networks are generally trained with gradient descent methods and threshold values can have no gradient - the loss function is not differentiable with respect to thresholds.
In your model you're using a dirty hackaround (whether you write is as it is or use torch.where) - model.act_max.grad.data is only defined because for some elements their value is set to model.act_max. But this gradient knows nothing about why they were set to that value. To make things more concrete, lets define cutoff operation C(x, t) which defines whether x is above or below threshold t
C(x, t) = 1 if x < t else 0
and write your clipping operation as a product
clip(x, t) = C(x, t) * x + (1 - C(x, t)) * t
you can then see that the threshold t has twofold meaning: it controls when to cutoff (inside C) and it controls the value above cutoff (the trailing t). We can therefore generalize the operation as
clip(x, t1, t2) = C(x, t1) * x + (1 - C(x, t1)) * t2
The problem with your operation is that it is only differentiable with respect to t2 but not t1. Your solution ties the two together so that t1 == t2, but it is still the case that gradient descent will act as if there was no changing the threshold, only changing the above-the-threshold-value.
For this reason, in general your thresholding operation may not be learning the value you would hope it learns. This is something to keep in mind when developing your operations, but not a guarantee of failure - in fact, if you consider the standard ReLU on biased output of some linear unit, we get a similar picture. We define the cutoff operation H
H(x, t) = 1 if x > t else 0
and ReLU as
ReLU(x + b, t) = (x + b) * H(x + b, t) = (x + b) * H(x, t - b)
where we could again generalize to
ReLU(x, b, t) = (x + b) * H(x, t)
and again we can only learn b and t is implicitly following b. Yet it seems to work :)

Resources