Pytorch: Learnable threshold for clipping activations - pytorch

What is the proper way to clip ReLU activations with a learnable threshold? Here's how I implemented it, however I'm not sure if this is correct:
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.act_max = nn.Parameter(torch.Tensor([0]), requires_grad=True)
self.conv1 = nn.Conv2d(3, 32, kernel_size=5)
self.conv2 = nn.Conv2d(32, 64, kernel_size=5)
self.pool = nn.MaxPool2d(2, 2)
self.relu = nn.ReLU()
self.linear = nn.Linear(64 * 5 * 5, 10)
def forward(self, input):
conv1 = self.conv1(input)
pool1 = self.pool(conv1)
relu1 = self.relu(pool1)
relu1[relu1 > self.act_max] = self.act_max
conv2 = self.conv2(relu1)
pool2 = self.pool(conv2)
relu2 = self.relu(pool2)
relu2 = relu2.view(relu2.size(0), -1)
linear = self.linear(relu2)
return linear
model = Net()
torch.nn.init.kaiming_normal_(model.parameters)
nn.init.constant(model.act_max, 1.0)
model = model.cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
for epoch in range(100):
for i in range(1000):
output = model(input)
loss = nn.CrossEntropyLoss()(output, label)
optimizer.zero_grad()
loss.backward()
optimizer.step()
model.act_max.data = model.act_max.data - 0.001 * model.act_max.grad.data
I had to add the last line because without it the value would not update for some reason.
UPDATE: I am now trying a method to compute the uppper bound (act_max) based on the gradients for activations:
For all activations above the threshold (relu1[relu1 > self.act_max]), look at their gradients: compute the average direction all these gradients point to.
For all positive activations below the threshold, compute the average gradient of which direction they want to change to.
The sum of these average gradients determines the direction and magnitude of the change for act_max.

There are two problems with that code.
The implementation-level one is that you're using an in-place operation which generally doesn't work well with autograd. Instead of
relu1[relu1 > self.act_max] = self.act_max
you should use an out-of-place operation like
relu1 = torch.where(relu1 > self.act_max, self.act_max, relu1)
The other is more general : neural networks are generally trained with gradient descent methods and threshold values can have no gradient - the loss function is not differentiable with respect to thresholds.
In your model you're using a dirty hackaround (whether you write is as it is or use torch.where) - model.act_max.grad.data is only defined because for some elements their value is set to model.act_max. But this gradient knows nothing about why they were set to that value. To make things more concrete, lets define cutoff operation C(x, t) which defines whether x is above or below threshold t
C(x, t) = 1 if x < t else 0
and write your clipping operation as a product
clip(x, t) = C(x, t) * x + (1 - C(x, t)) * t
you can then see that the threshold t has twofold meaning: it controls when to cutoff (inside C) and it controls the value above cutoff (the trailing t). We can therefore generalize the operation as
clip(x, t1, t2) = C(x, t1) * x + (1 - C(x, t1)) * t2
The problem with your operation is that it is only differentiable with respect to t2 but not t1. Your solution ties the two together so that t1 == t2, but it is still the case that gradient descent will act as if there was no changing the threshold, only changing the above-the-threshold-value.
For this reason, in general your thresholding operation may not be learning the value you would hope it learns. This is something to keep in mind when developing your operations, but not a guarantee of failure - in fact, if you consider the standard ReLU on biased output of some linear unit, we get a similar picture. We define the cutoff operation H
H(x, t) = 1 if x > t else 0
and ReLU as
ReLU(x + b, t) = (x + b) * H(x + b, t) = (x + b) * H(x, t - b)
where we could again generalize to
ReLU(x, b, t) = (x + b) * H(x, t)
and again we can only learn b and t is implicitly following b. Yet it seems to work :)

Related

Tuning multiple losses in a multi-headed neural network

I have a network to simultaneously predict a max, and a min, using the same logits (an impossible task, but hear me out). Basically, I want to turn a knob to say "now predict the max of a given set of values" or to predict the min. If the knob is in between, it 'll predict a min or max with 50% probability. My code is based on the You Only Train Once paper: https://openreview.net/pdf?id=HyxY6JHKwr. The paper claims that you can train one network, and then tune how you combine the losses, to produce the network you want. So in my case, I want to tune it in such a way that my network either predicts the max of a given set of numbers, or the min. But I am failing at this task. My network model is as follows
class MyModel(Module):
def __init__(self, vocab_size, embedding_dim, input_dim):
super(MyModel, self).__init__()
self.input_dim = input_dim
self.embedding_dim = embedding_dim
self.emb = Embedding(num_embeddings = vocab_size, embedding_dim = embedding_dim)
self.l1 = Linear(input_dim * embedding_dim, 64)
self.l2 = Linear(64, 32)
self.l3 = Linear(32,10)
self.loss_parameter_mlp = Sequential(
Linear(2, 2),
Sigmoid(),
)
def forward(self, x, lambd):
lambd = self.loss_parameter_mlp(lambd)
x = self.emb(x).reshape(-1, self.input_dim * self.embedding_dim)
x = ReLU()(self.l1(x))
x = x * lambd[:,0].reshape(-1, 1) + lambd[:,1].reshape(-1,1)
x = ReLU()(self.l2(x))
x = x * lambd[:,0].reshape(-1, 1) + lambd[:,1].reshape(-1,1)
logits = ReLU()(self.l3(x))
return logits
My inputs are 10 integers from 1 to 99, and my model outputs are the logits - the argmax of which should contain either the min, or the max, based on my hyperparameters lambd. I specifically chose this problem, since I want the network to predict two polar opposites (max and min) at the same time, which it cannot. It's (in my mind) a simpler version of the problem the paper is trying to solve. My training code is as shown below
# Training
epochs = 200
alpha = np.linspace(0, 1, epochs)
np.random.shuffle(alpha)
np.linspace(1, 0, int(epochs/2))))
for epoch in range(epochs):
lambd = torch.tensor([[alpha[epoch], (1 - alpha[epoch])]], dtype=torch.float32)
for batch, x in enumerate(train_loader):
y_max = torch.argmax(x, axis=1)
y_min = torch.argmin(x, axis=1)
lambd_b = lambd.expand(len(y_max), -1)
y_pred = model(x, lambd_b)
loss_max = CE_loss(y_pred, y_max)
loss_min = CE_loss(y_pred, y_min)
optimizer.zero_grad()
loss = alpha[epoch] * loss_max + (1 - alpha[epoch]) * loss_min
loss.backward()
optimizer.step()
However, the network learns to ignore the parameters lambd (in other words, the knobs to tune max or min just doesn't work). The network does learn to predict max and min (they share the same accuracy) - which is expected. What should I do to ensure that the knobs work?

Computing Jacobian and Derivative in Tensorflow is extremely slow

Is there a more efficient way to compute Jacobian (there must be, it doesn't even run for a single batch) I want to compute the loss as given in the self-explanatory neural network. Input has a shape of (32, 365, 3) where 32 is the batch size. The loss I want to minimize is Equation 3 of the paper.
I believe that I am not using the GradientTape optimally.
def compute_loss_theta(tape, parameter, concept, output, x):
b = x.shape[0]
in_dim = (x.shape[1], x.shape[2])
feature_dim = in_dim[0]*in_dim[1]
J = tape.batch_jacobian(concept, x)
grad_fx = tape.gradient(output, x)
grad_fx = tf.reshape(grad_fx,shape=(b, feature_dim))
J = tf.reshape(J, shape=(b, feature_dim, feature_dim))
parameter = tf.expand_dims(parameter, axis =1)
loss_theta_matrix = grad_fx - tf.matmul(parameter, J)
loss_theta = tf.norm(loss_theta_matrix)
return loss_theta
for i in range(10):
for x, y in train_dataset:
with tf.GradientTape(persistent=True) as tape:
tape.watch(x)
parameter, concept, output = model(x)
loss_theta = compute_loss_theta(tape, parameter, concept, output , x)
loss_y = loss_object(y_true=y, y_pred=output)
loss_value = loss_y + eps*loss_theta
gradients = tape.gradient(loss_value, model.trainable_weights)
optimizer.apply_gradients(zip(gradients, model.trainable_weights))

Implementing dropout with pytorch

I wonder if I want to implement dropout by myself, is something like the following sufficient (taken from Implementing dropout from scratch):
class MyDropout(nn.Module):
def __init__(self, p: float = 0.5):
super(MyDropout, self).__init__()
if p < 0 or p > 1:
raise ValueError("dropout probability has to be between 0 and 1, " "but got {}".format(p))
self.p = p
def forward(self, X):
if self.training:
binomial = torch.distributions.binomial.Binomial(probs=1-self.p)
return X * binomial.sample(X.size()) * (1.0/(1-self.p))
return X
My concern is even if the unwanted weights are masked out (either through this way or by using a mask tensor), there can still be gradient flow through the 0 weights (https://discuss.pytorch.org/t/custom-connections-in-neural-network-layers/3027/9). Is my concern valid?
DropOut does not mask the weights - it masks the features.
For linear layers implementing y = <w, x> the gradient w.r.t the parameters w is x. Therefore, if you set entries in x to zero - it will amount to no update for the corresponding weight in the adjacent linear layer.

Don't include an operation for gradient computation in PyTorch

I have a custom layer. Let the layer be called 'Gaussian'
class Gaussian(nn.Module):
def __init__():
super(Gaussian, self).__init__()
##torch.no_grad
def forward(self, x):
_r = np.random.randint(0, x.shape[0], x.shape[0])
_sample = x[_r]
_d = (_sample - x)
_number = int(self.k * x.shape[0])
x[1: _number] = x[1: _number] + (self.n * _d[1: _number]).detach()
return x
The above class will be used as below:
cnn_model = nn.Sequential(nn.Conv2d(1, 32, 5), Gaussian(), nn.ReLU(), nn.Conv2d(32, 32, 5))
If x is the input, I want the gradient of x to exclude operations that are present in the Gaussian module, but include the calculations in other layers of the neural network(nn.Conv2d etc).
In the end, my aim is to use the Gaussian module to perform calculations but that calculations should not be included in gradient computation.
I tried to do the following:
Used the #torch.no_grad above the forward method of the Gaussian
Using detach after every operation in the Gaussian module:
x[1: _number] = x[1: _number] + (self.n * _d[1: _number]).detach() and similarly for other operations
Use y = x.detach() in the forward method. Perform the operations on y and then x.data = y
Are the above methods correct?
P.S: Question edited
The gradient calculation has sense when there are parameters to optimise.
If your module do not have any parameters, then no gradient will be stored, because there are no parameters to associate it.

Using autograd to compute Jacobian matrix of outputs with respect to inputs

I apologize if this question is obvious or trivial. I am very new to pytorch and I am trying to understand the autograd.grad function in pytorch. I have a neural network G that takes in inputs (x,t) and outputs (u,v). Here is the code for G:
class GeneratorNet(torch.nn.Module):
"""
A three hidden-layer generative neural network
"""
def __init__(self):
super(GeneratorNet, self).__init__()
self.hidden0 = nn.Sequential(
nn.Linear(2, 100),
nn.LeakyReLU(0.2)
)
self.hidden1 = nn.Sequential(
nn.Linear(100, 100),
nn.LeakyReLU(0.2)
)
self.hidden2 = nn.Sequential(
nn.Linear(100, 100),
nn.LeakyReLU(0.2)
)
self.out = nn.Sequential(
nn.Linear(100, 2),
nn.Tanh()
)
def forward(self, x):
x = self.hidden0(x)
x = self.hidden1(x)
x = self.hidden2(x)
x = self.out(x)
return x
Or simply G(x,t) = (u(x,t), v(x,t)) where u(x,t) and v(x,t) are scalar valued. Goal: Compute $\frac{\partial u(x,t)}{\partial x}$ and $\frac{\partial u(x,t)}{\partial t}$. At every training step, I have a minibatch of size $100$ so u(x,t) is a [100,1] tensor. Here is my attempt to compute the partial derivatives, where coords is the input (x,t) and just like below I added the requires_grad_(True) flag to the coords as well:
tensor = GeneratorNet(coords)
tensor.requires_grad_(True)
u, v = torch.split(tensor, 1, dim=1)
du = autograd.grad(u, coords, grad_outputs=torch.ones_like(u), create_graph=True,
retain_graph=True, only_inputs=True, allow_unused=True)[0]
du is now a [100,2] tensor.
Question: Is this the tensor of the partials for the 100 input points of the minibatch?
There are similar questions like computing derivatives of the output with respect to inputs but I could not really figure out what's going on. I apologize once again if this is already answered or trivial. Thank you very much.
The code you posted should give you the partial derivative of your first output w.r.t. the input. However, you also have to set requires_grad_(True) on the inputs, as otherwise PyTorch does not build up the computation graph starting at the input and thus it cannot compute the gradient for them.
This version of your code example computes du and dv:
net = GeneratorNet()
coords = torch.randn(10, 2)
coords.requires_grad = True
tensor = net(coords)
u, v = torch.split(tensor, 1, dim=1)
du = torch.autograd.grad(u, coords, grad_outputs=torch.ones_like(u))[0]
dv = torch.autograd.grad(v, coords, grad_outputs=torch.ones_like(v))[0]
You can also compute the partial derivative for a single output:
net = GeneratorNet()
coords = torch.randn(10, 2)
coords.requires_grad = True
tensor = net(coords)
u, v = torch.split(tensor, 1, dim=1)
du_0 = torch.autograd.grad(u[0], coords)[0]
where du_0 == du[0].

Resources