Computing Jacobian and Derivative in Tensorflow is extremely slow - python-3.x

Is there a more efficient way to compute Jacobian (there must be, it doesn't even run for a single batch) I want to compute the loss as given in the self-explanatory neural network. Input has a shape of (32, 365, 3) where 32 is the batch size. The loss I want to minimize is Equation 3 of the paper.
I believe that I am not using the GradientTape optimally.
def compute_loss_theta(tape, parameter, concept, output, x):
b = x.shape[0]
in_dim = (x.shape[1], x.shape[2])
feature_dim = in_dim[0]*in_dim[1]
J = tape.batch_jacobian(concept, x)
grad_fx = tape.gradient(output, x)
grad_fx = tf.reshape(grad_fx,shape=(b, feature_dim))
J = tf.reshape(J, shape=(b, feature_dim, feature_dim))
parameter = tf.expand_dims(parameter, axis =1)
loss_theta_matrix = grad_fx - tf.matmul(parameter, J)
loss_theta = tf.norm(loss_theta_matrix)
return loss_theta
for i in range(10):
for x, y in train_dataset:
with tf.GradientTape(persistent=True) as tape:
tape.watch(x)
parameter, concept, output = model(x)
loss_theta = compute_loss_theta(tape, parameter, concept, output , x)
loss_y = loss_object(y_true=y, y_pred=output)
loss_value = loss_y + eps*loss_theta
gradients = tape.gradient(loss_value, model.trainable_weights)
optimizer.apply_gradients(zip(gradients, model.trainable_weights))

Related

Efficiently training a network simultaneously on labels and partial derivaties

I'm trying to train a network in pytorch along the lines of this idea.
The author creates a simple MLP (4 hidden layers) and then explicitly works out what the partial derivatives of the output is wrt the inputs. He then trains the network on the training labels as well as the gradients of the output wrt the input data (which is also part of the training data).
To replicate the idea in pytorch, my training loop looks like this:
import torch
import torch.nn.functional as F
class vanilla_net(torch.nn.Module):
def __init__(self,
input_dim, # dimension of inputs, e.g. 10
hidden_units, # units in hidden layers, assumed constant, e.g. 20
hidden_layers): # number of hidden layers, e.g. 4):
super(vanilla_net, self).__init__()
self.input = torch.nn.Linear(input_dim, hidden_units)
self.hidden = torch.nn.ModuleList()
for hl in range(hidden_layers):
layer = torch.nn.Linear(hidden_units, hidden_units)
self.hidden.append(layer)
self.output = torch.nn.Linear(hidden_units, 1)
def forward(self, x):
x = self.input(x)
x = F.softplus(x)
for h in self.hidden:
x = h(x)
x = F.softplus(x)
x = self.output(x)
return x
....
def lossfn(x, y, dx, dy):
# some loss function involving both sets of training data (y and dy)
# the network outputs x and what's needed is an efficient way of calculating dx - the partial
# derivatives of x wrt the batch inputs.
pass
def train(net, x_train, y_train, dydx_train, batch_size=256)
m, n = x_train.shape
first = 0
last = min(batch_size, m)
while first < m:
xi = x_train[first:last]
yi = y_train[first:last]
zi = dydx_train[first:last]
xi.requires_grad_()
# Perform forward pass
outputs = net(xi)
minimizer.zero_grad()
outputs.backward(torch.ones_like(outputs), create_graph=True)
xi_grad = xi.grad
# Compute loss
loss = lossfn(outputs, yi, xi_grad, zi)
minimizer.zero_grad()
# Perform backward pass
loss.backward()
# Perform optimization
minimizer.step()
first = last
last = min(first + batch_size, m)
net = vanilla_net(4, 10, 4)
minimizer = torch.optim.Adam(net.parameters(), lr=1e-4)
...
This seems to work but is there a more elegant/efficient way to achieve the same thing? Also - not sure I know where the best place to put the minimizer.zero_grad()
Thanks

Deep learning in partially defined parameter space

I have a deep learning problem, which I intend to solve in Keras with CNN. The task is 1D regression, for which I generate grayscale images using 2 parameters and the parameter to be deduced by the network (this is temperature difference). The image generation has 3 parameters, everything else is random. Naturally, the image generation occurs only in a region of each of the parameters. Of course the images visually represent the temperature difference.
The network has 2 inputs: two scalars as a vector (the 2 additional parameters for image generation) and the image. The aim of the teaching is to deduce the temperature difference from the supplied image.
My problem is the image generation is not always possible because of geometric constraints. There is a subregion of the 3 parameters used for generation, where it will fail. The red circles represent this in the figure below. Two axes of the figure is logarithmic, but as seen from the sample distribution, the parameter pick uses exponential-like distribution.
Learning proves to be quite good on the lower left region of the parameters, but totally unusable on the other end.
My question is if the poor model performance can be a result of the shape of the taught parameters, especially the failed region?
I forgot to mention that the images used are 256*256 8-bit grayscale. Training code:
def createMlp(aRepeatParameter:int):
vectorSize = aRepeatParameter * 2
inputs = Input(shape=(vectorSize,))
x = inputs
return Model(inputs, x)
def createCnn():
filters=(64, 16, 4)
inputShape = (256, 256, 1)
chanDim = -1
inputs = Input(shape=inputShape)
x = inputs
for (i, f) in enumerate(filters):
x = Conv2D(f, (3, 3), padding="same")(x)
x = LeakyReLU(alpha=0.3)(x)
x = BatchNormalization(axis=chanDim)(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Flatten()(x)
x = Dense(128, activation=LeakyReLU(alpha=0.3))(x)
x = Dense(16, activation=LeakyReLU(alpha=0.3))(x)
x = BatchNormalization(axis=chanDim)(x)
x = Dropout(0.5)(x)
x = Dense(4)(x)
x = LeakyReLU(alpha=0.3)(x)
return Model(inputs, x)
repeatParameter:int = 2
mlp = createMlp(repeatParameter)
cnn = createCnn()
combinedInput = Concatenate(axis=1)([mlp.output, cnn.output])
x = Dense(4, activation=LeakyReLU(alpha=0.3))(combinedInput)
x = Dense(1, activation="linear")(x)
model = Model(inputs=[mlp.input, cnn.input], outputs=x)
batchSize = 32
sampleSize = 96000
validationSize = 16000
port = 12345
trainingSteps = math.ceil(sampleSize / batchSize)
learningRate = ExponentialDecay(initial_learning_rate=0.001, decay_steps=trainingSteps, decay_rate=0.05)
opt = Adam(learning_rate=learningRate)
model.compile(loss="mean_squared_error", optimizer=opt, metrics=["mean_absolute_percentage_error"])
model.fit(landscapeGenerator.generate(batchSize, repeatParameter, port), validation_data=landscapeGenerator.generate(batchSize, repeatParameter, port),
epochs=50, steps_per_epoch=trainingSteps, validation_steps=validationSize/batchSize )

How to create a custom loss that does not directly use the output of the network with pytorch

I would like to create a custom loss that does not directly use the output of my network. Indeed, I need to create a loss that returns the difference between the result of a function f(x) (where x is the output of my network) and max(f(x)). Unfortunately my code doesn't work and I don't know how to proceed... Here is my code:
def forward(self, x, y, hidden):
c_0 = Variable(torch.zeros(self.num_layers, x.size(0), self.hidden_size))
y = torch.reshape(y, (y.shape[0], 1, 1))
tmp = torch.cat((x, y), 2)
output, (hn, cn) = self.lstm(tmp, (hidden, c_0))
out = self.fc(output)
return out, hn
def _train(self):
num_epochs = 10
num_iteration = 10
save_loss_global = []
save_loss_epoch = []
for epoch in range(num_epochs):
print("NOUVELLE EPOCH")
X_train, Y_train = donneesAleatoires()
self.maxRes = 0
self.hidden = Variable(torch.zeros(self.num_layers, 1, self.hidden_size))
tabY = torch.Tensor()
tabY = torch.cat((tabY, Y_train), 1)
for iteration in range(num_iteration):
x_i = X_train[0]
x_i = torch.reshape(x_i, (x_i.shape[0], 1, x_i.shape[1]))
y_i = Y_train[0]
outputs, self.hidden = self(x_i, y_i, self.hidden)
YiPlus1 = self.function(outputs.detach().numpy().reshape(1, -1))
self.optimizer.zero_grad()
Yadd = Variable(torch.Tensor(YiPlus1))
tabY = torch.cat((tabY, Yadd), 1)
loss = self.my_loss(tabY, iteration)
if YiPlus1 > self.maxRes:
self.maxRes = YiPlus1
if y_i.detach().numpy() > self.maxRes:
self.maxRes = y_i.detach().numpy()
#loss = Variable(loss, requires_grad=True)
loss.backward(retain_graph=True)
X_train = outputs
Y_train = YiPlus1
Y_train = Variable(torch.Tensor(Y_train))
self.optimizer.step()
save_loss_global.append(loss.item())
if iteration == num_iteration -1:
save_loss_epoch.append(loss.item())
print(X_train)
def my_loss(self, target, epoch):
if isinstance(target, np.ndarray):
target = Variable(torch.Tensor(target))
tmp = self.maxRes
loss = target[0][0] - tmp
if epoch > 0:
for i in range(1, epoch + 1):
loss = loss + (target[0][i] - tmp)
loss = -loss
return loss / (epoch+1)
To calculate gradients based on loss, toolchain needs computation graph. Said graph is builded implicitly on forward pass, but to do so, all computations must use toolchain's tensors (no .numpy()s!) with preserved gradients (no .detach()s!). Try to rewrite your code accordingly, don't wory about doing computations outside forward, it is normal.
You can check your tensors are computed right way, printing them, should look like
print( myTensor )
tensor([[-2.9016, -2.8739, ... ,-2.8929, -2.9033]], grad_fn=<AliasBackward0>)

Computing the Hessian of a Simple NN in PyTorch wrt to Parameters

I am relatively new to PyTorch and trying to compute the Hessian of a very simple feedforward networks with respect to its weights. I am trying to get torch.autograd.functional.hessian to work. I have been digging the forums and since this is a relatively new function added to PyTorch, I am unable to find a whole lot of information on it. Here is my simple network architecture which is from some sample code on Kaggle on Mnist.
class Network(nn.Module):
def __init__(self):
super(Network, self).__init__()
self.l1 = nn.Linear(input_size, hidden_size)
self.relu = nn.ReLU()
self.l3 = nn.Linear(hidden_size, output_size)
def forward(self, x):
x = self.l1(x)
x = self.relu(x)
x = self.l3(x)
return F.log_softmax(x, dim = 1)
net = Network()
optimizer = optim.SGD(net.parameters(), lr=learning_rate, momentum=0.9)
loss_func = nn.CrossEntropyLoss()
and I am running the NN for a bunch of epochs like:
for e in range(epochs):
for i in range(0, x.shape[0], batch_size):
x_mini = x[i:i + batch_size]
y_mini = y[i:i + batch_size]
x_var = Variable(x_mini)
y_var = Variable(y_mini)
optimizer.zero_grad()
net_out = net(x_var)
loss = loss_func(net_out, y_var)
loss.backward()
optimizer.step()
if i % 100 == 0:
loss_log.append(loss.data)
Then, I add all the parameters to a list and make a tensor out of it as below:
param_list = []
for param in net.parameters():
param_list.append(param.view(-1))
param_list = torch.cat(param_list)
Finally, I am trying to compute the Hessian of the converged network by running:
hessian = torch.autograd.functional.hessian(loss_func, param_list,create_graph=True)
but it gives me this error:
TypeError: forward() missing 1 required positional argument: 'target'
Any help would be appreciated.
Computing the hessian with regard to the parameters of a model (as opposed to the inputs to the model) isn't really well-supported right now. There's some work being done on this at https://github.com/pytorch/pytorch/issues/49171 , but for the moment it's very inconvenient.
Your code has a few other problems -- where you're passing loss_func, you should be passing a function that constructs the computation graph. Also, you never specify the input to the network or the target for the loss function.
Here's some code that cheats a little bit to use the existing functional interface to compute the hessian of the model weights, and concatenates everything together to give the same form as what you were trying to do:
# Pick a random input to the network
src = torch.rand(1, 2)
# Say our target for our loss is all ones
dst = torch.ones(1, dtype=torch.long)
keys = list(net.state_dict().keys())
parameters = list(net.parameters())
sizes = [x.view(-1).shape[0] for x in parameters]
ndims = sum(sizes)
def hessian_hack(*params):
for i in range(len(keys)):
path = keys[i].split('.')
cur = net
for f in range(0, len(path)-1):
cur = net.__getattr__(path[f])
cur.__delattr__(path[-1])
cur.__setattr__(path[-1], params[i])
return loss_func(net(src), dst)
# sub_hessians[i][f] is the hessian of parameter i vs parameter f
sub_hessians = torch.autograd.functional.hessian(
hessian_hack,
tuple(parameters),
create_graph=True)
# We can combine them all into a nice big hessian.
hessian = torch.cat([
torch.cat([
sub_hessians[i][f].reshape(sizes[i], sizes[f])
for f in range(len(sub_hessians[i]))
], axis=1)
for i in range(len(sub_hessians))
], axis=0)
print(hessian)

Using autograd to compute Jacobian matrix of outputs with respect to inputs

I apologize if this question is obvious or trivial. I am very new to pytorch and I am trying to understand the autograd.grad function in pytorch. I have a neural network G that takes in inputs (x,t) and outputs (u,v). Here is the code for G:
class GeneratorNet(torch.nn.Module):
"""
A three hidden-layer generative neural network
"""
def __init__(self):
super(GeneratorNet, self).__init__()
self.hidden0 = nn.Sequential(
nn.Linear(2, 100),
nn.LeakyReLU(0.2)
)
self.hidden1 = nn.Sequential(
nn.Linear(100, 100),
nn.LeakyReLU(0.2)
)
self.hidden2 = nn.Sequential(
nn.Linear(100, 100),
nn.LeakyReLU(0.2)
)
self.out = nn.Sequential(
nn.Linear(100, 2),
nn.Tanh()
)
def forward(self, x):
x = self.hidden0(x)
x = self.hidden1(x)
x = self.hidden2(x)
x = self.out(x)
return x
Or simply G(x,t) = (u(x,t), v(x,t)) where u(x,t) and v(x,t) are scalar valued. Goal: Compute $\frac{\partial u(x,t)}{\partial x}$ and $\frac{\partial u(x,t)}{\partial t}$. At every training step, I have a minibatch of size $100$ so u(x,t) is a [100,1] tensor. Here is my attempt to compute the partial derivatives, where coords is the input (x,t) and just like below I added the requires_grad_(True) flag to the coords as well:
tensor = GeneratorNet(coords)
tensor.requires_grad_(True)
u, v = torch.split(tensor, 1, dim=1)
du = autograd.grad(u, coords, grad_outputs=torch.ones_like(u), create_graph=True,
retain_graph=True, only_inputs=True, allow_unused=True)[0]
du is now a [100,2] tensor.
Question: Is this the tensor of the partials for the 100 input points of the minibatch?
There are similar questions like computing derivatives of the output with respect to inputs but I could not really figure out what's going on. I apologize once again if this is already answered or trivial. Thank you very much.
The code you posted should give you the partial derivative of your first output w.r.t. the input. However, you also have to set requires_grad_(True) on the inputs, as otherwise PyTorch does not build up the computation graph starting at the input and thus it cannot compute the gradient for them.
This version of your code example computes du and dv:
net = GeneratorNet()
coords = torch.randn(10, 2)
coords.requires_grad = True
tensor = net(coords)
u, v = torch.split(tensor, 1, dim=1)
du = torch.autograd.grad(u, coords, grad_outputs=torch.ones_like(u))[0]
dv = torch.autograd.grad(v, coords, grad_outputs=torch.ones_like(v))[0]
You can also compute the partial derivative for a single output:
net = GeneratorNet()
coords = torch.randn(10, 2)
coords.requires_grad = True
tensor = net(coords)
u, v = torch.split(tensor, 1, dim=1)
du_0 = torch.autograd.grad(u[0], coords)[0]
where du_0 == du[0].

Resources