PyTorch gradient computing of multidim tensor - pytorch

is there an easier way of computing gradients of a matrix in PyTorch?
import torch
x = torch.linspace(0, 1, 50, requires_grad=True).view(-1, 1)
inputs = torch.hstack([torch.cos(w*x) for w in range(5)])
d_inputs = torch.hstack([grad(f.sum(), x, create_graph=True, retain_graph=True)[0] for f in inputs.T])
Thank you.

Related

Calculating the Hessian Vector Product of a Flax NN output wrt to the inputs

I am trying to get the second derivative of the output w.r.t the input of a neural network built using Flax. The network is structured as follows:
import numpy as np
import jax
import jax.numpy as jnp
import flax.linen as nn
import optax
from flax import optim
class MLP(nn.Module):
features: Sequence[int]
#nn.compact
def __call__(self, x):
for feat in self.features[:-1]:
x = nn.tanh(nn.Dense(feat)(x))
x = nn.Dense(self.features[-1])(x)
return x
model = MLP([20, 20, 20, 20, 20, 1])
batch = jnp.ones((32, 3)) #Dummy input to Initialize the NN
params = model.init(jax.random.PRNGKey(0), batch)
X = jnp.ones((32, 3))
output = model.apply(params, X)
I can get the single derivative by using vmap over grad :
#jit
def u_function(params, X):
u = model.apply(params, X)
return jnp.squeeze(u)
grad_fn = vmap(grad(u_function, argnums=1), in_axes=(None, 0), out_axes=(0))
u_X = vmap(grad(u_function, argnums=1), in_axes=(None, 0), out_axes=(0))(params, X)
However, when I try to do this again to obtain the second derivative :
u_X_func = vmap(grad(u_function, argnums=1), in_axes=(None, 0), out_axes=(0))
u_XX_func = vmap(grad(u_X_func, argnums=1), in_axes=(None, 0), out_axes=(0))(params, X)
I get the folllowing error:
[/usr/local/lib/python3.7/dist-packages/flax/linen/linear.py](https://localhost:8080/#) in __call__(self, inputs)
186 kernel = self.param('kernel',
187 self.kernel_init,
--> 188 (jnp.shape(inputs)[-1], self.features),
189 self.param_dtype)
190 if self.use_bias:
IndexError: tuple index out of range
I tried using the hvp definition from the autodiff cookbook, but with params being an input to the function just wasnt sure how to proceed.
Any help on this would be really appreciable.
The issue is that your u_function maps a length-3 vector to a scalar. The first derivative of this is a length-3 vector, but the second derivative of this is a 3x3 hessian matrix, which you cannot compute via jax.grad, which is only designed for scalar-output functions. Fortunately JAX provides the jax.hessian transform to compute these general second derivatives:
u_XX = vmap(hessian(u_function, argnums=1), in_axes=(None, 0), out_axes=(0))(params, X)
print(u_XX.shape)
# (32, 3, 3)

Output of the model depends on the shape of the weights tensor

I want to train the model to sum the three inputs. So it is as simple as possible.
Firstly the weights are initialized randomly. It produces bad error estimate (approx. 0.5)
Then I initialize the weights with zeros. There are two options:
the shape of the weights tensor is [1, 3]
the shape of the weights tensor is [3]
When I choose the 1st option the model still works bad and can't learn this simple formula.
When I choose the 2nd option it works perfect with the error of 10e-12.
Why the result depends on the shape of the weights? Why do I need to initialize the model with zeros to solve this simple problem?
import torch
from torch.nn import Sequential as Seq, Linear as Lin
from torch.optim.lr_scheduler import ReduceLROnPlateau
X = torch.rand((1024, 3))
y = (X[:,0] + X[:,1] + X[:,2])
m = Seq(Lin(3, 1, bias=False))
# 1 option
m[0].weight = torch.nn.parameter.Parameter(torch.tensor([[0, 0, 0]], dtype=torch.float))
# 2 option
#m[0].weight = torch.nn.parameter.Parameter(torch.tensor([0, 0, 0], dtype=torch.float))
optim = torch.optim.SGD(m.parameters(), lr=10e-2)
scheduler = ReduceLROnPlateau(optim, 'min', factor=0.5, patience=20, verbose=True)
mse = torch.nn.MSELoss()
for epoch in range(500):
optim.zero_grad()
out = m(X)
loss = mse(out, y)
loss.backward()
optim.step()
if epoch % 20 == 0:
print(loss.item())
scheduler.step(loss)
First option doesn't learning because it fails with broadcasting: while out.shape == (1024, 1) corresponding targets y has shape of (1024, ). MSELoss, as expected, computes mean of tensor (out - y)^2, which in this case has shape (1024, 1024), clearly wrong objective for this task. At the same time, after applying 2-nd option tensor (out - y)^2 has size (1024, ) and mean of it corresponds to actual mse. Default approach, without explicit changing weights shape (through option 1 and 2), would work if set target shape to (1024, 1) for example by y = y.unsqueeze(-1) after definition of y.

I want to use Conv1D and MaxPool1D in pytorch for a 3-d tensor to its third dimension

For example, there is a 3-d tensor, I want to run the conv1d calculation on its third dimension,
import torch
import torch.nn as nn
x = torch.rand(4,5,6)
conv1d =nn.Conv1d(in_channels=1,out_channels=2,kernel_size=5,stride=3,padding=0)
y = conv1d(x)
I hope the shape of y is (4,5,2,-1), but I get an error
Given groups=1, weight of size [2, 1, 5], expected input[4, 5, 6] to have 1 channels, but got 5 channels instead
Then I modified the code,
import torch
import torch.nn as nn
x = torch.rand(4,5,6)
conv1d =nn.Conv1d(in_channels=1,out_channels=2,kernel_size=5,stride=3,padding=0)
x = x.unsqueeze(2)
y = conv1d(x)
There is another error:
Expected 3-dimensional input for 3-dimensional weight [2, 1, 5], but got 4-dimensional input of size [4, 5, 1, 6] instead
And if I want to run the maxpoo1d calulation in a tensor whose shape is (4,5,2,-1) ,in its last two dimension, what should I do?
I am searching for a long time on net. But no use. Please help or try to give some ideas how to achieve this. Thank you all for your help.
I made an attempt, but I felt it couldn’t meet the actual needs, I wanted to know if it's good practice to do that and what would be the best way to do that?
import torch
import torch.nn as nn
x = torch.rand(4,5,6)
conv1d =nn.Conv1d(in_channels=1,out_channels=2,kernel_size=2,stride=3,padding=0)
x = x.unsqueeze(2)
for i in range(4):
y = conv1d(x[i,:,:,:])
y = y.unsqueeze(0)
if i==0:
z = y
else:
z = torch.cat((z,y),0)
print(y)
print(z.size())
To use Conv1d you need your input to have 3 dimensions:
[batch_size, in_channels, data_dimension]
So, this would work:
x = torch.rand(4, 1, 50) # [batch_size=4, in_channels=1, data_dimension=50]
conv1d = nn.Conv1d(in_channels=1,out_channels=2,kernel_size=2,stride=3,padding=0)
x = conv1d(x)
print(x.shape) # Will output [4, 2, 16] 4=batch_size, 2=channels, 16=data_dimension
You can use MaxPool1d in the same way:
maxpool1d = nn.MaxPool1d(5)
x = maxpool1d(x)
print(x.shape) # Will output [4, 2, 3] 4=batch_size, 2=channels, 3=data_dimension
I sovle this question by torch.reshape(). I put the code here, hoping it could help somebody.
import torch
import torch.nn as nn
x = torch.rand(4,5,6)
conv1d =nn.Conv1d(in_channels=1,out_channels=2,kernel_size=2,stride=3,padding=0)
y = x.reshape(x.shape[0]*x.shape[1],-1)
y = y.unsqueeze(1)
y = conv1d(y)
z = y.reshape(x.shape[0],x.shape[1],2,-1)
print(z.size())

Using autograd to compute Jacobian matrix of outputs with respect to inputs

I apologize if this question is obvious or trivial. I am very new to pytorch and I am trying to understand the autograd.grad function in pytorch. I have a neural network G that takes in inputs (x,t) and outputs (u,v). Here is the code for G:
class GeneratorNet(torch.nn.Module):
"""
A three hidden-layer generative neural network
"""
def __init__(self):
super(GeneratorNet, self).__init__()
self.hidden0 = nn.Sequential(
nn.Linear(2, 100),
nn.LeakyReLU(0.2)
)
self.hidden1 = nn.Sequential(
nn.Linear(100, 100),
nn.LeakyReLU(0.2)
)
self.hidden2 = nn.Sequential(
nn.Linear(100, 100),
nn.LeakyReLU(0.2)
)
self.out = nn.Sequential(
nn.Linear(100, 2),
nn.Tanh()
)
def forward(self, x):
x = self.hidden0(x)
x = self.hidden1(x)
x = self.hidden2(x)
x = self.out(x)
return x
Or simply G(x,t) = (u(x,t), v(x,t)) where u(x,t) and v(x,t) are scalar valued. Goal: Compute $\frac{\partial u(x,t)}{\partial x}$ and $\frac{\partial u(x,t)}{\partial t}$. At every training step, I have a minibatch of size $100$ so u(x,t) is a [100,1] tensor. Here is my attempt to compute the partial derivatives, where coords is the input (x,t) and just like below I added the requires_grad_(True) flag to the coords as well:
tensor = GeneratorNet(coords)
tensor.requires_grad_(True)
u, v = torch.split(tensor, 1, dim=1)
du = autograd.grad(u, coords, grad_outputs=torch.ones_like(u), create_graph=True,
retain_graph=True, only_inputs=True, allow_unused=True)[0]
du is now a [100,2] tensor.
Question: Is this the tensor of the partials for the 100 input points of the minibatch?
There are similar questions like computing derivatives of the output with respect to inputs but I could not really figure out what's going on. I apologize once again if this is already answered or trivial. Thank you very much.
The code you posted should give you the partial derivative of your first output w.r.t. the input. However, you also have to set requires_grad_(True) on the inputs, as otherwise PyTorch does not build up the computation graph starting at the input and thus it cannot compute the gradient for them.
This version of your code example computes du and dv:
net = GeneratorNet()
coords = torch.randn(10, 2)
coords.requires_grad = True
tensor = net(coords)
u, v = torch.split(tensor, 1, dim=1)
du = torch.autograd.grad(u, coords, grad_outputs=torch.ones_like(u))[0]
dv = torch.autograd.grad(v, coords, grad_outputs=torch.ones_like(v))[0]
You can also compute the partial derivative for a single output:
net = GeneratorNet()
coords = torch.randn(10, 2)
coords.requires_grad = True
tensor = net(coords)
u, v = torch.split(tensor, 1, dim=1)
du_0 = torch.autograd.grad(u[0], coords)[0]
where du_0 == du[0].

Translating LSTM model from Keras to Pytorch

I am having a hard time translating a quite simple LSTM model from Keras to Pytorch. X (get it here) corresponds to 1152 samples of 90 timesteps, each timestep has only 1 dimension. y (here) is a single prediction at t = 91 for all 1152 samples.
In Keras:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM
import numpy as np
import pandas as pd
X = pd.read_csv('X.csv', header = None).values
X.shape
y = pd.read_csv('y.csv', header = None).values
y.shape
# From Keras documentation [https://keras.io/layers/recurrent/]:
# Input shape 3D tensor with shape (batch_size, timesteps, input_dim).
X = np.reshape(X, (1152, 90, 1))
regressor = Sequential()
regressor.add(LSTM(units = 100, return_sequences = True, input_shape = (90, 1)))
regressor.add(Dropout(0.3))
regressor.add(LSTM(units = 50, return_sequences = True))
regressor.add(Dropout(0.3))
regressor.add(LSTM(units = 50, return_sequences = True))
regressor.add(Dropout(0.3))
regressor.add(LSTM(units = 50))
regressor.add(Dropout(0.3))
regressor.add(Dense(units = 1, activation = 'linear'))
regressor.compile(optimizer = 'rmsprop', loss = 'mean_squared_error', metrics = ['mean_absolute_error'])
regressor.fit(X, y, epochs = 10, batch_size = 32)
... leads me to:
# Epoch 10/10
# 1152/1152 [==============================] - 33s 29ms/sample - loss: 0.0068 - mean_absolute_error: 0.0628
Then in Pytorch:
import torch
from torch import nn, optim
from sklearn.metrics import mean_absolute_error
X = pd.read_csv('X.csv', header = None).values
y = pd.read_csv('y.csv', header = None).values
X = torch.tensor(X, dtype = torch.float32)
y = torch.tensor(y, dtype = torch.float32)
dataset = torch.utils.data.TensorDataset(X, y)
loader = torch.utils.data.DataLoader(dataset, batch_size = 32, shuffle = True)
class regressor_LSTM(nn.Module):
def __init__(self):
super().__init__()
self.lstm1 = nn.LSTM(input_size = 1, hidden_size = 100)
self.lstm2 = nn.LSTM(100, 50)
self.lstm3 = nn.LSTM(50, 50, dropout = 0.3, num_layers = 2)
self.dropout = nn.Dropout(p = 0.3)
self.linear = nn.Linear(in_features = 50, out_features = 1)
def forward(self, X):
# From the Pytorch documentation [https://pytorch.org/docs/stable/_modules/torch/nn/modules/rnn.html]:
# **input** of shape `(seq_len, batch, input_size)`
X = X.view(90, 32, 1)
# I am discarding hidden/cell states since in Keras I am using a stateless approach
# [https://keras.io/examples/lstm_stateful/]
X, _ = self.lstm1(X)
X = self.dropout(X)
X, _ = self.lstm2(X)
X = self.dropout(X)
X, _ = self.lstm3(X)
X = self.dropout(X)
X = self.linear(X)
return X
regressor = regressor_LSTM()
criterion = nn.MSELoss()
optimizer = optim.RMSprop(regressor.parameters())
for epoch in range(10):
running_loss = 0.
running_mae = 0.
for i, data in enumerate(loader):
inputs, labels = data
optimizer.zero_grad()
outputs = regressor(inputs)
outputs = outputs[-1].view(*labels.shape)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
mae = mean_absolute_error(labels.detach().cpu().numpy().flatten(), outputs.detach().cpu().numpy().flatten())
running_mae += mae
print('EPOCH %3d: loss %.5f - MAE %.5f' % (epoch+1, running_loss/len(loader), running_mae/len(loader)))
... leads me to:
# EPOCH 10: loss 0.04220 - MAE 0.16762
You can notice that both loss and MAE are quite different (Pytorch's are much higher). If I use Pytorch's model to predict the values, they all return as a constant.
What am I doing wrong?
Oh I believe I made considerable progress. It seems that the way to represent y is different between Keras and Pytorch. In Keras, we should pass it as a single value representing one timestep in the future (or, at least, for the problem I am trying to solve). But in Pytorch, y must be X shifted one timestep to the future. It is like this:
time_series = [0, 1, 2, 3, 4, 5]
X = [0, 1, 2, 3, 4]
# Keras:
y = [5]
# Pytorch:
y = [1, 2, 3, 4, 5]
This way, Pytorch compares all values in the time slice when calculating loss. I believe Keras rearranges the data under the hood to conform to this approach, as the code works when fed the variables just like that. But in Pytorch, I was estimating loss based only on one value (the one I was trying to predict), not the whole series, therefore I believe it could not correctly capture the time dependency.
When taking this in consideration, I got to:
EPOCH 100: loss 0.00551 - MAE 0.058435
And, most importantly, comparing true and predicted values in a separate dataset got me to
The patterns were clearly captured by the model.
Hooray!

Resources