How does Pytorch build the computation graph - pytorch

Here is example pytorch code from the website:
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
# 1 input image channel, 6 output channels, 3x3 square convolution
# kernel
self.conv1 = nn.Conv2d(1, 6, 3)
self.conv2 = nn.Conv2d(6, 16, 3)
# an affine operation: y = Wx + b
self.fc1 = nn.Linear(16 * 6 * 6, 120) # 6*6 from image dimension
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
# Max pooling over a (2, 2) window
x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
# If the size is a square you can only specify a single number
x = F.max_pool2d(F.relu(self.conv2(x)), 2)
x = x.view(-1, self.num_flat_features(x))
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
In the forward function, we simply apply a series of transformations to x, but never explicitly define which objects are part of that transformation. Yet when computing the gradient and updating the weights, Pytorch 'magically' knows which weights to update and how the gradient should be calculated.
How does this process work? Is there code analysis going on, or something else that I am missing?

Yes, there is implicit analysis on forward pass. Examine the result tensor, there is thingie like grad_fn= <CatBackward>, that's a link, allowing you to unroll the whole computation graph. And it is built during real forward computation process, no matter how you defined your network module, object oriented with 'nn' or 'functional' way.
You can exploit this graph for net analysis, as torchviz do here: https://github.com/szagoruyko/pytorchviz/blob/master/torchviz/dot.py

Related

RuntimeError: mat1 and mat2 shapes cannot be multiplied (4x32 and 400x120)

class Net(nn.Module):
def __init__(self):
super().__init__()
#(input channel, output channel, kenel size)
#channel is a dimension of a tensor which is a container that can house data in N dimensions (matrices)
self.conv1 = nn.Conv2d(3, 6, 5)
#shrink the image stack by pooling(kernel size, stride(shift)) and take max value per window
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
#TODO: add conv3
self.conv3 = nn.Conv2d(16, 32, 5)
#drop layer deletes 20% of the feautures to help prevent overfitting
self.drop = nn.Dropout2d(p=0.2)
#linear predicts the output as a linear function of inputs
#(output channels, height, width, batch size
#TODO:
self.fc1 = nn.Linear(16 * 16 * 5, 120)
#TODO:
self.fc1_5 = nn.Linear()
#layer(size of input, size of output)
#Linear layer=Fully connected layer
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
#F.ReLUs change negative values to 0. Apply to all stack of images.
#they are activation functions. We apply it after each liner layer.
#only used in hidden layers.
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
#Select some feautures to drop after 3rd conv to prevent overfitting
x = self.drop(F.relu(self.conv3(x)))
x = torch.flatten(x, 1) # flatten all dimensions except batch into 1-D
x = F.relu(self.fc1(x))
#TODO: add fc1_5
x = F.relu(self.fc1_5(x))
x = F.relu(self.fc2(x))
#Feed to Fully connected layer to predict class
x = self.fc3(x) # no relu b/c it's a last layer.
return x
I am using images from CIFAR10 which are of size 3x32x32.
When I ran the code before, it stopped because self.fc1 linear layer size did not work with self.conv3 I've added.
I'm also not sure what to write for self.fc1_5.
Can someone explain me how this is actually working and the solution as well?
Thank you!
I have added an extra convolutional layer and you can see it is
self.conv3 = nn.Conv2d(16, 32, 5).
Lines under the TODO are where I'm stuck at.
I updated the line to:
self.fc1 = nn.Linear(16 * 16 * 5, 120)
before, it was:
self.fc1 = nn.Linear(16 * 5 * 5, 120).
When you create a CNN for classification with a fixed input size, it's easy to figure out the size of your image by the time it has progressed through your CNN layers. Since we start with images of size [32,32] (channels are unimportant for now):
def __init__(self):
super().__init__()
#(input channel, output channel, kenel size)
#channel is a dimension of a tensor which is a container that can house data in N dimensions (matrices)
self.conv1 = nn.Conv2d(3, 6, 5) # size 28x28 - lose 2 px from each side with a kernel of size 5
#shrink the image stack by pooling(kernel size, stride(shift)) and take max value per window
self.pool = nn.MaxPool2d(2, 2) # size 14x14 - max pooling with K=2 halves the image size
self.conv2 = nn.Conv2d(6, 16, 5) # size 10x10 -> 5x5 after pooling
#TODO: add conv3
self.conv3 = nn.Conv2d(16, 32, 5) # size 1x1
#drop layer deletes 20% of the feautures to help prevent overfitting
self.drop = nn.Dropout2d(p=0.2)
#linear predicts the output as a linear function of inputs
#(output channels, height, width, batch size
self.fc1 = nn.Linear(1 * 1 * 32, 120)
self.fc1_5 = nn.Linear(120,120) # matches the output size of fc1 and input size of fc2
The CNN size losses can be negated by using padding of (K-1)//2, where K=kernel_size.

Pytorch - vanilla Gradient Visualization

I trained a neural network on MNIST using PyTorch:
class MnistCNN(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d( 1, 16, 3, stride = 1, padding = 2)
self.pool1 = nn.MaxPool2d(kernel_size = 2)
self.conv2 = nn.Conv2d(16, 32, 3, stride = 1, padding = 2)
self.pool2 = nn.MaxPool2d(kernel_size = 2)
self.dropout = nn.Dropout(0.5)
self.lin = nn.Linear(32 * 8 * 8, 10)
def forward(self, x):
# conv block
x = F.relu(self.conv1(x))
x = self.pool1(x)
# conv block
x = F.relu(self.conv2(x))
x = self.pool2(x)
# dense block
x = x.view(x.size(0), -1)
x = self.dropout(x)
return self.lin(x)
I would like to implement vanilla Gradient Visualization (see reference below) on my model.
Simonyan, K., Vedaldi, A., Zisserman, A.
Deep inside convolutional networks: Visualising image classification models and saliency maps.
arXiv preprint arXiv:1312.6034 (2013)
Question: How can I implement this method in PyTorch?
If I understand correctly, vanilla gradient visualization consists in computing the partial derivatives of the loss of my model w.r.t all the pixels in my input image. So to make it short, I need to tweek my self.conv1 layer so that it computes the gradient over its input pixels instead of the gradient over its weights.
Please correct me if I'm wrong.
You do not need to change anything about your conv layer. Each layer computes gradients both w.r.t. parameters (for updates) and w.r.t. inputs (for "downstream" gradients by the chain rule). Therefore, all you need is to set your input image's x gradient property to true:
x, y = ... # get one image from MNIST
x.requires_grad_(True) # indicate to pytorch that you would like to look at these gradients
pred = model(x)
loss = criterion(pred, y)
loss.backward() # propagate gradients
x.grad # <- here you should have the gradients of the loss w.r.t pixels

Questions about programming a cnn with PyTorch

I'm pretty new at programming cnn so I'm a little bit lost. I'm trying to do this part of the code, where they ask me to implement a fully-connected network to classify the digits. It should contain 1 hidden layer with 20 units. I should use ReLU activation function on the hidden layer.
class Network(nn.Module):
def __init__(self):
super(Network, self).__init__()
self.fc1 = ...
self.fc2 = nn.Sequential(
nn.Linear(500,10),
nn.Softmax(dim = 1)
)
def forward(self, x):
x = x.view(x.size(0),-1)
x = self.fc1(x)
x = self.fc2(x)
return x
The dots are the part to fill, I think about this line:
self.fc1 = nn.Linear(20, 500)
But I don't know if it's correct. Could someone help me please? And I don't understand at all what the function Softmax do... so if someone knows it please.
Thank you so much!!
Pd. This is the code to load the data:
batch_size = 64
trainset = datasets.MNIST('./data', train=True, download=True, transform=transforms.ToTensor())
train_loader = DataLoader(trainset, batch_size=batch_size, shuffle=True, num_workers=1)
testset = datasets.MNIST('./data', train=False, download=True, transform=transforms.ToTensor())
test_loader = DataLoader(testset, batch_size=batch_size, shuffle=False, num_workers=1)
From the code given for the model, it can be seen that the hidden layer has 500 units. So I am assuming you meant 20 units for input. With this assumption, the code must be:
self.fc1 = nn.Sequential(
nn.Linear(20, 500),
nn.ReLU()
)
Coming to the next part of your question, given that you are working with MNIST dataset and you have the softmax function, I am assuming you are trying to predict the number present in the images.
Your neural network performs various multiplication and addition operations in each layer and finally, you end up with 10 numbers in the output layer. Now, you have to make sense of these 10 numbers to decide which of the 10 digits is given in the image.
One way to do this would be to select the unit which has the maximum value. For example if the 10th unit has the maximum value among all units, then we conclude that the digit is '9'. If the 2nd unit has the maximum value, then we conclude that the digit is '1'.
This is fine but a better way would be to convert the values of each of the units to probability that the corresponding digit is contained in the image and then we choose the digit having highest probability. This has certain mathematical advantages which helps us in defining a better loss function.
Softmax is what helps us to convert the values to probabilities. On applying softmax, all the values lie in the range (0, 1) and they sum up to 1.
If you are interested in deeplearning and the math behind it, I would suggest you to checkout Andrew NG's course on deeplearning.
You did not mention the shape of your data so I'll be assuming the expected shape returned by datasets.MNIST.
Data shape: torch.Size([64, 1, 28, 28])
class Network(nn.Module):
def __init__(self):
super(Network, self).__init__()
self.fc1 = nn.Sequential(
nn.Linear(1*28*28, 20),
nn.ReLU())
self.fc2 = nn.Sequential(
nn.Linear(500,10),
nn.Softmax(dim = 1))
def forward(self, x):
x = x.view(x.size(0), -1)
x = self.fc1(x)
x = self.fc2(x)
return x
The first argument of nn.Linear is the size of input feature while the second is the number of units.
For self.fc1, the size of the input feature is the multiplication of your data shape except the batch size, which is 1 * 28 * 28. And as per your post the second argument should be 20 (20 units).
The shape of the output from self.fc1 (which is also the input to self.fc2) will then be (batch size, 20).
For self.fc2, the size of the input feature will be 20 while the number of units (which is also the number of digits) will be 10.

Don't include an operation for gradient computation in PyTorch

I have a custom layer. Let the layer be called 'Gaussian'
class Gaussian(nn.Module):
def __init__():
super(Gaussian, self).__init__()
##torch.no_grad
def forward(self, x):
_r = np.random.randint(0, x.shape[0], x.shape[0])
_sample = x[_r]
_d = (_sample - x)
_number = int(self.k * x.shape[0])
x[1: _number] = x[1: _number] + (self.n * _d[1: _number]).detach()
return x
The above class will be used as below:
cnn_model = nn.Sequential(nn.Conv2d(1, 32, 5), Gaussian(), nn.ReLU(), nn.Conv2d(32, 32, 5))
If x is the input, I want the gradient of x to exclude operations that are present in the Gaussian module, but include the calculations in other layers of the neural network(nn.Conv2d etc).
In the end, my aim is to use the Gaussian module to perform calculations but that calculations should not be included in gradient computation.
I tried to do the following:
Used the #torch.no_grad above the forward method of the Gaussian
Using detach after every operation in the Gaussian module:
x[1: _number] = x[1: _number] + (self.n * _d[1: _number]).detach() and similarly for other operations
Use y = x.detach() in the forward method. Perform the operations on y and then x.data = y
Are the above methods correct?
P.S: Question edited
The gradient calculation has sense when there are parameters to optimise.
If your module do not have any parameters, then no gradient will be stored, because there are no parameters to associate it.

Using autograd to compute Jacobian matrix of outputs with respect to inputs

I apologize if this question is obvious or trivial. I am very new to pytorch and I am trying to understand the autograd.grad function in pytorch. I have a neural network G that takes in inputs (x,t) and outputs (u,v). Here is the code for G:
class GeneratorNet(torch.nn.Module):
"""
A three hidden-layer generative neural network
"""
def __init__(self):
super(GeneratorNet, self).__init__()
self.hidden0 = nn.Sequential(
nn.Linear(2, 100),
nn.LeakyReLU(0.2)
)
self.hidden1 = nn.Sequential(
nn.Linear(100, 100),
nn.LeakyReLU(0.2)
)
self.hidden2 = nn.Sequential(
nn.Linear(100, 100),
nn.LeakyReLU(0.2)
)
self.out = nn.Sequential(
nn.Linear(100, 2),
nn.Tanh()
)
def forward(self, x):
x = self.hidden0(x)
x = self.hidden1(x)
x = self.hidden2(x)
x = self.out(x)
return x
Or simply G(x,t) = (u(x,t), v(x,t)) where u(x,t) and v(x,t) are scalar valued. Goal: Compute $\frac{\partial u(x,t)}{\partial x}$ and $\frac{\partial u(x,t)}{\partial t}$. At every training step, I have a minibatch of size $100$ so u(x,t) is a [100,1] tensor. Here is my attempt to compute the partial derivatives, where coords is the input (x,t) and just like below I added the requires_grad_(True) flag to the coords as well:
tensor = GeneratorNet(coords)
tensor.requires_grad_(True)
u, v = torch.split(tensor, 1, dim=1)
du = autograd.grad(u, coords, grad_outputs=torch.ones_like(u), create_graph=True,
retain_graph=True, only_inputs=True, allow_unused=True)[0]
du is now a [100,2] tensor.
Question: Is this the tensor of the partials for the 100 input points of the minibatch?
There are similar questions like computing derivatives of the output with respect to inputs but I could not really figure out what's going on. I apologize once again if this is already answered or trivial. Thank you very much.
The code you posted should give you the partial derivative of your first output w.r.t. the input. However, you also have to set requires_grad_(True) on the inputs, as otherwise PyTorch does not build up the computation graph starting at the input and thus it cannot compute the gradient for them.
This version of your code example computes du and dv:
net = GeneratorNet()
coords = torch.randn(10, 2)
coords.requires_grad = True
tensor = net(coords)
u, v = torch.split(tensor, 1, dim=1)
du = torch.autograd.grad(u, coords, grad_outputs=torch.ones_like(u))[0]
dv = torch.autograd.grad(v, coords, grad_outputs=torch.ones_like(v))[0]
You can also compute the partial derivative for a single output:
net = GeneratorNet()
coords = torch.randn(10, 2)
coords.requires_grad = True
tensor = net(coords)
u, v = torch.split(tensor, 1, dim=1)
du_0 = torch.autograd.grad(u[0], coords)[0]
where du_0 == du[0].

Resources