Pytorch gradient calulation of one Tensor - pytorch

I'm a beginner in pytorch and I'm probably stuck on a relatively trivial problem, but it's not clearing up for me at the moment.
When calculating the gradient of a tensor I get a constant gradient of 1.
Shouldn't the gradient of a constant however result in 0?
Here is a minimal example:
import torch
x = torch.tensor(50., requires_grad=True)
y = x
y.backward()
print(x.grad)
#Ouput: tensor(1.)
So why is the ouput 1 and not 0?

You are not computing the gradient of a constant, but that of the variable x which has a constant value 50. The derivative of x with respect to x is 1.

Related

Is there a version of sparse categorical cross entropy in pytorch?

I saw a sudoku solver CNN uses a sparse categorical cross-entropy as a loss function using the TensorFlow framework, I am wondering if there is a similar function for Pytorch? if not could how could I potentially calculate the loss of a 2d array using Pytorch?
Here is an example of usage of nn.CrossEntropyLoss for image segmentation with a batch of size 1, width 2, height 2 and 3 classes.
Image segmentation is a classification problem at pixel level. Of course you can also use nn.CrossEntropyLoss for basic image classification as well.
The sudoku problem in the question can be seen as an image segmentation problem where you have 10 classes (the 10 digits) (though Neural Networks are not appropriate to solve combinatorial problems like Sudoku which already have efficient exact resolution algorithms).
nn.CrossEntropyLoss accepts ground truth labels directly as integers in [0, N_CLASSES[ (no need to onehot encode the labels):
import torch
from torch import nn
import numpy as np
# logits predicted
x = np.array([[
[[1,0,0],[1,0,0]], # predict class 0 for pixel (0,0) and class 0 for pixel (0,1)
[[0,1,0],[0,0,1]], # predict class 1 for pixel (1,0) and class 2 for pixel (1,1)
]])*5 # multiply by 5 to give bigger losses
print("logits map :")
print(x)
# ground truth labels
y = np.array([[
[0,1], # must predict class 0 for pixel (0,0) and class 1 for pixel (0,1)
[1,2], # must predict class 1 for pixel (1,0) and class 2 for pixel (1,1)
]])
print("\nlabels map :")
print(y)
x=torch.Tensor(x).permute((0,3,1,2)) # shape of preds must be (N, C, H, W) instead of (N, H, W, C)
y=torch.Tensor(y).long() # shape of labels must be (N, H, W) and type must be long integer
losses = nn.CrossEntropyLoss(reduction="none")(x, y) # reduction="none" to get the loss by pixel
print("\nLosses map :")
print(losses)
# notice that the loss is big only for pixel (0,1) where we predicted 0 instead of 1

Pytorch tensor multiplication with Float tensor giving wrong answer

I am seeing some strange behavior when i multiply two pytorch tensors.
x = torch.tensor([99397544.0])
y = torch.tensor([0.1])
x * y
This outputs
tensor([9939755.])
However, the answer should be 9939754.4
In default, the tensor dtype is torch.float32 in pytorch. Change it to torch.float64 will give the right result.
x = torch.tensor([99397544.0], dtype=torch.float64)
y = torch.tensor([0.1], dtype=torch.float64)
x * y
# tensor([9939754.4000])
The mismatched result for torch.float32 caused by rounding error if you do not have enough precision to calculate (represent) it.
What Every Computer Scientist Should Know About Floating-Point Arithmetic

How to calculate gradients on a tensor in PyTorch?

I want to calculate the gradient of a tensor and however, it gives error as
RunTimeerror: grad can be implicitly created only for scalar outputs
and here is what I am trying to code:
x = torch.full((2,3), 4,requires_grad=True)
y = (2*x**2+3)
y.backward()
And at this point, it throws an error.
Since there is no summing up/reducing the loss-value , like .sum()
Hence the issue could be fixed by:
y.backward(torch.ones_like(x))
which performs a Jacobian-vector product with a tensor of all ones and get the gradient.

How to properly update the weights in PyTorch?

I'm trying to implement the gradient descent with PyTorch according to this schema but can't figure out how to properly update the weights. It is just a toy example with 2 linear layers with 2 nodes in hidden layer and one output.
Learning rate = 0.05;
target output = 1
https://hmkcode.github.io/ai/backpropagation-step-by-step/
Forward
Backward
My code is as following:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
class MyNet(nn.Module):
def __init__(self):
super(MyNet, self).__init__()
self.linear1 = nn.Linear(2, 2, bias=None)
self.linear1.weight = torch.nn.Parameter(torch.tensor([[0.11, 0.21], [0.12, 0.08]]))
self.linear2 = nn.Linear(2, 1, bias=None)
self.linear2.weight = torch.nn.Parameter(torch.tensor([[0.14, 0.15]]))
def forward(self, inputs):
out = self.linear1(inputs)
out = self.linear2(out)
return out
losses = []
loss_function = nn.L1Loss()
model = MyNet()
optimizer = optim.SGD(model.parameters(), lr=0.05)
input = torch.tensor([2.0,3.0])
print('weights before backpropagation = ', list(model.parameters()))
for epoch in range(1):
result = model(input )
loss = loss_function(result , torch.tensor([1.00],dtype=torch.float))
print('result = ', result)
print("loss = ", loss)
model.zero_grad()
loss.backward()
print('gradients =', [x.grad.data for x in model.parameters()] )
optimizer.step()
print('weights after backpropagation = ', list(model.parameters()))
The result is following :
weights before backpropagation = [Parameter containing:
tensor([[0.1100, 0.2100],
[0.1200, 0.0800]], requires_grad=True), Parameter containing:
tensor([[0.1400, 0.1500]], requires_grad=True)]
result = tensor([0.1910], grad_fn=<SqueezeBackward3>)
loss = tensor(0.8090, grad_fn=<L1LossBackward>)
gradients = [tensor([[-0.2800, -0.4200], [-0.3000, -0.4500]]),
tensor([[-0.8500, -0.4800]])]
weights after backpropagation = [Parameter containing:
tensor([[0.1240, 0.2310],
[0.1350, 0.1025]], requires_grad=True), Parameter containing:
tensor([[0.1825, 0.1740]], requires_grad=True)]
Forward pass values:
2x0.11 + 3*0.21=0.85 ->
2x0.12 + 3*0.08=0.48 -> 0.85x0.14 + 0.48*0.15=0.191 -> loss =0.191-1 = -0.809
Backward pass: let's calculate w5 and w6 (output node weights)
w = w - (prediction-target)x(gradient)x(output of previous node)x(learning rate)
w5= 0.14 -(0.191-1)*1*0.85*0.05= 0.14 + 0.034= 0.174
w6= 0.15 -(0.191-1)*1*0.48*0.05= 0.15 + 0.019= 0.169
In my example Torch doesn't multiply the loss by derivative so we get wrong weights after updating. For the output node we got new weights w5,w6 [0.1825, 0.1740] , when it should be [0.174, 0.169]
Moving backward to update the first weight of the output node (w5) we need to calculate: (prediction-target)x(gradient)x(output of previous node)x(learning rate)=-0.809*1*0.85*0.05=-0.034. Updated weight w5 = 0.14-(-0.034)=0.174. But instead pytorch calculated new weight = 0.1825. It forgot to multiply by (prediction-target)=-0.809. For the output node we got gradients -0.8500 and -0.4800. But we still need to multiply them by loss 0.809 and learning rate 0.05 before we can update the weights.
What is the proper way of doing this?
Should we pass 'loss' as an argument to backward() as following: loss.backward(loss) .
That seems to fix it. But I couldn't find any example on this in documentation.
You should use .zero_grad() with optimizer, so optimizer.zero_grad(), not loss or model as suggested in the comments (though model is fine, but it is not clear or readable IMO).
Except that your parameters are updated fine, so the error is not on PyTorch's side.
Based on gradient values you provided:
gradients = [tensor([[-0.2800, -0.4200], [-0.3000, -0.4500]]),
tensor([[-0.8500, -0.4800]])]
Let's multiply all of them by your learning rate (0.05):
gradients_times_lr = [tensor([[-0.014, -0.021], [-0.015, -0.0225]]),
tensor([[-0.0425, -0.024]])]
Finally, let's apply ordinary SGD (theta -= gradient * lr), to get exactly the same results as in PyTorch:
parameters = [tensor([[0.1240, 0.2310], [0.1350, 0.1025]]),
tensor([[0.1825, 0.1740]])]
What you have done is taken the gradients calculated by PyTorch and multiplied them with the output of previous node and that's not how it works!.
What you've done:
w5= 0.14 -(0.191-1)*1*0.85*0.05= 0.14 + 0.034= 0.174
What should of been done (using PyTorch's results):
w5 = 0.14 - (-0.85*0.05) = 0.1825
No multiplication of previous node, it's done behind the scenes (that's what .backprop() does - calculates correct gradients for all of the nodes), no need to multiply them by previous ones.
If you want to calculate them manually, you have to start at the loss (with delta being one) and backprop all the way down (do not use learning rate here, it's a different story!).
After all of them are calculated, you can multiply each weight by optimizers learning rate (or any other formula for that matter, e.g. Momentum) and after this you have your correct update.
How to calculate backprop
Learning rate is not part of backpropagation, leave it alone until you calculate all of the gradients (it confuses separate algorithms together, optimization procedures and backpropagation).
1. Derivative of total error w.r.t. output
Well, I don't know why you are using Mean Absolute Error (while in the tutorial it is Mean Squared Error), and that's why both those results vary. But let's go with your choice.
Derivative of | y_true - y_pred | w.r.t. to y_pred is 1, so IT IS NOT the same as loss. Change to MSE to get equal results (here, the derivative will be (1/2 * y_pred - y_true), but we usually multiply MSE by two in order to remove the first multiplication).
In MSE case you would multiply by the loss value, but it depends entirely on the loss function (it was a bit unfortunate that the tutorial you were using didn't point this out).
2. Derivative of total error w.r.t. w5
You could probably go from here, but... Derivative of total error w.r.t to w5 is the output of h1 (0.85 in this case). We multiply it by derivative of total error w.r.t. output (it is 1!) and obtain 0.85, as done in PyTorch. Same idea goes for w6.
I seriously advise you not to confuse learning rate with backprop, you are making your life harder (and it's not easy with backprop IMO, quite counterintuitive), and those are two separate things (can't stress that one enough).
This source is nice, more step-by-step, with a little more complicated network idea (activations included), so you can get a better grasp if you go through all of it.
Furthermore, if you are really keen (and you seem to be), to know more ins and outs of this, calculate the weight corrections for other optimizers (say, nesterov), so you know why we should keep those ideas separated.

PyTorch: Calculating the Hessian vector product with nn.parameters()

Using PyTorch, I would like to calculate the Hessian vector product, where the Hessian is the second-derivative matrix of the loss function of some neural net, and the vector will be the vector of gradients of that loss function.
I know how to calculate the Hessian vector product for a regular function thanks to this post. However, I am running into trouble when the function is the loss function of a neural network. This is because the parameters are packaged into a module, accessible via nn.parameters(), and not a torch tensor.
I want to do something like this (doesn't work):
### a simple neural network
linear = nn.Linear(10, 20)
x = torch.randn(1, 10)
y = linear(x).sum()
### compute the gradient and make a copy that is detached from the graph
grad = torch.autograd.grad(y, linear.parameters(),create_graph=True)
v = grad.clone().detach()
### compute the Hessian vector product
z = grad # v
z.backward()
In analogy this this (does work):
x = Variable(torch.Tensor([1, 1]), requires_grad=True)
f = 3*x[0]**2 + 4*x[0]*x[1] + x[1]**2
grad, = torch.autograd.grad(f, x, create_graph=True)
v = grad.clone().detach()
z = grad # v
z.backward()
This post addresses a similar (possibly the same?) issue, but I don't understand the solution.
You are saying it doesn't work but do not show what error you get, this is why you haven't got any answers
torch.autograd.grad(outputs, inputs, grad_outputs=None, retain_graph=None, create_graph=False, only_inputs=True, allow_unused=False)
outputs and inputs are expected to be sequences of tensors. But you
use just a tensor as outputs.
What this is saying is that you should pass a sequence, so pass [y] instead of y

Resources