Meaning of grad_outputs in PyTorch's torch.autograd.grad - pytorch

I am having trouble understanding the conceptual meaning of the grad_outputs option in torch.autograd.grad.
The documentation says:
grad_outputs should be a sequence of length matching output containing the “vector” in Jacobian-vector product, usually the pre-computed gradients w.r.t. each of the outputs. If an output doesn’t require_grad, then the gradient can be None).
I find this description quite cryptic. What exactly do they mean by Jacobian-vector product? I know what the Jacobian is, but not sure about what product they mean here: element-wise, matrix product, something else? I can't tell from my example below.
And why is "vector" in quotes? Indeed, in the example below I get an error when grad_outputs is a vector, but not when it is a matrix.
>>> x = torch.tensor([1.,2.,3.,4.], requires_grad=True)
>>> y = torch.outer(x, x)
Why do we observe the following output; how was it computed?
>>> y
tensor([[ 1., 2., 3., 4.],
[ 2., 4., 6., 8.],
[ 3., 6., 9., 12.],
[ 4., 8., 12., 16.]], grad_fn=<MulBackward0>)
>>> torch.autograd.grad(y, x, grad_outputs=torch.ones_like(y))
(tensor([20., 20., 20., 20.]),)
However, why this error?
>>> torch.autograd.grad(y, x, grad_outputs=torch.ones_like(x))
RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([4]) and output[0] has a shape of torch.Size([4, 4]).

If we take your example we have function f which takes as input x shaped (n,) and outputs y = f(x) shaped (n, n). The input is described as column vector [x_i]_i for i ∈ [1, n], and f(x) is defined as matrix [y_jk]_jk = [x_j*x_k]_jk for j, k ∈ [1, n]².
It is often useful to compute the gradient of the output with respect to the input (or sometimes w.r.t the parameters of f, there are none here). In the more general case though, we are looking to compute dL/dx and not just dy/dx, where dL/dx is the partial derivative of L, computed from y, w.r.t. x.
The computation graph looks like:
x.grad = dL/dx <------- dL/dy y.grad
dy/dx
x -------> y = x*xT
Then, if we look at dL/dx, which is, via the chain rule equal to dL/dy*dy/dx. We have, looking at the interface of torch.autograd.grad, the following correspondences:
outputs <-> y,
inputs <-> x, and
grad_outputs <-> dL/dy.
Looking at the shapes: dL/dx should have the same shape as x (dL/dx can be referred to as the 'gradient' of x), while dy/dx, the Jacobian matrix, would be 3-dimensional. On the other hand dL/dy, which is the incoming gradient, should have the same shape as the output, i.e., y's shape.
We want to compute dL/dx = dL/dy*dy/dx. If we look more closely, we have
dy/dx = [dy_jk/dx_i]_ijk for i, j, k ∈ [1, n]³
Therefore,
dL/dx = [dL/d_x_i]_i, i ∈ [1,n]
= [sum(dL/dy_jk * d(y_jk)/dx_i over j, k ∈ [1, n]²]_i, i ∈ [1,n]
Back to your example, it means for a given i ∈ [1, n]: dL/dx_i = sum(dy_jk/dx_i) over j, k ∈ [1,n]². And dy_jk/dx_i = f(x_j*x_k)/dx_i will equal x_j if i = k, x_k if i = j, and 2*x_i if i = j = k (because of the squared x_i). This being said matrix y is symmetric... So the result comes down to 2*sum(x_i) over i ∈ [1, n]
This means dL/dx is the column vector [2*sum(x)]_i for i ∈ [1, n].
>>> 2*x.sum()*torch.ones_like(x)
tensor([20., 20., 20., 20.])
Stepping back look at this other graph example, here adding an additional operation after y:
x -------> y = x*xT --------> z = y²
If you look at the backward pass on this graph, you have:
dL/dx <------- dL/dy <-------- dL/dz
dy/dx dz/dy
x -------> y = x*xT --------> z = y²
With dL/dx = dL/dy*dy/dx = dL/dz*dz/dy*dy/dx which is in practice computed in two sequential steps: dL/dy = dL/dz*dz/dy, then dL/dx = dL/dy*dy/dx.

Related

Multiply a [3, 2, 3] by a [3, 2] tensor in pytorch (dot product along dimension)

Given the following tensors x and y with shapes [3,2,3] and [3,2]. I want to multiply the tensors along the 2nd dimension, this is expected to be a kind of dot product and scaling along the axis and return a [3,2,3] tensor.
import torch
a = [[[0.2,0.3,0.5],[-0.5,0.02,1.0]],[[0.01,0.13,0.06],[0.35,0.12,0.0]], [[1.0,-0.3,1.0],[1.0,0.02, 0.03]] ]
b = [[1,2],[1,3],[0,2]]
x = torch.FloatTensor(a) # shape [3,2,3]
y = torch.FloatTensor(b) # shape [3,2]
The expected output :
Expected output shape should be [3,2,3]
#output = [[[0.2,0.3,0.5],[-1.0,0.04,2.0]],[[0.01,0.13,0.06],[1.05,0.36,0.0]], [[0.0,0.0,0.0],[2.0,0.04, 0.06]] ]
I have tried the two below but none of them is giving the desired output and output shape.
torch.matmul(x,y)
torch.matmul(x,y.unsqueeze(1).shape)
What is the best way to fix this?
This is just broadcasted multiply. So you can insert a unitary dimension on the end of y to make it a [3,2,1] tensor and then multiply by x. There are multiple ways to insert unitary dimensions.
# all equivalent
x * y.unsqueeze(2)
x * y[..., None]
x * y[:, :, None]
x * y.reshape(3, 2, 1)
You could also use torch.einsum.
torch.einsum('abc,ab->abc', x, y)

Pytorch change format of tensors from BxWxH to B, N, 3

I have a tensor A with the shape BxWxH (B=Batch size, W=Width, H=Height) and want to change it to a tensor B of shape BxNx3 (B=Batch size, N=Number of points=W*H).
Tensor A represents a depth map, e.g. tensor[0,1,2] => gives the depth value for the pixel (1,2) in batch 0.
Tensor B also represents a depth map but in a different format. Each point in tensor B has the following three dimensions: (x coord, y coord, depth value).
How can I transform tensor A into tensor B?
You are looking for meshgrid to give you the x and y coordinates of each pixel:
b, w, h = A.shape
x, y = torch.meshgrid(torch.arange(w), torch.arange(h))
B = torch.cat((x[None, ...], y[None, ...], A), dim=0)
B = B.reshape(b, w*h, 3)

Computation of gradients

I want to compute the gradient in the following scenario:
y = w_0x+w_1 and z = w_2x + (dy/dx)^2
w = torch.tensor([2.,1.,3.], requires_grad=True)
x = torch.tensor([0.5], requires_grad=True)
y = w[0]*x + w[1]
y.backward()
l = x.grad
l.requires_grad=True
w.grad.zero_()
z = w[2]*x + l**2
z.backward()
I expect [4, 0, 0.5] instead I get [0, 0, 0.5]. I know in this case I can replace l by w_0 but, l can be a complex function of x in which case it is important that I compute the gradients numerically instead of changing the expression for z. Please let me know what changes I need to do get the correct gradient w.r.t w
You should print your gradients along the way, it would be easier this way.
I will comment out what's going on in code:
import torch
w = torch.tensor([2.0, 1.0, 3.0], requires_grad=True)
x = torch.tensor([0.5], requires_grad=True)
y = w[0] * x + w[1]
y.backward()
l = x.grad
l.requires_grad = True
print(w.grad) # [0.5000, 1.0000, 0.0000] as expected
w.grad.zero_()
print(w.grad) # [0., 0., 0.] as you cleared the gradient
z = w[2] * x + l ** 2
z.backward()
print(w.grad) # [0., 0., 0.5] - see below
Last print(w.grad) works like that because your are using last element of tensor and it's the only taking part in equation z, it's multiplied by x which is 0.5 hence gradient is 0.5. You cleared the gradient before by issuing w.grad_zero_(). I can't see how could you get [4., 0., 0.5]. If you didn't clear the gradient, you would get: tensor([0.5000, 1.0000, 0.5000]), the first two being from the first y equation, the second one and the last from the z equation.

How to do numpy matmul broadcasting between two numpy tensors?

I have the Pauli matrices which are (2x2) and complex
II = np.identity(2, dtype=complex)
X = np.array([[0, 1], [1, 0]], dtype=complex)
Y = np.array([[0, -1j], [1j, 0]], dtype=complex)
Z = np.array([[1, 0], [0, -1]], dtype=complex)
and a depolarizing_error function which takes in a normally distributed random number param, generated by np.random.normal(noise_mean, noise_sd)
def depolarizing_error(param):
XYZ = np.sqrt(param/3)*np.array([X, Y, Z])
return np.array([np.sqrt(1-param)*II, XYZ[0], XYZ[1], XYZ[2]])
Now if I feed in a single number for param of let's say a, my function should return an output of np.array([np.sqrt(1-a)*II, a*X, a*Y, a*Z]) where a is a float and * denotes the element-wise multiplication between a and the entries of the (2x2) matrices II, X, Y, Z.
Now for vectorization purposes, I wish to feed in an array of param i.e.
param = np.array([a, b, c, ..., n]) Eqn(1)
again with all a, b, c, ..., n generated independently by np.random.normal(noise_mean, noise_sd) (I think it's doable with np.random.normal(noise_mean, noise_sd, n) or something)
such that my function now returns:
np.array([[np.sqrt(1-a)*II, a*X, a*Y, a*Z],
[np.sqrt(1-b)*II, b*X, b*Y, b*Z],
................................,
[np.sqrt(1-n)*II, n*X, n*Y, n*Z]])
I thought feeding in something like np.random.normal(noise_mean, noise_sd, n) as param, giving output as np.array([a, b, c,...,n]) would sort itself out and return what I want above. but my XYZ = np.sqrt(param/3)*np.array([X, Y, Z]) ended up doing element-wise dot product instead of element-wise multiplication. I tried using param as np.array([a, b])
and ended up with
np.array([np.dot(np.sqrt(1-[a, b]), II),
np.dot(np.sqrt([a, b]/3), X),
np.dot(np.sqrt([a, b]/3), Y),
np.dot(np.sqrt([a, b]/3), Z)])
instead. So far I've tried something like
def depolarizing_error(param):
XYZ = np.sqrt(param/3)#np.array([X, Y, Z])
return np.array([np.sqrt(1-param)*II, XYZ[0], XYZ[1], XYZ[2]])
thinking that the matmul # will just broadcast it conveniently for me but then I got really bogged down by the dimensions.
Now my motivation for wanting to do all this is because I have another matrix that's given by:
def random_angles(sd, seq_length):
return np.random.normal(0, sd, (seq_length,3))
def unitary_error(params):
e_1 = np.exp(-1j*(params[:,0]+params[:,2])/2)*np.cos(params[:,1]/2)
e_2 = np.exp(-1j*(params[:,0]-params[:,2])/2)*np.sin(params[:,1]/2)
return np.array([[e_1, e_2], [-e_2.conj(), e_1.conj()]],
dtype=complex).transpose(2,0,1)
where here the size of seq_length is equivalent to the number of entries in Eqn(1) param, denoting N = seq_length = |param| say. Here my unitary_error function should give me an output of
np.array([V_1, V_2, ..., V_N])
such that I'll be able to use np.matmul as an attempt to implement vectorization like this
np.array([V_1, V_2, ..., V_N])#np.array([[np.sqrt(1-a)*II, a*X, a*Y, a*Z],
[np.sqrt(1-b)*II, b*X, b*Y, b*Z],
................................,
[np.sqrt(1-n)*II, n*X, n*Y, n*Z]])#np.array([V_1, V_2, ..., V_N])
to finally give
np.array([[V_1#np.sqrt(1-a)*II#V_1, V_1#a*X#V_1, V_1#a*Y#V_1, V_1#a*Z#V_1],
[V_2#np.sqrt(1-b)*II#V_2, V_2#b*X#V_2, V_2#b*Y#V_2, V_2#b*Z#V_2],
................................,
[V_N#np.sqrt(1-n)*II#V_N, V_N#n*X#V_N, V_N#n*Y#V_N, V_N#n*Z#V_N]])
where here # denotes the element-wise dot-product

Non-linear optimization for rotation

I had a chat with an engineer the other day and we both were stumped on a question related to bundle adjustment. For a refresher, here is a good link explaining the problem:
http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/ZISSERMAN/bundle/bundle.html
The problem requires optimization over 3n+11m parameters. The camera optimization consists of 5 intrinsic camera parameters, 3 DOF for position (x,y,z), and 3 DOF for rotation (pitch, yaw and roll).
Now, when you actually go about implementing this algorithm, a rotation matrix consists an optimization over 9 numbers. Euler's Axis Theorem says these 9 numbers are related and there are only 3 degrees of freedom overall.
Suppose you represent the rotation using a normalized quaternion. Then you have optimization over 3 numbers. Same DOF.
Is one representation more computationally efficient and better than the other? Will you have less variables to optimize using a rotation quaternion over rotation matrix?
You never optimize over 9 numbers! Of course this would be inefficient. One efficient representation in which you only need 3 parameters is to parametrize your rotation matrix R using the Lie algebra of the groupe SO(3). If you are not familiar with Lie algebra, here's a tutorial that explains everything in an intuitive (but sometimes oversimplified) manner. To explain it in a few short sentences, in this representation, each rotation matrix R is written as expmat(a*G_1+b*G_2+c*G_3) where expmat is the matrix exponential, and the G_i are the "generators" of the lie algebra of SO(3), i.e. the tangent space to SO(3) at the identity. Therefore, to estimate a rotation matrix, you only need to learn the three parameters a,b,c. This is roughly equivalent to decomposing your rotation matrix in three rotations around x,y,z and estimating the three angles of these rotations.
A solution not mentioned yet is to use axis-angle parameterization.
Basically, you represent the rotation as a single 3D vector. The direction v/|v| of the vector is the axis of rotation, and the norm |v| is the angle of rotation around that axis.
This method has 3 DOF directly, unlike quaternions' 4 DOF. So with quaternions, you need to use either constrained optimization or additional parameterization to get down to 3 DOF.
I'm not familiar with #Ash's suggestion, but he does mention in the comment that it only works for small angles. Axis-angle representation doesn't have this limitation.
One option is as relatively_random suggests to optimize over the axis-angle parameterization. The derivative can then, relatively simple, be computed as described in this paper. The only problem might be that some numerical issues might arise for rotations close to the identity.
import numpy as np
def hat(v):
"""
vecotrized version of the hat function, creating for a vector its skew symmetric matrix.
Args:
v (np.array<float>(..., 3, 1)): The input vector.
Returns:
(np.array<float>(..., 3, 3)): The output skew symmetric matrix.
"""
E1 = np.array([[0., 0., 0.], [0., 0., -1.], [0., 1., 0.]])
E2 = np.array([[0., 0., 1.], [0., 0., 0.], [-1., 0., 0.]])
E3 = np.array([[0., -1., 0.], [1., 0., 0.], [0., 0., 0.]])
return v[..., 0:1, :] * E1 + v[..., 1:2, :] * E2 + v[..., 2:3, :] * E3
def exp(v, der=False):
"""
Vectorized version of the exponential map.
Args:
v (np.array<float>(..., 3, 1)): The input axis-angle vector.
der (bool, optional): Wether to output the derivative as well. Defaults to False.
Returns:
R (np.array<float>(..., 3, 3)): The corresponding rotation matrix.
[dR (np.array<float>(3, ..., 3, 3)): The derivative of each rotation matrix.
The matrix dR[i, ..., :, :] corresponds to
the derivative d R[..., :, :] / d v[..., i, :],
so the derivative of the rotation R gained
through the axis-angle vector v with respect
to v_i. Note that this is not a Jacobian of
any form but a vectorized version of derivatives.]
"""
n = np.linalg.norm(v, axis=-2, keepdims=True)
H = hat(v)
with np.errstate(all='ignore'):
R = np.identity(3) + (np.sin(n) / n) * H + ((1 - np.cos(n)) / n**2) * (H # H)
R = np.where(n == 0, np.identity(3), R)
if der:
sh = (3,) + tuple(1 for _ in range(v.ndim - 2)) + (3, 1)
dR = np.swapaxes(np.expand_dims(v, axis=0), 0, -2) * H
dR = dR + hat(np.cross(v, ((np.identity(3) - R) # np.identity(3).reshape(sh)), axis=-2))
dR = dR # R
n = n**2 # redifinition
with np.errstate(all='ignore'):
dR = dR / n
dR = np.where(n == 0, hat(np.identity(3).reshape(sh)), dR)
return R, dR
else:
return R
# generate two sets of points which differ by a rotation
np.random.seed(1001)
n = 100 # number of points
p_1 = np.random.randn(n, 3, 1)
v = np.array([0.3, -0.2, 0.1]).reshape(3, 1) # the axis-angle vector
p_2 = exp(v) # p_1 + np.random.randn(n, 3, 1) * 1e-2
# estimate v with least sqaures, so the objective function becomes:
# minimize v over f(v) = sum_[1<=i<=n] (||p_1_i - exp(v)p_2_i||^2)
# Due to the way least_squres is implemented we have to pass the
# individual residuals ||p_1_i - exp(v)p_2_i||^2 as ||p_1_i - exp(v)p_2_i||.
from scipy.optimize import least_squares
def loss(x):
R = exp(x.reshape(1, 3, 1))
y = p_2 - R # p_1
y = np.linalg.norm(y, axis=-2).squeeze(-1)
return y
def d_loss(x):
R, d_R = exp(x.reshape(1, 3, 1), der=True)
y = p_2 - R # p_1
d_y = -d_R # p_1
d_y = np.sum(y * d_y, axis=-2) / np.linalg.norm(y, axis=-2)
d_y = d_y.squeeze(-1).T
return d_y
x0 = np.zeros((3))
res = least_squares(loss, x0, d_loss)
print('True axis-angle vector: {}'.format(v.reshape(-1)))
print('Estimated axis-angle vector: {}'.format(res.x))

Resources