Autograd function in Pytorch documentation [duplicate] - pytorch

This question already has answers here:
Pytorch, what are the gradient arguments
(4 answers)
Closed 3 years ago.
In the Pytorch documentation https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#sphx-glr-beginner-blitz-autograd-tutorial-py
In the image, I am unable to understand what y.backward(v) means and why do we need to define another tensor v to do the backward operation and also how we got the results of x.grad
Thanks in advance

y.backward() computes dy/dz where z are all the leaf nodes in the computation graph. And it stores dy/dz in z.grad.
For example: In the above case, leaf nodes are x.
y.backward() works when y is a scalar which is the case for most of the deep-learning. When y is a vector you have to pass another vector (v in the above case). You can see this as computing d(v^Ty)/dx.
To answer how we got x.grad note that you raise x by the power of 2 unless norm exceeds 1000, so x.grad will be v*k*x**(k-1) where k is 2**i and i is the number of times the loop was executed.
To have a less complicated example, consider this:
x = torch.randn(3,requires_grad=True)
print(x)
Out: tensor([-0.0952, -0.4544, -0.7430], requires_grad=True)
y = x**2
v = torch.tensor([1.0,0.1,0.01])
y.backward(v)
print(x.grad)
Out[15]: tensor([-0.1903, -0.0909, -0.0149])
print(2*v*x)
Out: tensor([-0.1903, -0.0909, -0.0149], grad_fn=<MulBackward0>)

Related

taking the norm of 3 vectors in python

This is probably a stupid question, but for some reason I can't get the norm of three matrices of vectors.
Each vector in the x matrix represents the x coordinate of a sensor (8 sensors total) for three different experiments. Same for y and z.
ex:
x = [array([ 2.239, 3.981, -8.415, 33.895, 48.237, 52.13 , 60.531, 56.74 ]), array([ 2.372, 6.06 , -3.672, 3.704, -5.926, -2.341, 35.667, 62.097])]
y = [array([ 18.308, -17.83 , -22.278, -99.67 , -121.575, -116.794,-123.132, -127.802]), array([ -3.808, 0.974, -3.14 , 6.645, 2.531, 7.312, -129.236, -112. ])]
z = [array([-1054.728, -1054.928, -1054.928, -1058.128, -1058.928, -1058.928, -1058.928, -1058.928]), array([-1054.559, -1054.559, -1054.559, -1054.559, -1054.559, -1054.559, -1057.959, -1058.059])]
I tried doing:
norm= np.sqrt(np.square(x)+np.square(y)+np.square(z))
x = x/norm
y = y/norm
z = z/norm
However, I'm pretty sure its wrong. When I then try and sum the components of let's say np.sum(x[0]) I don't get anywhere close to 1.
Normalization does not make the sum of the components equal to one. Normalization makes the norm of the vector equal to one. You can check if your code worked by taking the norm (square root of the sum of the squared elements) of the normalized vector. That should equal 1.
From what I can tell, your code is working as intended.
I made a mistake - your code is working as intended, but not for your application. You could define a function to normalize any vector that you pass to it, much as you did in your program as follows:
def normalize(vector):
norm = np.sqrt(np.sum(np.square(vector)))
return vector/norm
However, because x, y, and z each have 8 elements, you can't normalize x with the components from x, y, and z.
What I think you want to do is normalize the vector (x,y,z) for each of your 8 sensors. So, you should pass 8 vectors, (one for each sensor) into the normalize function I defined above. This might look something like this:
normalized_vectors = []
for i in range(8):
vector = np.asarray([x[i], y[i],z[i]])
normalized_vectors.append = normalize(vector)

Why embed dimemsion must be divisible by num of heads in MultiheadAttention?

I am learning the Transformer. Here is the pytorch document for MultiheadAttention. In their implementation, I saw there is a constraint:
assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"
Why require the constraint: embed_dim must be divisible by num_heads? If we go back to the equation
Assume:
Q, K,V are n x emded_dim matrices; all the weight matrices W is emded_dim x head_dim,
Then, the concat [head_i, ..., head_h] will be a n x (num_heads*head_dim) matrix;
W^O with size (num_heads*head_dim) x embed_dim
[head_i, ..., head_h] * W^O will become a n x embed_dim output
I don't know why we require embed_dim must be divisible by num_heads.
Let say we have num_heads=10000, the resuts are the same, since the matrix-matrix product will absort this information.
From what I understood, it is a simplification they have added to keep things simple. Theoretically, we can implement the model like you proposed (similar to the original paper).
In pytorch documention, they have briefly mentioned it.
Note that `embed_dim` will be split across `num_heads` (i.e. each head will have dimension `embed_dim` // `num_heads`)
Also, if you see the Pytorch implementation, you can see it is a bit different (optimised in my point of view) when comparing to the originally proposed model. For example, they use MatMul instead of Linear and Concat layer is ignored. Refer the below which shows the first encoder (with Btach size 32, 10 words, 512 features).
P.s:
If you need to see the model params (like the above image), this is the code I used.
import torch
transformer_model = torch.nn.Transformer(d_model=512, nhead=8, num_encoder_layers=1,num_decoder_layers=1,dim_feedforward=11) # change params as necessary
tgt = torch.rand((20, 32, 512))
src = torch.rand((11, 32, 512))
torch.onnx.export(transformer_model, (src, tgt), "transformer_model.onnx")
When you have a sequence of seq_len x emb_dim (ie. 20 x 8) and you want to use num_heads=2, the sequence will be split along the emb_dim dimension. Therefore you get two 20 x 4 sequences. You want every head to have the same shape and if emb_dim isn't divisible by num_heads this wont work. Take for example a sequence 20 x 9 and again num_heads=2. Then you would get 20 x 4 and 20 x 5 which are not the same dimension.

How to avoid recalculating a function when we need to backpropagate through it twice?

In PyTorch, I want to do the following calculation:
l1 = f(x.detach(), y)
l1.backward(retain_graph=True)
l2 = -1*f(x, y.detach())
l2.backward()
where f is some function, and x and y are tensors that require gradient. Notice that x and y may both be the results of previous calculations which utilize shared parameters (for example, maybe x=g(z) and y=g(w) where g is an nn.Module).
The issue is that l1 and l2 are both numerically identical, up to the minus sign, and it seems wasteful to repeat the calculation f(x,y) twice. It would be nicer to be able to calculate it once, and apply backward twice on the result. Is there any way of doing this?
One possibility is to manually call autograd.grad and update the w.grad field of each nn.Parameter w. But I'm wondering if there is a more direct and clean way to do this, using the backward function.
I took this answer from here.
We can calculate f(x,y) once, without detaching neither x or y, if we ensure that we we multiply by -1 the gradient flowing through x. This can be done using register_hook:
x.register_hook(lambda t: -t)
l = f(x,y)
l.backward()
Here is code demonstrating that this works:
import torch
lin = torch.nn.Linear(1, 1, bias=False)
lin.weight.data[:] = 1.0
a = torch.tensor([1.0])
b = torch.tensor([2.0])
loss_func = lambda x, y: (x - y).abs()
# option 1: this is the inefficient option, presented in the original question
lin.zero_grad()
x = lin(a)
y = lin(b)
loss1 = loss_func(x.detach(), y)
loss1.backward(retain_graph=True)
loss2 = -1 * loss_func(x, y.detach()) # second invocation of `loss_func` - not efficient!
loss2.backward()
print(lin.weight.grad)
# option 2: this is the efficient method, suggested in this answer.
lin.zero_grad()
x = lin(a)
y = lin(b)
x.register_hook(lambda t: -t)
loss = loss_func(x, y) # only one invocation of `loss_func` - more efficient!
loss.backward()
print(lin.weight.grad) # the output of this is identical to the previous print, which confirms the method
# option 3 - this should not be equivalent to the previous options, used just for comparison
lin.zero_grad()
x = lin(a)
y = lin(b)
loss = loss_func(x, y)
loss.backward()
print(lin.weight.grad)

Pytorch - Porting # Operator

I have the following line of code I want to port to Torch Matmul
rotMat = xmat # ymat # zmat
Can I know if this is the correct ordering:
rotMat = torch.matmul(xmat, torch.matmul(ymat, zmat))
According to the python docs on operator precedence the # operator has left-to-right associativity
https://docs.python.org/3/reference/expressions.html#operator-precedence
Operators in the same box group left to right (except for exponentiation, which groups from right to left).
Therefore the equivalent operation is
rotMat = torch.matmul(torch.matmul(xmat, ymat), zmat)
Though keep in mind that matrix multiplication is associative (mathematically) so you shouldn't see much of a difference in the result if you do it the other way. Generally you want to associate in the way that results in the fewest computational steps. For example using the naive matrix multiplication algorithm, if X is 1x10, Y is 10x100 and Z is 100x1000 then the difference between
(X # Y) # Z
and
X # (Y # Z)
is about 1*10*100 + 1*100*1000 = 101,000 multiplication/addition operations for the first versus 10*100*1000 + 1*10*1000 = 1,001,000 operations for the second. Though these have the same result (ignoring rounding errors) the second version will be about 10 x slower!
As pointed out by #Szymon Maszke pytorch tensors also support the # operator so you can still use
xmat # ymat # zmat
in pytorch.

Should I use tf.add or + to add two tensors in Tensorflow?

I am using Tensorflow 2.0 for Python 3.
Suppose I have two tensor variables, x and y, and I want to compute their element-wise sum x + y. Should I just write x + y, or tf.add(x, y)? If they are not equivalent, when should I use one or the other?
In my understanding they are equivalent and just executes the __add__ magic function.

Resources