Can we get the Jacobian of functions with vector inputs in PyTorch? - pytorch

My goal is to understand how to use torch.autograd.grad.
Documentation gives a description as
torch.autograd.grad(outputs, inputs, grad_outputs=None, retain_graph=None, create_graph=False, only_inputs=True, allow_unused=False, is_grads_batched=False)
Computes and returns the sum of gradients of outputs with respect to the inputs.
so I made this derivative function:
def derivative(y, t):
return torch.autograd.grad(y, t)
And can return gradient for y if y is scalar or sum of gradients of each component of y wrt vector t
def f(x):
return (x**2).sum()
x = torch.tensor([1, 2, 3],dtype = torch.float, requires_grad = True)
derivative(f(x), x)
# returns 2*x
def f(x):
return (x[0], x[1], x[2], x[0])
x = torch.tensor([1, 2, 3],dtype = torch.float, requires_grad = True)
derivative(f(x), x)
# returns (tensor([2., 1., 1.]),)
But it doesn't work for y values that are tensors in t, nor I can get the full Jacobian.
Could you show how to get gradients of tensor outputs wrt to tensor inputs or how to get Jacobian?

Related

Error: "One of the differentiated Tensors appears to not have been used in the graph"

I am trying to compute a gradient of y_hat to x (y_hat is the sum of gradients of model output to x) but it gives me the error: One of the differentiated Tensors appears to not have been used in the graph. This is the code:
class Model(nn.Module):
def __init__(self,):
super(Model, self).__init__()
self.weight1 = torch.nn.Parameter(torch.tensor([[.2,.5,.9],[1.0,.3,.5],[.3,.2,.7]]))
self.weight2 = torch.nn.Parameter(torch.tensor([2.0,1.0,.4]))
def forward(self, x):
out =F.linear(x, self.weight1.T)
out =F.linear(out, self.weight2.T)
return out
model = Model()
x = torch.tensor([[0.1,0.7,0.2]])
x = x.requires_grad_()
output = model(x)
y_hat = torch.sum(torch.autograd.grad(output, x, create_graph = True)[0])
torch.autograd.grad(y_hat, x)
I think x should be in the computational graph, so I don't know why it gives me this error? Any thoughts would be appreciated!
Cause this function is actually y = x#b so after the first derivative there is no x in the result and that's why we can't do the second derivative.

How to implement pair-wise calculation of attention within a batch?

Suppose I now have the following code to calculate source-target attention for two variable, x and y:
class MultiHeadedAttention(nn.Module):
"""Multi-Head Attention layer
:param int n_head: the number of head s
:param int n_feat: the number of features
:param float dropout_rate: dropout rate
"""
def __init__(self, n_head: int, n_feat: int, dropout_rate: float):
super(MultiHeadedAttention, self).__init__()
assert n_feat % n_head == 0
self.d_k = n_feat // n_head
self.h = n_head
self.linear_q = nn.Linear(n_feat, n_feat)
self.linear_k = nn.Linear(n_feat, n_feat)
self.linear_v = nn.Linear(n_feat, n_feat)
self.linear_out = nn.Linear(n_feat, n_feat)
self.dropout = nn.Dropout(p=dropout_rate)
def forward(
self,
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
mask: Optional[torch.Tensor] = None,
) -> torch.Tensor:
"""Compute 'Scaled Dot Product Attention'
:param torch.Tensor query: (batch, x_len, size)
:param torch.Tensor key: (batch, y_len, size)
:param torch.Tensor value: (batch, y_len, size)
:param torch.Tensor mask: (batch, x_len, y_len)
:param torch.nn.Dropout dropout:
:return torch.Tensor: attentined and transformed `value` (batch, x_len, depth)
weighted by the query dot key attention (batch, head, x_len, y_len)
"""
n_batch = query.size(0)
q = self.linear_q(query).view(n_batch, -1, self.h, self.d_k)
k = self.linear_k(key).view(n_batch, -1, self.h, self.d_k)
v = self.linear_v(value).view(n_batch, -1, self.h, self.d_k)
q = q.transpose(1, 2) # (batch, head, x_len, d_k)
k = k.transpose(1, 2) # (batch, head, x_len, d_k)
v = v.transpose(1, 2) # (batch, head, y_len, d_k)
scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(
self.d_k
) # (batch, head, x_len, y_len)
if mask is not None:
mask = mask.unsqueeze(1).eq(0) # (batch, 1, x_len, y_len)
mask = mask.to(device=scores.device)
scores = scores.masked_fill_(mask, -np.inf)
attn = torch.softmax(scores, dim=-1).masked_fill(
mask, 0.0
) # (batch, head, x_len, y_len)
else:
attn = torch.softmax(scores, dim=-1) # (batch, head, x_len, y_len)
p_attn = self.dropout(attn)
x = torch.matmul(p_attn, v) # (batch, head, x_len, d_k)
x = (
x.transpose(1, 2).contiguous().view(n_batch, -1, self.h * self.d_k)
) # (batch, x_len, depth)
return self.linear_out(x) # (batch, x_len, depth)
So this class calculate the attention of batch size=B pairs of (x, y)_i, gives output of dim (batch, x_len, depth). So far so good.
The question is: What if I wanted to extend this class to calculate NOT ONLY (x1, y1), (x2, y2)..., but also all combination of xy, i.e. (x1, y2), (x1, y3)... within the batch, so that I will get an output of dim (batch, batch, x_len, depth) WITHOUT LOOPING. How would you implement this? Any recommendation, suggestion, example is appreciated.
EDITED
I just came up with an idea which does the desired job at the expense of extra memory use. Just simply copy X and Y along the batch dimension so that the represent all the pairs of x_i and y_i. Specifically:
b = torch.tensor(list(range(batch_size)))
comb = torch.cartesian_prod(b, b)
x = x[comb[:, 0], :, :]
y = y[comb[:, 1], :, :]
and then after the calculation, view or reshape the first dimension and it will return output which is of dim=(batch_size, batch_size, x_len, depth).
I have tested using toy example and quite sure it does do the job.
However, unfortunately, for my case it got CUDA out of memory.
What would you do under this situation? Should I give up on parallelism and just use loop to make it works?
If I understand you correctly, you might want to check out torch.cdist, which is a torch implementation of pairwise distances, similar to scipy.spatial.distance.cdist. You might have to do some tweaking on your tensor dimensions, as described in the documentation torch cdist

What is causing tensorflow retracing?

Here is my code for factorization
I am trying to run some part of my iterate function with TPUs and I am getting retracing error ? If I am making an obvious mistake I am sorry. update_xu and update_yi is mostly matrix multiplication and matrix inversion. Hence I am try to run them in strategy.scope() ?. Eventually some kind of memory leakage or something
def loss(R, X, Y, C, lmda_x, lmda_y):
'''
returns the MSE of the weighted least squares plus L2 regularisation error
'''
Error = R - tf.matmul(X, Y, transpose_a = True)
Error = Error * Error
Error = C * Error
Reg = 0 + tf.math.reduce_sum(X * X * lmda_x) + tf.math.reduce_sum(Y * Y * lmda_y)
return Reg + tf.math.reduce_sum(Error) * 1 / (Error.shape[0] * Error.shape[1])
def update_xu(Ru, Y, Cuser, lmda):
'''
input
column vector Ru,
k x m matrix Y,
m x m matrix Cuser and
lmda_x
output
the updated user row vector (xu) by making Y matrix constant
'''
Ru = tf.reshape(Ru, shape = [Y.shape[1], 1])
C = Cuser # Ru
inverse = tf.linalg.inv(Y # Cuser # tf.transpose(Y) + lmda * tf.eye(Y.shape[0]))
ans = (inverse # Y # C)
return tf.reshape(ans, [Y.shape[0]])
def update_yi(Ri, X, Citem, lmda):
'''
input
column vector Ri,
k x n matrix X,
n x n matrix Citem and
lmda_y
output
the updated user row vector (yi) by making X matrix constant
'''
Ri = tf.reshape(Ri, shape = [X.shape[1], 1])
C = Citem # Ri
inverse = tf.linalg.inv(X # Citem # tf.transpose(X) + lmda * tf.eye(X.shape[0]))
ans = inverse # X # C
return tf.reshape(ans, [X.shape[0]])
def iterate(R, X, Y, C, lmda_x, lmda_y, epochs):
'''
returns approximately updated X and Y such R = X(Y.T) with WALS algorithm
'''
with strategy.scope():
for _ in range(epochs):
Xtt = tf.vectorized_map(lambda x: update_xu(x[0], Y, tf.linalg.diag(x[1]), lmda_x), (R, C))
#Xtt = tf.map_fn(lambda x: update_xu(x[0], Y, tf.linalg.diag(x[1]), lmda_x), (R, C), dtype = tf.TensorSpec([Y.shape[0]], dtype = tf.float32), parallel_iterations=6)
print(Xtt.shape)
X = tf.transpose(Xtt)
R, C = tf.transpose(R), tf.transpose(C)
Ytt = tf.vectorized_map(lambda x: update_yi(x[0], X, tf.linalg.diag(x[1]), lmda_y), (R, C))
#Ytt = tf.map_fn(lambda x: update_yi(x[0], X, tf.linalg.diag(x[1]), lmda_y), (R, C), dtype = tf.TensorSpec([X.shape[0]], dtype = tf.float32), parallel_iterations=6)
R , C = tf.transpose(R), tf.transpose(C)
print(Ytt.shape)
Y = tf.transpose(Ytt)
return X, Y
def init_weights(R):
chk = tf.where(
tf.math.abs(R) > eps, 1., 0.
)
cnt = tf.constant(
tf.math.reduce_sum(chk, axis = 0), shape = [1, R.shape[1]]
)
cnt = tf.repeat(
cnt, repeats = R.shape[0], axis = 0
)
cnt = tf.reshape(cnt, [R.shape[0], R.shape[1]])
cnt = cnt * chk
cnt = tf.random.normal( R.shape, mean=0.0, stddev=1.0) * cnt
cnt = cnt + tf.random.normal([1], mean = 0.0, stddev = 1.0)
return cnt
def train( R, lmda_x, lmda_y, epochs, embd):
flag = False
loss_train , loss_test, total = 0., 0., 0
loss_train_list, loss_test_list,total_epochs = [], [], []
X, Y = tf.zeros([embd, R.shape[0]]), tf.zeros([embd, R.shape[1]])
C = init_weights(train_data)
for epoch in epochs:
if flag == False:
X, Y = tf.random.normal( [embd, R.shape[0]], mean=0.0, stddev=1.0), tf.random.normal( [embd, R.shape[1]], mean=0.0, stddev=1.0)
X, Y = iterate(R, X, Y, C, lmda_x, lmda_y, epoch)
flag = True
else:
X, Y = iterate(train_data, X, Y, C, lmda_x, lmda_y, epoch)
total += epoch
loss_train = loss(train_data, X, Y, C, lmda_x, lmda_y)
loss_test = loss(test_data, X, Y, C, lmda_x, lmda_y)
loss_train_list.append(loss_train)
loss_test_list.append(loss_test)
total_epochs.append(total)
print(loss_train, loss_test)
plt.plot(total_epochs, loss_train_list, label = "Training", linewidth = 5)
plt.plot(total_epochs, loss_test_list, label = "Test", linewidth = 1)
plt.xticks(fontsize = 10)
plt.title(str(embd)+ ', ' +str(lmda_x) + ',' + str(lmda_y))
plt.yticks(fontsize = 10)
plt.xlim(0, 200)
plt.ylim(0, 1000000000)
plt.xlabel('iterations', fontsize=30);
plt.ylabel('MSE', fontsize=30);
plt.legend(loc='best', fontsize=20);
return X, Y
def grid_search(epochs, embds, lmdas_x, lmdas_y, train_data, test_data):
for lmda_x in lmdas_x:
for lmda_y in lmdas_y:
for embd in embds:
lmda_x , lmda_y , epochs, embd = tf.convert_to_tensor(lmda_x, dtype = tf.float32), tf.convert_to_tensor(lmda_y, dtype =tf.float32), tf.convert_to_tensor(epochs, dtype = tf.int64), tf.convert_to_tensor(embd, dtype =tf.int64)
X, Y = train(train_data, lmda_x, lmda_y, epochs, embd)
Some errors on calling grid search
WARNING:tensorflow:11 out of the last 11 calls to <function pfor.<locals>.f at 0x7f4dd38f7b90> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating #tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your #tf.function outside of the loop. For (2), #tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details.
WARNING:tensorflow:11 out of the last 11 calls to <function pfor.<locals>.f at 0x7f4dd38f7b90> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating #tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your #tf.function outside of the loop. For (2), #tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details.
(943, 1)
WARNING:tensorflow:11 out of the last 11 calls to <function pfor.<locals>.f at 0x7f4dd38f7560> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating #tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your #tf.function outside of the loop. For (2), #tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details.
WARNING:tensorflow:11 out of the last 11 calls to <function pfor.<locals>.f at 0x7f4dd38f7560> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating #tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your #tf.function outside of the loop. For (2), #tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details.
(1682, 1)
(943, 1)
ERROR:tensorflow:==================================
Object was never used (type <class 'tensorflow.python.ops.tensor_array_ops.TensorArray'>):
<tensorflow.python.ops.tensor_array_ops.TensorArray object at 0x7f4e018c73d0>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2778, in while_loop
return result File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2726, in <lambda>
body = lambda i, lv: (i + 1, orig_body(*lv)) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/map_fn.py", line 507, in compute
return (i + 1, tas) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/map_fn.py", line 505, in <listcomp>
ta.write(i, value) for (ta, value) in zip(tas, result_value_batchable) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 249, in wrapped
error_in_function=error_in_function)
==================================
InvalidArgumentError Traceback (most recent call last)
<ipython-input-58-a2922620d8dd> in <module>()
7 epochs = [5, 100] #2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]#, 588, 299, 300, 200]
8 #X, Y, C = train(train_data.shape[0], train_data.shape[1], train_data, 0, 0, 2, 2)
----> 9 grid_search(epochs, embds, lmdas, lmdas, train_data, test_data)
3 frames
<ipython-input-57-8cc0dfdb85de> in grid_search(epochs, embds, lmdas_x, lmdas_y, train_data, test_data)
5 plt.figure(figsize = (10,10))
6 lmda_x , lmda_y , epochs, embd = tf.convert_to_tensor(lmda_x, dtype = tf.float32), tf.convert_to_tensor(lmda_y, dtype =tf.float32), tf.convert_to_tensor(epochs, dtype = tf.int64), tf.convert_to_tensor(embd, dtype =tf.int64)
----> 7 X, Y = train(train_data, lmda_x, lmda_y, epochs, embd)
8
9
<ipython-input-56-d70632663530> in train(R, lmda_x, lmda_y, epochs, embd)
88 flag = True
89 else:
---> 90 X, Y = iterate(train_data, X, Y, C, lmda_x, lmda_y, epoch)
91 total += epoch
92 loss_train = loss(train_data, X, Y, C, lmda_x, lmda_y)
<ipython-input-56-d70632663530> in iterate(R, X, Y, C, lmda_x, lmda_y, epochs)
50 Xtt = tf.vectorized_map(lambda x: update_xu(x[0], Y, tf.linalg.diag(x[1]), lmda_x), (R, C))
51 #Xtt = tf.map_fn(lambda x: update_xu(x[0], Y, tf.linalg.diag(x[1]), lmda_x), (R, C), dtype = tf.TensorSpec([Y.shape[0]], dtype = tf.float32), parallel_iterations=6)
---> 52 print(Xtt.shape)
53 X = tf.transpose(Xtt)
54 R, C = tf.transpose(R), tf.transpose(C)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py in shape(self)
1173 # `_tensor_shape` is declared and defined in the definition of
1174 # `EagerTensor`, in C.
-> 1175 self._tensor_shape = tensor_shape.TensorShape(self._shape_tuple())
1176 except core._NotOkStatusException as e:
1177 six.raise_from(core._status_to_exception(e.code, e.message), None)
InvalidArgumentError: {{function_node __inference_f_5094764}} Input is not invertible.
[[{{node loop_body/MatrixInverse/pfor/MatrixInverse}}]]

Pytorch autograd.grad how to write the parameters for multiple outputs?

In the documentation of torch.autograd.grad, it is stated that, for parameters,
parameters:
outputs (sequence of Tensor) – outputs of the differentiated function.
inputs (sequence of Tensor) – Inputs w.r.t.
which the gradient will be returned (and not accumulated into .grad).
I try the following:
a = torch.rand(2, requires_grad=True)
b = torch.rand(2, requires_grad=True)
c = a+b
d = a-b
torch.autograd.grad([c, d], [a, b]) #ValueError: only one element tensors can be converted to Python scalars
torch.autograd.grad(torch.tensor([c, d]), torch.tensor([a, b])) #RuntimeError: grad can be implicitly created only for scalar outputs
I would like to get gradients of a list of tensors w.r.t another list of tensors. What is the correct way to feed the parameters?
As the torch.autograd.grad mentioned, torch.autograd.grad computes and returns the sum of gradients of outputs w.r.t. the inputs. Since your c and d are not scalar values, grad_outputs are required.
import torch
a = torch.rand(2,requires_grad=True)
b = torch.rand(2, requires_grad=True)
a
# tensor([0.2308, 0.2388], requires_grad=True)
b
# tensor([0.6314, 0.7867], requires_grad=True)
c = a*a + b*b
d = 2*a+4*b
torch.autograd.grad([c,d], inputs=[a,b], grad_outputs=[torch.Tensor([1.,1.]), torch.Tensor([1.,1.])])
# (tensor([2.4616, 2.4776]), tensor([5.2628, 5.5734]))
Explanation:
dc/da = 2*a = [0.2308*2, 0.2388*2]
dd/da = [2.,2.]
So the first output is dc/da*grad_outputs[0]+dd/da*grad_outputs[1] = [2.4616, 2.4776]. Same calculation for the second output.
If you just want to get the gradient of c and d w.r.t. the inputs, probably you can do this:
a = torch.rand(2,requires_grad=True)
b = torch.rand(2, requires_grad=True)
a
# tensor([0.9566, 0.6066], requires_grad=True)
b
# tensor([0.5248, 0.4833], requires_grad=True)
c = a*a + b*b
d = 2*a+4*b
[torch.autograd.grad(t, inputs=[a,b], grad_outputs=[torch.Tensor([1.,1.])]) for t in [c,d]]
# [(tensor([1.9133, 1.2132]), tensor([1.0496, 0.9666])),
# (tensor([2., 2.]), tensor([4., 4.]))]
Here you go
In the example you gave:
a = torch.rand(2, requires_grad=True)
b = torch.rand(2, requires_grad=True)
loss = a + b
As the loss is a vector with 2 elements, you can't perform the autograd operation at once.
typically,
loss = torch.sum(a + b)
torch.autograd.grad([loss], [a, b])
This would return the correct value of gradient for the loss tensor which contains one element.
You can pass mutiple scalar tensors to outputs argument of the torch.autograd.grad method

How to use GradientTape with AutoGraph in Tensorflow 2?

I can not figure out how to run GradientTape code on AutoGraph in Tensorflow 2.
I want to run GradientTape code on TPU. I wanted to start by testing it on CPU. TPU code would run much faster using AutoGraph. I tried watching the input variable and I tried passing in the argument into the function that contained GradientTape but both failed.
I made a reproducible example here:
https://colab.research.google.com/drive/1luCk7t5SOcDHQC6YTzJzpSixIqHy_76b#scrollTo=9OQUYQtTTYIt
The code and the corresponding output is as follows:
They all start with import tensorflow as tf
x = tf.constant(3.0)
with tf.GradientTape() as g:
g.watch(x)
y = x * x
dy_dx = g.gradient(y, x)
print(dy_dx)
Output: tf.Tensor(6.0, shape=(), dtype=float32)
Explanation: Using Eager Execution, the GradientTape produces the gradient.
#tf.function
def compute_me():
x = tf.constant(3.0)
with tf.GradientTape() as g:
g.watch(x)
y = x * x
dy_dx = g.gradient(y, x) # Will compute to 6.0
print(dy_dx)
compute_me()
Output: Tensor("AddN:0", shape=(), dtype=float32)
Explanation: Using AutoGraph on GradientTape in TF2 results in empty gradient
#tf.function
def compute_me_args(x):
with tf.GradientTape() as g:
g.watch(x)
y = x * x
dy_dx = g.gradient(y, x) # Will compute to 6.0
print(dy_dx)
x = tf.constant(3.0)
compute_me_args(x)
Output: Tensor("AddN:0", shape=(), dtype=float32)
Explanation: Passing in arguments also fails
I was expecting all of the cells to output tf.Tensor(6.0, shape=(), dtype=float32) but instead the cells using AutoGraph output Tensor("AddN:0", shape=(), dtype=float32).
It doesn't "fail", it's just that print, if used in the context of a tf.function (i.e. in graph mode) will print the symbolic tensors, and these do not have a value. Try this instead:
#tf.function
def compute_me():
x = tf.constant(3.0)
with tf.GradientTape() as g:
g.watch(x)
y = x * x
dy_dx = g.gradient(y, x) # Will compute to 6.0
tf.print(dy_dx)
compute_me()
This should print 6. All you need to do is use tf.print instead, which is "smart" enough to print the actual values if available. Or, using return values:
#tf.function
def compute_me():
x = tf.constant(3.0)
with tf.GradientTape() as g:
g.watch(x)
y = x * x
dy_dx = g.gradient(y, x) # Will compute to 6.0
return dy_dx
result = compute_me()
print(result)
outputs something like <tf.Tensor: id=43, shape=(), dtype=float32, numpy=6.0>. You can see that the value (6.0) is visible here as well. Use result.numpy() to just get 6.0.

Resources