Understanding pytorch graph generation

Understanding pytorch graph generation - pytorch

If I run the code:
import torch
x = torch.ones(5) # input tensor
y = torch.zeros(3) # expected output
w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)
z = torch.matmul(x, w)+b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)
loss.backward()
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)
loss.backward()
pytorch spits the error "Trying to backward through the graph a second time" at me. My understanding is that calling the loss calculation line again doesn't actually change the computational graph, which is why I get this error. However, when I call the code:
import torch
x = torch.ones(5) # input tensor
y = torch.zeros(3) # expected output
w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)
z = torch.matmul(x, w)+b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)
loss.backward()
z = torch.matmul(x, w)+b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)
loss.backward()
it works fine (without error), and I don't understand why this is the case, in either case, I haven't made any change to the computational graph?

This is a good question. In my opinion, this is particularly important in order to fully grasp this feature of PyTorch. Which is paramount when dealing with complex setups, whether it involves multiple backward passes or partial backward passes.
In both examples your computational graph is:
y ---------------------------->|
b ----------->| |
w ------->| |
x --> x # w + b = z --> BCE(z, y) = loss
However, the "computational graph" as we call it is just a representation of the dependencies that exist in the computation of that result. The way this result is tied to the tensors that lead to the final computation, i.e. the intermediate results of the graph. When you compute loss, a link remains between loss and all other tensors, this is needed in order to compute the backward pass.
First scenario
In your first example you compute loss, which by itself creates a "computational graph". Notice the grad_fn attribute appearing on your loss variable. This is the callback function used to navigate back up the graph. In your case F.binary_cross_entropy_with_logits will output a grad_fn=<BinaryCrossEntropyWithLogitsBackward>. This being said, you successfully compute the backward pass by calling backward(), doing so backpropagates up the graph using the graph_fn's functions and updating the parameters' grad attribute. Then you define loss using the same z, the one that is tied to the previous graph. You're essentially going from the previous computational graph above to the following one:
y ---------------------------->|
b ----------->| |
w ------->| |
x --> x # w + b = z --> BCE(z, y) = loss
\--> BCE(z, y) = loss # 2nd definition of loss
The second definition of loss overwrites the previous value for loss, yes. However, it won't affect the first portion of the graph which still exists: as I explained z is still tied to the initial tensors x, w, and b.
By default, during a backward pass, the activations are freed. This means you won't be able to perform a second pass. To sum up your first example, the second loss.backward() will go through loss's (the new one) grad_fn, then reach the initial z whose activations have already been freed. This results in the error you've encountered:
Trying to backward pass through the graph a second time
Second scenario
In the second example, you redefine the whole network by recomputing z from the leaf tensor x and consequently loss with intermediate output z and leaf tensor y.
Conceptually, the state of the computation graphs is:
y ---------------------------->|
b ----------->| |
w ------->| |
x --> x # w + b = z --> BCE(z, y) = loss
\-> x # w + b = z --> BCE(z, y) = loss # 2nd definition of loss
This means that by calling loss.backward a first time you do a backward pass on the initial graph. Then, after having redefined both z and loss, you end up creating a new graph altogether: 2nd branch of the illustration above. The 2nd backward pass ends up working since you're not on the same graph.

Related

Gradients of loss with respect to random parameters are the same in pytorch

In the simple code below, I perform a simple linear operation on an input tensor of ones and compute its binary cross-entropy loss considering a vector of zeros as the expected output.
When computing the gradient of the loss with respect to w, the rows are the same and equal to the gradient with respect to b. This is counter-intuitive since w and b have random values. What is the reason?
n_input, n_output = 5, 3
x = torch.ones(n_input)
y = torch.zeros(n_output) # expected output
w = torch.randn(n_input, n_output, requires_grad=True)
b = torch.randn(n_output, requires_grad=True)
z = torch.matmul(x,w) + b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)
loss.backward()
print(w.grad)
print(b.grad)
Output:
tensor([[0.2179, 0.4337, 0.1959],
[0.2179, 0.4337, 0.1959],
[0.2179, 0.4337, 0.1959],
[0.2179, 0.4337, 0.1959],
[0.2179, 0.4337, 0.1959]])
tensor([0.2179, 0.4337, 0.1959])

It's because Your input is symmetric.
Imagine the issue from the point of view of a perceptron (You have 3 of them in Your setup):
each input is 1.0 so the weights of a specific neuron don't matter (it is not important from which input You will take as there is 1.0 everywhere).
If You diversify the input, everything works just fine:
n_input, n_output = 5, 3
x = torch.randn(n_input)
y = torch.ones(n_output)/2. # expected output
w = torch.randn(n_input, n_output, requires_grad=True)
b = torch.randn(n_output, requires_grad=True)
z = torch.matmul(x, w) + b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)
loss.backward()
print(w.grad)
print(b.grad)
tensor([[-0.1939, 0.1657, -0.2501],
[ 0.0561, -0.0480, 0.0724],
[-0.3162, 0.2703, -0.4079],
[ 0.0947, -0.0809, 0.1221],
[-0.0140, 0.0120, -0.0181]])
tensor([-0.1263, 0.1080, -0.1630])

You have a single data point with an input feature size of 5. If you look at your operation performed you have z = x#w + b, then you have a binary cross-entropy from logits against a null label. The binary cross-entropy is defined by:
bce = -[y_true*log(σ(y_pred)) + (1 - y_true)*log(1 - σ(y_pred))]
The gradient of z is written as the partial derivative dL/dz, it consists of three elements (same size as z) let's say [dz1, dz2, dz3].
To compute the gradients of the weight parameter w and the bias parameter b we have the following:
dL/dw = x.T # dL/dz
dL/db = dL/dz (with a shape change)
Therefore b.grad is simply
[dz1, dz2, dz3]
And, since we have x made up of ones, x.T # dL/dz ends up being a matrix with rows equal to dL/dz as well, i.e. with five rows:
[[dz1, dz2, dz3],
[dz1, dz2, dz3],
[dz1, dz2, dz3],
[dz1, dz2, dz3],
[dz1, dz2, dz3]]

Predictions using Logistic Regression in Pytorch return infinity

I started watching a tutorial on PyTorch and I am learning the concept of logistic regression.
I tried it using some stock data that I had. I have inputs, which contains two parameters trade_quantity and trade_value, and targets which has the corresponding stock price.
inputs = torch.tensor([[182723838.00, 2375432.00],
[185968153.00, 2415558.00],
[181970093.00, 2369140.00],
[221676832.00, 2811589.00],
[339785916.00, 4291782.00],
[225855390.00, 2821301.00],
[151430199.00, 1889032.00],
[122645372.00, 1552998.00],
[129015052.00, 1617158.00],
[121207837.00, 1532166.00],
[139554705.00, 1789392.00]])
targets = torch.tensor([[76.90],
[76.90],
[76.90],
[80.70],
[78.95],
[79.60],
[80.05],
[78.90],
[79.40],
[78.95],
[77.80]])
I defined the model function, the loss as the mean square error, and tried to run it a few times to get some predictions. Here's the code:
def model(x):
return x # w.t() + b
def mse(t1, t2):
diff = t1 - t2
return torch.sum(diff * diff) / diff.numel()
preds = model(inputs)
loss = mse(preds, targets)
loss.backward()
with torch.no_grad():
w -= w.grad * 1e-5
b -= b.grad * 1e-5
w.grad.zero_()
b.grad.zero_()
I am using Jupyter for this and ran the last part of the code a few times, after which the predictions come as:
tensor([[inf],
[inf],
[inf],
[inf],
[inf],
[inf],
[inf],
[inf],
[inf],
[inf],
[inf]], grad_fn=<AddBackward0>)
If I run it for a few more times the predictions become nan. Can you please tell me why is this happening?

To me, this looks more like linear regression than logistic regression. You are trying to fit a linear model onto your data. It's different to a binary classification task where you would need to use a special kind of activation function (a sigmoid for instance) so that the output is either 0 or 1.
In this particular instance you want to solve a 2D linear problem given input x of shape (batch, x1, x2) (where x1 is trade_quantity and x2 is trade_value) and target (batch, y) (y being the stock_price).
So the objective is to find the best w and b matrices (weight matrix and bias column) so that x#w + b is the closest to y as possible, according to your criterion, the mean square error.
I would recommend normalizing your data so it stays in a [0, 1] range. You can do so by measuring the mean and standard deviation of inputs and targets.
inputs_min, inputs_max = inputs.min(axis=0).values, inputs.max(axis=0).values
targets_min, targets_max = targets.min(axis=0).values, targets.max(axis=0).values
Then applying the transformation:
x = (inputs - inputs_min)/(inputs_max - inputs_min)
y = (targets - targets_min)/(targets_max - targets_min)
Try changing your learning rate and have it run for multiple epochs.
lr = 1e-2
for epochs in range(100):
preds = model(x)
loss = mse(preds, y)
loss.backward()
with torch.no_grad():
w -= lr*w.grad
b -= lr*b.grad
w.grad.zero_()
b.grad.zero_()
I use a (1, 2) randomly initialized matrix for w (and a (1,) matrix for b):
w = torch.rand(1, 2)
w.requires_grad = True
b = torch.rand(1)
b.requires_grad = True
And got the following train loss over 100 epochs:
To find the right hyperparameters, it's better to have a validation set. This set will get normalized with the mean and std from the train set. It will be used to evaluate the performances at the end of each epoch on data that is 'unknown' to the model. Same goes for your test set, if you have one.

Can we change the contents of the tensors once it is used?

I am wondering if we are allowed to change the contents of the tensors after they are sent to the Loss Functions in pytorch. For example:
x = torch.zeros(1000)
y = torch.zeros(1000)
output = net(x)
loss = criterion(oytput, y)
loss.backward()
optimizer.step()
After we have done this, can we change the contents of y and output without ill-effect? For example:
y[0] = 990
output[0] = 1000
After I have done this after a mini-batch, but keep feeding it with more mini-batches, will this cause issues?
I am not so sure, because maybe the nodes are still referenced internally by the computational graph.

PyTorch: Calculating the Hessian vector product with nn.parameters()

Using PyTorch, I would like to calculate the Hessian vector product, where the Hessian is the second-derivative matrix of the loss function of some neural net, and the vector will be the vector of gradients of that loss function.
I know how to calculate the Hessian vector product for a regular function thanks to this post. However, I am running into trouble when the function is the loss function of a neural network. This is because the parameters are packaged into a module, accessible via nn.parameters(), and not a torch tensor.
I want to do something like this (doesn't work):
### a simple neural network
linear = nn.Linear(10, 20)
x = torch.randn(1, 10)
y = linear(x).sum()
### compute the gradient and make a copy that is detached from the graph
grad = torch.autograd.grad(y, linear.parameters(),create_graph=True)
v = grad.clone().detach()
### compute the Hessian vector product
z = grad # v
z.backward()
In analogy this this (does work):
x = Variable(torch.Tensor([1, 1]), requires_grad=True)
f = 3*x[0]**2 + 4*x[0]*x[1] + x[1]**2
grad, = torch.autograd.grad(f, x, create_graph=True)
v = grad.clone().detach()
z = grad # v
z.backward()
This post addresses a similar (possibly the same?) issue, but I don't understand the solution.

You are saying it doesn't work but do not show what error you get, this is why you haven't got any answers
torch.autograd.grad(outputs, inputs, grad_outputs=None, retain_graph=None, create_graph=False, only_inputs=True, allow_unused=False)
outputs and inputs are expected to be sequences of tensors. But you
use just a tensor as outputs.
What this is saying is that you should pass a sequence, so pass [y] instead of y

What does compute_gradients return in tensorflow

mean_sqr = tf.reduce_mean(tf.pow(y_ - y, 2))
optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
gradients, variables = zip(*optimizer.compute_gradients(mean_sqr))
opt = optimizer.apply_gradients(list(zip(gradients, variables)))
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
for j in range(TRAINING_EPOCHS):
sess.run(opt, feed_dict={x: batch_xs, y_: batch_xs})
I don't clearly understand what compute_gradients returns? Does it return sum(dy/dx) for a given x values assigned by batch_xs, and update gradient in apply_gradients function such as :
theta <- theta - LEARNING_RATE*1/m*gradients?
Or does it already return average of gradients that is summed for each x values in a given batch such as sum(dy/dx)*1/m, m is defined as batch_size?

compute_gradients(a,b) returns d[ sum a ]/db. So in your case this returns d mean_sq / d theta, where theta is set of all variables. There is no "dx" in this equation, you are not computing gradients wrt. inputs. So what happens with batch dimension? You remove it yourself in the definition of mean_sq:
mean_sqr = tf.reduce_mean(tf.pow(y_ - y, 2))
thus (I am assuming y is 1D for simplicity)
d[ mean_sqr ] / d theta = d[ 1/M SUM_i=1^M (pred(x_i), y_i)^2 ] / d theta
= 1/M SUM_i=1^M d[ (pred(x_i), y_i)^2 ] / d theta
so you are in control of whether it sums over batch, takes the mean or does something different, if you would define mean_sqr to use reduce_sum instead of a reduce_mean, gradients would be the sum over the batch and so on.
On the other hand apply_gradients simply "applies the gradients", the exact rule for application is optimiser dependent, for GradientDescentOptimizer it would be
theta <- theta - learning_rate * gradients(theta)
For Adam that you are using the equation is more complex of course.
Note however that tf.gradients is more like "backprop" than true gradient in mathematical sense - meaning that it depends on the graph dependencies and does not recognise dependences which are in "opposite" direction.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Understanding pytorch graph generation - pytorch

Related

Gradients of loss with respect to random parameters are the same in pytorch

Predictions using Logistic Regression in Pytorch return infinity

Can we change the contents of the tensors once it is used?

PyTorch: Calculating the Hessian vector product with nn.parameters()

What does compute_gradients return in tensorflow

Categories

Resources