To free up memory, I was wondering if it was possible to remove intermediate tensors in the forward method of my model. Here's a minimalized example scenario:
def forward(self, input):
x1, x2 = input
x1 = some_layers(x1)
x2 = some_layers(x2)
x_conc = torch.cat((x1,x2),dim=1)
x_conc = some_layers(x_conc)
return x_conc
Basically, the model passes two tensors through two separate blocks, and then concatenates the results. Further operations are applied on that concatenated tensor. Will it affect the computation graph if I run del x1 and del x2 after creating x_conc?
PyTorch will store the x1, x2 tensors in the computation graph if you want to perform automatic differentiation later on. Also, note that deleting tensors using del operator works but you won't see a decrease in the GPU memory. Why? Because the memory is freed but not returned to the device. It is an optimization technique and from the user's perspective, the memory has been "freed". That is, the memory is available for making new tensors now.
Hence, deleting tensors is not recommended to free up GPU memory.
Related
Suppose I have my custom loss function and I want to fit the solution of some differential equation with help of my neural network. So in each forward pass, I am calculating the output of my neural net and then calculating the loss by taking the MSE with the expected equation to which I want to fit my perceptron.
Now my doubt is: should I use grad(loss) or should I do loss.backward() for backpropagation to calculate and update my gradients?
I understand that while using loss.backward() I have to wrap my tensors with Variable and have to set the requires_grad = True for the variables w.r.t which I want to take the gradient of my loss.
So my questions are :
Does grad(loss) also requires any such explicit parameter to identify the variables for gradient computation?
How does it actually compute the gradients?
Which approach is better?
what is the main difference between the two in a practical scenario.
It would be better if you could explain the practical implications of both approaches because whenever I try to find it online I am just bombarded with a lot of stuff that isn't much relevant to my project.
TLDR; Both are two different interfaces to perform gradient computation: torch.autograd.grad is non-mutable while torch.autograd.backward is.
Descriptions
The torch.autograd module is the automatic differentiation package for PyTorch. As described in the documentation it only requires minimal change to code base in order to be used:
you only need to declare Tensors for which gradients should be computed with the requires_grad=True keyword.
The two main functions torch.autograd provides for gradient computation are torch.autograd.backward and torch.autograd.grad:
torch.autograd.backward (source)
torch.autograd.grad (source)
Description
Computes the sum of gradients of given tensors with respect to graph leaves.
Computes and returns the sum of gradients of outputs with respect to the inputs.
Header
torch.autograd.backward( tensors, grad_tensors=None, retain_graph=None, create_graph=False, grad_variables=None, inputs=None)
torch.autograd.grad( outputs, inputs, grad_outputs=None, retain_graph=None, create_graph=False, only_inputs=True, allow_unused=False)
Parameters
- tensors – Tensors of which the derivative will be computed.- grad_tensors – The "vector" in the Jacobian-vector product, usually gradients w.r.t. each element of corresponding tensors.- retain_graph – If False, the graph used to compute the grad will be freed. [...] - inputs – Inputs w.r.t. which the gradient be will be accumulated into .grad. All other Tensors will be ignored. If not provided, the gradient is accumulated into all the leaf Tensors that were used [...].
- outputs – outputs of the differentiated function.- inputs – Inputs w.r.t. which the gradient will be returned (and not accumulated into .grad).- grad_tensors – The "vector" in the Jacobian-vector product, usually gradients w.r.t. each element of corresponding tensors.- retain_graph – If False, the graph used to compute the grad will be freed. [...].
Usage examples
In terms of high-level usage, you can look at torch.autograd.grad as a non-mutable function. As mentioned in the documentation table above, it will not accumulate the gradients on the grad attribute but instead return the computed partial derivatives. In contrast torch.autograd.backward will be able to mutate the tensors by updating the grad attribute of leaf nodes, the function won't return any value. In other words, the latter is more suitable when computing gradients for a large number of parameters.
In the following, we will take two inputs (x1 and, x2), calculate a tensor y with them, and then compute the partial derivatives of the result w.r.t both inputs, i.e. dL/dx1 and dL/dx2:
>>> x1 = torch.rand(1, requires_grad=True)
>>> x2 = torch.rand(1, requires_grad=True)
>>> x1, x2
(tensor(0.3939, grad_fn=<UnbindBackward>),
tensor(0.7965, grad_fn=<UnbindBackward>))
Inference:
>>> y = x1**2 + 5*x2
>>> y
tensor(4.1377, grad_fn=<AddBackward0>)
Since y was computed using tensor(s) requiring gradients (i.e. with requires_grad=True) - *outside of a torch.no_grad context. It will have a grad_fn function attached. This callback is used to backpropagate onto the computation graph to compute the gradients of preceding tensor nodes.
torch.autograd.grad:
Here we provide torch.ones_like(y) as the grad_outputs.
>>> torch.autograd.grad(y, (x1, x2), torch.ones_like(y))
(tensor(0.7879), tensor(5.))
The above output is a tuple containing the two partial derivatives w.r.t. to the provided inputs respectively in order of appearance, i.e. dL/dx1 and dL/dx2.
This corresponds to the following computation:
# dL/dx1 = dL/dy * dy/dx1 = grad_outputs # 2*x1
# dL/dx2 = dL/dy * dy/dx2 = grad_outputs # 5
torch.autograd.backward: in contrast it will mutate the provided tensors by updating the grad of the tensors which have been used to compute the output tensor and that require gradients. It is equivalent to the torch.Tensor.backward API. Here, we go through the same example by defining x1, x2, and y again. We call backward:
>>> # y.backward(torch.ones_like(y))
>>> torch.autograd.backward(y, torch.ones_like(y))
None
Then you can retrieve the gradients on x1.grad and x2.grad:
>>> x1.grad, x2.grad
(tensor(0.7879), tensor(5.))
In conclusion: both perform the same operation. They are two different interfaces to interact with the autograd library and perform gradient computations. The latter, torch.autograd.backward (equivalent to torch.Tensor.backward), is generally used in neural networks training loops to compute the partial derivative of the loss w.r.t each one of the model's parameters.
You can read more about how torch.autograd.grad works by reading through this other answer I made on: Meaning of grad_outputs in PyTorch's torch.autograd.grad.
In addition to Ivan's answer, having torch.autograd.grad not accumulating gradients into .grad can avoid racing conditions in multi-thread scenarios.
Quoting PyTorch doc https://pytorch.org/docs/stable/notes/autograd.html#non-determinism
If you are calling backward() on multiple thread concurrently but with shared inputs (i.e. Hogwild CPU training). Since parameters are automatically shared across threads, gradient accumulation might become non-deterministic on backward calls across threads, because two backward calls might access and try to accumulate the same .grad attribute. This is technically not safe, and it might result in racing condition and the result might be invalid to use.
But this is expected pattern if you are using the multithreading approach to drive the whole training process but using shared parameters, user who use multithreading should have the threading model in mind and should expect this to happen. User could use the functional API torch.autograd.grad() to calculate the gradients instead of backward() to avoid non-determinism.
implementation details https://github.com/pytorch/pytorch/blob/7e3a694b23b383e38f5e39ef960ba8f374d22404/torch/csrc/autograd/functions/accumulate_grad.h
I have a script that performs a Gatys-like neural style transfer. It uses style loss, and a total variation loss. I'm using the GradientTape() to compute my gradients. The losses that I have implemented seem to work fine, but a new loss that I added isn't being properly accounted for by the GradientTape(). I'm using TensorFlow with eager execution enabled.
I suspect it has something to do with how I compute the loss based on the input variable. The input is a 4D tensor (batch, h, w, channels). At the most basic level, the input is a floating point image, and in order to compute this new loss I need to convert it to a binary image to compute the ratio of one pixel color to another. I don't want to actually go and change the image like that during every iteration, so I just make a copy of the tensor(in numpy form) and operate on that to compute the loss. I do not understand the limitations of the GradientTape, but I believe it is "losing the thread" of how the input variable is used to get to the loss when it's converted to a numpy array.
Could I make a copy of the image tensor and perform binarizing operations & loss computation using that? Or am I asking tensorflow to do something that it just can not do?
My new loss function:
def compute_loss(self, **kwargs):
loss = 0
image = self.model.deprocess_image(kwargs['image'].numpy())
binarized_image = self.image_decoder.binarize_image(image)
volume_fraction = self.compute_volume_fraction(binarized_image)
loss = np.abs(self.volume_fraction_target - volume_fraction)
return loss
My implementation using the GradientTape:
def compute_grads_and_losses(self, style_transfer_state):
"""
Computes gradients with respect to input image
"""
with tf.GradientTape() as tape:
loss = self.loss_evaluator.compute_total_loss(style_transfer_state)
total_loss = loss['total_loss']
return tape.gradient(total_loss, style_transfer_state['image']), loss
An example that I believe might illustrate my confusion. The strangest thing is that my code doesn't have any problem running; it just doesn't seem to minimize the new loss term whatsoever. But this example won't even run due to an attribute error: AttributeError: 'numpy.float64' object has no attribute '_id'.
Example:
import tensorflow.contrib.eager as tfe
import tensorflow as tf
def compute_square_of_value(x):
a = turn_to_numpy(x['x'])
return a**2
def turn_to_numpy(arg):
return arg.numpy() #just return arg to eliminate the error
tf.enable_eager_execution()
x = tfe.Variable(3.0, dtype=tf.float32)
data_dict = {'x': x}
with tf.GradientTape() as tape:
tape.watch(x)
y = compute_square_of_value(data_dict)
dy_dx = tape.gradient(y, x) # Will compute to 6.0
print(dy_dx)
Edit:
From my current understanding the issue arises that my use of the .numpy() operation is what makes the Gradient Tape lose track of the variable to compute the gradient from. My original reason for doing this is because my loss operation requires me to physically change values of the tensor, and I don't want to actually change the values used for the tensor that is being optimized. Hence the use of the numpy() copy to work on in order to compute the loss properly. Is there any way around this? Or is shall I consider my loss calculation to be impossible to implement because of this constraint of having to perform essentially non-reversible operations on the input tensor?
The first issue here is that GradientTape only traces operations on tf.Tensor objects. When you call tensor.numpy() the operations executed there fall outside the tape.
The second issue is that your first example never calls tape.watche on the image you want to differentiate with respect to.
I would like to know how to take gradient steps for the following mathematical operation in PyTorch (A, B and C are PyTorch modules whose parameters do not overlap)
This is somewhat different than the cost function of a Generative Adversarial Network (GAN), so I cannot use examples for GANs off the shelf, and I got stuck while trying to adapt them for the above cost.
One approach I thought of is to construct two optimizers. Optimizer opt1 has the parameters for the modules A and B, and optimizer opt2 has the parameters of module C. One can then:
take a step for minimizing the cost function for C
run the network again with the same input to get the costs (and intermediate outputs) again
take a step with respect to A and B.
I am sure they must be a better way to do this with PyTorch (maybe using some detach operations), possibly without running the network again. Any help is appreciated.
Yes it is possible without going through the network two times, which is both wasting resources and wrong mathematically, since the weights have changed and so the lost, so you are introducing a delay doing this, which may be interesting but not what you are trying to achieve.
First, create two optimizers just as you said. Compute the loss, and then call backward. At this point, the gradient for the parameters A,B,C have been filled, so now you can just have to call the step method for the optimizer minimizing the loss, but not for the one maximizing it. For the later, you need to reverse the sign of the gradient of the leaf parameter tensor C.
def d(y, x):
return torch.pow(y.abs(), x + 1)
A = torch.nn.Linear(1,2)
B = torch.nn.Linear(2,3)
C = torch.nn.Linear(2,3)
optimizer1 = torch.optim.Adam((*A.parameters(), *B.parameters()))
optimizer2 = torch.optim.Adam(C.parameters())
x = torch.rand((10, 1))
loss = (d(B(A(x)), x) - d(C(A(x)), x)).sum()
optimizer1.zero_grad()
optimizer2.zero_grad()
loss.backward()
for p in C.parameters():
if p.grad is not None: # In general, C is a NN, with requires_grad=False for some layers
p.grad.data.mul_(-1) # Update of grad.data not tracked in computation graph
optimizer1.step()
optimizer2.step()
NB: I have not checked mathematically if the result is correct but I assume it is.
I am attempting to compute a linear combination of n tensors of the same dimension in Tensorflow. The scalar coefficients are Tensorflow Variables.
Since tf.scalar_mul does not generalise to multiplying a vector of tensors by a vector of scalars, I have thus far used tf.gather and performed each multiplication individually in a python for loop, and then converted the list of results to a tensor and summed them across the zeroth axis. Like so:
coefficients = tf.Variable(tf.constant(initial_value, shape=[n]))
components = []
for i in range(n):
components.append(tf.scalar_mul(tf.gather(coefficients, i), tensors[i]))
combination = tf.reduce_sum(tf.convert_to_tensor(components), axis=0)
This works fine, but does not scale well at all. My application requires computing n linear combinations, meaning I have n^2 gather and multiply operations. With large values of n the computation time is poor and the memory usage of the program is unreasonably large.
Is there a more natural way of computing a linear combination like this in Tensorflow that would be faster and less resource intensive?
Use broadcasting. Assuming coefficients has shape (n,) and tensors shape (n,...) you can simply use
coefficients[:, tf.newaxis, ...] * tensors
here, you would need to repeat tf.newaxis as many times as tensors has dimenions besides the one of size n. So e.g. if tensors has shape (n, a, b) you would use coefficients[:, tf.newaxis, tf.newaxis]
This will turn coefficients into a tensor with the same number of dimensions as tensors, but all dimensions except the first one are of size 1, so they can be broadcast to the shape of tensors.
Some alternatives:
Define coefficients as a variable with the correct number of dimensions in the first place (a little ugly in my opinion).
Use tf.reshape to reshape coefficients to (n, 1, ...) instead if you don't like the indexing syntax.
Use tf.transpose to shift the dimension of size n to the end of tensors. Then the dimensions align for broadcasting without needing to add dimensions to coefficients.
Also see the numpy docs on broadcasting -- it works essentially the same way in Tensorflow.
There is a new PyPI module called TWIT, Tensor Weighted Interpolative Transfer, that will do this fast. It is written in C for the core operations.
I am attempting to implement a Lambda layer that will produce a custom loss function. In the layer, I need to be able to compare every element in a batch to every other element in the batch in order to calculate the cost. Ideally, I want code that looks something like this:
for el_1 in zip(y_pred, y_true):
for el_2 in zip(y_pred, y_true):
if el_1[1] == el_2[1]:
# Perform a calculation
else:
# Perform a different calculation
When I true this, I get:
TypeError: TensorType does not support iteration.
I am using Keras version 2.0.2 with a Theano version 0.9.0 backend. I understand that I need to use Keras tensor functions in order to do this, but I can't figure out any tensor functions that do what I want.
Also, I am having difficulty understanding precisely what my Lambda function should return. Is it a tensor of the total cost for each sample, or is it just a total cost for the batch?
I have been beating my head against this for days. Any help is deeply appreciated.
A tensor in Keras commonly has at least 2 dimensions, the batch and the neuron/unit/node/... dimension. A dense layer with 128 units trained with a batch size of 64 would therefore yields a tensor with shape (64,128).
Your LambdaLayer processes tensors as any other layer does, plugging it in after your dense layer from before will give you a tensor with shape (64,128) to process. Processing a tensor works similar to how calculations on numpy arrays works (or any other vector processing library really): you specify one operation to broadcast over all elements in the data structure.
For example, your custom cost is the difference for each value in the batch, you would implement it like so:
cost_layer = LambdaLayer(lambda a,b: a - b)
The - operation is broadcasted over a and b and will return a suitable result provided the dimensions match. The takeaway is that you really only can specify one operation for every value. If you want to do more complex tasks, for example computations based on the value you need single operations that take two operations and apply the correct one accordingly, i.e. the switch operation.
The syntax for K.switch is
K.switch(condition, then_expression, else_expression)
For example, if you want to subtract both values when a != b but add them when they are equal, you would write:
import keras.backend as K
cost_layer = LambdaLayer(lambda a,b: K.switch(a != b, a - b, a + b))