I am attempting to compute a linear combination of n tensors of the same dimension in Tensorflow. The scalar coefficients are Tensorflow Variables.
Since tf.scalar_mul does not generalise to multiplying a vector of tensors by a vector of scalars, I have thus far used tf.gather and performed each multiplication individually in a python for loop, and then converted the list of results to a tensor and summed them across the zeroth axis. Like so:
coefficients = tf.Variable(tf.constant(initial_value, shape=[n]))
components = []
for i in range(n):
components.append(tf.scalar_mul(tf.gather(coefficients, i), tensors[i]))
combination = tf.reduce_sum(tf.convert_to_tensor(components), axis=0)
This works fine, but does not scale well at all. My application requires computing n linear combinations, meaning I have n^2 gather and multiply operations. With large values of n the computation time is poor and the memory usage of the program is unreasonably large.
Is there a more natural way of computing a linear combination like this in Tensorflow that would be faster and less resource intensive?
Use broadcasting. Assuming coefficients has shape (n,) and tensors shape (n,...) you can simply use
coefficients[:, tf.newaxis, ...] * tensors
here, you would need to repeat tf.newaxis as many times as tensors has dimenions besides the one of size n. So e.g. if tensors has shape (n, a, b) you would use coefficients[:, tf.newaxis, tf.newaxis]
This will turn coefficients into a tensor with the same number of dimensions as tensors, but all dimensions except the first one are of size 1, so they can be broadcast to the shape of tensors.
Some alternatives:
Define coefficients as a variable with the correct number of dimensions in the first place (a little ugly in my opinion).
Use tf.reshape to reshape coefficients to (n, 1, ...) instead if you don't like the indexing syntax.
Use tf.transpose to shift the dimension of size n to the end of tensors. Then the dimensions align for broadcasting without needing to add dimensions to coefficients.
Also see the numpy docs on broadcasting -- it works essentially the same way in Tensorflow.
There is a new PyPI module called TWIT, Tensor Weighted Interpolative Transfer, that will do this fast. It is written in C for the core operations.
Related
Taking a tensor with shape [4,8,12] as an example, what's the difference between the two lines:
torch.mean(x, dim=2)
torch.nn.functional.avg_pool1d(x, kernel_size=12)
With the very example you provided the result is the same, but only because you specified dim=2 and kernel_size equal to the dimensionality of the third (index 2) dimension.
But in principle, you are applying two different functions, that sometimes just happen to collide with specific choices of hyperparameters.
torch.mean is effectively a dimensionality reduction function, meaning that when you average all values across one dimension, you effectively get rid of that dimension.
On the other hand, average 1-dimensional pooling is more powerful in this regard, as it gives you a lot more flexibility in choosing kernel size, padding and stride like you would normally do when using a convolutional layer.
You can see the first function as a specific case of 1-d pooling.
Suppose I have my custom loss function and I want to fit the solution of some differential equation with help of my neural network. So in each forward pass, I am calculating the output of my neural net and then calculating the loss by taking the MSE with the expected equation to which I want to fit my perceptron.
Now my doubt is: should I use grad(loss) or should I do loss.backward() for backpropagation to calculate and update my gradients?
I understand that while using loss.backward() I have to wrap my tensors with Variable and have to set the requires_grad = True for the variables w.r.t which I want to take the gradient of my loss.
So my questions are :
Does grad(loss) also requires any such explicit parameter to identify the variables for gradient computation?
How does it actually compute the gradients?
Which approach is better?
what is the main difference between the two in a practical scenario.
It would be better if you could explain the practical implications of both approaches because whenever I try to find it online I am just bombarded with a lot of stuff that isn't much relevant to my project.
TLDR; Both are two different interfaces to perform gradient computation: torch.autograd.grad is non-mutable while torch.autograd.backward is.
Descriptions
The torch.autograd module is the automatic differentiation package for PyTorch. As described in the documentation it only requires minimal change to code base in order to be used:
you only need to declare Tensors for which gradients should be computed with the requires_grad=True keyword.
The two main functions torch.autograd provides for gradient computation are torch.autograd.backward and torch.autograd.grad:
torch.autograd.backward (source)
torch.autograd.grad (source)
Description
Computes the sum of gradients of given tensors with respect to graph leaves.
Computes and returns the sum of gradients of outputs with respect to the inputs.
Header
torch.autograd.backward( tensors, grad_tensors=None, retain_graph=None, create_graph=False, grad_variables=None, inputs=None)
torch.autograd.grad( outputs, inputs, grad_outputs=None, retain_graph=None, create_graph=False, only_inputs=True, allow_unused=False)
Parameters
- tensors – Tensors of which the derivative will be computed.- grad_tensors – The "vector" in the Jacobian-vector product, usually gradients w.r.t. each element of corresponding tensors.- retain_graph – If False, the graph used to compute the grad will be freed. [...] - inputs – Inputs w.r.t. which the gradient be will be accumulated into .grad. All other Tensors will be ignored. If not provided, the gradient is accumulated into all the leaf Tensors that were used [...].
- outputs – outputs of the differentiated function.- inputs – Inputs w.r.t. which the gradient will be returned (and not accumulated into .grad).- grad_tensors – The "vector" in the Jacobian-vector product, usually gradients w.r.t. each element of corresponding tensors.- retain_graph – If False, the graph used to compute the grad will be freed. [...].
Usage examples
In terms of high-level usage, you can look at torch.autograd.grad as a non-mutable function. As mentioned in the documentation table above, it will not accumulate the gradients on the grad attribute but instead return the computed partial derivatives. In contrast torch.autograd.backward will be able to mutate the tensors by updating the grad attribute of leaf nodes, the function won't return any value. In other words, the latter is more suitable when computing gradients for a large number of parameters.
In the following, we will take two inputs (x1 and, x2), calculate a tensor y with them, and then compute the partial derivatives of the result w.r.t both inputs, i.e. dL/dx1 and dL/dx2:
>>> x1 = torch.rand(1, requires_grad=True)
>>> x2 = torch.rand(1, requires_grad=True)
>>> x1, x2
(tensor(0.3939, grad_fn=<UnbindBackward>),
tensor(0.7965, grad_fn=<UnbindBackward>))
Inference:
>>> y = x1**2 + 5*x2
>>> y
tensor(4.1377, grad_fn=<AddBackward0>)
Since y was computed using tensor(s) requiring gradients (i.e. with requires_grad=True) - *outside of a torch.no_grad context. It will have a grad_fn function attached. This callback is used to backpropagate onto the computation graph to compute the gradients of preceding tensor nodes.
torch.autograd.grad:
Here we provide torch.ones_like(y) as the grad_outputs.
>>> torch.autograd.grad(y, (x1, x2), torch.ones_like(y))
(tensor(0.7879), tensor(5.))
The above output is a tuple containing the two partial derivatives w.r.t. to the provided inputs respectively in order of appearance, i.e. dL/dx1 and dL/dx2.
This corresponds to the following computation:
# dL/dx1 = dL/dy * dy/dx1 = grad_outputs # 2*x1
# dL/dx2 = dL/dy * dy/dx2 = grad_outputs # 5
torch.autograd.backward: in contrast it will mutate the provided tensors by updating the grad of the tensors which have been used to compute the output tensor and that require gradients. It is equivalent to the torch.Tensor.backward API. Here, we go through the same example by defining x1, x2, and y again. We call backward:
>>> # y.backward(torch.ones_like(y))
>>> torch.autograd.backward(y, torch.ones_like(y))
None
Then you can retrieve the gradients on x1.grad and x2.grad:
>>> x1.grad, x2.grad
(tensor(0.7879), tensor(5.))
In conclusion: both perform the same operation. They are two different interfaces to interact with the autograd library and perform gradient computations. The latter, torch.autograd.backward (equivalent to torch.Tensor.backward), is generally used in neural networks training loops to compute the partial derivative of the loss w.r.t each one of the model's parameters.
You can read more about how torch.autograd.grad works by reading through this other answer I made on: Meaning of grad_outputs in PyTorch's torch.autograd.grad.
In addition to Ivan's answer, having torch.autograd.grad not accumulating gradients into .grad can avoid racing conditions in multi-thread scenarios.
Quoting PyTorch doc https://pytorch.org/docs/stable/notes/autograd.html#non-determinism
If you are calling backward() on multiple thread concurrently but with shared inputs (i.e. Hogwild CPU training). Since parameters are automatically shared across threads, gradient accumulation might become non-deterministic on backward calls across threads, because two backward calls might access and try to accumulate the same .grad attribute. This is technically not safe, and it might result in racing condition and the result might be invalid to use.
But this is expected pattern if you are using the multithreading approach to drive the whole training process but using shared parameters, user who use multithreading should have the threading model in mind and should expect this to happen. User could use the functional API torch.autograd.grad() to calculate the gradients instead of backward() to avoid non-determinism.
implementation details https://github.com/pytorch/pytorch/blob/7e3a694b23b383e38f5e39ef960ba8f374d22404/torch/csrc/autograd/functions/accumulate_grad.h
I have a 3D torch tensor with dimension of [Batch_size, n, n] which is the out put of a layer of my network and a constant 2D torch tensor with size of [n, n]. How can I perform element wise multiplication over the batch size which should resulted in a torch tensor with size of [Batch_size, n, n]?
I know it is possible to implement this operation using explicit loop but I am interested in the most efficient way.
One option is that you can expand your weight matrix to have a matching batch dimension (without using any additional memory). E.g. twoDTensor.expand((batch_size, n, n)) returns the same underlying data, but representing a 3D tensor. You can see that the stride for the batch dim is zero.
I need to construct a 4-dimensional PyTorch Tensor where one of the dimensions comes from multiplying a constant sparse matrix with a dense vector. The dense vector, and the resulting 4D Tensor, require gradients to be tracked. Since PyTorch only supports sparse matrices, I can't express the whole thing as a Tensor-Tensor multipllication, and I think I have to do the matrix multiplication part of the construction in a loop. In that case, I'd at least like to preallocate the result 4D Tensor and let the sparse mm fill in one dimension in a loop.
How do I in that case keep track of the resulting 4D Tensor's gradient requirements? Can I manually attach it into the gradient graph once it's been created?
My current approach is extremely inefficient, essentially building up one dimension at a time in a list, can cating.
I am attempting to implement a Lambda layer that will produce a custom loss function. In the layer, I need to be able to compare every element in a batch to every other element in the batch in order to calculate the cost. Ideally, I want code that looks something like this:
for el_1 in zip(y_pred, y_true):
for el_2 in zip(y_pred, y_true):
if el_1[1] == el_2[1]:
# Perform a calculation
else:
# Perform a different calculation
When I true this, I get:
TypeError: TensorType does not support iteration.
I am using Keras version 2.0.2 with a Theano version 0.9.0 backend. I understand that I need to use Keras tensor functions in order to do this, but I can't figure out any tensor functions that do what I want.
Also, I am having difficulty understanding precisely what my Lambda function should return. Is it a tensor of the total cost for each sample, or is it just a total cost for the batch?
I have been beating my head against this for days. Any help is deeply appreciated.
A tensor in Keras commonly has at least 2 dimensions, the batch and the neuron/unit/node/... dimension. A dense layer with 128 units trained with a batch size of 64 would therefore yields a tensor with shape (64,128).
Your LambdaLayer processes tensors as any other layer does, plugging it in after your dense layer from before will give you a tensor with shape (64,128) to process. Processing a tensor works similar to how calculations on numpy arrays works (or any other vector processing library really): you specify one operation to broadcast over all elements in the data structure.
For example, your custom cost is the difference for each value in the batch, you would implement it like so:
cost_layer = LambdaLayer(lambda a,b: a - b)
The - operation is broadcasted over a and b and will return a suitable result provided the dimensions match. The takeaway is that you really only can specify one operation for every value. If you want to do more complex tasks, for example computations based on the value you need single operations that take two operations and apply the correct one accordingly, i.e. the switch operation.
The syntax for K.switch is
K.switch(condition, then_expression, else_expression)
For example, if you want to subtract both values when a != b but add them when they are equal, you would write:
import keras.backend as K
cost_layer = LambdaLayer(lambda a,b: K.switch(a != b, a - b, a + b))