Lets say i have an CNN intermediate layer output tensor call it X with shape (B,C,H,W) batch, channels, height and width. I extract the regions of interest (ROIs) from the tensor based on some manually chosen criteria i.e i have box coordinates. Assume all the ROIs have same shape (B,N,C,h,w). N is number of ROIs, h is height, and w is width of ROI respectively. Lets call the ROI tensor Y. Now i perform a differentiable operation on Y (assume convolution), this operation does not alter the dimension or shape of the ROIs. Lets call the modified ROI tensor as Y’(shape: B,N,C,h,w).
Now i want to replace the locations where Y are extracted from X with Y’. This modified X is further processed in the subsequent layers of the model. So essentially if i do the following things
Y = X[location criteria]
Y’ = some_operation(Y)
X[location criteria] = Y’
The above operation mentioned has inplace change of X, pytorch computational graph cannot keep track of it. How to modify the value of X without causing error?
Related
I would like to project a tensor into a space with an additional dimension.
I tried
torch.nn.Linear(
in_features=num_inputs,
out_features=(num_inputs, num_additional),
)
But this results in an error
A workaround would be to
torch.nn.Linear(
in_features=num_inputs,
out_features=num_inputs*num_additional,
)
and then change the view the output
output.view(batch_size, num_inputs, num_additional)
But I imagine this workaround will get tricky to read, especially when a projection into more than one additional dimension is desired.
Is there a more direct way to code this operation?
Perhaps the source code for linear can be changed
https://pytorch.org/docs/stable/_modules/torch/nn/modules/linear.html#Linear
To accept more dimensions for the weight and bias initialization, and F.linear seems like it would need to be replaced with a different function.
IMO the workaround you provided is already clear enough. However, if you want to express this as a single operation, you can always write your own module by subclassing torch.nn.Linear:
import numpy as np
import torch
class MultiDimLinear(torch.nn.Linear):
def __init__(self, in_features, out_shape, **kwargs):
self.out_shape = out_shape
out_features = np.prod(out_shape)
super().__init__(in_features, out_features, **kwargs)
def forward(self, x):
out = super().forward(x)
return out.reshape((len(x), *self.out_shape))
if __name__ == '__main__':
tmp = torch.empty((32, 10))
linear = MultiDimLinear(in_features=10, out_shape=(10, 10))
out = linear(tmp)
print(out.shape) # (32, 10, 10)
Another way would be to use torch.einsum
https://pytorch.org/docs/stable/generated/torch.einsum.html
torch.einsum can prevent summation across dimensions in tensor to tensor multiplication operations. This can allow separate multiplication operations to happen in parallel. [ I do not know if this would necessarily result in GPU efficiency; if the operations are still occurring in the same kernel. In fact, it may be slower https://github.com/pytorch/pytorch/issues/32591 ]
How this would work is to directly initialize the weight and bias tensors (look at source code for the torch linear layer for that code)
Say that the input (X) has dimensions (a, b), where a is the batch size.
Say that you want to pass this input through a series of classifiers, represented in a single weight tensor (W) with dimensions (c, d, e), where c is the number of classifiers, and e is the number of classes for the classifier
import torch
x = torch.arange(2*4).view(2, 4)
w = torch.arange(5*4*6).view(5, 4, 2)
torch.einsum('ab, cbe -> ace', x, w)
in the last line, a and b are the dimensions of the input as mentioned above. What might be the tricky part is c, b, and e are the dimensions of the classifiers weight tensor; I didn't use d, I used b instead. That is because the vector multiplication is happening along that dimension for the inputs tensor and the weight tensor. So that's why the left side of the einsum equation is ab, cbe. The right side of the einsum equation is simply what dimensions to exclude from summation.
The final dimensions we want is (a, c, e). a is the batch size, c is the number of classifiers, and e is the number of classes for each classifier. We do not want to add those values, so to preserve their separation, the left side of the equation is ace.
For those unfamiliar with einsum, this will be harder to read than the word around I created (though I highly recommend learning it, because it gets very easy and intuitive very fast even though it's a bit tricky at first https://www.youtube.com/watch?v=pkVwUVEHmfI )
However, for paralyzing certain operations (especially on GPU), it seems that einsum is the only way to do it. For example so that in my previous example, I didn't want to use a classification head yet, I just wanted to project to multiple dimensions.
import torch
x = torch.arange(2*4).view(2, 4)
w = torch.arange(5*4*6).view(5, 4, 4)
y = torch.einsum('ab, cbe -> ace', x, w)
And say I do a few other operations to y, perhaps some non linear operations, activations, etc.
z = f(y)
z will still have the dimensions 2, 5, 4. Batch size two, 5 hidden states per batch, and the dimension of those hidden states are 4.
And then I want to apply a classifier to each separate tensor.
w2 = torch.arange(4*2).view(4, 2)
final = torch.einsum('fgh, hj -> fgj', z, w2)
Quick refresh, 2 is the batch size, 5 is the number of classifier, and 2 is the number of outputs for each classifier.
The output dimensions, f, g, j (2, 5, 2) will not be summed across, and thus will be preserved in the output.
As cited in the github link, this may be slower than just using regular linear layers. There may be efficiencies in a very large number of parallel operations.
I am playing around with GPT2 and I have 2 tensors:
O: An output tensor of shaped (B, S-1, V) where B is the batch size S is the the number of timestep and V is the vocabulary size. This is the output of a generative model and is softmaxed along the 2nd dimension.
L: A 2D tensor shaped (B, S-1) where each element is the index of the correct token for each timestep for each sample. This is basically the labels.
I want to extract the predicted probability of the corresponding correct token from tensor O based on tensor L such that I will end up with a 2D tensor shaped (B, S). Is there an efficient way of doing this apart from using loops?
For reference, I based my answer on this Medium article.
Essentially, your answer lies in torch.gather, assuming that both of your tensors are just regular torch.Tensors (or can be converted to one).
import torch
# Specify some arbitrary dimensions for now
B = 3
V = 6
S = 4
# Make example reproducible
torch.manual_seed(42)
# L necessarily has to be a torch.LongTensor, otherwise indexing will fail.
L = torch.randint(0, V, size=[B, S])
O = torch.rand([B, S, V])
# Now collect the results. L needs to have similar dimension,
# except in the axis you want to collect along.
X = torch.gather(O, dim=2, index=L.unsqueeze(dim=2))
# Make sure X has no "unnecessary" dimension
X = X.squeeze(dim=2)
It is a bit difficult to see whether this produces the exact correct results, which is why I included a random seed which makes the example deterministic in the result, and you an easily verify that it gets you the desired results. However, for clarification, one could also use a lower-dimensional tensor, for which this becomes clearer what exactly torch.gather does.
Note that torch.gather also allows you to index multiple indexes in the same row theoretically. Meaning if you instead got a multiclass example for which multiple values are correct, you could similarly use a tensor L of shape [B, S, number_of_correct_samples].
My question is, I think, too simple, but it's giving me headaches. I think I'm missing either something conceptually in Neural Networks or Tensorflow is returning some wrong layer.
I have a network in which last layer outputs 4800 units. The penultimate layer has 2000 units. I expect my weight matrix for last layer to have the shape (4800, 2000) but when I print out the shape in Tensorflow I see (2000, 4800). Please can someone confirm which shape of weight matrix the last layer should have? Depending on the answer, I can further debug the issue. Thanks.
Conceptually, a neural network layer is often written like y = W*x where * is matrix multiplication, x is an input vector and y an output vector. If x has 2000 units and y 4800, then indeed W should have size (4800, 2000), i.e. 4800 rows and 2000 columns.
However, in implementations we usually work on a batch of inputs X. Say X is (b, 2000) where b is your batch size. We don't want to transform each element of X individually by doing W*x as above since this would be inefficient.
Instead we would like to transform all inputs at the same time. This can be done via Y = X*W.T where W.T is the transpose of W. You can work out that this essentially applies W*x to each row of X (i.e. each input). Y is then a (b, 4800) matrix containing all transformed inputs.
In Tensorflow, the weight matrix is simply saved in this transposed state, since it is usually the form that is needed anyway. Thus, we have a matrix with shape (2000, 4800) (the shape of W.T).
In the following diagram, I have two different tensors: tensor1 and tensor2.
How do I merge (concatenate) these two tensors such that input to LSTM is now:
(tensor1[0], tensor11, concatenate(tensor1[2], tensor21)) ??
It's impossible to concatenate them.
You need to manipulate, transform them somehow.
The most logical thing I can think of is repeating tensor 2 six times to fill the timesteps that it doesn't have.
If this is ok (transforming tensor 2 into a sequence of 6 constant steps), the solution is:
tensor2Repeated = RepeatVector(6)(tensor2)
tensor = Concatenate()([tensor1,tensor2Repeated])
Isn't it better to reduce redundancy? You only have to replicate the second tensor 3 times to produce the same amount of information as the first tensor, then you simply reshape. To concatenate an arbitrary number of tensors, simply calculate the size of each minus the last axis (multiply all the axes before last to get size), find the largest tensor m, then upsample or repeat each tensor x by ceiling(m.size / x.size). Then you simply reshape each with the same axes as m except for the last axis, which you either calculate or let your framework calculate implicitly with -1.
tensor2Repeated = RepeatVector(3)(tensor2)
tensor2Reshaped = reshape(tensor2Repeated, (32, 6, 1))
tensor = Concatenate()([tensor1,tensor2Reshaped])
I want to understand in more details how a softmax layer can look in a CNN for semantic segmentation / pixelwise classification of an image. The CNN outputs an image of class labels, where each pixel of the original image gets a label.
After passing a test image through the network, the next-to-last layer outputs N channels of the resolution of the original image. My question is, how the softmax layer transforms these N channels to the final image of labels.
Assumed we have C classes (# possible labels). My suggestion is that for each pixel, its N neurons of the previous layer are connected to C neurons in the softmax layer, where each of the C neurons represents one class. Using the softmax activation function, the sum of the C outputs (for this pixel) is equal to 1 (which facilitates training of the network). Last, each pixel is classified as the class with the highest probability (given by softmax values).
This would mean, that the softmax layer consists of C * #pixels neurons. Is my suggestion correct? I didn't find an explanation for this and hope that you can help me.
Thanks for helping!
The answer is softmax layer Do not transforms these N channels to the final image of labels
Assuming you have a output of N channel your question is how do you convert it to a 3 channel for the final output.
The answer is you dont. Each of those N channel represents a class. The way to go is that you should have a dummy array with same height and weight and 3 channels.
Now you fist have to abstractly encode each class with a color, like streets as green, cars as red etc.
Assume for height = 5 and width = 5, channel 7 has the max value. Now,
-> if the channel 7 represents car the you need to put a red pixel on the dummy array where height = 5 and width = 5.
-> if the channel 7 represents street the you need to put a green pixel on the dummy array where height = 5 and width = 5.
So you are trying to look for which of the N class a pixel belongs to. And based on the class you will redraw the pixel in a unique color on the dummy array.
This dummy array is called the mask.
For example, assume this is a input
We are trying to locate the tumor area of the brain using pixel wise classification. Here the number of classes are 2, Tumor present and not present. So the softmax layer outputs a 2 channel object where channel 1 says tumor present and channel 2 says otherwise.
So whenever for height = X and width = Y, channel 1 has higher value we make a white pixel of the dummmy[X][Y] image. When the channel 2 has higher value we make a black pixel.
After that we get a mask like this,
Which doesnt make that much sense. But when we overlay the two image, we get this
So basically you will try to create the mask image (2nd one) from your output with N Channel. And overlaying them will get you the final output