I'd like to do the equivalent of
# X.shape = (batch, M, D)
# Y.shape = (N, D)
Z = torch.max(torch.bmm(X, Y.T), dim=2)
while only paying memory cost (batch, M) by not materializing an intermediate of size (batch, M, N) (because, as you may have guessed, I run out of GPU ram). I can do this manually by slicing the bmm and accumulating the max, but that's alot of cuda kernel calls, so it kinda sucks.
Is there a better solution?
Related
Fitting a single polynomial to a bunch of data is pretty easy in Pytorch using an nn.Linear layer. I've included a trivial example at the end of this post. But suppose I have tons of data split into groups, and I want to fit a different polynomial to each group. As an example, find the particular quadratic coefficients that fit each column in this image:
In other words, I want to simultaneously find the coefficients for N polynomials of order n, given m data per set to be fit:
In the image above, there are m=80 points per dataset, and N=100 sets to fit.
This perfectly lends itself to tensor manipulation and Pytorch on a gpu should make this blindingly fast by fitting all N at once. Problem is, I'm having a terrible brain fart, and haven't been able to wrap my head around the right layer configuration. Basically I need N nn.Linear layers, each operating on its own dataset. If this were convolution, I'd use a depthwise layer...
Example network to fit one polynomial where X are the m x p abscissa data, y are the m ordinate data, and we want to find the p coefficients.
class polyfit(torch.nn.Module):
def __init__(self,n=2):
super(polyfit, self).__init__()
self.poly = torch.nn.Linear(n,1,bias=False,)
def forward(self, x):
print(x.shape,self.poly)
return self.poly(x)
model = polyfit(n)
loss = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
for epoch in range(100): # or however I want to run the loops
output = model(X)
mse = loss(output, y)
optimizer.zero_grad()
mse.backward()
optimizer.step()
Figured it out after thinking about my Depthwise Convolution comment. A Conv1D with just 3 parameters times a tensor with values [1,x,x**2] is a quadratic, same as with a Linear layer with n=3. So the layer needs to be:
self.poly = torch.nn.Conv1d(N,N,n+1,bias=False,groups=N)
Just have to make sure the X,y tensors are the right dimensions of [m, N, n] and [m, N, 1] respectively.
Like what nn.Conv2d or nn.AvgPool2d do with a tensor and a kernel size, I would like to calculate the variances of a tensor with a kernel size. How can I achieve this? I guess maybe source code of pytorch should be touched?
If it's only the variance you are after, you can use the fact that
var(x) = E[x^2] - E[x]^2
Using avg_pool2d you can estimate the local average of x and of x squared:
import torch.nn.functional as nnf
running_var = nnf.avg_pool2d(x**2, kernel_size=2, stride=1) - nnf.avg_pool2d(x, kernel_size=2,stride=1)**2
However, if you want a more general method of performing "sliding window" operations, you should become familiarized with unfold and fold:
u = nnf.unfold(x, kernel_size=2, stride=1) # get all kernel_size patches as vectors
running_var2 = torch.var(u, unbiased=False, dim=1)
# reshape back to original shape ("folding")
running_var2 = running_var2.reshape(x.shape[0], 1, x.shape[2]-1, x.shape[3]-1)
I want to sample a tensor of probability distributions with shape (N, C, H, W), where dimension 1 (size C) contains normalized probability distributions with āCā possibilities. Is there a pytorch function to efficiently sample all the distributions in the tensor in parallel? I just need to sample each distribution once, so the result could either be a one-hot tensor with the same shape or a tensor of indices with shape (N, 1, H, W).
There was no single function to sample that I saw, but I was able to sample the tensor in several steps by computing the cumulative probabilities, sampling each point independently, and then picking the first point that sampled a 1 in the distribution dimension:
reverse_cumulative = torch.flip(torch.cumsum(torch.flip(probabilities, [1]), dim=1), [1])
cumulative = probabilities / reverse_cumulative
sampled = (torch.rand(cumulative.shape, device=device()) <= cumulative)
idxs = sampled * one_hot
idxs[~sampled] = self.tile_count
sampled_idxs = idxs.min(dim=1).indices
I want to build a model, that predicts next character based on the previous characters.
I have spliced text into sequences of integers with length = 100(using dataset and dataloader).
Dimensions of my input and target variables are:
inputs dimension: (batch_size,sequence length). In my case (128,100)
targets dimension: (batch_size,sequence length). In my case (128,100)
After forward pass I get dimension of my predictions: (batch_size, sequence_length, vocabulary_size) which is in my case (128,100,44)
but when I calculate my loss using nn.CrossEntropyLoss() function:
batch_size = 128
sequence_length = 100
number_of_classes = 44
# creates random tensor of your output shape
output = torch.rand(batch_size,sequence_length, number_of_classes)
# creates tensor with random targets
target = torch.randint(number_of_classes, (batch_size,sequence_length)).long()
# define loss function and calculate loss
criterion = nn.CrossEntropyLoss()
loss = criterion(output, target)
print(loss)
I get an error:
ValueError: Expected target size (128, 44), got torch.Size([128, 100])
Question is: how should I handle calculation of the loss function for many-to-many LSTM prediction? Especially sequence dimension? According to nn.CrossEntropyLoss Dimension must be(N,C,d1,d2...dN), where N is batch_size,C - number of classes. But what is D? Is it related to sequence length?
As a general comment, let me just say that you have asked many different questions, which makes it difficult for someone to answer. I suggest asking just one question per StackOverflow post, even if that means making several posts. I will answer just the main question that I think you are asking: "why is my code crashing and how to fix it?" and hopefully that will clear up your other questions.
Per your code, the output of your model has dimensions (128, 100, 44) = (N, D, C). Here N is the minibatch size, C is the number of classes, and D is the dimensionality of your input. The cross entropy loss you are using expects the output to have dimension (N, C, D) and the target to have dimension (N, D). To clear up the documentation that says (N, C, D1, D2, ..., Dk), remember that your input can be an arbitrary tensor of any dimensionality. In your case inputs have length 100, but nothing is to stop someone from making a model with, say, a 100x100 image as input. (In that case the loss would expect output to have dimension (N, C, 100, 100).) But in your case, your input is one dimensional, so you have just a single D=100 for the length of your input.
Now we see the error, outputs should be (N, C, D), but yours is (N, D, C). Your targets have the correct dimensions of (N, D). You have two paths the fix the issue. First is to change the structure of your network so that its output is (N, C, D), this may or may not be easy or what you want in the context of your model. The second option is to transpose your axes at the time of loss computation using torch.transpose https://pytorch.org/docs/stable/generated/torch.transpose.html
batch_size = 128
sequence_length = 100
number_of_classes = 44
# creates random tensor of your output shape (N, D, C)
output = torch.rand(batch_size,sequence_length, number_of_classes)
# transposes dimensionality to (N, C, D)
tansposed_output = torch.transpose(output, 1, 2)
# creates tensor with random targets
target = torch.randint(number_of_classes, (batch_size,sequence_length)).long()
# define loss function and calculate loss
criterion = nn.CrossEntropyLoss()
loss = criterion(transposed_output, target)
print(loss)
Dual Encoder LSTM
I want to implement this model in TensorFlow Keras API. I am confused about how to implement the sigmoid(CMR) function in Keras. How to merge the output of both LSTM's an compute the above function ?
RNN here means LSTM
C and R are sentences encoded into a fixed dimension by the two LSTM's. Then they are passed through a function sigmoid(CMR). We can assume that R and C are both 256 dimensional matrices and M is a 256 * 256 matrix. The matrix M is learned during training.
Assuming you only consider the final output of the LSTMs and not the whole sequence, the shape of the output of each LSTM model would be (batch_size, 256).
Now, we have the following vectors and their shapes:
C: (batch_size, 256)
R: (batch_size, 256)
M: (256, 256).
The simplest case is for batch_size = 1. Then,
C: (1, 256)
R: (1, 256)
So, mathematically, CTMR would practically be CMRT, and give you a vector of shape (1, 1), which can be represented by any number of dimensions.
In code, this is straightforward:
def compute_cmr(c, m, r):
r = tf.transpose(r, [1, 0])
output = tf.matmul(c, m)
output = tf.matmul(output, r)
return output
However, if your batch_size is greater than 1, things can get tricky. My approach (using eager execution) is to unstack along the batch axis, process individually, then restack. It may not be the most efficient way, but it works flawlessly and the time overhead usually is negligible.
Here's how you can do it:
def compute_cmr(c, m, r):
outputs = []
c_list = tf.unstack(c, axis=0)
r_list = tf.unstack(r, axis=0)
for batch_number in range(len(c_list)):
r = tf.expand_dims(r_list[batch_number], axis=1)
c = tf.expand_dims(c_list[batch_number], axis=0)
output = tf.matmul(c, m)
output = tf.matmul(output, r)
outputs.append(output)
return tf.stack(outputs, axis=0)