RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous - pytorch

I got it when I was running:
def case3():
a = torch.randn(2,2)
torch.kron(a,a.T)
but it works for torch.kron(a,a)
And then I try:
def case4():
a = torch.randn(1,4)
torch.kron(a,a.T)
It works! So I am confusing why torch.kron would not work on tensor of size 2x2? Thanks!

I have solved it by myself!
The reason lies in that .T actually only reshape and shows the tensor instead of changing the actual tensor in memory.
so by modifying it as this, it solves the problem
def case4():
a = torch.randn(2,2)
b = a.T.contiguous()
print(torch.kron(a,b))
print(torch.kron(a,a))
.contiguous makes a copy of tensor
referring to What does .contiguous() do in PyTorch?

Related

Is there a way to compute a circulant matrix in Pytorch?

I want a similar function as in https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.circulant.html to create a circulant matrix using PyTorch. I need this as a part of my Deep Learning model (in order to reduce over-parametrization in some of my Fully Connected layers as suggested in https://arxiv.org/abs/1907.08448 (Fig.3))
The input of the function shall be a 1D torch tensor, and the output should be the 2D circulant matrix.
You can make use of unfold to extract sliding windows. But to get the correct order you need to flip (later unflip) the tensors, and first concatenate the flipped tensor to itself.
circ=lambda v:torch.cat([f:=v.flip(0),f[:-1]]).unfold(0,len(v),1).flip(0)
Here is a generic function for pytorch tensors, to get the circulant matrix for one dimension. It's based on unfold and it works for 2d circulant matrix or high-dimension tensors.
def circulant(tensor, dim):
"""get a circulant version of the tensor along the {dim} dimension.
The additional axis is appended as the last dimension.
E.g. tensor=[0,1,2], dim=0 --> [[0,1,2],[2,0,1],[1,2,0]]"""
S = tensor.shape[dim]
tmp = torch.cat([tensor.flip((dim,)), torch.narrow(tensor.flip((dim,)), dim=dim, start=0, length=S-1)], dim=dim)
return tmp.unfold(dim, S, 1).flip((-1,))
Essentially, this is a PyTorch version of scipy.linalg.circulant and works for multi-dimension tensors.
Also a similar question: Create array/tensor of cycle shifted arrays

Torch mv behavior not understandable

The following screenshots show that torch.mv is unusable in a situation that obviously seem to be correct... how is this possible, any idea what can be the problem?
this first image shows the correct situation, where the vector has 10 rows for a matrix of 10 columns, but I showed the other also just in case. Also swapping w.mv(x) for x.mv(w) does not make a difference.
However, the # operator works... the thing is that for my own reasons I want to use mv, so I would like to know what the problem is.
According to documentation:
torch.mv(input, vec, *, out=None) → Tensor
If input is a (n×m) tensor, vec is a 1-D tensor of size m, out will be 1-D of size n.
The x here should be 1-D, but in your case it's 10x1 (2D). You can remove extra dimension (or create a single dimension x)
>>> w.mv(x.squeeze())
tensor([ 0.1432, -2.0639, -2.1871, -1.8837, 0.7333, -0.4000, 0.4023, -1.1318,
0.0423, -1.2136])
>>> w # x
tensor([[ 0.1432],
[-2.0639],
[-2.1871],
[-1.8837],
[ 0.7333],
[-0.4000],
[ 0.4023],
[-1.1318],
[ 0.0423],
[-1.2136]])

Why torch.dot(a,b) makes requires_grad=False

I have some losses in a loop storing them in a tensor loss. Now I want to multiply a weight tensor to the loss tensor to have final loss, but after torch.dot(), the result scalar, ll_new, has requires_grad=False. The following is my code.
loss_vector = torch.FloatTensor(total_loss_q)
w_norm = F.softmax(loss_vector, dim=0)
ll_new = torch.dot(loss_vector,w_norm)
How can I have requires_grad=False for the ll_new after doing the above?
I think the issue is in the line: loss_vector = torch.FloatTensor(total_loss_q) as requires_grad for loss_vector is False (default value). So, you should do:
loss_vector = torch.FloatTensor(total_loss_q, requires_grad=True)
The issue most likely lies within this part:
I have some losses in a loop storing them in a tensor loss
You are most likely losing requires_grad somewhere in the process before torch.dot. E.g. if you use something like .item() on individual losses when constructing total_loss_q tensor.
What type is your total_loss_q? If it is a list of integers then there is no way your gradients will propagate through that. You need to construct total_loss_q in such a way that it is a tensor which knows how each individual loss was constructed (i.e. can propagate gradients to your trainable weights).

In PyTorch, what makes a tensor have non-contiguous memory?

According to this SO and this PyTorch discussion, PyTorch's view function works only on contiguous memory, while reshape does not. In the second link, the author even claims:
[view] will raise an error on a non-contiguous tensor.
But when does a tensor have non-contiguous memory?
This is a very good answer, which explains the topic in the context of NumPy. PyTorch works essentially the same. Its docs don't generally mention whether function outputs are (non)contiguous, but that's something that can be guessed based on the kind of the operation (with some experience and understanding of the implementation). As a rule of thumb, most operations preserve contiguity as they construct new tensors. You may see non-contiguous outputs if the operation works on the array inplace and change its striding. A couple of examples below
import torch
t = torch.randn(10, 10)
def check(ten):
print(ten.is_contiguous())
check(t) # True
# flip sets the stride to negative, but element j is still adjacent to
# element i, so it is contiguous
check(torch.flip(t, (0,))) # True
# if we take every 2nd element, adjacent elements in the resulting array
# are not adjacent in the input array
check(t[::2]) # False
# if we transpose, we lose contiguity, as in case of NumPy
check(t.transpose(0, 1)) # False
# if we transpose twice, we first lose and then regain contiguity
check(t.transpose(0, 1).transpose(0, 1)) # True
In general, if you have non-contiguous tensor t, you can make it contiguous by calling t = t.contiguous(). If t is contiguous, call to t.contiguous() is essentially a no-op, so you can do that without risking a big performance hit.
I think your title contiguous memory is a bit misleading. As I understand, contiguous in PyTorch means if the neighboring elements in the tensor are actually next to each other in memory. Let's take a simple example:
x = torch.tensor([[1, 2, 3], [4, 5, 6]]) # x is contiguous
y = torch.transpose(0, 1) # y is non-contiguous
According to documentation of tranpose():
Returns a tensor that is a transposed version of input. The given dimensions dim0 and dim1 are swapped.
The resulting out tensor shares it’s underlying storage with the input tensor, so changing the content of one would change the content of the other.
So that x and y in the above example share the same memory space. But if you check their contiguity with is_contiguous(), you will find that x is contiguous and y is not. Now you will find that contiguity does not refer to contiguous memory.
Since x is contiguous, x[0][0] and x[0][1] are next to each other in memory. But y[0][0] and y[0][1] is not. That is what contiguous means.

How to correctly implement a batch-input LSTM network in PyTorch?

This release of PyTorch seems provide the PackedSequence for variable lengths of input for recurrent neural network. However, I found it's a bit hard to use it correctly.
Using pad_packed_sequence to recover an output of a RNN layer which were fed by pack_padded_sequence, we got a T x B x N tensor outputs where T is the max time steps, B is the batch size and N is the hidden size. I found that for short sequences in the batch, the subsequent output will be all zeros.
Here are my questions.
For a single output task where the one would need the last output of all the sequences, simple outputs[-1] will give a wrong result since this tensor contains lots of zeros for short sequences. One will need to construct indices by sequence lengths to fetch the individual last output for all the sequences. Is there more simple way to do that?
For a multiple output task (e.g. seq2seq), usually one will add a linear layer N x O and reshape the batch outputs T x B x O into TB x O and compute the cross entropy loss with the true targets TB (usually integers in language model). In this situation, do these zeros in batch output matters?
Question 1 - Last Timestep
This is the code that i use to get the output of the last timestep. I don't know if there is a simpler solution. If it is, i'd like to know it. I followed this discussion and grabbed the relative code snippet for my last_timestep method. This is my forward.
class BaselineRNN(nn.Module):
def __init__(self, **kwargs):
...
def last_timestep(self, unpacked, lengths):
# Index of the last output for each sequence.
idx = (lengths - 1).view(-1, 1).expand(unpacked.size(0),
unpacked.size(2)).unsqueeze(1)
return unpacked.gather(1, idx).squeeze()
def forward(self, x, lengths):
embs = self.embedding(x)
# pack the batch
packed = pack_padded_sequence(embs, list(lengths.data),
batch_first=True)
out_packed, (h, c) = self.rnn(packed)
out_unpacked, _ = pad_packed_sequence(out_packed, batch_first=True)
# get the outputs from the last *non-masked* timestep for each sentence
last_outputs = self.last_timestep(out_unpacked, lengths)
# project to the classes using a linear layer
logits = self.linear(last_outputs)
return logits
Question 2 - Masked Cross Entropy Loss
Yes, by default the zero padded timesteps (targets) matter. However, it is very easy to mask them. You have two options, depending on the version of PyTorch that you use.
PyTorch 0.2.0: Now pytorch supports masking directly in the CrossEntropyLoss, with the ignore_index argument. For example, in language modeling or seq2seq, where i add zero padding, i mask the zero padded words (target) simply like this:
loss_function = nn.CrossEntropyLoss(ignore_index=0)
PyTorch 0.1.12 and older: In the older versions of PyTorch, masking was not supported, so you had to implement your own workaround. I solution that i used, was masked_cross_entropy.py, by jihunchoi. You may be also interested in this discussion.
A few days ago, I found this method which uses indexing to accomplish the same task with a one-liner.
I have my dataset batch first ([batch size, sequence length, features]), so for me:
unpacked_out = unpacked_out[np.arange(unpacked_out.shape[0]), lengths - 1, :]
where unpacked_out is the output of torch.nn.utils.rnn.pad_packed_sequence.
I have compared it with the method described here, which looks similar to the last_timestep() method Christos Baziotis is using above (also recommended here), and the results are the same in my case.

Resources