Pack tensor containing padding on both sides - pytorch

I have an LSTM that receives as input a padded tensor with pad values on both sides (assume that pad values are 0):
[0, 0, 0, 3, 2, 1, 4, 0, 0, 0]
[1, 9, 5, 3, 2, 1, 4, 7, 2, 3]
[0, 0, 0, 3, 2, 1, 4, 2, 3, 3]
Typically, one would use TORCH.NN.UTILS.RNN.PACK_PADDED_SEQUENCE to let the LSTM know which values to pad, but I've only seen it used for cases where tensors are right-padded. There's also no indication from the documentation that it will be able to recognize padded values on the left-hand side. Is there an alternative?

Related

How to pad the left side of a list of tensors in pytorch to the size of the largest list?

In pytorch, if you have a list of tensors, you can pad the right side using torch.nn.utils.rnn.pad_sequence
import torch
'for the collate function, pad the sequences'
f = [
[0,1],
[0, 3, 4],
[4, 3, 2, 4, 3]
]
torch.nn.utils.rnn.pad_sequence(
[torch.tensor(part) for part in f],
batch_first=True
)
tensor([[0, 1, 0, 0, 0],
[0, 3, 4, 0, 0],
[4, 3, 2, 4, 3]])
How would I pad the left side? The desired solution is
tensor([[0, 0, 0, 0, 1],
[0, 0, 0, 3, 4],
[4, 3, 2, 4, 3]])
You can reverse the list, do the padding, and reverse the tensor. Would that be acceptable to you? If yes, you can use the code below.
torch.nn.utils.rnn.pad_sequence([
torch.tensor(i[::-1]) for i in f
], # reverse the list and create tensors
batch_first=True) # pad
.flip(dims=[1]) # reverse/flip the padded tensor in first dimension

Pytorch: Most computationally and memory efficient way to make a series of concatenations from extracting tensor rows?

Say that this is my sample tensor
sample = torch.tensor(
[[2, 7, 3, 1, 1],
[9, 5, 8, 2, 5],
[0, 4, 0, 1, 4],
[5, 4, 9, 0, 0]]
)
I want to have a new tensor, which will consist of concatenations of 2 rows from the sample tensor.
So I have a tensor which contains pairs of the row numbers that I want concatenated into a single row for the new tensor
cat_indices = torch.tensor([[0, 1], [1, 2], [0, 2], [2, 3]])
The current method I am using is this
torch.cat((sample[cat_indices[:,0]], sample[cat_indices[:,1]]), dim=1)
Which gives the desired result
tensor([[2, 7, 3, 1, 1, 9, 5, 8, 2, 5],
[9, 5, 8, 2, 5, 0, 4, 0, 1, 4],
[2, 7, 3, 1, 1, 0, 4, 0, 1, 4],
[0, 4, 0, 1, 4, 5, 4, 9, 0, 0]])
Is this the most memory and computationally efficient method of doing this? I am not sure because I am making two calls to cat_indices, and then I am doing a concatenation operation.
I feel that there should be a way to do this via some sort of view. Perhaps advanced indexing. I've tried things like sample[cat_indices[:,0], cat_indices[:,1]] or sample[cat_indices[0], cat_indices[1]] but I can't make the view come out right.
What you have should be pretty fast. An alternative is
sample[cat_indices].reshape(cat_indices.shape[0],-1)
You would have to benchmark the performance on your machine though to see which is better.

How to initialize columns in hybrid sparse tensor

How initialize in pytorch hybrid tensor torch.sparse_coo_tensor (one dimension is sparse and other is not), which have the following dense representation?
array([[1, 0, 5, 0],
[2, 0, 6, 0],
[3, 0, 7, 0],
[4, 0, 8, 0]])
What should I put into the indices argument?
How to initialize
Something like this:
import torch
indices = torch.tensor([[0, 0, 1, 1, 2, 2, 3, 3], [0, 2, 0, 2, 0, 2, 0, 2]])
tensor = torch.sparse_coo_tensor(
indices, torch.tensor([1, 2, 3, 4, 5, 6, 7, 8]), size=(4, 4)
)
Given above:
indices - first dimension specifies row, second column, where non-zero value(s) will be located. Those become pairs, in this case: (0, 0), (0, 2), (1, 0), (1, 2)... and so on
values - values located at those pairs, so 1 will be under (0, 0) coordinate, 2 under (0, 2) and so it goes.
size - total size of the matrix, optional, might be inferred in this case from your input
8 pairs, 8 values, there are also other ways to specify it, but the idea holds.
And a quick check:
print(tensor)
print(tensor.to_dense())
Gives us:
tensor(indices=tensor([[0, 0, 1, 1, 2, 2, 3, 3],
[0, 2, 0, 2, 0, 2, 0, 2]]),
values=tensor([1, 2, 3, 4, 5, 6, 7, 8]),
size=(4, 4), nnz=8, layout=torch.sparse_coo)
tensor([[1, 0, 2, 0],
[3, 0, 4, 0],
[5, 0, 6, 0],
[7, 0, 8, 0]])
Why to initialize
If your actual data is 50% sparse, you shouldn't use COO tensor.
It will save some memory, but operations will be way slower, so keep that in mind.

How to get padding mask from input ids?

Considering a batch of 4 pre-processed sentences (tokenization, numericalizing and padding) shown below:
batch = torch.tensor([
[1, 2, 0, 0],
[4, 0, 0, 0],
[3, 5, 6, 7]
])
where 0 states for [PAD] token.
Thus, what would be an efficient approach to generate a padding masking tensor of the same shape as the batch assigning zero at [PAD] positions and assigning one to other input data (sentence tokens)?
In the example above it would be something like:
padding_masking=
tensor([
[1, 1, 0, 0],
[1, 0, 0, 0],
[1, 1, 1, 1]
])
The following is tested on pytorch 1.3.1.
pad_token_id = 0
batch = torch.tensor([
[1, 2, 0, 0],
[4, 0, 0, 0],
[3, 5, 6, 7]
])
pad_mask = ~(batch == pad_token_id)
print(pad_mask)
Output
tensor([[1, 1, 0, 0],
[1, 0, 0, 0],
[1, 1, 1, 1]], dtype=torch.uint8)
You can get your desired result with
padding_masking = batch > 0
If you want ints instead of booleans, use
padding_masking.type(torch.int)

Numpy Selecting Elements given row and column index arrays

I have row indices as a 1d numpy array and a list of numpy arrays (list as same length as the size of the row indices array. I want to extract values corresponding to these indices. How can I do it ?
This is an example of what I want as output given the input
A = np.array([[2, 1, 1, 0, 0],
[3, 0, 2, 1, 1],
[0, 0, 2, 1, 0],
[0, 3, 3, 3, 0],
[0, 1, 2, 1, 0],
[0, 1, 3, 1, 0],
[2, 1, 3, 0, 1],
[2, 0, 2, 0, 2],
[3, 0, 3, 1, 2]])
row_ind = np.array([0,2,4])
col_ind = [np.array([0, 1, 2]), np.array([2, 3]), np.array([1, 2, 3])]
Now, I want my output as a list of numpy arrays or list of lists as
[np.array([2, 1, 1]), np.array([2, 1]), np.array([1, 2, 1])]
My biggest concern is the efficiency. My array A is of dimension 20K x 10K.
As #hpaulj commented, likely, you won't be able to avoid looping - e.g.
import numpy as np
A = np.array([[2, 1, 1, 0, 0],
[3, 0, 2, 1, 1],
[0, 0, 2, 1, 0],
[0, 3, 3, 3, 0],
[0, 1, 2, 1, 0],
[0, 1, 3, 1, 0],
[2, 1, 3, 0, 1],
[2, 0, 2, 0, 2],
[3, 0, 3, 1, 2]])
row_ind = np.array([0,2,4])
col_ind = [np.array([0, 1, 2]), np.array([2, 3]), np.array([1, 2, 3])]
# make sure the following code is safe...
assert row_ind.shape[0] == len(col_ind)
# 1) select row (A[r, :]), then select elements (cols) [col_ind[i]]:
output = [A[r, :][col_ind[i]] for i, r in enumerate(row_ind)]
# output
# [array([2, 1, 1]), array([2, 1]), array([1, 2, 1])]
Another way to do this could be to use np.ix_ (still requires looping). Use with caution though for very large arrays; np.ix_ uses advanced indexing - in contrast to basic slicing, it creates a copy of the data instead of a view - see the docs.

Resources