How to initialize columns in hybrid sparse tensor - pytorch

How initialize in pytorch hybrid tensor torch.sparse_coo_tensor (one dimension is sparse and other is not), which have the following dense representation?
array([[1, 0, 5, 0],
[2, 0, 6, 0],
[3, 0, 7, 0],
[4, 0, 8, 0]])
What should I put into the indices argument?

How to initialize
Something like this:
import torch
indices = torch.tensor([[0, 0, 1, 1, 2, 2, 3, 3], [0, 2, 0, 2, 0, 2, 0, 2]])
tensor = torch.sparse_coo_tensor(
indices, torch.tensor([1, 2, 3, 4, 5, 6, 7, 8]), size=(4, 4)
)
Given above:
indices - first dimension specifies row, second column, where non-zero value(s) will be located. Those become pairs, in this case: (0, 0), (0, 2), (1, 0), (1, 2)... and so on
values - values located at those pairs, so 1 will be under (0, 0) coordinate, 2 under (0, 2) and so it goes.
size - total size of the matrix, optional, might be inferred in this case from your input
8 pairs, 8 values, there are also other ways to specify it, but the idea holds.
And a quick check:
print(tensor)
print(tensor.to_dense())
Gives us:
tensor(indices=tensor([[0, 0, 1, 1, 2, 2, 3, 3],
[0, 2, 0, 2, 0, 2, 0, 2]]),
values=tensor([1, 2, 3, 4, 5, 6, 7, 8]),
size=(4, 4), nnz=8, layout=torch.sparse_coo)
tensor([[1, 0, 2, 0],
[3, 0, 4, 0],
[5, 0, 6, 0],
[7, 0, 8, 0]])
Why to initialize
If your actual data is 50% sparse, you shouldn't use COO tensor.
It will save some memory, but operations will be way slower, so keep that in mind.

Related

Pack tensor containing padding on both sides

I have an LSTM that receives as input a padded tensor with pad values on both sides (assume that pad values are 0):
[0, 0, 0, 3, 2, 1, 4, 0, 0, 0]
[1, 9, 5, 3, 2, 1, 4, 7, 2, 3]
[0, 0, 0, 3, 2, 1, 4, 2, 3, 3]
Typically, one would use TORCH.NN.UTILS.RNN.PACK_PADDED_SEQUENCE to let the LSTM know which values to pad, but I've only seen it used for cases where tensors are right-padded. There's also no indication from the documentation that it will be able to recognize padded values on the left-hand side. Is there an alternative?

partitioning blocks of elements from the indices of a numpy 2D array

I have the indices of a 2D array. I want to partition the indices such that the corresponding entries form blocks (block size is given as input m and n).
For example, if the indices are as given below
(array([0, 0, 0, 0, 1, 1, 1, 1, 6, 6, 6, 6, 7, 7, 7, 7 ]), array([0, 1, 7, 8, 0,1,7,8, 0,1,7,8, 0, 1, 7, 8]))
for the original matrix (from which the indices are generated)
array([[3, 4, 2, 0, 1, 1, 0, 2, 4],
[1, 3, 2, 0, 0, 1, 0, 4, 0],
[1, 0, 0, 1, 1, 0, 1, 1, 3],
[0, 0, 0, 3, 3, 0, 4, 0, 4],
[4, 3, 4, 2, 1, 1, 0, 0, 4],
[0, 1, 0, 4, 4, 2, 2, 2, 1],
[2, 4, 0, 1, 1, 0, 0, 2, 1],
[0, 4, 1, 3, 3, 2, 3, 2, 4]])
and if the block size is (2,2), then the blocks should be
[[3, 4],
[1, 3]]
[[2, 4]
[4, 0]]
[[2, 4]
[0, 4]]
[[2, 1]
[2, 4]]
Any help to do it efficiently?
Does this help? A is your matrix.
row_size = 2
col_size = 3
for i in range(A.shape[0] // row_size):
for j in range(A.shape[1] // col_size):
print(A[row_size*i:row_size*i + row_size, col_size*j:col_size*j + col_size])

Pytorch: Most computationally and memory efficient way to make a series of concatenations from extracting tensor rows?

Say that this is my sample tensor
sample = torch.tensor(
[[2, 7, 3, 1, 1],
[9, 5, 8, 2, 5],
[0, 4, 0, 1, 4],
[5, 4, 9, 0, 0]]
)
I want to have a new tensor, which will consist of concatenations of 2 rows from the sample tensor.
So I have a tensor which contains pairs of the row numbers that I want concatenated into a single row for the new tensor
cat_indices = torch.tensor([[0, 1], [1, 2], [0, 2], [2, 3]])
The current method I am using is this
torch.cat((sample[cat_indices[:,0]], sample[cat_indices[:,1]]), dim=1)
Which gives the desired result
tensor([[2, 7, 3, 1, 1, 9, 5, 8, 2, 5],
[9, 5, 8, 2, 5, 0, 4, 0, 1, 4],
[2, 7, 3, 1, 1, 0, 4, 0, 1, 4],
[0, 4, 0, 1, 4, 5, 4, 9, 0, 0]])
Is this the most memory and computationally efficient method of doing this? I am not sure because I am making two calls to cat_indices, and then I am doing a concatenation operation.
I feel that there should be a way to do this via some sort of view. Perhaps advanced indexing. I've tried things like sample[cat_indices[:,0], cat_indices[:,1]] or sample[cat_indices[0], cat_indices[1]] but I can't make the view come out right.
What you have should be pretty fast. An alternative is
sample[cat_indices].reshape(cat_indices.shape[0],-1)
You would have to benchmark the performance on your machine though to see which is better.

How to get padding mask from input ids?

Considering a batch of 4 pre-processed sentences (tokenization, numericalizing and padding) shown below:
batch = torch.tensor([
[1, 2, 0, 0],
[4, 0, 0, 0],
[3, 5, 6, 7]
])
where 0 states for [PAD] token.
Thus, what would be an efficient approach to generate a padding masking tensor of the same shape as the batch assigning zero at [PAD] positions and assigning one to other input data (sentence tokens)?
In the example above it would be something like:
padding_masking=
tensor([
[1, 1, 0, 0],
[1, 0, 0, 0],
[1, 1, 1, 1]
])
The following is tested on pytorch 1.3.1.
pad_token_id = 0
batch = torch.tensor([
[1, 2, 0, 0],
[4, 0, 0, 0],
[3, 5, 6, 7]
])
pad_mask = ~(batch == pad_token_id)
print(pad_mask)
Output
tensor([[1, 1, 0, 0],
[1, 0, 0, 0],
[1, 1, 1, 1]], dtype=torch.uint8)
You can get your desired result with
padding_masking = batch > 0
If you want ints instead of booleans, use
padding_masking.type(torch.int)

Numpy Selecting Elements given row and column index arrays

I have row indices as a 1d numpy array and a list of numpy arrays (list as same length as the size of the row indices array. I want to extract values corresponding to these indices. How can I do it ?
This is an example of what I want as output given the input
A = np.array([[2, 1, 1, 0, 0],
[3, 0, 2, 1, 1],
[0, 0, 2, 1, 0],
[0, 3, 3, 3, 0],
[0, 1, 2, 1, 0],
[0, 1, 3, 1, 0],
[2, 1, 3, 0, 1],
[2, 0, 2, 0, 2],
[3, 0, 3, 1, 2]])
row_ind = np.array([0,2,4])
col_ind = [np.array([0, 1, 2]), np.array([2, 3]), np.array([1, 2, 3])]
Now, I want my output as a list of numpy arrays or list of lists as
[np.array([2, 1, 1]), np.array([2, 1]), np.array([1, 2, 1])]
My biggest concern is the efficiency. My array A is of dimension 20K x 10K.
As #hpaulj commented, likely, you won't be able to avoid looping - e.g.
import numpy as np
A = np.array([[2, 1, 1, 0, 0],
[3, 0, 2, 1, 1],
[0, 0, 2, 1, 0],
[0, 3, 3, 3, 0],
[0, 1, 2, 1, 0],
[0, 1, 3, 1, 0],
[2, 1, 3, 0, 1],
[2, 0, 2, 0, 2],
[3, 0, 3, 1, 2]])
row_ind = np.array([0,2,4])
col_ind = [np.array([0, 1, 2]), np.array([2, 3]), np.array([1, 2, 3])]
# make sure the following code is safe...
assert row_ind.shape[0] == len(col_ind)
# 1) select row (A[r, :]), then select elements (cols) [col_ind[i]]:
output = [A[r, :][col_ind[i]] for i, r in enumerate(row_ind)]
# output
# [array([2, 1, 1]), array([2, 1]), array([1, 2, 1])]
Another way to do this could be to use np.ix_ (still requires looping). Use with caution though for very large arrays; np.ix_ uses advanced indexing - in contrast to basic slicing, it creates a copy of the data instead of a view - see the docs.

Resources