Related
In pytorch, if you have a list of tensors, you can pad the right side using torch.nn.utils.rnn.pad_sequence
import torch
'for the collate function, pad the sequences'
f = [
[0,1],
[0, 3, 4],
[4, 3, 2, 4, 3]
]
torch.nn.utils.rnn.pad_sequence(
[torch.tensor(part) for part in f],
batch_first=True
)
tensor([[0, 1, 0, 0, 0],
[0, 3, 4, 0, 0],
[4, 3, 2, 4, 3]])
How would I pad the left side? The desired solution is
tensor([[0, 0, 0, 0, 1],
[0, 0, 0, 3, 4],
[4, 3, 2, 4, 3]])
You can reverse the list, do the padding, and reverse the tensor. Would that be acceptable to you? If yes, you can use the code below.
torch.nn.utils.rnn.pad_sequence([
torch.tensor(i[::-1]) for i in f
], # reverse the list and create tensors
batch_first=True) # pad
.flip(dims=[1]) # reverse/flip the padded tensor in first dimension
Say that this is my sample tensor
sample = torch.tensor(
[[2, 7, 3, 1, 1],
[9, 5, 8, 2, 5],
[0, 4, 0, 1, 4],
[5, 4, 9, 0, 0]]
)
I want to have a new tensor, which will consist of concatenations of 2 rows from the sample tensor.
So I have a tensor which contains pairs of the row numbers that I want concatenated into a single row for the new tensor
cat_indices = torch.tensor([[0, 1], [1, 2], [0, 2], [2, 3]])
The current method I am using is this
torch.cat((sample[cat_indices[:,0]], sample[cat_indices[:,1]]), dim=1)
Which gives the desired result
tensor([[2, 7, 3, 1, 1, 9, 5, 8, 2, 5],
[9, 5, 8, 2, 5, 0, 4, 0, 1, 4],
[2, 7, 3, 1, 1, 0, 4, 0, 1, 4],
[0, 4, 0, 1, 4, 5, 4, 9, 0, 0]])
Is this the most memory and computationally efficient method of doing this? I am not sure because I am making two calls to cat_indices, and then I am doing a concatenation operation.
I feel that there should be a way to do this via some sort of view. Perhaps advanced indexing. I've tried things like sample[cat_indices[:,0], cat_indices[:,1]] or sample[cat_indices[0], cat_indices[1]] but I can't make the view come out right.
What you have should be pretty fast. An alternative is
sample[cat_indices].reshape(cat_indices.shape[0],-1)
You would have to benchmark the performance on your machine though to see which is better.
How initialize in pytorch hybrid tensor torch.sparse_coo_tensor (one dimension is sparse and other is not), which have the following dense representation?
array([[1, 0, 5, 0],
[2, 0, 6, 0],
[3, 0, 7, 0],
[4, 0, 8, 0]])
What should I put into the indices argument?
How to initialize
Something like this:
import torch
indices = torch.tensor([[0, 0, 1, 1, 2, 2, 3, 3], [0, 2, 0, 2, 0, 2, 0, 2]])
tensor = torch.sparse_coo_tensor(
indices, torch.tensor([1, 2, 3, 4, 5, 6, 7, 8]), size=(4, 4)
)
Given above:
indices - first dimension specifies row, second column, where non-zero value(s) will be located. Those become pairs, in this case: (0, 0), (0, 2), (1, 0), (1, 2)... and so on
values - values located at those pairs, so 1 will be under (0, 0) coordinate, 2 under (0, 2) and so it goes.
size - total size of the matrix, optional, might be inferred in this case from your input
8 pairs, 8 values, there are also other ways to specify it, but the idea holds.
And a quick check:
print(tensor)
print(tensor.to_dense())
Gives us:
tensor(indices=tensor([[0, 0, 1, 1, 2, 2, 3, 3],
[0, 2, 0, 2, 0, 2, 0, 2]]),
values=tensor([1, 2, 3, 4, 5, 6, 7, 8]),
size=(4, 4), nnz=8, layout=torch.sparse_coo)
tensor([[1, 0, 2, 0],
[3, 0, 4, 0],
[5, 0, 6, 0],
[7, 0, 8, 0]])
Why to initialize
If your actual data is 50% sparse, you shouldn't use COO tensor.
It will save some memory, but operations will be way slower, so keep that in mind.
Considering a batch of 4 pre-processed sentences (tokenization, numericalizing and padding) shown below:
batch = torch.tensor([
[1, 2, 0, 0],
[4, 0, 0, 0],
[3, 5, 6, 7]
])
where 0 states for [PAD] token.
Thus, what would be an efficient approach to generate a padding masking tensor of the same shape as the batch assigning zero at [PAD] positions and assigning one to other input data (sentence tokens)?
In the example above it would be something like:
padding_masking=
tensor([
[1, 1, 0, 0],
[1, 0, 0, 0],
[1, 1, 1, 1]
])
The following is tested on pytorch 1.3.1.
pad_token_id = 0
batch = torch.tensor([
[1, 2, 0, 0],
[4, 0, 0, 0],
[3, 5, 6, 7]
])
pad_mask = ~(batch == pad_token_id)
print(pad_mask)
Output
tensor([[1, 1, 0, 0],
[1, 0, 0, 0],
[1, 1, 1, 1]], dtype=torch.uint8)
You can get your desired result with
padding_masking = batch > 0
If you want ints instead of booleans, use
padding_masking.type(torch.int)
I have row indices as a 1d numpy array and a list of numpy arrays (list as same length as the size of the row indices array. I want to extract values corresponding to these indices. How can I do it ?
This is an example of what I want as output given the input
A = np.array([[2, 1, 1, 0, 0],
[3, 0, 2, 1, 1],
[0, 0, 2, 1, 0],
[0, 3, 3, 3, 0],
[0, 1, 2, 1, 0],
[0, 1, 3, 1, 0],
[2, 1, 3, 0, 1],
[2, 0, 2, 0, 2],
[3, 0, 3, 1, 2]])
row_ind = np.array([0,2,4])
col_ind = [np.array([0, 1, 2]), np.array([2, 3]), np.array([1, 2, 3])]
Now, I want my output as a list of numpy arrays or list of lists as
[np.array([2, 1, 1]), np.array([2, 1]), np.array([1, 2, 1])]
My biggest concern is the efficiency. My array A is of dimension 20K x 10K.
As #hpaulj commented, likely, you won't be able to avoid looping - e.g.
import numpy as np
A = np.array([[2, 1, 1, 0, 0],
[3, 0, 2, 1, 1],
[0, 0, 2, 1, 0],
[0, 3, 3, 3, 0],
[0, 1, 2, 1, 0],
[0, 1, 3, 1, 0],
[2, 1, 3, 0, 1],
[2, 0, 2, 0, 2],
[3, 0, 3, 1, 2]])
row_ind = np.array([0,2,4])
col_ind = [np.array([0, 1, 2]), np.array([2, 3]), np.array([1, 2, 3])]
# make sure the following code is safe...
assert row_ind.shape[0] == len(col_ind)
# 1) select row (A[r, :]), then select elements (cols) [col_ind[i]]:
output = [A[r, :][col_ind[i]] for i, r in enumerate(row_ind)]
# output
# [array([2, 1, 1]), array([2, 1]), array([1, 2, 1])]
Another way to do this could be to use np.ix_ (still requires looping). Use with caution though for very large arrays; np.ix_ uses advanced indexing - in contrast to basic slicing, it creates a copy of the data instead of a view - see the docs.