How to get padding mask from input ids? - pytorch

Considering a batch of 4 pre-processed sentences (tokenization, numericalizing and padding) shown below:
batch = torch.tensor([
[1, 2, 0, 0],
[4, 0, 0, 0],
[3, 5, 6, 7]
])
where 0 states for [PAD] token.
Thus, what would be an efficient approach to generate a padding masking tensor of the same shape as the batch assigning zero at [PAD] positions and assigning one to other input data (sentence tokens)?
In the example above it would be something like:
padding_masking=
tensor([
[1, 1, 0, 0],
[1, 0, 0, 0],
[1, 1, 1, 1]
])

The following is tested on pytorch 1.3.1.
pad_token_id = 0
batch = torch.tensor([
[1, 2, 0, 0],
[4, 0, 0, 0],
[3, 5, 6, 7]
])
pad_mask = ~(batch == pad_token_id)
print(pad_mask)
Output
tensor([[1, 1, 0, 0],
[1, 0, 0, 0],
[1, 1, 1, 1]], dtype=torch.uint8)

You can get your desired result with
padding_masking = batch > 0
If you want ints instead of booleans, use
padding_masking.type(torch.int)

Related

How to pad the left side of a list of tensors in pytorch to the size of the largest list?

In pytorch, if you have a list of tensors, you can pad the right side using torch.nn.utils.rnn.pad_sequence
import torch
'for the collate function, pad the sequences'
f = [
[0,1],
[0, 3, 4],
[4, 3, 2, 4, 3]
]
torch.nn.utils.rnn.pad_sequence(
[torch.tensor(part) for part in f],
batch_first=True
)
tensor([[0, 1, 0, 0, 0],
[0, 3, 4, 0, 0],
[4, 3, 2, 4, 3]])
How would I pad the left side? The desired solution is
tensor([[0, 0, 0, 0, 1],
[0, 0, 0, 3, 4],
[4, 3, 2, 4, 3]])
You can reverse the list, do the padding, and reverse the tensor. Would that be acceptable to you? If yes, you can use the code below.
torch.nn.utils.rnn.pad_sequence([
torch.tensor(i[::-1]) for i in f
], # reverse the list and create tensors
batch_first=True) # pad
.flip(dims=[1]) # reverse/flip the padded tensor in first dimension

How to convert the following processing using numpy

I am trying to improve a part of code that is slowing down the whole script significantly, right to the point of making it unfeasible. In particular the piece of code is:
for vectors1 in EC1:
for vectors2 in EC2:
r = np.add(vectors1, vectors2)
for vectors3 in CDC:
result = np.add(r, vectors3).tolist()
if result not in states: # This is what makes it very slow
states.append(result)
EC1, EC2 and CDC are lists that contains as elements, lists of lists, as an example of one iteration, we get:
vectors1: [[2, 0, 0], [0, 0, 0], [0, 0, 0], [2, 0, 0], [0, 0, 0], [0, 0, 0], [2, 0, 0], [2, 0, 0], [0, 0, 0]]
vectors2: [[0, 0, 0], [2, 0, 0], [0, 0, 0], [0, 0, 0], [2, 0, 0], [2, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0]]
vectors3: [[0, 0, 0], [0, 0, 0], [2, 1, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [2, 1, 0], [2, 1, 0]]
result: [[2, 0, 0], [2, 0, 0], [2, 1, 0], [2, 0, 0], [2, 0, 0], [2, 0, 0], [2, 0, 0], [4, 1, 0], [2, 1, 0]]
Notice how vectors1, vectors2 and vectors3 correspond to one element from EC1, EC2 and CDC respectively, also how 'result' is the summation from vectors1, vectors2 and vectors3, hence the previous vectors cannot be altered in any manner or sorted, otherwise it would change the expected result from the 'result' variable.
In the first two loops each item in EC1 and EC2 are summed, for later on sum up the previous result with items in CDC. To sum the list of lists from EC1 and EC2 and later on the previous result ('r') with the list of lists from CDC I use numpy.add(). Finally, I reconvert 'result' back to list. So Basically I am managing lists of lists as elements from EC1, EC2 and CDC.
The problem is that I must deal with hundreds of thousands (close to 1M) of results and having to check if a result exists in states list is slowing things drastically, specially since states list grows as more results are processed.
I've tried to keep inside the numpy world by managing everything as numpy arrays. First declaring states as:
states = np.empty([9, 3], int)
Then, concatenating the result numpy array to states numpy array, prior checking if already exists in states:
for vectors1 in EC1:
for vectors2 in EC2:
r = np.add(vectors1, vectors2)
for vectors3 in CDC:
result = np.add(r, vectors3)
if not np.isin(states, result).any():
np.concatenate(states, result, axis=0)
But definitely I am doing something wrong because result is not being concatenated to states, I've also tried without success:
np.append(states, result, axis=0)
Could this be parallelized in some way?
You can do the sums solely in numpy by using broadcasting
res = ((EC1[:,None,:] + EC2).reshape(-1, 1, 3) + CDC).reshape(-1, 3)
given that EC1, EC2 and CDC are arrays.
Afterwards you can filter out the duplicates with
np.unique(res, axis=0)
But like Lucas, I would strongly advise you to filter the arrays beforehand. For your example arrays that would shrink the number of rows in res from 729 to 8.
I'm not sure how large the data are that you are working with but this may speed things up somewhat:
EC1 = [[2, 0, 0], [0, 0, 0], [0, 0, 0], [2, 0, 0], [0, 0, 0], [0, 0, 0], [2, 0, 0], [2, 0, 0], [0, 0, 0]]
EC2 = [[0, 0, 0], [2, 0, 0], [0, 0, 0], [0, 0, 0], [2, 0, 0], [2, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0]]
CDC = [[0, 0, 0], [0, 0, 0], [2, 1, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [2, 1, 0], [2, 1, 0]]
EC1.sort()
EC2.sort()
CDC.sort()
unique_triples = dict()
for v1 in EC1:
for v2 in EC2:
for v3 in CDC:
if str(v1)+str(v2)+str(v3) not in unique_triples: # list not hashable but strings are
unique_triples[str(v1)+str(v2)+str(v3)] = list(np.add(np.add(v1, v2), v3))
The basic idea is to remove duplicate triples of (EC1,EC2, CDC) entries and only do the additions on unique triples, sort the lists so that they are ordered lexicographically
A dictionary has O(1) lookups so these lookups are (maybe) faster.
Whether this is faster or not might depend on how large-and how many unique values of triples-the data are that are being processed.
The 3-vector sums are the values of the dictionary, e.g.
list(unique_triples.values()) for me gives:
>>> list(unique_triples.values())
[[0, 0, 0], [2, 1, 0], [2, 0, 0], [4, 1, 0], [2, 0, 0], [4, 1, 0], [4, 0, 0], [6, 1, 0]]
I did not remove the duplicates in the original lists of lists here. If the application you are looking at allows, it is also likely beneficial to remove these duplicates in EC1, EC2, and CDC before iterating over the values.

Multiclass vs. multilabel fitting

In scikit-learn tutorials, I found the following paragraphs in the section 'Multiclass vs. multilabel fitting'.
I couldn't understand why the following codes generate the given results.
First
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import LabelBinarizer
X = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]]
y = [0, 0, 1, 1, 2]
classif = OneVsRestClassifier(estimator=SVC(random_state=0))
classif.fit(X, y).predict(X)
array([0, 0, 1, 1, 2])
y = LabelBinarizer().fit_transform(y)
classif.fit(X, y).predict(X)
array([[1, 0, 0],
[1, 0, 0],
[0, 1, 0],
[0, 0, 0],
[0, 0, 0]])
Next
from sklearn.preprocessing import MultiLabelBinarizer
y = [[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]]
y = MultiLabelBinarizer().fit_transform(y)
classif.fit(X, y).predict(X)
array([[1, 1, 0, 0, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 0],
[1, 0, 1, 0, 0]])
Label binarization in scikit-learn will transform your targets and represent them in a label indicator matrix. This label indicator matrix has the shape (n_samples, n_classes) and is composed as follows:
each row represents a sample
each column represents a class
each element is 1 if the sample is labeled with the class and 0 if not
In your first example, you have a target collection with 5 samples and 3 classes. That's why transforming y with LabelBinarizer results in a 5x3 matrix. In your case, [1, 0, 0] corresponds to class 0, [0, 1, 0] corresponds to class 1 and so forth. Notice that in each row there is only one element set to 1, since each sample can have one label only.
In your next example, you have a target collection with 5 samples and 5 classes. That's why transforming y with MultiLabelBinarizer results in a 5x5 matrix. In your case, [1, 1, 0, 0, 0] corresponds to the multilabel [0, 1], [0, 1, 0, 1, 0] corresponds to the multilabel [1, 3] and so forth. The key difference to the first example is that each row can have multiple elements set to 1, because each sample can have multiple labels/classes.
The predicted values you get follow the very same pattern. They are however not equivalent to the original values in y since your classification model has obviously predicted different values. You can check this with the inverse_transform() of the binarizers:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y = np.array([[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]])
y_bin = mlb.fit_transform(y)
# direct transformation
[[1 1 0 0 0]
[1 0 1 0 0]
[0 1 0 1 0]
[1 0 1 1 0]
[0 0 1 0 1]]
# prediction of your classifier
y_pred = np.array([[1, 1, 0, 0, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 0],
[1, 0, 1, 0, 0]])
# inverting the binarized values to the original classes
y_inv = mlb.inverse_transform(y_pred)
# output
[(0, 1), (0, 2), (1, 3), (0, 2), (0, 2)]

Numpy Selecting Elements given row and column index arrays

I have row indices as a 1d numpy array and a list of numpy arrays (list as same length as the size of the row indices array. I want to extract values corresponding to these indices. How can I do it ?
This is an example of what I want as output given the input
A = np.array([[2, 1, 1, 0, 0],
[3, 0, 2, 1, 1],
[0, 0, 2, 1, 0],
[0, 3, 3, 3, 0],
[0, 1, 2, 1, 0],
[0, 1, 3, 1, 0],
[2, 1, 3, 0, 1],
[2, 0, 2, 0, 2],
[3, 0, 3, 1, 2]])
row_ind = np.array([0,2,4])
col_ind = [np.array([0, 1, 2]), np.array([2, 3]), np.array([1, 2, 3])]
Now, I want my output as a list of numpy arrays or list of lists as
[np.array([2, 1, 1]), np.array([2, 1]), np.array([1, 2, 1])]
My biggest concern is the efficiency. My array A is of dimension 20K x 10K.
As #hpaulj commented, likely, you won't be able to avoid looping - e.g.
import numpy as np
A = np.array([[2, 1, 1, 0, 0],
[3, 0, 2, 1, 1],
[0, 0, 2, 1, 0],
[0, 3, 3, 3, 0],
[0, 1, 2, 1, 0],
[0, 1, 3, 1, 0],
[2, 1, 3, 0, 1],
[2, 0, 2, 0, 2],
[3, 0, 3, 1, 2]])
row_ind = np.array([0,2,4])
col_ind = [np.array([0, 1, 2]), np.array([2, 3]), np.array([1, 2, 3])]
# make sure the following code is safe...
assert row_ind.shape[0] == len(col_ind)
# 1) select row (A[r, :]), then select elements (cols) [col_ind[i]]:
output = [A[r, :][col_ind[i]] for i, r in enumerate(row_ind)]
# output
# [array([2, 1, 1]), array([2, 1]), array([1, 2, 1])]
Another way to do this could be to use np.ix_ (still requires looping). Use with caution though for very large arrays; np.ix_ uses advanced indexing - in contrast to basic slicing, it creates a copy of the data instead of a view - see the docs.

Batch multiplication/division with scalar in tensorflow

I'm struggling to find a simple way to multiply a batch of tensors with a batch of scalars.
I have a tensor with dimensions N, 4, 4. What I want is to divide tensor in the batch with the value at position 3, 3.
For example, let's say I have:
A = [[[1, 1, 1, 0],
[1, 1, 1, 0],
[1, 1, 1, 0],
[0, 0, 0, a]],
[[1, 1, 1, 0],
[1, 1, 1, 0],
[1, 1, 1, 0],
[0, 0, 0, b]]
What I want is to obtain the following:
B = [[[1/a, 1/a, 1/a, 0],
[1/a, 1/a, 1/a, 0],
[1/a, 1/a, 1/a, 0],
[0, 0, 0, 1]],
[[1/b, 1/b, 1/b, 0],
[1/b, 1/b, 1/b, 0],
[1/b, 1/b, 1/b, 0],
[0, 0, 0, 1]]
You should just do:
B = A / A[:, 3:, 3:]

Resources