Difference between src_mask and src_key_padding_mask - pytorch

I am having a difficult time in understanding transformers. Everything is getting clear bit by bit but one thing that makes my head scratch is
what is the difference between src_mask and src_key_padding_mask which is passed as an argument in forward function in both encoder layer and decoder layer.
https://pytorch.org/docs/master/_modules/torch/nn/modules/transformer.html#Transformer

Difference between src_mask and src_key_padding_mask
The general thing is to notice the difference between the use of the tensors _mask vs _key_padding_mask.
Inside the transformer when attention is done we usually get an squared intermediate tensor with all the comparisons
of size [Tx, Tx] (for the input to the encoder), [Ty, Ty] (for the shifted output - one of the inputs to the decoder)
and [Ty, Tx] (for the memory mask - the attention between output of encoder/memory and input to decoder/shifted output).
So we get that this are the uses for each of the masks in the transformer
(note the notation from the pytorch docs is as follows where Tx=S is the source sequence length
(e.g. max of input batches),
Ty=T is the target sequence length (e.g. max of target length),
B=N is the batch size,
D=E is the feature number):
src_mask [Tx, Tx] = [S, S] – the additive mask for the src sequence (optional).
This is applied when doing atten_src + src_mask. I'm not sure of an example input - see tgt_mask for an example
but the typical use is to add -inf so one could mask the src_attention that way if desired.
If a ByteTensor is provided, the non-zero positions are not allowed to attend while the zero positions will be unchanged.
If a BoolTensor is provided, positions with True is not allowed to attend while False values will be unchanged.
If a FloatTensor is provided, it will be added to the attention weight.
tgt_mask [Ty, Ty] = [T, T] – the additive mask for the tgt sequence (optional).
This is applied when doing atten_tgt + tgt_mask. An example use is the diagonal to avoid the decoder from cheating.
So the tgt is right shifted, the first tokens are start of sequence token embedding SOS/BOS and thus the first
entry is zero while the remaining. See concrete example at the appendix.
If a ByteTensor is provided, the non-zero positions are not allowed to attend while the zero positions will be unchanged.
If a BoolTensor is provided, positions with True is not allowed to attend while False values will be unchanged.
If a FloatTensor is provided, it will be added to the attention weight.
memory_mask [Ty, Tx] = [T, S]– the additive mask for the encoder output (optional).
This is applied when doing atten_memory + memory_mask.
Not sure of an example use but as previously, adding -inf sets some of the attention weight to zero.
If a ByteTensor is provided, the non-zero positions are not allowed to attend while the zero positions will be unchanged.
If a BoolTensor is provided, positions with True is not allowed to attend while False values will be unchanged.
If a FloatTensor is provided, it will be added to the attention weight.
src_key_padding_mask [B, Tx] = [N, S] – the ByteTensor mask for src keys per batch (optional).
Since your src usually has different lengths sequences it's common to remove the padding vectors
you appended at the end.
For this you specify the length of each sequence per example in your batch.
See concrete example in appendix.
If a ByteTensor is provided, the non-zero positions are not allowed to attend while the zero positions will be unchanged.
If a BoolTensor is provided, positions with True is not allowed to attend while False values will be unchanged.
If a FloatTensor is provided, it will be added to the attention weight.
tgt_key_padding_mask [B, Ty] = [N, t] – the ByteTensor mask for tgt keys per batch (optional).
Same as previous.
See concrete example in appendix.
If a ByteTensor is provided, the non-zero positions are not allowed to attend while the zero positions will be unchanged.
If a BoolTensor is provided, positions with True is not allowed to attend while False values will be unchanged.
If a FloatTensor is provided, it will be added to the attention weight.
memory_key_padding_mask [B, Tx] = [N, S] – the ByteTensor mask for memory keys per batch (optional).
Same as previous.
See concrete example in appendix.
If a ByteTensor is provided, the non-zero positions are not allowed to attend while the zero positions will be unchanged.
If a BoolTensor is provided, positions with True is not allowed to attend while False values will be unchanged.
If a FloatTensor is provided, it will be added to the attention weight.
Appendix
Examples from pytorch tutorial (https://pytorch.org/tutorials/beginner/translation_transformer.html):
1 src_mask example
src_mask = torch.zeros((src_seq_len, src_seq_len), device=DEVICE).type(torch.bool)
returns a tensor of booleans of size [Tx, Tx]:
tensor([[False, False, False, ..., False, False, False],
...,
[False, False, False, ..., False, False, False]])
2 tgt_mask example
mask = (torch.triu(torch.ones((sz, sz), device=DEVICE)) == 1)
mask = mask.transpose(0, 1).float()
mask = mask.masked_fill(mask == 0, float('-inf'))
mask = mask.masked_fill(mask == 1, float(0.0))
generates the diagonal for the right shifted output which the input to the decoder.
tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf,
-inf, -inf, -inf],
[0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf,
-inf, -inf, -inf],
[0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf,
-inf, -inf, -inf],
...,
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., -inf],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0.]])
usually the right shifted output has the BOS/SOS at the beginning and it's the tutorial gets the right shift simply
by appending that BOS/SOS at the front and then triming the last element with tgt_input = tgt[:-1, :].
3 _padding
The padding is just to mask the padding at the end.
The src padding is usually the same as the memory padding.
The tgt has it's own sequences and thus it's own padding.
Example:
src_padding_mask = (src == PAD_IDX).transpose(0, 1)
tgt_padding_mask = (tgt == PAD_IDX).transpose(0, 1)
memory_padding_mask = src_padding_mask
Output:
tensor([[False, False, False, ..., True, True, True],
...,
[False, False, False, ..., True, True, True]])
note that a False means there is no padding token there (so yes use that value in the transformer forward pass) and a True means that there is a padding token (so masked it out so the transformer pass forward does not get affected).
The answers are sort of spread around but I found only these 3 references being useful
(the separate layers docs/stuff wasn't very useful honesty):
long tutorial: https://pytorch.org/tutorials/beginner/translation_transformer.html
MHA docs: https://pytorch.org/docs/master/generated/torch.nn.MultiheadAttention.html#torch.nn.MultiheadAttention
transformer docs: https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html

I must say PyTorch implementations are a bit confusing as it contains too many mask parameters. But I can shed light on the two mask parameters that you are referring to. Both src_mask and src_key_padding_mask is used in the MultiheadAttention mechanism. According to the documentation of MultiheadAttention:
key_padding_mask – if provided, specified padding elements in the key will be ignored by the attention.
attn_mask – 2D or 3D mask that prevents attention to certain positions.
As you know from the paper, Attention is all you need, MultiheadAttention is used in both Encoder and Decoder. However, in Decoder, there are two types of MultiheadAttention. One is called Masked MultiheadAttention and another one is the regular MultiheadAttention. To accommodate both these techniques, PyTorch uses the above mentioned two parameters in their MultiheadAttention implementation.
So, long story short-
attn_mask and key_padding_mask is used in Encoder's MultiheadAttention and Decoder's Masked MultiheadAttention.
memory_mask is used in Decoder's MultiheadAttention mechanism as pointed out here.
Looking into the implementation of MultiheadAttention might help you.
As you can see from here and here, first src_mask is used to block specific positions from attending and then key_padding_mask is used to block attending to pad tokens.
Note. Answer updated based on #michael-jungo's comment.

To give a small example, consider I want to build a sequential recommender i.e., given the items the users have purchased till time 't' predict the next item at 't+1'
u1 - [i1, i2, i7]
u2 - [i2, i5]
u3 - [i6, i7, i1, i2]
For this task, I could use a transformer where I would make the sequence equal length by padding it with 0's on left.
u1 - [0, i1, i2, i7]
u2 - [0, 0, i2, i5]
u3 - [i6, i7, i1, i2]
I will use key_padding_mask to tell PyTorch that 0's shd be ignored.
Now, consider user u3 where given [i6] I want to predict [i7] and later given [i6, i7] I want to predict [i1] i.e., I want causal attention, such that the attention doesn't peep into the future elements. For this, I will use attn_mask. Hence for user u3 attn_mask will be like
[[True, False, False, False],
[True, True , False, False],
[True, True , True , False]
[True, True , True , True ]]

Related

Scikit learn preprocessing cannot understand the output using min_frequency argument in OneHotencoder class

Consider the below array t. When using min_frequency kwarg in the OneHotEncoder class, I cannot understand why the category snake is still present when transforming a new array. There are 2/40 events of this label. Should the shape of e be (4,3) instead?
sklearn.__version__ == '1.1.1'
t = np.array([['dog'] * 8 + ['cat'] * 20 + ['rabbit'] * 10 +
['snake'] * 2], dtype=object).T
enc = OneHotEncoder(min_frequency= 4/40,
sparse=False).fit(t)
print(enc.infrequent_categories_)
# [array(['snake'], dtype=object)]
e = enc.transform(np.array([['dog'], ['cat'], ['dog'], ['snake']]))
array([[0., 1., 0., 0.],
[1., 0., 0., 0.],
[0., 1., 0., 0.],
[0., 0., 0., 1.]]) # snake is present?
Check out enc.get_feature_names_out():
array(['x0_cat', 'x0_dog', 'x0_rabbit', 'x0_infrequent_sklearn'],
dtype=object)
"snake" isn't considered its own category anymore, but lumped into the infrequent category. If you added some other rare categories, they'd be assigned to the same, and if you additionally set handle_unknown="infrequent_if_exist", you would also encode unseen categories to the same.

How do I mask a feed forward layer based on tensor in pytorch?

I have a really simple network with 2 inputs (x and m).
x is size 100
m is size 3
My network is simply...
f_1 = linear_layer(x)
f_2 = linear_layer(f_1)
f_3 = linear_layer(f_1)
f_4 = linear_layer(f_1)
f_5 = softmax(linear_layer(sum(f_2, f_3, f_4)))
based on the vector m, I want to zero out and ignore f_2, f_3, f_4 in the final sum and resulting gradient calculation. Is there a way to create a mask based on vector m to achieve this?
Ok, here is how you do it. Use list comprehensions to make it more generic:
# example input and output
x = torch.ones(5)
y = torch.zeros(3)
# mask tensor
mask = torch.tensor([0, 1, 0])
# initial layer
z0 = torch.nn.Linear(5, 5)
# layers to potentially mask
z1 = torch.nn.Linear(5, 3)
z2 = torch.nn.Linear(5, 3)
z3 = torch.nn.Linear(5, 3)
# defines how the data passes through the layers, specific mask element is applied to each of the maskable layers
layer1_output = z0(x)
layer2_output = mask[0]*z1(layer1_output) + mask[1]*z2(layer1_output) + mask[2]*z3(layer1_output)
# loss function
loss = torch.nn.functional.binary_cross_entropy_with_logits(layer2_output, y)
# run it and see
loss.backward()
print(z0.weight.grad)
print(z1.weight.grad)
print(z2.weight.grad)
print(z3.weight.grad)
as shown below, the masking tensor is effective in selecting subnets to apply computation to based on mask element
tensor([[ 0.0354, 0.0354, 0.0354, 0.0354, 0.0354],
[-0.0986, -0.0986, -0.0986, -0.0986, -0.0986],
[-0.0372, -0.0372, -0.0372, -0.0372, -0.0372],
[-0.0168, -0.0168, -0.0168, -0.0168, -0.0168],
[-0.0133, -0.0133, -0.0133, -0.0133, -0.0133]])
tensor([[-0., 0., 0., -0., 0.],
[-0., 0., 0., -0., 0.],
[-0., 0., 0., -0., 0.]])
tensor([[-0.0422, 0.1314, 0.1108, -0.1644, 0.0906],
[-0.0240, 0.0747, 0.0630, -0.0934, 0.0515],
[-0.0251, 0.0781, 0.0659, -0.0977, 0.0539]])
tensor([[-0., 0., 0., -0., 0.],
[-0., 0., 0., -0., 0.],
[-0., 0., 0., -0., 0.]])

Pytorch find unique vectors in tensor

I have a tensor containing binary values e.g
T1 = torch.tensor([[1., 0., 1.],
[0., 1., 0.],
[1., 0., 1.]])
i need to convert this to:
tensor([[1., 0., 1.],
[0., 1., 0.]])
I looked into torch.unique but it only works for values?
Is there a way to do the unique operation across entire vectors
Although the PyTorch documentation is not very clear the dim parameter can achieve this I found this post solving the issue Delete duplicated rows in torch.tensor
hence
torch.unique(T1,dim=0)
would solve the problem

Need to compare two arrays elements inside a for loop in Python 3 and getting error "The truth value of an array ..."

I have a numpy array in the form:
y_sol =
[[0. 0. 1.]
[0. 0. 1.]
[0. 0. 1.]
...
[1. 0. 0.]
[0. 1. 0.]
[1. 0. 0.]]
and I need to translate it for a categorical string value using correlations given by a list of tuples:
transf_codes = [('Alert', [1., 0., 0.]),
('Neutral', [0., 1., 0.]),
('Urgent', [0., 0., 1.])]
Note that I haven't used a dictionary here to avoid a complication to search for keys having the values as the search input.
Anyhow, I've tried the following code to have the job done:
for i in np.arange(len(y_sol)-1):
for j in np.arange(3):
if np.equal(transf_codes[j][1], y_sol[i].all()): # <-error line
y_categ[i] = transf_codes[j][0]
and I get the error "The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()"
In the "if" line above, the more natural form >>> transf_codes[j][1] == y_sol[i] <<<, with or without .all() or .any(), raises the same error.
What is the righ and best approach to compare element-wise of arrays, lists, etc. in an if-statement?
Many thanks in advance.
The error you are seeing happens whenever numpy tries to cast an array as a boolean. It doesn't understand how to do so, so it throws that error.
So when you do something like if (a == b) with a, b being arrays, you wind up with an error.
However, a == b will yield a boolean array, with the element-wise comparison.
Note that for that to happen a, b have to be numpy arrays. List behaviour is different.
One way we could use the boolean array is with the np.all method which I see you used, but not in the right place. You currently do .all() on the y_sol[i].
So the following code and output are the correct.
import numpy as np
y_sol = np.array( \
[[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[1., 0., 0.],
[0., 1., 0.],
[1., 0., 0.]])
transf_codes = [('Alert', [1., 0., 0.]),
('Neutral', [0., 1., 0.]),
('Urgent', [0., 0., 1.])]
for i in np.arange(len(y_sol)-1):
for j in np.arange(3):
if (transf_codes[j][1] == y_sol[i]).all(): # <-error line
print(f'y_sol[i]={y_sol[i]}, transf_codes={transf_codes[j][0]}')
And this outputs:
y_sol[i]=[0. 0. 1.], transf_codes=Urgent
y_sol[i]=[0. 0. 1.], transf_codes=Urgent
y_sol[i]=[0. 0. 1.], transf_codes=Urgent
y_sol[i]=[1. 0. 0.], transf_codes=Alert
y_sol[i]=[0. 1. 0.], transf_codes=Neutral
Note that I used .all() after having compared the vectors.

One-vs-Rest algorithm and out-of-the-box multiclass algorithm gives different results

Can someone explain why the OneVsRestClassifier gives different result than the out-of-the-box algorithm?
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
X = [[1,2],[1,3],[4,2],[2,3],[1,4]]
y = [1,2,3,2,1]
X_pred = [[2,4], [5,4], [3,7]]
dummy_clf = OneVsRestClassifier(SGDClassifier(verbose=0, class_weight="auto", loss='modified_huber', random_state=0)) # first case
#dummy_clf = SGDClassifier(verbose=0, class_weight="auto", loss='modified_huber', random_state=0) # second case
dummy_clf.fit(X, y)
dummy_clf.predict_proba(X_pred)
First case:
array([[ 0.5, 0.5, 0. ],
[ 0. , 1. , 0. ],
[ 0.5, 0.5, 0. ]])
Second case:
array([[ 0., 1., 0.],
[ 0., 1., 0.],
[ 0., 1., 0.]])
OneVsRest gives you the probability of X_pred for all of the classes, thus the first and last test cases have a value for multiple classes (that sum to 1). The classifier is trained on all classes.
OneVsOne trains a classifier on all class pairs. For all class pairs, the class predicted most is the winner, so you only get one prediction per instance.

Resources