Can I apply softmax only on specific output neurons? - pytorch

I am building an Actor-Critic neural network model in pytorch in order to train an agent to play the game of Quoridor (hopefully). For this reason, I have a neural network with two heads, one for the actor output which does a softmax on all the possible moves and one for the critic output which is just one neuron (for regressing the value of the input state).
Now, in quoridor, most of the times not all moves will be legal and as such I am wondering if I can exclude output neurons on the actor's head that correspond to illegal moves for the input state e.g. by passing a list of indices of all the neurons that correspond to legal moves. Thus, I want to not sum these outputs on the denominator of softmax.
Is there a functionality like this on pytorch (because I cannot find one)? Should I attempt to implement such a Softmax myself (kinda scared to, pytorch probably knows best, I ve been adviced to use LogSoftmax as well)?
Furthermore, do you think this approach of dealing with illegal moves is good? Or should I just let him guess illegal moves and penalize him (negative reward) for it in the hopes that eventually it will not pick illegal moves?
Or should I let the softmax be over all the outputs and then just set illegal ones to zero? The rest won't sum to 1 but maybe I can solve that by plain normalization (i.e. dividing by the L2 norm)?

An easy solution would be to mask out illegal moves with a large negative value, this will practically force very low (log)softmax values (example below).
# 3 dummy actions for a batch size of 2
>>> actions = torch.rand(2, 3)
>>> actions
tensor([[0.9357, 0.2386, 0.3264],
[0.0179, 0.8989, 0.9156]])
# dummy mask assigning 0 to valid actions and 1 to invalid ones
>>> mask = torch.randint(low=0, high=2, size=(2, 3))
>>> mask
tensor([[1, 0, 0],
[0, 0, 0]])
# set actions marked as invalid to very large negative value
>>> actions = actions.masked_fill_(mask.eq(1), value=-1e10)
>>> actions
tensor([[-1.0000e+10, 2.3862e-01, 3.2636e-01],
[ 1.7921e-02, 8.9890e-01, 9.1564e-01]])
# softmax assigns no probability mass to illegal actions
>>> actions.softmax(dim=-1)
tensor([[0.0000, 0.4781, 0.5219],
[0.1704, 0.4113, 0.4183]])

I'm not qualified to say if this is a good idea, but I had the same one and ended up implementing it.
The code is using rust's bindings for pytorch, so it should be directly translatable to python based pytorch.
/// As log_softmax(dim=1) on a 2d tensor, but takes a {0, 1} `filter` of the same shape as `xs`
/// and has the softmax only look at values where filter[idx] = 1.
///
/// The output is 0 where the filter is 0.
pub fn filtered_log_softmax(xs: &Tensor, filter: &Tensor) -> Tensor {
// We are calculating `log softmax(xs, ys)` except that we only want to consider
// the values of xs and ys where the corresponding `filter` bit is set to 1.
//
// log_softmax on one element of the batch = for_each_i log(e^xs[i] / sum_j e^xs[j]))
//
// To filter that we need to remove (zero out) elements that are being filtered both after the log is
// taken, and before summing into the denominator. We can do this with two multiplications
//
// filtered_log_softmax = for_each_i filter[i] * log(e^xs[i] / sum_j filter[j] * e^xs[j]))
//
// This is mathematically correct, but it turns out there's a numeric stability trick we need to do,
// without it we're seeing NaNs. Sourcing the trick from: https://stackoverflow.com/a/52132033
//
// We can do the same transformation here, and come out with the following expression:
//
// let xs_max = max_i xs[i]
// for_each_i filter[i] * (xs[i] - xs_max - log(sum_j filter[j] * e^(xs[j] - xs_max))
//
// Keep in mind that the actual implementation below is further vectorized over an initial batch dimension.
let (xs_max, _) = xs.max_dim(1, true);
let xs_offset = xs - xs_max;
// TODO: Replace with Tensor::linalg_vecdot(&filter, &xs_offset.exp(), 1).log();
// when we update tch-rs (linalg_vecdot is new in pytorch 1.13)
let constant_sub = (filter * &xs_offset.exp()).sum_to_size(&[xs.size()[0], 1]).log();
filter * (&xs_offset - constant_sub)
}

Related

Why embed dimemsion must be divisible by num of heads in MultiheadAttention?

I am learning the Transformer. Here is the pytorch document for MultiheadAttention. In their implementation, I saw there is a constraint:
assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"
Why require the constraint: embed_dim must be divisible by num_heads? If we go back to the equation
Assume:
Q, K,V are n x emded_dim matrices; all the weight matrices W is emded_dim x head_dim,
Then, the concat [head_i, ..., head_h] will be a n x (num_heads*head_dim) matrix;
W^O with size (num_heads*head_dim) x embed_dim
[head_i, ..., head_h] * W^O will become a n x embed_dim output
I don't know why we require embed_dim must be divisible by num_heads.
Let say we have num_heads=10000, the resuts are the same, since the matrix-matrix product will absort this information.
From what I understood, it is a simplification they have added to keep things simple. Theoretically, we can implement the model like you proposed (similar to the original paper).
In pytorch documention, they have briefly mentioned it.
Note that `embed_dim` will be split across `num_heads` (i.e. each head will have dimension `embed_dim` // `num_heads`)
Also, if you see the Pytorch implementation, you can see it is a bit different (optimised in my point of view) when comparing to the originally proposed model. For example, they use MatMul instead of Linear and Concat layer is ignored. Refer the below which shows the first encoder (with Btach size 32, 10 words, 512 features).
P.s:
If you need to see the model params (like the above image), this is the code I used.
import torch
transformer_model = torch.nn.Transformer(d_model=512, nhead=8, num_encoder_layers=1,num_decoder_layers=1,dim_feedforward=11) # change params as necessary
tgt = torch.rand((20, 32, 512))
src = torch.rand((11, 32, 512))
torch.onnx.export(transformer_model, (src, tgt), "transformer_model.onnx")
When you have a sequence of seq_len x emb_dim (ie. 20 x 8) and you want to use num_heads=2, the sequence will be split along the emb_dim dimension. Therefore you get two 20 x 4 sequences. You want every head to have the same shape and if emb_dim isn't divisible by num_heads this wont work. Take for example a sequence 20 x 9 and again num_heads=2. Then you would get 20 x 4 and 20 x 5 which are not the same dimension.

Retrieve elements from a 3D tensor with a 2D index tensor

I am playing around with GPT2 and I have 2 tensors:
O: An output tensor of shaped (B, S-1, V) where B is the batch size S is the the number of timestep and V is the vocabulary size. This is the output of a generative model and is softmaxed along the 2nd dimension.
L: A 2D tensor shaped (B, S-1) where each element is the index of the correct token for each timestep for each sample. This is basically the labels.
I want to extract the predicted probability of the corresponding correct token from tensor O based on tensor L such that I will end up with a 2D tensor shaped (B, S). Is there an efficient way of doing this apart from using loops?
For reference, I based my answer on this Medium article.
Essentially, your answer lies in torch.gather, assuming that both of your tensors are just regular torch.Tensors (or can be converted to one).
import torch
# Specify some arbitrary dimensions for now
B = 3
V = 6
S = 4
# Make example reproducible
torch.manual_seed(42)
# L necessarily has to be a torch.LongTensor, otherwise indexing will fail.
L = torch.randint(0, V, size=[B, S])
O = torch.rand([B, S, V])
# Now collect the results. L needs to have similar dimension,
# except in the axis you want to collect along.
X = torch.gather(O, dim=2, index=L.unsqueeze(dim=2))
# Make sure X has no "unnecessary" dimension
X = X.squeeze(dim=2)
It is a bit difficult to see whether this produces the exact correct results, which is why I included a random seed which makes the example deterministic in the result, and you an easily verify that it gets you the desired results. However, for clarification, one could also use a lower-dimensional tensor, for which this becomes clearer what exactly torch.gather does.
Note that torch.gather also allows you to index multiple indexes in the same row theoretically. Meaning if you instead got a multiclass example for which multiple values are correct, you could similarly use a tensor L of shape [B, S, number_of_correct_samples].

How do I build a probability matrix output layer in Keras

Suppose I need to build a network that takes two inputs:
A patient's information, represented as an array of features
Selected treatment, represented as one-hot encoded array
Now how do I build a network that outputs a 2D probability matrix A where A[i,j] represents the probability the patient will end up at state j under treatment i. Let's say there are n possible states, and under any treatment, the total probability of all n states sums up to 1.
I wanted to do this because I was motivated by a similar network, where the inputs are the same as above, but the output is a 1d array representing the expected lifetime after treatment i is delivered. And such network is built as follows:
def default_dense(feature_shape, n_treatment):
feature_input = keras.layers.Input(feature_shape)
treatment_input = keras.layers.Input((n_treatments,))
hidden_1 = keras.layers.Dense(16, activation = 'relu')(feature_input)
hidden_2 = keras.layers.Dense(16, activation = 'relu')(hidden_1)
output = keras.layers.Dense(n_treatments)(hidden_2)
output_on_action = keras.layers.multiply([output, treatment_input])
model = keras.models.Model([feature_input, treatment_input], output_on_action)
model.compile(optimizer=tf.optimizers.Adam(0.001),loss='mse')
return model
And the training is simply
model.fit(x = [features, encoded_treatments], y = encoded_treatments * lifetime[:, np.newaxis], verbose = 0)
This is super handy because when predicting, I can use np.ones() as the encoded_treatments, and the network gives expected lifetimes under all treatments, thus choosing the best one is one-step. Certainly I can create multiple networks, each for a treatment, but it would be much less efficient.
Now the questions is, can I do the same to probability output?
I have figured it out myself. The trick is to use RepeatVector() and Permute() layers to generate a matrix mask for treatments.
The output is an element-wise Multiply() of the mask and a Softmax() of same size.

Keras - passing different parameter for different data point onto Lambda Layer

I am working on a CNN model in Keras/TF background. At the end of final convolutional layer, I need to pool the output maps from the filters. Instead of using GlobalAveragePooling or any other sort of pooling, I had to pool according to time frames which exist along the width of the output map.
So if a sample output from one filter is let's say n x m, n being time frames and m outputs along the features. Here I just need to pool output from frames n1 to n2 where n1 and n2 <= n. So my output slice is (n2-n1)*m, on which I will apply pooling. I came across Lambda Layer of keras to do this. But I am stuck at a point where n1 and n2 will be different for each points. So my question is how can pass a custom argument for each data point onto a Lambda Layer? or am I approaching this in a wrong way?
A sample snippet:
# for slicing a tensor
def time_based_slicing(x, crop_at):
dim = x.get_shape()
len_ = crop_at[1] - crop_at[0]
return tf.slice(x, [0, crop_at[0], 0, 0], [1, len_, dim[2], dim[3]])
# for output shape
def return_out_shape(input_shape):
return tuple([input_shape[0], None, input_shape[2], input_shape[3]])
# lambda layer addition
model.add(Lambda(time_based_slicing, output_shape=return_out_shape, arguments={'crop_at': (2, 5)}))
The above argument crop_at needs to be custom for each data point when fitting in a loop. Any pointers/clues to this will be helpful.
Given that you know the indices of the time frames that belong to each datapoint from before, you can store them in a text file and pass them as an additional Input to your model:
slice_input = Input((2,))
And use those in your time_based_slicing function.
Switch from Sequential API - it starts to fall apart when you need to use multiple inputs: use Functional API https://keras.io/models/model/
Assuming that your lambda functions are correct:
def time_based_slicing(inputs_list):
x, crop_at = inputs_list
... (will probably need to do some work to subset crop_at since it will be a tensor now instead of constants
inp = Input(your_shape)
inp_additional = Inp((2,)
x=YOUR_CNN_LOGIC(inp)
out = Lambda(time_based_slicing)([x,inp_additional])

What does `sample_weight` do to the way a `DecisionTreeClassifier` works in sklearn?

I've read from the relevant documentation that :
Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value.
But, it is still unclear to me how this works. If I set sample_weight with an array of only two possible values, 1's and 2's, does this mean that the samples with 2's will get sampled twice as often as the samples with 1's when doing the bagging? I cannot think of a practical example for this.
Some quick preliminaries:
Let's say we have a classification problem with K classes. In a region of feature space represented by the node of a decision tree, recall that the "impurity" of the region is measured by quantifying the inhomogeneity, using the probability of the class in that region. Normally, we estimate:
Pr(Class=k) = #(examples of class k in region) / #(total examples in region)
The impurity measure takes as input, the array of class probabilities:
[Pr(Class=1), Pr(Class=2), ..., Pr(Class=K)]
and spits out a number, which tells you how "impure" or how inhomogeneous-by-class the region of feature space is. For example, the gini measure for a two class problem is 2*p*(1-p), where p = Pr(Class=1) and 1-p=Pr(Class=2).
Now, basically the short answer to your question is:
sample_weight augments the probability estimates in the probability array ... which augments the impurity measure ... which augments how nodes are split ... which augments how the tree is built ... which augments how feature space is diced up for classification.
I believe this is best illustrated through example.
First consider the following 2-class problem where the inputs are 1 dimensional:
from sklearn.tree import DecisionTreeClassifier as DTC
X = [[0],[1],[2]] # 3 simple training examples
Y = [ 1, 2, 1 ] # class labels
dtc = DTC(max_depth=1)
So, we'll look trees with just a root node and two children. Note that the default impurity measure the gini measure.
Case 1: no sample_weight
dtc.fit(X,Y)
print dtc.tree_.threshold
# [0.5, -2, -2]
print dtc.tree_.impurity
# [0.44444444, 0, 0.5]
The first value in the threshold array tells us that the 1st training example is sent to the left child node, and the 2nd and 3rd training examples are sent to the right child node. The last two values in threshold are placeholders and are to be ignored. The impurity array tells us the computed impurity values in the parent, left, and right nodes respectively.
In the parent node, p = Pr(Class=1) = 2. / 3., so that gini = 2*(2.0/3.0)*(1.0/3.0) = 0.444..... You can confirm the child node impurities as well.
Case 2: with sample_weight
Now, let's try:
dtc.fit(X,Y,sample_weight=[1,2,3])
print dtc.tree_.threshold
# [1.5, -2, -2]
print dtc.tree_.impurity
# [0.44444444, 0.44444444, 0.]
You can see the feature threshold is different. sample_weight also affects the impurity measure in each node. Specifically, in the probability estimates, the first training example is counted the same, the second is counted double, and the third is counted triple, due to the sample weights we've provided.
The impurity in the parent node region is the same. This is just a coincidence. We can compute it directly:
p = Pr(Class=1) = (1+3) / (1+2+3) = 2.0/3.0
The gini measure of 4/9 follows.
Now, you can see from the chosen threshold that the first and second training examples are sent to the left child node, while the third is sent to the right. We see that impurity is calculated to be 4/9 also in the left child node because:
p = Pr(Class=1) = 1 / (1+2) = 1/3.
The impurity of zero in the right child is due to only one training example lying in that region.
You can extend this with non-integer sample-wights similarly. I recommend trying something like sample_weight = [1,2,2.5], and confirming the computed impurities.

Resources