I'm training a CNN for image classification. The same object (with the same label then) is present in the test set twice (like two view-point). I'd like to take advantage of this when predicting the class.
Right now the final layer is a Linear layer (PyTorch) and I'm using cross-entropy as loss function. I was wondering what is the best way to take the most confident prediction for each object. Should I first compute the LogSoftMax and take the class with the highest probability (among both arrays of predictions), or should I take the logits directly?

Since LogSoftMax preserves order, the largest logit will always correspond to the highest confidence. Therefore there's no need to perform the operation if all you're interested in is finding the index of most confident class.
Probably the easiest way to get the index of the most confident class is by using torch.argmax.
batch_size = 5
num_logits = 10
y = torch.randn(batch_size, num_logits)
preds = torch.argmax(y, dim=1)
which in this case results in
>>> print(preds)
tensor([9, 7, 2, 4, 6])


How does a trained SVR model predict values?

I've been trying to understand how does a model trained with support vector machines for regression predict values. I have trained a model with the sklearn.svm.SVR, and now I'm wondering how to "manually" predict the outcome of an input.
Some background - the model is trained with kernel SVR, with RBF function and uses the dual formulation. So now I have arrays of the dual coefficients, the indexes of the support vectors, and the support vectors themselves.
I found the function which is used to fit the hyperplane but I've been unsuccessful in applying that to "manually" predict outcomes without the function .predict.
The few things I tried all include the dot products of the input (features) array, and all the support vectors.
If anyone ever needs this, I've managed to understand the equation and code it in python.
The following is the used equation for the dual formulation:
where N is the number of observations, and αi multiplied by yi are the dual coefficients found from the model's attributed model.dual_coef_. The xiT are some of the observations used for training (support vectors) accessed by the attribute model.support_vectors_ (transposed to allow multiplication of the two matrices), x is the input vector containing a value for each feature (its the one observation for which we want to get prediction), and b is the intercept accessed by model.intercept_.
The xiT and x, however, are the observations transformed in a higher-dimensional space, as explained by mery in this post.
The calculation of the transformation by RBF can be either applied manually step by stem or by using the sklearn.metrics.pairwise.rbf_kernel.
With the latter, the code would look like this (my case shows I have 589 support vectors, and 40 features).
First we access the coefficients and vectors:
support_vectors = model.support_vectors_
dual_coefs = model.dual_coef_[0]
pred = (np.matmul(dual_coefs.reshape(1,589),
+ model.intercept_
If the RBF funcion needs to be applied manually, step by step, then:
vrbf = support_vectors.reshape(589,40) - input_array.reshape(1,40)
pred = (np.matmul(dual_coefs.reshape(1,589),
np.diag(np.exp(-model.get_params()['gamma'] *
np.matmul(vrbf, vrbf.T)
+ model.intercept_
I placed the .reshape() function even where it is not necessary, just to emphasize the shapes for the matrix operations.
These both give the same results as model.predict(input_array)

Retrieve only the last hidden state from lstm layer in pytorch sequential

I have a pytorch model:
model = torch.nn.Sequential(
torch.nn.LSTM(40, 256, 3, batch_first=True),
torch.nn.Linear(256, 256),
And for the LSTM layer, I want to retrieve only the last hidden state from the batch to pass through the rest of the layers. Ex:
_, (hidden, _) = lstm(data)
hidden = hidden[-1]
Though, that example only works for a subclassed model. I need to somehow do this on a nn.Sequential() model that way when I save it, it can properly be converted to a tensorflow.js model. The reason I can't make and train this model in tensorflow.js is because I'm trying to implement this repo: Resemblyzer in tensorflow.js while still using the same weights as the pretrained Resemblyzer model which was made in pytorch as a subclassed model. I thought of using the torchvisions.transformations.Lambda() transformation but I would assume that would make it incompatible with tensorflow.js. Is there any way to make this possible while still allowing the model to convert properly?
You could split up your sequential but only doing so in the forward definition of your model on inference. Once defined:
model = nn.Sequential(nn.LSTM(40, 256, 3, batch_first=True),
nn.Linear(256, 256),
You can split it:
>>> lstm, fc = model[0], model[1:]
Then infer in two steps:
>>> out, (hidden, _) = lstm(data)
>>> hidden = hidden[-1]
>>> out = fc(out) # <- or fc(out[-1]) depending on what you want
Though the answer is provided above, I thought of elaborating on the same as PyTorch LSTM documentation is confusing.
In TF, we directly get the last_state as the output. No further action needed.
Let us check the Torch output of LSTM:
There are 2 outputs - a sequence and a tuple. We are interested in the last state so we can ignore the sequence and focus on the tuple. The tuple consists of 2 values - the first is the hidden state of the last cell (of all layers in the LSTM) and the second is the cell state of the last cell (again of all layers in the LSTM). We are interested in the hidden state. So
_, tup = self.bilstm(inp)
We are interested in tup[0]. Let us dig further into this.
The shape of tup[0] is somewhat odd with batch size at the centre. On the left of the batch size is the number of layers in the LSTM (multiply 2 if is biLSTM). On the right is the dimension you have provided while defining the LSTM. You could take the output from the last layer by simply doing a tup[0][-1] which is the answer provided above.
Alternatively if you want to make use of hidden states across layers, you may try something like:
out = tup[0].swapaxes(0,1)
out = out.reshape(*out.shape[:-2], -1)
The first line produces shape of batch_size, num_layers, hidden_size_specified. The second line produces shape of batch_size, num_layers x hidden_size_specified
(For e.g., Let us say, yours is a biLSTM and you have 3 layers and your hiddensize is 100, you could choose to concatenate the output such that you get one vector of 2 x 3 x 100 = 600 dimensions and then run a simple linear layer on top of this to get the output you want.)
There is another way to get the output of the LSTM. We discussed that the first output of an LSTM is a sequence:
sequence, tup = self.bilstm(inp)
This sequence is the output of the LAST hidden layer of the LSTM. It is a sequence because it contains hidden states of EVERY cell in this layer. So its length will be the input sequence length that you have provided. We could choose to take the hidden state of the last element in the sequence by doing a:
#shape of sequence is: batch_size, seq_size, dim
sequence = sequence.swapaxes(0,1)
#shape of sequence is: seq_size, batch_size, dim
sequence = sequence[-1]
#shape of sequence is: batch_size, dim (ie last seq is taken)
Needless to say this will be the same value we got by taking the last layer from tup[0]. Well, not quite! If the LSTM is a biLSTM, then using the sequence approach returns is 2 x hidden_size dim output (which is correct) wheras using the tup[0][-1] approach will give us only hidden_size dim even for a biLSTM. OP's LSTM is a non-biLSTM so both answers hold true.

PyTorch's CrossEntropyLoss - how to deal with the sequence length dimension with transformers?

I'm training a transformer model for text generation.
let's assume:
vocab size = 100
embbeding size = 50
max sequence length = 30
batch size = 32
loss = cross entropy loss
the last layer in the model is a fully connected layer,
mapping from shape [30, 32, 50] to [30, 32, 100].
the idea is that each of the last 30 sequences in the first dimension, I have a target vector I want to calculate loss with.
the issue is that based on the docs, this loss only excepts 2 dims on the prediction and one on the target - so how can I fit my 3D prediction into it?
(and 2D target?)
Use torch.BCELoss() instead (Binary cross entropy). This expects input and target to be the same size but they can be any size, and should fall within the range [0,1]. It performs cross-entropy loss element-wise.
EDIT: if you expect only one element from the vocab to be output, then you should use CrossEntropyLoss and instead encode your labels as a 1D vector rather than a 2D vector (i.e. do 1-hot decoding). BCE treats each element in the output for a single example as independent from the others, which is not a valid assumption for a multi-class style problem. I originally misread and thought the final output was an embedding, rather than an element from the vocabulary, hence my original suggestion.

Multiclass semantic segmentation model evaluation

I am doing a project on multiclass semantic segmentation. I have formulated a model that outputs pretty descent segmented images by decreasing the loss value. However, I cannot evaluate the model performance in metrics, such as meanIoU or Dice coefficient.
In case of binary semantic segmentation it was easy just to set the threshold of 0.5, to classify the outputs as an object or background, but it does not work in the case of multiclass semantic segmentation. Could you please tell me how to obtain model performance on the aforementioned metrics? Any help will be highly appreciated!
By the way, I am using PyTorch framework and CamVid dataset.
If anyone is interested in this answer, please also look at this issue. The author of the issue points out that mIoU can be computed in a different way (and that method is more accepted in literature). So, consider that before using the implementation for any formal publication.
Basically, the other method suggested by the issue-poster is to separately accumulate the intersections and unions over the entire dataset and divide them at the final step. The method in the below original answer computes intersection and union for a batch of images, then divides them to get IoU for the current batch, and then takes a mean of the IoUs over the entire dataset.
However, this below given original method is problematic because the final mean IoU would vary with the batch-size. On the other hand, the mIoU would not vary with the batch size for the method mentioned in the issue as the separate accumulation would ensure that batch size is irrelevant (though higher batch size can definitely help speed up the evaluation).
Original answer:
Given below is an implementation of mean IoU (Intersection over Union) in PyTorch.
def mIOU(label, pred, num_classes=19):
pred = F.softmax(pred, dim=1)
pred = torch.argmax(pred, dim=1).squeeze(1)
iou_list = list()
present_iou_list = list()
pred = pred.view(-1)
label = label.view(-1)
# Note: Following for loop goes from 0 to (num_classes-1)
# and ignore_index is num_classes, thus ignore_index is
# not considered in computation of IoU.
for sem_class in range(num_classes):
pred_inds = (pred == sem_class)
target_inds = (label == sem_class)
if target_inds.long().sum().item() == 0:
iou_now = float('nan')
intersection_now = (pred_inds[target_inds]).long().sum().item()
union_now = pred_inds.long().sum().item() + target_inds.long().sum().item() - intersection_now
iou_now = float(intersection_now) / float(union_now)
return np.mean(present_iou_list)
Prediction of your model will be in one-hot form, so first take softmax (if your model doesn't already) followed by argmax to get the index with the highest probability at each pixel. Then, we calculate IoU for each class (and take the mean over it at the end).
We can reshape both the prediction and the label as 1-D vectors (I read that it makes the computation faster). For each class, we first identify the indices of that class using pred_inds = (pred == sem_class) and target_inds = (label == sem_class). The resulting pred_inds and target_inds will have 1 at pixels labelled as that particular class while 0 for any other class.
Then, there is a possibility that the target does not contain that particular class at all. This will make that class's IoU calculation invalid as it is not present in the target. So, you assign such classes a NaN IoU (so you can identify them later) and not involve them in the calculation of the mean.
If the particular class is present in the target, then pred_inds[target_inds] will give a vector of 1s and 0s where indices with 1 are those where prediction and target are equal and zero otherwise. Taking the sum of all elements of this will give us the intersection.
If we add all the elements of pred_inds and target_inds, we'll get the union + intersection of pixels of that particular class. So, we subtract the already calculated intersection to get the union. Then, we can divide the intersection and union to get the IoU of that particular class and add it to a list of valid IoUs.
At the end, you take the mean of the entire list to get the mIoU. If you want the Dice Coefficient, you can calculate it in a similar fashion.

How to correctly implement a batch-input LSTM network in PyTorch?

This release of PyTorch seems provide the PackedSequence for variable lengths of input for recurrent neural network. However, I found it's a bit hard to use it correctly.
Using pad_packed_sequence to recover an output of a RNN layer which were fed by pack_padded_sequence, we got a T x B x N tensor outputs where T is the max time steps, B is the batch size and N is the hidden size. I found that for short sequences in the batch, the subsequent output will be all zeros.
Here are my questions.
For a single output task where the one would need the last output of all the sequences, simple outputs[-1] will give a wrong result since this tensor contains lots of zeros for short sequences. One will need to construct indices by sequence lengths to fetch the individual last output for all the sequences. Is there more simple way to do that?
For a multiple output task (e.g. seq2seq), usually one will add a linear layer N x O and reshape the batch outputs T x B x O into TB x O and compute the cross entropy loss with the true targets TB (usually integers in language model). In this situation, do these zeros in batch output matters?
Question 1 - Last Timestep
This is the code that i use to get the output of the last timestep. I don't know if there is a simpler solution. If it is, i'd like to know it. I followed this discussion and grabbed the relative code snippet for my last_timestep method. This is my forward.
class BaselineRNN(nn.Module):
def __init__(self, **kwargs):
def last_timestep(self, unpacked, lengths):
# Index of the last output for each sequence.
idx = (lengths - 1).view(-1, 1).expand(unpacked.size(0),
return unpacked.gather(1, idx).squeeze()
def forward(self, x, lengths):
embs = self.embedding(x)
# pack the batch
packed = pack_padded_sequence(embs, list(,
out_packed, (h, c) = self.rnn(packed)
out_unpacked, _ = pad_packed_sequence(out_packed, batch_first=True)
# get the outputs from the last *non-masked* timestep for each sentence
last_outputs = self.last_timestep(out_unpacked, lengths)
# project to the classes using a linear layer
logits = self.linear(last_outputs)
return logits
Question 2 - Masked Cross Entropy Loss
Yes, by default the zero padded timesteps (targets) matter. However, it is very easy to mask them. You have two options, depending on the version of PyTorch that you use.
PyTorch 0.2.0: Now pytorch supports masking directly in the CrossEntropyLoss, with the ignore_index argument. For example, in language modeling or seq2seq, where i add zero padding, i mask the zero padded words (target) simply like this:
loss_function = nn.CrossEntropyLoss(ignore_index=0)
PyTorch 0.1.12 and older: In the older versions of PyTorch, masking was not supported, so you had to implement your own workaround. I solution that i used, was, by jihunchoi. You may be also interested in this discussion.
A few days ago, I found this method which uses indexing to accomplish the same task with a one-liner.
I have my dataset batch first ([batch size, sequence length, features]), so for me:
unpacked_out = unpacked_out[np.arange(unpacked_out.shape[0]), lengths - 1, :]
where unpacked_out is the output of torch.nn.utils.rnn.pad_packed_sequence.
I have compared it with the method described here, which looks similar to the last_timestep() method Christos Baziotis is using above (also recommended here), and the results are the same in my case.
