Flattening the input to nn.MSELoss() - pytorch

Here's the screenshot of a YouTube video implementing the Loss function from the YOLOv1 original research paper.
What I don't understand is the need for torch.Flatten() while passing the input to self.mse(), which, in fact, is nn.MSELoss()
The video just mentions the reason as nn.MSELoss() expects the input in the shape (a,b), which I specifically don't understand how or why?
Video link just in case. [For reference, N is the batch size, S is the grid size (split size)]

It helps to go back to the definitions. What is MSE? What is it computing?
MSE = mean squared error.
This will be rough pythonic pseudo code to illustrate.
total = 0
for (x,y) in (data,labels):
total += (x-y)**2
return total / len(labels) # the average squared difference
For each pair of entries it subtracts two numbers together and returns the average (or mean) after all of the subtractions.
To rephrase the question how would you interpret MSE without flattening? MSE as described and implemented doesn't mean anything for higher dimensions. You can use other loss functions if you want to work with the outputs being matrices such as norms of the output matrices.
Anyways hope that answers your question as to why the flattening is needed.

I have the same question. So I try with different end_dims.
like:
data = torch.randn((1, 7, 7, 4))
target = torch.randn((1, 7, 7, 4))
loss = torch.nn.MSELoss(reduction="sum")
object_loss = loss(
torch.flatten(data, end_dim=-2),
torch.flatten(target, end_dim=-2),
)
object_loss1 = loss(
torch.flatten(data, end_dim=-3),
torch.flatten(target, end_dim=-3),
)
print(object_loss)
print(object_loss1)
I got the same result. So I think it just helps to intepret MSE.

Related

FFT loss in PyTorch

I want to compute the loss between the GT and the output of my network (called TDN) in the frequency domain by computing 2D FFT. The tensors are of dim batch x channel x height x width
amp_ip, phase_ip = 2DFFT(TDN(ip))
amp_gt, phase_gt = 2DFFT(TDN(gt))
loss = ||amp_ip - amp_gt||
For computing FFT I can use torch.fft(ip, signal_ndim = 2). But the output is in a + j b format i.e rectangular coordinates and NOT decomposed into phase and amplitude. How can I convert a + j b into amp exp(j phase) format in PyTorch? A side concern is also if signal_ndims be kept 2 to compute 2D FFT or something else?
The following description, which describes the loss that I plan to implement, maybe useful.
The question is answered by the GITHUB code file shared by #akshayk07 in the comments. Extracting the relevant information from that code, the concise answer to the question is,
fft_im = torch.rfft(img.clone(), signal_ndim=2, onesided=False)
# fft_im: size should be bx3xhxwx2
fft_amp = fft_im[:,:,:,:,0]**2 + fft_im[:,:,:,:,1]**2
fft_amp = torch.sqrt(fft_amp) # this is the amplitude
fft_pha = torch.atan2( fft_im[:,:,:,:,1], fft_im[:,:,:,:,0] ) # this is the phase
As of PyTorch 1.7.1 choose torch.rfft over torch.fft as the latter does not work off the shelf with real valued tensors propagating in CNNs. Also a good idea will be ti use the normalisation flag of torch.rfft.

CNN: taking the most confident prediction among many

I'm training a CNN for image classification. The same object (with the same label then) is present in the test set twice (like two view-point). I'd like to take advantage of this when predicting the class.
Right now the final layer is a Linear layer (PyTorch) and I'm using cross-entropy as loss function. I was wondering what is the best way to take the most confident prediction for each object. Should I first compute the LogSoftMax and take the class with the highest probability (among both arrays of predictions), or should I take the logits directly?
Since LogSoftMax preserves order, the largest logit will always correspond to the highest confidence. Therefore there's no need to perform the operation if all you're interested in is finding the index of most confident class.
Probably the easiest way to get the index of the most confident class is by using torch.argmax.
e.g.
batch_size = 5
num_logits = 10
y = torch.randn(batch_size, num_logits)
preds = torch.argmax(y, dim=1)
which in this case results in
>>> print(preds)
tensor([9, 7, 2, 4, 6])

Weights Matrix Final Fully Connected Layer

My question is, I think, too simple, but it's giving me headaches. I think I'm missing either something conceptually in Neural Networks or Tensorflow is returning some wrong layer.
I have a network in which last layer outputs 4800 units. The penultimate layer has 2000 units. I expect my weight matrix for last layer to have the shape (4800, 2000) but when I print out the shape in Tensorflow I see (2000, 4800). Please can someone confirm which shape of weight matrix the last layer should have? Depending on the answer, I can further debug the issue. Thanks.
Conceptually, a neural network layer is often written like y = W*x where * is matrix multiplication, x is an input vector and y an output vector. If x has 2000 units and y 4800, then indeed W should have size (4800, 2000), i.e. 4800 rows and 2000 columns.
However, in implementations we usually work on a batch of inputs X. Say X is (b, 2000) where b is your batch size. We don't want to transform each element of X individually by doing W*x as above since this would be inefficient.
Instead we would like to transform all inputs at the same time. This can be done via Y = X*W.T where W.T is the transpose of W. You can work out that this essentially applies W*x to each row of X (i.e. each input). Y is then a (b, 4800) matrix containing all transformed inputs.
In Tensorflow, the weight matrix is simply saved in this transposed state, since it is usually the form that is needed anyway. Thus, we have a matrix with shape (2000, 4800) (the shape of W.T).

How to correctly implement a batch-input LSTM network in PyTorch?

This release of PyTorch seems provide the PackedSequence for variable lengths of input for recurrent neural network. However, I found it's a bit hard to use it correctly.
Using pad_packed_sequence to recover an output of a RNN layer which were fed by pack_padded_sequence, we got a T x B x N tensor outputs where T is the max time steps, B is the batch size and N is the hidden size. I found that for short sequences in the batch, the subsequent output will be all zeros.
Here are my questions.
For a single output task where the one would need the last output of all the sequences, simple outputs[-1] will give a wrong result since this tensor contains lots of zeros for short sequences. One will need to construct indices by sequence lengths to fetch the individual last output for all the sequences. Is there more simple way to do that?
For a multiple output task (e.g. seq2seq), usually one will add a linear layer N x O and reshape the batch outputs T x B x O into TB x O and compute the cross entropy loss with the true targets TB (usually integers in language model). In this situation, do these zeros in batch output matters?
Question 1 - Last Timestep
This is the code that i use to get the output of the last timestep. I don't know if there is a simpler solution. If it is, i'd like to know it. I followed this discussion and grabbed the relative code snippet for my last_timestep method. This is my forward.
class BaselineRNN(nn.Module):
def __init__(self, **kwargs):
...
def last_timestep(self, unpacked, lengths):
# Index of the last output for each sequence.
idx = (lengths - 1).view(-1, 1).expand(unpacked.size(0),
unpacked.size(2)).unsqueeze(1)
return unpacked.gather(1, idx).squeeze()
def forward(self, x, lengths):
embs = self.embedding(x)
# pack the batch
packed = pack_padded_sequence(embs, list(lengths.data),
batch_first=True)
out_packed, (h, c) = self.rnn(packed)
out_unpacked, _ = pad_packed_sequence(out_packed, batch_first=True)
# get the outputs from the last *non-masked* timestep for each sentence
last_outputs = self.last_timestep(out_unpacked, lengths)
# project to the classes using a linear layer
logits = self.linear(last_outputs)
return logits
Question 2 - Masked Cross Entropy Loss
Yes, by default the zero padded timesteps (targets) matter. However, it is very easy to mask them. You have two options, depending on the version of PyTorch that you use.
PyTorch 0.2.0: Now pytorch supports masking directly in the CrossEntropyLoss, with the ignore_index argument. For example, in language modeling or seq2seq, where i add zero padding, i mask the zero padded words (target) simply like this:
loss_function = nn.CrossEntropyLoss(ignore_index=0)
PyTorch 0.1.12 and older: In the older versions of PyTorch, masking was not supported, so you had to implement your own workaround. I solution that i used, was masked_cross_entropy.py, by jihunchoi. You may be also interested in this discussion.
A few days ago, I found this method which uses indexing to accomplish the same task with a one-liner.
I have my dataset batch first ([batch size, sequence length, features]), so for me:
unpacked_out = unpacked_out[np.arange(unpacked_out.shape[0]), lengths - 1, :]
where unpacked_out is the output of torch.nn.utils.rnn.pad_packed_sequence.
I have compared it with the method described here, which looks similar to the last_timestep() method Christos Baziotis is using above (also recommended here), and the results are the same in my case.

Why is the logloss negative?

I just applied the log loss in sklearn for logistic regression: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html
My code looks something like this:
def perform_cv(clf, X, Y, scoring):
kf = KFold(X.shape[0], n_folds=5, shuffle=True)
kf_scores = []
for train, _ in kf:
X_sub = X[train,:]
Y_sub = Y[train]
#Apply 'log_loss' as a loss function
scores = cross_validation.cross_val_score(clf, X_sub, Y_sub, cv=5, scoring='log_loss')
kf_scores.append(scores.mean())
return kf_scores
However, I'm wondering why the resulting logarithmic losses are negative. I'd expect them to be positive since in the documentation (see my link above) the log loss is multiplied by a -1 in order to turn it into a positive number.
Am I doing something wrong here?
Yes, this is supposed to happen. It is not a 'bug' as others have suggested. The actual log loss is simply the positive version of the number you're getting.
SK-Learn's unified scoring API always maximizes the score, so scores which need to be minimized are negated in order for the unified scoring API to work correctly. The score that is returned is therefore negated when it is a score that should be minimized and left positive if it is a score that should be maximized.
This is also described in sklearn GridSearchCV with Pipeline and in scikit-learn cross validation, negative values with mean squared error
a similar discussion can be found here.
In this way, an higher score means better performance (less loss).
I cross checked the sklearn implementation with several other methods. It seems to be an actual bug within the framework. Instead consider the follwoing code for calculating the log loss:
import scipy as sp
def llfun(act, pred):
epsilon = 1e-15
pred = sp.maximum(epsilon, pred)
pred = sp.minimum(1-epsilon, pred)
ll = sum(act*sp.log(pred) + sp.subtract(1,act)*sp.log(sp.subtract(1,pred)))
ll = ll * -1.0/len(act)
return ll
Also take into account that the dimensions of act and pred have to Nx1 column vectors.

Resources