Difference between CrossEntropyLoss and NNLLoss with log_softmax in PyTorch? - pytorch

When I am building a classifier in PyTorch, I have 2 options to do
Using the nn.CrossEntropyLoss without any modification in the model
Using the nn.NNLLoss with F.log_softmax added as the last layer in the model
So there are two approaches.
Now, what approach should anyone use, and why?

They're the same.
If you check the implementation, you will find that it calls nll_loss after applying log_softmax on the incoming arguments.
return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
Edit: seems like the links are now broken, here's the C++ implementation which shows the same information.

Both the cross-entropy and log-likelihood are two different interpretations of the same formula. In the log-likelihood case, we maximize the probability (actually likelihood) of the correct class which is the same as minimizing cross-entropy. Though you're correct both of these have created some ambiguity in the literature, however, there are some subtleties and caveats, I would highly suggest you go through this thread, as this topic has been rigorously discussed there. You may find it useful.
Cross-Entropy or Log-Likelihood in Output layer

Related

How to put more weight on one class during training in Pytorch [duplicate]

I have a multilabel classification problem, which I am trying to solve with CNNs in Pytorch. I have 80,000 training examples and 7900 classes; every example can belong to multiple classes at the same time, mean number of classes per example is 130.
The problem is that my dataset is very imbalance. For some classes, I have only ~900 examples, which is around 1%. For “overrepresented” classes I have ~12000 examples (15%). When I train the model I use BCEWithLogitsLoss from pytorch with a positive weights parameter. I calculate the weights the same way as described in the documentation: the number of negative examples divided by the number of positives.
As a result, my model overestimates almost every class… Mor minor and major classes I get almost twice as many predictions as true labels. And my AUPRC is just 0.18. Even though it’s much better than no weighting at all, since in this case the model predicts everything as zero.
So my question is, how do I improve the performance? Is there anything else I can do? I tried different batch sampling techniques (to oversample minority class), but they don’t seem to work.
I would suggest either one of these strategies
Focal Loss
A very interesting approach for dealing with un-balanced training data through tweaking of the loss function was introduced in
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He and Piotr Dollar Focal Loss for Dense Object Detection (ICCV 2017).
They propose to modify the binary cross entropy loss in a way that decrease the loss and gradient of easily classified examples while "focusing the effort" on examples where the model makes gross errors.
Hard Negative Mining
Another popular approach is to do "hard negative mining"; that is, propagate gradients only for part of the training examples - the "hard" ones.
see, e.g.:
Abhinav Shrivastava, Abhinav Gupta and Ross Girshick Training Region-based Object Detectors with Online Hard Example Mining (CVPR 2016)
#Shai has provided two strategies developed in the deep learning era. I would like to provide you some additional traditional machine learning options: over-sampling and under-sampling.
The main idea of them is to produce a more balanced dataset by sampling before starting your training. Note that you probably will face some problems such as losing the data diversity (under-sampling) and overfitting the training data (over-sampling), but it might be a good start point.
See the wiki link for more information.

How pytorch implement forward for a quantized linear layer?

I have a quantized model in pytorch and now I want to extract the parameter of the quantized linear layer and implement the forward manually.
I search the source code but only find this function.
def forward(self, x: torch.Tensor) -> torch.Tensor:
return torch.ops.quantized.linear(
x, self._packed_params._packed_params, self.scale, self.zero_point)
But no where I can find how torch.ops.quantized.linear is defined.
Can someone give me a hind how the forward of quantized linear are defined?
In answer to the question of where torch.ops.quantized.linear is, I was looking for the same thing but was never able to find it. I believe it's probably somewhere in the aten (C++ namespace). I did, however, find some useful PyTorch-based implementations in the NVIDIA TensorRT repo below. It's quite possible these are the ones actually called by PyTorch via some DLLs. If you're trying to add quantization to a custom layer, these implementations walk you through it.
You can find the docs here and the GitHub page here.
For the linear layer specifically, see the QuantLinear layer here
Under the hood, this calls TensorQuantFunction.apply() for post-training quantization or FakeTensorQuantFunction.apply() for quantization-aware training.

PyTorch LogSoftmax vs Softmax for CrossEntropyLoss

I understand that PyTorch's LogSoftmax function is basically just a more numerically stable way to compute Log(Softmax(x)). Softmax lets you convert the output from a Linear layer into a categorical probability distribution.
The pytorch documentation says that CrossEntropyLoss combines nn.LogSoftmax() and nn.NLLLoss() in one single class.
Looking at NLLLoss, I'm still confused...Are there 2 logs being used? I think of negative log as information content of an event. (As in entropy)
After a bit more looking, I think that NLLLoss assumes that you're actually passing in log probabilities instead of just probabilities. Is this correct? It's kind of weird if so...
Yes, NLLLoss takes log-probabilities (log(softmax(x))) as input. Why?. Because if you add a nn.LogSoftmax (or F.log_softmax) as the final layer of your model's output, you can easily get the probabilities using torch.exp(output), and in order to get cross-entropy loss, you can directly use nn.NLLLoss. Of course, log-softmax is more stable as you said.
And, there is only one log (it's in nn.LogSoftmax). There is no log in nn.NLLLoss.
nn.CrossEntropyLoss() combines nn.LogSoftmax() (that is, log(softmax(x))) and nn.NLLLoss() in one single class. Therefore, the output from the network that is passed into nn.CrossEntropyLoss needs to be the raw output of the network (called logits), not the output of the softmax function.

net.zero_grad() vs optim.zero_grad() pytorch

Here they mention the need to include optim.zero_grad() when training to zero the parameter gradients. My question is: Could I do as well net.zero_grad() and would that have the same effect? Or is it necessary to do optim.zero_grad(). Moreover, what happens if I do both? If I do none, then the gradients get accumulated, but what does that exactly mean? do they get added? In other words, what's the difference between doing optim.zero_grad() and net.zero_grad(). I am asking because here, line 115 they use net.zero_grad() and it is the first time I see that, that is an implementation of a reinforcement learning algorithm, where one has to be especially careful with the gradients because there are multiple networks and gradients, so I suppose there is a reason for them to do net.zero_grad() as opposed to optim.zero_grad().
net.zero_grad() sets the gradients of all its parameters (including parameters of submodules) to zero. If you call optim.zero_grad() that will do the same, but for all parameters that have been specified to be optimised. If you are using only net.parameters() in your optimiser, e.g. optim = Adam(net.parameters(), lr=1e-3), then both are equivalent, since they contain the exact same parameters.
You could have other parameters that are being optimised by the same optimiser, which are not part of net, in which case you would either have to manually set their gradients to zero and therefore keep track of all the parameters, or you can simply call optim.zero_grad() to ensure that all parameters that are being optimised, had their gradients set to zero.
Moreover, what happens if I do both?
Nothing, the gradients would just be set to zero again, but since they were already zero, it makes absolutely no difference.
If I do none, then the gradients get accumulated, but what does that exactly mean? do they get added?
Yes, they are being added to the existing gradients. In the backward pass the gradients in respect to every parameter are calculated, and then the gradient is added to the parameters' gradient (param.grad). That allows you to have multiple backward passes, that affect the same parameters, which would not be possible if the gradients were overwritten instead of being added.
For example, you could accumulate the gradients over multiple batches, if you need bigger batches for training stability but don't have enough memory to increase the batch size. This is trivial to achieve in PyTorch, which is essentially leaving off optim.zero_grad() and delaying optim.step() until you have gathered enough steps, as shown in HuggingFace - Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups.
That flexibility comes at the cost of having to manually set the gradients to zero. Frankly, one line is a very small cost to pay, even though many users won't make use of it and especially beginners might find it confusing.

Use pytorch optimizer to fit a user defined function

I have read many tutorials on how to use PyTorch to make a regression over a data set, using, for instance, a model composed of several linear layers and the MSE loss.
Well, imagine that I know the function F depends on a variable x and some unknown parameters (p_j: j=0,..., P-1) with P relatively small, but the function is a composition of special function. So, my problem is the classical minimization knowing the data {x_i,y_i}_i<=N
Min_{ {p_j} } Sum_i (F(x_i;{p_j}) - y_i)^2
I would like to know if I can use the PyTorch optimizers and if yes how I can do it?
Thanks.
In fact, PyTorch experts answer that the function to minimized must be expressed in terms of torch.tensors to let the minimizers computing the gradients. So, it is not possible in my case.

Resources