Computer vision and deep learning literature usually say one should use binary_crossentropy for a binary (two-class) problem and categorical_crossentropy for more than two classes. Now I am wondering: is there any reason to not use the latter for a two-class problem as well?
categorical_crossentropy:
accepts only one correct class per sample
will take "only" the true neuron and make the crossentropy calculation with that neuron
binary_crossentropy:
accepts many correct classes per sample
will do the crossentropy calculation for "all neurons", considering that each neuron can be two classes, 0 and 1.
A 2-class problem can be modeled as:
2-neuron output with only one correct class: softmax + categorical_crossentropy
1-neuron output, one class is 0, the other is 1: sigmoid + binary_crossentropy
Explanation
Notice how in categorical crossentropy (the first equation), the term y_true is only 1 for the true neuron, making all other neurons equal to zero.
The equation can be reduced to simply: ln(y_pred[correct_label]).
Now notice how binary crossentropy (the second equation in the picture) has two terms, one for considering 1 as the correct class, another for considering 0 as the correct class.
Related
I'm performing an image classification task . Images are labeled as 0 1 2. Should be the size of the last linear layer in the model output be 3 or 1 ? In general, for a 3-class operation, the output is set to 3, and as a result of these three, the maximum probability is returned. But I saw that the last layer is set as 1 in some codes. I think it is actually logical. What do you think about ? ( Also I dont use softmax or sigmoid function in last layer.)
To perform classification into c classes (c = 3 in your example) you need to predict the probability of each class, therefore you need to output a c-dim output.
Usually you do not explicitly apply softmax to the "raw predictions" (aka "logits") - the loss function usually does that for you in a more numerically-robust way (see, e.g., nn.CrossEntropyLoss).
After you trained the model, at inference time you can take argmax over the predicted c logits and output a single scalar - the index of the predicted class. This can only be done during inference since argmax is not a differentiable operation.
I got a little bit lost while studying loss functions for multi-task learning.
For instance, in binary classification with only one task, for example classifying emails as spam or not, the sum of probabilities for each label (spam/not spam) would be 1 using softmax activation + softmax_crossentropy loss function. How does that apply to multi-task learning?
Let's consider the case with 5 tasks and each of them is a binary problem. Is the softmax function applied to each task independently (e.g. for task 1: probability of label 1 = 0.7 and label 2 = 0.3; for task 2: probability of label 1 = 0.2 and label 2 = 0.8 and so on) or it considers the tasks jointly (e.g. if label 1 of task 1 has a probability of 0.80 all other labels of all other tasks will sum to 0.20)?
Some notes:
Nitpicking: you should not use softmax for binary classification, but rather a regular sigmoid (which is kind of the 2d reduction of the softmax), followed by a log-loss (same).
for mulit-task that involves classification, you would probably use multiple binary classifications. Say you have an image and you want an output to say if there are pedestrians, cars and road signs in it. This is not a multi-class classification, as an image can have all of the above in it. So instead you'd define your output as 3 nodes, and you would calculate the binary classification for each node. This is done in one multi-task NN instead of running 3 different NN's, with the assumption that all 3 classification problems can benefit from the same latent layer or embedding that is created in that one NN.
Primarily, the loss function that is calculated can be different for different tasks in the case of multi-task(I would like to comment that it is not MULTI-LABEL classification) classification.
For example, Task 1 can be binary classification; Task 2 can be next sentence prediction and so on. Therefore, since different tasks involves learning different Loss function, you can attribute to the first part of your assumption, i.e, Softmax is applied only to the labels of the first task, while learning the first task.
In the pytorch docs, it says for cross entropy loss:
input has to be a Tensor of size (minibatch, C)
Does this mean that for binary (0,1) prediction, the input must be converted into an (N,2) tensor where the second dimension is equal to (1-p)?
So for instance if I predict 0.75 for a class with target 1 (true), would I have to stack two values (0.75; 0.25) on top of each other as input?
Quick and easy: yes, just give 1.0 for the true class and 0.0 for the other class as target values. Your model should also generate two predictions for that case, though it would be possible to do that with only a single prediction and use the sign information to determine the class. In that case, you wouldn't get a probability using a softmax as the last operation but for example by using a sigmoid function (which maps your output from (-inf,inf) to (0,1).
I have a layers neural net that does some stuff and I want a SVM at the end. I have googled and searched on stack exchange and it seems that it is easily implemented in keras using the loss function hinge or categorical_hinge. However, I am confused as to which one to use.
My examples is to be classifed into a binary class, either class 0 or class 1. So I can either do it via:
Method 1 https://github.com/keras-team/keras/issues/2588 (uses hinge) or How do I use categorical_hinge in Keras? (uses categorical_hinge):
Labels will be of shape (,2) with values of 0 or 1 indicating if it belongs to that class or not.
nb_classes = 2
model.add(Dense(nb_classes), W_regularizer=l2(0.01))
model.add(Activation('linear'))
model.compile(loss='hinge OR categorical_hinge ??,
optimizer='adadelta',
metrics=['accuracy'])
Then the class is the node that has a higher value of the two output node?
Method 2 https://github.com/keras-team/keras/issues/2830 (uses hinge):
The first commenter mentioned that hinge is supposed to be binary_hinge and that the labels must be -1 or 1 for no or yes, and that the activation for the last SVM layer should be tanh with 1 node only.
So it should look something like this but the labels will be (,1) shape with values either -1 or 1.
model.add(Dense(1), W_regularizer=l2(0.01))
model.add(Activation('tanh'))
model.compile(loss='hinge',
optimizer='adadelta',
metrics=['accuracy'])
So which method is correct or more desirable? I am unsure of what to use since there are multiple answers online and the keras documentation contains nothing at all for the hinge and categorial_hinge loss functions. Thank you!
Might be a bit late but here is my answer.
You can do it in multiple ways:
Since you have 2 classes it is a binary problem and you can use the normal hinge.
The architecture will then only have to put out 1 output -1 and one as you said.
You can use output 2 of the last layer also, you input just have to be one-hot encodings of the label and then use the categorical hinge.
According to the activation a linear layer and a tanh would both make an SVM the tanh will just be smoothed.
I would suggest making it binary and use a tanh layer, but try both things to see what works.
I did this tutorial: https://www.tensorflow.org/get_started/eager
It was very helpful, but I didn't understand why the outputs always sum to 1. It is stated "For this example, the sum of the output predictions are 1.0", but not explained. I thought it might be a characteristic of the activation function, but I read ReLu can take any value >0 (https://medium.com/the-theory-of-everything/understanding-activation-functions-in-neural-networks-9491262884e0).
I'd like to understand because I want to learn in which cases one should normalize the output variables and in which cases this is not necessary (I assume if they always sum up to 1, it's not necessary).
In the given example, the sentence outputs always sum to 1 refers to the used softmax function and has nothing to do with normalization or your used activation function. In the tutorial's Iris example we want to distingusih between three classes and of course the sum of the class probabilities cannot exceed 100% (1.0).
For example the softmax function, which is located at the end of your network, could return [0.8, 0.1, 0.1]. That means the first class has the highest probability. Notice: The sum of all single probas result to 1.0.