What should be output size of image classifier model? - pytorch

I'm performing an image classification task . Images are labeled as 0 1 2. Should be the size of the last linear layer in the model output be 3 or 1 ? In general, for a 3-class operation, the output is set to 3, and as a result of these three, the maximum probability is returned. But I saw that the last layer is set as 1 in some codes. I think it is actually logical. What do you think about ? ( Also I dont use softmax or sigmoid function in last layer.)

To perform classification into c classes (c = 3 in your example) you need to predict the probability of each class, therefore you need to output a c-dim output.
Usually you do not explicitly apply softmax to the "raw predictions" (aka "logits") - the loss function usually does that for you in a more numerically-robust way (see, e.g., nn.CrossEntropyLoss).
After you trained the model, at inference time you can take argmax over the predicted c logits and output a single scalar - the index of the predicted class. This can only be done during inference since argmax is not a differentiable operation.

Related

Why can't I use softmax in regression task for probabilities?

I have a supervised learning task f(X)=y where X is a 2-dimentional np.array of np.int8 and y is a 1-dimentional array of np.float64 containing probabilities (so numbers between 0 and 1). I want to build a Neural Network model that performs regression in order to predict said probabilities y given X.
As the output of my Network is one real value (i.e. the output layer has one neuron) and is a probability (so in the range [0, 1]), I believe I should use softmax as the activation function of the output layer (i.e. output neuron) in order to squash the network's output to [0, 1].
As it is a regression task, I opted for using the mean_squared_error loss (instead of cross_entropy_loss that is typically used in classification tasks and often paired with softmax).
However, as I am trying to fit(X, y) the loss does not change at all between epochs and remains constant. Any ideas why? Is the combination of softmax and mean_squared_error loss wrong for some reason and why?
If I remove the softmax it does work, but then my model would also predict non probabilities which I do not want. Yes, I could squash it myself later but it doesn't seem right.
My code basically is (after removing some irrelevant additional callbacks for EarlyStopping and learning rate scheaduling):
model = Sequential()
model.add(Dense(W1_size, input_shape=(input_dims,), activation='relu'))
model.add(Dense(1, activation='softmax'))
# compile model
model.compile(optimizer=Adam(), loss='mse') # mse is the standard loss for regression
# fit
model.fit(X, y, batch_size=batch_size, epochs=MAX_EPOCHS)
Edit: Turns out I needed the sigmoid function to squash one real value to [0, 1] as the accepted answer suggests. The softmax function for a vector of size 1 is always 1.
As you stated you want to perform a regression task. (Which means, finding a continuous mapping between your input and desired output).
The softmax function creates a pseudo-probability distribution for multi-dimensional outputs (all values sum up to 1). This is the reason why the softmax function perfectly fits for classification tasks (predicting probabilities for different classes).
As you want to perform a regression task and your output is one-dimensional, softmax would not work properly because it is always 1 for a one-dimensional input.
A function which maps a one-dimensional input continuously to [0,1] works fine here (e.g Sigmoid).
Note that you can also interpret both the output of the sigmoid and the softmax function as probabilities. But be careful: these are only pseudo-probabilities and it is not representing the certainty or uncertainty of your model in making predictions.

When to use bias in Keras model?

I am new to modeling with Keras. I am trying to evaluate appropriate parameters for setting up the model. How do I know when you use bias vs when to turn it off?
The short answer is, always use bias variables when your model is small. Otherwise, it is still recommended to keep using bias in all neural network architectures.
Because each neurone performs like a simple logistic regression. In each neurone, the input values are multiplied with by the weights and the bias affects the initial level in the sigmoid function, which results the desired the non-linearity.
For example, if you have a zero input in your training data like X = [[0,0,...], [0,0,...],... ] , Y = 1, in a sigmoid function, the output will always be exactly Y=0.5 since X*W is zero. However, in large networks, each node can make a bias node out of the average activation of all of its inputs.

How does softmax loss works in multi-task learning

I got a little bit lost while studying loss functions for multi-task learning.
For instance, in binary classification with only one task, for example classifying emails as spam or not, the sum of probabilities for each label (spam/not spam) would be 1 using softmax activation + softmax_crossentropy loss function. How does that apply to multi-task learning?
Let's consider the case with 5 tasks and each of them is a binary problem. Is the softmax function applied to each task independently (e.g. for task 1: probability of label 1 = 0.7 and label 2 = 0.3; for task 2: probability of label 1 = 0.2 and label 2 = 0.8 and so on) or it considers the tasks jointly (e.g. if label 1 of task 1 has a probability of 0.80 all other labels of all other tasks will sum to 0.20)?
Some notes:
Nitpicking: you should not use softmax for binary classification, but rather a regular sigmoid (which is kind of the 2d reduction of the softmax), followed by a log-loss (same).
for mulit-task that involves classification, you would probably use multiple binary classifications. Say you have an image and you want an output to say if there are pedestrians, cars and road signs in it. This is not a multi-class classification, as an image can have all of the above in it. So instead you'd define your output as 3 nodes, and you would calculate the binary classification for each node. This is done in one multi-task NN instead of running 3 different NN's, with the assumption that all 3 classification problems can benefit from the same latent layer or embedding that is created in that one NN.
Primarily, the loss function that is calculated can be different for different tasks in the case of multi-task(I would like to comment that it is not MULTI-LABEL classification) classification.
For example, Task 1 can be binary classification; Task 2 can be next sentence prediction and so on. Therefore, since different tasks involves learning different Loss function, you can attribute to the first part of your assumption, i.e, Softmax is applied only to the labels of the first task, while learning the first task.

How to use Cross Entropy loss in pytorch for binary prediction

In the pytorch docs, it says for cross entropy loss:
input has to be a Tensor of size (minibatch, C)
Does this mean that for binary (0,1) prediction, the input must be converted into an (N,2) tensor where the second dimension is equal to (1-p)?
So for instance if I predict 0.75 for a class with target 1 (true), would I have to stack two values (0.75; 0.25) on top of each other as input?
Quick and easy: yes, just give 1.0 for the true class and 0.0 for the other class as target values. Your model should also generate two predictions for that case, though it would be possible to do that with only a single prediction and use the sign information to determine the class. In that case, you wouldn't get a probability using a softmax as the last operation but for example by using a sigmoid function (which maps your output from (-inf,inf) to (0,1).

Keras softmax activation, category_crossentropy loss. But output is not 0, 1

I trained CNN model for just one epoch with very little data. I use Keras 2.05.
Here is the CNN model's (partial) last 2 layers, number_outputs = 201. Training data output is one hot encoded 201 output.
model.add(Dense(200, activation='relu', name='full_2'))
model.add(Dense(40, activation='relu', name='full_3'))
model.add(Dense(number_outputs, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
The model is saved to a h5 file. Then, saved mode is loaded with same model as above. batch_image is an image file.
prediction = loaded_model.predict(batch_image, batch_size=1)
I get prediction like this:
ndarray: [[ 0.00498065 0.00497852 0.00498095 0.00496987 0.00497506 0.00496112
0.00497585 0.00496474 0.00496769 0.0049708 0.00497027 0.00496049
0.00496767 0.00498348 0.00497927 0.00497842 0.00497095 0.00496493
0.00498282 0.00497441 0.00497477 0.00498019 0.00497417 0.00497654
0.00498381 0.00497481 0.00497533 0.00497961 0.00498793 0.00496556
0.0049665 0.00498809 0.00498689 0.00497886 0.00498933 0.00498056
Questions:
Prediction array should be 1, 0? Why do I get output like output activate as sigmoid, and loss is binary_crossentropy. What is wrong? I want to emphasize again, the model is not really trained well with data. It's almost just initialized with random weights.
If I don't train the network well (not converge yet), such as just initializing weights with random number, should the prediction still be 1, 0?
If I want to get the probability of prediction, and then, I decide how to interpret it, how to get the probability prediction output after the CNN is trained?
Your number of output is 201 that is why your output comes as (1,201) and not as (1,0). You can easily get which class has the highest value just by using np.argmax and that class is the output for your given input by your model.
And for the fact even when you have trained for 1 epoch only, your model has learned something that may be very lame, but still, it learns something and based on that, it has predicted the output.
You have used softmax as your activation in the last layer. It normalizes your output in a non-linear fashion so that the sum of output for all classes is equals to 1. So the value you get for each class can be interpreted as the probability of that class as output for the given input by the model. (For more clarity, you can look into how softmax function works)
And lastly, each class has values like 0.0049 or similar because the model is not sure which class your input belongs to. So it calculates values for each class and then softmax normalizes it. That is why your output values are in the range 0 to 1.
For example, say I have four class so one of the probable output can be like [0.223 0.344 0.122 0.311] which in the end we look as a confidence score for each class. And by looking at confidence score for each class we can say the predicted class is 2 as it has the highest confidence score of 0.344.
The output of a softmax layer is not 0 or 1. It is actually a normalized layer adding up to 1. If you do the sum of all your coefficient, they will add up. To get the prediction, you should get the one with the highest value. You can interpret them as probability even if there are not technically. https://en.wikipedia.org/wiki/Softmax_function for the definition.
This layer is used in the training process in order to be able to compare the prediction of a categorical classification and the true label.
It is required for the optimization because the optimization is done on derivable functions (having a gradient) and a 0,1 output would not be derivable (not even continuous). The optimization is done afterwards on all these values.
An interesting example is the following one: if your true target is [0 0 1 0] and your prediction output [0.1 0.1 0.6 0.2], even if the prediction is correct, it will still be able to learn, because it still give a non zero probabilty to the other classes, on which you can compute a gradient.
In order to get the prediction output in form of class in stead of probability, use:
model.predict_classes(x_train,batch_size)
My understanding is, Softmax says the likelihood of the value landing in that bucket out of the 201 buckets. With certainty of the first bucket you would get [1,0,0,0,0........]. Since very little training/learning/weight adjustment has occurred, the 201 values are all about 0.00497 which together sum to 1.
A decent description on developers.Google of SoftMax here
The output was specified as 'number_outputs' so you get 201 outputs, each of which tell you the likelihood (as a value between 0 and 1) of your prediction being THAT output.

Resources