Why use ReLu in the final layer of Neural Network? - relu

It is recommended that we use ReLu in the final layer of the neural network when we are learning regressions.
It makes sense to me, since the output from ReLu is not confined between 0 and 1.
However, how does it behave when x < 0 (ie when ReLu output is zero). Can y(the result of regression) still be lesser than 0?
I believe, I am missing a basic mathematical concept here. Any help is appreciated.

You typically use:
A linear layer for regression in order to get a continuous value
Softmax for classification where you want a probability distribution of classes
But these aren't set in stone. If you know your output value for a regression should only be positive, why not use a ReLu? If the output of your classification isn't a probability distribution (ex, which classes exists) you could just as easily use a sigmoid.

Related

Do I need to apply the Softmax Function ANYWHERE in my multi-class classification Model?

I am currently turning my Binary Classification Model to a multi-class classification Model. Bare with me.. I am very knew to pytorch and Machine Learning.
Most of what I state here, I know from the following video.
https://www.youtube.com/watch?v=7q7E91pHoW4&t=654s
What I read / know is that the CrossEntropyLoss already has the Softmax function implemented, thus my output layer is linear.
What I then read / saw is that I can just choose my Model prediction by taking the torch.max() of my model output (Which comes from my last linear output. This feels weird because I Have some negative outputs and i thought I need to apply the SOftmax function first, but It seems to work right without it.
So know the big confusing question I have is, when would I use the Softmax function? Would I only use it when my loss doesnt have it implemented? BUT then I would choose my prediction based on the outputs of the SOftmax layer which wouldnt be the same as with the linear output layer.
Thank you guys for every answer this gets.
For calculating the loss using CrossEntropy you do not need softmax because CrossEntropy already includes it. However to turn model outputs to probabilities you still need to apply softmax to turn them into probabilities.
Lets say you didnt apply softmax at the end of you model. And trained it with crossentropy. And then you want to evaluate your model with new data and get outputs and use these outputs for classification. At this point you can manually apply softmax to your outputs. And there will be no problem. This is how it is usually done.
Traning()
MODEL ----> FC LAYER --->raw outputs ---> Crossentropy Loss
Eval()
MODEL ----> FC LAYER --->raw outputs --> Softmax -> Probabilites
Yes you need to apply softmax on the output layer. When you are doing binary classification you are free to use relu, sigmoid,tanh etc activation function. But when you are doing multi class classification softmax is required because softmax activation function distributes the probability throughout each output node. So that you can easily conclude that the output node which has the highest probability belongs to a particular class. Thank you. Hope this is useful!

when should i use "sigmoid" and "relu" function in CNN?

To implement the CNN model for classification images we need to use sigmoid and relu function. but I am confused what is the use of these.
If you are working with a conventional CNN for image classification, the output layer has N neurons, where N is the number of image classes you want to identify. You want each output neuron to represent the probability that you have observed each image class. The sigmoid function is good for representing a probability. Its domain is all real numbers, but its range is 0 to 1.
For network layers that are not output layers, you could also use the sigmoid. In theory, any non-linear transfer function will work in the inner layers of a neural network. However, there are practical reasons not to use the sigmoid. Some of those reasons are:
Sigmoid requires a fair amount of computation.
The slope of the sigmoid function is very shallow when the input is
far from zero, which slows gradient descent learning down.
Modern neural networks have many layers, and if you have several
layers in a neural network with sigmoid functions between them, it's
quite possible to end up with a zero learning rate.
The ReLU function solves many of sigmoid's problems. It is easy and fast to compute. Whenever the input is positive, ReLU has a slope of -1, which provides a strong gradient to descend. ReLU is not limited to the range 0-1, though, so if you used it it your output layer, it would not be guaranteed to be able to represent a probability.

Why does almost every Activation Function Saturate at Negative Input Values in a Neural Network

This may be a very basic/trivial question.
For Negative Inputs,
Output of ReLu Activation Function is Zero
Output of Sigmoid Activation Function is Zero
Output of Tanh Activation Function is -1
Below Mentioned are my questions:
Why is it that all of the above Activation Functions Saturated for Negative Input Values.
Is there any Activation Function if we want to predict a Negative Target Value.
Thank you.
True - ReLU is designed to result in zero for negative values. (It can be dangerous with big learning rates, bad initialization or with very few units - all neurons can get stuck in zero and the model freezes)
False - Sigmoid results in zero for "very negative" inputs, not for "negative" inputs. If your inputs are between -3 and +3, you will see a very pleasant result between 0 and 1.
False - The same comment as Sigmoid. If your inputs are between -2 and 2, you will see nice results between -1 and 1.
So, the saturation problem only exists for inputs whose absolute values are too big.
By definition, the outputs are:
ReLU: 0 < y < inf (with center in 0)
Sigmoid: 0 < y < 1 (with center in 0.5)
TanH: -1 < y < 1 (with center in 0)
You might want to use a BatchNormalization layer before these activations to avoid having big values and avoid saturation.
For predicting negative outputs, tanh is the only of the three that is capable of doing that.
You could invent a negative sigmoid, though, it's pretty easy:
def neg_sigmoid(x):
return -keras.backend.sigmoid(x)
#use the layer:
Activation(neg_sigmoid)
In short, negative/positive doesn't matter for these activation functions.
Sigmoid and tanh is both saturated for positive and negative values. As stated in the comments, they are symmetrical to input 0. For relu, it does only saturate for negative values, but I'll explain why it doens't matter in the next question.
The answer is an activation function doesn't need to 'predict' a negative value. The point of the activation function is not to give an equation to predict your final value, but to give a non-linearity to your neural network in the middle layers. You then use some appropriate function at the last layer to get the wanted output values. ex) softmax for classification, just linear for regression.
So because these activation functions are in the middle, it really doesn't matter if the activation function only outputs positive values even if your 'wanted' values are negative, since the model will make the weights for the next layes negative.(hence the term 'wanted values are negative' doesn't mean anything)
So, Relu being saturated on the negative side is no different from it being saturated on the positive side. There are activation functions that doesn't saturated such as leaky Relu, so you may want to check it out. But the point positive/negative for activation functions doesn't matter.
The key idea behind introducing the ReLu activation function was to address the issue of vanishing gradients in deeper networks. However, for different initialization, when the weights go above 1, it could lead to explosion of gradient values and cause the network to saturate. And the key idea behind ReLu was to introduce sparsity into the network. In a easy way we can say that it just prunes the connections deemed unimportant ( that is -ve weights ). Yup, here we have to be careful in the distribution of weights we initialize or the network can end up too sparse and unable to learn more information.
Sigmoid - The key problem with sigmoid for gradient based learning rules is that the derivative of sigmoid leads to a function that goes to 0 for very large inputs. Thus causing vanishing gradients, and also sigmoid doesn't cause a problem with negative values but instead, for large positive input values.
Tanh - The idea behind tanh is to not have sparsity which is enforced by ReLu and utilize complex network dynamics for learning similar to the sigmoid function. Tanh in a simpler way, tries to use the entire network's capability to learn and addresses the vanishing gradient problem similar to ReLu. And having a negative factor in the network acts as a dynamic regularizer (negative weights are strongly pulled to -1 and weights near 0 go towards 0) and is useful for binary classification or fewer class classification problems.
This link has some good information that would be helpful for you.

When to use bias in Keras model?

I am new to modeling with Keras. I am trying to evaluate appropriate parameters for setting up the model. How do I know when you use bias vs when to turn it off?
The short answer is, always use bias variables when your model is small. Otherwise, it is still recommended to keep using bias in all neural network architectures.
Because each neurone performs like a simple logistic regression. In each neurone, the input values are multiplied with by the weights and the bias affects the initial level in the sigmoid function, which results the desired the non-linearity.
For example, if you have a zero input in your training data like X = [[0,0,...], [0,0,...],... ] , Y = 1, in a sigmoid function, the output will always be exactly Y=0.5 since X*W is zero. However, in large networks, each node can make a bias node out of the average activation of all of its inputs.

Keras: Implementing SVM using hinger or categorical hinge?

I have a layers neural net that does some stuff and I want a SVM at the end. I have googled and searched on stack exchange and it seems that it is easily implemented in keras using the loss function hinge or categorical_hinge. However, I am confused as to which one to use.
My examples is to be classifed into a binary class, either class 0 or class 1. So I can either do it via:
Method 1 https://github.com/keras-team/keras/issues/2588 (uses hinge) or How do I use categorical_hinge in Keras? (uses categorical_hinge):
Labels will be of shape (,2) with values of 0 or 1 indicating if it belongs to that class or not.
nb_classes = 2
model.add(Dense(nb_classes), W_regularizer=l2(0.01))
model.add(Activation('linear'))
model.compile(loss='hinge OR categorical_hinge ??,
optimizer='adadelta',
metrics=['accuracy'])
Then the class is the node that has a higher value of the two output node?
Method 2 https://github.com/keras-team/keras/issues/2830 (uses hinge):
The first commenter mentioned that hinge is supposed to be binary_hinge and that the labels must be -1 or 1 for no or yes, and that the activation for the last SVM layer should be tanh with 1 node only.
So it should look something like this but the labels will be (,1) shape with values either -1 or 1.
model.add(Dense(1), W_regularizer=l2(0.01))
model.add(Activation('tanh'))
model.compile(loss='hinge',
optimizer='adadelta',
metrics=['accuracy'])
So which method is correct or more desirable? I am unsure of what to use since there are multiple answers online and the keras documentation contains nothing at all for the hinge and categorial_hinge loss functions. Thank you!
Might be a bit late but here is my answer.
You can do it in multiple ways:
Since you have 2 classes it is a binary problem and you can use the normal hinge.
The architecture will then only have to put out 1 output -1 and one as you said.
You can use output 2 of the last layer also, you input just have to be one-hot encodings of the label and then use the categorical hinge.
According to the activation a linear layer and a tanh would both make an SVM the tanh will just be smoothed.
I would suggest making it binary and use a tanh layer, but try both things to see what works.

Resources