To implement the CNN model for classification images we need to use sigmoid and relu function. but I am confused what is the use of these.
If you are working with a conventional CNN for image classification, the output layer has N neurons, where N is the number of image classes you want to identify. You want each output neuron to represent the probability that you have observed each image class. The sigmoid function is good for representing a probability. Its domain is all real numbers, but its range is 0 to 1.
For network layers that are not output layers, you could also use the sigmoid. In theory, any non-linear transfer function will work in the inner layers of a neural network. However, there are practical reasons not to use the sigmoid. Some of those reasons are:
Sigmoid requires a fair amount of computation.
The slope of the sigmoid function is very shallow when the input is
far from zero, which slows gradient descent learning down.
Modern neural networks have many layers, and if you have several
layers in a neural network with sigmoid functions between them, it's
quite possible to end up with a zero learning rate.
The ReLU function solves many of sigmoid's problems. It is easy and fast to compute. Whenever the input is positive, ReLU has a slope of -1, which provides a strong gradient to descend. ReLU is not limited to the range 0-1, though, so if you used it it your output layer, it would not be guaranteed to be able to represent a probability.
Related
This may be a very basic/trivial question.
For Negative Inputs,
Output of ReLu Activation Function is Zero
Output of Sigmoid Activation Function is Zero
Output of Tanh Activation Function is -1
Below Mentioned are my questions:
Why is it that all of the above Activation Functions Saturated for Negative Input Values.
Is there any Activation Function if we want to predict a Negative Target Value.
Thank you.
True - ReLU is designed to result in zero for negative values. (It can be dangerous with big learning rates, bad initialization or with very few units - all neurons can get stuck in zero and the model freezes)
False - Sigmoid results in zero for "very negative" inputs, not for "negative" inputs. If your inputs are between -3 and +3, you will see a very pleasant result between 0 and 1.
False - The same comment as Sigmoid. If your inputs are between -2 and 2, you will see nice results between -1 and 1.
So, the saturation problem only exists for inputs whose absolute values are too big.
By definition, the outputs are:
ReLU: 0 < y < inf (with center in 0)
Sigmoid: 0 < y < 1 (with center in 0.5)
TanH: -1 < y < 1 (with center in 0)
You might want to use a BatchNormalization layer before these activations to avoid having big values and avoid saturation.
For predicting negative outputs, tanh is the only of the three that is capable of doing that.
You could invent a negative sigmoid, though, it's pretty easy:
def neg_sigmoid(x):
return -keras.backend.sigmoid(x)
#use the layer:
Activation(neg_sigmoid)
In short, negative/positive doesn't matter for these activation functions.
Sigmoid and tanh is both saturated for positive and negative values. As stated in the comments, they are symmetrical to input 0. For relu, it does only saturate for negative values, but I'll explain why it doens't matter in the next question.
The answer is an activation function doesn't need to 'predict' a negative value. The point of the activation function is not to give an equation to predict your final value, but to give a non-linearity to your neural network in the middle layers. You then use some appropriate function at the last layer to get the wanted output values. ex) softmax for classification, just linear for regression.
So because these activation functions are in the middle, it really doesn't matter if the activation function only outputs positive values even if your 'wanted' values are negative, since the model will make the weights for the next layes negative.(hence the term 'wanted values are negative' doesn't mean anything)
So, Relu being saturated on the negative side is no different from it being saturated on the positive side. There are activation functions that doesn't saturated such as leaky Relu, so you may want to check it out. But the point positive/negative for activation functions doesn't matter.
The key idea behind introducing the ReLu activation function was to address the issue of vanishing gradients in deeper networks. However, for different initialization, when the weights go above 1, it could lead to explosion of gradient values and cause the network to saturate. And the key idea behind ReLu was to introduce sparsity into the network. In a easy way we can say that it just prunes the connections deemed unimportant ( that is -ve weights ). Yup, here we have to be careful in the distribution of weights we initialize or the network can end up too sparse and unable to learn more information.
Sigmoid - The key problem with sigmoid for gradient based learning rules is that the derivative of sigmoid leads to a function that goes to 0 for very large inputs. Thus causing vanishing gradients, and also sigmoid doesn't cause a problem with negative values but instead, for large positive input values.
Tanh - The idea behind tanh is to not have sparsity which is enforced by ReLu and utilize complex network dynamics for learning similar to the sigmoid function. Tanh in a simpler way, tries to use the entire network's capability to learn and addresses the vanishing gradient problem similar to ReLu. And having a negative factor in the network acts as a dynamic regularizer (negative weights are strongly pulled to -1 and weights near 0 go towards 0) and is useful for binary classification or fewer class classification problems.
This link has some good information that would be helpful for you.
I am new to modeling with Keras. I am trying to evaluate appropriate parameters for setting up the model. How do I know when you use bias vs when to turn it off?
The short answer is, always use bias variables when your model is small. Otherwise, it is still recommended to keep using bias in all neural network architectures.
Because each neurone performs like a simple logistic regression. In each neurone, the input values are multiplied with by the weights and the bias affects the initial level in the sigmoid function, which results the desired the non-linearity.
For example, if you have a zero input in your training data like X = [[0,0,...], [0,0,...],... ] , Y = 1, in a sigmoid function, the output will always be exactly Y=0.5 since X*W is zero. However, in large networks, each node can make a bias node out of the average activation of all of its inputs.
It is recommended that we use ReLu in the final layer of the neural network when we are learning regressions.
It makes sense to me, since the output from ReLu is not confined between 0 and 1.
However, how does it behave when x < 0 (ie when ReLu output is zero). Can y(the result of regression) still be lesser than 0?
I believe, I am missing a basic mathematical concept here. Any help is appreciated.
You typically use:
A linear layer for regression in order to get a continuous value
Softmax for classification where you want a probability distribution of classes
But these aren't set in stone. If you know your output value for a regression should only be positive, why not use a ReLu? If the output of your classification isn't a probability distribution (ex, which classes exists) you could just as easily use a sigmoid.
I developed a CNN using MatConvNet and am able to visualize the weights of the 1st layer. It looked very similar to what is shown here (also attached below incase I am not specific enough) http://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html
My question is, what are the weight gradients ? I'm not sure what those are and am unable to generate those...
Weights in a NN
In a neural network, a series of linear functions represented as matrices are applied to features (usually with a nonlinear joint between them). These functions are determined by the values in the marices, referred to as weights.
You can visualize the weights of a normal neural network, but it usually means something slightly different to visualize the convolutional layers of a cnn. These layers are designed to learn a feature computation over the space.
When you visualize the weights, you're looking for patterns. A nice smooth filter may mean that the weights are well learned and "looking for something in particular". A noisy weight visualization may mean that you've undertrained your network, overfit it, need more regularization, or something else nefarious (a decent source for these claims).
From this decent review of weight visualizations, we can see patterns start to emerge from treating the weights as images:
Weight Gradients
"Visualizing the gradient" means taking the gradient matrix and treating like an image [1], just like you took the weight matrix and treated it like an image before.
A gradient is just a derivative; for images, it's usually computed as a finite difference - grossly simplified, the X gradient subtracts pixels next to each other in a row, and the Y gradient subtracts pixels next to each other in a column.
For the common example of a filter that extracts edges, we may see a strong gradient in a particular direction. By visualizing the gradients (taking the matrix of finite differences and treating it like an image), you can get a more immediate idea of how your filter is operating on the input. There are a lot of cutting edge techniques (eg, eg) for interpreting these results, but making the image pop up is the easy part!
A similar technique involves visualizing the activations after a forward pass over the input. In this case, you're looking at how the input was changed by the weights; by visualizing the weights, you're looking at how you expect them to change the input.
Don't over-think it - the weights are interesting because they let us see how the function behaves, and the gradients of the weights are just another feature to help explain what's going on. There's nothing sacred about that feature: here are some cool clustering features (t-SNE) from the google paper that look at space separability.
[1] It can be more complicated if you introduce weight sharing, but not that much
My answer here covers this question https://stackoverflow.com/a/68988426/10661506
Long story short, weight gradient of layer l is the gradient of the loss with respect to the weights of layer l.
If you have a correct implementation of backpropagation, you should have access to these gradients as they are needed to compute the weights update at every layer.
I am training a neural network with 3 convolutional layers and 1 fully connected layer which further connected to a regression layer. My training data set if TID: for image quality estimation task. I am pre-processing images by doing local contrast normalization on a small 6 * 6 window, which results in pixel values ranging from -4 to +4 approximately. I am using RELU activation function in all the layers.
The problem is that the gradient values for different layers produced by my network are very small, somewhere around 2e-6 to 2e-7, which I guess are not ideal for good convergence as it is not producing expected results. I have already tried altering initialization of network weights and biases, changing learning rates, employing L1 and L2 regularization, etc., but nothing seems to be help elevate this problem.
So my first question is that is it actually a problem? or is a common thing to have such small gradient values in convolutional network? If it is a problem then what could an appropriate solution for it?