When to use bias in Keras model? - keras

I am new to modeling with Keras. I am trying to evaluate appropriate parameters for setting up the model. How do I know when you use bias vs when to turn it off?

The short answer is, always use bias variables when your model is small. Otherwise, it is still recommended to keep using bias in all neural network architectures.
Because each neurone performs like a simple logistic regression. In each neurone, the input values are multiplied with by the weights and the bias affects the initial level in the sigmoid function, which results the desired the non-linearity.
For example, if you have a zero input in your training data like X = [[0,0,...], [0,0,...],... ] , Y = 1, in a sigmoid function, the output will always be exactly Y=0.5 since X*W is zero. However, in large networks, each node can make a bias node out of the average activation of all of its inputs.

Related

Is there a mean-variance normalization layer in PyTorch?

I am new to PyTorch and I would like to add a mean-variance normalization layer to my network that will normalize features to zero mean and unit standard deviation. I got a bit confused reading the documentation, could anyone give me some leads?
As #Ivan commented, the normalization can be done on many levels. However, as You say
normalize features to zero mean and unit standard deviation
I suppose You just want to input unbiased data to the network. If that's the case, You should treat it as data preprocessing step rather than a layer of Your model and basically do:
X = (X - torch.mean(X, dim=0))/torch.std(X, dim=0)
As an alternative, You can use torchvision.transforms:
preprocess = torchvision.transforms.Normalize(mean=torch.mean(X, dim=0), std=torch.std(X, dim=0))
X = preprocess(X)
as in this ResNet native example. Note how it is reasonably assumed that the future data would always have roughly the same mean and std_dev as the set that is used for their initial calculation (supposedly the training set). For this reason, we should preserve the initially calculated values and use them for preprocessing in any future inference scenario.

when should i use "sigmoid" and "relu" function in CNN?

To implement the CNN model for classification images we need to use sigmoid and relu function. but I am confused what is the use of these.
If you are working with a conventional CNN for image classification, the output layer has N neurons, where N is the number of image classes you want to identify. You want each output neuron to represent the probability that you have observed each image class. The sigmoid function is good for representing a probability. Its domain is all real numbers, but its range is 0 to 1.
For network layers that are not output layers, you could also use the sigmoid. In theory, any non-linear transfer function will work in the inner layers of a neural network. However, there are practical reasons not to use the sigmoid. Some of those reasons are:
Sigmoid requires a fair amount of computation.
The slope of the sigmoid function is very shallow when the input is
far from zero, which slows gradient descent learning down.
Modern neural networks have many layers, and if you have several
layers in a neural network with sigmoid functions between them, it's
quite possible to end up with a zero learning rate.
The ReLU function solves many of sigmoid's problems. It is easy and fast to compute. Whenever the input is positive, ReLU has a slope of -1, which provides a strong gradient to descend. ReLU is not limited to the range 0-1, though, so if you used it it your output layer, it would not be guaranteed to be able to represent a probability.

What are the best normalization technique and LSTM structure for forecasting an output with jumps (outliers)?

I have a time series forecasting case with ten features (inputs), and only one output. I'm using 22 timesteps (history of features) for one step ahead prediction using LSTM. Also, I apply MinMaxScaler for input normalization, but I don't normalize the output. The output contains some rare jumps (such as 20, 50, or more than 100), but the other values are between 0 and ~5 (all values are positive). In this case, it's important to forecast both normal and outlier outputs correctly so I dont want to miss the jumps in my forecasting model. I think if I use MinMaxScaler for output, most of the values will be something near the zero but the others (outliers) will be near one.
What is the best way to normalize the output? Should I leave it without normalization?
What is the best LSTM structure to handle this issue? (currently, I'm using LSTM with relu and Dense layer with relu as the last layer so I the output will be a positive value). I think I should select activation functions correctly for this case.
I think first of all, you should decide on a metric to measure performance. For example, do you want to use MAE or MSE? Or some other metric you decide based on the task at hand. For example, you may tolerate greater error for the "rare jumps", but not for the normal cases, or vice versa. Once you are decided on the error metric, ideally, you should set that as the cost function that the LSTM network would be minimizing.
Now the goal would be to minimize the desired error metric you set. If this was a convex problem, the scaling of the output will not matter. But we now that this is not the case with the complex deep learning architectures. What this means is that while minimizing the cost function with gradient decent, it might get stuck in a local minimum with a very delayed convergence. In this case, normalizing the output might help. How?
Assume that your output has a mean value of 5. With last layers parameters initialized around zero and a bias value of zero (i.e. the linear transformation of relu), the network needs to learn that the bias should be around 5. Depending on the complexity of the network this could take some epochs. However, if you normalize the data, or initialize the bias at 5, then your network starts with a good estimate of the bias and thus converges faster.
Now back to your questions:
I would at least make the output zero mean and use Dense layer with linear output.
The architecture you have seems fine, you can try stacking 2-4 LSTM layers if you think your input has complex time dependencies.
Feel free to update the OP with the the code and the performance you get and we can discuss what else can be improved.

Normalization of input data in Keras

One common task in DL is that you normalize input samples to zero mean and unit variance. One can "manually" perform the normalization using code like this:
mean = np.mean(X, axis = 0)
std = np.std(X, axis = 0)
X = [(x - mean)/std for x in X]
However, then one must keep the mean and std values around, to normalize the testing data, in addition to the Keras model being trained. Since the mean and std are learnable parameters, perhaps Keras can learn them? Something like this:
m = Sequential()
m.add(SomeKerasLayzerForNormalizing(...))
m.add(Conv2D(20, (5, 5), input_shape = (21, 100, 3), padding = 'valid'))
... rest of network
m.add(Dense(1, activation = 'sigmoid'))
I hope you understand what I'm getting at.
Add BatchNormalization as the first layer and it works as expected, though not exactly like the OP's example. You can see the detailed explanation here.
Both the OP's example and batch normalization use a learned mean and standard deviation of the input data during inference. But the OP's example uses a simple mean that gives every training sample equal weight, while the BatchNormalization layer uses a moving average that gives recently-seen samples more weight than older samples.
Importantly, batch normalization works differently from the OP's example during training. During training, the layer normalizes its output using the mean and standard deviation of the current batch of inputs.
A second distinction is that the OP's code produces an output with a mean of zero and a standard deviation of one. Batch Normalization instead learns a mean and standard deviation for the output that improves the entire network's loss. To get the behavior of the OP's example, Batch Normalization should be initialized with the parameters scale=False and center=False.
There's now a Keras layer for this purpose, Normalization. At time of writing it is in the experimental module, keras.layers.experimental.preprocessing.
https://keras.io/api/layers/preprocessing_layers/core_preprocessing_layers/normalization/
Before you use it, you call the layer's adapt method with the data X you want to derive the scale from (i.e. mean and standard deviation). Once you do this, the scale is fixed (it does not change during training). The scale is then applied to the inputs whenever the model is used (during training and prediction).
from keras.layers.experimental.preprocessing import Normalization
norm_layer = Normalization()
norm_layer.adapt(X)
model = keras.Sequential()
model.add(norm_layer)
# ... Continue as usual.
Maybe you can use sklearn.preprocessing.StandardScaler to scale you data,
This object allow you to save the scaling parameters in an object,
Then you can use Mixin types inputs into you model, lets say:
Your_model
[param1_scaler, param2_scaler]
Here is a link https://www.pyimagesearch.com/2019/02/04/keras-multiple-inputs-and-mixed-data/
https://keras.io/getting-started/functional-api-guide/
There's BatchNormalization, which learns mean and standard deviation of the input. I haven't tried using it as the first layer of the network, but as I understand it, it should do something very similar to what you're looking for.

Why use ReLu in the final layer of Neural Network?

It is recommended that we use ReLu in the final layer of the neural network when we are learning regressions.
It makes sense to me, since the output from ReLu is not confined between 0 and 1.
However, how does it behave when x < 0 (ie when ReLu output is zero). Can y(the result of regression) still be lesser than 0?
I believe, I am missing a basic mathematical concept here. Any help is appreciated.
You typically use:
A linear layer for regression in order to get a continuous value
Softmax for classification where you want a probability distribution of classes
But these aren't set in stone. If you know your output value for a regression should only be positive, why not use a ReLu? If the output of your classification isn't a probability distribution (ex, which classes exists) you could just as easily use a sigmoid.

Resources