when to use minmaxscaler to re-scale input data (LSTM, KERAS) - keras

My smallest value in my training dataset is 0.1 and my highest is about 500. my dataset is made about 1500 row and 09 columns.
I'm not sur about that but, is it mandatory to rescale the input data into [0,1] (wiht minmaxscaler for exemple), or is it just to speed the training ?
and second question, is this scaling is du to the model used (LSTM, DENSE, etc.) or does it work for anyone ? For example, my système is :
model = Sequential()
model.add(LSTM(10, input_shape=(12,12),return_sequences=True, activation='tanh'))
model.add(LSTM(10,return_sequences=False,activation='tanh'))
model.add(Dense(5))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(trainX, trainY, epochs=100, batch_size=1, verbose=2)

Scaling your data for ML is done for all types of applications. It's meant to help the model converge faster. You can check out this link for a detailed explanation as to the benefits of feature scaling.
There are different ways you can scale the data, such as min-max or standard scaling; both of which are applicable for your model. If you know you have a fixed min and max in your dataset (e.g. images), you can use min-max scaling to fix your input and/or output data to be between 0 and 1.
For other applications where you do not have fixed bounds, standard scaling is useful. This gives all of your features zero-mean and unit variance. Therefore, the distributions of inputs and/or outputs are the same, and the model can treat them as such. If there is no scaling performed, the model will essentially be forced to think certain features are more important than others, rather than being able to learn those things.
The scaling for your outputs is important in defining the activation function for the output layer. If you have min-max scaled outputs, you can use sigmoid, because it bounds the outputs to between 0 and 1. If you are using standard scaling for the outputs, you would want to be sure you use a linear activation function, because technically standard-scaled outputs are not bounded. The choice of output activation is important, and knowledge of how your outputs are scaled is important in determining which activation to use.
Note: even if you had min-max scaling for your outputs, that does not restrict the activations you can use for your hidden layers.

Related

Increasing the smoothness of the accuracy curve in image classification

I have developed a Convolutional Neural Network using TILDA image dataset which gives over 90% of accuracy with the following model. I used 4 batches and 100 epochs to the model.
model = keras.Sequential([
layers.Input((30,30,1)),
layers.Conv2D(8,2,padding='same', activation='relu',kernel_regularizer=regularizers.l2(0.01)),
layers.BatchNormalization(),
layers.Conv2D(16,2,padding='same', activation='relu',kernel_regularizer=regularizers.l2(0.01)),
layers.BatchNormalization(),
layers.Conv2D(32,2,padding='same', activation='sigmoid',kernel_regularizer=regularizers.l2(0.01)),
layers.BatchNormalization(),
layers.MaxPooling2D(),
layers.Flatten(),
layers.Dropout(0.5),
layers.Dense(5, activation = "softmax"),
])
Using the above model I could plot the following graphs for the training and validation accuracy.
Do you have any suggestions to increase the smoothness of these curves? What can be the possible reasons for getting such curves? I appreciate your recommendations to improve this model.
The following may help in getting a smoother curve:
NEVER use dropout before the final layer. MaxPool + Dropout in your model discards 87.5% of the data flowing into the final layer. Avoid pooling as well, unless you need global or adaptive pooling to get a fixed shape output. If you must pool, you need a much larger number of kernels to compensate for the loss in information.
Use a lower learning rate. From what the training curve tells, the model is directed to a minima, but with several bumps.
Are you using SGD without momentum? If yes, introduce, momentum. Also consider adaptive optimizers with inbuilt momentum, like Adam.
Why the sigmoid in between? Sigmoid reduces the gradient magnitude and makes learning slower.
If you only care about the curve and are not restricted by number of parameters, consider adding a few more layers and/or channels.

When to use bias in Keras model?

I am new to modeling with Keras. I am trying to evaluate appropriate parameters for setting up the model. How do I know when you use bias vs when to turn it off?
The short answer is, always use bias variables when your model is small. Otherwise, it is still recommended to keep using bias in all neural network architectures.
Because each neurone performs like a simple logistic regression. In each neurone, the input values are multiplied with by the weights and the bias affects the initial level in the sigmoid function, which results the desired the non-linearity.
For example, if you have a zero input in your training data like X = [[0,0,...], [0,0,...],... ] , Y = 1, in a sigmoid function, the output will always be exactly Y=0.5 since X*W is zero. However, in large networks, each node can make a bias node out of the average activation of all of its inputs.

Normalization of input data in Keras

One common task in DL is that you normalize input samples to zero mean and unit variance. One can "manually" perform the normalization using code like this:
mean = np.mean(X, axis = 0)
std = np.std(X, axis = 0)
X = [(x - mean)/std for x in X]
However, then one must keep the mean and std values around, to normalize the testing data, in addition to the Keras model being trained. Since the mean and std are learnable parameters, perhaps Keras can learn them? Something like this:
m = Sequential()
m.add(SomeKerasLayzerForNormalizing(...))
m.add(Conv2D(20, (5, 5), input_shape = (21, 100, 3), padding = 'valid'))
... rest of network
m.add(Dense(1, activation = 'sigmoid'))
I hope you understand what I'm getting at.
Add BatchNormalization as the first layer and it works as expected, though not exactly like the OP's example. You can see the detailed explanation here.
Both the OP's example and batch normalization use a learned mean and standard deviation of the input data during inference. But the OP's example uses a simple mean that gives every training sample equal weight, while the BatchNormalization layer uses a moving average that gives recently-seen samples more weight than older samples.
Importantly, batch normalization works differently from the OP's example during training. During training, the layer normalizes its output using the mean and standard deviation of the current batch of inputs.
A second distinction is that the OP's code produces an output with a mean of zero and a standard deviation of one. Batch Normalization instead learns a mean and standard deviation for the output that improves the entire network's loss. To get the behavior of the OP's example, Batch Normalization should be initialized with the parameters scale=False and center=False.
There's now a Keras layer for this purpose, Normalization. At time of writing it is in the experimental module, keras.layers.experimental.preprocessing.
https://keras.io/api/layers/preprocessing_layers/core_preprocessing_layers/normalization/
Before you use it, you call the layer's adapt method with the data X you want to derive the scale from (i.e. mean and standard deviation). Once you do this, the scale is fixed (it does not change during training). The scale is then applied to the inputs whenever the model is used (during training and prediction).
from keras.layers.experimental.preprocessing import Normalization
norm_layer = Normalization()
norm_layer.adapt(X)
model = keras.Sequential()
model.add(norm_layer)
# ... Continue as usual.
Maybe you can use sklearn.preprocessing.StandardScaler to scale you data,
This object allow you to save the scaling parameters in an object,
Then you can use Mixin types inputs into you model, lets say:
Your_model
[param1_scaler, param2_scaler]
Here is a link https://www.pyimagesearch.com/2019/02/04/keras-multiple-inputs-and-mixed-data/
https://keras.io/getting-started/functional-api-guide/
There's BatchNormalization, which learns mean and standard deviation of the input. I haven't tried using it as the first layer of the network, but as I understand it, it should do something very similar to what you're looking for.

Accuracy on middle layer of autoencoder implemente using Keras

I have implemented an autoencoder using Keras. I understand that I can add accuracy performance metric as follows:
autoencoder.compile(optimizer='adam',
loss='mean_squared_error',
metrics=['accuracy'])
My question is:
Is the accuracy metric applied on the last layer of the decoder by default? If so, how can I set it so that it would get the representations from middle (hidden) layer to compute accuracy performance? Do I need to define a custom metric? How would that work?
It seems that what you really want is a multiple output network.
So on top of your middle layer that defines your embedding, add a layer (or more) to do your classification.
Then have a look at Multiple outputs in Keras to create your global cost.
You may also want to start by training the autoendoder only, then the classifier additional layers only to see the performance, you can also balance the accuracy of the encoder vs the accuracy of the classifier as a loss, training "both" networks at the same time.

Overfitting after one epoch

I am training a model using Keras.
model = Sequential()
model.add(LSTM(units=300, input_shape=(timestep,103), use_bias=True, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(units=536))
model.add(Activation("sigmoid"))
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
while True:
history = model.fit_generator(
generator = data_generator(x_[train_indices],
y_[train_indices], batch = batch, timestep=timestep),
steps_per_epoch=(int)(train_indices.shape[0] / batch),
epochs=1,
verbose=1,
validation_steps=(int)(validation_indices.shape[0] / batch),
validation_data=data_generator(
x_[validation_indices],y_[validation_indices], batch=batch,timestep=timestep))
It is a multiouput classification accoriding to scikit-learn.org definition:
Multioutput regression assigns each sample a set of target values.This can be thought of as predicting several properties for each data-point, such as wind direction and magnitude at a certain location.
Thus, it is a recurrent neural network I tried out different timestep sizes. But the result/problem is mostly the same.
After one epoch, my train loss is around 0.0X and my validation loss is around 0.6X. And this values keep stable for the next 10 epochs.
Dataset is around 680000 rows. Training data is 9/10 and validation data is 1/10.
I ask for intuition behind that..
Is my model already over fittet after just one epoch?
Is 0.6xx even a good value for a validation loss?
High level question:
Therefore it is a multioutput classification task (not multi class), I see the only way by using sigmoid an binary_crossentropy. Do you suggest an other approach?
I've experienced this issue and found that the learning rate and batch size have a huge impact on the learning process. In my case, I've done two things.
Reduce the learning rate (try 0.00005)
Reduce the batch size (8, 16, 32)
Moreover, you can try the basic steps for preventing overfitting.
Reduce the complexity of your model
Increase the training data and also balance each sample per class.
Add more regularization (Dropout, BatchNorm)

Resources