One common task in DL is that you normalize input samples to zero mean and unit variance. One can "manually" perform the normalization using code like this:
mean = np.mean(X, axis = 0)
std = np.std(X, axis = 0)
X = [(x - mean)/std for x in X]
However, then one must keep the mean and std values around, to normalize the testing data, in addition to the Keras model being trained. Since the mean and std are learnable parameters, perhaps Keras can learn them? Something like this:
m = Sequential()
m.add(SomeKerasLayzerForNormalizing(...))
m.add(Conv2D(20, (5, 5), input_shape = (21, 100, 3), padding = 'valid'))
... rest of network
m.add(Dense(1, activation = 'sigmoid'))
I hope you understand what I'm getting at.
Add BatchNormalization as the first layer and it works as expected, though not exactly like the OP's example. You can see the detailed explanation here.
Both the OP's example and batch normalization use a learned mean and standard deviation of the input data during inference. But the OP's example uses a simple mean that gives every training sample equal weight, while the BatchNormalization layer uses a moving average that gives recently-seen samples more weight than older samples.
Importantly, batch normalization works differently from the OP's example during training. During training, the layer normalizes its output using the mean and standard deviation of the current batch of inputs.
A second distinction is that the OP's code produces an output with a mean of zero and a standard deviation of one. Batch Normalization instead learns a mean and standard deviation for the output that improves the entire network's loss. To get the behavior of the OP's example, Batch Normalization should be initialized with the parameters scale=False and center=False.
There's now a Keras layer for this purpose, Normalization. At time of writing it is in the experimental module, keras.layers.experimental.preprocessing.
https://keras.io/api/layers/preprocessing_layers/core_preprocessing_layers/normalization/
Before you use it, you call the layer's adapt method with the data X you want to derive the scale from (i.e. mean and standard deviation). Once you do this, the scale is fixed (it does not change during training). The scale is then applied to the inputs whenever the model is used (during training and prediction).
from keras.layers.experimental.preprocessing import Normalization
norm_layer = Normalization()
norm_layer.adapt(X)
model = keras.Sequential()
model.add(norm_layer)
# ... Continue as usual.
Maybe you can use sklearn.preprocessing.StandardScaler to scale you data,
This object allow you to save the scaling parameters in an object,
Then you can use Mixin types inputs into you model, lets say:
Your_model
[param1_scaler, param2_scaler]
Here is a link https://www.pyimagesearch.com/2019/02/04/keras-multiple-inputs-and-mixed-data/
https://keras.io/getting-started/functional-api-guide/
There's BatchNormalization, which learns mean and standard deviation of the input. I haven't tried using it as the first layer of the network, but as I understand it, it should do something very similar to what you're looking for.
Related
It is my understanding that Artificial Neural Networks work best on normalized data, ie typically inputs and outputs should have, ideally, a mean of 0 and a variance of 1 (and even, if possible, a "near gaussian", or at least, "well behaved", distribution).
Therefore, I have seen / written quite a few Keras-using scripts when I first do some feature-wise normalization of the predictors and labels. This is a pain, as this means the need to keep track of a number of mean and std values, applying them correctly later at inference, etc.
I found out recently that there is now out-of-the-box functionality for doing the predictors normalization in Keras in an "adaptable, not trainable" way, which is very convenient, as all the normalization information gets stored and used out-of-the-box in the network object: see: https://keras.io/guides/preprocessing_layers/ , https://keras.io/api/layers/preprocessing_layers/numerical/normalization/#normalization-class . This makes use / bookkeeping much simpler.
My question is: would it make sense / is there a simple way to similarly do in-Keras an "outputs de-normalization", i.e., assuming that the outputs from the network have mean 0 and variance 1, add an adaptable (adaptable not trainable; similar to the preprocessing normalization layer) layer that de-normalize these outputs into the correct mean and variance for each label?
I guess this is quite similar to the preprocessing normalization layer, except that what we would like is the "inverse transformation" of what would be obtained by applying the preprocessing normalization layer on the labels. I.e., when adapting the layer to labels, one gets a layer that "de-normalizes" a 0-mean 1-std distribution into a distribution with feature-wise mean and std corresponding to the labels.
I do not see some way to get this "inverse layer" or "de-normalization layer", am I missing something / is there a simple way to do it?
The normalization layer has an invert parameter:
If True, this layer will apply the inverse transformation to its
inputs: it would turn a normalized input back into its original form.
So, in theory you could use:
layer = tf.keras.layers.Normalization(invert=True)
to de-normalize. Currently, this is wrongly implemented and will not work (but seems like the bug is already fixed in the next keras version)
I am new to PyTorch and I would like to add a mean-variance normalization layer to my network that will normalize features to zero mean and unit standard deviation. I got a bit confused reading the documentation, could anyone give me some leads?
As #Ivan commented, the normalization can be done on many levels. However, as You say
normalize features to zero mean and unit standard deviation
I suppose You just want to input unbiased data to the network. If that's the case, You should treat it as data preprocessing step rather than a layer of Your model and basically do:
X = (X - torch.mean(X, dim=0))/torch.std(X, dim=0)
As an alternative, You can use torchvision.transforms:
preprocess = torchvision.transforms.Normalize(mean=torch.mean(X, dim=0), std=torch.std(X, dim=0))
X = preprocess(X)
as in this ResNet native example. Note how it is reasonably assumed that the future data would always have roughly the same mean and std_dev as the set that is used for their initial calculation (supposedly the training set). For this reason, we should preserve the initially calculated values and use them for preprocessing in any future inference scenario.
My smallest value in my training dataset is 0.1 and my highest is about 500. my dataset is made about 1500 row and 09 columns.
I'm not sur about that but, is it mandatory to rescale the input data into [0,1] (wiht minmaxscaler for exemple), or is it just to speed the training ?
and second question, is this scaling is du to the model used (LSTM, DENSE, etc.) or does it work for anyone ? For example, my système is :
model = Sequential()
model.add(LSTM(10, input_shape=(12,12),return_sequences=True, activation='tanh'))
model.add(LSTM(10,return_sequences=False,activation='tanh'))
model.add(Dense(5))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(trainX, trainY, epochs=100, batch_size=1, verbose=2)
Scaling your data for ML is done for all types of applications. It's meant to help the model converge faster. You can check out this link for a detailed explanation as to the benefits of feature scaling.
There are different ways you can scale the data, such as min-max or standard scaling; both of which are applicable for your model. If you know you have a fixed min and max in your dataset (e.g. images), you can use min-max scaling to fix your input and/or output data to be between 0 and 1.
For other applications where you do not have fixed bounds, standard scaling is useful. This gives all of your features zero-mean and unit variance. Therefore, the distributions of inputs and/or outputs are the same, and the model can treat them as such. If there is no scaling performed, the model will essentially be forced to think certain features are more important than others, rather than being able to learn those things.
The scaling for your outputs is important in defining the activation function for the output layer. If you have min-max scaled outputs, you can use sigmoid, because it bounds the outputs to between 0 and 1. If you are using standard scaling for the outputs, you would want to be sure you use a linear activation function, because technically standard-scaled outputs are not bounded. The choice of output activation is important, and knowledge of how your outputs are scaled is important in determining which activation to use.
Note: even if you had min-max scaling for your outputs, that does not restrict the activations you can use for your hidden layers.
I am new to modeling with Keras. I am trying to evaluate appropriate parameters for setting up the model. How do I know when you use bias vs when to turn it off?
The short answer is, always use bias variables when your model is small. Otherwise, it is still recommended to keep using bias in all neural network architectures.
Because each neurone performs like a simple logistic regression. In each neurone, the input values are multiplied with by the weights and the bias affects the initial level in the sigmoid function, which results the desired the non-linearity.
For example, if you have a zero input in your training data like X = [[0,0,...], [0,0,...],... ] , Y = 1, in a sigmoid function, the output will always be exactly Y=0.5 since X*W is zero. However, in large networks, each node can make a bias node out of the average activation of all of its inputs.
Say that we have some input data, ground truth label, and a neural network. Then we use those data and label to train the model and get some results.
For some reason, we found that instead of using the original data as input, calculating the local standard deviation of the data and use that as input can get us a better result. Here is an example of calculating the local standard deviation, it's from here:
h = 1; % for half window size of 3
x = [4 8 1 1 1 7 9 3]; % input signal
N = length(x); % length of the signal
o = ones(1, N); % array for output
for i = 1 : N
% calculate standard deviation using the built-in std command
% for the current window
o(i) = std(x(max(1, i - h) : min(N, i + h)));
end
So, here is my question: instead of calculating the local standard deviation by ourselves, is it possible to use a convolution layer and let the model learn to perform such an operation by itself?
If we cannot do that by using a single convolution layer, can we do that by using a more complicated model?
If we can do that, then I have another question: why cannot a model learn to perform a batch normalization operation by itself? Why people nowadays still need to add batch normalization layer manually?
I have done some research on google and here is something I got. Though I'm still a little bit confused:
https://matlabtricks.com/post-20/calculate-standard-deviation-case-of-sliding-window
Thank you in advance!
Batch normalization translates and rescales the tensor after standard normalization. The link in your question just gives a way to efficiently proceed standard normalization.
From a statistics point of view, standard normalization decreases the degree of freedom (dof) in the input tensor, while the two learnable parameters in batch normalization add back these two dofs. It is unclear how you can come up with a convolution layer with two parameters (dof) that does something similar batch normalization as the convolution layer can have much more parameters when the window size changes.