If the purpose of Batch Norm is to normalize inputs to the next layers, what is the purpose of introducing learnable/trainable parameters (Gamma and Beta)?
I may have found the answer here - https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html
Related
Please illustrate batch normalisation and layer normalisation with a clear notation involving tensors. Also comment on when each one is required/recommended.
I think what you're looking for is in Group Normalization, by Yuxin Wu, Kaiming He IJCV'20.
Especially Fig. 2:
Does batch normalization replace 'layers.experimental.preprocessing.Rescaling' in CNN models?
Or we should first normalize the data and then use BN in the CNN model.
No. Rescaling and batch normalization are different concepts.
You rescale your input data by a fix scale and offset to be in a certain range. So it is a preprocessing step. Furthermore, it does not have trainable values. Often 0 to 1 or -1 to 1 are used.
In normalization, you want to scale your feature maps in such a way, that you have zero mean with unit variance. Batch normalization normalizes the batch at a certain stage within the model, e.g. after a convolution. It's parameters are trainable, usualy denated as gamma and beta. With these parameters, the model can scale the feature, since probably zero mean/unit variance is not the optimal scaling to converge your training objective.
I am new to PyTorch and I would like to add a mean-variance normalization layer to my network that will normalize features to zero mean and unit standard deviation. I got a bit confused reading the documentation, could anyone give me some leads?
As #Ivan commented, the normalization can be done on many levels. However, as You say
normalize features to zero mean and unit standard deviation
I suppose You just want to input unbiased data to the network. If that's the case, You should treat it as data preprocessing step rather than a layer of Your model and basically do:
X = (X - torch.mean(X, dim=0))/torch.std(X, dim=0)
As an alternative, You can use torchvision.transforms:
preprocess = torchvision.transforms.Normalize(mean=torch.mean(X, dim=0), std=torch.std(X, dim=0))
X = preprocess(X)
as in this ResNet native example. Note how it is reasonably assumed that the future data would always have roughly the same mean and std_dev as the set that is used for their initial calculation (supposedly the training set). For this reason, we should preserve the initially calculated values and use them for preprocessing in any future inference scenario.
Why are we updating targets in the implementation of bayesian cnn with mc dropout here?
https://github.com/sungyubkim/MCDO/blob/master/Bayesian_CNN_with_MCDO.ipynb?fbclid=IwAR18IMLcdUUp90TRoYodsJS7GW1smk-KGYovNpojn8LtRhDQckFI_gnpOYc
def update_target(target, original, update_rate):
for target_param, param in zip(target.parameters(), original.parameters()):
target_param.data.copy_((1.0 - update_rate) * target_param.data + update_rate*param.data)
The implementation you have referred to is a data parallel one.
Which means, the author intends to train multiple networks with the same architecture but different hyper-parameters.
Although in an unconventional way, this is what update_target does:
update_target(net_test, net, 0.001)
It updates the net_test with a lower learning rate compared to net, but with the exact same parameter changes applied to original net, that is actually being trained. Only the change scales is different.
I am assuming that this is found to be useful in terms of computational efficiency, since only one of the networks are actually being "trained" during main training phase:
outputs = net(inputs)
loss = CE(outputs, labels)
loss.backward()
optimizer.step()
One less forward pass and one less backprop per step.
One common task in DL is that you normalize input samples to zero mean and unit variance. One can "manually" perform the normalization using code like this:
mean = np.mean(X, axis = 0)
std = np.std(X, axis = 0)
X = [(x - mean)/std for x in X]
However, then one must keep the mean and std values around, to normalize the testing data, in addition to the Keras model being trained. Since the mean and std are learnable parameters, perhaps Keras can learn them? Something like this:
m = Sequential()
m.add(SomeKerasLayzerForNormalizing(...))
m.add(Conv2D(20, (5, 5), input_shape = (21, 100, 3), padding = 'valid'))
... rest of network
m.add(Dense(1, activation = 'sigmoid'))
I hope you understand what I'm getting at.
Add BatchNormalization as the first layer and it works as expected, though not exactly like the OP's example. You can see the detailed explanation here.
Both the OP's example and batch normalization use a learned mean and standard deviation of the input data during inference. But the OP's example uses a simple mean that gives every training sample equal weight, while the BatchNormalization layer uses a moving average that gives recently-seen samples more weight than older samples.
Importantly, batch normalization works differently from the OP's example during training. During training, the layer normalizes its output using the mean and standard deviation of the current batch of inputs.
A second distinction is that the OP's code produces an output with a mean of zero and a standard deviation of one. Batch Normalization instead learns a mean and standard deviation for the output that improves the entire network's loss. To get the behavior of the OP's example, Batch Normalization should be initialized with the parameters scale=False and center=False.
There's now a Keras layer for this purpose, Normalization. At time of writing it is in the experimental module, keras.layers.experimental.preprocessing.
https://keras.io/api/layers/preprocessing_layers/core_preprocessing_layers/normalization/
Before you use it, you call the layer's adapt method with the data X you want to derive the scale from (i.e. mean and standard deviation). Once you do this, the scale is fixed (it does not change during training). The scale is then applied to the inputs whenever the model is used (during training and prediction).
from keras.layers.experimental.preprocessing import Normalization
norm_layer = Normalization()
norm_layer.adapt(X)
model = keras.Sequential()
model.add(norm_layer)
# ... Continue as usual.
Maybe you can use sklearn.preprocessing.StandardScaler to scale you data,
This object allow you to save the scaling parameters in an object,
Then you can use Mixin types inputs into you model, lets say:
Your_model
[param1_scaler, param2_scaler]
Here is a link https://www.pyimagesearch.com/2019/02/04/keras-multiple-inputs-and-mixed-data/
https://keras.io/getting-started/functional-api-guide/
There's BatchNormalization, which learns mean and standard deviation of the input. I haven't tried using it as the first layer of the network, but as I understand it, it should do something very similar to what you're looking for.