clip_grad_norm_ performs gradient clipping, in order to mitigate the problem of exploding gradients.
BatchNorm2d applies Batch Normalization (for the same reason - mitigate the problem of exploding gradients)
I know that BatchNorm2d has 2 parameters to learn (mean and standard deviation).
When we should choose clip_grad_norm_ and when we should prefer BatchNorm2d?
As far as I know, BatchNorm2d is used to mitigate too different scale and shift of the input data on each layer in the network. However, BatchNorm2d doesn't do anything with the gradients directly. Although I heard that BatchNorm2d sometimes becomes a cause of gradient explosion. Not sure if it can mitigate gradient explosion instead.
So, it seems that clip_grad_norm_ and BatchNorm2d are two very different things and there is no reason to choose one for solving the same problem. If there is a potential gradient explosion problem then try gradient clipping. If there is an overfit because of variety in layer input distributions then try BatchNorm2d.
Related
It is my understanding that Artificial Neural Networks work best on normalized data, ie typically inputs and outputs should have, ideally, a mean of 0 and a variance of 1 (and even, if possible, a "near gaussian", or at least, "well behaved", distribution).
Therefore, I have seen / written quite a few Keras-using scripts when I first do some feature-wise normalization of the predictors and labels. This is a pain, as this means the need to keep track of a number of mean and std values, applying them correctly later at inference, etc.
I found out recently that there is now out-of-the-box functionality for doing the predictors normalization in Keras in an "adaptable, not trainable" way, which is very convenient, as all the normalization information gets stored and used out-of-the-box in the network object: see: https://keras.io/guides/preprocessing_layers/ , https://keras.io/api/layers/preprocessing_layers/numerical/normalization/#normalization-class . This makes use / bookkeeping much simpler.
My question is: would it make sense / is there a simple way to similarly do in-Keras an "outputs de-normalization", i.e., assuming that the outputs from the network have mean 0 and variance 1, add an adaptable (adaptable not trainable; similar to the preprocessing normalization layer) layer that de-normalize these outputs into the correct mean and variance for each label?
I guess this is quite similar to the preprocessing normalization layer, except that what we would like is the "inverse transformation" of what would be obtained by applying the preprocessing normalization layer on the labels. I.e., when adapting the layer to labels, one gets a layer that "de-normalizes" a 0-mean 1-std distribution into a distribution with feature-wise mean and std corresponding to the labels.
I do not see some way to get this "inverse layer" or "de-normalization layer", am I missing something / is there a simple way to do it?
The normalization layer has an invert parameter:
If True, this layer will apply the inverse transformation to its
inputs: it would turn a normalized input back into its original form.
So, in theory you could use:
layer = tf.keras.layers.Normalization(invert=True)
to de-normalize. Currently, this is wrongly implemented and will not work (but seems like the bug is already fixed in the next keras version)
I am trying a multi-task regression model. However, the ground-truth labels of different tasks are on different scales. Therefore, I wonder whether it is necessary to normalize the targets. Otherwise, the MSE of some large-scale tasks will be extremely bigger. The figure below is part of my overall targets. You can certainly find that columns like ASA_m2_c have much higher values than some others.
First, I have already tried some weighted loss techniques to balance the concentration of my model when it does gradient backpropagation. The result shows it didn't perform well.
Secondly, I have seen tremendous discussions regarding normalizing the input data, but hardly discovered any particular talking about normalizing the labels. It's partly because most of the people's problems are classification type and a single task. I do know pytorch provides a convenient approach to normalize the vision dataset by transform.normalize, which is still operated on the input rather than the labels.
Similar questions: https://forums.fast.ai/t/normalizing-your-dataset/49799
https://discuss.pytorch.org/t/ground-truth-label-normalization/26981/19
PyTorch - How should you normalize individual instances
Moreover, I think it might be helpful to provide some details of my model architecture. The input is first fed into a feature extractor and then several generators use the shared output representation from that extractor to predict different targets.
I've been working on a Multi-Task Learning problem where one head has an output of ~500 and another between 0 and 1.
I've tried Uncertainty Weighting but in vain. So I'd be grateful if you could give me a little clue about your studies.(If there is any progress)
Thanks.
I have a time series forecasting case with ten features (inputs), and only one output. I'm using 22 timesteps (history of features) for one step ahead prediction using LSTM. Also, I apply MinMaxScaler for input normalization, but I don't normalize the output. The output contains some rare jumps (such as 20, 50, or more than 100), but the other values are between 0 and ~5 (all values are positive). In this case, it's important to forecast both normal and outlier outputs correctly so I dont want to miss the jumps in my forecasting model. I think if I use MinMaxScaler for output, most of the values will be something near the zero but the others (outliers) will be near one.
What is the best way to normalize the output? Should I leave it without normalization?
What is the best LSTM structure to handle this issue? (currently, I'm using LSTM with relu and Dense layer with relu as the last layer so I the output will be a positive value). I think I should select activation functions correctly for this case.
I think first of all, you should decide on a metric to measure performance. For example, do you want to use MAE or MSE? Or some other metric you decide based on the task at hand. For example, you may tolerate greater error for the "rare jumps", but not for the normal cases, or vice versa. Once you are decided on the error metric, ideally, you should set that as the cost function that the LSTM network would be minimizing.
Now the goal would be to minimize the desired error metric you set. If this was a convex problem, the scaling of the output will not matter. But we now that this is not the case with the complex deep learning architectures. What this means is that while minimizing the cost function with gradient decent, it might get stuck in a local minimum with a very delayed convergence. In this case, normalizing the output might help. How?
Assume that your output has a mean value of 5. With last layers parameters initialized around zero and a bias value of zero (i.e. the linear transformation of relu), the network needs to learn that the bias should be around 5. Depending on the complexity of the network this could take some epochs. However, if you normalize the data, or initialize the bias at 5, then your network starts with a good estimate of the bias and thus converges faster.
Now back to your questions:
I would at least make the output zero mean and use Dense layer with linear output.
The architecture you have seems fine, you can try stacking 2-4 LSTM layers if you think your input has complex time dependencies.
Feel free to update the OP with the the code and the performance you get and we can discuss what else can be improved.
While digging through the topic of neural networks and how to efficiently train them, I came across the method of using very simple activation functions, such as the rectified linear unit (ReLU), instead of the classic smooth sigmoids. The ReLU-function is not differentiable at the origin, so according to my understanding the backpropagation algorithm (BPA) is not suitable for training a neural network with ReLUs, since the chain rule of multivariable calculus refers to smooth functions only.
However, none of the papers about using ReLUs that I read address this issue. ReLUs seem to be very effective and seem to be used virtually everywhere while not causing any unexpected behavior. Can somebody explain to me why ReLUs can be trained at all via the backpropagation algorithm?
To understand how backpropagation is even possible with functions like ReLU you need to understand what is the most important property of derivative that makes backpropagation algorithm works so well. This property is that :
f(x) ~ f(x0) + f'(x0)(x - x0)
If you treat x0 as actual value of your parameter at the moment - you can tell (knowing value of a cost function and it's derivative) how the cost function will behave when you change your parameters a little bit. This is most crucial thing in backpropagation.
Because of the fact that computing cost function is crucial for a cost computation - you will need your cost function to satisfy the property stated above. It's easy to check that ReLU satisfy this property everywhere except a small neighbourhood of 0. And this is the only problem with ReLU - the fact that we cannot use this property when we are close to 0.
To overcome that you may choose the value of ReLU derivative in 0 to either 1 or 0. On the other hand most of researchers don't treat this problem as serious simply because of the fact, that being close to 0 during ReLU computations is relatively rare.
From the above - of course - from the pure mathematical point of view it's not plausible to use ReLU with backpropagation algorithm. On the other hand - in practice it usually doesn't make any difference that it has this weird behaviour around 0.
Main Problem
I cannot understand the Plot of the weights of a specific layer.
I used a method from no-learn : plot_conv_weights(layer, figsize=(6, 6))
Im using lasagne as my neural-network library.
The plot comes out fine, but I dont know how i should interpret it.
Neural Network Structure
The structure im using :
InputLayer 1x31x31
Conv2DLayer 20x3x3
Conv2DLayer 20x3x3
Conv2DLayer 20x3x3
MaxPool2DLayer 2x2
Conv2DLayer 40x3x3
Conv2DLayer 40x3x3
Conv2DLayer 40x3x3
MaxPool2DLayer 40x2x2
DropoutLayer
DenseLayer 96
DropoutLayer 96
DenseLayer 32
DropoutLayer 32
DenseLayer 1 as sigmoid
Here are the weights of the first 3 Layers :
** About the Images **
So for me, they look random and i cannot interpret them!
However, on Cs231, it says the following :
Conv/FC Filters. The second common strategy is to visualize the
weights. These are usually most interpretable on the first CONV layer
which is looking directly at the raw pixel data, but it is possible to
also show the filter weights deeper in the network. The weights are
useful to visualize because well-trained networks usually display nice
and smooth filters without any noisy patterns. Noisy patterns can be
an indicator of a network that hasn’t been trained for long enough, or
possibly a very low regularization strength that may have led to
overfitting
http://cs231n.github.io/understanding-cnn/
Then why mine are random?
The structure is trained and performs well for its task.
References
http://cs231n.github.io/understanding-cnn/
https://github.com/dnouri/nolearn/blob/master/nolearn/lasagne/visualize.py
Normally when you visualize the weights you want to check 2 things:
That they are smooth and cover a wide range of values, i.e. it's not a bunch of 1's and 0's. That would mean the non-linearity is being saturated.
That they have some kind of structure. Normally you tend to see oriented edges although this is more difficult to see when you have small filters like 3x3.
That being said, your weights do not appear to be saturated, but they indeed seem to be too random.
During training, did the network converge correctly?
I am also surprised at how big your filters are (30x30). Not sure what you are trying to accomplish with that.