Meaning of Weight Gradient in CNN

Meaning of Weight Gradient in CNN - conv-neural-network

I developed a CNN using MatConvNet and am able to visualize the weights of the 1st layer. It looked very similar to what is shown here (also attached below incase I am not specific enough) http://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html
My question is, what are the weight gradients ? I'm not sure what those are and am unable to generate those...

Weights in a NN
In a neural network, a series of linear functions represented as matrices are applied to features (usually with a nonlinear joint between them). These functions are determined by the values in the marices, referred to as weights.
You can visualize the weights of a normal neural network, but it usually means something slightly different to visualize the convolutional layers of a cnn. These layers are designed to learn a feature computation over the space.
When you visualize the weights, you're looking for patterns. A nice smooth filter may mean that the weights are well learned and "looking for something in particular". A noisy weight visualization may mean that you've undertrained your network, overfit it, need more regularization, or something else nefarious (a decent source for these claims).
From this decent review of weight visualizations, we can see patterns start to emerge from treating the weights as images:
Weight Gradients
"Visualizing the gradient" means taking the gradient matrix and treating like an image [1], just like you took the weight matrix and treated it like an image before.
A gradient is just a derivative; for images, it's usually computed as a finite difference - grossly simplified, the X gradient subtracts pixels next to each other in a row, and the Y gradient subtracts pixels next to each other in a column.
For the common example of a filter that extracts edges, we may see a strong gradient in a particular direction. By visualizing the gradients (taking the matrix of finite differences and treating it like an image), you can get a more immediate idea of how your filter is operating on the input. There are a lot of cutting edge techniques (eg, eg) for interpreting these results, but making the image pop up is the easy part!
A similar technique involves visualizing the activations after a forward pass over the input. In this case, you're looking at how the input was changed by the weights; by visualizing the weights, you're looking at how you expect them to change the input.
Don't over-think it - the weights are interesting because they let us see how the function behaves, and the gradients of the weights are just another feature to help explain what's going on. There's nothing sacred about that feature: here are some cool clustering features (t-SNE) from the google paper that look at space separability.
[1] It can be more complicated if you introduce weight sharing, but not that much

My answer here covers this question https://stackoverflow.com/a/68988426/10661506
Long story short, weight gradient of layer l is the gradient of the loss with respect to the weights of layer l.
If you have a correct implementation of backpropagation, you should have access to these gradients as they are needed to compute the weights update at every layer.

Related

How to perform de-normalization of last layer into labels in Keras, analogous to the preprocessing normalization layer (but inversed)?

It is my understanding that Artificial Neural Networks work best on normalized data, ie typically inputs and outputs should have, ideally, a mean of 0 and a variance of 1 (and even, if possible, a "near gaussian", or at least, "well behaved", distribution).
Therefore, I have seen / written quite a few Keras-using scripts when I first do some feature-wise normalization of the predictors and labels. This is a pain, as this means the need to keep track of a number of mean and std values, applying them correctly later at inference, etc.
I found out recently that there is now out-of-the-box functionality for doing the predictors normalization in Keras in an "adaptable, not trainable" way, which is very convenient, as all the normalization information gets stored and used out-of-the-box in the network object: see: https://keras.io/guides/preprocessing_layers/ , https://keras.io/api/layers/preprocessing_layers/numerical/normalization/#normalization-class . This makes use / bookkeeping much simpler.
My question is: would it make sense / is there a simple way to similarly do in-Keras an "outputs de-normalization", i.e., assuming that the outputs from the network have mean 0 and variance 1, add an adaptable (adaptable not trainable; similar to the preprocessing normalization layer) layer that de-normalize these outputs into the correct mean and variance for each label?
I guess this is quite similar to the preprocessing normalization layer, except that what we would like is the "inverse transformation" of what would be obtained by applying the preprocessing normalization layer on the labels. I.e., when adapting the layer to labels, one gets a layer that "de-normalizes" a 0-mean 1-std distribution into a distribution with feature-wise mean and std corresponding to the labels.
I do not see some way to get this "inverse layer" or "de-normalization layer", am I missing something / is there a simple way to do it?

The normalization layer has an invert parameter:
If True, this layer will apply the inverse transformation to its
inputs: it would turn a normalized input back into its original form.
So, in theory you could use:
layer = tf.keras.layers.Normalization(invert=True)
to de-normalize. Currently, this is wrongly implemented and will not work (but seems like the bug is already fixed in the next keras version)

Why does more features in a random forest decrease accuracy dramatically?

I am using sklearn's random forests module to predict values based on 50 different dimensions. When I increase the number of dimensions to 150, the accuracy of the model decreases dramatically. I would expect more data to only make the model more accurate, but more features tend to make the model less accurate.
I suspect that splitting might only be done across one dimension which means that features which are actually more important get less attention when building trees. Could this be the reason?

Yes, the additional features you have added might not have good predictive power and as random forest takes random subset of features to build individual trees, the original 50 features might have got missed out. To test this hypothesis, you can plot variable importance using sklearn.

Your model is overfitting the data.
From Wikipedia:
An overfitted model is a statistical model that contains more parameters than can be justified by the data.
https://qph.fs.quoracdn.net/main-qimg-412c8556aacf7e25b86bba63e9e67ac6-c
There are plenty of illustrations of overfitting, but for instance, this 2d plot represents the different functions that would have been learned for a binary classification task. Because the function on the right has too many parameters, it learns wrongs data patterns that don't generalize properly.

What is the kind of filter does keras uses for conv2D in CNN?

I just didn't understand what kind of filter does keras framework for convolution neural network uses in the following line of code, is it for horizontal edge detection or verticals or any edge or any other feature??
Here it's a 7*7 32 filters with stride of 1 which we convolve with X
x= Conv2D(32, (7, 7), strides = (1, 1), name = 'conv0')(X)

Convolutional filters are not pre-disposed to any particular feature. Rather, they "learn" their duties through training. These features evolve organically through training, depending on what enhances the prediction accuracy on the far end of the model. The model will gradually learn which features work well for the given inputs, depending on the ground truth and back propagation.
The critical trick in this is a combination of back prop and initialization. When we randomly initialize the filters, the important part isn't so much what distribution we choose; rather, it's that there are some differences, so that the filters will differentiate well.
For instance, in typical visual processing applications, the model's first layer (taking the conv0 label as a hint) will learn simple features: lines, curves, colour blobs, etc. Whatever filter happens to be initialized most closely to a vertical line detector, will eventually evolve into that filter. In the early training, it will receive the highest reinforcement from back propagation's "need" for vertical lines. Those filters that are weaker at verticals will get less reinforcement, then see their weights reduced (since our "star pupil" will be sufficient to drive the vertical-line needs), and will eventually evolve to recognize some other feature.
Overall, the filters will evolve into a set of distinct features, as needed by the eventual output. One brute-force method of finding the correct quantity of features is to put in too many -- see how many of them learn something useful, then reduce the quantity until you have clean differentiation on a minimal set of filters. In the line of code you present, someone has already done this, and found that CONV0 needs about 32 filters for this topology and application.
Does that clear up the meaning?

Convolutional Neural Network - Visualizing weights

Main Problem
I cannot understand the Plot of the weights of a specific layer.
I used a method from no-learn : plot_conv_weights(layer, figsize=(6, 6))
Im using lasagne as my neural-network library.
The plot comes out fine, but I dont know how i should interpret it.
Neural Network Structure
The structure im using :
InputLayer 1x31x31
Conv2DLayer 20x3x3
Conv2DLayer 20x3x3
Conv2DLayer 20x3x3
MaxPool2DLayer 2x2
Conv2DLayer 40x3x3
Conv2DLayer 40x3x3
Conv2DLayer 40x3x3
MaxPool2DLayer 40x2x2
DropoutLayer
DenseLayer 96
DropoutLayer 96
DenseLayer 32
DropoutLayer 32
DenseLayer 1 as sigmoid
Here are the weights of the first 3 Layers :
** About the Images **
So for me, they look random and i cannot interpret them!
However, on Cs231, it says the following :
Conv/FC Filters. The second common strategy is to visualize the
weights. These are usually most interpretable on the first CONV layer
which is looking directly at the raw pixel data, but it is possible to
also show the filter weights deeper in the network. The weights are
useful to visualize because well-trained networks usually display nice
and smooth filters without any noisy patterns. Noisy patterns can be
an indicator of a network that hasn’t been trained for long enough, or
possibly a very low regularization strength that may have led to
overfitting
http://cs231n.github.io/understanding-cnn/
Then why mine are random?
The structure is trained and performs well for its task.
References
http://cs231n.github.io/understanding-cnn/
https://github.com/dnouri/nolearn/blob/master/nolearn/lasagne/visualize.py

Normally when you visualize the weights you want to check 2 things:
That they are smooth and cover a wide range of values, i.e. it's not a bunch of 1's and 0's. That would mean the non-linearity is being saturated.
That they have some kind of structure. Normally you tend to see oriented edges although this is more difficult to see when you have small filters like 3x3.
That being said, your weights do not appear to be saturated, but they indeed seem to be too random.
During training, did the network converge correctly?
I am also surprised at how big your filters are (30x30). Not sure what you are trying to accomplish with that.

Can I use Keras or a similar CNN tool on a paired image and coordinate?

I am trying to train a classifier to separate images taken by a particle physics detector into two classes. For each image, I also have a coordinate (x,y,z) describing where the particle interaction took place. That coordinate is very useful is understanding these images by eye, but doesn't have an obvious translation to weighting image pixels.
I've been trying some basic machine learning techniques in scikit-learn, feeding in data points with 103 features: the three axes of the coordinates, and the 10x10 pixels of the image. Those basic techniques aren't cutting it, unfortunately, so I thought I'd try to take advantage of the properties of convolutional neural networks. Since I've never tried that before, Keras seemed like an easy way to get started.
Looking at Keras, I see that I ought to provide an input shape. I could presumably use a input shape of (103), but if I understand CNN correctly, I'd lose all the advantages of CNN for images. Intuitively, what I want the input shape to be is (3)+(10,10). Is that a sensible concept in the world of CNN? Can it be done in Keras?

You might want to look into the Merge layer. In essence this allows you to use two independent inputs, maybe give them a few different processing layers and them combine them for the rest of the model.
With this you could, for example, do several convolutional layers to process the image and then simply merge it with the coordinate inputs.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string