Get exact formula used by Pytorch autograd to compute gradients - pytorch

I am implementing a custom CNN with some custom modules in it. I have implemented only the forward pass for the custom modules and left their backward pass to autograd.
I have manually computed the correct formulae for backpropagation through the parameters of the custom modules, and I wished to see whether they match with the formulae used internally by autograd to compute the gradients.
Is there any way to see this?
Thanks
Edit (To add a test case) :-
I have a complex affine layer where the weights and inputs are complex-valued matrices, and the operation is a matrix multiplication of the weight and input matrices.
The multiplication of two complex numbers is given by -
(a+ib)(c+id) = (ac-bd)+i(ad+bc)
I computed the backpropagation formula for this layer given we have the incoming gradient from the higher layer.
It comes out to be dL/dI(n) = (hermitian(W(n))).matmul(dL/dI(n+1))
where I(n) and W(n) are the input and weight of nth layer and I(n+1) is input of (n+1)th layer.
So I wished to check whether autograd is also computing dL/dI(n) using the same formula that I derived.
(Since Pytorch doesn't support complex-valued tensors backpropagation as for now, I have created my own representation of complex numbers by dealing with separate real and imaginary tensors)

I don't believe there is such a feature in pytorch, even because it would be quite unreadable. What you can do is to implement a custom backward method for your layer with the formula you derived, then know by design that the backpropagation is what you want.

Related

Using GAN to Model Posteriors with PyTorch

I have a 5-dimensional dataset and I'm interested in using a neural network to model the posterior distributions from which the data was drawn. I decided to implement a GAN to do this, and have been familiarizing myself with PyTorch.
I'm wondering how one should go about restricting what values the generator can produce for the parameters. For one of the parameters, the values must be nonnegative real values. For another case, the values must be nonnegative integer values. For the other three cases, the parameters can take on any real value.
My first idea was to control this through the transfer function applied to the nodes in the output layer of my neural network. But all of the PyTorch examples I've seen so far apply the same transfer function to all of the output nodes, which is not what I want to do. Is there a way to apply a different transfer function to each output node? Or is there maybe a better way to approach this problem?

Merging several convolutional layers into one

How can I compose several convolutional layers into one layer. I mean if there is no non-linear activations in between. How do I write a code for it in pytorch?
I want the code to account for different padding and strides. I thought about having a template image and run the conv layers on it to obtain one kernel, but can't really come up with a meaningful way to do it
Here there are detailed instructions for collapsing 2 convolution layers into 1.
You can use the code to merge the first two and then to merge the outcome with the third.
Conceptually, you can vision the process in a simpler way by using the 'Toeplitz matrix' represantaion of each convolution operation, then use matrix multiplication to multiply all three, and then return to one convolution operation (since it is more efficiently implemented, and the Toeplitz representation is very sparse).
The convolution operation can be constructed as a matrix multiplication, where one of the inputs is converted into a Toeplitz matrix.
You can see an example of this approach here.

How to perform de-normalization of last layer into labels in Keras, analogous to the preprocessing normalization layer (but inversed)?

It is my understanding that Artificial Neural Networks work best on normalized data, ie typically inputs and outputs should have, ideally, a mean of 0 and a variance of 1 (and even, if possible, a "near gaussian", or at least, "well behaved", distribution).
Therefore, I have seen / written quite a few Keras-using scripts when I first do some feature-wise normalization of the predictors and labels. This is a pain, as this means the need to keep track of a number of mean and std values, applying them correctly later at inference, etc.
I found out recently that there is now out-of-the-box functionality for doing the predictors normalization in Keras in an "adaptable, not trainable" way, which is very convenient, as all the normalization information gets stored and used out-of-the-box in the network object: see: https://keras.io/guides/preprocessing_layers/ , https://keras.io/api/layers/preprocessing_layers/numerical/normalization/#normalization-class . This makes use / bookkeeping much simpler.
My question is: would it make sense / is there a simple way to similarly do in-Keras an "outputs de-normalization", i.e., assuming that the outputs from the network have mean 0 and variance 1, add an adaptable (adaptable not trainable; similar to the preprocessing normalization layer) layer that de-normalize these outputs into the correct mean and variance for each label?
I guess this is quite similar to the preprocessing normalization layer, except that what we would like is the "inverse transformation" of what would be obtained by applying the preprocessing normalization layer on the labels. I.e., when adapting the layer to labels, one gets a layer that "de-normalizes" a 0-mean 1-std distribution into a distribution with feature-wise mean and std corresponding to the labels.
I do not see some way to get this "inverse layer" or "de-normalization layer", am I missing something / is there a simple way to do it?
The normalization layer has an invert parameter:
If True, this layer will apply the inverse transformation to its
inputs: it would turn a normalized input back into its original form.
So, in theory you could use:
layer = tf.keras.layers.Normalization(invert=True)
to de-normalize. Currently, this is wrongly implemented and will not work (but seems like the bug is already fixed in the next keras version)

Loss function forcing orthogonality for the encoding of an autoencoder (NPCA)

I have implemented an autoencoder that should realize a non-linear version of a principal component analysis. In- and Output of the model is a the same dataset with n features and I am interested in the encoding which has dimension d<n. To generalize the principal component analysis I would like to have an encoding that consists of d almost linearly independent vectors, but if I use the loss function "mse" I get e.g. for d=2 two vectors which look almost the same.
Theoretically I could use a loss function including a penalty term for vector that are similar and far from independent. But that would mean to have a loss function that uses the information of the whole batch not just a single sample and not from the output but from an intermediate layer.
Since I am working with Keras: Can anyone give me hint or a reference how I can approach this problem in Keras?

Meaning of Weight Gradient in CNN

I developed a CNN using MatConvNet and am able to visualize the weights of the 1st layer. It looked very similar to what is shown here (also attached below incase I am not specific enough) http://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html
My question is, what are the weight gradients ? I'm not sure what those are and am unable to generate those...
Weights in a NN
In a neural network, a series of linear functions represented as matrices are applied to features (usually with a nonlinear joint between them). These functions are determined by the values in the marices, referred to as weights.
You can visualize the weights of a normal neural network, but it usually means something slightly different to visualize the convolutional layers of a cnn. These layers are designed to learn a feature computation over the space.
When you visualize the weights, you're looking for patterns. A nice smooth filter may mean that the weights are well learned and "looking for something in particular". A noisy weight visualization may mean that you've undertrained your network, overfit it, need more regularization, or something else nefarious (a decent source for these claims).
From this decent review of weight visualizations, we can see patterns start to emerge from treating the weights as images:
Weight Gradients
"Visualizing the gradient" means taking the gradient matrix and treating like an image [1], just like you took the weight matrix and treated it like an image before.
A gradient is just a derivative; for images, it's usually computed as a finite difference - grossly simplified, the X gradient subtracts pixels next to each other in a row, and the Y gradient subtracts pixels next to each other in a column.
For the common example of a filter that extracts edges, we may see a strong gradient in a particular direction. By visualizing the gradients (taking the matrix of finite differences and treating it like an image), you can get a more immediate idea of how your filter is operating on the input. There are a lot of cutting edge techniques (eg, eg) for interpreting these results, but making the image pop up is the easy part!
A similar technique involves visualizing the activations after a forward pass over the input. In this case, you're looking at how the input was changed by the weights; by visualizing the weights, you're looking at how you expect them to change the input.
Don't over-think it - the weights are interesting because they let us see how the function behaves, and the gradients of the weights are just another feature to help explain what's going on. There's nothing sacred about that feature: here are some cool clustering features (t-SNE) from the google paper that look at space separability.
[1] It can be more complicated if you introduce weight sharing, but not that much
My answer here covers this question https://stackoverflow.com/a/68988426/10661506
Long story short, weight gradient of layer l is the gradient of the loss with respect to the weights of layer l.
If you have a correct implementation of backpropagation, you should have access to these gradients as they are needed to compute the weights update at every layer.

Resources