I am learning CNNs and I need to calculate the receptive field size for a 3-layer conv network. I have looked at Guide to Receptive field arithmetic and the formulas for calculation. It's the j(out) field that takes in the stride and in next step, we calculate r(out). However, in the guide, it seems that the stride is always assumed to be symmetric like 2x2. Any idea how the formulas will change for a stride of 2x3?
Also, is there any way to do this calculation as part of model building either in TensorFlow/Keras or Pytorch, maybe model.summary() call? If yes, I will appreciate a code example.
Thank you very much,
Yoshiro
Related
I am currently pursuing undergraduation, I am working on CNN model to recognize Telegu characters.
This Questions has two parts,
I have a (32,32,1) shape Telegu character images, I want to train my CNN model to recognize the character. So, what should be my model architecture and how to decide the architecture, no of parameters and hidden layers. I know that my case is exactly same as handwritten digit recognition, but I want to know how to decide those parameters. Is there any common practice in building such architecture.
Operation Conv2D (32, (5,5)) means 32 filters of size 5x5 are applied on to the input, my question is are these filters all same or different, if different what kind of filters are initialized and who decides them?
I tried to surf internet but everywhere I go, the answer I get is Conv2D operation applies filters on input and does the convolution operation.
To decide which model architecture would be best, you need to experiment. Thats the only way. As you want to classify, VGG architecture would be a good starting point I believe. You need to experiment with number of parameters as it depends on your problem. You can use Keras Tuner for it: https://keras.io/keras_tuner/
For kernel initialization, as far as I know convolutional layers in Keras uses Glorot Uniform Initialization but you can change that by using kernel_initializer parameter. Long story short, convolutional layers are initialized with a distribution function and as training goes filters change the values inside, which is learning process. https://keras.io/api/layers/initializers
Edit: I forgot to inform you that I suggest VGG architecture but in a way you downsize the models a lot. Your input shape is little so if your model is too much deep, you will overfit really quickly.
I have a time series forecasting case with ten features (inputs), and only one output. I'm using 22 timesteps (history of features) for one step ahead prediction using LSTM. Also, I apply MinMaxScaler for input normalization, but I don't normalize the output. The output contains some rare jumps (such as 20, 50, or more than 100), but the other values are between 0 and ~5 (all values are positive). In this case, it's important to forecast both normal and outlier outputs correctly so I dont want to miss the jumps in my forecasting model. I think if I use MinMaxScaler for output, most of the values will be something near the zero but the others (outliers) will be near one.
What is the best way to normalize the output? Should I leave it without normalization?
What is the best LSTM structure to handle this issue? (currently, I'm using LSTM with relu and Dense layer with relu as the last layer so I the output will be a positive value). I think I should select activation functions correctly for this case.
I think first of all, you should decide on a metric to measure performance. For example, do you want to use MAE or MSE? Or some other metric you decide based on the task at hand. For example, you may tolerate greater error for the "rare jumps", but not for the normal cases, or vice versa. Once you are decided on the error metric, ideally, you should set that as the cost function that the LSTM network would be minimizing.
Now the goal would be to minimize the desired error metric you set. If this was a convex problem, the scaling of the output will not matter. But we now that this is not the case with the complex deep learning architectures. What this means is that while minimizing the cost function with gradient decent, it might get stuck in a local minimum with a very delayed convergence. In this case, normalizing the output might help. How?
Assume that your output has a mean value of 5. With last layers parameters initialized around zero and a bias value of zero (i.e. the linear transformation of relu), the network needs to learn that the bias should be around 5. Depending on the complexity of the network this could take some epochs. However, if you normalize the data, or initialize the bias at 5, then your network starts with a good estimate of the bias and thus converges faster.
Now back to your questions:
I would at least make the output zero mean and use Dense layer with linear output.
The architecture you have seems fine, you can try stacking 2-4 LSTM layers if you think your input has complex time dependencies.
Feel free to update the OP with the the code and the performance you get and we can discuss what else can be improved.
I am writing a sequence to sequence neural network in Pytorch. In the official Pytorch seq2seq tutorial, there is code for an Attention Decoder that I cannot understand/think might contain a mistake.
It computes the attention weights at each time step by concatenating the output and the hidden state at this time, and then multiplying by a matrix to get a vector of size equal to the output sequence length. Note, these attention weights don’t depend on the encoder sequence (named encoder_outputs in the code), which I think it should.
Also, the paper cited in the tutorial, lists three different score functions that can be used to compute attention weights (section 3.1 in the paper). None of these functions is just concatenating and multiplying by a matrix.
So it seems to me that the code in the tutorial is mistaken both in the function it applies and the arguments that are passed to this function. Am I missing something?
This tutorial has a simplified version of these attentions in the Luong paper that you mentioned.
It just uses a linear layer to combine the input embedding and the decoder RNN hidden state. This is sometimes called a 'location-based' attention, because it does not depend on the encoder outputs. Then it applies the softmax and computes the attention weights and the process goes as it would normally.
This is not always bad to have, as from the encoder outputs the attention mechanism might attend to a previous token and then the attention would not be monotonic, so your model would fail.
To implement the attentions from the Luong paper, you I suggest to use the 'concat' attention, after applying linear layers to both the decoder hidden state and the encoder outputs. Then the matrix W_a will transform these concatenated results to an arbitrary dimension of your choice, and finally the v_a is a vector that will transform to the desired context vector dimension.
In the algorithm, attn_weights depends on decode parameters.
Then we get an output of a linear layer(here 10). This is attention vector.
Then we multiply this with encoder_outputs. So at every epoch, we update attn_weights by back propagation. Verbally, at every iteration, it is learning in the reverse direction.
Let me give an example:
Our task is translate from English to German.
I want to sing a song. -> Ich möchte ein Lied singen.
At decoder, singen verb is at end. So our decoder attn_weights see decoder output,and learns to apply which parts of input encoding. When you multiply this value with encoder_outputs , you get a matrix of ,which have high values in necessary points.
So infact this way, it is learning when decoder see a sentence pattern in german,
which parts of input it must pay attention. So direction of learning is correct,I think.
Lets say the first conv layer has 32 filters of size 5x5 with stride of 1.
model.add(Conv2D(32, (5, 5), input_shape=input_shape))
Lets say the image is of size 32x32x3(channesl). So when a filter convolves with a part of an image, is it already looking for a specific feature? I understand that the filter matrix is initialized with random numbers. But do they already have a sort of purpose to what they are looking for? Could you explain how features are being detected in CNN?
The goal of a convolutional layer is filtering. As we move over an image we effectively check for patterns in that section of the image. This works because of filters, stacks of weights represented as a vector, which are multiplied by the values outputed by the convolution.When training an image, these weights change, and so when it is time to evaluate an image, these weights return high values if it thinks it is seeing a pattern it has seen before. The combinations of high weights from various filters let the network predict the content of an image.
So, when a filter convolves with a part of an image, at first, it doesn't know it is feature or not, by training and changing weights, the filters are adaptive to the features in images so that the Loss function should be minimum with the ground truth. The reason for initialization is just we will change weights so that the predicted value will be as closest as possible to the given label.
I developed a CNN using MatConvNet and am able to visualize the weights of the 1st layer. It looked very similar to what is shown here (also attached below incase I am not specific enough) http://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html
My question is, what are the weight gradients ? I'm not sure what those are and am unable to generate those...
Weights in a NN
In a neural network, a series of linear functions represented as matrices are applied to features (usually with a nonlinear joint between them). These functions are determined by the values in the marices, referred to as weights.
You can visualize the weights of a normal neural network, but it usually means something slightly different to visualize the convolutional layers of a cnn. These layers are designed to learn a feature computation over the space.
When you visualize the weights, you're looking for patterns. A nice smooth filter may mean that the weights are well learned and "looking for something in particular". A noisy weight visualization may mean that you've undertrained your network, overfit it, need more regularization, or something else nefarious (a decent source for these claims).
From this decent review of weight visualizations, we can see patterns start to emerge from treating the weights as images:
Weight Gradients
"Visualizing the gradient" means taking the gradient matrix and treating like an image [1], just like you took the weight matrix and treated it like an image before.
A gradient is just a derivative; for images, it's usually computed as a finite difference - grossly simplified, the X gradient subtracts pixels next to each other in a row, and the Y gradient subtracts pixels next to each other in a column.
For the common example of a filter that extracts edges, we may see a strong gradient in a particular direction. By visualizing the gradients (taking the matrix of finite differences and treating it like an image), you can get a more immediate idea of how your filter is operating on the input. There are a lot of cutting edge techniques (eg, eg) for interpreting these results, but making the image pop up is the easy part!
A similar technique involves visualizing the activations after a forward pass over the input. In this case, you're looking at how the input was changed by the weights; by visualizing the weights, you're looking at how you expect them to change the input.
Don't over-think it - the weights are interesting because they let us see how the function behaves, and the gradients of the weights are just another feature to help explain what's going on. There's nothing sacred about that feature: here are some cool clustering features (t-SNE) from the google paper that look at space separability.
[1] It can be more complicated if you introduce weight sharing, but not that much
My answer here covers this question https://stackoverflow.com/a/68988426/10661506
Long story short, weight gradient of layer l is the gradient of the loss with respect to the weights of layer l.
If you have a correct implementation of backpropagation, you should have access to these gradients as they are needed to compute the weights update at every layer.