Resolution preserving Fully Convolutional Network - conv-neural-network

I am new to ML and Pytorch and I have the following problem:
I am looking for a Fully Convolutional Network architecture in Pytorch, so that the input would be an RGB image (HxWxC or 480x640x3) and the output would be a single channel image (HxW or 480x640). In other words, I am looking for a network that will preserve the resolution of the input (HxW), and will loose the channel dimension. All of the networks that I've came across (ResNet, Densenet, ...) end with a fully connected layer (without any upsampling or deconvolution). This is problematic for two reasons:
I am restricted with the choice of the input size (HxWxC).
It has nothing to do with the output that I expect to get (a single channel image HxW).
What am I missing? Why is there even a FC layer? Why is there no up-sampling, or some deconvolution layers after feature extraction? Is there any build-in torchvision.model that might suit my requirements? Where can I find such pytorch architecture? As I said, I am new in this field so I don't really like the idea of building such a network from scratch.
Thanks.

You probably came across the networks that are used in classification. So they end up with a pooling and a fully connected layer to produce a fixed number of categorical output.
Have a look at Unet
https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/
Note: the original unet implementation use a lot of tricks.
You can simply downsample and then upsample symmetrically to do the work.

Your kind of task belongs to dense classification tasks, e.g. segmentation. In those tasks, we use fully convolution nets (see here for the original paper). In the FCNs you don't have any fully-connected layers, because when applying fully-connected layers you lose spatial information which you need for the dense prediction. Also have a look at the U-Net paper. All state-of-the art architectures use some kind of encoder-decoder architecture extended for example with a pyramid pooling module.
There are some implementations in the pytorch model zoo here. Search also Github for pytorch implementations for other networks.

Related

Conv2D filters and CNN architecture

I am currently pursuing undergraduation, I am working on CNN model to recognize Telegu characters.
This Questions has two parts,
I have a (32,32,1) shape Telegu character images, I want to train my CNN model to recognize the character. So, what should be my model architecture and how to decide the architecture, no of parameters and hidden layers. I know that my case is exactly same as handwritten digit recognition, but I want to know how to decide those parameters. Is there any common practice in building such architecture.
Operation Conv2D (32, (5,5)) means 32 filters of size 5x5 are applied on to the input, my question is are these filters all same or different, if different what kind of filters are initialized and who decides them?
I tried to surf internet but everywhere I go, the answer I get is Conv2D operation applies filters on input and does the convolution operation.
To decide which model architecture would be best, you need to experiment. Thats the only way. As you want to classify, VGG architecture would be a good starting point I believe. You need to experiment with number of parameters as it depends on your problem. You can use Keras Tuner for it: https://keras.io/keras_tuner/
For kernel initialization, as far as I know convolutional layers in Keras uses Glorot Uniform Initialization but you can change that by using kernel_initializer parameter. Long story short, convolutional layers are initialized with a distribution function and as training goes filters change the values inside, which is learning process. https://keras.io/api/layers/initializers
Edit: I forgot to inform you that I suggest VGG architecture but in a way you downsize the models a lot. Your input shape is little so if your model is too much deep, you will overfit really quickly.

Differences in encoder - decoder models between Keras and Pytorch

There seem to be significant, fundamental differences in construction of encoder-decoder models between keras and pytorch. Here is keras' enc-dec blog and here is pytorch's enc-dec blog.
Some differences I noticed are the following:
Keras' model directly feeds input to LSTM layer. Whereas Pytorch uses an embedding layer for both the encoder and decoder.
Pytorch uses an embedding layer with no activation in the encoder but uses relu activation for the embedding layer in the decoder.
Given these observations, my questions are the following:
My understanding is the following, is it correct? The embedding layer is not strictly required but it helps in finding a better and denser representation of the input. It is optional and you can still build a good model without the embedding layer (dependent on the problem). This is why Keras chose not to use it in this particular example. Is this a sound reason or is there more to the story?
Why use an activation for the embedding layer in the decoder but not the encoder?
Why use 'relu' as the activation instead of 'tanh', etc for the embedding layer? What's the intuition here? I've only seen 'relu' applied to data that has spatial relation, not temporal relation.
You have a wrong understanding of encoder-decoder models. First of all, please note Keras and Pytorch are two deep learning frameworks, while encoder-decoder is a type of neural network architecture. So, you need to understand how encoder-decoder works in the first place and then revise their architecture as per your need. Now, let me come back to your questions.
Embedding layer converts one-hot encoding representations into low-dimensional vector representations. For example, we have a sentence I love programming. We want to translate this sentence into German using an encoder-decoder network. So, the first step is to first convert the words in the input sentence into a sequence of vector representations, and this can be done using an embedding layer. Please note, the use of Keras or Pytorch doesn't matter. You can think, how would you give a natural language sentence as input to an LSTM? Obviously, you first need to convert them into vectors.
There is no such rule that you should use an activation layer in the embedding layer for the decoder, but not in the encoder. Remember, activation functions are non-linear functions. So, applying a non-linearity has different consequences but it has nothing to do with the encoder-decoder framework.
Again, the choice of activation function depends on other factors, not on encoder or decoder or a specific type of neural network architecture. I suggest you read the characteristics of the popular activation functions that are used in neural networks. Also, do not come into conclusions after observing a few use cases. Such conclusions are dangerous.

choose filter in convolution neural network

I have done implementation part of convolution neural network. But I am still confused about how to select the filter to obtain convolved feature in convolution neural network. As I know we detect features(like eyes, nose, mouth) to recognize a face from an image using convolution layer with the help of the filter.is it true that filter contains eyes, nose, mouth to recognize a face from an image?
There is no hard rule for this purpose.
In many university courses and even implemented models in papers, researcher uses 3x3 or 5x5 filters with with 1 or 2 strides.
It is one of your hyperparameters you should tune for your model. But the best way as a practice is to go to implemented model's documentations by google or others and find best size with respect to your conv layers.
But the last thing you should know is that the purpose of adding filters is to reduce nmber of parameters but keeping high quality features.
Here is a link to all models implemented using Tensoflow for different tasks.
Good luck

keras RNN w/ local support and shared weights

I would like to understand how Keras sets up weights to be shared. Specifically, I would like to use a convolutional 1D layer for processing a time-frequency representation of an audio signal and feed it into an RNN (perhaps a GRU layer) that has:
local support (e.g. like the Conv1D layer with a specified kernel size). Things that are far away in frequency from an output are unlikely to affect the output.
Shared weights, that is I train only a single set of weights across all of the neurons in the RNN layer. Similar inferences should work at lower or higher frequencies.
Essentially, I'm looking for many of the properties that we find in the 2D RNN layers. I've been looking at some of the Keras source code for the convnets to try to understand how weight sharing is implemented, but when I see the weight allocation code in the layer build methods (e.g. in the _Conv class), it's not clear to me how the code is specifying that the weights for each filter are shared. Is this buried in the backend? I see that the backend call is to a specific 1D, 2D, or 3D convolution.
Any pointers in the right direction would be appreciated.
Thank you - Marie

Difference of filters in convolutional neural network

When creating a convolutional neural network (CNN) (e.g. as described in
https://cs231n.github.io/convolutional-networks/) the input layer is connected with one or several filters, each representing a feature map. Here, each neuron in a filter layer is connected with just a few neurons of the input layer.
In the most simple case each of my n filters has the same dimensionality and uses the same stride.
My (tight-knitted) questions are:
How is ensured that the filters learn different features, although they are trained with the same patches?
"Depends" the learned feature of a filter on the randomly assigned values (for weights and biases) when initiating the network?
I'm not an expert, but I can speak a bit to your questions. To be honest, it sounds like you already have the right idea: it's specifically the initial randomization of weights/biases in the filters that fosters their tendencies to learn different features (although I believe randomness in the error backpropagated from higher layers of the network can play a role as well).
As #user2717954 indicated, there is no guarantee that the filters will learn unique features. However, each time the error of a training sample or batch is backpropagated to a given convolutional layer, the weights and biases of each filter is slightly modified to improve the overall accuracy of the network. Since the initial weights and biases are all different in each filter, it's possible (and likely given a suitable model) for most of the filters to eventually stabilize to values representing a robust set of unique features.
In addition to proper randomization of weights, this also demonstrates why it's crucial to use convolutional layers with an adequate number of filters. Without enough filters, the network is fundamentally limited such that there are important, useful patterns at the given layer of abstraction that simply can't be represented by the network.

Resources