Locally connected layer filters - keras

I have a problem understanding how the filters are used in Locally connected layer.
For example, say input is 6x6x3 image and we use one Conv2D(same padding) and one LocallyConnected2D with 4 filters each of size 3x3.
filters, biases = model.layers[2].get_weights()
When I use layer.get_weights() on conv2D, it returns filters with shape (3,3,3,4) and bias shape (4, ) which is expected as we have 4 filters with shape 3x3x3.
But layer.get_weights() on LocallyConnected2D returns filters with shape (16,36,4) and bias shape (4,4,4).
Why is the filter shape 16x36?
I know that locally connected layer uses different filters at each input patch. How do we slide across the whole image with only 4 filters?

Reading the documentation there is the warning for a layer with input 32x32x3 and 64 filters:
notice that this layer will consume (30*30)*(3*3*3*64) + (30*30)*64 parameters
Or:
Kernel: (patches x patches) * (size * size * input_channels * output_channels)
Bias: (patches x patches) * output_channels
Translating to your case:
this layer will consume (4*4)*(3*3*4*4) kernel params + (4*4)*4 bias params.
Explanation:
Your kernel size is 3x3 (first part of the second parentheses)
Your input channels are 4 (third number in second parentheses) and output are 4 (fourth number)
For images 6x6 and this kernel size, there are 4x4 patches. (Just as images 32x32 will have 30x30 patches for kernel size 3x3 without padding)
For the biases, 4x4 patches and 4 output channels.

Related

Should Kernel size be same as word size in 1D Convolution?

In CNN literature, it is often illustrated that kernel size is same as size of the longest word in the vocabulary list that one has, when it sweeps across a sentence.
So if we use embedding to represent the text, then shouldn't the kernel size be same as the embedding dimension so that it gives the same effect as sweeping word by word?
I see difference sizes of kernel used, despite the word length.
Well... these are 1D convolutions, for which the kernels are 3 dimensional.
It's true that one of these 3 dimensions must match the embedding size (otherwise it would be pointless to have this size)
These three dimensions are:
(length_or_size, input_channels, output_channels)
Where:
length_or_size (kernel_size): anything you want. In the picture, there are 6 different filters with sizes 4, 4, 3, 3, 2, 2, represented by the "vertical" dimension.
input_channels (automatically the embedding_size): the size of the embedding - this is somwehat mandatory (in Keras this is automatic and almost invisible), otherwise the multiplications wouldn't use the entire embedding, which is pointless. In the picture, the "horizontal" dimension of the filters is constantly 5 (the same as the word size - this is not a spatial dimension).
output_channels (filters): anything you want, but it seems the picture is talking about 1 channel only per filter, since it's totally ignored, and if represented would be something like "depth".
So, you're probably confusing which dimensions are which. When you define a conv layer, you do:
Conv1D(filters = output_channels, kernel_size=length_or_size)
While the input_channels come from the embedding (or the previous layer) automatically.
Creating this model in Keras
To create this model, it would be something like:
sentence_length = 7
embedding_size=5
inputs = Input((sentence_length,))
out = Embedding(total_words_in_dic, embedding_size)
Now, supposing these filters have 1 channel only (since the image doesn't seem to consider their depth...), we can join them in pairs of 2 channels:
size1 = 4
size2 = 3
size3 = 2
output_channels=2
out1 = Conv1D(output_channels, size1, activation=activation_function)(out)
out2 = Conv1D(output_channels, size2, activation=activation_function)(out)
out3 = Conv1D(output_channels, size3, activation=activation_function)(out)
Now, let's collapse the spatial dimensions and remain with the two channels:
out1 = GlobalMaxPooling1D()(out1)
out2 = GlobalMaxPooling1D()(out2)
out3 = GlobalMaxPooling1D()(out3)
And create the 6 channel output:
out = Concatenate()([out1,out2,out3])
Now there is a mistery jump from 6 channels to 2 channels which cannot be explained by the picture. Perhaps they're applying a Dense layer or something.......
#????????????????
out = Dense(2, activation='softmax')(out)
model = Model(inputs, out)

Weights Matrix Final Fully Connected Layer

My question is, I think, too simple, but it's giving me headaches. I think I'm missing either something conceptually in Neural Networks or Tensorflow is returning some wrong layer.
I have a network in which last layer outputs 4800 units. The penultimate layer has 2000 units. I expect my weight matrix for last layer to have the shape (4800, 2000) but when I print out the shape in Tensorflow I see (2000, 4800). Please can someone confirm which shape of weight matrix the last layer should have? Depending on the answer, I can further debug the issue. Thanks.
Conceptually, a neural network layer is often written like y = W*x where * is matrix multiplication, x is an input vector and y an output vector. If x has 2000 units and y 4800, then indeed W should have size (4800, 2000), i.e. 4800 rows and 2000 columns.
However, in implementations we usually work on a batch of inputs X. Say X is (b, 2000) where b is your batch size. We don't want to transform each element of X individually by doing W*x as above since this would be inefficient.
Instead we would like to transform all inputs at the same time. This can be done via Y = X*W.T where W.T is the transpose of W. You can work out that this essentially applies W*x to each row of X (i.e. each input). Y is then a (b, 4800) matrix containing all transformed inputs.
In Tensorflow, the weight matrix is simply saved in this transposed state, since it is usually the form that is needed anyway. Thus, we have a matrix with shape (2000, 4800) (the shape of W.T).

CNN: Softmax layer for pixel-wise classification

I want to understand in more details how a softmax layer can look in a CNN for semantic segmentation / pixelwise classification of an image. The CNN outputs an image of class labels, where each pixel of the original image gets a label.
After passing a test image through the network, the next-to-last layer outputs N channels of the resolution of the original image. My question is, how the softmax layer transforms these N channels to the final image of labels.
Assumed we have C classes (# possible labels). My suggestion is that for each pixel, its N neurons of the previous layer are connected to C neurons in the softmax layer, where each of the C neurons represents one class. Using the softmax activation function, the sum of the C outputs (for this pixel) is equal to 1 (which facilitates training of the network). Last, each pixel is classified as the class with the highest probability (given by softmax values).
This would mean, that the softmax layer consists of C * #pixels neurons. Is my suggestion correct? I didn't find an explanation for this and hope that you can help me.
Thanks for helping!
The answer is softmax layer Do not transforms these N channels to the final image of labels
Assuming you have a output of N channel your question is how do you convert it to a 3 channel for the final output.
The answer is you dont. Each of those N channel represents a class. The way to go is that you should have a dummy array with same height and weight and 3 channels.
Now you fist have to abstractly encode each class with a color, like streets as green, cars as red etc.
Assume for height = 5 and width = 5, channel 7 has the max value. Now,
-> if the channel 7 represents car the you need to put a red pixel on the dummy array where height = 5 and width = 5.
-> if the channel 7 represents street the you need to put a green pixel on the dummy array where height = 5 and width = 5.
So you are trying to look for which of the N class a pixel belongs to. And based on the class you will redraw the pixel in a unique color on the dummy array.
This dummy array is called the mask.
For example, assume this is a input
We are trying to locate the tumor area of the brain using pixel wise classification. Here the number of classes are 2, Tumor present and not present. So the softmax layer outputs a 2 channel object where channel 1 says tumor present and channel 2 says otherwise.
So whenever for height = X and width = Y, channel 1 has higher value we make a white pixel of the dummmy[X][Y] image. When the channel 2 has higher value we make a black pixel.
After that we get a mask like this,
Which doesnt make that much sense. But when we overlay the two image, we get this
So basically you will try to create the mask image (2nd one) from your output with N Channel. And overlaying them will get you the final output

Output shape of a convolutional layer

I built a convolutional neural network in Keras.
model.add(Convolution1D(nb_filter=111, filter_length=5, border_mode='valid', activation="relu", subsample_length=1))
According to the CS231 lecture a convolving operation creates a feature map (i.e. activation map) for each filter which are then stacked together. IN my case the convolutional layer has a 300 dimensional input. Hence, I expect the following computation:
Each filter has a window size of 5. Consequently, each filter produces 300-5+1=296 convolutions.
As there are 111 filters there should be a 111*296 output of the convolutional layer.
However, the actual output shapes look differently:
convolutional_layer = model.layers[1]
conv_weights, conv_biases = convolutional_layer.get_weights()
print(conv_weights.shape) # (5, 1, 300, 111)
print(conv_biases.shape) # (,111)
The shape of the bias values makes sense, because there is one bias value for each filter. However, I do not understand the shape of the weights. Apparently, the first dimension depends on the filter size. The third dimension is the number of input neurons, which should have been reduced by the convolution. The last dimension probably refers to the number of filters. This does not make sense, because how should I easily get the feature map for a specific filter?
Keras either uses Theano or Tensorflow as a backend. According to their documentation the output of a convolving operation is a 4d tensor (batch_size, output_channel, output_rows, output_columns).
Can somebody explain me the output shape in accordance with the CS231 lecture?
Your Weight dimension has to be [filter_height, filter_width, in_channel, out_channe]
With your example I think the input channel which is the depth of the input is 300 and you want the output channel to be 111
Total number of filters are 111 and not 300*111
As you have said by yourself each bias for every filter so 111 bias for 111 filters
Each filter out of 111 will produce a convolution on the input
The Weight shape in your case means that you are using a kernel patch of shape 5*1
The third dimension means that depth of input feature map is 300
The fourth dimension mean that depth of the output feature map is 111
Actually it makes very good sense. Your learn the weights of the filters. Each filter in turn produces an output (aka an activation map respective to your input data).
The first two axes of your conv_weights.shape are the dimensions of your filter that is being learned (as your already mentioned). Your filter_length is 5 x 1. Your input has 300 dimensions and you want to get 111 filters per dimension, so you end up with 300 * 111 filters of size 5 * 1 weights.
I assume that the feature map of filter #0 for dimension #0 is sth like your_weights[:, :, 0, 0].

Confused about the size of the output of convolution layer in theano deep learning tutorial

[http://deeplearning.net/tutorial/lenet.html#lenet]
in the above link it says
Construct the first convolutional pooling layer:
filtering reduces the image size to (28-5+1 , 28-5+1) = (24, 24)
convolution of data of size a with filter of size b gives output of size a+b-1. so here the data size 28*28, filter size is 5*5. so the output size should be (28+5-1,28+5-1). it is given as (28-5+1,28-5+1)
It depends on the border_mode
conv2d uses border_mode='valid' by default which means (from the scipy documentation)
The output consists only of those elements that do not rely on the
zero-padding.
So with border_mode='valid' and a (5,5) filter the output is going to be the same size as the input minus a two pixel border, i.e. image_shape - filter_shape + 1, hence with input size (28,28) the output is going to be (24,24).
The alternative, border_mode='full' will zero pad the input such that the output is of shape image_shape + filter_shape - 1.

Resources