Output shape of a convolutional layer - keras

I built a convolutional neural network in Keras.
model.add(Convolution1D(nb_filter=111, filter_length=5, border_mode='valid', activation="relu", subsample_length=1))
According to the CS231 lecture a convolving operation creates a feature map (i.e. activation map) for each filter which are then stacked together. IN my case the convolutional layer has a 300 dimensional input. Hence, I expect the following computation:
Each filter has a window size of 5. Consequently, each filter produces 300-5+1=296 convolutions.
As there are 111 filters there should be a 111*296 output of the convolutional layer.
However, the actual output shapes look differently:
convolutional_layer = model.layers[1]
conv_weights, conv_biases = convolutional_layer.get_weights()
print(conv_weights.shape) # (5, 1, 300, 111)
print(conv_biases.shape) # (,111)
The shape of the bias values makes sense, because there is one bias value for each filter. However, I do not understand the shape of the weights. Apparently, the first dimension depends on the filter size. The third dimension is the number of input neurons, which should have been reduced by the convolution. The last dimension probably refers to the number of filters. This does not make sense, because how should I easily get the feature map for a specific filter?
Keras either uses Theano or Tensorflow as a backend. According to their documentation the output of a convolving operation is a 4d tensor (batch_size, output_channel, output_rows, output_columns).
Can somebody explain me the output shape in accordance with the CS231 lecture?

Your Weight dimension has to be [filter_height, filter_width, in_channel, out_channe]
With your example I think the input channel which is the depth of the input is 300 and you want the output channel to be 111
Total number of filters are 111 and not 300*111
As you have said by yourself each bias for every filter so 111 bias for 111 filters
Each filter out of 111 will produce a convolution on the input
The Weight shape in your case means that you are using a kernel patch of shape 5*1
The third dimension means that depth of input feature map is 300
The fourth dimension mean that depth of the output feature map is 111

Actually it makes very good sense. Your learn the weights of the filters. Each filter in turn produces an output (aka an activation map respective to your input data).
The first two axes of your conv_weights.shape are the dimensions of your filter that is being learned (as your already mentioned). Your filter_length is 5 x 1. Your input has 300 dimensions and you want to get 111 filters per dimension, so you end up with 300 * 111 filters of size 5 * 1 weights.
I assume that the feature map of filter #0 for dimension #0 is sth like your_weights[:, :, 0, 0].

Related

PyTorch's CrossEntropyLoss - how to deal with the sequence length dimension with transformers?

I'm training a transformer model for text generation.
let's assume:
vocab size = 100
embbeding size = 50
max sequence length = 30
batch size = 32
loss = cross entropy loss
the last layer in the model is a fully connected layer,
mapping from shape [30, 32, 50] to [30, 32, 100].
the idea is that each of the last 30 sequences in the first dimension, I have a target vector I want to calculate loss with.
the issue is that based on the docs, this loss only excepts 2 dims on the prediction and one on the target - so how can I fit my 3D prediction into it?
(and 2D target?)
Use torch.BCELoss() instead (Binary cross entropy). This expects input and target to be the same size but they can be any size, and should fall within the range [0,1]. It performs cross-entropy loss element-wise.
EDIT: if you expect only one element from the vocab to be output, then you should use CrossEntropyLoss and instead encode your labels as a 1D vector rather than a 2D vector (i.e. do 1-hot decoding). BCE treats each element in the output for a single example as independent from the others, which is not a valid assumption for a multi-class style problem. I originally misread and thought the final output was an embedding, rather than an element from the vocabulary, hence my original suggestion.

Locally connected layer filters

I have a problem understanding how the filters are used in Locally connected layer.
For example, say input is 6x6x3 image and we use one Conv2D(same padding) and one LocallyConnected2D with 4 filters each of size 3x3.
filters, biases = model.layers[2].get_weights()
When I use layer.get_weights() on conv2D, it returns filters with shape (3,3,3,4) and bias shape (4, ) which is expected as we have 4 filters with shape 3x3x3.
But layer.get_weights() on LocallyConnected2D returns filters with shape (16,36,4) and bias shape (4,4,4).
Why is the filter shape 16x36?
I know that locally connected layer uses different filters at each input patch. How do we slide across the whole image with only 4 filters?
Reading the documentation there is the warning for a layer with input 32x32x3 and 64 filters:
notice that this layer will consume (30*30)*(3*3*3*64) + (30*30)*64 parameters
Or:
Kernel: (patches x patches) * (size * size * input_channels * output_channels)
Bias: (patches x patches) * output_channels
Translating to your case:
this layer will consume (4*4)*(3*3*4*4) kernel params + (4*4)*4 bias params.
Explanation:
Your kernel size is 3x3 (first part of the second parentheses)
Your input channels are 4 (third number in second parentheses) and output are 4 (fourth number)
For images 6x6 and this kernel size, there are 4x4 patches. (Just as images 32x32 will have 30x30 patches for kernel size 3x3 without padding)
For the biases, 4x4 patches and 4 output channels.

Should Kernel size be same as word size in 1D Convolution?

In CNN literature, it is often illustrated that kernel size is same as size of the longest word in the vocabulary list that one has, when it sweeps across a sentence.
So if we use embedding to represent the text, then shouldn't the kernel size be same as the embedding dimension so that it gives the same effect as sweeping word by word?
I see difference sizes of kernel used, despite the word length.
Well... these are 1D convolutions, for which the kernels are 3 dimensional.
It's true that one of these 3 dimensions must match the embedding size (otherwise it would be pointless to have this size)
These three dimensions are:
(length_or_size, input_channels, output_channels)
Where:
length_or_size (kernel_size): anything you want. In the picture, there are 6 different filters with sizes 4, 4, 3, 3, 2, 2, represented by the "vertical" dimension.
input_channels (automatically the embedding_size): the size of the embedding - this is somwehat mandatory (in Keras this is automatic and almost invisible), otherwise the multiplications wouldn't use the entire embedding, which is pointless. In the picture, the "horizontal" dimension of the filters is constantly 5 (the same as the word size - this is not a spatial dimension).
output_channels (filters): anything you want, but it seems the picture is talking about 1 channel only per filter, since it's totally ignored, and if represented would be something like "depth".
So, you're probably confusing which dimensions are which. When you define a conv layer, you do:
Conv1D(filters = output_channels, kernel_size=length_or_size)
While the input_channels come from the embedding (or the previous layer) automatically.
Creating this model in Keras
To create this model, it would be something like:
sentence_length = 7
embedding_size=5
inputs = Input((sentence_length,))
out = Embedding(total_words_in_dic, embedding_size)
Now, supposing these filters have 1 channel only (since the image doesn't seem to consider their depth...), we can join them in pairs of 2 channels:
size1 = 4
size2 = 3
size3 = 2
output_channels=2
out1 = Conv1D(output_channels, size1, activation=activation_function)(out)
out2 = Conv1D(output_channels, size2, activation=activation_function)(out)
out3 = Conv1D(output_channels, size3, activation=activation_function)(out)
Now, let's collapse the spatial dimensions and remain with the two channels:
out1 = GlobalMaxPooling1D()(out1)
out2 = GlobalMaxPooling1D()(out2)
out3 = GlobalMaxPooling1D()(out3)
And create the 6 channel output:
out = Concatenate()([out1,out2,out3])
Now there is a mistery jump from 6 channels to 2 channels which cannot be explained by the picture. Perhaps they're applying a Dense layer or something.......
#????????????????
out = Dense(2, activation='softmax')(out)
model = Model(inputs, out)

Weights Matrix Final Fully Connected Layer

My question is, I think, too simple, but it's giving me headaches. I think I'm missing either something conceptually in Neural Networks or Tensorflow is returning some wrong layer.
I have a network in which last layer outputs 4800 units. The penultimate layer has 2000 units. I expect my weight matrix for last layer to have the shape (4800, 2000) but when I print out the shape in Tensorflow I see (2000, 4800). Please can someone confirm which shape of weight matrix the last layer should have? Depending on the answer, I can further debug the issue. Thanks.
Conceptually, a neural network layer is often written like y = W*x where * is matrix multiplication, x is an input vector and y an output vector. If x has 2000 units and y 4800, then indeed W should have size (4800, 2000), i.e. 4800 rows and 2000 columns.
However, in implementations we usually work on a batch of inputs X. Say X is (b, 2000) where b is your batch size. We don't want to transform each element of X individually by doing W*x as above since this would be inefficient.
Instead we would like to transform all inputs at the same time. This can be done via Y = X*W.T where W.T is the transpose of W. You can work out that this essentially applies W*x to each row of X (i.e. each input). Y is then a (b, 4800) matrix containing all transformed inputs.
In Tensorflow, the weight matrix is simply saved in this transposed state, since it is usually the form that is needed anyway. Thus, we have a matrix with shape (2000, 4800) (the shape of W.T).

Confused about the size of the output of convolution layer in theano deep learning tutorial

[http://deeplearning.net/tutorial/lenet.html#lenet]
in the above link it says
Construct the first convolutional pooling layer:
filtering reduces the image size to (28-5+1 , 28-5+1) = (24, 24)
convolution of data of size a with filter of size b gives output of size a+b-1. so here the data size 28*28, filter size is 5*5. so the output size should be (28+5-1,28+5-1). it is given as (28-5+1,28-5+1)
It depends on the border_mode
conv2d uses border_mode='valid' by default which means (from the scipy documentation)
The output consists only of those elements that do not rely on the
zero-padding.
So with border_mode='valid' and a (5,5) filter the output is going to be the same size as the input minus a two pixel border, i.e. image_shape - filter_shape + 1, hence with input size (28,28) the output is going to be (24,24).
The alternative, border_mode='full' will zero pad the input such that the output is of shape image_shape + filter_shape - 1.

Resources