Keras Conv2D parameter order - keras

If I have a layer with 32 convolution 5x5 rgb kernels in it, I would expect the shape to be (32, 5, 5, 3) being (count, h, w, rgb) but instead it is
(5, 5, 3, 32). This messes up iteration since
for kern in kernels:
Does not work correctly. I get a series of (5 ,3, 32) ndarrays. I do not get each of the 5x5 rgb kernels.
Am I just doing this wrong?

Strange that the kernel is stored in the shape (h, w, channels, filters), as the implementation suggests otherwise:
kernel_shape = self.kernel_size + (self.filters, input_dim)
self.kernel = self.add_weight(shape=kernel_shape, ...)
...
However, if this is what you are seeing, and if you need to iterate over each filter, why not just move the axis with np.moveaxis:
kernel = np.moveaxis(kernel, -1,0)
to get the desired (kernels, h, w, channels).

Related

Understanding matrix dimesions in printing CNN Heatmaps (for Class Activation Map" (CAM))

This question is related to computing the Class Activation Map (CAM) visualization.
Source code is at [ln [24 and onwards]: See ln [24] onwards and code snapshot pasted below
Model.summary()
The last convolutional layer is block5_conv3 with dimensions (14, 14, 512) and it predicts 1000 classes.
My questions are related to lines of code in this screenshot.
I included the lines of code separately also
In this line of code:
african_elephant_output = model.output[:, 386]
This model was trained on 1000 classes (the last line in the output of model.summary()). To understand the gradient calculation at a later step, I first want to understand how to print the length of vector african_elephant_outputand also the actual values in this feature vector
last_conv_layer = model.get_layer('block5_conv3')
grads = K.gradients(african_elephant_output, last_conv_layer.output)[0]
The dimensions of last_conv_layer are (14, 14, 512). But to understand how the dot product calculated by K.gradients, I need to know the dimensions of african_elephant_output. Should these dimensions of African_elephant output be: (1, 1, 512) so that we are able to first broadcast African_elephant output and then calculated the dot product of corresponding channels? How can I print the dimensions of african_elephant_output and the dimensions and values of grads?
What does axis = (0, 1, 2) refer to this line of code:
pooled_grads = K.mean(grads, axis=(0, 1, 2))
I am assuming the grads vector in #2 above is of shape (14, 14, 512) so the axis values of 0, 1, 2 refer to width (referred to by 0), height (referred to by 1) and the channel dimension ((referred to by 2). So that the mean is calculated along the width and height and we get a vector of shape (512, ) ?

Understanding input shape to PyTorch conv1D?

This seems to be one of the common questions on here (1, 2, 3), but I am still struggling to define the right shape for input to PyTorch conv1D.
I have text sequences of length 512 (number of tokens per sequence) with each token being represented by a vector of length 768 (embedding). The batch size I am using is 6.
So my input tensor to conv1D is of shape [6, 512, 768].
input = torch.randn(6, 512, 768)
Now, I want to convolve over the length of my sequence (512) with a kernel size of 2 using the conv1D layer from PyTorch.
Understanding 1:
I assumed that "in_channels" are the embedding dimension of the conv1D layer. If so, then a conv1D layer will be defined in this way where
in_channels = embedding dimension (768)
out_channels = 100 (arbitrary number)
kernel = 2
convolution_layer = nn.conv1D(768, 100, 2)
feature_map = convolution_layer(input)
But with this assumption, I get the following error:
RuntimeError: Given groups=1, weight of size 100 768 2, expected input `[4, 512, 768]` to have 768 channels, but got 512 channels instead
Understanding 2:
Then I assumed that "in_channels" is the sequence length of the input sequence. If so, then a conv1D layer will be defined in this way where
in_channels = sequence length (512)
out_channels = 100 (arbitrary number)
kernel = 2
convolution_layer = nn.conv1D(512, 100, 2)
feature_map = convolution_layer(input)
This works fine and I get an output feature map of dimension [batch_size, 100, 767]. However, I am confused. Shouldn't the convolutional layer convolve over the sequence length of 512 and output a feature map of dimension [batch_size, 100, 511]?
I will be really grateful for your help.
In pytorch your input shape of [6, 512, 768] should actually be [6, 768, 512] where the feature length is represented by the channel dimension and sequence length is the length dimension. Then you can define your conv1d with in/out channels of 768 and 100 respectively to get an output of [6, 100, 511].
Given an input of shape [6, 512, 768] you can convert it to the correct shape with Tensor.transpose.
input = input.transpose(1, 2).contiguous()
The .contiguous() ensures the memory of the tensor is stored contiguously which helps avoid potential issues during processing.
I found an answer to it (source).
So, usually, BERT outputs vectors of shape
[batch_size, sequence_length, embedding_dim].
where,
sequence_length = number of words or tokens in a sequence (max_length sequence BERT can handle is 512)
embedding_dim = the vector length of the vector describing each token (768 in case of BERT).
thus, input = torch.randn(batch_size, 512, 768)
Now, we want to convolve over the text sequence of length 512 using a kernel size of 2.
So, we define a PyTorch conv1D layer as follows,
convolution_layer = nn.conv1d(in_channels, out_channels, kernel_size)
where,
in_channels = embedding_dim
out_channels = arbitrary int
kernel_size = 2 (I want bigrams)
thus, convolution_layer = nn.conv1d(768, 100, 2)
Now we need a connecting link between the expected input by convolution_layer and the actual input.
For this, we require to
current input shape [batch_size, 512, 768]
expected input [batch_size, 768, 512]
To achieve this expected input shape, we need to use the transpose function from PyTorch.
input_transposed = input.transpose(1, 2)
I have a suggestion for you which may not be what you asked for but might help. Because your input is (6, 512, 768) you can use conv2d instead of 1d.
All you need to do is to add a dimension of 1 at index 1: input.unsqueeze(1) which works as your channel (consider it as a grayscale image)
def forward(self, x):
x = self.embedding(x) # [Batch, seq length, Embedding] = [5, 512, 768])
x = torch.unsqueeze(x, 1) # [5, 1, 512, 768]) # like a grayscale image
and also for your conv2d layer, you can define like this:
window_size=3 # for trigrams
EMBEDDING_SIZE = 768
NUM_FILTERS = 10 # or whatever you want
self.conv = nn.Conv2d(in_channels = 1,
out_channels = NUM_FILTERS,
kernel_size = [window_size, EMBEDDING_SIZE],
padding=(window_size - 1, 0))```

What's the usage for convolutional layer that output is the same as the input applied with MaxPool

what's the idea behind when using the following convolutional layers?
especially for nn.Conv2d(16, 16, 3, padding = 1)
self.conv1 = nn.Conv2d(3, 16, 3, padding = 1 )
self.conv2 = nn.Conv2d(16, 16, 3, padding = 1)
self.conv3 = nn.Conv2d(16, 32, 3, padding = 1)
self.pool = nn.MaxPool2d(2, 2)
x = F.relu(self.conv1(x))
x = self.pool(F.relu(self.conv2(x)))
x = F.relu(self.conv3(x))
I thought Conv2d always uses a bigger size like
from (16,32) to (32,64) for example.
Is nn.Conv2d(16, 16, 3, padding = 1) merely for reducing the size?
The model architecture all depends on finally what works best for your application, and it's always going to vary.
You are right in saying that usually, you want to make your tensors deeper (in the dimension of your channels) in order to extract richer features, but there is no hard and fast rule about that. Having said that, sometimes you don't want to make your tensors too big, since more the number of channels more the number of trainable parameters making it difficult for your model to train. This again brings me back to the very first line I said - "It all depends".
And as for the line:
nn.Conv2d(16, 16, 3, padding = 1) # stride = 1 by default.
This will keep the size of the tensor the same as the input in all 3 dimensions (height, width, and number of channels).
I will also add the formula to calculate size of output tensor in a convolution for reference.
output_size = ( (input_size - filter_size + 2*padding) / stride ) + 1

Default dilation value in PyTorch

As given in the documentation of PyTorch, the layer Conv2d uses a default dilation of 1. Does this mean that if I want to create a simple conv2d layer I would have to write
nn.conv2d(in_channels = 3, out_channels = 64, kernel_size = 3, dilation = 0)
instead of simply writing
nn.conv2d(in_channels = 3, out_channels = 64, kernel_size = 3)
Or is it the case that in PyTorch dilation = 1 means same as dilation = 0 as given here in the Dilated Convolution section?
From the calculation of H_out, W_out in the documentation of pytorch, we can know that dilation=n means to make a pixel (1x1) of kernel to be nxn, where the original kernel pixel is at the topleft, and the rest pixels are empty (or filled with 0).
Thus dilation=1 is equivalent to the standard convolution with no dilation.

MNIST Tensorflow example

def conv2d(x, W):
return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
def max_pool_2x2(x):
return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
strides=[1, 2, 2, 1], padding='SAME')
This is the code from the Deep MNIST for experts tutorial on Tensorflow website.
I have two questions:
1) The documentation k-size is an integer list of length greater than 4 that refers to the size of the max-pool window. Shouldn't that be just [2,2] considering that it's a 2X2 window? I mean why is it [1, 2, 2, 1] instead of [2,2] ?
2) If we are taking a stride step on size one. Why do we need a vector of 4 values, wouldn't one value suffice?
strides = [1]
3) If padding = 'SAME' why does the image size decrease by half? ( from 28 X 28 to 14 X 14 in the first convolutional process )
I'm not sure which documentation you're referring to in this question. The maxpool window is indeed 2x2.
The step size can be different depending on the dimensions. The 4 vector is the most general case where suppose you wanted to skip images in the batch, skip different height and width and potentially even skip based on channels. This is hardly used but has been left in.
If you have a stride of 2 along each direction then you skip every other pixel that you could potentially use for max pooling. If you set the skip size to be [1,1,1,1] with padding same then you would indeed return a result of the same size. The padding "SAME" refers to zero padding the image such that you add a border of height kernel hieght and a width of size kernel width to the image.

Resources