I am working on an inference model of a pytorch onnx model which is why this question is being asked.
Assume, I have a image with dimensions 32 x 32 x 3 (CIFAR-10 dataset). I pass it through a Conv2d with dimensions : 3 x 192 x 5 x 5. The command I used is: Conv2d(3, 192, kernel_size=5, stride=1, padding=2)
Using the formula (stated here for reference pg12 https://arxiv.org/pdf/1603.07285.pdf) I should be getting an output image with dimensions 28 x 28 x 192 (input - kernel + 1 = 32 - 5 + 1).
Question is how has PyTorch implemented this 4d tensor 3 x 192 x 5 x 5 to get me an output of 28 x 28 x 192 ? The layer is a 4d tensor and the input image is a 2d one.
How is the kernel (5x5) spread in the image matrix 32 x 32 x 3 ? What does the kernel convolve with first -> 3 x 192 or 32 x 32?
Note : I have understood the 2d aspects of things. I am asking the above questions in 3 or more.
The input to Conv2d is a tensor of shape (N, C_in, H_in, W_in) and the output is of shape (N, C_out, H_out, W_out), where N is the batch size (number of images), C is the number of channels, H is the height and W is the width. The output height and width H_out, W_out are computed as follows (ignoring the dilation):
H_out = (H_in + 2*padding[0] - kernel_size[0]) / stride[0] + 1
W_out = (W_in + 2*padding[1] - kernel_size[1]) / stride[1] + 1
See cs231n for an explanation of how this formulas were obtained.
In your example N=1, H_in = 32, W_in = 32, C_in = 3, kernel_size = (5, 5), strides = (1, 1), padding = (0, 0), giving H_out = 28, W_out = 28.
The C_out=192 means that there are 192 different filters, each of shape (C_in, kernel_size[0], kernel_size[1]) = (3, 5, 5). Each filter independently performs convolution with the input image resulting in a 2D tensor of shape (H_out, W_out) = (28, 28), and since there are C_out = 192 filters and N = 1 images, the final output is of shape (N, C_out, H_out, W_out) = (1, 192, 28, 28).
To understand how exactly the convolution is performed see the convolution demo.
Related
Let's say I feed 3 grayscale images to a CNN, having a combined shape of 3,28,28. This process will generate multiple feature maps for each image. How do I identify which feature map corresponds to a particular image.
Here is some code -
import torch
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(256, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
print("Shape of x = ", x.shape)
x = self.pool(F.relu(self.conv2(x)))
print("Shape of x = ", x.shape)
x = torch.flatten(x, 1) # flatten all dimensions except batch
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
net = Net()
foo = torch.randn(3,1, 28, 28)
foo_cnn = net(foo)
For instance, the first convolution generated 6 feature maps from 3 images. Is there a way for me to identify which feature map belonged to which image, so that I can perform some operation on it.
To distinguish which image generated which convolved feature maps, one must split the different input images into the batches dimension (#images=#batches), such that when applying any convolutional layers, they're applied on each image separately, not a weighted sum of the different input images as would be the case if they were split into the channel/depth dimension.
Right now you're not feeding 3 images into the model (in pytorch's eyes); that would require the input to be of the shape: (3, 1, 28, 28) for grayscale images and (3, 3, 28, 28) for RGB images. What you're doing instead is (in a sense) concatenating the 3 images into the depth dimension resulting in the shape: (1, 3, 28, 28), thus the 6 output feature maps cannot be attributed to a specific image (a weighted combination of the 3, since they're in depth dimension).
Therefore, reshaping the input to (3, 1, 28, 28) and changing conv1 to (1, 6, 5) will result in the following output: (3, 6, 12, 12) and hence, the 1st 6 feature maps in the 1st batch (of the output) correspond to the first image in the batch (of the input), and the 2nd 6 feature maps correspond to the 2nd image in the batch and so on.
I am using TF2.5 & Python3.8 where a conv layer is defined as:
Conv2D(
filters = 64, kernel_size = (3, 3),
activation='relu', kernel_initializer = tf.initializers.GlorotNormal(),
strides = (1, 1), padding = 'same',
)
Using a batch of 60 CIFAR-10 dataset as input:
x.shape
# TensorShape([60, 32, 32, 3])
Output volume of this layer preserves the spatial width and height (32, 32) and has 64 filters/kernel maps applied to the 60 images as batch-
conv1(x).shape
# TensorShape([60, 32, 32, 64])
I understand this output.
Can you explain the output of:
conv1.trainable_weights[0].shape
# TensorShape([3, 3, 3, 64])
This is the formula used to compute the number of trainable parameters in a conv layer = [{(m x n x d) + 1} x k]
where,
m -> width of filter; n -> height of filter; d -> number of channels in input volume; k -> number of filters applied in current layer.
The 1 is added as bias for each filter. But in case of TF2.X, for a conv layer, the bias term is set to False. Therefore, it doesn't appear as in formula.
This seems to be one of the common questions on here (1, 2, 3), but I am still struggling to define the right shape for input to PyTorch conv1D.
I have text sequences of length 512 (number of tokens per sequence) with each token being represented by a vector of length 768 (embedding). The batch size I am using is 6.
So my input tensor to conv1D is of shape [6, 512, 768].
input = torch.randn(6, 512, 768)
Now, I want to convolve over the length of my sequence (512) with a kernel size of 2 using the conv1D layer from PyTorch.
Understanding 1:
I assumed that "in_channels" are the embedding dimension of the conv1D layer. If so, then a conv1D layer will be defined in this way where
in_channels = embedding dimension (768)
out_channels = 100 (arbitrary number)
kernel = 2
convolution_layer = nn.conv1D(768, 100, 2)
feature_map = convolution_layer(input)
But with this assumption, I get the following error:
RuntimeError: Given groups=1, weight of size 100 768 2, expected input `[4, 512, 768]` to have 768 channels, but got 512 channels instead
Understanding 2:
Then I assumed that "in_channels" is the sequence length of the input sequence. If so, then a conv1D layer will be defined in this way where
in_channels = sequence length (512)
out_channels = 100 (arbitrary number)
kernel = 2
convolution_layer = nn.conv1D(512, 100, 2)
feature_map = convolution_layer(input)
This works fine and I get an output feature map of dimension [batch_size, 100, 767]. However, I am confused. Shouldn't the convolutional layer convolve over the sequence length of 512 and output a feature map of dimension [batch_size, 100, 511]?
I will be really grateful for your help.
In pytorch your input shape of [6, 512, 768] should actually be [6, 768, 512] where the feature length is represented by the channel dimension and sequence length is the length dimension. Then you can define your conv1d with in/out channels of 768 and 100 respectively to get an output of [6, 100, 511].
Given an input of shape [6, 512, 768] you can convert it to the correct shape with Tensor.transpose.
input = input.transpose(1, 2).contiguous()
The .contiguous() ensures the memory of the tensor is stored contiguously which helps avoid potential issues during processing.
I found an answer to it (source).
So, usually, BERT outputs vectors of shape
[batch_size, sequence_length, embedding_dim].
where,
sequence_length = number of words or tokens in a sequence (max_length sequence BERT can handle is 512)
embedding_dim = the vector length of the vector describing each token (768 in case of BERT).
thus, input = torch.randn(batch_size, 512, 768)
Now, we want to convolve over the text sequence of length 512 using a kernel size of 2.
So, we define a PyTorch conv1D layer as follows,
convolution_layer = nn.conv1d(in_channels, out_channels, kernel_size)
where,
in_channels = embedding_dim
out_channels = arbitrary int
kernel_size = 2 (I want bigrams)
thus, convolution_layer = nn.conv1d(768, 100, 2)
Now we need a connecting link between the expected input by convolution_layer and the actual input.
For this, we require to
current input shape [batch_size, 512, 768]
expected input [batch_size, 768, 512]
To achieve this expected input shape, we need to use the transpose function from PyTorch.
input_transposed = input.transpose(1, 2)
I have a suggestion for you which may not be what you asked for but might help. Because your input is (6, 512, 768) you can use conv2d instead of 1d.
All you need to do is to add a dimension of 1 at index 1: input.unsqueeze(1) which works as your channel (consider it as a grayscale image)
def forward(self, x):
x = self.embedding(x) # [Batch, seq length, Embedding] = [5, 512, 768])
x = torch.unsqueeze(x, 1) # [5, 1, 512, 768]) # like a grayscale image
and also for your conv2d layer, you can define like this:
window_size=3 # for trigrams
EMBEDDING_SIZE = 768
NUM_FILTERS = 10 # or whatever you want
self.conv = nn.Conv2d(in_channels = 1,
out_channels = NUM_FILTERS,
kernel_size = [window_size, EMBEDDING_SIZE],
padding=(window_size - 1, 0))```
what's the idea behind when using the following convolutional layers?
especially for nn.Conv2d(16, 16, 3, padding = 1)
self.conv1 = nn.Conv2d(3, 16, 3, padding = 1 )
self.conv2 = nn.Conv2d(16, 16, 3, padding = 1)
self.conv3 = nn.Conv2d(16, 32, 3, padding = 1)
self.pool = nn.MaxPool2d(2, 2)
x = F.relu(self.conv1(x))
x = self.pool(F.relu(self.conv2(x)))
x = F.relu(self.conv3(x))
I thought Conv2d always uses a bigger size like
from (16,32) to (32,64) for example.
Is nn.Conv2d(16, 16, 3, padding = 1) merely for reducing the size?
The model architecture all depends on finally what works best for your application, and it's always going to vary.
You are right in saying that usually, you want to make your tensors deeper (in the dimension of your channels) in order to extract richer features, but there is no hard and fast rule about that. Having said that, sometimes you don't want to make your tensors too big, since more the number of channels more the number of trainable parameters making it difficult for your model to train. This again brings me back to the very first line I said - "It all depends".
And as for the line:
nn.Conv2d(16, 16, 3, padding = 1) # stride = 1 by default.
This will keep the size of the tensor the same as the input in all 3 dimensions (height, width, and number of channels).
I will also add the formula to calculate size of output tensor in a convolution for reference.
output_size = ( (input_size - filter_size + 2*padding) / stride ) + 1
If I have a layer with 32 convolution 5x5 rgb kernels in it, I would expect the shape to be (32, 5, 5, 3) being (count, h, w, rgb) but instead it is
(5, 5, 3, 32). This messes up iteration since
for kern in kernels:
Does not work correctly. I get a series of (5 ,3, 32) ndarrays. I do not get each of the 5x5 rgb kernels.
Am I just doing this wrong?
Strange that the kernel is stored in the shape (h, w, channels, filters), as the implementation suggests otherwise:
kernel_shape = self.kernel_size + (self.filters, input_dim)
self.kernel = self.add_weight(shape=kernel_shape, ...)
...
However, if this is what you are seeing, and if you need to iterate over each filter, why not just move the axis with np.moveaxis:
kernel = np.moveaxis(kernel, -1,0)
to get the desired (kernels, h, w, channels).