understanding pytorch conv2d internally [duplicate] - pytorch

This question already has answers here:
Understanding convolutional layers shapes
(3 answers)
Closed 1 year ago.
I'm trying to understand what does the nn.conv2d do internally.
so lets assume we are applying Conv2d to a 32*32 RGB image.
torch.nn.Conv2d(3, 49, 4, bias=True)
so :
when we initialize the conv layer how many weights and in which shapes would it have, please tell this for biases apart?
before applying it the conv the image would have 3 * 32 * 32 shape and after applying it would have 49 * 29 * 29 so what happens in between?
I define "slide" operation (don't know real name) as multiplying first to element of kernel to first element of box in shape of image going till last element of kernel corresponding making one the 1of29 * 1of29 is calculated.
and "slide all" doing this horizontally and vertically till the all 29 * 29 are calculated.
so I understand how a kernel would act but I don't understand how many kernels would be created by the torch.nn.Conv2d(3, 49, 4, bias=True) and which of them would be apllying on R,G,B channels.

I understand how a kernel would act but I don't understand how many
kernels would be created by the nn.Conv2d(3, 49, 4, bias=True) and
which of them would be applying on R, G, and B channels.
Calling nn.Conv2d(3, 49, 4, bias=True) will initialize 49 4x4-kernels, each having a total of three channels and a single bias parameter. That's a total of 49*(4*4*3 + 1) parameters, i.e. 2,401 parameters.
You can check that it is indeed correct with:
>>> conv2d = nn.Conv2d(3, 49, 4, bias=True)
Parameters list will contain the weight tensor shaped (n_filters=49, n_channels=3, kernel_height=4, kernel_width=4), and a bias tensor shaped (49,):
>>> [p.shape for p in conv2d.parameters()]
[torch.Size([49, 3, 4, 4]), torch.Size([49])]
If we get a look at the total number of parameters, we indeed find:
>>> nn.utils.parameters_to_vector(conv2d.parameters()).numel()
2401
Concerning how they are applied: each one of the 49 kernels will be applied 'independently' to the input map. For each filter operation, you are convolving the input of a three-channel tensor, with a three-channel kernel. Each one of those 49 convolutions gets its respective bias added. At the end, you are left with a number of 49 single-channel maps which are concatenated to make up a single 49-channel map. In practice, everything is done in one go using a window view of the input.
I am certainly biased towards my own posts: here you will find another explanation of shapes in convolutional neural networks.

Related

Implement "same" padding for convolution operations with dilation > 1, in Pytorch

I am using Pytorch 1.8.1 and although I know the newer version has padding "same" option, for some reasons I do not want to upgrade it. To implement same padding for CNN with stride 1 and dilation >1, I put padding as follows:
padding=(dilation*(cnn_kernel_size[0]-1)//2, dilation*(cnn_kernel_size[1]-1)//2))
According to the Pytorch document, I expected the input and output size will be the same, but it did not happen!
It is written in Pytorch document that:
Hout​=⌊( Hin​ + 2×padding[0] − dilation[0]×(kernel_size[0]−1) −1) /stride[0] ​+ 1⌋
Wout​=⌊( Win​ + 2×padding[1] − dilation[1]×(kernel_size[1]−1) −1) /stride[1] + 1⌋
The input of torch.nn.Conv2d was with the shape of (1,1,625,513)
which based on the Conv2d pytorch document, indicates batch size = 1, C in = 1, H in = 625 and Win = 513
and after using:
64 filters
kernel size of (15,15)
stride = (1,1)
dilation =5
padding =(35,35)
Putting those values in the formulas above gives us:
Hout​=⌊(625 ​+ 2×35 −5×(15−1) −1) /1 ​+1⌋=⌊(625 ​+ 70 −5×14 -1) + 1⌋=625
Wout​=⌊(513 ​+ 2×35 −5×(15−1) −1) /1 ​+1⌋=⌊(513 ​+ 70 −5×14 -1) + 1⌋=513
However, the given output shape by pytorch was (1,64,681,569)
I can understand the value of 1 and C out = 64. But I don't know why H out and W out are not the same as H in and W in? Does anyone has any explanation that can help?
I figured it out! The reason that I ended up with the wrong dimension was that I didn't put a numeric value for padding. I gave it the numeric value of dilation and based on that it calculate itself the value for padding as
padding=(dilation*(cnn_kernel_size[0]-1)//2, dilation*(cnn_kernel_size[1]-1)//2))
I think Pytorch needs to be given the numeric value of padding because when I change my code and gave the network the value of padding and calculate the dilation based on padding (Dilation = (2*padding)/(kernel size -1) I got the right output shape.
I think your calculations are correct. The padding should be 35 pixels.
For some reason, I am unable to reproduce the output shape you report.
Testing this, yield the desired output size:
import torch
conv = torch.nn.Conv2d(1, 64, kernel_size=15, dilation=5, padding=35, stride=1)
conv(torch.rand(1, 1, 625, 513)).shape
Yields
torch.Size([1, 64, 625, 513])
As expected.

Convolution with RGB images - what values does a RGB filter hold?

Convolution for a grayscale image is straightforward. You have a filter of shape nxnx1and convolve the input image to extract whatever features you desire.
I also understand how convolution would work for a RGB image. The filter would have a shape of nxnx3. However, would all 3 'layers' in the filter hold the same kernel? For example, if the 0th layer a map as shown below, would layer 1 and 2 also hold the exact values? I am asking in regards to Convolutional Neural Networks and not conventional image processing. I understand the weights of each filter are learned and are randomized initially, am I correct in thinking that each layer would have different randomized values?
Would all 3 'layers' in the filter hold the same kernel?
The short answer is no. The longer answer is, there isn't a kernel per layer, but instead just one kernel which handles all input and output layer at once.
The code below shows step by step how one would calculate each convolution manually, and from this we can see that at a high level the calculation goes like this:
take a patch from the batch of images (BatchSize x 3x3x3 in your case)
flatten it [BatchSize, 27]
matrix multiply it by the reshaped kernel [27, output_filters]
add in the bias of shape [output_filters]
All the colors are processed at once using matrix multiplication with the kernel matrix. If we think about the kernel matrix, we can see that the values in the kernel matrix that are used to generate the first filter are in the first column, and the values to generate the second filter are in the second column. So, indeed, the values are different and not reused, but they are not stored or applied separately.
The code walkthrough
import tensorflow as tf
import numpy as np
# Define a 3x3 kernel that after convolution will create an image with 2 filters (channels)
conv_layer = tf.keras.layers.Conv2D(filters=2, kernel_size=3)
# Lets create a random input image
starting_image = np.array( np.random.rand(1,4,4,3), dtype=np.float32)
# and process it
result = conv_layer(starting_image)
weight, bias = conv_layer.get_weights()
print('size of weight', weight.shape)
print('size of bias', bias.shape)
size of weight (3, 3, 3, 2)
size of bias (2,)
# The output of the convolution of the 4x4x3 image input
# is a 2x2x2 output (because we don't have padding)
result.numpy()
array([[[[-0.34940776, -0.6426925 ],
[-0.81834394, -0.16166998]],
[[-0.37515935, -0.28143463],
[-0.60084903, -0.5310158 ]]]], dtype=float32)
# Now let's see how we can recreate this using the weights
# The way convolution is done is to extract a patch
# the size of the kernel (3x3 in this case)
# We will use the first patch, the first three rows and columns and all the colors
patch = starting_image[0,:3,:3,:]
print('patch.shape' , patch.shape)
# Then we flatten the patch
flat_patch = np.reshape( patch, [1,-1] )
print('New shape is', flat_patch.shape)
patch.shape (3, 3, 3)
New shape is (1, 27)
# next we take the weight and reshape it to be [-1,filters]
flat_weight = np.reshape( weight, [-1,2] )
print('flat_weight shape is ',flat_weight.shape)
flat_weight shape is (27, 2)
# we have the patch of shape [1,27] and the weight of [27,2]
# doing a matric multiplication of the two shapes [1,27]*[27,2] = a shape of [1,2]
# which is the output we want, 2 filter outputs for this patch
output_for_patch = np.matmul(flat_patch,flat_weight)
# but we haven't added the bias yet, so lets do that
output_for_patch = output_for_patch + bias
# Finally, we can see that our manual calculation matches
# what Conv2D does exactly for the first patch
output_for_patch
array([[-0.34940773, -0.64269245]], dtype=float32)
If we compare this to the full convolution above, we can see that this is exactly the first patch
array([[[[-0.34940776, -0.6426925 ],
[-0.81834394, -0.16166998]],
[[-0.37515935, -0.28143463],
[-0.60084903, -0.5310158 ]]]], dtype=float32)
We would repeat this process for each patch. If we want to optimize this code some more, instead of passing only one image patch at a time [1,27] we can pass [batch_number,27] patches at a time and the kernel will process them all at once returning [batch_number,filter_size].

Wrap two tensors in pytorch to get size of new tensor as 2

I have two tensors say x and y:
x has shape: [21314, 3, 128, 128]
y has shape: [21314]
Can I get new tensor of shape : [ [21314, 3, 128, 128], [21314] ], basically of shape 2
I believe it's not possible, if you require to save it as a tensor object. Of course, you can use a list or a tuple for this case, but I guess that was not what you meant.
First, a tensor is simply a generalization of a matrix for n dimentions instead of two. But let's simplify this for a matrix for now, for example 4x3. The first dimention is of size 4, that means 4 entries. A second dimention of 3 means that each of the 4 first dimention entries will have exactly (and not less then) 3 entries. That is, you must have full list of 3 elements in each nested list. In this simple example, note that you cannot have a matrix like that one:
[[1,2,3]
[1,2]
[1] ]
while this is a nested list it's not a matrix and also not a tensor of 2d. What i'm trying to say is that the shape your requested - [ [21314, 3, 128, 128], [21314] ] - is actually not a tensor.
But, you could have think of it as a tensor of size two, with data type of tensor in each entry (what you probably ment when asking the question). Though this is not possible since tensors in pytorch holds only numbers of types: float32, float64, float16, uint8, int8, int16, int32, int64, bool.
Nevertheless, in most cases you can achieve what you need with assigning two tensors to a list or tuple.

Pytorch summary only works for one specific input size for U-Net

I am trying to implement the UNet architecture in Pytorch. When I print the model using print(model) I get the correct architecture:
but when I try to print the summary using (or any other input size for that matter):
from torchsummary import summary
summary(model, input_size=(13, 572, 572))
I get an error:
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 70 and 71 in dimension 2 at /Users/distiller/project/conda/conda-bld/pytorch_1579022061893/work/aten/src/TH/generic/THTensor.cpp:612
However, it works perfectly if I give the input_size as input_size=(3, 224, 224))( like it worked for this person here). I am so baffled.
Can someone help me what's wrong?
Edit: I have used the model architecture from here.
This UNet architecture you provided doesn't support that shape (unless the depth parameter is <= 3). Ultimately the reason for this is that the size of a downsampling operation isn't invertible since multiple input shapes map to the same output shape. For example consider
>> torch.nn.functional.max_pool2d(torch.zeros(1, 1, 10, 10), 2).shape
torch.Size([1, 1, 5, 5])
>> torch.nn.functional.max_pool2d(torch.zeros(1, 1, 11, 11), 2).shape
torch.Size([1, 1, 5, 5])
So the question is, given only the output shape is 5x5, what was the shape of the input? Was it 10x10 or 11x11? This same phenomenon applies to downsampling via strided convolutions.
The problem is that the UNet class tries to combine features from the downsampling half to the network to the features in the upsampling half. If it "guesses wrong" about the original shape during upsampling then you will receive a dimension mismatch error.
To avoid this issue you'll need to ensure that the height and width of your input data are multiples of 2**(depth-1). So, for the default depth=5 you need the input image height and width to be a multiple of 16 (e.g. 560 or 576). Alternatively, since 572 is divisible by 4 then you could also set depth=3 to make it work.

what is the meaning of border_mode in keras?

I'm confused what to use, valid,same or full. I also don't know what does it do. I can't find it in the Docs. And border_mode for MaxPooling2D layer does not make sense to me. (It does make sense to me for the Convolution layers though).
When you use a two-dimentional image with m rows and n cols, and a a*b size kernel to convolute the input image, this is what happens:
If border_mode is 'full', returns a (m+a-1)x(n+b-1) image;
if border_mode is 'same', returns the same dimention as the input image;if border_mode is 'valid',returns a (m-a+1)x(n-b+1) image.
for example,
Input:
In following 4x4 image
A = [12 13 14 15;1 2 3 4;16 17 18 19;5 6 7 8], and a 3x3 kernel B = [1 2 3;4 5 6;7 8 9],
if border_mode is 'full', then returns a 6x6 matrix;
if border_mode is 'same', then returns a 4x4 matrix;
if border_mode is 'valid', then returns a 2x2 matrix.
you can also use function conv2(A,B,border_mode) in MATLAB to test the output matrix.
Hope this answer could help.
This is for Keras 2+ as they replaced the Border_mode with padding
And it can be used for down and upsampling in a network.
Border_mode full, same, and valid are explained very well above.
Yes you are right MaxPooling is used to reduce the channels if you use border_mode = full or border_mode = same it doesn't make any sense.

Resources