In the paper 'Fully Convolutional Networks for Semantic Segmentation'
the author distinguishes between input stride and output stride in the context of deconvolution.
How do these terms differ from each other?
Input stride is the stride of the filter . How much you shift the filter in the output .
Output Stride this is actually a nominal value . We get feature map in a CNN after doing several convolution , max-pooling operations . Let's say our input image is 224 * 224 and our final feature map is 7*7 .
Then we say our output stride is : 224/7 = 32 (Approximate of what happened to the image after down sampling .)
This tensorflow script describe what is this output stride , and how to use in FCN which is the case of dense prediction .
one uses inputs with
spatial dimensions that are multiples of 32 plus 1, e.g., [321, 321]. In
this case the feature maps at the ResNet output will have spatial shape
[(height - 1) / output_stride + 1, (width - 1) / output_stride + 1]
and corners exactly aligned with the input image corners, which greatly
facilitates alignment of the features to the image. Using as input [225, 225]
images results in [8, 8] feature maps at the output of the last ResNet block.
Related
In the leading DeepLearning libraries, does the filter (aka kernel or weight) in the convolutional layer convolves also across the "channel" dimension or does it take all the channels at once?
To make an example, if the input dimension is (60,60,10) (where the last dimension is often referred as "channels") and the desired output number of channels is 5, can the filter be (5,5,5,5) or should it be (5,5,10,5) instead ?
It should be (5, 5, 10, 5). Conv2d operation is just like Linear if you ignore the spatial dimensions.
From TensorFlow documentation [link]:
Given an input tensor of shape batch_shape + [in_height, in_width, in_channels] and a filter / kernel tensor of shape [filter_height, filter_width, in_channels, out_channels], this op performs the following:
Flattens the filter to a 2-D matrix with shape [filter_height * filter_width * in_channels, output_channels].
Extracts image patches from the input tensor to form a virtual tensor of shape [batch, out_height, out_width, filter_height * filter_width * in_channels].
For each patch, right-multiplies the filter matrix and the image patch vector.
It takes all channels at once, so 5×5×10×5 should be right.
julia> using Flux
julia> c = Conv((5,5), 10 => 5); # make a layer, 10 channels to 5
julia> c.weight |> summary
"5×5×10×5 Array{Float32, 4}"
julia> c(randn(Float32, 60, 60, 10, 1)) |> summary # check it works
"56×56×5×1 Array{Float32, 4}"
julia> Conv(rand(Float32, (5,5,5,5))) # different weight size
Conv((5, 5), 5 => 5) # 630 parameters
I have a tensor of size (32, 128, 50) in PyTorch. These are 50-dim word embeddings with a batch size of 32. That is, the three indices in my size correspond to number of batches, maximum sequence length (with 'pad' token), and the size of each embedding. Now, I want to pass this through a linear layer to get an output of size (32, 128, 1). That is, for every word embedding in every sequence, I want to make it one dimensional. I tried adding a linear layer to my network going from 50 to 1 dimension, and my output tensor is of the desired shape. So I think this works, but I would like to understand how PyTorch deals with this issue, since I did not explicitly tell it which dimension to apply the linear layer to. I played around with this and found that:
If I input a tensor of shape (32, 50, 50) -- thus creating ambiguity by having two dimensions along which the linear layer could be applied to (two 50s) -- it only applies it to the last dim and gives an output tensor of shape (32, 50, 1).
If I input a tensor of shape (32, 50, 128) it does NOT output a tensor of shape (32, 1, 128), but rather gives me an error.
This suggests that a linear layer in PyTorch applies the transformation to the last dimension of your tensor. Is that the case?
In the nn.Linear docs, it is specified that the input of this module can be any tensor of size (*, H_in) and the output will be a tensor of size (*, H_out), where:
* means any number of dimensions
H_in is the number of in_features
H_out is the number of out_features
To understand this better, for a tensor of size (n, m, 50) can be processed by a Linear module with in_features=50, while a tensor of size (n, 50, m) can be processed by a Linear module with in_features=m (in your case 128).
My task is to visualize the plotted weights in a cnn layer, now when I passed parameters, filters = 32 and kernel_size = (3, 3), I am expecting the output to be 32 matrices each of 3x3 size by using .get_weights() function(to extract weights and biases), but I am getting a very weird nested output,
the output is as follows:
a = model.layers[0].get_weights()
a[0][0][0]
array([[ 2.87332404e-02, -2.80513391e-02,
**... 32 values ...**,
-1.55516148e-01, -1.26494586e-01, -1.36454999e-01,
1.61165968e-02, 7.63138831e-02],
[-5.21791205e-02, 3.13560963e-02, **... 32 values ...**,
-7.63987377e-02, 7.28923678e-02, 8.98564830e-02,
-3.02852653e-02, 4.07049060e-02],
[-7.04478994e-02, 1.33816227e-02,
**... 32 values ...**, -1.99537817e-02,
-1.67200342e-01, 1.15980692e-02]], dtype=float32)
I want to know that why I am getting this type of weird output and how can I get the weights in the perfect shape. Thanks in advance.
Weights in neural network are values that represent connection strength between input nodes and output nodes(or nodes in next layer).
Conv2D layer's weights usually have shape of (H, W, I, O), where:-
H is kernel height
W is kernel width
I is number of input channels
O is number of output channels
Conv2D weights can be interpreted as connection strength between a patch of input channels and nodes in output filter/feature map. This way you would have weights of shape(H, W) between each Input channels and each Output Channels. It should be noted that the weights are shared among different patches of the same channel.
Consider the following convolution of (8, 8, 1) input with (2, 2) kernel and output with (8, 8, 1). The weights of this layer has shape (2, 2, 1, 1)
The same input can be used to produce 2 feature map using 2 (2, 2) filters as follows. Now the shape of the weights would be (2, 2,1, 2).
Hope this will clarify how to interpret the shape of convolutional layers.
The shape of the kernel weights from a Conv2D layer is (kernel_size[0], kernel_size[1], n_input_channels, filters). So in your case
a = model.layers[0].get_weights()
print(a[0].shape)
# should print (3,3,z,32) if your input has shape (x, y, z)
If you want to print the weights from one of the filters, you can do
a[0][:,:,:,0]
I am new with Pytorch, and will be glad if someone will be able to help me understand the following (and correct me if I am wrong), regarding the meaning of the command x.view in Pytorch first tutorial, and in general about the input of convolutional layers and the input of fully-connected layers:
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
As far as I understand, an input 256X256 image to a convolutional layer is inserted in its 2D form (i.e. - a 256X256 matrix, or a 256X256X3 in the case of a color image). Nevertheless, when we insert an image to a fully-connected linear layer, we need to first reshape the 2D image into a 1D vector (am I right? Is this true also in general (or only in Pytorch)? ). Is this why we use the command “x = x.view(-1, 16 * 5 * 5)” before inserting x into the fully-connected layers?
If the input image x would be 3D (e.g. 256X256X256), would the syntax of the given above “forward” function remain the same?
Thanks a lot in advance
Its from Petteri Nevavuori's lecture notes and shows how a feature map is produced from an image I with a kernel K. With each application of the kernel a dot product is calculated, which effectively is the sum of element-wise multiplications between I and K in an K-sized area within I.
You could say that kernel looks for diagonal features. It then searches the image and finds a perfect matching feature in the lower left corner. Otherwise the kernel is able to identify only parts the feature its looking for. This why the product is called a feature map, as it tells how well a kernel was able to identify a feature in any location of the image it was applied to.
Answer adapted from: https://discuss.pytorch.org/t/convolution-input-and-output-channels/10205/3
Let's say we consider an input image of shape (W x H x 3) where input volume has 3 channels (RGB image). Now we would like to create a ConvLayer for this image.
Each kernel in the ConvLayer will use all input channels of the input volume. Let’s assume we would like to use a 3 by 3 kernel. This kernel will have 27 weights and 1 bias parameter, since (W * H * input_Channels = 3 * 3 * 3 = 27 weights).
The number of output channels is the number of different kernels used in the ConvLayer. If we would like to output 64 channels, we need to define ConvLayer such that it uses 64 different 3x3 kernels.
If you check out the documentation of Conv2d, we can define a ConvLayer mimicking above scenario as follows.
nn.Conv2d(3, 64, 3, stride=1)
Where in_channels = 3, out_channels = 64, kernel_size = 3x3. Check out what is stride in the documentation.
If you check out the implementation of Linear layer, you would see the underlying mathematical equation that a linear operation mimics is: y = Ax + b.
According to pytorch documentation of linear layer, we can see it expects an input of shape (N,∗,in_features) and the output is of shape (N,∗,out_features). So, in your case, if the input image x is of shape 256 x 256 x 256, and you want to transform all the (256*256*256) features to a specific number of feature, you can define a linear layer as:
llayer = nn.Linear(256*256*256, num_features)
I am using Keras to make a CNN, and I want to visualize the model with plot_model().
When I look at the shape of the Conv2d layers, there is a thing that I can't figure out.
Let's say my Conv2d layer has kernel size [8 x 8], stride is [4 by 4], padding is 'same' and I want 16 feature maps.
Input shape to this layer is [None, 3, 160, 320] and output is [None,1,40,16].
'None' is samples, but what is 1 and 40? I guess 16 is number of feature maps?
Since I implemented padding = 'same', shouldn't the image size out have the same width and height as input, or isn't this the same thing?
Thanks!
Well, since you're using "strides", you'll never have the same shape.
Your convolutional filter (which can be seen as a sliding window) is jumping four pixels in its sliding.
As a result, you get your final shape divided by 4 (and rounded up).
3/4 rounded up = 1
160/4 = 40
16 is the number of feature maps, indeed.