Pytorch - meaning of a command in a basic "forward" pass - pytorch

I am new with Pytorch, and will be glad if someone will be able to help me understand the following (and correct me if I am wrong), regarding the meaning of the command x.view in Pytorch first tutorial, and in general about the input of convolutional layers and the input of fully-connected layers:
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
As far as I understand, an input 256X256 image to a convolutional layer is inserted in its 2D form (i.e. - a 256X256 matrix, or a 256X256X3 in the case of a color image). Nevertheless, when we insert an image to a fully-connected linear layer, we need to first reshape the 2D image into a 1D vector (am I right? Is this true also in general (or only in Pytorch)? ). Is this why we use the command “x = x.view(-1, 16 * 5 * 5)” before inserting x into the fully-connected layers?
If the input image x would be 3D (e.g. 256X256X256), would the syntax of the given above “forward” function remain the same?
Thanks a lot in advance

Its from Petteri Nevavuori's lecture notes and shows how a feature map is produced from an image I with a kernel K. With each application of the kernel a dot product is calculated, which effectively is the sum of element-wise multiplications between I and K in an K-sized area within I.
You could say that kernel looks for diagonal features. It then searches the image and finds a perfect matching feature in the lower left corner. Otherwise the kernel is able to identify only parts the feature its looking for. This why the product is called a feature map, as it tells how well a kernel was able to identify a feature in any location of the image it was applied to.
Answer adapted from: https://discuss.pytorch.org/t/convolution-input-and-output-channels/10205/3
Let's say we consider an input image of shape (W x H x 3) where input volume has 3 channels (RGB image). Now we would like to create a ConvLayer for this image.
Each kernel in the ConvLayer will use all input channels of the input volume. Let’s assume we would like to use a 3 by 3 kernel. This kernel will have 27 weights and 1 bias parameter, since (W * H * input_Channels = 3 * 3 * 3 = 27 weights).
The number of output channels is the number of different kernels used in the ConvLayer. If we would like to output 64 channels, we need to define ConvLayer such that it uses 64 different 3x3 kernels.
If you check out the documentation of Conv2d, we can define a ConvLayer mimicking above scenario as follows.
nn.Conv2d(3, 64, 3, stride=1)
Where in_channels = 3, out_channels = 64, kernel_size = 3x3. Check out what is stride in the documentation.
If you check out the implementation of Linear layer, you would see the underlying mathematical equation that a linear operation mimics is: y = Ax + b.
According to pytorch documentation of linear layer, we can see it expects an input of shape (N,∗,in_features) and the output is of shape (N,∗,out_features). So, in your case, if the input image x is of shape 256 x 256 x 256, and you want to transform all the (256*256*256) features to a specific number of feature, you can define a linear layer as:
llayer = nn.Linear(256*256*256, num_features)

Related

Convolutional layer: does the filter convolves also trough the nlayers_in or it take all the dimensions?

In the leading DeepLearning libraries, does the filter (aka kernel or weight) in the convolutional layer convolves also across the "channel" dimension or does it take all the channels at once?
To make an example, if the input dimension is (60,60,10) (where the last dimension is often referred as "channels") and the desired output number of channels is 5, can the filter be (5,5,5,5) or should it be (5,5,10,5) instead ?
It should be (5, 5, 10, 5). Conv2d operation is just like Linear if you ignore the spatial dimensions.
From TensorFlow documentation [link]:
Given an input tensor of shape batch_shape + [in_height, in_width, in_channels] and a filter / kernel tensor of shape [filter_height, filter_width, in_channels, out_channels], this op performs the following:
Flattens the filter to a 2-D matrix with shape [filter_height * filter_width * in_channels, output_channels].
Extracts image patches from the input tensor to form a virtual tensor of shape [batch, out_height, out_width, filter_height * filter_width * in_channels].
For each patch, right-multiplies the filter matrix and the image patch vector.
It takes all channels at once, so 5×5×10×5 should be right.
julia> using Flux
julia> c = Conv((5,5), 10 => 5); # make a layer, 10 channels to 5
julia> c.weight |> summary
"5×5×10×5 Array{Float32, 4}"
julia> c(randn(Float32, 60, 60, 10, 1)) |> summary # check it works
"56×56×5×1 Array{Float32, 4}"
julia> Conv(rand(Float32, (5,5,5,5))) # different weight size
Conv((5, 5), 5 => 5) # 630 parameters

Size Mismatch for Functional Linear Layer

I apologize that this is probably a simple question that has been answered before, but I could not find the answer. I’m attempting to use a CNN to extract features and then input that into a FC network that outputs 2 variables. I’m attempting to use the functional linear layer as a way to dynamically handle the flattened features. The self.cnn is a Sequential container which last layer is the nn.Flatten(). When I print the size of x after the CNN I see it is 15x152064, so I’m unclear why the F.linear layer is failing to run with the error below. Any help would be appreciated.
RuntimeError: size mismatch, get 15, 15x152064,2
x = self.cnn(x)
batch_size, channels = x.size()
x = F.linear(x, torch.Tensor([256,channels]))
y_hat = self.FC(x)
torch.Tensor([256, channels]) does not create a tensor of size (256, channels) but the 1D tensor containing the values 256 and channels instead. I don't know how you want to initialize your weights, but there are a couple options :
# Identity transform:
x = F.linear(x, torch.ones(256,channels))
# Random transform :
x = F.linear(x, torch.randn(256,channels))

Stacking of 2 convolutional layers

In a convolutional layer with n neurons, trained for inputs with dimension h x w x c (height x width x channel), c usually being 3 (RGB), one trains n x c kernels of size k x k (and n bias values). So for each neuron i in the layer and each channel j in the input, we have a weight matrix of size k x k, we call weights_ij. The output of each neuron i=1,..,n (for input X) is as follows:
out_i = sigma ( tmp_i + bias_i)
with tmp_i = sum_{j=1,...,c} conv(X, weights_ij).
The output is then h_new x w_new x n. So basically the depth of the output coincides with the number of neurons in the first layer. h_new and w_new depend on padding and stride in the convolution.
This makes sense to me and I also checked it by coding the convolution and the summation myself and comparing the result with the result of a keras model, that only consists of this one layer. Now my acutal question: when we add a second convolutional layer, my understanding was that the output from the first layer is now a "picture" with n channels and we do exactly the same as before but with c=n (and a new number n2 of neurons in our 2nd layer).
But I also coded that and compared it with the prediction of a keras model with 2 convolutional layers and now the result is not the same. So does anyone know how the 2nd convolutional layer treats the output of the first?
Ok, I solved my problem.
Actually the problem was already present for just one layer and by stacking 2 layers the errors accumulated.
I thought when using stride=2 in the convolutional layer, one applys the convolution to the sections [0:N_k,0:N_k], [2:2+N_k,2:2+N_k], [4:4+N_k,4:4+N_k],... of the input but keras actually applys the convolution to [1:1+N_k,1:1+N_k], [3:3+N_k,3:3+N_k],...

The derivative of Softmax outputs really large shapes

I am creating a basic, and also my first neural network on handwritten digit recognition without any framework (like Tensorflow, PyTorch...) using the Backpropagation algorithm.
My NN has 784 inputs and 10 outputs. So for the last layer, I have to use Softmax.
Because of some memory errors, I have right now my images in shape (300, 784) and my labels in shape (300, 10)
After that I am calculating loss from Categorical Cross-entropy.
Now we are getting to my problem. In Backpropagation, I need manually compute the first derivative of an activation function. I am doing it like this:
dAl = -(np.divide(Y, Al) - np.divide(1 - Y, 1 - Al))
#Y = test labels
#Al - Activation value from my last layer
And after that my Backpropagation can start, so the last layer is softmax.
def SoftmaxDerivative(dA, Z):
#Z is an output from np.dot(A_prev, W) + b
#Where A_prev is an activation value from previous layer
#W is weight and b is bias
#dA is the derivative of an activation function value
x = activation_functions.softmax(dA)
s = x.reshape(-1,1)
dZ = np.diagflat(s) - np.dot(s, s.T)
return dZ
1. Is this function working properly?
In the end, I would like to compute derivatives of weights and biases, So I am using this:
dW = (1/m)*np.dot(dZ, A_prev.T)
#m is A_prev.shape[1] -> 10
db = (1/m)*np.sum(dZ, axis = 1, keepdims = True)
BUT it fails on dW, because dZ.shape is (3000, 3000) (compare to A_prev.shape, which is (300,10))
So from this I assume, that there are only 3 possible outcomes.
My Softmax backward is wrong
dW is wrong
I have some other bug completely somewhere else
Any help would be really appreciated!
I faced the same problem recently. I'm not sure but maybe this question will help you: Softmax derivative in NumPy approaches 0 (implementation)

Convolutional NN for text input in PyTorch

I am trying to implement a text classification model using a CNN. As far as I know, for text data, we should use 1d Convolutions. I saw an example in pytorch using Conv2d but I want to know how can I apply Conv1d for text? Or, it is actually not possible?
Here is my model scenario:
Number of in-channels: 1, Number of out-channels: 128
Kernel size : 3 (only want to consider trigrams)
Batch size : 16
So, I will provide tensors of shape, <16, 1, 28, 300> where 28 is the length of a sentence. I want to use Conv1d which will give me 128 feature maps of length 26 (as I am considering trigrams).
I am not sure, how to define nn.Conv1d() for this setting. I can use Conv2d but want to know is it possible to achieve the same using Conv1d?
This example of Conv1d and Pool1d layers into an RNN resolved my issue.
So, I need to consider the embedding dimension as the number of in-channels while using nn.Conv1d as follows.
m = nn.Conv1d(200, 10, 2) # in-channels = 200, out-channels = 10
input = Variable(torch.randn(10, 200, 5)) # 200 = embedding dim, 5 = seq length
feature_maps = m(input)
print(feature_maps.size()) # feature_maps size = 10,10,4
Although I don't work with text data, the input tensor in its current form would only work using conv2d. One possible way to use conv1d would be to concatenate the embeddings in a tensor of shape e.g. <16,1,28*300>. You can reshape the input with view In pytorch.

Resources