I'm trying to implement an autoencoder CNN. However, I have the following problem:
The last convolutional layer of my encoder is defined as follows:
Conv2d(128, 256, 3, padding=1, stride=2)
The input of this layer has shape (1, 128, 24, 24). Thus, the output has shape (1, 256, 12, 12).
After this layer, I have ReLU activation and BatchNorm. Neither of these changes the shape of the output.
Then I have a first ConvTranspose2d layer defined as:
ConvTranspose2d(256, 128, 3, padding=1, stride=2)
But the output of this layer has shape (1, 128, 23, 23).
As far as I know, if we use the same kernel size, stride, and padding in ConvTrapnpose2d as in the preceding Conv2d layer, then the output of this 2 layers block must have the same shape as its input.
So, my question is: what is wrong with my understanding? And how can I fix this issue?
I would first like to note that the nn.ConvTranspose2d layer is not the inverse of nn.Conv2d as explained in its documentation page:
it is not an actual deconvolution operation as it does not compute a true inverse of convolution
As far as I know, if we use the same kernel size, stride, and padding in ConvTranspose2d as in the preceding Conv2d layer, then the output of this 2 layers block must have the same shape as its input.
This is not always true! It depends on the input spatial dimensions.
In terms of spatial dimensions the 2D convolution will output:
out = [(x + 2p - d(k - 1) - 1)/s + 1]
where [x] is the whole part of x.
while the 2D transpose convolution will output:
out = (x - 1)s - 2p + d(k - 1) + op + 1
where x = input_dimension, out = output_dimension, k = kernel_size, s = stride, d = dilation, p = padding, and op = output_padding.
If you look at the convT o conv operator (i.e. convT(conv(x))) then you have:
out = (out_conv - 1)s - 2p + d(k - 1) + op + 1
= ([(x + 2p - d(k - 1) - 1)/s + 1] - 1)s - 2p + d(k - 1) + op + 1
Which equals to x only if we have [(x + 2p - d(k - 1) - 1)/s + 1] = (x + 2p - d(k - 1) - 1)/s + 1, that is: if x is odd, in this case:
out = ((x + 2p - d(k - 1) - 1)/s + 1 - 1)s - 2p + d(k - 1) + op + 1
= x + op
And out = x when op = 0.
Otherwise if x is even then:
out = x - 1 + op
And setting op = 1 gives out = x.
Here is an example:
>>> conv = nn.Conv2d(1, 1, 3, stride=2, padding=1)
>>> convT = nn.ConvTranspose2d(1, 1, 3, stride=2, padding=1)
>>> convT(conv(torch.rand(1, 1, 25, 25))).shape # x even
(1, 1, 25, 25) #<- out = x
>>> convT = nn.ConvTranspose2d(1, 1, 3, stride=2, padding=1, output_padding=1)
>>> convT(conv(torch.rand(1, 1, 24, 24))).shape # x odd
(1, 1, 24, 24) #<- out = x - 1 + op
Related
I want to make a projection to the tensor of shape [197, 1, 768] to [197,1,128] in pytorch using nn.Conv()
You could achieve this using a wide flat kernel and/or combined with a specific stride. If you stick with a dilation of 1, then the input/output spatial dimension relation is given by:
out = [(2p + x - k)/s + 1]
Where p is the padding, k is the kernel size and s is the stride. [] detonates the whole part of the quantity.
Applied here you have:
128 = [(2p + 768 - k)/s + 1]
So you would get:
p = 2*p + 768 - (128-1)*s # one off
If you impose p = 0, and s = 6 you find k = 6
>>> project = nn.Conv2d(197, 197, kernel_size=(1, 6), stride=6)
>>> project(torch.rand(1, 197, 1, 768)).shape
torch.Size([1, 197, 1, 128])
Alternatively, a more straightforward - but different - approach is to learn a mapping using a fully connected layer:
>>> project = nn.Linear(768, 128)
>>> project(torch.rand(1, 197, 1, 768)).shape
torch.Size([1, 197, 1, 128])
You could use a kernel size and stride of 6, as that’s the factor between the input and output temporal size:
x = torch.randn(197, 1, 768)
conv = nn.Conv1d(in_channels=1, out_channels=1, kernel_size=6, stride=6)
out = conv(x)
print(out.shape)
> torch.Size([197, 1, 128])
Solution Source
I am using Python 3.8 and PyTorch 1.7.1. I saw a code which defines a Conv2d layer as follows:
Conv2d(3, 6, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
The input 'X' being passed to it is a 4D tensor-
X.shape
# torch.Size([4, 3, 6, 6])
The output volume for this conv layer is:
c1(X).shape
# torch.Size([4, 6, 3, 3])
I am trying to use the formula to compute output spatial dimensions for any conv layer: O = ((W - K + 2P)/S) + 1, where W = spatial dimension of image, K = filter/kernel size, P = zero padding & S = stride.
For 'c1' conv layer, we get, W = 6, K = 3, S = 2 & P = 1. Using the formula, you get O = ((6 - 3 + (2 x 1)) / 2) + 1 = 5/2 + 1 = 3.5.
The output volume: (4, 6, 3, 3) since number of filters used = 6.
How is the spatial output from 'c1' then (3, 3)? What am I not getting?
Thanks!
How would you have half a pixel?
You're missing the floor function:
O = floor(((W - K + 2P)/S) + 1)
So the shape of the outputted maps is (3, 3).
Here's the complete formula (with dilation) for nn.Conv2d:
I have two different size tensors to put in the network.
C = nn.Conv1d(1, 1, kernel_size=1, stride=2)
TC = nn.ConvTranspose1d(1, 1, kernel_size=1, stride=2)
a = torch.rand(1, 1, 100)
b = torch.rand(1, 1, 101)
a_out, b_out = TC(C(a)), TC(C(b))
The results are
a_out = torch.size([1, 1, 99]) # What I want is [1, 1, 100]
b_out = torch.size([1, 1, 101])
Is there any method to handle this problem?
I need your help.
Thanks
It is expected behaviour as per documentation. May be padding can be used when even input length is detected to get same length as input.
Something like this
class PadEven(nn.Module):
def __init__(self, conv, deconv, pad_value=0, padding=(0, 1)):
super().__init__()
self.conv = conv
self.deconv = deconv
self.pad = nn.ConstantPad1d(padding=padding, value=pad_value)
def forward(self, x):
nd = x.size(-1)
x = self.deconv(self.conv(x))
if nd % 2 == 0:
x = self.pad(x)
return x
C = nn.Conv1d(1, 1, kernel_size=1, stride=2)
TC = nn.ConvTranspose1d(1, 1, kernel_size=1, stride=2)
P = PadEven(C, TC)
a = torch.rand(1, 1, 100)
b = torch.rand(1, 1, 101)
a_out, b_out = P(a), P(b)
I am working on an inference model of a pytorch onnx model which is why this question is being asked.
Assume, I have a image with dimensions 32 x 32 x 3 (CIFAR-10 dataset). I pass it through a Conv2d with dimensions : 3 x 192 x 5 x 5. The command I used is: Conv2d(3, 192, kernel_size=5, stride=1, padding=2)
Using the formula (stated here for reference pg12 https://arxiv.org/pdf/1603.07285.pdf) I should be getting an output image with dimensions 28 x 28 x 192 (input - kernel + 1 = 32 - 5 + 1).
Question is how has PyTorch implemented this 4d tensor 3 x 192 x 5 x 5 to get me an output of 28 x 28 x 192 ? The layer is a 4d tensor and the input image is a 2d one.
How is the kernel (5x5) spread in the image matrix 32 x 32 x 3 ? What does the kernel convolve with first -> 3 x 192 or 32 x 32?
Note : I have understood the 2d aspects of things. I am asking the above questions in 3 or more.
The input to Conv2d is a tensor of shape (N, C_in, H_in, W_in) and the output is of shape (N, C_out, H_out, W_out), where N is the batch size (number of images), C is the number of channels, H is the height and W is the width. The output height and width H_out, W_out are computed as follows (ignoring the dilation):
H_out = (H_in + 2*padding[0] - kernel_size[0]) / stride[0] + 1
W_out = (W_in + 2*padding[1] - kernel_size[1]) / stride[1] + 1
See cs231n for an explanation of how this formulas were obtained.
In your example N=1, H_in = 32, W_in = 32, C_in = 3, kernel_size = (5, 5), strides = (1, 1), padding = (0, 0), giving H_out = 28, W_out = 28.
The C_out=192 means that there are 192 different filters, each of shape (C_in, kernel_size[0], kernel_size[1]) = (3, 5, 5). Each filter independently performs convolution with the input image resulting in a 2D tensor of shape (H_out, W_out) = (28, 28), and since there are C_out = 192 filters and N = 1 images, the final output is of shape (N, C_out, H_out, W_out) = (1, 192, 28, 28).
To understand how exactly the convolution is performed see the convolution demo.
My network transposes an image, with size 62*71, to a vector of 124 outputs. In the test, I got the same output for each input. I checked 4000 cases.
I cannot seem to signify the problem because the learning seems to be fine, there is an improvement of the error and get a relatively low error.
Someone maybe knows what is the problem?
#load data
data_in= np.transpose(np.loadtxt("images_in_10000.csv", delimiter=',',dtype=np.float32))
data_out= np.transpose(np.loadtxt("out_to_image_10000.csv", delimiter=',',dtype=np.float32))
x_train = data_in[0:6000, :]
x_test = data_in[6000:10001,:]
y_train = data_out[0:6000, :]
y_test = data_out[6000:10001, :]
#parametersa
batch=100
epochs=7
learning_rate=0.01
n = x_test.shape[1] #4392
m = x_train.shape[0] #6000
d = y_test.shape[1] #124
l = y_test.shape[0] #4000
trainX = tf.placeholder(tf.float32, [batch, n])
trainY = tf.placeholder(tf.float32, [batch, d])
testX = tf.placeholder(tf.float32, [l, n])
testY = tf.placeholder(tf.float32, [l, d])
W_c1= tf.Variable(tf.random_normal([5, 5, 1, 32]))
W_c2= tf.Variable(tf.random_normal([5, 5, 32, 64]))
W_fc= tf.Variable(tf.random_normal([18 * 16 * 64, 128]))
W_out= tf.Variable(tf.random_normal([128, d]))
b_c1= tf.Variable(tf.random_normal([32]))
b_c2=tf.Variable(tf.random_normal([64]))
b_fc=tf.Variable(tf.random_normal([128]))
b_out=tf.Variable(tf.random_normal([d]))
def conv2d(x, W):
return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
def maxpool2d(x):
return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
def convolutional_neural_network(x):
x = tf.reshape(x, shape=[-1,61,72, 1])
conv1 = tf.nn.relu(conv2d(x, W_c1) + b_c1)
conv1 = maxpool2d(conv1)
conv2 = tf.nn.relu(conv2d(conv1, W_c2) + b_c2)
conv2 = maxpool2d(conv2)
fc = tf.reshape(conv2, [-1, 18 * 16 * 64])
fc = tf.nn.relu(tf.matmul(fc, W_fc) + b_fc)
output = tf.matmul(fc, W_out) + b_out
return output
prediction = convolutional_neural_network(trainX)
cost =tf.reduce_mean(tf.pow(prediction-trainY,2))
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
prediction_t = convolutional_neural_network(testX)
losstest = tf.reduce_mean(tf.pow(prediction_t - testY, 2))
k=0
a = np.linspace(0, m - batch, m / batch, dtype=np.int32)
costshow = [0] * (len(a) * epochs)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(epochs):
epoch_loss = 0
for i in (np.linspace(0,m - batch, m / batch, dtype=np.int32)):
x = x_train[i:i + batch, :]
y = y_train[i:i + batch, :]
sess.run(optimizer, feed_dict={trainX: x, trainY: y})
cost_val = sess.run(cost, feed_dict={trainX: x, trainY: y})
costshow[k]=cost_val
print("Epoch=", '%04d' % (epoch + 1), "loss=", " {:.9f}".format(cost_val))
k = k + 1
print("finsh train-small ")
result = sess.run(prediction_t, feed_dict={testX: x_test})
test_loss = sess.run(losstest, feed_dict={testX: np.asarray(x_test), testY: np.asarray(y_test)})
print("Testing loss=", test_loss)
The metric behind a picture is clearly defined. The values of an image often ranges from 0-1 or 0-255. For CNN's you should normalize your input values (0-1).
Thus you have to be careful with your weight initialization. For example, if your have a bias of 0.6 and a value of 0.6, you get a 1.2 as image value and your plotting program thinks you are in the 0-255 range and everything is black.
So try to use the glorot-initializer for the weights and zero-initializer for the bias initializer:
Weights:
tf.get_variable("weight", shape=[5, 5, 1, 32], initializer=tf.glorot_uniform_initializer())
Bias:
tf.get_variable("bias", shape=[32], initializer=tf.zeros_initializer())
Furthermore, tf.Variabel is deprecated. It is better to use tf.get_variable.