understanding output shape of keras Conv2DTranspose - keras

I am having a hard time understanding the output shape of keras.layers.Conv2DTranspose
Here is the prototype:
keras.layers.Conv2DTranspose(
filters,
kernel_size,
strides=(1, 1),
padding='valid',
output_padding=None,
data_format=None,
dilation_rate=(1, 1),
activation=None,
use_bias=True,
kernel_initializer='glorot_uniform',
bias_initializer='zeros',
kernel_regularizer=None,
bias_regularizer=None,
activity_regularizer=None,
kernel_constraint=None,
bias_constraint=None
)
In the documentation (https://keras.io/layers/convolutional/), I read:
If output_padding is set to None (default), the output shape is inferred.
In the code (https://github.com/keras-team/keras/blob/master/keras/layers/convolutional.py), I read:
out_height = conv_utils.deconv_length(height,
stride_h, kernel_h,
self.padding,
out_pad_h,
self.dilation_rate[0])
out_width = conv_utils.deconv_length(width,
stride_w, kernel_w,
self.padding,
out_pad_w,
self.dilation_rate[1])
if self.data_format == 'channels_first':
output_shape = (batch_size, self.filters, out_height, out_width)
else:
output_shape = (batch_size, out_height, out_width, self.filters)
and (https://github.com/keras-team/keras/blob/master/keras/utils/conv_utils.py):
def deconv_length(dim_size, stride_size, kernel_size, padding, output_padding, dilation=1):
"""Determines output length of a transposed convolution given input length.
# Arguments
dim_size: Integer, the input length.
stride_size: Integer, the stride along the dimension of `dim_size`.
kernel_size: Integer, the kernel size along the dimension of `dim_size`.
padding: One of `"same"`, `"valid"`, `"full"`.
output_padding: Integer, amount of padding along the output dimension, can be set to `None` in which case the output length is inferred.
dilation: dilation rate, integer.
# Returns
The output length (integer).
"""
assert padding in {'same', 'valid', 'full'}
if dim_size is None:
return None
# Get the dilated kernel size
kernel_size = kernel_size + (kernel_size - 1) * (dilation - 1)
# Infer length if output padding is None, else compute the exact length
if output_padding is None:
if padding == 'valid':
dim_size = dim_size * stride_size + max(kernel_size - stride_size, 0)
elif padding == 'full':
dim_size = dim_size * stride_size - (stride_size + kernel_size - 2)
elif padding == 'same':
dim_size = dim_size * stride_size
else:
if padding == 'same':
pad = kernel_size // 2
elif padding == 'valid':
pad = 0
elif padding == 'full':
pad = kernel_size - 1
dim_size = ((dim_size - 1) * stride_size + kernel_size - 2 * pad + output_padding)
return dim_size
I understand that Conv2DTranspose is kind of a Conv2D, but reversed.
Since applying a Conv2D with kernel_size = (3, 3), strides = (10, 10) and padding = "same" to a 200x200 image will output a 20x20 image,
I assume that applying a Conv2DTranspose with kernel_size = (3, 3), strides = (10, 10) and padding = "same" to a 20x20 image will output a 200x200 image.
Also, applying a Conv2D with kernel_size = (3, 3), strides = (10, 10) and padding = "same" to a 195x195 image will also output a 20x20 image.
So, I understand that there is kind of an ambiguity on the output shape when applying a Conv2DTranspose with kernel_size = (3, 3), strides = (10, 10) and padding = "same" (user might want output to be 195x195, or 200x200, or many other compatible shapes).
I assume that "the output shape is inferred." means that a default output shape is computed according to the parameters of the layer, and I assume that there is a mechanism to specify an output shape differnet from the default one, if necessary.
This said, I do not really understand
the meaning of the "output_padding" parameter
the interactions between parameters "padding" and "output_padding"
the various formulas in the function keras.conv_utils.deconv_length
Could someone explain this?
Many thanks,
Julien

I may have found a (partial) answer.
I found it in the Pytorch documentation, which appears to be much clearer than the Keras documentation on this topic.
When applying Conv2D with a stride greater than 1 to images which dimensions are close, we get output images with the same dimensions.
For instance, when applied a Conv2D with kernel size of 3x3, stride of 7x7 and padding "same", the following image dimensions
22x22, 23x23, ..., 28x28, 22x28, 28x22, 27x24, etc. (7x7 = 49
combinations)
will ALL yield an output dimension of 4x4.
That is because output_dimension = ceiling(input_dimension / stride).
As a consequence, when applying a Conv2DTranspose with kernel size of 3x3, stride of 7x7 and padding "same", there is an ambiguity about the output dimension.
Any of the 49 possible output dimensions would be correct.
The parameter output_padding is a way to resolve the ambiguity by choosing explicitly the output dimension.
In my example, the minimum output size is 22x22, and output_padding provides a number of lines (between 0 and 6) to add at the bottom of the output image and a number of columns (between 0 and 6) to add at the right of the output image.
So I can get output_dimensions = 24x25 if I use outout_padding = (2, 3)
What I still do not understand, however, is the logic that keras uses to choose a certain output image dimension when output_padding is not specified (when it 'infers" the output shape)
A few pointers:
https://pytorch.org/docs/stable/nn.html#torch.nn.ConvTranspose2d
https://discuss.pytorch.org/t/the-output-size-of-convtranspose2d-differs-from-the-expected-output-size/1876/5
https://discuss.pytorch.org/t/question-about-the-output-padding-in-nn-convtrasnpose2d/19740
https://discuss.pytorch.org/t/what-does-output-padding-exactly-do-in-convtranspose2d/2688
So to answer my own questions:
the meaning of the "output_padding" parameter: see above
the interactions between parameters "padding" and "output_padding": these parameters are independant
the various formulas in the function keras.conv_utils.deconv_length
For now, I do not understand the part when output_padding is None;
I ignore the case when padding == 'full' (not supported by Conv2DTranspose);
The formula for padding == 'valid' seems correct (can be computed by reversing the formula of Conv2D)
The formula for padding == 'same' seems incorrect to me, in case kernel_size is even. (As a matter of fact, keras crashes when trying to build a Conv2DTranspose layer with input_dimension = 5x5, kernel_size = 2x2, stride = 7x7 and padding = 'same'. It appears to me that there is a bug in keras, I will start another thread for this topic...)

Outpadding in Conv2DTranspose is also what I am concerned about when designing an autoencoder.
Assume stride is always 1. Along the encoder path, for each convolution layer, I chose padding='valid', which means that if my input image is HXW, and the filter is sized mXn, the output of the layer will be (H-(m-1))X(W-(n-1)).
In the corresponding Con2DTranspose layer along the decoder path, if I use Theano, in order to resume the input size of its corresponding Con2D, I have to chose padding='full', and out_padding = None or 0 (no difference), which implies the input size will be expanded by [m-1, n-1] around it, that is, (m-1)/2 for top and bottom, and (n-1)/2 for left and right.
If I use tensorflow, I will have to choose padding = 'same', and out_padding = 2*((filter_size-1)//2), I think that is Keras' intended behaviour.
If stride is not 1, then you will have to calculate carefully how many output paddings are to be added.
In Conv2D out_size = floor(in_size+2*padding_size-filter_size)/stride+1)
If we choose padding = 'same', Keras will automatically set padding = (filter_size-1)/2; whilst if we choose 'valid', padding_size will be set 0, which is the convention of any N-D convolutions.
Conversely, in Con2DTranspose out_size = (in_size-1)*stride+filter_size-2*padding_size
where padding_size refers to how many pixels will actually be padded caused by 'padding' option and out_padding together. Based upon the discussion above, there is no 'full' option on tensorflow, we will have to use out_padding to resume the input size of its corresponding Con2D.
Could you try and see if it works properly and let me know, please?
So in summary, I think out_padding is used for facilitating different backends.

When output_padding=None, Keras uses the deconv_output_length method to compute the output length, which sets it to:
if padding == 'valid':
length = input_length * stride + max(filter_size - stride, 0)
elif padding == 'same':
length = input_length * stride
Now in the documentation it says that if output_padding is set, the output length will be
((input_length - 1) * stride + filter_size - 2 * padding + output_padding
So using this we can figure out what the default output_padding is.
In the padding='valid' case, padding = 0 in the above, so solving for output_padding:
output_padding = max(stride - filter_size, 0)
padding='valid'
In this case, padding = 0 in the above, so solving for output_padding:
output_padding = max(stride - filter_size, 0)
and one can check that setting this results in the same as setting it to None
padding = 'same'
This case is much more mysterious, and in fact it seems to be impossible to get the same as output_padding=None by setting it to any integer. For example with strides=2 and kernel_size=2, for an output_padding larger than 1, it gives a warning that the stride must be larger than the output padding. For anything smaller than 1 it gives a warning that the size of out_backprop doesn't match computed. So the only value that works is 1, but this results in a different output shape from None.
In fact it is not implemented by setting output_padding to some default value, it is only used to compute the output shape, which then is used in the convolution method.

Related

How does calculation in a GRU layer take place

So I want to understand exactly how the outputs and hidden state of a GRU cell are calculated.
I obtained the pre-trained model from here and the GRU layer has been defined as nn.GRU(96, 96, bias=True).
I looked at the the PyTorch Documentation and confirmed the dimensions of the weights and bias as:
weight_ih_l0: (288, 96)
weight_hh_l0: (288, 96)
bias_ih_l0: (288)
bias_hh_l0: (288)
My input size and output size are (1000, 8, 96). I understand that there are 1000 tensors, each of size (8, 96). The hidden state is (1, 8, 96), which is one tensor of size (8, 96).
I have also printed the variable batch_first and found it to be False. This means that:
Sequence length: L=1000
Batch size: B=8
Input size: Hin=96
Now going by the equations from the documentation, for the reset gate, I need to multiply the weight by the input x. But my weights are 2-dimensions and my input has three dimensions.
Here is what I've tried, I took the first (8, 96) matrix from my input and multiplied it with the transpose of my weight matrix:
Input (8, 96) x Weight (96, 288) = (8, 288)
Then I add the bias by replicating the (288) eight times to give (8, 288). This would give the size of r(t) as (8, 288). Similarly, z(t) would also be (8, 288).
This r(t) is used in n(t), since Hadamard product is used, both the matrices being multiplied have to be the same size that is (8, 288). This implies that n(t) is also (8, 288).
Finally, h(t) is the Hadamard produce and matrix addition, which would give the size of h(t) as (8, 288) which is wrong.
Where am I going wrong in this process?
TLDR; This confusion comes from the fact that the weights of the layer are the concatenation of input_hidden and hidden-hidden respectively.
- nn.GRU layer weight/bias layout
You can take a closer look at what's inside the GRU layer implementation torch.nn.GRU by peaking through the weights and biases.
>>> gru = nn.GRU(input_size=96, hidden_size=96, num_layers=1)
First the parameters of the GRU layer:
>>> gru._all_weights
[['weight_ih_l0', 'weight_hh_l0', 'bias_ih_l0', 'bias_hh_l0']]
You can look at gru.state_dict() to get the dictionary of weights of the layer.
We have two weights and two biases, _ih stands for 'input-hidden' and _hh stands for 'hidden-hidden'.
For more efficient computation the parameters have been concatenated together, as the documentation page clearly explains (| means concatenation). In this particular example num_layers=1 and k=0:
~GRU.weight_ih_l[k] – the learnable input-hidden weights of the layer (W_ir | W_iz | W_in), of shape (3*hidden_size, input_size).
~GRU.weight_hh_l[k] – the learnable hidden-hidden weights of the layer (W_hr | W_hz | W_hn), of shape (3*hidden_size, hidden_size).
~GRU.bias_ih_l[k] – the learnable input-hidden bias of the layer (b_ir | b_iz | b_in), of shape (3*hidden_size).
~GRU.bias_hh_l[k] – the learnable hidden-hidden bias of the (b_hr | b_hz | b_hn).
For further inspection we can get those split up with the following code:
>>> W_ih, W_hh, b_ih, b_hh = gru._flat_weights
>>> W_ir, W_iz, W_in = W_ih.split(H_in)
>>> W_hr, W_hz, W_hn = W_hh.split(H_in)
>>> b_ir, b_iz, b_in = b_ih.split(H_in)
>>> b_hr, b_hz, b_hn = b_hh.split(H_in)
Now we have the 12 tensor parameters sorted out.
- Expressions
The four expressions for a GRU layer: r_t, z_t, n_t, and h_t, are computed at each timestep.
The first operation is r_t = σ(W_ir#x_t + b_ir + W_hr#h + b_hr). I used the # sign to designate the matrix multiplication operator (__matmul__). Remember W_ir is shaped (H_in=input_size, hidden_size) while x_t contains the element at step t from the x sequence. Tensor x_t = x[t] is shaped as (N=batch_size, H_in=input_size). At this point, it's simply a matrix multiplication between the input x[t] and the weight matrix. The resulting tensor r is shaped (N, hidden_size=H_in):
>>> (x[t]#W_ir.T).shape
(8, 96)
The same is true for all other weight multiplication operations performed. As a result, you end up with an output tensor shaped (N, H_out=hidden_size).
In the following expressions h is the tensor containing the hidden state of the previous step for each element in the batch, i.e. shaped (N, hidden_size=H_out), since num_layers=1, i.e. there's a single hidden layer.
>>> r_t = torch.sigmoid(x[t]#W_ir.T + b_ir + h#W_hr.T + b_hr)
>>> r_t.shape
(8, 96)
>>> z_t = torch.sigmoid(x[t]#W_iz.T + b_iz + h#W_hz.T + b_hz)
>>> z_t.shape
(8, 96)
The output of the layer is the concatenation of the computed h tensors at
consecutive timesteps t (between 0 and L-1).
- Demonstration
Here is a minimal example of an nn.GRU inference manually computed:
Parameters
Description
Values
H_in
feature size
3
H_out
hidden size
2
L
sequence length
3
N
batch size
1
k
number of layers
1
Setup:
gru = nn.GRU(input_size=H_in, hidden_size=H_out, num_layers=k)
W_ih, W_hh, b_ih, b_hh = gru._flat_weights
W_ir, W_iz, W_in = W_ih.split(H_out)
W_hr, W_hz, W_hn = W_hh.split(H_out)
b_ir, b_iz, b_in = b_ih.split(H_out)
b_hr, b_hz, b_hn = b_hh.split(H_out)
Random input:
x = torch.rand(L, N, H_in)
Inference loop:
output = []
h = torch.zeros(1, N, H_out)
for t in range(L):
r = torch.sigmoid(x[t]#W_ir.T + b_ir + h#W_hr.T + b_hr)
z = torch.sigmoid(x[t]#W_iz.T + b_iz + h#W_hz.T + b_hz)
n = torch.tanh(x[t]#W_in.T + b_in + r*(h#W_hn.T + b_hn))
h = (1-z)*n + z*h
output.append(h)
The final output is given by the stacking the tensors h at consecutive timesteps:
>>> torch.vstack(output)
tensor([[[0.1086, 0.0362]],
[[0.2150, 0.0108]],
[[0.3020, 0.0352]]], grad_fn=<CatBackward>)
In this case the output shape is (L, N, H_out), i.e. (3, 1, 2).
Which you can compare with output, _ = gru(x).

what the Conv1D with kernel size equal to 1 do?

I read an example of using LSTM with CONV1.
(Took it from: CNN LSTM)
Conv1D(filters=64, kernel_size=1, activation='relu')
I understand that the dimension of the convolutional is 1 (one dim with size 1))
what is the value of the convolution ? (what is the value of the matrix 1*1 ?)
I can't figure out what is the filters=64 ? what does it mean ?
Is the relu activation function work on the output of the convolutional ? (from what I read it seems like that, but I'm not sure)
what is the motivation to use convolutional with kernel_size = 1, as we do here ?
filters
filters = 64 means number of separate filters used is 64.
Each filter will output 1 channel. i.e. here 64 filters operate on input to produce 64 different channels(or vectors). Hence filters parameter determines number of output channels.
kernel_size
kernel_size determines the size of the convolution window. Suppose kernel_size = 1 then each kernel will have dimension of in_channels x 1. Hence each kernel weight will be in_channels x 1 dimension tensor.
activation = relu
That means relu activation will be applied on the output of convolution operation.
kernel_size = 1 convolution
Used to reduce depth channels with applying non-linearity. It will do something like weighted average across the channels while keeping receptive field.
In your eg: filters = 64, kernel_size = 1, activation = relu
Suppose input feature map has size of 100 x 10(100 channels). Then the layer weight will of dimension 64 x 100 x 1. The output size will be 64 x 10.

Calculating shape of conv1d layer in Pytorch

How we can calculate the shape of conv1d layer in PyTorch. IS there any command to calculate size and shape of these layers in PyTorch.
nn.Conv1d(depth_1, depth_2, kernel_size=kernel_size_2, stride=stride_size),
nn.ReLU(),
nn.MaxPool1d(kernel_size=2, stride=stride_size),
nn.Dropout(0.25)```
The output size can be calculated as shown in the documentation nn.Conv1d - Shape:
The batch size remains unchanged and you already know the number of channels, since you specified them when creating the convolution (depth_2 in this example).
Only the length needs to be calculated and you can do that with a simple function analogous to the formula above:
def calculate_output_length(length_in, kernel_size, stride=1, padding=0, dilation=1):
return (length_in + 2 * padding - dilation * (kernel_size - 1) - 1) // stride + 1
The default values specified are also the default values of nn.Conv1d, therefore you only need to specify what you also specify to create the convolution. It uses an integer division //, because the numerator might be not be divisible by stride, in which case it just gets rounded down (indicated by the brackets that are only closed at towards the bottom).
The same formula also applies to nn.MaxPool1d, but keep in mind that it automatically sets stride = kernel_size if stride is not specified.

RuntimeError: Given groups=3, weight of size 12 64 3 768, expected input[32, 12, 30, 768] to have 192 channels, but got 12 channels instead

I started working with Pytorch recently so my understanding of it isn't quite strong. I previously had a 1 layer CNN but wanted to extend it to 2 layers, but the input and output channels have been throwing errors I can seem to decipher. Why does it expect 192 channels? Can someone give me a pointer to help me understand this better? I have seen several related problems on here, but I don't understand those solutions either.
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from transformers import BertConfig, BertModel, BertTokenizer
import math
from transformers import AdamW, get_linear_schedule_with_warmup
def pad_sents(sents, pad_token): # Pad list of sentences according to the longest sentence in the batch.
sents_padded = []
max_len = max(len(s) for s in sents)
for s in sents:
padded = [pad_token] * max_len
padded[:len(s)] = s
sents_padded.append(padded)
return sents_padded
def sents_to_tensor(tokenizer, sents, device):
tokens_list = [tokenizer.tokenize(str(sent)) for sent in sents]
sents_lengths = [len(tokens) for tokens in tokens_list]
tokens_list_padded = pad_sents(tokens_list, '[PAD]')
sents_lengths = torch.tensor(sents_lengths, device=device)
masks = []
for tokens in tokens_list_padded:
mask = [0 if token == '[PAD]' else 1 for token in tokens]
masks.append(mask)
masks_tensor = torch.tensor(masks, dtype=torch.long, device=device)
tokens_id_list = [tokenizer.convert_tokens_to_ids(tokens) for tokens in tokens_list_padded]
sents_tensor = torch.tensor(tokens_id_list, dtype=torch.long, device=device)
return sents_tensor, masks_tensor, sents_lengths
class ConvModel(nn.Module):
def __init__(self, device, dropout_rate, n_class, out_channel=16):
super(ConvModel, self).__init__()
self.bert_config = BertConfig.from_pretrained('bert-base-uncased', output_hidden_states=True)
self.dropout_rate = dropout_rate
self.n_class = n_class
self.out_channel = out_channel
self.bert = BertModel.from_pretrained('bert-base-uncased', config=self.bert_config)
self.out_channels = self.bert.config.num_hidden_layers * self.out_channel
self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', config=self.bert_config)
self.conv = nn.Conv2d(in_channels=self.bert.config.num_hidden_layers,
out_channels=self.out_channels,
kernel_size=(3, self.bert.config.hidden_size),
groups=self.bert.config.num_hidden_layers)
self.conv1 = nn.Conv2d(in_channels=self.out_channels,
out_channels=48,
kernel_size=(3, self.bert.config.hidden_size),
groups=self.bert.config.num_hidden_layers)
self.hidden_to_softmax = nn.Linear(self.out_channels, self.n_class, bias=True)
self.dropout = nn.Dropout(p=self.dropout_rate)
self.device = device
def forward(self, sents):
sents_tensor, masks_tensor, sents_lengths = sents_to_tensor(self.tokenizer, sents, self.device)
encoded_layers = self.bert(input_ids=sents_tensor, attention_mask=masks_tensor)
hidden_encoded_layer = encoded_layers[2]
hidden_encoded_layer = hidden_encoded_layer[0]
hidden_encoded_layer = torch.unsqueeze(hidden_encoded_layer, dim=1)
hidden_encoded_layer = hidden_encoded_layer.repeat(1, 12, 1, 1)
conv_out = self.conv(hidden_encoded_layer) # (batch_size, channel_out, some_length, 1)
conv_out = self.conv1(conv_out)
conv_out = torch.squeeze(conv_out, dim=3) # (batch_size, channel_out, some_length)
conv_out, _ = torch.max(conv_out, dim=2) # (batch_size, channel_out)
pre_softmax = self.hidden_to_softmax(conv_out)
return pre_softmax
def batch_iter(data, batch_size, shuffle=False, bert=None):
batch_num = math.ceil(data.shape[0] / batch_size)
index_array = list(range(data.shape[0]))
if shuffle:
data = data.sample(frac=1)
for i in range(batch_num):
indices = index_array[i * batch_size: (i + 1) * batch_size]
examples = data.iloc[indices]
sents = list(examples.train_BERT_tweet)
targets = list(examples.train_label.values)
yield sents, targets # list[list[str]] if not bert else list[str], list[int]
def train():
label_name = ['Yes', 'Maybe', 'No']
device = torch.device("cpu")
df_train = pd.read_csv('trainn.csv') # , index_col=0)
train_label = dict(df_train.train_label.value_counts())
label_max = float(max(train_label.values()))
train_label_weight = torch.tensor([label_max / train_label[i] for i in range(len(train_label))], device=device)
model = ConvModel(device=device, dropout_rate=0.2, n_class=len(label_name))
optimizer = AdamW(model.parameters(), lr=1e-3, correct_bias=False)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=100, num_training_steps=1000) # changed the last 2 arguments to old ones
model = model.to(device)
model.train()
cn_loss = torch.nn.CrossEntropyLoss(weight=train_label_weight, reduction='mean')
train_batch_size = 16
for epoch in range(1):
for sents, targets in batch_iter(df_train, batch_size=train_batch_size, shuffle=True): # for each epoch
optimizer.zero_grad()
pre_softmax = model(sents)
loss = cn_loss(pre_softmax, torch.tensor(targets, dtype=torch.long, device=device))
loss.backward()
optimizer.step()
scheduler.step()
TrainingModel = train()
Here's a snippet of data https://github.com/Kosisochi/DataSnippet
It seems that the original version of the code you had in this question behaved differently. The final version of the code you have here gives me a different error from what you posted, more specifically - this:
RuntimeError: Calculated padded input size per channel: (20 x 1). Kernel size: (3 x 768). Kernel size can't be greater than actual input size
I apologize if I misunderstood the situation, but it seems to me that your understanding of what exactly nn.Conv2d layer does is not 100% clear and that is the main source of your struggle. I interpret the part "detailed explanation on 2 layer CNN in Pytorch" you requested as an ask to explain in detail on how that layer works and I hope that after this is done there will be no problem applying it 1 time, 2 times or more.
You can find all the documentation about the layer here, but let me give you a recap which hopefully will help to understand more the errors you're getting.
First of all nn.Conv2d inputs are 4-d tensors of the shape (BatchSize, ChannelsIn, Height, Width) and outputs are 4-d tensors of the shape (BatchSize, ChannelsOut, HeightOut, WidthOut). The simplest way to think about nn.Conv2d is of something applied to 2d images with pixel grid of size Height x Width and having ChannelsIn different colors or features per pixel. Even if your inputs have nothing to do with actual images the behavior of the layer is still the same. Simplest situation is when the nn.Conv2d is not using padding (as in your code). In that case the kernel_size=(kernel_height, kernel_width) argument specifies the rectangle which you can imagine sweeping through Height x Width rectangle of your inputs and producing one pixel for each valid position. Without padding the coordinate of the rectangle's point can be any pair of indicies (x, y) with x between 0 and Height - kernel_height and y between 0 and Width - kernel_width. Thus the output will look like a 2d image of size (Height - kernel_height + 1) x (Width - kernel_width + 1) and will have as many output channels as specified to nn.Conv2d constructor, so the output tensor will be of shape (BatchSize, ChannelsOut, Height - kernel_height + 1, Width - kernel_width + 1).
The parameter groups is not affecting how shapes are changed by the layer - it is only controlling which input channels are used as inputs for the output channels (groups=1 means that every input channel is used as input for every output channel, otherwise input and output channels are divided into corresponding number of groups and only input channels from group i are used as inputs for the output channels from group i).
Now in your current version of the code you have BatchSize = 16 and the output of pre-trained model is (BatchSize, DynamicSize, 768) with DynamicSize depending on the input, e.g. 22. You then introduce additional dimension as axis 1 with unsqueeze and repeat the values along that dimension transforming the tensor of shape (16, 22, 768) into (16, 12, 22, 768). Effectively you are using the output of the pre-trained model as 12-channel (with each channel having same values as others) 2-d images here of size (22, 768), where 22 is not fixed (depends on the batch). Then you apply a nn.Conv2d with kernel size (3, 768) - which means that there is no "wiggle room" for width and output 2-d images will be of size (20, 1) and since your layer has 192 channels final size of the output of first convolution layer has shape (16, 192, 20, 1). Then you try to apply second layer of convolution on top of that with kernel size (3, 768) again, but since your 2-d "image" is now just (20 x 1) there is no valid position to fit (3, 768) kernel rectangle inside a rectangle (20 x 1) which leads to the error message Kernel size can't be greater than actual input size.
Hope this explanation helps. Now to the choices you have to avoid the issue:
(a) is to add padding in such a way that the size of the output is not changing comparing to input (I won't go into details here,
because I don't think this is what you need)
(b) Use smaller kernel on both first and/or second convolutions (e.g. if you don't change first convolution the only valid width for
the second kernel would be 1).
(c) Looking at what you're trying to do my guess is that you actually don't want to use 2d convolution, you want 1d convolution (on the sequence) with every position described by 768 values. When you're using one convolution layer with 768 width kernel (and same 768 width input) you're effectively doing exactly same thing as 1d convolution with 768 input channels, but then if you try to apply second one you have a problem. You can specify kernel width as 1 for the next layer(s) and that will work for you, but a more correct way would be to transpose pre-trained model's output tensor by switching the last dimensions - getting shape (16, 768, DynamicSize) from (16, DynamicSize, 768) and then apply nn.Conv1d layer with 768 input channels and arbitrary ChannelsOut as output channels and 1d kernel_size=3 (meaning you look at 3 consecutive elements of the sequence for convolution). If you do that than without padding input shape of (16, 768, DynamicSize) will become (16, ChannelsOut, DynamicSize-2), and after you apply second Conv1d with e.g. the same settings as first one you'll get a tensor of shape (16, ChannelsOut, DynamicSize-4), etc. (each time the 1d length will shrink by kernel_size-1). You can always change number of channels/kernel_size for each subsequent convolution layer too.

convLSTM2D dilation_rate seems not to work

I'm trying to do an upsampling using the dilation_rate from the convLSTM2D (Keras with Tenosrflow as backend)
input = Input(shape=(10, 64, 64, 1), name='encoder_input')
layer1 = ConvLSTM2D(filters=33, kernel_size=(5,5), dilation_rate=(2, 2))
model = Model(input, layer1(input))
plot_model(model, show_shapes=True, show_layer_names=True)
I would expect the output shape to be (None,128,128,33) but I got (None,64,64,33).
Wouldn't this dilation_rate=(2, 2) be the opposite to strides=(2, 2)?
Dilation, unlike stride, does not change the shape of the data. It simply increases the "spread" of the kernels. In this gif, you can see how it works:
The only change in the shape of the data comes from cutting off 2 from each side, because no padding is used.

Resources