This seems to be one of the common questions on here (1, 2, 3), but I am still struggling to define the right shape for input to PyTorch conv1D.
I have text sequences of length 512 (number of tokens per sequence) with each token being represented by a vector of length 768 (embedding). The batch size I am using is 6.
So my input tensor to conv1D is of shape [6, 512, 768].
input = torch.randn(6, 512, 768)
Now, I want to convolve over the length of my sequence (512) with a kernel size of 2 using the conv1D layer from PyTorch.
Understanding 1:
I assumed that "in_channels" are the embedding dimension of the conv1D layer. If so, then a conv1D layer will be defined in this way where
in_channels = embedding dimension (768)
out_channels = 100 (arbitrary number)
kernel = 2
convolution_layer = nn.conv1D(768, 100, 2)
feature_map = convolution_layer(input)
But with this assumption, I get the following error:
RuntimeError: Given groups=1, weight of size 100 768 2, expected input `[4, 512, 768]` to have 768 channels, but got 512 channels instead
Understanding 2:
Then I assumed that "in_channels" is the sequence length of the input sequence. If so, then a conv1D layer will be defined in this way where
in_channels = sequence length (512)
out_channels = 100 (arbitrary number)
kernel = 2
convolution_layer = nn.conv1D(512, 100, 2)
feature_map = convolution_layer(input)
This works fine and I get an output feature map of dimension [batch_size, 100, 767]. However, I am confused. Shouldn't the convolutional layer convolve over the sequence length of 512 and output a feature map of dimension [batch_size, 100, 511]?
I will be really grateful for your help.
In pytorch your input shape of [6, 512, 768] should actually be [6, 768, 512] where the feature length is represented by the channel dimension and sequence length is the length dimension. Then you can define your conv1d with in/out channels of 768 and 100 respectively to get an output of [6, 100, 511].
Given an input of shape [6, 512, 768] you can convert it to the correct shape with Tensor.transpose.
input = input.transpose(1, 2).contiguous()
The .contiguous() ensures the memory of the tensor is stored contiguously which helps avoid potential issues during processing.
I found an answer to it (source).
So, usually, BERT outputs vectors of shape
[batch_size, sequence_length, embedding_dim].
where,
sequence_length = number of words or tokens in a sequence (max_length sequence BERT can handle is 512)
embedding_dim = the vector length of the vector describing each token (768 in case of BERT).
thus, input = torch.randn(batch_size, 512, 768)
Now, we want to convolve over the text sequence of length 512 using a kernel size of 2.
So, we define a PyTorch conv1D layer as follows,
convolution_layer = nn.conv1d(in_channels, out_channels, kernel_size)
where,
in_channels = embedding_dim
out_channels = arbitrary int
kernel_size = 2 (I want bigrams)
thus, convolution_layer = nn.conv1d(768, 100, 2)
Now we need a connecting link between the expected input by convolution_layer and the actual input.
For this, we require to
current input shape [batch_size, 512, 768]
expected input [batch_size, 768, 512]
To achieve this expected input shape, we need to use the transpose function from PyTorch.
input_transposed = input.transpose(1, 2)
I have a suggestion for you which may not be what you asked for but might help. Because your input is (6, 512, 768) you can use conv2d instead of 1d.
All you need to do is to add a dimension of 1 at index 1: input.unsqueeze(1) which works as your channel (consider it as a grayscale image)
def forward(self, x):
x = self.embedding(x) # [Batch, seq length, Embedding] = [5, 512, 768])
x = torch.unsqueeze(x, 1) # [5, 1, 512, 768]) # like a grayscale image
and also for your conv2d layer, you can define like this:
window_size=3 # for trigrams
EMBEDDING_SIZE = 768
NUM_FILTERS = 10 # or whatever you want
self.conv = nn.Conv2d(in_channels = 1,
out_channels = NUM_FILTERS,
kernel_size = [window_size, EMBEDDING_SIZE],
padding=(window_size - 1, 0))```
Related
This is the question:
Before we define the model, we define the size of our alphabet. Our alphabet consists of lowercase English letters, and additionally a special character used for space between symbols or before and after the word. For the first part of this assignment, we don't need that extra character.
Our end goal is to learn to transcribe words of arbitrary length. However, first, we pre-train our simple convolutional neural net to recognize single characters. In order to be able to use the same model for one character and for entire words, we are going to design the model in a way that makes sure that the output size for one character (or when input image size is 32x18) is 1x27, and Kx27 whenever the input image is wider. K here will depend on particular architecture of the network, and is affected by strides, poolings, among other things. A little bit more formally, our model 𝑓𝜃 , for an input image 𝑥 gives output energies 𝑙=𝑓𝜃(𝑥) . If 𝑥∈ℝ32×18 , then 𝑙∈ℝ1×27 . If 𝑥∈ℝ32×100 for example, our model may output 𝑙∈ℝ10×27 , where 𝑙𝑖 corresponds to a particular window in 𝑥 , for example from 𝑥0,9𝑖 to 𝑥32,9𝑖+18 (again, this will depend on the particular architecture).
The code:
# constants for number of classes in total, and for the special extra character for empty space
ALPHABET_SIZE = 27, # Extra character for space inbetween
BETWEEN = 26
print(alphabet.shape) # RETURNS: torch.Size([32, 340])
My CNN Block:
from torch import nn
import torch.nn.functional as F
"""
Remember basics:
1. Bigger strides = less overlap
2. More filters = More features
Image shape = 32, 18
Alphabet shape = 32, 340
"""
class SimpleNet(torch.nn.Module):
def __init__(self):
super().__init__()
self.cnn_block = torch.nn.Sequential(
nn.Conv2d(3, 32, 3),
nn.BatchNorm2d(32),
nn.Conv2d(32, 32, 3),
nn.BatchNorm2d(32),
nn.Conv2d(32, 32, 3),
nn.BatchNorm2d(32),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, 3),
nn.BatchNorm2d(64),
nn.Conv2d(64, 64, 3),
nn.BatchNorm2d(64),
nn.Conv2d(64, 64, 3),
nn.BatchNorm2d(64),
nn.MaxPool2d(2)
)
def forward(self, x):
x = self.cnn_block(x)
# after applying cnn_block, x.shape should be:
# batch_size, alphabet_size, 1, width
return x[:, :, 0, :].permute(0, 2, 1)
model = SimpleNet()
alphabet_energies = model(alphabet.view(1, 1, *alphabet.shape))
def plot_energies(ce):
fig=plt.figure(dpi=200)
ax = plt.axes()
im = ax.imshow(ce.cpu().T)
ax.set_xlabel('window locations →')
ax.set_ylabel('← classes')
ax.xaxis.set_label_position('top')
ax.set_xticks([])
ax.set_yticks([])
cax = fig.add_axes([ax.get_position().x1+0.01,ax.get_position().y0,0.02,ax.get_position().height])
plt.colorbar(im, cax=cax)
plot_energies(alphabet_energies[0].detach())
I get the error in the title at alphabet_energies = model(alphabet.view(1, 1, *alphabet.shape))
Any help would be appreciated.
You should begin to replace nn.Conv2d(3, 32, 3) to nn.Conv2d(1, 32, 3)
Your model begins with a conv2d from 3 channels to 32 but your input image has only 1 channel (greyscale image).
This question already has answers here:
Pytorch - Inferring linear layer in_features
(2 answers)
Closed 1 year ago.
I'm working on an assignement with 1D signals and I have trouble finding the right input size for the linear layer (XXX). My signals have different lengths and are padded in a batch. I read that the linear layear should always have the same input size (XXX) but I'm not sure how to find it when each batch has a different length. Does anybody have an advice?
Thanks
class NeuralNet(nn.Module):
def __init__(self):
super(NeuralNet, self).__init__()
self.features = nn.Sequential(nn.Conv1d(in_channels = 1, out_channels = 128, kernel_size = 7, stride = 3),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.MaxPool1d(2, 3),
nn.Conv1d(128, 32, 5, 1),
nn.BatchNorm1d(32),
nn.ReLU(),
nn.MaxPool1d(2, 2),
nn.Conv1d(32, 32, 5, 1),
nn.ReLU(),
nn.Conv1d(32, 128, 3, 2),
nn.ReLU(),
nn.MaxPool1d(2, 2),
nn.Conv1d(128, 256, 7, 1),
nn.ReLU(),
nn.MaxPool1d(2, 2),
nn.Conv1d(256, 512, 3, 1),
nn.ReLU(),
nn.Conv1d(512, 128, 3, 1),
nn.ReLU()
)
self.classifier = nn.Sequential(nn.Linear(XXX, 512),
nn.ReLU(),
nn.Dropout(p = 0.1),
nn.Linear(512,2)
)
def forward(self, x):
x = self.features(x)
x = torch.flatten(x)
x = self.classifier(x)
return x
First, you need to decide on a fixed-length input. Let's assume each signal is of length 2048. It didn't work for any length < 1024 because of the previous convolution layers. If you would like to have a signal of length < 1024, then you may need to either remove a couple of Convolutional layers or change the kernel_size or remove maxpool operation.
Assuming, the fixed length is 2048, your nn.Linear layer will take as input 768 neurons. To calculate this, fix an arbitrary size of input neurons to your nn.Linear layer (say 1000) and then try to print the shape of the output from the Conv. layers. You could do something like this in your forward call:
def forward(self, x):
x = self.features(x)
print('Output shape of Conv. layers', x.shape)
x = x.view(-1, x.size(1) * x.size(2))
print('Shape required to pass to Linear Layer', x.shape)
x = self.classifier(x)
return x
This will obviously throw an error because of the shape mismatch. But you'll get to know the number of input neurons required in your first nn.Linear layer. With this approach, you could try a number of experiments of varying input signal lengths (1536, 2048, 4096, etc.)
what's the idea behind when using the following convolutional layers?
especially for nn.Conv2d(16, 16, 3, padding = 1)
self.conv1 = nn.Conv2d(3, 16, 3, padding = 1 )
self.conv2 = nn.Conv2d(16, 16, 3, padding = 1)
self.conv3 = nn.Conv2d(16, 32, 3, padding = 1)
self.pool = nn.MaxPool2d(2, 2)
x = F.relu(self.conv1(x))
x = self.pool(F.relu(self.conv2(x)))
x = F.relu(self.conv3(x))
I thought Conv2d always uses a bigger size like
from (16,32) to (32,64) for example.
Is nn.Conv2d(16, 16, 3, padding = 1) merely for reducing the size?
The model architecture all depends on finally what works best for your application, and it's always going to vary.
You are right in saying that usually, you want to make your tensors deeper (in the dimension of your channels) in order to extract richer features, but there is no hard and fast rule about that. Having said that, sometimes you don't want to make your tensors too big, since more the number of channels more the number of trainable parameters making it difficult for your model to train. This again brings me back to the very first line I said - "It all depends".
And as for the line:
nn.Conv2d(16, 16, 3, padding = 1) # stride = 1 by default.
This will keep the size of the tensor the same as the input in all 3 dimensions (height, width, and number of channels).
I will also add the formula to calculate size of output tensor in a convolution for reference.
output_size = ( (input_size - filter_size + 2*padding) / stride ) + 1
I'm a bit new to Keras and deep learning. I'm currently trying to replicate this paper but when I'm compiling the first model (without the LSTMs) I get the following error:
"ValueError: Error when checking target: expected dense_3 to have shape (None, 120, 40) but got array with shape (8, 40, 1)"
The description of the model is this:
Input (length T is appliance specific window size)
Parallel 1D convolution with filter size 3, 5, and 7
respectively, stride=1, number of filters=32,
activation type=linear, border mode=same
Merge layer which concatenates the output of
parallel 1D convolutions
Dense layer, output_dim=128, activation type=ReLU
Dense layer, output_dim=128, activation type=ReLU
Dense layer, output_dim=T , activation type=linear
My code is this:
from keras import layers, Input
from keras.models import Model
# the window sizes (seq_length?) are 40, 1075, 465, 72 and 1246 for the kettle, dish washer,
# fridge, microwave, oven and washing machine, respectively.
def ae_net(T):
input_layer = Input(shape= (T,))
branch_a = layers.Conv1D(32, 3, activation= 'linear', padding='same', strides=1)(input_layer)
branch_b = layers.Conv1D(32, 5, activation= 'linear', padding='same', strides=1)(input_layer)
branch_c = layers.Conv1D(32, 7, activation= 'linear', padding='same', strides=1)(input_layer)
merge_layer = layers.concatenate([branch_a, branch_b, branch_c], axis=1)
dense_1 = layers.Dense(128, activation='relu')(merge_layer)
dense_2 =layers.Dense(128, activation='relu')(dense_1)
output_dense = layers.Dense(T, activation='linear')(dense_2)
model = Model(input_layer, output_dense)
return model
model = ae_net(40)
model.compile(loss= 'mean_absolute_error', optimizer='rmsprop')
model.fit(X, y, batch_size= 8)
where X and y are numpy arrays of 8 sequences of a length of 40 values. So X.shape and y.shape are (8, 40, 1). It's actually one batch of data. The thing is I cannot understand how the output would be of shape (None, 120, 40) and what these sizes would mean.
As you noted, your shapes contain batch_size, length and channels: (8,40,1)
Your three convolutions are, each one, creating a tensor like (8,40,32).
Your concatenation in the axis=1 creates a tensor like (8,120,32), where 120 = 3*40.
Now, the dense layers only work on the last dimension (the channels in this case), leaving the length (now 120) untouched.
Solution
Now, it seems you do want to keep the length at the end. So you won't need any flatten or reshape layers. But you will need to keep the length 40, though.
You're probably doing the concatenation in the wrong axis. Instead of the length axis (1), you should concatenate in the channels axis (2 or -1).
So, this should be your concatenate layer:
merge_layer = layers.Concatenate()([branch_a, branch_b, branch_c])
#or layers.Concatenate(axis=-1)([branch_a, branch_b, branch_c])
This will output (8, 40, 96), and the dense layers will transform the 96 in something else.
As given in the documentation of PyTorch, the layer Conv2d uses a default dilation of 1. Does this mean that if I want to create a simple conv2d layer I would have to write
nn.conv2d(in_channels = 3, out_channels = 64, kernel_size = 3, dilation = 0)
instead of simply writing
nn.conv2d(in_channels = 3, out_channels = 64, kernel_size = 3)
Or is it the case that in PyTorch dilation = 1 means same as dilation = 0 as given here in the Dilated Convolution section?
From the calculation of H_out, W_out in the documentation of pytorch, we can know that dilation=n means to make a pixel (1x1) of kernel to be nxn, where the original kernel pixel is at the topleft, and the rest pixels are empty (or filled with 0).
Thus dilation=1 is equivalent to the standard convolution with no dilation.