Related
This question is related to computing the Class Activation Map (CAM) visualization.
Source code is at [ln [24 and onwards]: See ln [24] onwards and code snapshot pasted below
Model.summary()
The last convolutional layer is block5_conv3 with dimensions (14, 14, 512) and it predicts 1000 classes.
My questions are related to lines of code in this screenshot.
I included the lines of code separately also
In this line of code:
african_elephant_output = model.output[:, 386]
This model was trained on 1000 classes (the last line in the output of model.summary()). To understand the gradient calculation at a later step, I first want to understand how to print the length of vector african_elephant_outputand also the actual values in this feature vector
last_conv_layer = model.get_layer('block5_conv3')
grads = K.gradients(african_elephant_output, last_conv_layer.output)[0]
The dimensions of last_conv_layer are (14, 14, 512). But to understand how the dot product calculated by K.gradients, I need to know the dimensions of african_elephant_output. Should these dimensions of African_elephant output be: (1, 1, 512) so that we are able to first broadcast African_elephant output and then calculated the dot product of corresponding channels? How can I print the dimensions of african_elephant_output and the dimensions and values of grads?
What does axis = (0, 1, 2) refer to this line of code:
pooled_grads = K.mean(grads, axis=(0, 1, 2))
I am assuming the grads vector in #2 above is of shape (14, 14, 512) so the axis values of 0, 1, 2 refer to width (referred to by 0), height (referred to by 1) and the channel dimension ((referred to by 2). So that the mean is calculated along the width and height and we get a vector of shape (512, ) ?
This seems to be one of the common questions on here (1, 2, 3), but I am still struggling to define the right shape for input to PyTorch conv1D.
I have text sequences of length 512 (number of tokens per sequence) with each token being represented by a vector of length 768 (embedding). The batch size I am using is 6.
So my input tensor to conv1D is of shape [6, 512, 768].
input = torch.randn(6, 512, 768)
Now, I want to convolve over the length of my sequence (512) with a kernel size of 2 using the conv1D layer from PyTorch.
Understanding 1:
I assumed that "in_channels" are the embedding dimension of the conv1D layer. If so, then a conv1D layer will be defined in this way where
in_channels = embedding dimension (768)
out_channels = 100 (arbitrary number)
kernel = 2
convolution_layer = nn.conv1D(768, 100, 2)
feature_map = convolution_layer(input)
But with this assumption, I get the following error:
RuntimeError: Given groups=1, weight of size 100 768 2, expected input `[4, 512, 768]` to have 768 channels, but got 512 channels instead
Understanding 2:
Then I assumed that "in_channels" is the sequence length of the input sequence. If so, then a conv1D layer will be defined in this way where
in_channels = sequence length (512)
out_channels = 100 (arbitrary number)
kernel = 2
convolution_layer = nn.conv1D(512, 100, 2)
feature_map = convolution_layer(input)
This works fine and I get an output feature map of dimension [batch_size, 100, 767]. However, I am confused. Shouldn't the convolutional layer convolve over the sequence length of 512 and output a feature map of dimension [batch_size, 100, 511]?
I will be really grateful for your help.
In pytorch your input shape of [6, 512, 768] should actually be [6, 768, 512] where the feature length is represented by the channel dimension and sequence length is the length dimension. Then you can define your conv1d with in/out channels of 768 and 100 respectively to get an output of [6, 100, 511].
Given an input of shape [6, 512, 768] you can convert it to the correct shape with Tensor.transpose.
input = input.transpose(1, 2).contiguous()
The .contiguous() ensures the memory of the tensor is stored contiguously which helps avoid potential issues during processing.
I found an answer to it (source).
So, usually, BERT outputs vectors of shape
[batch_size, sequence_length, embedding_dim].
where,
sequence_length = number of words or tokens in a sequence (max_length sequence BERT can handle is 512)
embedding_dim = the vector length of the vector describing each token (768 in case of BERT).
thus, input = torch.randn(batch_size, 512, 768)
Now, we want to convolve over the text sequence of length 512 using a kernel size of 2.
So, we define a PyTorch conv1D layer as follows,
convolution_layer = nn.conv1d(in_channels, out_channels, kernel_size)
where,
in_channels = embedding_dim
out_channels = arbitrary int
kernel_size = 2 (I want bigrams)
thus, convolution_layer = nn.conv1d(768, 100, 2)
Now we need a connecting link between the expected input by convolution_layer and the actual input.
For this, we require to
current input shape [batch_size, 512, 768]
expected input [batch_size, 768, 512]
To achieve this expected input shape, we need to use the transpose function from PyTorch.
input_transposed = input.transpose(1, 2)
I have a suggestion for you which may not be what you asked for but might help. Because your input is (6, 512, 768) you can use conv2d instead of 1d.
All you need to do is to add a dimension of 1 at index 1: input.unsqueeze(1) which works as your channel (consider it as a grayscale image)
def forward(self, x):
x = self.embedding(x) # [Batch, seq length, Embedding] = [5, 512, 768])
x = torch.unsqueeze(x, 1) # [5, 1, 512, 768]) # like a grayscale image
and also for your conv2d layer, you can define like this:
window_size=3 # for trigrams
EMBEDDING_SIZE = 768
NUM_FILTERS = 10 # or whatever you want
self.conv = nn.Conv2d(in_channels = 1,
out_channels = NUM_FILTERS,
kernel_size = [window_size, EMBEDDING_SIZE],
padding=(window_size - 1, 0))```
I'm trying to model a Keras-based network using a set of 1D CNN and LSTM layers. Most of the available examples on the web uses data in the shape such as (1, 30, 50) (1 sample containing 30 time-steps with 50 features each).
However, each time step in my dataset is composed of a number of 1D arrays. A 10 time-step sample would be (1, 10, 100, 384) (1 batch - a single sample, 10 time-steps each containing 100 arrays with 384 features). So, how should I define a model with such shape?
I really could have flatten each time-step data (100*384), but that seems quite inadequate, as could void all the CNN processing... Plus, each time-step data is really 1D: it is not spacial data.
I have already defined a simple model such as below, but i think it's using the batch_size of the input shape incorrectly. I think its trying to learn from "482 samples" and not from a single sample with "482 time-steps"...
data_input_shape = (482, 100, 384)
model = Sequential()
model.add(Conv1D(300, 1, activation="relu", input_shape=(100,384)))
model.add(MaxPooling1D(4))
model.add(Conv1D(256, 1, activation="relu"))
model.add(MaxPooling1D(4))
model.add(Conv1D(128, 1, activation="relu"))
model.add(MaxPooling1D(5))
model.add(LSTM(200, return_sequences=True))
model.add(LSTM(200, return_sequences=True))
model.add(LSTM(200, return_sequences=True))
model.add(Dense(1, activation='sigmoid'))
Any suggestions?
Let's assume the following two case as you have already mentioned that the 100 arrays are not spatially correlated:
The 384 values of each feature are spatially independent.
The 384 values of each feature are spatially dependent. For example, they are values across a frequency range after some FFT or similar operation.
In case 1, you have basically 100x384 independent feature. So flatting seems to be the option to go with.
In case 2 though, it might make sense to apply a 2D convolution across the features. Here is how:
First, you should prepare your data in the right format. Assuming that your data has 482 time steps, you should decide how many time steps you like to have in each sample. For example, you can decide to have 10 time steps in each sample, which with no overlapping between the samples will give you about 48 samples. So the data now would be of shape (48, 10, 100, 384). In addition, we should add an additional dimension as channel to be able to apply a 2D convolution in Keras. So your data would become of shape (48, 10, 100, 384, 1)
Next, you can decide on the architecture. We will apply a Conv2D to each array at each time step. We use a kernel size of (1, x) or (100, x) since your arrays are not spatially related. Here is an example architecture:
model = Sequential()
model.add(TimeDistributed(Conv2D(16, (1, 5), activation="relu"), input_shape=(10, 100, 384, 1)))
model.add(TimeDistributed(MaxPooling2D((1, 2))))
model.add(TimeDistributed(Conv2D(32, (100, 9), activation="relu"), input_shape=(10, 100, 384, 1)))
model.add(TimeDistributed(MaxPooling2D((1, 4))))
model.add(TimeDistributed(Flatten()))
model.add(LSTM(16, return_sequences=True))
model.add(Dense(1, activation='sigmoid'))
A few additional notes:
You can certainly add more layers of each type.
TimeDistruted is new above. You can read about it here.
If you have images to begin with, consider using CNN/LSTM hybrid or Conv3D from the beginning vs extracting 100 arrays from the image.
Take a look at ConvLSTM2D here for combined CNN and LSTM layer.
I was trying to follow this tutorial
https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html
In the baseline model it has
model.add(Conv2D(32, (3, 3), input_shape=(3, 150, 150)))
I don't quite follow the output shape here. If input shape is 3x150x150 with a kernel size 3x3, isn't the output shape 3x148x148? (Assuming no padding). However, according to Keras Doc:
Output shape: 4D tensor with shape: (batch, filters, new_rows,
new_cols)
That seems to me output shape will be 32x148x148. My question is whether this understanding correct? If so, where do the additional filters come from?
If the input shape is (3, 150, 150), after applying Conv2D layer the output is (?, 32, 148, 148). Check it out with following example:
inps = Input(shape=(3, 150, 150))
conv = Conv2D(32, (3, 3), data_format='channels_first')(inps)
print(conv)
>> Tensor("conv2d/BiasAdd:0", shape=(?, 32, 148, 148), dtype=float32)
The first dimension which specified by ? symbol is batch size.
The second dimension is filter size (32).
The two last are image width and height (148).
How do channels change from 3 to 32? Let's assume we have RGB image (3 channels) and the output channel size is 1. The following things happen:
When you use filters=32 and kernel_size=(3,3), you are creating 32 different filters, each of them with shape (3,3,3). The result will bring 32 different convolutions. Note that, according to Keras, all kernels initialize by glorot_uniform at the beginning.
Image from this blog post.
In the PyTorch tutorial, the constructed network is
Net(
(conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
(conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
(fc1): Linear(in_features=400, out_features=120, bias=True)
(fc2): Linear(in_features=120, out_features=84, bias=True)
(fc3): Linear(in_features=84, out_features=10, bias=True)
)
And used to process images with dimensions 1x32x32. They mention, that the network cannot be used with images with a different size.
The two convolutional layers seem to allow for an arbitrary number of features, so the linear layers seem to be related to getting the 32x32 into into 10 final features.
I do not really understand, how the numbers 120 and 84 are chosen there and why the result matches with the input dimensions.
And when I try to construct a similar network, I actually get the problem with the dimension of the data.
When I for example use a simpler network:
Net(
(conv1): Conv2d(3, 8, kernel_size=(5, 5), stride=(1, 1))
(conv2): Conv2d(8, 16, kernel_size=(5, 5), stride=(1, 1))
(fc1): Linear(in_features=400, out_features=3, bias=True)
)
for an input of the size 3x1200x800, I get the error message:
RuntimeError: size mismatch, m1: [1 x 936144], m2: [400 x 3] at /pytorch/aten/src/TH/generic/THTensorMath.cpp:940
Where does the number 936144 come from and how do I need to design the network, such that the dimensions are matching?
The key step is between the last convolution and the first Linear block. Conv2d outputs a tensor of shape [batch_size, n_features_conv, height, width] whereas Linear expects [batch_size, n_features_lin]. To make the two align you need to "stack" the 3 dimensions [n_features_conv, height, width] into one [n_features_lin]. As follows, it must be that n_features_lin == n_features_conv * height * width. In the original code this "stacking" is achieved by
x = x.view(-1, self.num_flat_features(x))
and if you inspect num_flat_features it just computes this n_features_conv * height * width product. In other words, your first conv must have num_flat_features(x) input features, where x is the tensor retrieved from the preceding convolution. But we need to calculate this value ahead of time, so that we can initialize the network in the first place...
The calculation follows from inspecting the operations one by one.
input is 32x32
we do a 5x5 convolution without padding, so we lose 2 pixels at each side, we drop down to 28x28
we do maxpooling with receptive field of 2x2, we cut each dimension by half, down to 14x14
we do another 5x5 convolution without padding, we drop down to 10x10
we do another maxpooling, we drop down to 5x5
and this 5x5 is why in the tutorial you see self.fc1 = nn.Linear(16 * 5 * 5, 120). It's n_features_conv * height * width, when starting from a 32x32 image. If you want to have a different input size, you have to redo the above calculation and adjust your first Linear layer accordingly.
For the further operations, it's just a chain of matrix multiplications (that's what Linear does). So the only rule is that the n_features_out of previous Linear matches n_features_in of the next one. Values 120 and 84 are entirely arbitrary, though they were probably chosen by the author such that the resulting network performs well.