Question about input dimension for conv2D-LSTM implement - pytorch

I am a PyTorch beginner and would like to get help applying the conv2d-LSTM model.
I have a 2D image (1 channel x Time x Frequency) that contains time and frequency information.
I’d like to extract features automatically using conv2D and then LSTM model because 2D image contains time information
According to PyTorch documents, the output shape of conv2D is (Batch size, Channel out, Height out, Width out) and the input shape of LSTM is (Batch size, sequence length, input size). From that, I thought before input features of the LSTM network there need to reshape the output features of conv2D.
I expected the cnn-lstm model to perform well because it could learn the characteristics and time information of the image, but it did not get the expected performance.
My question is when I insert data into the LSTM model, is there any idea that LSTM learns the data by each row without flattening? Should I always flatten the 2D output?
My networks code and input/output shape are as follows. (I maintained the width size in the conv layer to preserve time information.)
Thanks a lot
class CNN_LSTM(nn.Module):
def __init__(self, paramArr1, paramArr2):
super(CNN_LSTM, self).__init__()
self.input_dim = paramArr2[0]
self.hidden_dim = paramArr2[1]
self.n_layers = paramArr2[2]
self.batch_size = paramArr2[3]
self.conv = nn.Sequential(
nn.Conv2d(1, out_channels=paramArr1[0],
kernel_size=(paramArr1[1],1),
stride=(paramArr1[2],1)),
nn.BatchNorm2d(paramArr1[0]),
nn.ReLU(),
nn.MaxPool2d(kernel_size = (paramArr1[3],1),stride=(paramArr1[4],1))
)
self.lstm = nn.LSTM(input_size = paramArr2[0],
hidden_size=paramArr2[1],
num_layers=paramArr2[2],
batch_first=True)
self.linear = nn.Linear(in_features=paramArr2[1], out_features=1)
def reset_hidden_state(self):
self.hidden = (
torch.zeros(self.n_layers, self.batch_size, self.hidden_dim).to(device),
torch.zeros(self.n_layers, self.batch_size, self.hidden_dim).to(device)
)
def forward(self, x):
x = self.conv(x)
x = x.view(x.size(0), x.size(1),-1)
x = x.permute(0,2,1)
out, (hn, cn) = self.lstm(x, self.hidden)
out = out.squeeze()[-1, :]
out = self.linear(out)
return out
model input/output shape

Related

How to scale input features for CNN-LSTM pose estimation network

I develop a CNN-LSTM network for estimating the 3D pose (x,y,z,rotx,roty,rotz) of 2D image videos. The input are sequences of 128x128 images and in every batch continous frames are taken into account. Basically every batch a video is considered, and divided into sequences, shifting a sliding window of lenght n, as in every LSTM application. My problem: even if the network seems to train finely (loss converging), the lstm is predicting very similar values for each samples, repeating the same few values every time.
model
class ResNetLSTM(nn.Module):
def __init__(self, hidden_size, num_layer):
super(ResNetLSTM, self).__init__()
self.hidden_size = hidden_size
self.model = pretrainedmodels.__dict__['resnet18'](pretrained='imagenet')
self.lstm = nn.LSTM(512, hidden_size, num_layers = num_layer, batch_first=True)
self.regression_layer = nn.Linear(hidden_size, out_params)
def forward(self, x):
batch_size, seq_len = x.shape[0], x.shape[1]
#print('image size', x.size()) #(batch, seq, channel, height, width)
batch_of_seq_of_feature = None
for seq_of_image in x: #x is batch
x = self.model.features(seq_of_image) #out (seq_len,512,3,3) == (s, feature)
x = torch.nn.functional.adaptive_avg_pool2d(x, 1) #out (seq_len,512,1,1) == (s, feature)
x = torch.squeeze(x).reshape(seq_len, -1) #out (seq_len,512) == (s, feature)
batch_of_seq_of_feature = self.__concat_tensor(batch_of_seq_of_feature, x) #out (batch_size_seq,seq_len,512) == (b,s, feature)
x, _ = self.lstm(batch_of_seq_of_feature) #out1: (batch_size_len,seq_len,out) == (b,s, feature)
x = self.regression_layer(x)
x_transl = torch.squeeze(x[:,-1,-3:], 1)
x_rot = compute_rotation_matrix_from_ortho6d(torch.squeeze(x[:,-1,:6], 1))
return x_transl, x_rot
On many blogs they suggest to normalize the input of the lstm, so of every sequence, on its own, but my question is, since the input of my lstm is directly the output of the CNN , am I supposed to scale this input or should avoid doing this? Help please

Image Classification with Lstm and cnn with pytorch

I’m new at the neural network. I really need some help :D
I have a video dataset. I extract one video image frame and extract on audio spectrum as image of the video. I have two main folders -one includes video image frames and the other contains audio spectrums of each videos-. Each two main folder have 8 subfolders - which are the classes.
My model has two inputs -one image frame and one audio spectrum image-. Each input is transferred a pretrained model vgg16 paralelly for feature extraction. Then, result of these two inputs are concatinated into 8192 linear and then transferred the classification step. My problems begins here. I have to use LSTM for the Classification part. I could not combine Vgg ang Lstm, maybe it is not possible.
The error:
Expected input batch_size (1) to match target batch_size (16).
Any ideas?
Thank you,
Best regards
vggmodel = vgg16(weights=torchvision.models.VGG16_Weights.DEFAULT)
for param in vggmodel.features.parameters():
param.require_grad = False
class MyModel(nn.Module):
def __init__(self):
super().__init__()
m = vggmodel
for param in m.parameters():
param.requires_grad = False
m.classifier[6] = nn.Identity() # replaced final FC layer with identity
self.vgg16_modified = m
self.rnn = nn.LSTM(
input_size=8192,
hidden_size=2048,
num_layers=1,
bidirectional=True)
self.classifier2 = nn.Sequential(
nn.Linear(2048, 256),
nn.ReLU(),
nn.Linear(256, 8),
)
def forward(self, x): #X contains two images, one is video frame and other is audio spectrum
y1 = self.vgg16_modified(x["videoFrame"])
y2 = self.vgg16_modified(x["audioImage"])
#Video frame and audio image are transferred vgg parallely and then concatinated
#Then lstm is used for classification
y = torch.concat((y1, y2), 1)
hidden2 = Variable(torch.zeros(2, 2048).to(device))
c_0 = Variable(torch.zeros(2, 2048).to(device))
output, (hidden, final_cell_state) = self.rnn(y, (hidden2, c_0))
hidden = hidden[0].view(-1, 2048)
output = self.classifier2(hidden)
return output
model = MyModel()
Expected input batch_size (1) to match target batch_size (16).

Understanding the architecture of an LSTM for sequence classification

I have this model in pytorch that I have been using for sequence classification.
class RoBERT_Model(nn.Module):
def __init__(self, hidden_size = 100):
self.hidden_size = hidden_size
super(RoBERT_Model, self).__init__()
self.lstm = nn.LSTM(768, hidden_size, num_layers=1, bidirectional=False)
self.out = nn.Linear(hidden_size, 2)
def forward(self, grouped_pooled_outs):
# chunks_emb = pooled_out.split_with_sizes(lengt) # splits the input tensor into a list of tensors where the length of each sublist is determined by length
seq_lengths = torch.LongTensor([x for x in map(len, grouped_pooled_outs)]) # gets the length of each sublist in chunks_emb and returns it as an array
batch_emb_pad = nn.utils.rnn.pad_sequence(grouped_pooled_outs, padding_value=-91, batch_first=True) # pads each sublist in chunks_emb to the largest sublist with value -91
batch_emb = batch_emb_pad.transpose(0, 1) # (B,L,D) -> (L,B,D)
lstm_input = nn.utils.rnn.pack_padded_sequence(batch_emb, seq_lengths, batch_first=False, enforce_sorted=False) # seq_lengths.cpu().numpy()
packed_output, (h_t, h_c) = self.lstm(lstm_input, ) # (h_t, h_c))
# output, _ = nn.utils.rnn.pad_packed_sequence(packed_output, padding_value=-91)
h_t = h_t.view(-1, self.hidden_size) # (-1, 100)
return self.out(h_t) # logits
The issue that I am having is that I am not entirely convinced of what data is being passed to the final classification layer. I believe what is being done is that only the final LSTM cell in the last layer is being used for classification. That is there are hidden_size features that are passed to the feedforward layer.
I have depicted what I believe is going on in this figure here:
Is this understanding correct? Am I missing anything?
Thanks.
Your code is a basic LSTM for classification, working with a single rnn layer.
In your picture you have multiple LSTM layers, while, in reality, there is only one, H_n^0 in the picture.
Your input to LSTM is of shape (B, L, D) as correctly pointed out in the comment.
packed_output and h_c is not used at all, hence you can change this line to: _, (h_t, _) = self.lstm(lstm_input) in order no to clutter the picture further
h_t is output of last step for each batch element, in general (B, D * L, hidden_size). As this neural network is not bidirectional D=1, as you have a single layer L=1 as well, hence the output is of shape (B, 1, hidden_size).
This output is reshaped into nn.Linear compatible (this line: h_t = h_t.view(-1, self.hidden_size)) and will give you output of shape (B, hidden_size)
This input is fed to a single nn.Linear layer.
In general, the output of the last time step from RNN is used for each element in the batch, in your picture H_n^0 and simply fed to the classifier.
By the way, having self.out = nn.Linear(hidden_size, 2) in classification is probably counter-productive; most likely your are performing binary classification and self.out = nn.Linear(hidden_size, 1) with torch.nn.BCEWithLogitsLoss might be used. Single logit contains information whether the label should be 0 or 1; everything smaller than 0 is more likely to be 0 according to nn, everything above 0 is considered as a 1 label.

Pytorch and batches

I'm having trouble understanding how batches play a role into the Pytorch framework.
In this model:
class MyModel(nn.Module):
def __init__(self):
super(MyModel, self).__init__()
# 28x28x1 => 26x26x32
self.conv1 = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3)
self.d1 = nn.Linear(26 * 26 * 32, 128)
self.d2 = nn.Linear(128, 10)
def forward(self, x):
# 32x1x28x28 => 32x32x26x26
x = self.conv1(x)
x = F.relu(x)
# flatten => 32 x (32*26*26)
x = x.flatten(start_dim = 1)
#x = x.view(32, -1)
# 32 x (32*26*26) => 32x128
x = self.d1(x)
x = F.relu(x)
# logits => 32x10
logits = self.d2(x)
out = F.softmax(logits, dim=1)
return out
In the forward definition, we pass in some x, ie. aggregated images for a batch from a DataLoader. Here, the 32x1x28x28 dimension indicates that there are 32 images in a batch. Do we just ignore this fact and Pytorch handles applying Conv2d to each sample? The forward propagation seems to be just relative to a single image.
Indeed, the network is agnostic to batches: The model is designed to classify a single image.
So why do we need batches for?
Each model has weights (aka parameters) and one needs to optimize the weights using the training images so that the model will classify images as correctly as possible.
This optimization process is usually carried out using Stochastic Gradient Descent (SGD): we are using the current values of the weights to classify a batch of images. Using the prediction the current model made, and the expected predictions we know should be (the "labels") we can compute a gradient of the weights and improve the model.

How can I use LSTM in pytorch for classification?

My code is as below:
class Mymodel(nn.Module):
def __init__(self, input_size, hidden_size, output_size, num_layers, batch_size):
super(Discriminator, self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.output_size = output_size
self.num_layers = num_layers
self.batch_size = batch_size
self.lstm = nn.LSTM(input_size, hidden_size)
self.proj = nn.Linear(hidden_size, output_size)
self.hidden = self.init_hidden()
def init_hidden(self):
return (Variable(torch.zeros(self.num_layers, self.batch_size, self.hidden_size)),
Variable(torch.zeros(self.num_layers, self.batch_size, self.hidden_size)))
def forward(self, x):
lstm_out, self.hidden = self.lstm(x, self.hidden)
output = self.proj(lstm_out)
result = F.sigmoid(output)
return result
I want to use LSTM to classify a sentence to good (1) or bad (0). Using this code, I get the result which is time_step * batch_size * 1 but not 0 or 1. How to edit the code in order to get the classification result?
Theory:
Recall that an LSTM outputs a vector for every input in the series. You are using sentences, which are a series of words (probably converted to indices and then embedded as vectors). This code from the LSTM PyTorch tutorial makes clear exactly what I mean (***emphasis mine):
lstm = nn.LSTM(3, 3) # Input dim is 3, output dim is 3
inputs = [autograd.Variable(torch.randn((1, 3)))
for _ in range(5)] # make a sequence of length 5
# initialize the hidden state.
hidden = (autograd.Variable(torch.randn(1, 1, 3)),
autograd.Variable(torch.randn((1, 1, 3))))
for i in inputs:
# Step through the sequence one element at a time.
# after each step, hidden contains the hidden state.
out, hidden = lstm(i.view(1, 1, -1), hidden)
# alternatively, we can do the entire sequence all at once.
# the first value returned by LSTM is all of the hidden states throughout
# the sequence. the second is just the most recent hidden state
# *** (compare the last slice of "out" with "hidden" below, they are the same)
# The reason for this is that:
# "out" will give you access to all hidden states in the sequence
# "hidden" will allow you to continue the sequence and backpropagate,
# by passing it as an argument to the lstm at a later time
# Add the extra 2nd dimension
inputs = torch.cat(inputs).view(len(inputs), 1, -1)
hidden = (autograd.Variable(torch.randn(1, 1, 3)), autograd.Variable(
torch.randn((1, 1, 3)))) # clean out hidden state
out, hidden = lstm(inputs, hidden)
print(out)
print(hidden)
One more time: compare the last slice of "out" with "hidden" below, they are the same. Why? Well...
If you're familiar with LSTM's, I'd recommend the PyTorch LSTM docs at this point. Under the output section, notice h_t is output at every t.
Now if you aren't used to LSTM-style equations, take a look at Chris Olah's LSTM blog post. Scroll down to the diagram of the unrolled network:
As you feed your sentence in word-by-word (x_i-by-x_i+1), you get an output from each timestep. You want to interpret the entire sentence to classify it. So you must wait until the LSTM has seen all the words. That is, you need to take h_t where t is the number of words in your sentence.
Code:
Here's a coding reference. I'm not going to copy-paste the entire thing, just the relevant parts. The magic happens at self.hidden2label(lstm_out[-1])
class LSTMClassifier(nn.Module):
def __init__(self, embedding_dim, hidden_dim, vocab_size, label_size, batch_size):
...
self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
self.lstm = nn.LSTM(embedding_dim, hidden_dim)
self.hidden2label = nn.Linear(hidden_dim, label_size)
self.hidden = self.init_hidden()
def init_hidden(self):
return (autograd.Variable(torch.zeros(1, self.batch_size, self.hidden_dim)),
autograd.Variable(torch.zeros(1, self.batch_size, self.hidden_dim)))
def forward(self, sentence):
embeds = self.word_embeddings(sentence)
x = embeds.view(len(sentence), self.batch_size , -1)
lstm_out, self.hidden = self.lstm(x, self.hidden)
y = self.hidden2label(lstm_out[-1])
log_probs = F.log_softmax(y)
return log_probs
The main problem you need to figure out is the in which dim place you should put your batch size when you prepare your data. As far as I know, if you didn't set it in your nn.LSTM() init function, it will automatically assume that the second dim is your batch size, which is quite different compared to other DNN framework. Maybe you can try:
self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
like this to ask your model to treat your first dim as the batch dim.
As a last layer you have to have a linear layer for however many classes you want i.e 10 if you are doing digit classification as in MNIST . For your case since you are doing a yes/no (1/0) classification you have two lablels/ classes so you linear layer has two classes. I suggest adding a linear layer as
nn.Linear ( feature_size_from_previous_layer , 2)
and then train the model using a cross-entropy loss.
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

Resources