Image Classification with Lstm and cnn with pytorch - pytorch

I’m new at the neural network. I really need some help :D
I have a video dataset. I extract one video image frame and extract on audio spectrum as image of the video. I have two main folders -one includes video image frames and the other contains audio spectrums of each videos-. Each two main folder have 8 subfolders - which are the classes.
My model has two inputs -one image frame and one audio spectrum image-. Each input is transferred a pretrained model vgg16 paralelly for feature extraction. Then, result of these two inputs are concatinated into 8192 linear and then transferred the classification step. My problems begins here. I have to use LSTM for the Classification part. I could not combine Vgg ang Lstm, maybe it is not possible.
The error:
Expected input batch_size (1) to match target batch_size (16).
Any ideas?
Thank you,
Best regards
vggmodel = vgg16(weights=torchvision.models.VGG16_Weights.DEFAULT)
for param in vggmodel.features.parameters():
param.require_grad = False
class MyModel(nn.Module):
def __init__(self):
super().__init__()
m = vggmodel
for param in m.parameters():
param.requires_grad = False
m.classifier[6] = nn.Identity() # replaced final FC layer with identity
self.vgg16_modified = m
self.rnn = nn.LSTM(
input_size=8192,
hidden_size=2048,
num_layers=1,
bidirectional=True)
self.classifier2 = nn.Sequential(
nn.Linear(2048, 256),
nn.ReLU(),
nn.Linear(256, 8),
)
def forward(self, x): #X contains two images, one is video frame and other is audio spectrum
y1 = self.vgg16_modified(x["videoFrame"])
y2 = self.vgg16_modified(x["audioImage"])
#Video frame and audio image are transferred vgg parallely and then concatinated
#Then lstm is used for classification
y = torch.concat((y1, y2), 1)
hidden2 = Variable(torch.zeros(2, 2048).to(device))
c_0 = Variable(torch.zeros(2, 2048).to(device))
output, (hidden, final_cell_state) = self.rnn(y, (hidden2, c_0))
hidden = hidden[0].view(-1, 2048)
output = self.classifier2(hidden)
return output
model = MyModel()
Expected input batch_size (1) to match target batch_size (16).

Related

Question about input dimension for conv2D-LSTM implement

I am a PyTorch beginner and would like to get help applying the conv2d-LSTM model.
I have a 2D image (1 channel x Time x Frequency) that contains time and frequency information.
I’d like to extract features automatically using conv2D and then LSTM model because 2D image contains time information
According to PyTorch documents, the output shape of conv2D is (Batch size, Channel out, Height out, Width out) and the input shape of LSTM is (Batch size, sequence length, input size). From that, I thought before input features of the LSTM network there need to reshape the output features of conv2D.
I expected the cnn-lstm model to perform well because it could learn the characteristics and time information of the image, but it did not get the expected performance.
My question is when I insert data into the LSTM model, is there any idea that LSTM learns the data by each row without flattening? Should I always flatten the 2D output?
My networks code and input/output shape are as follows. (I maintained the width size in the conv layer to preserve time information.)
Thanks a lot
class CNN_LSTM(nn.Module):
def __init__(self, paramArr1, paramArr2):
super(CNN_LSTM, self).__init__()
self.input_dim = paramArr2[0]
self.hidden_dim = paramArr2[1]
self.n_layers = paramArr2[2]
self.batch_size = paramArr2[3]
self.conv = nn.Sequential(
nn.Conv2d(1, out_channels=paramArr1[0],
kernel_size=(paramArr1[1],1),
stride=(paramArr1[2],1)),
nn.BatchNorm2d(paramArr1[0]),
nn.ReLU(),
nn.MaxPool2d(kernel_size = (paramArr1[3],1),stride=(paramArr1[4],1))
)
self.lstm = nn.LSTM(input_size = paramArr2[0],
hidden_size=paramArr2[1],
num_layers=paramArr2[2],
batch_first=True)
self.linear = nn.Linear(in_features=paramArr2[1], out_features=1)
def reset_hidden_state(self):
self.hidden = (
torch.zeros(self.n_layers, self.batch_size, self.hidden_dim).to(device),
torch.zeros(self.n_layers, self.batch_size, self.hidden_dim).to(device)
)
def forward(self, x):
x = self.conv(x)
x = x.view(x.size(0), x.size(1),-1)
x = x.permute(0,2,1)
out, (hn, cn) = self.lstm(x, self.hidden)
out = out.squeeze()[-1, :]
out = self.linear(out)
return out
model input/output shape

Understanding the architecture of an LSTM for sequence classification

I have this model in pytorch that I have been using for sequence classification.
class RoBERT_Model(nn.Module):
def __init__(self, hidden_size = 100):
self.hidden_size = hidden_size
super(RoBERT_Model, self).__init__()
self.lstm = nn.LSTM(768, hidden_size, num_layers=1, bidirectional=False)
self.out = nn.Linear(hidden_size, 2)
def forward(self, grouped_pooled_outs):
# chunks_emb = pooled_out.split_with_sizes(lengt) # splits the input tensor into a list of tensors where the length of each sublist is determined by length
seq_lengths = torch.LongTensor([x for x in map(len, grouped_pooled_outs)]) # gets the length of each sublist in chunks_emb and returns it as an array
batch_emb_pad = nn.utils.rnn.pad_sequence(grouped_pooled_outs, padding_value=-91, batch_first=True) # pads each sublist in chunks_emb to the largest sublist with value -91
batch_emb = batch_emb_pad.transpose(0, 1) # (B,L,D) -> (L,B,D)
lstm_input = nn.utils.rnn.pack_padded_sequence(batch_emb, seq_lengths, batch_first=False, enforce_sorted=False) # seq_lengths.cpu().numpy()
packed_output, (h_t, h_c) = self.lstm(lstm_input, ) # (h_t, h_c))
# output, _ = nn.utils.rnn.pad_packed_sequence(packed_output, padding_value=-91)
h_t = h_t.view(-1, self.hidden_size) # (-1, 100)
return self.out(h_t) # logits
The issue that I am having is that I am not entirely convinced of what data is being passed to the final classification layer. I believe what is being done is that only the final LSTM cell in the last layer is being used for classification. That is there are hidden_size features that are passed to the feedforward layer.
I have depicted what I believe is going on in this figure here:
Is this understanding correct? Am I missing anything?
Thanks.
Your code is a basic LSTM for classification, working with a single rnn layer.
In your picture you have multiple LSTM layers, while, in reality, there is only one, H_n^0 in the picture.
Your input to LSTM is of shape (B, L, D) as correctly pointed out in the comment.
packed_output and h_c is not used at all, hence you can change this line to: _, (h_t, _) = self.lstm(lstm_input) in order no to clutter the picture further
h_t is output of last step for each batch element, in general (B, D * L, hidden_size). As this neural network is not bidirectional D=1, as you have a single layer L=1 as well, hence the output is of shape (B, 1, hidden_size).
This output is reshaped into nn.Linear compatible (this line: h_t = h_t.view(-1, self.hidden_size)) and will give you output of shape (B, hidden_size)
This input is fed to a single nn.Linear layer.
In general, the output of the last time step from RNN is used for each element in the batch, in your picture H_n^0 and simply fed to the classifier.
By the way, having self.out = nn.Linear(hidden_size, 2) in classification is probably counter-productive; most likely your are performing binary classification and self.out = nn.Linear(hidden_size, 1) with torch.nn.BCEWithLogitsLoss might be used. Single logit contains information whether the label should be 0 or 1; everything smaller than 0 is more likely to be 0 according to nn, everything above 0 is considered as a 1 label.

Pytorch and batches

I'm having trouble understanding how batches play a role into the Pytorch framework.
In this model:
class MyModel(nn.Module):
def __init__(self):
super(MyModel, self).__init__()
# 28x28x1 => 26x26x32
self.conv1 = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3)
self.d1 = nn.Linear(26 * 26 * 32, 128)
self.d2 = nn.Linear(128, 10)
def forward(self, x):
# 32x1x28x28 => 32x32x26x26
x = self.conv1(x)
x = F.relu(x)
# flatten => 32 x (32*26*26)
x = x.flatten(start_dim = 1)
#x = x.view(32, -1)
# 32 x (32*26*26) => 32x128
x = self.d1(x)
x = F.relu(x)
# logits => 32x10
logits = self.d2(x)
out = F.softmax(logits, dim=1)
return out
In the forward definition, we pass in some x, ie. aggregated images for a batch from a DataLoader. Here, the 32x1x28x28 dimension indicates that there are 32 images in a batch. Do we just ignore this fact and Pytorch handles applying Conv2d to each sample? The forward propagation seems to be just relative to a single image.
Indeed, the network is agnostic to batches: The model is designed to classify a single image.
So why do we need batches for?
Each model has weights (aka parameters) and one needs to optimize the weights using the training images so that the model will classify images as correctly as possible.
This optimization process is usually carried out using Stochastic Gradient Descent (SGD): we are using the current values of the weights to classify a batch of images. Using the prediction the current model made, and the expected predictions we know should be (the "labels") we can compute a gradient of the weights and improve the model.

How to train Pytorch CNN with two or more inputs

I have a big image, multiple events in the image can impact the classification. I am thinking to split big image into small chunks and get features from each chunk and concatenate outputs together for prediction.
My code is like:
train_load_1 = DataLoader(dataset=train_dataset_1, batch_size=100, shuffle=False)
train_load_2 = DataLoader(dataset=train_dataset_2, batch_size=100, shuffle=False)
train_load_3 = DataLoader(dataset=train_dataset_3, batch_size=100, shuffle=False)
test_load_1 = DataLoader(dataset=test_dataset_1, batch_size=100, shuffle=True)
test_load_2 = DataLoader(dataset=test_dataset_2, batch_size=100, shuffle=True)
test_load_3 = DataLoader(dataset=test_dataset_3, batch_size=100, shuffle=True)
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv = nn.Conv2d( ... ) # set up your layer here
self.fc1 = nn.Linear( ... ) # set up first FC layer
self.fc2 = nn.Linear( ... ) # set up the other FC layer
def forward(self, x1, x2, x3):
o1 = self.conv(x1)
o2 = self.conv(x2)
o3 = self.conv(x3)
combined = torch.cat((o1.view(c.size(0), -1),
o2.view(c.size(0), -1),
o3.view(c.size(0), -1)), dim=1)
out = self.fc1(combined)
out = self.fc2(out)
return F.softmax(x, dim=1)
model = Net().to(device)
optimizer = optim.SGD(model.parameters(), lr=0.01)
for epoch in epochs:
model.train()
for batch_idx, (inputs, labels) in enumerate(train_loader_1):
**### I am stuck here, how to enumerate all three train_loader to pass input_1, input_2, input_3 into model and share the same label? Please note in train_loader I have set shuffle=False, this is to make sure train_loader_1, train_loader_2, train_loader_3 are getting the same label **
Thank you for your help!
Instead of using 3 separate dataLoader elements, you can use a single dataLoader element where each of the datapoint contains 3 separate parts of the image.
Like this:
dataLoader = [[[img1_part1],[img1_part2],[img1_part3], label1], [[img2_part1],[img2_part2],[img2_part3], label2]....]
This way you can use that in training loop as:
for img in dataLoader:
part1,part2,part3,label = img
out = model.forward(part1,part2,part3)
loss = loss_fn(out, label)
loss.backward()
optimizer.step()
For having the image parts in that format:
You can loop over the images and append them to a list or a numpy array.
def make_parts(full_image):
# some code
# returns a list of image parts after converting them into torch tensors
return [TorchTensor_of_part1, TorchTensor_of_part2, TorchTensor_of_part3]
list_of_parts_and_labels = []
for image,label in zip(full_img_data, labels):
parts = make_parts(image)
list_of_parts_and_labels.append([parts, torch.tensor(label)])
If you wanna load your images into dataLoader, assuming that you already have your image parts and labels in the above mentioned format:
train_loader = torch.utils.data.DataLoader(list_of_parts_and_labels,
shuffle = True, batch_size = BATCH_SIZE)
then use it as,
for data in train_loader:
parts, label = data
out = model.forward(*parts)
loss = loss_fn(out, label)

how to build a multidimensional autoencoder with pytorch

I followed this great answer for sequence autoencoder,
LSTM autoencoder always returns the average of the input sequence.
but I met some problem when I try to change the code:
question one:
Your explanation is so professional, but the problem is a little bit different from mine, I attached some code I changed from your example. My input features are 2 dimensional, and my output is same with the input.
for example:
input_x = torch.Tensor([[0.0,0.0], [0.1,0.1], [0.2,0.2], [0.3,0.3], [0.4,0.4]])
output_y = torch.Tensor([[0.0,0.0], [0.1,0.1], [0.2,0.2], [0.3,0.3], [0.4,0.4]])
the input_x and output_y are same, 5-timesteps, 2-dimensional feature.
import torch
import torch.nn as nn
import torch.optim as optim
class LSTM(nn.Module):
def __init__(self, input_dim, latent_dim, num_layers):
super(LSTM, self).__init__()
self.input_dim = input_dim
self.latent_dim = latent_dim
self.num_layers = num_layers
self.encoder = nn.LSTM(self.input_dim, self.latent_dim, self.num_layers)
# I changed here, to 40 dimesion, I think there is some problem
# self.decoder = nn.LSTM(self.latent_dim, self.input_dim, self.num_layers)
self.decoder = nn.LSTM(40, self.input_dim, self.num_layers)
def forward(self, input):
# Encode
_, (last_hidden, _) = self.encoder(input)
# It is way more general that way
encoded = last_hidden.repeat(input.shape)
# Decode
y, _ = self.decoder(encoded)
return torch.squeeze(y)
model = LSTM(input_dim=2, latent_dim=20, num_layers=1)
loss_function = nn.MSELoss()
optimizer = optim.Adam(model.parameters())
y = torch.Tensor([[0.0,0.0], [0.1,0.1], [0.2,0.2], [0.3,0.3], [0.4,0.4]])
x = y.view(len(y), -1, 2) # I changed here
while True:
y_pred = model(x)
optimizer.zero_grad()
loss = loss_function(y_pred, y)
loss.backward()
optimizer.step()
print(y_pred)
The above code can learn very well, can you help review the code and give some instructions.
When I input 2 examples as the input to the model, the model cannot work:
for example, change the code:
y = torch.Tensor([[0.0,0.0], [0.1,0.1], [0.2,0.2], [0.3,0.3], [0.4,0.4]])
to:
y = torch.Tensor([[[0.0,0.0],[0.5,0.5]], [[0.1,0.1], [0.6,0.6]], [[0.2,0.2],[0.7,0.7]], [[0.3,0.3],[0.8,0.8]], [[0.4,0.4],[0.9,0.9]]])
When I compute the loss function, it complain some errors? can anyone help have a look
question two:
my training samples are with different length:
for example:
x1 = [[0.0,0.0], [0.1,0.1], [0.2,0.2], [0.3,0.3], [0.4,0.4]] #with 5 timesteps
x2 = [[0.5,0.5], [0.6,0.6], [0.7,0.7]] #with only 3 timesteps
How can I input these two training sample into the model at the same time for a batch training.
Recurrent N-dimensional autoencoder
First of all, LSTMs work on 1D samples, yours are 2D as it's usually used for words encoded with a single vector.
No worries though, one can flatten this 2D sample to 1D, example for your case would be:
import torch
var = torch.randn(10, 32, 100, 100)
var.reshape((10, 32, -1)) # shape: [10, 32, 100 * 100]
Please notice it's really not general, what if you were to have 3D input? Snippet belows generalizes this notion to any dimension of your samples, provided the preceding dimensions are batch_size and seq_len:
import torch
input_size = 2
var = torch.randn(10, 32, 100, 100, 35)
var.reshape(var.shape[:-input_size] + (-1,)) # shape: [10, 32, 100 * 100 * 35]
Finally, you can employ it inside neural network as follows. Look at forward method especially and constructor arguments:
import torch
class LSTM(nn.Module):
# input_dim has to be size after flattening
# For 20x20 single input it would be 400
def __init__(
self,
input_dimensionality: int,
input_dim: int,
latent_dim: int,
num_layers: int,
):
super(LSTM, self).__init__()
self.input_dimensionality: int = input_dimensionality
self.input_dim: int = input_dim # It is 1d, remember
self.latent_dim: int = latent_dim
self.num_layers: int = num_layers
self.encoder = torch.nn.LSTM(self.input_dim, self.latent_dim, self.num_layers)
# You can have any latent dim you want, just output has to be exact same size as input
# In this case, only encoder and decoder, it has to be input_dim though
self.decoder = torch.nn.LSTM(self.latent_dim, self.input_dim, self.num_layers)
def forward(self, input):
# Save original size first:
original_shape = input.shape
# Flatten 2d (or 3d or however many you specified in constructor)
input = input.reshape(input.shape[: -self.input_dimensionality] + (-1,))
# Rest goes as in my previous answer
_, (last_hidden, _) = self.encoder(input)
encoded = last_hidden.repeat(input.shape)
y, _ = self.decoder(encoded)
# You have to reshape output to what the original was
reshaped_y = y.reshape(original_shape)
return torch.squeeze(reshaped_y)
Remember you have to reshape your output in this case. It should work for any dimensions.
Batching
When it comes to batching and different length of sequences it is a little more complicated.
You have to pad each sequence in batch before pushing it through network. Usually, values with which you pad are zeros, you may configure it inside LSTM though.
You may check this link for an example. You will have to use functions like torch.nn.pack_padded_sequence and others to make it work, you may check this answer.
Oh, since PyTorch 1.1 you don't have to sort your sequences by length in order to pack them. But when it comes to this topic, grab some tutorials, should make things clearer.
Lastly: Please, separate your questions. If you perform the autoencoding with single example, move on to batching and if you have issues there, please post a new question on StackOverflow, thanks.

Resources