How can I use LSTM in pytorch for classification? - pytorch

My code is as below:
class Mymodel(nn.Module):
def __init__(self, input_size, hidden_size, output_size, num_layers, batch_size):
super(Discriminator, self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.output_size = output_size
self.num_layers = num_layers
self.batch_size = batch_size
self.lstm = nn.LSTM(input_size, hidden_size)
self.proj = nn.Linear(hidden_size, output_size)
self.hidden = self.init_hidden()
def init_hidden(self):
return (Variable(torch.zeros(self.num_layers, self.batch_size, self.hidden_size)),
Variable(torch.zeros(self.num_layers, self.batch_size, self.hidden_size)))
def forward(self, x):
lstm_out, self.hidden = self.lstm(x, self.hidden)
output = self.proj(lstm_out)
result = F.sigmoid(output)
return result
I want to use LSTM to classify a sentence to good (1) or bad (0). Using this code, I get the result which is time_step * batch_size * 1 but not 0 or 1. How to edit the code in order to get the classification result?

Theory:
Recall that an LSTM outputs a vector for every input in the series. You are using sentences, which are a series of words (probably converted to indices and then embedded as vectors). This code from the LSTM PyTorch tutorial makes clear exactly what I mean (***emphasis mine):
lstm = nn.LSTM(3, 3) # Input dim is 3, output dim is 3
inputs = [autograd.Variable(torch.randn((1, 3)))
for _ in range(5)] # make a sequence of length 5
# initialize the hidden state.
hidden = (autograd.Variable(torch.randn(1, 1, 3)),
autograd.Variable(torch.randn((1, 1, 3))))
for i in inputs:
# Step through the sequence one element at a time.
# after each step, hidden contains the hidden state.
out, hidden = lstm(i.view(1, 1, -1), hidden)
# alternatively, we can do the entire sequence all at once.
# the first value returned by LSTM is all of the hidden states throughout
# the sequence. the second is just the most recent hidden state
# *** (compare the last slice of "out" with "hidden" below, they are the same)
# The reason for this is that:
# "out" will give you access to all hidden states in the sequence
# "hidden" will allow you to continue the sequence and backpropagate,
# by passing it as an argument to the lstm at a later time
# Add the extra 2nd dimension
inputs = torch.cat(inputs).view(len(inputs), 1, -1)
hidden = (autograd.Variable(torch.randn(1, 1, 3)), autograd.Variable(
torch.randn((1, 1, 3)))) # clean out hidden state
out, hidden = lstm(inputs, hidden)
print(out)
print(hidden)
One more time: compare the last slice of "out" with "hidden" below, they are the same. Why? Well...
If you're familiar with LSTM's, I'd recommend the PyTorch LSTM docs at this point. Under the output section, notice h_t is output at every t.
Now if you aren't used to LSTM-style equations, take a look at Chris Olah's LSTM blog post. Scroll down to the diagram of the unrolled network:
As you feed your sentence in word-by-word (x_i-by-x_i+1), you get an output from each timestep. You want to interpret the entire sentence to classify it. So you must wait until the LSTM has seen all the words. That is, you need to take h_t where t is the number of words in your sentence.
Code:
Here's a coding reference. I'm not going to copy-paste the entire thing, just the relevant parts. The magic happens at self.hidden2label(lstm_out[-1])
class LSTMClassifier(nn.Module):
def __init__(self, embedding_dim, hidden_dim, vocab_size, label_size, batch_size):
...
self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
self.lstm = nn.LSTM(embedding_dim, hidden_dim)
self.hidden2label = nn.Linear(hidden_dim, label_size)
self.hidden = self.init_hidden()
def init_hidden(self):
return (autograd.Variable(torch.zeros(1, self.batch_size, self.hidden_dim)),
autograd.Variable(torch.zeros(1, self.batch_size, self.hidden_dim)))
def forward(self, sentence):
embeds = self.word_embeddings(sentence)
x = embeds.view(len(sentence), self.batch_size , -1)
lstm_out, self.hidden = self.lstm(x, self.hidden)
y = self.hidden2label(lstm_out[-1])
log_probs = F.log_softmax(y)
return log_probs

The main problem you need to figure out is the in which dim place you should put your batch size when you prepare your data. As far as I know, if you didn't set it in your nn.LSTM() init function, it will automatically assume that the second dim is your batch size, which is quite different compared to other DNN framework. Maybe you can try:
self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
like this to ask your model to treat your first dim as the batch dim.

As a last layer you have to have a linear layer for however many classes you want i.e 10 if you are doing digit classification as in MNIST . For your case since you are doing a yes/no (1/0) classification you have two lablels/ classes so you linear layer has two classes. I suggest adding a linear layer as
nn.Linear ( feature_size_from_previous_layer , 2)
and then train the model using a cross-entropy loss.
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

Related

TBPTT with a multivariate time series

I am trying to use TBPTT on a multivariate time series, and I am facing a problem, my loss doesn’t decrease, and I don’t know what I am doing wrong.
Inputs shape (Batch_size,1270,6)
Output shape (Batch_size,1270)
There is a particularity with the Inputs:
6 Features correspond to A-B A-C A-D where A is the time step,
Between two inputs (Inputs[0] and Inputs[1]) features don’t have the same length, I padded all the Inputs using
torch.nn.utils.rnn.pad_sequence(Mise_en_donnees,padding_value=-1,batch_first=True)
I tried padding_value=0. But it doesn’t change anything)
All Inputs are normalized using get_mean_std
def get_mean_std(loader,ignore_idx=-1.):
channels_sum,channels_squared_sum,num_batches=0,0,0
for data in loader:
a=torch.sum((data[:,0]!=ignore_idx)).item()-1
channels_sum+=torch.mean(data[:a],dim=[0])
channels_squared_sum+=torch.mean(data[:a]**2,dim=[0])
num_batches+=1
mean=channels_sum/num_batches
std=(channels_squared_sum/num_batches -mean**2)**0.5
return mean,std
There is my Model
#A classic Conv_Block
class conv_block (nn.Module):
def __init__(self, in_channels, out_channels, **kwargs):
super(conv_block, self).__init__()
self.relu = nn.LeakyReLU()
self.conv = nn.Conv1d(in_channels, out_channels, **kwargs)
self.batchnorm = nn.BatchNorm1d(out_channels)
def forward(self, x):
x=self.conv(x)
x= self.batchnorm(x)
return self.relu(x)
class Test (nn.Module):
def __init__(self,in_channels,num_layers,hidden_size, p,out_size):
super(Test ,self).__init__()
self.CNN=nn.Sequential(
#I am trying to apply filters on every two columns (A-B A-C A-D) using groups
conv_block(in_channels,3,kernel_size=2,stride=1,padding=1,groups=3),#,padding_mode="reflect"),
conv_block(3,32,kernel_size=2,stride=1,padding=0),
#SqueezeExcitation(32,16), #i tried but same results
conv_block(32,16,kernel_size=3,stride=1,padding=1),
conv_block(16,8,kernel_size=3,stride=1,padding=1),
)
self.rnn = nn.LSTM(8, hidden_size, num_layers)
self.rnn1 = nn.LSTM(hidden_size, hidden_size, num_layers)
#self.fc_hidden = nn.Linear(hidden_size * 2, hidden_size) # in case of using bidirectional
#self.fc_cell = nn.Linear(hidden_size * 2, hidden_size)
self.dropout = nn.Dropout(p)
self.num_layers=num_layers
self.fc_f=nn.Linear(out_size*hidden_size,out_size)
def forward(self,x,hidden, cell):
x=x.permute(0,2,1)
x=self.CNN(x)
x=x.permute(2,0,1)
x, (hidden, cell) = self.rnn(x) #i tried bidirectional but same results
#hidden = self.dropout(self.fc_hidden(torch.cat((hidden[0:self.num_layers], hidden[self.num_layers:2*self.num_layers]), dim=2)))
#cell = self.dropout(self.fc_cell(torch.cat((cell[0:self.num_layers], cell[self.num_layers:2*self.num_layers]), dim=2)))
x, (hidden, cell) = self.rnn1(x, (hidden, cell))
#hidden=hidden.repeat(2,1,1)
#cell=cell.repeat(2,1,1)
x=x.permute(1,0,2)
x=x.reshape(x.shape[0],-1)
x=self.fc_f(x) #final result
return x, hidden, cell
#hyperparameters
in_channels=6
num_layers=64
hidden_size=90
p=0.2
out_size=tbptt_steps=20 #truncated bptt steps
split_dim=1
nb_epoch=100
learning_rate=3e-4
Model=Test(in_channels,num_layers,hidden_size, p,out_size).to(device)
optimizer = optim.Adam(Model.parameters(), lr=learning_rate)
# I tired to test my model on the same inputs
X=Inputs[:5,:500,:-1].to(device)
Y=Inputs[:5,:500,-1].to(device)
#training loop
hidden=None
cell=None
for ep in range (nb_epoch):
Losses=0
for i, (x_, y_) in enumerate(zip(X.split(tbptt_steps, dim=split_dim), Y.split(tbptt_steps, dim=split_dim))):
optimizer.zero_grad()
#Model.train()
# Detach last hidden state, so the backprop-graph will be cut
if hidden is not None:
hidden.detach_()
if cell is not None:
cell.detach_()
# Forward path
y_pred, hidden, cell = Model(x_, hidden, cell)
#print("predict",y_pred.shape,y_.shape)
# Compute loss
loss = nn.functional.mse_loss(y_, y_pred)
# Backward path
loss.backward()
Losses+=loss.item()
# Update weights
optimizer.step()
if i==0:
print("Epoch ",ep," Loss ",loss.item())
print("#################################################")
print(Losses)
print("#################################################")
There is two problems with this Model:
It doesn’t catch the padding_value
-The loss is high and didn’t decrease
I really hope that the Model is understandable, and we will correct it.
As you can see I am not a professional in Machine learning, I am really eager to understand more about my errors .
Thank you very much for your help

Question about input dimension for conv2D-LSTM implement

I am a PyTorch beginner and would like to get help applying the conv2d-LSTM model.
I have a 2D image (1 channel x Time x Frequency) that contains time and frequency information.
I’d like to extract features automatically using conv2D and then LSTM model because 2D image contains time information
According to PyTorch documents, the output shape of conv2D is (Batch size, Channel out, Height out, Width out) and the input shape of LSTM is (Batch size, sequence length, input size). From that, I thought before input features of the LSTM network there need to reshape the output features of conv2D.
I expected the cnn-lstm model to perform well because it could learn the characteristics and time information of the image, but it did not get the expected performance.
My question is when I insert data into the LSTM model, is there any idea that LSTM learns the data by each row without flattening? Should I always flatten the 2D output?
My networks code and input/output shape are as follows. (I maintained the width size in the conv layer to preserve time information.)
Thanks a lot
class CNN_LSTM(nn.Module):
def __init__(self, paramArr1, paramArr2):
super(CNN_LSTM, self).__init__()
self.input_dim = paramArr2[0]
self.hidden_dim = paramArr2[1]
self.n_layers = paramArr2[2]
self.batch_size = paramArr2[3]
self.conv = nn.Sequential(
nn.Conv2d(1, out_channels=paramArr1[0],
kernel_size=(paramArr1[1],1),
stride=(paramArr1[2],1)),
nn.BatchNorm2d(paramArr1[0]),
nn.ReLU(),
nn.MaxPool2d(kernel_size = (paramArr1[3],1),stride=(paramArr1[4],1))
)
self.lstm = nn.LSTM(input_size = paramArr2[0],
hidden_size=paramArr2[1],
num_layers=paramArr2[2],
batch_first=True)
self.linear = nn.Linear(in_features=paramArr2[1], out_features=1)
def reset_hidden_state(self):
self.hidden = (
torch.zeros(self.n_layers, self.batch_size, self.hidden_dim).to(device),
torch.zeros(self.n_layers, self.batch_size, self.hidden_dim).to(device)
)
def forward(self, x):
x = self.conv(x)
x = x.view(x.size(0), x.size(1),-1)
x = x.permute(0,2,1)
out, (hn, cn) = self.lstm(x, self.hidden)
out = out.squeeze()[-1, :]
out = self.linear(out)
return out
model input/output shape

Understanding the architecture of an LSTM for sequence classification

I have this model in pytorch that I have been using for sequence classification.
class RoBERT_Model(nn.Module):
def __init__(self, hidden_size = 100):
self.hidden_size = hidden_size
super(RoBERT_Model, self).__init__()
self.lstm = nn.LSTM(768, hidden_size, num_layers=1, bidirectional=False)
self.out = nn.Linear(hidden_size, 2)
def forward(self, grouped_pooled_outs):
# chunks_emb = pooled_out.split_with_sizes(lengt) # splits the input tensor into a list of tensors where the length of each sublist is determined by length
seq_lengths = torch.LongTensor([x for x in map(len, grouped_pooled_outs)]) # gets the length of each sublist in chunks_emb and returns it as an array
batch_emb_pad = nn.utils.rnn.pad_sequence(grouped_pooled_outs, padding_value=-91, batch_first=True) # pads each sublist in chunks_emb to the largest sublist with value -91
batch_emb = batch_emb_pad.transpose(0, 1) # (B,L,D) -> (L,B,D)
lstm_input = nn.utils.rnn.pack_padded_sequence(batch_emb, seq_lengths, batch_first=False, enforce_sorted=False) # seq_lengths.cpu().numpy()
packed_output, (h_t, h_c) = self.lstm(lstm_input, ) # (h_t, h_c))
# output, _ = nn.utils.rnn.pad_packed_sequence(packed_output, padding_value=-91)
h_t = h_t.view(-1, self.hidden_size) # (-1, 100)
return self.out(h_t) # logits
The issue that I am having is that I am not entirely convinced of what data is being passed to the final classification layer. I believe what is being done is that only the final LSTM cell in the last layer is being used for classification. That is there are hidden_size features that are passed to the feedforward layer.
I have depicted what I believe is going on in this figure here:
Is this understanding correct? Am I missing anything?
Thanks.
Your code is a basic LSTM for classification, working with a single rnn layer.
In your picture you have multiple LSTM layers, while, in reality, there is only one, H_n^0 in the picture.
Your input to LSTM is of shape (B, L, D) as correctly pointed out in the comment.
packed_output and h_c is not used at all, hence you can change this line to: _, (h_t, _) = self.lstm(lstm_input) in order no to clutter the picture further
h_t is output of last step for each batch element, in general (B, D * L, hidden_size). As this neural network is not bidirectional D=1, as you have a single layer L=1 as well, hence the output is of shape (B, 1, hidden_size).
This output is reshaped into nn.Linear compatible (this line: h_t = h_t.view(-1, self.hidden_size)) and will give you output of shape (B, hidden_size)
This input is fed to a single nn.Linear layer.
In general, the output of the last time step from RNN is used for each element in the batch, in your picture H_n^0 and simply fed to the classifier.
By the way, having self.out = nn.Linear(hidden_size, 2) in classification is probably counter-productive; most likely your are performing binary classification and self.out = nn.Linear(hidden_size, 1) with torch.nn.BCEWithLogitsLoss might be used. Single logit contains information whether the label should be 0 or 1; everything smaller than 0 is more likely to be 0 according to nn, everything above 0 is considered as a 1 label.

LSTM/RNN in pytorch The relation between forward method and training model

I'm still fairly new to neural networks, so sorry on beforehand for any ambiguities to the following.
In a "standard" LSTM implementation for language task, we have the following (sorry for the very rough sketches):
class LSTM(nn.Module):
def __init__(*args):
...
def forward(self, input, states):
lstn_in = self.model['embed'](input)
lstm_out, hidden = self.model['lstm'](lstm_in,states)
return lstm_out, hidden
Later on, we call upon this model in the training step:
def train(*args):
for epoch in range(epochs):
....
*init_zero_states
...
out, states = model(input, states)
...
return model
Let's just say, that I have 3 sentences as input:
sents = [[The, sun, is, shiny],
[The, beach, was, very, windy],
[Computer, broke, down, today]]
model = train(LSTM, sents)
All words in all sentences gets converted to embeddings and loaded into the model.
Now the question:
Does the self.model['lstm'] iterate though all words from all articles and makes one output after every word? or every sentence?
How does the model make distinction between the 3 sentences, such as after getting "The", "sun", "is", "shiny", does something (such as the states) in the 'lstm' reset and begin anew?
The "out" in the train step after out, states = model(input, states) is the output after running all 3 sentences and hence the combined "information" from all 3 sentences?
Thanks!
when using LSTMs in Pytorch you usually use the nn.LSTM function. Here is a quick example and then an explanation what happens inside:
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.embedder = nn.Embedding(voab_size, embed_size)
self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
self.fc = nn.Linear(hidden_size, output_size)
self.softmax = nn.Softmax(dim=1)
def forward(self, x):
x = self.embedder(x)
# every time you pass a new sentence into the model you need to create
# a new hidden-state (the LSTM requires, unlike RNNs, two hidden-states in a tuple)
hidden = (torch.zeros(num_layers, batch_size, hidden_size), torch.zeros(num_layers, batch_size, hidden_size))
x, hidden = self.lstm(x, hidden)
# x contains the output states of every timestep,
# for classifiction we mostly just want the last one
x = x[:, -1]
x = self.fc(x)
x = self.softmax(x)
return x
So, when taking a look at the nn.LSTM function, you see all N embedded words are passed into it at once and you get as output all N outputs (one from every timestep). That means inside of the lstm function, it iterates over all words in the sentence embeddings. We just dont see that in the code. It also returns the hiddenstate of every timestep but you dont have to use that further. In most cases you can just ignore that.
As pseudo code:
def lstm(x):
hiddenstates = init_with_zeros()
outputs, hiddenstates = [], []
for e in x:
output, hiddenstate = neuralnet(e, hiddenstate)
outputs.append(output)
hiddenstates.append(hiddenstate)
return outputs, hiddenstates
sentence = ["the", "sun", "is", "shiny"]
sentence = embedding(sentence)
outputs, hiddenstates = lstm(sentence)

How to print the output weights for the output layer in BERT?

I would like to print the output vector/tensor in BERT an wasn't sure how to do it. I've been using the following example to walk myself through it:
https://colab.research.google.com/drive/1pTuQhug6Dhl9XalKB0zUGf4FIdYFlpcX
Its a simple classification problem, but I want to be able to get the output vector before we classify the training examples. Can someone point to where in the code I can do this and how?
Do you want the weights to the output layer or the logits? I think you want the logits, it is more work but better in the long run to subclass so you can play with it yourself. Here part of subclass I did where I wanted dropout and more control. I'll just include it here where you can access all the parts of the model
class MyBert(BertPreTrainedModel):
def __init__(self, config, dropout_prob):
super().__init__(config)
self.num_labels = 2
self.bert = BertModel(config)
self.dropout = torch.nn.Dropout(dropout_prob)
self.classifier = torch.nn.Linear(config.hidden_size, self.num_labels)
self.init_weights()
def forward(self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
labels=None,):
outputs = self.bert(
input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
)
pooled_output = outputs[1]
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
outputs = (logits,) + outputs[2:] # add hidden states and attention if they are here
if labels is not None:
loss_fct = torch.nn.CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
outputs = (loss,) + outputs
return outputs # (loss), logits, (hidden_states), (attentions)

Resources