LSTM/RNN in pytorch The relation between forward method and training model - pytorch

I'm still fairly new to neural networks, so sorry on beforehand for any ambiguities to the following.
In a "standard" LSTM implementation for language task, we have the following (sorry for the very rough sketches):
class LSTM(nn.Module):
def __init__(*args):
...
def forward(self, input, states):
lstn_in = self.model['embed'](input)
lstm_out, hidden = self.model['lstm'](lstm_in,states)
return lstm_out, hidden
Later on, we call upon this model in the training step:
def train(*args):
for epoch in range(epochs):
....
*init_zero_states
...
out, states = model(input, states)
...
return model
Let's just say, that I have 3 sentences as input:
sents = [[The, sun, is, shiny],
[The, beach, was, very, windy],
[Computer, broke, down, today]]
model = train(LSTM, sents)
All words in all sentences gets converted to embeddings and loaded into the model.
Now the question:
Does the self.model['lstm'] iterate though all words from all articles and makes one output after every word? or every sentence?
How does the model make distinction between the 3 sentences, such as after getting "The", "sun", "is", "shiny", does something (such as the states) in the 'lstm' reset and begin anew?
The "out" in the train step after out, states = model(input, states) is the output after running all 3 sentences and hence the combined "information" from all 3 sentences?
Thanks!

when using LSTMs in Pytorch you usually use the nn.LSTM function. Here is a quick example and then an explanation what happens inside:
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.embedder = nn.Embedding(voab_size, embed_size)
self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
self.fc = nn.Linear(hidden_size, output_size)
self.softmax = nn.Softmax(dim=1)
def forward(self, x):
x = self.embedder(x)
# every time you pass a new sentence into the model you need to create
# a new hidden-state (the LSTM requires, unlike RNNs, two hidden-states in a tuple)
hidden = (torch.zeros(num_layers, batch_size, hidden_size), torch.zeros(num_layers, batch_size, hidden_size))
x, hidden = self.lstm(x, hidden)
# x contains the output states of every timestep,
# for classifiction we mostly just want the last one
x = x[:, -1]
x = self.fc(x)
x = self.softmax(x)
return x
So, when taking a look at the nn.LSTM function, you see all N embedded words are passed into it at once and you get as output all N outputs (one from every timestep). That means inside of the lstm function, it iterates over all words in the sentence embeddings. We just dont see that in the code. It also returns the hiddenstate of every timestep but you dont have to use that further. In most cases you can just ignore that.
As pseudo code:
def lstm(x):
hiddenstates = init_with_zeros()
outputs, hiddenstates = [], []
for e in x:
output, hiddenstate = neuralnet(e, hiddenstate)
outputs.append(output)
hiddenstates.append(hiddenstate)
return outputs, hiddenstates
sentence = ["the", "sun", "is", "shiny"]
sentence = embedding(sentence)
outputs, hiddenstates = lstm(sentence)

Related

Understanding the architecture of an LSTM for sequence classification

I have this model in pytorch that I have been using for sequence classification.
class RoBERT_Model(nn.Module):
def __init__(self, hidden_size = 100):
self.hidden_size = hidden_size
super(RoBERT_Model, self).__init__()
self.lstm = nn.LSTM(768, hidden_size, num_layers=1, bidirectional=False)
self.out = nn.Linear(hidden_size, 2)
def forward(self, grouped_pooled_outs):
# chunks_emb = pooled_out.split_with_sizes(lengt) # splits the input tensor into a list of tensors where the length of each sublist is determined by length
seq_lengths = torch.LongTensor([x for x in map(len, grouped_pooled_outs)]) # gets the length of each sublist in chunks_emb and returns it as an array
batch_emb_pad = nn.utils.rnn.pad_sequence(grouped_pooled_outs, padding_value=-91, batch_first=True) # pads each sublist in chunks_emb to the largest sublist with value -91
batch_emb = batch_emb_pad.transpose(0, 1) # (B,L,D) -> (L,B,D)
lstm_input = nn.utils.rnn.pack_padded_sequence(batch_emb, seq_lengths, batch_first=False, enforce_sorted=False) # seq_lengths.cpu().numpy()
packed_output, (h_t, h_c) = self.lstm(lstm_input, ) # (h_t, h_c))
# output, _ = nn.utils.rnn.pad_packed_sequence(packed_output, padding_value=-91)
h_t = h_t.view(-1, self.hidden_size) # (-1, 100)
return self.out(h_t) # logits
The issue that I am having is that I am not entirely convinced of what data is being passed to the final classification layer. I believe what is being done is that only the final LSTM cell in the last layer is being used for classification. That is there are hidden_size features that are passed to the feedforward layer.
I have depicted what I believe is going on in this figure here:
Is this understanding correct? Am I missing anything?
Thanks.
Your code is a basic LSTM for classification, working with a single rnn layer.
In your picture you have multiple LSTM layers, while, in reality, there is only one, H_n^0 in the picture.
Your input to LSTM is of shape (B, L, D) as correctly pointed out in the comment.
packed_output and h_c is not used at all, hence you can change this line to: _, (h_t, _) = self.lstm(lstm_input) in order no to clutter the picture further
h_t is output of last step for each batch element, in general (B, D * L, hidden_size). As this neural network is not bidirectional D=1, as you have a single layer L=1 as well, hence the output is of shape (B, 1, hidden_size).
This output is reshaped into nn.Linear compatible (this line: h_t = h_t.view(-1, self.hidden_size)) and will give you output of shape (B, hidden_size)
This input is fed to a single nn.Linear layer.
In general, the output of the last time step from RNN is used for each element in the batch, in your picture H_n^0 and simply fed to the classifier.
By the way, having self.out = nn.Linear(hidden_size, 2) in classification is probably counter-productive; most likely your are performing binary classification and self.out = nn.Linear(hidden_size, 1) with torch.nn.BCEWithLogitsLoss might be used. Single logit contains information whether the label should be 0 or 1; everything smaller than 0 is more likely to be 0 according to nn, everything above 0 is considered as a 1 label.

What does PyTorch classifier output?

So i am new to deep learning and started learning PyTorch. I created a classifier model with following structure.
class model(nn.Module):
def __init__(self):
super(model, self).__init__()
resnet = models.resnet34(pretrained=True)
layers = list(resnet.children())[:8]
self.features1 = nn.Sequential(*layers[:6])
self.features2 = nn.Sequential(*layers[6:])
self.classifier = nn.Sequential(nn.BatchNorm1d(512), nn.Linear(512, 3))
def forward(self, x):
x = self.features1(x)
x = self.features2(x)
x = F.relu(x)
x = nn.AdaptiveAvgPool2d((1,1))(x)
x = x.view(x.shape[0], -1)
return self.classifier(x)
So basically I wanted to classify among three things {0,1,2}. While evaluating, I passed the image it returned a Tensor with three values like below
(tensor([[-0.1526, 1.3511, -1.0384]], device='cuda:0', grad_fn=<AddmmBackward>)
So my question is what are these three numbers? Are they probability ?
P.S. Please pardon me If I asked something too silly.
The final layer nn.Linear (fully connected layer) of self.classifier of your model produces values, that we can call a scores, for example, it may be: [10.3, -3.5, -12.0], the same you can see in your example as well: [-0.1526, 1.3511, -1.0384] which are not normalized and cannot be interpreted as probabilities.
As you can see it's just a kind of "raw unscaled" network output, in other words these values are not normalized, and it's hard to use them or interpret the results, that's why the common practice is converting them to normalized probability distribution by using softmax after the final layer, as #skinny_func has already described. After that you will get the probabilities in the range of 0 and 1, which is more intuitive representation.
So after training what you would want to do is to apply softmax to the output tensor to extract the probability of each class, then you choose the maximal value (highest probability).
in your case:
prob = torch.nn.functional.softmax(model(x), dim=1)
_, pred_class = torch.max(prob, dim=1)

How to print the output weights for the output layer in BERT?

I would like to print the output vector/tensor in BERT an wasn't sure how to do it. I've been using the following example to walk myself through it:
https://colab.research.google.com/drive/1pTuQhug6Dhl9XalKB0zUGf4FIdYFlpcX
Its a simple classification problem, but I want to be able to get the output vector before we classify the training examples. Can someone point to where in the code I can do this and how?
Do you want the weights to the output layer or the logits? I think you want the logits, it is more work but better in the long run to subclass so you can play with it yourself. Here part of subclass I did where I wanted dropout and more control. I'll just include it here where you can access all the parts of the model
class MyBert(BertPreTrainedModel):
def __init__(self, config, dropout_prob):
super().__init__(config)
self.num_labels = 2
self.bert = BertModel(config)
self.dropout = torch.nn.Dropout(dropout_prob)
self.classifier = torch.nn.Linear(config.hidden_size, self.num_labels)
self.init_weights()
def forward(self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
labels=None,):
outputs = self.bert(
input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
)
pooled_output = outputs[1]
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
outputs = (logits,) + outputs[2:] # add hidden states and attention if they are here
if labels is not None:
loss_fct = torch.nn.CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
outputs = (loss,) + outputs
return outputs # (loss), logits, (hidden_states), (attentions)

how to build a multidimensional autoencoder with pytorch

I followed this great answer for sequence autoencoder,
LSTM autoencoder always returns the average of the input sequence.
but I met some problem when I try to change the code:
question one:
Your explanation is so professional, but the problem is a little bit different from mine, I attached some code I changed from your example. My input features are 2 dimensional, and my output is same with the input.
for example:
input_x = torch.Tensor([[0.0,0.0], [0.1,0.1], [0.2,0.2], [0.3,0.3], [0.4,0.4]])
output_y = torch.Tensor([[0.0,0.0], [0.1,0.1], [0.2,0.2], [0.3,0.3], [0.4,0.4]])
the input_x and output_y are same, 5-timesteps, 2-dimensional feature.
import torch
import torch.nn as nn
import torch.optim as optim
class LSTM(nn.Module):
def __init__(self, input_dim, latent_dim, num_layers):
super(LSTM, self).__init__()
self.input_dim = input_dim
self.latent_dim = latent_dim
self.num_layers = num_layers
self.encoder = nn.LSTM(self.input_dim, self.latent_dim, self.num_layers)
# I changed here, to 40 dimesion, I think there is some problem
# self.decoder = nn.LSTM(self.latent_dim, self.input_dim, self.num_layers)
self.decoder = nn.LSTM(40, self.input_dim, self.num_layers)
def forward(self, input):
# Encode
_, (last_hidden, _) = self.encoder(input)
# It is way more general that way
encoded = last_hidden.repeat(input.shape)
# Decode
y, _ = self.decoder(encoded)
return torch.squeeze(y)
model = LSTM(input_dim=2, latent_dim=20, num_layers=1)
loss_function = nn.MSELoss()
optimizer = optim.Adam(model.parameters())
y = torch.Tensor([[0.0,0.0], [0.1,0.1], [0.2,0.2], [0.3,0.3], [0.4,0.4]])
x = y.view(len(y), -1, 2) # I changed here
while True:
y_pred = model(x)
optimizer.zero_grad()
loss = loss_function(y_pred, y)
loss.backward()
optimizer.step()
print(y_pred)
The above code can learn very well, can you help review the code and give some instructions.
When I input 2 examples as the input to the model, the model cannot work:
for example, change the code:
y = torch.Tensor([[0.0,0.0], [0.1,0.1], [0.2,0.2], [0.3,0.3], [0.4,0.4]])
to:
y = torch.Tensor([[[0.0,0.0],[0.5,0.5]], [[0.1,0.1], [0.6,0.6]], [[0.2,0.2],[0.7,0.7]], [[0.3,0.3],[0.8,0.8]], [[0.4,0.4],[0.9,0.9]]])
When I compute the loss function, it complain some errors? can anyone help have a look
question two:
my training samples are with different length:
for example:
x1 = [[0.0,0.0], [0.1,0.1], [0.2,0.2], [0.3,0.3], [0.4,0.4]] #with 5 timesteps
x2 = [[0.5,0.5], [0.6,0.6], [0.7,0.7]] #with only 3 timesteps
How can I input these two training sample into the model at the same time for a batch training.
Recurrent N-dimensional autoencoder
First of all, LSTMs work on 1D samples, yours are 2D as it's usually used for words encoded with a single vector.
No worries though, one can flatten this 2D sample to 1D, example for your case would be:
import torch
var = torch.randn(10, 32, 100, 100)
var.reshape((10, 32, -1)) # shape: [10, 32, 100 * 100]
Please notice it's really not general, what if you were to have 3D input? Snippet belows generalizes this notion to any dimension of your samples, provided the preceding dimensions are batch_size and seq_len:
import torch
input_size = 2
var = torch.randn(10, 32, 100, 100, 35)
var.reshape(var.shape[:-input_size] + (-1,)) # shape: [10, 32, 100 * 100 * 35]
Finally, you can employ it inside neural network as follows. Look at forward method especially and constructor arguments:
import torch
class LSTM(nn.Module):
# input_dim has to be size after flattening
# For 20x20 single input it would be 400
def __init__(
self,
input_dimensionality: int,
input_dim: int,
latent_dim: int,
num_layers: int,
):
super(LSTM, self).__init__()
self.input_dimensionality: int = input_dimensionality
self.input_dim: int = input_dim # It is 1d, remember
self.latent_dim: int = latent_dim
self.num_layers: int = num_layers
self.encoder = torch.nn.LSTM(self.input_dim, self.latent_dim, self.num_layers)
# You can have any latent dim you want, just output has to be exact same size as input
# In this case, only encoder and decoder, it has to be input_dim though
self.decoder = torch.nn.LSTM(self.latent_dim, self.input_dim, self.num_layers)
def forward(self, input):
# Save original size first:
original_shape = input.shape
# Flatten 2d (or 3d or however many you specified in constructor)
input = input.reshape(input.shape[: -self.input_dimensionality] + (-1,))
# Rest goes as in my previous answer
_, (last_hidden, _) = self.encoder(input)
encoded = last_hidden.repeat(input.shape)
y, _ = self.decoder(encoded)
# You have to reshape output to what the original was
reshaped_y = y.reshape(original_shape)
return torch.squeeze(reshaped_y)
Remember you have to reshape your output in this case. It should work for any dimensions.
Batching
When it comes to batching and different length of sequences it is a little more complicated.
You have to pad each sequence in batch before pushing it through network. Usually, values with which you pad are zeros, you may configure it inside LSTM though.
You may check this link for an example. You will have to use functions like torch.nn.pack_padded_sequence and others to make it work, you may check this answer.
Oh, since PyTorch 1.1 you don't have to sort your sequences by length in order to pack them. But when it comes to this topic, grab some tutorials, should make things clearer.
Lastly: Please, separate your questions. If you perform the autoencoding with single example, move on to batching and if you have issues there, please post a new question on StackOverflow, thanks.

How can I use LSTM in pytorch for classification?

My code is as below:
class Mymodel(nn.Module):
def __init__(self, input_size, hidden_size, output_size, num_layers, batch_size):
super(Discriminator, self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.output_size = output_size
self.num_layers = num_layers
self.batch_size = batch_size
self.lstm = nn.LSTM(input_size, hidden_size)
self.proj = nn.Linear(hidden_size, output_size)
self.hidden = self.init_hidden()
def init_hidden(self):
return (Variable(torch.zeros(self.num_layers, self.batch_size, self.hidden_size)),
Variable(torch.zeros(self.num_layers, self.batch_size, self.hidden_size)))
def forward(self, x):
lstm_out, self.hidden = self.lstm(x, self.hidden)
output = self.proj(lstm_out)
result = F.sigmoid(output)
return result
I want to use LSTM to classify a sentence to good (1) or bad (0). Using this code, I get the result which is time_step * batch_size * 1 but not 0 or 1. How to edit the code in order to get the classification result?
Theory:
Recall that an LSTM outputs a vector for every input in the series. You are using sentences, which are a series of words (probably converted to indices and then embedded as vectors). This code from the LSTM PyTorch tutorial makes clear exactly what I mean (***emphasis mine):
lstm = nn.LSTM(3, 3) # Input dim is 3, output dim is 3
inputs = [autograd.Variable(torch.randn((1, 3)))
for _ in range(5)] # make a sequence of length 5
# initialize the hidden state.
hidden = (autograd.Variable(torch.randn(1, 1, 3)),
autograd.Variable(torch.randn((1, 1, 3))))
for i in inputs:
# Step through the sequence one element at a time.
# after each step, hidden contains the hidden state.
out, hidden = lstm(i.view(1, 1, -1), hidden)
# alternatively, we can do the entire sequence all at once.
# the first value returned by LSTM is all of the hidden states throughout
# the sequence. the second is just the most recent hidden state
# *** (compare the last slice of "out" with "hidden" below, they are the same)
# The reason for this is that:
# "out" will give you access to all hidden states in the sequence
# "hidden" will allow you to continue the sequence and backpropagate,
# by passing it as an argument to the lstm at a later time
# Add the extra 2nd dimension
inputs = torch.cat(inputs).view(len(inputs), 1, -1)
hidden = (autograd.Variable(torch.randn(1, 1, 3)), autograd.Variable(
torch.randn((1, 1, 3)))) # clean out hidden state
out, hidden = lstm(inputs, hidden)
print(out)
print(hidden)
One more time: compare the last slice of "out" with "hidden" below, they are the same. Why? Well...
If you're familiar with LSTM's, I'd recommend the PyTorch LSTM docs at this point. Under the output section, notice h_t is output at every t.
Now if you aren't used to LSTM-style equations, take a look at Chris Olah's LSTM blog post. Scroll down to the diagram of the unrolled network:
As you feed your sentence in word-by-word (x_i-by-x_i+1), you get an output from each timestep. You want to interpret the entire sentence to classify it. So you must wait until the LSTM has seen all the words. That is, you need to take h_t where t is the number of words in your sentence.
Code:
Here's a coding reference. I'm not going to copy-paste the entire thing, just the relevant parts. The magic happens at self.hidden2label(lstm_out[-1])
class LSTMClassifier(nn.Module):
def __init__(self, embedding_dim, hidden_dim, vocab_size, label_size, batch_size):
...
self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
self.lstm = nn.LSTM(embedding_dim, hidden_dim)
self.hidden2label = nn.Linear(hidden_dim, label_size)
self.hidden = self.init_hidden()
def init_hidden(self):
return (autograd.Variable(torch.zeros(1, self.batch_size, self.hidden_dim)),
autograd.Variable(torch.zeros(1, self.batch_size, self.hidden_dim)))
def forward(self, sentence):
embeds = self.word_embeddings(sentence)
x = embeds.view(len(sentence), self.batch_size , -1)
lstm_out, self.hidden = self.lstm(x, self.hidden)
y = self.hidden2label(lstm_out[-1])
log_probs = F.log_softmax(y)
return log_probs
The main problem you need to figure out is the in which dim place you should put your batch size when you prepare your data. As far as I know, if you didn't set it in your nn.LSTM() init function, it will automatically assume that the second dim is your batch size, which is quite different compared to other DNN framework. Maybe you can try:
self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
like this to ask your model to treat your first dim as the batch dim.
As a last layer you have to have a linear layer for however many classes you want i.e 10 if you are doing digit classification as in MNIST . For your case since you are doing a yes/no (1/0) classification you have two lablels/ classes so you linear layer has two classes. I suggest adding a linear layer as
nn.Linear ( feature_size_from_previous_layer , 2)
and then train the model using a cross-entropy loss.
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

Resources