The GRU model in pytorch outputs two objects: the output features as well as the hidden states. I understand that for classification one uses the output features, but I'm not entirely sure which of them. Specifically, in a typical decoder-encoder architecture that uses a GRU in the decoder part, one would typically only pass the last (time-wise, i.e., t = N, where N is the length of the input sequence) output to the encoder. Which part of the output tensor refers to this time-wise last output?
The GRU is created like so (note that it is bidirectional):
self.gru = nn.GRU(
700,
700,
bidirectional=True,
batch_first=True,
)
Given some embedding vector representing a piece of text of size 150x700, I use the GRU like so (150 is the sequence length, 700 the embedding dimension):
gru_out, gru_hidden = self.gru(embedding)
gru_out will be of shape 150x1400, where 150 is again the sequence length and 1400 is double the embedding dimension, which is because of the GRU being a bidirectional one (in terms of pytorch's documentation, hidden_size*num_directions).
If I only want to access the time-wise last output, do I need to access it like so?
tmp = gru_out.view(150, 2, 700)
last_out_first_direction = tmp[149, 0, :]
last_out_second_direction = tmp[149, 1, :]
While this technically seems right and is similar to the answer posted here, it would also require that the actual input sequence is always of length 150, whereas typically you have also shorter actual input sequences that are simply padded to be of length 150. However, in GRU one is typically interested in the last actual input token, which can thus also be at a position <150. What is a common way to access the actual last token or time-step (<=150) instead of only the technically last step (always =150)?
Side question: Is the output of the second direction reversed (since the direction in which information is passed through the GRU is also reversed compared to the first direction) so I should actually access last_out_second_direction = tmp[0, 1, :] instead of tmp[149, 1, :]?
Related
I have a pytorch model:
model = torch.nn.Sequential(
torch.nn.LSTM(40, 256, 3, batch_first=True),
torch.nn.Linear(256, 256),
torch.nn.ReLU()
)
And for the LSTM layer, I want to retrieve only the last hidden state from the batch to pass through the rest of the layers. Ex:
_, (hidden, _) = lstm(data)
hidden = hidden[-1]
Though, that example only works for a subclassed model. I need to somehow do this on a nn.Sequential() model that way when I save it, it can properly be converted to a tensorflow.js model. The reason I can't make and train this model in tensorflow.js is because I'm trying to implement this repo: Resemblyzer in tensorflow.js while still using the same weights as the pretrained Resemblyzer model which was made in pytorch as a subclassed model. I thought of using the torchvisions.transformations.Lambda() transformation but I would assume that would make it incompatible with tensorflow.js. Is there any way to make this possible while still allowing the model to convert properly?
You could split up your sequential but only doing so in the forward definition of your model on inference. Once defined:
model = nn.Sequential(nn.LSTM(40, 256, 3, batch_first=True),
nn.Linear(256, 256),
nn.ReLU())
You can split it:
>>> lstm, fc = model[0], model[1:]
Then infer in two steps:
>>> out, (hidden, _) = lstm(data)
>>> hidden = hidden[-1]
>>> out = fc(out) # <- or fc(out[-1]) depending on what you want
Though the answer is provided above, I thought of elaborating on the same as PyTorch LSTM documentation is confusing.
In TF, we directly get the last_state as the output. No further action needed.
Let us check the Torch output of LSTM:
There are 2 outputs - a sequence and a tuple. We are interested in the last state so we can ignore the sequence and focus on the tuple. The tuple consists of 2 values - the first is the hidden state of the last cell (of all layers in the LSTM) and the second is the cell state of the last cell (again of all layers in the LSTM). We are interested in the hidden state. So
_, tup = self.bilstm(inp)
We are interested in tup[0]. Let us dig further into this.
The shape of tup[0] is somewhat odd with batch size at the centre. On the left of the batch size is the number of layers in the LSTM (multiply 2 if is biLSTM). On the right is the dimension you have provided while defining the LSTM. You could take the output from the last layer by simply doing a tup[0][-1] which is the answer provided above.
Alternatively if you want to make use of hidden states across layers, you may try something like:
out = tup[0].swapaxes(0,1)
out = out.reshape(*out.shape[:-2], -1)
The first line produces shape of batch_size, num_layers, hidden_size_specified. The second line produces shape of batch_size, num_layers x hidden_size_specified
(For e.g., Let us say, yours is a biLSTM and you have 3 layers and your hiddensize is 100, you could choose to concatenate the output such that you get one vector of 2 x 3 x 100 = 600 dimensions and then run a simple linear layer on top of this to get the output you want.)
There is another way to get the output of the LSTM. We discussed that the first output of an LSTM is a sequence:
sequence, tup = self.bilstm(inp)
This sequence is the output of the LAST hidden layer of the LSTM. It is a sequence because it contains hidden states of EVERY cell in this layer. So its length will be the input sequence length that you have provided. We could choose to take the hidden state of the last element in the sequence by doing a:
#shape of sequence is: batch_size, seq_size, dim
sequence = sequence.swapaxes(0,1)
#shape of sequence is: seq_size, batch_size, dim
sequence = sequence[-1]
#shape of sequence is: batch_size, dim (ie last seq is taken)
Needless to say this will be the same value we got by taking the last layer from tup[0]. Well, not quite! If the LSTM is a biLSTM, then using the sequence approach returns is 2 x hidden_size dim output (which is correct) wheras using the tup[0][-1] approach will give us only hidden_size dim even for a biLSTM. OP's LSTM is a non-biLSTM so both answers hold true.
I have a simple rnn code below.
rnn = nn.RNN(1, 1, 1, bias = False, batch_first = True)
t = torch.ones(size = (1, 2, 1))
output, hidden = rnn(t)
print(rnn.weight_ih_l0)
print(rnn.weight_hh_l0)
print(output)
print(hidden)
# Outputs
Parameter containing:
tensor([[0.7199]], requires_grad=True)
Parameter containing:
tensor([[0.4698]], requires_grad=True)
tensor([[[0.6168],
[0.7656]]], grad_fn=<TransposeBackward1>)
tensor([[[0.7656]]], grad_fn=<StackBackward>)
tensor([[[0.7656]]], grad_fn=<StackBackward>)
My understanding from the PyTorch documentation is that the output from above is the hidden state.
So, I tried to manually calculate the output using the below
hidden_state1 = torch.tanh(t[0][0] * rnn.weight_ih_l0)
print(hidden_state1)
hidden_state2 = torch.tanh(t[0][1] * rnn.weight_ih_l0 + hidden_state1 * rnn.weight_hh_l0)
print(hidden_state2)
tensor([[0.6168]], grad_fn=<TanhBackward>)
tensor([[0.7656]], grad_fn=<TanhBackward>)
The result was correct. hidden_state1 and hidden_state2 match the output.
Shouldn’t the hidden_states get multiplied with output weights to get the output?
I checked for weights connecting from hidden state to output. But there are no weights at all.
If the objective of rnn is to calculate only hidden states, Could anyone tell me how to get the output?
Shouldn’t the hidden_states get multiplied with output weights to get
the output
Yes and No. It depends on your problem formulation. Suppose you are dealing with a case where output from last timestep only matters. In that case it really doesn't make sense to multiply hidden state to output weight in each unit.
That's why pytorch only gives you hidden output as an abstract value, after that you can really go wild and do whatever you want with hidden states according to your problem.
In your particular case suppose you want to apply another linear layer to the output at each timestep. You can do so simply by defining a linear layer and propagating the output of hidden unit.
#Linear Layer
##hidden_feature_size = 1 in your case
lin_layer = nn.Linear(hidden_feature_size, output_feature_size)
#output from first timestep
linear_layer(output[0])
#output from second timestep
linear_layer(output[1])
I'm training a transformer model for text generation.
let's assume:
vocab size = 100
embbeding size = 50
max sequence length = 30
batch size = 32
loss = cross entropy loss
the last layer in the model is a fully connected layer,
mapping from shape [30, 32, 50] to [30, 32, 100].
the idea is that each of the last 30 sequences in the first dimension, I have a target vector I want to calculate loss with.
the issue is that based on the docs, this loss only excepts 2 dims on the prediction and one on the target - so how can I fit my 3D prediction into it?
(and 2D target?)
Use torch.BCELoss() instead (Binary cross entropy). This expects input and target to be the same size but they can be any size, and should fall within the range [0,1]. It performs cross-entropy loss element-wise.
EDIT: if you expect only one element from the vocab to be output, then you should use CrossEntropyLoss and instead encode your labels as a 1D vector rather than a 2D vector (i.e. do 1-hot decoding). BCE treats each element in the output for a single example as independent from the others, which is not a valid assumption for a multi-class style problem. I originally misread and thought the final output was an embedding, rather than an element from the vocabulary, hence my original suggestion.
This release of PyTorch seems provide the PackedSequence for variable lengths of input for recurrent neural network. However, I found it's a bit hard to use it correctly.
Using pad_packed_sequence to recover an output of a RNN layer which were fed by pack_padded_sequence, we got a T x B x N tensor outputs where T is the max time steps, B is the batch size and N is the hidden size. I found that for short sequences in the batch, the subsequent output will be all zeros.
Here are my questions.
For a single output task where the one would need the last output of all the sequences, simple outputs[-1] will give a wrong result since this tensor contains lots of zeros for short sequences. One will need to construct indices by sequence lengths to fetch the individual last output for all the sequences. Is there more simple way to do that?
For a multiple output task (e.g. seq2seq), usually one will add a linear layer N x O and reshape the batch outputs T x B x O into TB x O and compute the cross entropy loss with the true targets TB (usually integers in language model). In this situation, do these zeros in batch output matters?
Question 1 - Last Timestep
This is the code that i use to get the output of the last timestep. I don't know if there is a simpler solution. If it is, i'd like to know it. I followed this discussion and grabbed the relative code snippet for my last_timestep method. This is my forward.
class BaselineRNN(nn.Module):
def __init__(self, **kwargs):
...
def last_timestep(self, unpacked, lengths):
# Index of the last output for each sequence.
idx = (lengths - 1).view(-1, 1).expand(unpacked.size(0),
unpacked.size(2)).unsqueeze(1)
return unpacked.gather(1, idx).squeeze()
def forward(self, x, lengths):
embs = self.embedding(x)
# pack the batch
packed = pack_padded_sequence(embs, list(lengths.data),
batch_first=True)
out_packed, (h, c) = self.rnn(packed)
out_unpacked, _ = pad_packed_sequence(out_packed, batch_first=True)
# get the outputs from the last *non-masked* timestep for each sentence
last_outputs = self.last_timestep(out_unpacked, lengths)
# project to the classes using a linear layer
logits = self.linear(last_outputs)
return logits
Question 2 - Masked Cross Entropy Loss
Yes, by default the zero padded timesteps (targets) matter. However, it is very easy to mask them. You have two options, depending on the version of PyTorch that you use.
PyTorch 0.2.0: Now pytorch supports masking directly in the CrossEntropyLoss, with the ignore_index argument. For example, in language modeling or seq2seq, where i add zero padding, i mask the zero padded words (target) simply like this:
loss_function = nn.CrossEntropyLoss(ignore_index=0)
PyTorch 0.1.12 and older: In the older versions of PyTorch, masking was not supported, so you had to implement your own workaround. I solution that i used, was masked_cross_entropy.py, by jihunchoi. You may be also interested in this discussion.
A few days ago, I found this method which uses indexing to accomplish the same task with a one-liner.
I have my dataset batch first ([batch size, sequence length, features]), so for me:
unpacked_out = unpacked_out[np.arange(unpacked_out.shape[0]), lengths - 1, :]
where unpacked_out is the output of torch.nn.utils.rnn.pad_packed_sequence.
I have compared it with the method described here, which looks similar to the last_timestep() method Christos Baziotis is using above (also recommended here), and the results are the same in my case.
I built a convolutional neural network in Keras.
model.add(Convolution1D(nb_filter=111, filter_length=5, border_mode='valid', activation="relu", subsample_length=1))
According to the CS231 lecture a convolving operation creates a feature map (i.e. activation map) for each filter which are then stacked together. IN my case the convolutional layer has a 300 dimensional input. Hence, I expect the following computation:
Each filter has a window size of 5. Consequently, each filter produces 300-5+1=296 convolutions.
As there are 111 filters there should be a 111*296 output of the convolutional layer.
However, the actual output shapes look differently:
convolutional_layer = model.layers[1]
conv_weights, conv_biases = convolutional_layer.get_weights()
print(conv_weights.shape) # (5, 1, 300, 111)
print(conv_biases.shape) # (,111)
The shape of the bias values makes sense, because there is one bias value for each filter. However, I do not understand the shape of the weights. Apparently, the first dimension depends on the filter size. The third dimension is the number of input neurons, which should have been reduced by the convolution. The last dimension probably refers to the number of filters. This does not make sense, because how should I easily get the feature map for a specific filter?
Keras either uses Theano or Tensorflow as a backend. According to their documentation the output of a convolving operation is a 4d tensor (batch_size, output_channel, output_rows, output_columns).
Can somebody explain me the output shape in accordance with the CS231 lecture?
Your Weight dimension has to be [filter_height, filter_width, in_channel, out_channe]
With your example I think the input channel which is the depth of the input is 300 and you want the output channel to be 111
Total number of filters are 111 and not 300*111
As you have said by yourself each bias for every filter so 111 bias for 111 filters
Each filter out of 111 will produce a convolution on the input
The Weight shape in your case means that you are using a kernel patch of shape 5*1
The third dimension means that depth of input feature map is 300
The fourth dimension mean that depth of the output feature map is 111
Actually it makes very good sense. Your learn the weights of the filters. Each filter in turn produces an output (aka an activation map respective to your input data).
The first two axes of your conv_weights.shape are the dimensions of your filter that is being learned (as your already mentioned). Your filter_length is 5 x 1. Your input has 300 dimensions and you want to get 111 filters per dimension, so you end up with 300 * 111 filters of size 5 * 1 weights.
I assume that the feature map of filter #0 for dimension #0 is sth like your_weights[:, :, 0, 0].