confusion about pytorch LSTM implementation - pytorch

as we all known, pytorch's LSTM implementation is a layered Bi-directional LSTM.
the first layer's input dimension is supposed to be (L,N,H_in) . If we use bidirectional LSTM, then the output of first layer is (L, N, 2*H_hiddensize) official doc
I can't figure out how this output is fed into the second LSTM layer. will the output of backforward layer and the forward layer be merged or concatenated?
I check the source code of its implementation. source code but i fail to understand it.
layers = [_LSTMLayer(**self.input_size**, self.hidden_size,
self.bias, batch_first=False,
bidirectional=self.bidirectional, **factory_kwargs)]
for layer in range(1, num_layers):
layers.append(_LSTMLayer(**self.hidden_size**, self.hidden_size,
self.bias, batch_first=False,
bidirectional=self.bidirectional,
**factory_kwargs))
for idx, layer in enumerate(self.layers):
x, hxcx[idx] = layer(x, hxcx[idx])
Why the output of first layer (shape: L,N,2H_hiddensize) can be fed into the second layer which expect (shape: L,N, H_hiddensize) but not (shape: L,N,2H_hiddensize)

I can't figure out how this output is fed into the second LSTM layer.
will the output of backforward layer and the forward layer be merged
or concatenated?
Yes, the output of bidirectional LSTM will concatenate the last step of forward hidden and the first step of reverse hidden
reference:
Pytorch LSTM documentation
For bidirectional LSTMs, h_n is not equivalent to the last element of
output; the former contains the final forward and reverse hidden
states, while the latter contains the final forward hidden state and
the initial reverse hidden state.

Related

Keras CNN that uses matrix (single channel) input as its argument. And the layer between the input and first is MLP

Using Keras, I want to model CNN that uses matrix (single channel) input as its argument,
but the layer between the input and first is MLP. The convolution is carried out to the outputs of the first hidden layer
How can I code this?

Keras custom layer function

I am following the self attention in Keras in the following link: How to add attention layer to a Bi-LSTM
I am new to python , what does the shape=(input_shape[-1],1) in self.add_weight and shape=(input_shape[1],1) in bias means?
The shape argument sets the expected input dimensions which the model will be fed. In your case, it is just going to be whatever the last dimension of the input shape is for the weight layer and the second dimension of the input shape for the bias layer.
Neural networks take in inputs of fixed size so while building a model, it is important that you hard code the input dimensions for each layer.

How to use masking with Convolution1D layer in keras?

I am trying to perform a sentiment classification task for which I am using Attention based architecture which has both Convolution layer and BiLSTM layers. The first layer of my model is a Embedding layer followed by a Convolution1D layer. I have used mask_zero=True for the Embedding layer since I have padded the sequence with zeros. This however creates an error for the Convolution1D layer since this layer does not support masking. However, I do need to mask the zero inputs since I have LSTM layers after the convolutional layers. Does anyone have any solution for this. I have attached a sample code of my model till the Convolution1D layer for reference.
wordsInputs = Input(shape=(maxSeq,), name='words_input')
embed_reg = l2(l=0.001)
emb = Embedding(vocabSize, 300, mask_zero=True, init='glorot_uniform', W_regularizer=embed_reg)(wordsInputs)
convOutput = Convolution1D(nb_filter=100, filter_length=3, activation='relu', border_mode='same')(emb)
It looks like you have defined a maxSeq length and you say you are padding the sequence with zeros. The mask_zero means something else, specifically that zero is a reserved input word index that you are not supposed to use and is reserved for the internals of the program to mark the end of a variable length sequence.
I think the solution is simply to remove the parameter mask_zero=True, as it is unneeded (because it is for variable length sequences), and to use zero as your padding word index.

How are contents of hidden_states tuple in BertModel in the transformers library arranged

model = BertModel.from_pretrained('bert-base-uncased', config=BertConfig.from_pretrained('bert-base-uncased',output_hidden_states=True))
outputs = model(input_ids)
hidden_states = outputs[2]
hidden_states is a tuple of 13 torch.FloatTensors. Each tensor is of size: (batch_size, sequence_length, hidden_size).
According to the documentation, the 13 tensors are the hidden states of the embedding and the 12 encoder layers.
My question:
Is hidden_states[0] the embedding layer while hidden_states[12] is the 12th encoder layer or
Is hidden_states[0] the embedding layer while hidden_states[12] is the 1st encoder layer or
Is hidden_states[0] the 12th encoder layer while hidden_states[12] is the embedding layer or
Is hidden_states[0] the 1st encoder layer while hidden_states[12] is the embedding layer
I havent found this found clearly stated anywhere else.
Looking at the source-code for BertModel, it can be concluded that hidden_states[0] contains the outputs of the initial embedding layer, and the rest of the elements in tuples contain the hidden states in the increasing order of each layer. Simply put, hidden_states[1] contains the outputs of the first layer of BERT and hidden_states[12] contains the last i.e. 12th layer.

LSTM with variable sequences & return full sequences

How can I set up a keras model such that the final LSTM layer outputs a prediction for each time step while having variable sequence lengths as input?
I'd then like to provide labels for each of the timesteps after a dense layer with linear activation.
When I try to add a reshape or a dense layer to the LSTM model that is returning the full sequence and has a masking layer to take care of variable sequence lengths, it says:
The reshape and the dense layers do not support masking.
Would this be possible to do?
You can use the TimeDistributed layer wrapper for this. This applies the layer you want to each timestep. In your case, you could also just use TimeDistributedDense.

Resources