Suppose if I provide a list of sentences:
['I like python',
'I am learning python', # longest sentence of length 4 tokens
'Python is simple']
Bert will produce an output of (3 * 4+2 * 768).
Because there were 3 sentences, 4 max tokens, 768 hidden states.
Suppose if I provide another list of sentences:
['I like python',
'I am learning python',
'Python is simple',
'Python is fun to learn' # 5 tokens
]
The new embedding output would be (4 * 5+2 * 768).
I understand that dim[0] becomes 4 because there is now 4 sentences instead. This is achieved by increasing the rows of the tensor(batch size) during tensor computation.
I also understand that dim[1] becomes 5+2 because the max number of token is number 5 and there is [CLS] and [SEP] tokens at the start and end.
I also understand that there is a padding mechanism that accepts up to a max_position_embeddings=512 for bert model.
What I want to ask is:
during computation, does bert pad all the values after 5th element with zeros and process with computation using a input of (4 * 512) (4 sentences, 512 max tokens).
then after computation from the output of (4 * 512 * 768), the tensor is trimmed to output: (4 * 5+2 * 768).
if the above assumptions is true, isn't it a huge waste of resources, since majority of the 512 tokens are not attention-required.
I read about the attention_mask matrix that tells the model which are the tokens needed for computation, but I don't understand how does attention_mask achieve this; when the architecture of the model is initialised with N dimensional inputs, how does attention_mask help during computation to ignore/avoid the computation of the attention-masked elements?
which part of the bert model explicitly restrict the output to (4 * 5+2 * 768)?
Related
The GRU model in pytorch outputs two objects: the output features as well as the hidden states. I understand that for classification one uses the output features, but I'm not entirely sure which of them. Specifically, in a typical decoder-encoder architecture that uses a GRU in the decoder part, one would typically only pass the last (time-wise, i.e., t = N, where N is the length of the input sequence) output to the encoder. Which part of the output tensor refers to this time-wise last output?
The GRU is created like so (note that it is bidirectional):
self.gru = nn.GRU(
700,
700,
bidirectional=True,
batch_first=True,
)
Given some embedding vector representing a piece of text of size 150x700, I use the GRU like so (150 is the sequence length, 700 the embedding dimension):
gru_out, gru_hidden = self.gru(embedding)
gru_out will be of shape 150x1400, where 150 is again the sequence length and 1400 is double the embedding dimension, which is because of the GRU being a bidirectional one (in terms of pytorch's documentation, hidden_size*num_directions).
If I only want to access the time-wise last output, do I need to access it like so?
tmp = gru_out.view(150, 2, 700)
last_out_first_direction = tmp[149, 0, :]
last_out_second_direction = tmp[149, 1, :]
While this technically seems right and is similar to the answer posted here, it would also require that the actual input sequence is always of length 150, whereas typically you have also shorter actual input sequences that are simply padded to be of length 150. However, in GRU one is typically interested in the last actual input token, which can thus also be at a position <150. What is a common way to access the actual last token or time-step (<=150) instead of only the technically last step (always =150)?
Side question: Is the output of the second direction reversed (since the direction in which information is passed through the GRU is also reversed compared to the first direction) so I should actually access last_out_second_direction = tmp[0, 1, :] instead of tmp[149, 1, :]?
I got a strange error when trying to encode question-answer pairs for BERT using the encode_plus method provided in the Transformers library.
I am using data from this Kaggle competition. Given a question title, question body and answer, the model must predict 30 values (regression problem). My goal is to get the following encoding as input to BERT:
[CLS] question_title question_body [SEP] answer [SEP]
However, when I try to use
tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased")
and encode only the second input from train.csv as follows:
inputs = tokenizer.encode_plus(
df_train["question_title"].values[1] + " " + df_train["question_body"].values[1], # first sequence to be encoded
df_train["answer"].values[1], # second sequence to be encoded
add_special_tokens=True, # [CLS] and 2x [SEP]
max_len = 512,
pad_to_max_length=True
)
I get the following error:
Token indices sequence length is longer than the specified maximum sequence length for this model (46 > 512). Running this sequence through the model will result in indexing errors
It says that the length of the token indices is longer than the specified maximum sequence length, but this is not true (as you can see, 46 is not > 512).
This happens for several of the rows in df_train. Am I doing something wrong here?
The model 'bert-base-uncased' is not pre-trained to handle the long texts of [CLS] + Question + [SEP] + Context + [SEP]. Any other model from Huggingface models dedicated especially for the squad question-answer datasets would handle the long sequence.
For example if I am using the ALBERT model, I would go for 'ktrapeznikov/albert-xlarge-v2-squad-v2' model.
How to apply SMOTE algorithm before word embedding layer in LSTM.
I have a problem of text binary classification (Good(9500) or Bad(500) review with total of 10000 training sample and it's unbalanced training sample), mean while i am using LSTM with pre-trained word-embeddings (100 dimension space for each word) as well, so each training input have an id's (Total of 50 ids with zero padding's as well when the text description is having lesser than 50 words and trimmed to 50 when the description is exceeded 50 characters) of word dictionary.
Below is my general flow,
Input - 1000(batch) X 50 (sequence length)
Word Embedding - 200(Unique vocabulary word) X 100 (word representation)
After word embedding layer (new input for LSTM) - 1000(batch) X 50(sequence) X 100 (features)
Final State from LSTM 1000 (batch) X 100 (units)
Apply final layer 1000(batch) X 100 X [100(units) X 2 (output class)]
All i want to generate more data for Bad review with the help of SMOTE
I faced the same issue.
Found this post on stackexchange which proposes to adjust the weights of the class distribution instead of oversampling. Apparently it is the standard way in LSTM / RNN to deal with class imbalance.
https://stats.stackexchange.com/questions/342170/how-to-train-an-lstm-when-the-sequence-has-imbalanced-classes
I am confused that how to process character level information in RNN using keras. I want to implement something like this model structure.
I have 1800 sentences; each sentence length(time_stamp) 150 and each word length has 16 characters. Gensim model helps me to create word embedding of the size of 100. Unique characters in sentences are 69, for each character represented by one hot encoding is 70.
shape for word-level bi-lstm input is: sentences X time_stamp X embedding_size (1800 x 150 x 100)
I know that how to feed this into keras layer but I am confused with character level feeding. the shape for char-level is: sentences X time_stamp X characters X char_embedding (1800 x 150 x 16 x 70).
I am a beginner in for keras.
What you have in the image is a simple bi-directional LSTM:
model = Sequential()
model.add(Bidrectional(LSTM(128), input_shape=(maxlen, len(chars))))
# Bidrectional concatenates the output of both directions by default.
For a more full example you can check the text generation example which uses character level processing.
This release of PyTorch seems provide the PackedSequence for variable lengths of input for recurrent neural network. However, I found it's a bit hard to use it correctly.
Using pad_packed_sequence to recover an output of a RNN layer which were fed by pack_padded_sequence, we got a T x B x N tensor outputs where T is the max time steps, B is the batch size and N is the hidden size. I found that for short sequences in the batch, the subsequent output will be all zeros.
Here are my questions.
For a single output task where the one would need the last output of all the sequences, simple outputs[-1] will give a wrong result since this tensor contains lots of zeros for short sequences. One will need to construct indices by sequence lengths to fetch the individual last output for all the sequences. Is there more simple way to do that?
For a multiple output task (e.g. seq2seq), usually one will add a linear layer N x O and reshape the batch outputs T x B x O into TB x O and compute the cross entropy loss with the true targets TB (usually integers in language model). In this situation, do these zeros in batch output matters?
Question 1 - Last Timestep
This is the code that i use to get the output of the last timestep. I don't know if there is a simpler solution. If it is, i'd like to know it. I followed this discussion and grabbed the relative code snippet for my last_timestep method. This is my forward.
class BaselineRNN(nn.Module):
def __init__(self, **kwargs):
...
def last_timestep(self, unpacked, lengths):
# Index of the last output for each sequence.
idx = (lengths - 1).view(-1, 1).expand(unpacked.size(0),
unpacked.size(2)).unsqueeze(1)
return unpacked.gather(1, idx).squeeze()
def forward(self, x, lengths):
embs = self.embedding(x)
# pack the batch
packed = pack_padded_sequence(embs, list(lengths.data),
batch_first=True)
out_packed, (h, c) = self.rnn(packed)
out_unpacked, _ = pad_packed_sequence(out_packed, batch_first=True)
# get the outputs from the last *non-masked* timestep for each sentence
last_outputs = self.last_timestep(out_unpacked, lengths)
# project to the classes using a linear layer
logits = self.linear(last_outputs)
return logits
Question 2 - Masked Cross Entropy Loss
Yes, by default the zero padded timesteps (targets) matter. However, it is very easy to mask them. You have two options, depending on the version of PyTorch that you use.
PyTorch 0.2.0: Now pytorch supports masking directly in the CrossEntropyLoss, with the ignore_index argument. For example, in language modeling or seq2seq, where i add zero padding, i mask the zero padded words (target) simply like this:
loss_function = nn.CrossEntropyLoss(ignore_index=0)
PyTorch 0.1.12 and older: In the older versions of PyTorch, masking was not supported, so you had to implement your own workaround. I solution that i used, was masked_cross_entropy.py, by jihunchoi. You may be also interested in this discussion.
A few days ago, I found this method which uses indexing to accomplish the same task with a one-liner.
I have my dataset batch first ([batch size, sequence length, features]), so for me:
unpacked_out = unpacked_out[np.arange(unpacked_out.shape[0]), lengths - 1, :]
where unpacked_out is the output of torch.nn.utils.rnn.pad_packed_sequence.
I have compared it with the method described here, which looks similar to the last_timestep() method Christos Baziotis is using above (also recommended here), and the results are the same in my case.