Token indices sequence length error when using encode_plus method - nlp

I got a strange error when trying to encode question-answer pairs for BERT using the encode_plus method provided in the Transformers library.
I am using data from this Kaggle competition. Given a question title, question body and answer, the model must predict 30 values (regression problem). My goal is to get the following encoding as input to BERT:
[CLS] question_title question_body [SEP] answer [SEP]
However, when I try to use
tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased")
and encode only the second input from train.csv as follows:
inputs = tokenizer.encode_plus(
df_train["question_title"].values[1] + " " + df_train["question_body"].values[1], # first sequence to be encoded
df_train["answer"].values[1], # second sequence to be encoded
add_special_tokens=True, # [CLS] and 2x [SEP]
max_len = 512,
pad_to_max_length=True
)
I get the following error:
Token indices sequence length is longer than the specified maximum sequence length for this model (46 > 512). Running this sequence through the model will result in indexing errors
It says that the length of the token indices is longer than the specified maximum sequence length, but this is not true (as you can see, 46 is not > 512).
This happens for several of the rows in df_train. Am I doing something wrong here?

The model 'bert-base-uncased' is not pre-trained to handle the long texts of [CLS] + Question + [SEP] + Context + [SEP]. Any other model from Huggingface models dedicated especially for the squad question-answer datasets would handle the long sequence.
For example if I am using the ALBERT model, I would go for 'ktrapeznikov/albert-xlarge-v2-squad-v2' model.

Related

BERT sentence embeddings: how to obtain sentence embeddings vector

I'm using the module bert-for-tf2 in order to wrap BERT model as Keras layer in Tensorflow 2.0 I've followed your guide for implementing BERT model as Keras layer.
I'm trying to extract embeddings from a sentence; in my case, the sentence is "Hello"
I have a question about the output of the model prediction; I've written this model:
model_word_embedding = tf.keras.Sequential([
tf.keras.layers.Input(shape=(4,), dtype='int32', name='input_ids'),
bert_layer
])
model_word_embedding .build(input_shape=(None, 4))
Then I want to extract the embeddings for the sentence written above:
sentences = ["Hello"]
predict = model_word_embedding .predict(sentences)
the object predict contains 4 arrays of 768 elements each:
print(predict)
print(len(predict))
print(len(predict[0][0]))
...
[[[-0.02768866 -0.7341324 1.9084396 ... -0.65953904 0.26496622
1.1610721 ]
[-0.19322394 -1.3134469 0.10383344 ... 1.1250225 -0.2988368
-0.2323082 ]
[-1.4576151 -1.4579685 0.78580517 ... -0.8898649 -1.1016986
0.6008501 ]
[ 1.41647 -0.92478925 -1.3651332 ... -0.9197768 -1.5469263
0.03305872]]]
4
768
I know that each array of that 4 represents my original sentence, but I want to obtain one array as the embeddings of my original sentence.
So, my question is: How can I obtain the embeddings for a sentence?
In BERT source code I read this:
For classification tasks, the first vector (corresponding to [CLS]) is used as the "sentence vector." Note that this only makes sense because the entire model is fine-tuned.
So I have to take the first array from the prediction output since it represents my sentence vector?
Thank you for your support
We should use [CLS] from the last hidden states as the sentence embeddings from BERT. According to the BERT paper [CLS] represent the encoded sentence of dimension 768. Following figure represents the use of [CLS] in more details. considering you have 2000 sentences.
#input_ids consist of all sentences padded to max_len.
last_hidden_states = model(input_ids)
features = last_hidden_states[0][:,0,:].numpy() # considering o only the [CLS] for each sentences
features.shape
# (2000, 768) dimension

LSTM getting caught up in loop

I recently implemented a name generating RNN "from scratch" which was doing ok but far from perfect. So I thought about trying my luck with pytorch's LSTM class to see if it makes a difference. Indeed it does and the outpus looks way better for the first 7 ~ 8 characters. But then the networks gets caught in a loop and outputs things like "laulaulaulau" or "rourourourou" (it is supposed the generate french names).
Is it a often occuring problem ? If so do you know a way to fix it ? I'm concern about the fact the network doesn't produce EOS tokens...
This is an issue which has already been asked here Why does my keras LSTM model get stuck in an infinite loop?
but not really answered hence my post.
here is the model :
class pytorchLSTM(nn.Module):
def __init__(self,input_size,hidden_size):
super(pytorchLSTM,self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.lstm = nn.LSTM(input_size, hidden_size)
self.output_layer = nn.Linear(hidden_size,input_size)
self.tanh = nn.Tanh()
self.softmax = nn.LogSoftmax(dim = 2)
def forward(self, input, hidden)
out, hidden = self.lstm(input,hidden)
out = self.tanh(out)
out = self.output_layer(out)
out = self.softmax(out)
return out, hidden
The input and target are two sequences of one-hot encoded vectors respectively with a start of sequence and end of sequence vector at the start and the end. They represent the characters inside of a name taken from the name list (database).
I use a and token on each name from the database. here are the function I use
def inputTensor(line):
#tensor starts with <start of sequence> token.
tensor = torch.zeros(len(line)+1, 1, n_letters)
tensor[0][0][n_letters - 2] = 1
for li in range(len(line)):
letter = line[li]
tensor[li+1][0][all_letters.find(letter)] = 1
return tensor
# LongTensor of second letter to end (EOS) for target
def targetTensor(line):
letter_indexes = [all_letters.find(line[li]) for li in range(len(line))]
letter_indexes.append(n_letters - 1) # EOS
return torch.LongTensor(letter_indexes)
training loop :
def train_lstm(model):
start = time.time()
criterion = nn.NLLLoss()
optimizer = torch.optim.Adam(model.parameters())
n_iters = 20000
print_every = 1000
plot_every = 500
all_losses = []
total_loss = 0
for iter in range(1,n_iters+1):
line = randomChoice(category_line)
input_line_tensor = inputTensor(line)
target_line_tensor = targetTensor(line).unsqueeze(-1)
optimizer.zero_grad()
loss = 0
output, hidden = model(input_line_tensor)
for i in range(input_line_tensor.size(0)):
l = criterion(output[i], target_line_tensor[i])
loss += l
loss.backward()
optimizer.step()
the sampling function :
def sample():
max_length = 20
input = torch.zeros(1,1,n_letters)
input[0][0][n_letters - 2] = 1
output_name = ""
hidden = (torch.zeros(2,1,lstm.hidden_size),torch.zeros(2,1,lstm.hidden_size))
for i in range(max_length):
output, hidden = lstm(input)
output = output[-1][:][:]
l = torch.multinomial(torch.exp(output[0]),num_samples = 1).item()
if l == n_letters - 1:
break
else:
letter = all_letters[l]
output_name += letter
input = inputTensor(letter)
return output_name
The typical sampled output looks something like that :
Laurayeerauerararauo
Leayealouododauodouo
Courouauurourourodau
Do you know how I can improve that ?
I found the explanation :
When using instances of the LSTM class as part of a RNN, the default input dimensions are (seq_length,batch_dim,input_size). To be able to interpret the output of the lstm as a probability (over the set of inputs) I needed to pass it to a Linear layer before the Softmax call, which is where the problem happens : Linear instances expects the input to be in the format (batch_dim,seq_length,input_size).
To fix this, one needs to pass batch_first = True as an argument to the LSTM upon creation, and then feed the RNN with an input of the form (batch_dim, seq_length, input_size).
Some tips to improve the network in the order of importance (and ease of implementing):
1. Training data
If you want your generated samples to look real, you have to give some real data to the network. Find a set of names, split those into letters and transform into indices. This step alone would give way more realistic names.
2. Separate start and end tokens.
I would go with <SON> (Start Of Name) and <EON> (End Of Name). In this configuration neural network can learn combinations of letters leading to <EON> and combinations of letters coming after <SON>. ATM it's trying to fit two different concepts into this one custom token.
3. Unsupervised Pretaining
You may want to give your letters some semantic meaning instead of one-hot encoded vectors, check word2vec for basic approach.
Basically, each letter would be represented by N-dimensional vector (say 50 dimensions) and would be closer in space if the letter occurs more often next to another letter (a closer to k than x).
Simple way to implement that would be taking some text dataset and trying to predict next letter at each timestep. Each letter would be represented by random vector at the beginning, through backpropagation letter representations would be updated to reflect their similarity.
Check pytorch embedding tutorial for more info.
4. Different architecture
You may want to check Andrej Karpathy's idea for generating baby names. It is simply described here.
Essentially, after training, you feed your model with random letters (say 10) and tell it to predict the next letter.
You remove last letter from random seed and put the predicted one in it's place. Iterate until <EON> is outputted.

Can anybody help me implement hybrid embedding model using keras library as shown in the attached figure

I am confused that how to process character level information in RNN using keras. I want to implement something like this model structure.
I have 1800 sentences; each sentence length(time_stamp) 150 and each word length has 16 characters. Gensim model helps me to create word embedding of the size of 100. Unique characters in sentences are 69, for each character represented by one hot encoding is 70.
shape for word-level bi-lstm input is: sentences X time_stamp X embedding_size (1800 x 150 x 100)
I know that how to feed this into keras layer but I am confused with character level feeding. the shape for char-level is: sentences X time_stamp X characters X char_embedding (1800 x 150 x 16 x 70).
I am a beginner in for keras.
What you have in the image is a simple bi-directional LSTM:
model = Sequential()
model.add(Bidrectional(LSTM(128), input_shape=(maxlen, len(chars))))
# Bidrectional concatenates the output of both directions by default.
For a more full example you can check the text generation example which uses character level processing.

How to generate GloVe embeddings for POS tags? Python

For a sentence analysis task, I would like to take the sequence of POS tags associated with the sentence and feed it to my model as if the POS tags are words.
I am using GloVe to make representations of each word in the sentence and SpaCy to generate POS tags. However, GloVe embeddings do not make much sense for POS tags. So I will have to somehow create embeddings for each POS tag. What is the best way to do create embeddings for POS tags, so that I can feed POS sequences into my model in the same way I would feed sentences? Could anyone point to code examples of how to do this with GloVe in Python?
Added context
My task is a binary classification of sentence pairs, based on their resemblance (similar meaning vs different meaning).
I would like to use POS tags as words, so that the POS tags serve as an additional bit of information to compare the sentences. My current model does not use an LSTM as a way to predict sequences.
Most word embedding models still rely on an underlying assumption that the meaning of a word is induced by its usage context. For example, learning a word2vec embedding with skipgram or continuous bag of words formulations implicitly assumes a model in which the representation vector of the word is based on the context words that co-occur with the target word, specifically by learning to create embeddings that best solve the classification task of distinguishing pairs of words that contextually co-occur from random pairs of words (so-called negative sampling).
But if the input is changed to be a sequence of discrete labels (POS tags), this assumption doesn't seem like it needs to remain accurate or reasonable. Part of speech labels have an assigned meaning that is not really induced by the context of being surrounded by other part of speech labels, so it's unlikely that standard learning tasks which are used to produce word embeddings would work when treating POS labels as if they were words from a much smaller vocabulary.
What is the overall sentence analysis task in your situation?
Added after question was updated with the learning task at hand.
Let's assume you can create POS input vectors for each sentence example. If there are N different POS labels possible, it means your input will consist of one vector from word embeddings and another vector of length N, where the value in component i represents the number of terms in the input sentence that possess POS label P_i.
For example, let's pretend the only POS labels possible are 'article', 'noun' and 'verb', and you have a sentence with ['article', 'noun', 'verb', 'noun']. Then this transforms into [1, 2, 1], and probably you want to normalize it by the length of the sentence. Let's call this input pos1 for sentence number 1 and pos2 for sentence number 2.
Let's call the word embedding vector input for sentence 1 as sentence1. sentence1 will be calculated by looking up each word embedding from a separate source, like a pretrained word2vec model or fastText or GloVe, and summing them up (using continuous bag of words). And the same for sentence2.
It's assumed that your batches of training data would already be processed into these vector formats, so a given single input would be a 4-tuple of vectors: the looked up CBOW embedding vector for sentence 1, same for sentence 2, and the calculated discrete representation vector for POS labels of sentence 1, and same for sentence 2.
A model that could work from this data might be like this:
from keras.engine.topology import Input
from keras.layers import Concatenate
from keras.layers.core import Activation, Dense
from keras.models import Model
sentence1 = Input(shape=word_embedding_shape)
sentence2 = Input(shape=word_embedding_shape)
pos1 = Input(shape=pos_vector_shape)
pos2 = Input(shape=pos_vector_shape)
# Note: just choosing 128 as an embedding space dimension or intermediate
# layer size... in your real case, you'd choose these shape params
# based on what you want to model or experiment with. They don't mean
# anything here.
sentence1_branch = Dense(128)(sentence1)
sentence1_branch = Activation('relu')(sentence1_branch)
# ... do whatever other sentence1-only stuff
sentence2_branch = Dense(128)(sentence2)
sentence2_branch = Activation('relu')(sentence2_branch)
# ... do whatever other sentence2-only stuff
pos1_embedding = Dense(128)(pos1)
pos1_branch = Activation('relu')(pos1_embedding)
# ... do whatever other pos1-only stuff
pos2_embedding = Dense(128)(pos2)
pos2_branch = Activation('relu')(pos2_embedding)
# ... do whatever other pos2-only stuff
unified = Concatenate([sentence1_branch, sentence2_branch,
pos1_branch, pos2_branch])
# ... do dense layers, whatever, to the concatenated intermediate
# representations
# finally boil it down to whatever final prediction task you are using,
# whether it is predicting a sentence similarity score (Dense(1)),
# or predicting a binary label that indicates whether the sentence
# pairs are similar or not (Dense(2) then followed by softmax activation,
# or Dense(1) followed by some type of probability activation like sigmoid).
# Assume your data is binary labeled for similar sentences...
unified = Activation('softmax')(Dense(2)(unified))
unified.compile(loss='binary_crossentropy', other parameters)
# Do training to learn the weights...
# A separate model that will just produce the embedding output
# from a POS input vector, relying on weights learned from the
# training process.
pos_embedding_model = Model(inputs=[pos1], outputs=[pos1_embedding])

What should be the word vectors of token <pad>, <unknown>, <go>, <EOS> before sent into RNN?

In word embedding, what should be a good vector representation for the start_tokens _PAD, _UNKNOWN, _GO, _EOS?
Spettekaka's answer works if you are updating your word embedding vectors as well.
Sometimes you will want to use pretrained word vectors that you can't update, though. In this case, you can add a new dimension to your word vectors for each token you want to add and set the vector for each token to 1 in the new dimension and 0 for every other dimension. That way, you won't run into a situation where e.g. "EOS" is closer to the vector embedding of "start" than it is to the vector embedding of "end".
Example for clarification:
# assume_vector embeddings is a dictionary and word embeddings are 3-d before adding tokens
# e.g. vector_embedding['NLP'] = np.array([0.2, 0.3, 0.4])
vector_embedding['<EOS>'] = np.array([0,0,0,1])
vector_embedding['<PAD>'] = np.array([0,0,0,0,1])
new_vector_length = vector_embedding['<pad>'].shape[0] # length of longest vector
for key, word_vector in vector_embedding.items():
zero_append_length = new_vector_length - word_vector.shape[0]
vector_embedding[key] = np.append(word_vector, np.zeros(zero_append_length))
Now your dictionary of word embeddings contains 2 new dimensions for your tokens and all of your words have been updated.
As far as I understand you can represent these tokens by any vector.
Here's why:
Inputting a sequence of words to your model, you first convert each word to an ID and then look in your embedding-matrix which vector corresponds to that ID. With that vector, you train your model. But the embedding-matrix just contains also trainable weights which will be adjusted during training. The vector-representations from your pretrained vectors just serve as a good point to start to yield good results.
Thus, it doesn't matter that much what your special tokens are represented by in the beginning as their representation will change during training.

Resources