BERT sentence embeddings: how to obtain sentence embeddings vector - keras

I'm using the module bert-for-tf2 in order to wrap BERT model as Keras layer in Tensorflow 2.0 I've followed your guide for implementing BERT model as Keras layer.
I'm trying to extract embeddings from a sentence; in my case, the sentence is "Hello"
I have a question about the output of the model prediction; I've written this model:
model_word_embedding = tf.keras.Sequential([
tf.keras.layers.Input(shape=(4,), dtype='int32', name='input_ids'),
bert_layer
])
model_word_embedding .build(input_shape=(None, 4))
Then I want to extract the embeddings for the sentence written above:
sentences = ["Hello"]
predict = model_word_embedding .predict(sentences)
the object predict contains 4 arrays of 768 elements each:
print(predict)
print(len(predict))
print(len(predict[0][0]))
...
[[[-0.02768866 -0.7341324 1.9084396 ... -0.65953904 0.26496622
1.1610721 ]
[-0.19322394 -1.3134469 0.10383344 ... 1.1250225 -0.2988368
-0.2323082 ]
[-1.4576151 -1.4579685 0.78580517 ... -0.8898649 -1.1016986
0.6008501 ]
[ 1.41647 -0.92478925 -1.3651332 ... -0.9197768 -1.5469263
0.03305872]]]
4
768
I know that each array of that 4 represents my original sentence, but I want to obtain one array as the embeddings of my original sentence.
So, my question is: How can I obtain the embeddings for a sentence?
In BERT source code I read this:
For classification tasks, the first vector (corresponding to [CLS]) is used as the "sentence vector." Note that this only makes sense because the entire model is fine-tuned.
So I have to take the first array from the prediction output since it represents my sentence vector?
Thank you for your support

We should use [CLS] from the last hidden states as the sentence embeddings from BERT. According to the BERT paper [CLS] represent the encoded sentence of dimension 768. Following figure represents the use of [CLS] in more details. considering you have 2000 sentences.
#input_ids consist of all sentences padded to max_len.
last_hidden_states = model(input_ids)
features = last_hidden_states[0][:,0,:].numpy() # considering o only the [CLS] for each sentences
features.shape
# (2000, 768) dimension

Related

Understanding get_sentence_vector() and get_word_vector() for fasttext

The thing I want to do is to get embeddings of a pair of words or phrases and calculate similarity.
I observed that the similarity is the same when I switch between get_sentence_vector() and get_word_vector() for a word. For example, I can switch the method when calculating embedding_2 or embedding_3, but embedding_2 and embedding_3 are not euqal, which is weird:
from scipy.spatial.distance import cosine
import numpy as np
import fasttext
import fasttext.util
# download an english model
fasttext.util.download_model('en', if_exists='ignore') # English
model = fasttext.load_model('cc.en.300.bin')
# Getting word vectors for 'one' and 'two'.
embedding_1 = model.get_sentence_vector('baby dog')
embedding_2 = model.get_word_vector('puppy')
embedding_3 = model.get_sentence_vector('puppy')
def cosine_similarity(embedding_1, embedding_2):
# Calculate the cosine similarity of the two embeddings.
sim = 1 - cosine(embedding_1, embedding_2)
print('Cosine similarity: {:.2}'.format(sim))
# compare the embeddings
cosine_similarity(embedding_1, embedding_2)
# compare the embeddings
cosine_similarity(embedding_1, embedding_3)
# Checking if the two approaches yield the same result.
is_equal = np.array_equal(embedding_2, embedding_3)
# Printing the result.
print(is_equal)
If I switch methods, similarity is always 0.76 but is_equal is false. I have two questions:
(1) I probably have to use get_sentence_vector() for phrases, but in terms of words, which one should I use? What happens when I call get_sentence_vector() for a word?
(2) I use fasttext because it can handle out of vocabulary, is it a good idea to use fasttext's embedding for cosine similarity comparison?
You should use get_word_vector for words and get_sentence_vector for sentences.
get_sentence_vector divides each word vector by its norm and then average them. If you are interested in more details, read this.
Since fastText provides vector representations, it is a good idea to use this vectors in order to compare words and sentences.

Token indices sequence length error when using encode_plus method

I got a strange error when trying to encode question-answer pairs for BERT using the encode_plus method provided in the Transformers library.
I am using data from this Kaggle competition. Given a question title, question body and answer, the model must predict 30 values (regression problem). My goal is to get the following encoding as input to BERT:
[CLS] question_title question_body [SEP] answer [SEP]
However, when I try to use
tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased")
and encode only the second input from train.csv as follows:
inputs = tokenizer.encode_plus(
df_train["question_title"].values[1] + " " + df_train["question_body"].values[1], # first sequence to be encoded
df_train["answer"].values[1], # second sequence to be encoded
add_special_tokens=True, # [CLS] and 2x [SEP]
max_len = 512,
pad_to_max_length=True
)
I get the following error:
Token indices sequence length is longer than the specified maximum sequence length for this model (46 > 512). Running this sequence through the model will result in indexing errors
It says that the length of the token indices is longer than the specified maximum sequence length, but this is not true (as you can see, 46 is not > 512).
This happens for several of the rows in df_train. Am I doing something wrong here?
The model 'bert-base-uncased' is not pre-trained to handle the long texts of [CLS] + Question + [SEP] + Context + [SEP]. Any other model from Huggingface models dedicated especially for the squad question-answer datasets would handle the long sequence.
For example if I am using the ALBERT model, I would go for 'ktrapeznikov/albert-xlarge-v2-squad-v2' model.

Can anybody help me implement hybrid embedding model using keras library as shown in the attached figure

I am confused that how to process character level information in RNN using keras. I want to implement something like this model structure.
I have 1800 sentences; each sentence length(time_stamp) 150 and each word length has 16 characters. Gensim model helps me to create word embedding of the size of 100. Unique characters in sentences are 69, for each character represented by one hot encoding is 70.
shape for word-level bi-lstm input is: sentences X time_stamp X embedding_size (1800 x 150 x 100)
I know that how to feed this into keras layer but I am confused with character level feeding. the shape for char-level is: sentences X time_stamp X characters X char_embedding (1800 x 150 x 16 x 70).
I am a beginner in for keras.
What you have in the image is a simple bi-directional LSTM:
model = Sequential()
model.add(Bidrectional(LSTM(128), input_shape=(maxlen, len(chars))))
# Bidrectional concatenates the output of both directions by default.
For a more full example you can check the text generation example which uses character level processing.

How to generate GloVe embeddings for POS tags? Python

For a sentence analysis task, I would like to take the sequence of POS tags associated with the sentence and feed it to my model as if the POS tags are words.
I am using GloVe to make representations of each word in the sentence and SpaCy to generate POS tags. However, GloVe embeddings do not make much sense for POS tags. So I will have to somehow create embeddings for each POS tag. What is the best way to do create embeddings for POS tags, so that I can feed POS sequences into my model in the same way I would feed sentences? Could anyone point to code examples of how to do this with GloVe in Python?
Added context
My task is a binary classification of sentence pairs, based on their resemblance (similar meaning vs different meaning).
I would like to use POS tags as words, so that the POS tags serve as an additional bit of information to compare the sentences. My current model does not use an LSTM as a way to predict sequences.
Most word embedding models still rely on an underlying assumption that the meaning of a word is induced by its usage context. For example, learning a word2vec embedding with skipgram or continuous bag of words formulations implicitly assumes a model in which the representation vector of the word is based on the context words that co-occur with the target word, specifically by learning to create embeddings that best solve the classification task of distinguishing pairs of words that contextually co-occur from random pairs of words (so-called negative sampling).
But if the input is changed to be a sequence of discrete labels (POS tags), this assumption doesn't seem like it needs to remain accurate or reasonable. Part of speech labels have an assigned meaning that is not really induced by the context of being surrounded by other part of speech labels, so it's unlikely that standard learning tasks which are used to produce word embeddings would work when treating POS labels as if they were words from a much smaller vocabulary.
What is the overall sentence analysis task in your situation?
Added after question was updated with the learning task at hand.
Let's assume you can create POS input vectors for each sentence example. If there are N different POS labels possible, it means your input will consist of one vector from word embeddings and another vector of length N, where the value in component i represents the number of terms in the input sentence that possess POS label P_i.
For example, let's pretend the only POS labels possible are 'article', 'noun' and 'verb', and you have a sentence with ['article', 'noun', 'verb', 'noun']. Then this transforms into [1, 2, 1], and probably you want to normalize it by the length of the sentence. Let's call this input pos1 for sentence number 1 and pos2 for sentence number 2.
Let's call the word embedding vector input for sentence 1 as sentence1. sentence1 will be calculated by looking up each word embedding from a separate source, like a pretrained word2vec model or fastText or GloVe, and summing them up (using continuous bag of words). And the same for sentence2.
It's assumed that your batches of training data would already be processed into these vector formats, so a given single input would be a 4-tuple of vectors: the looked up CBOW embedding vector for sentence 1, same for sentence 2, and the calculated discrete representation vector for POS labels of sentence 1, and same for sentence 2.
A model that could work from this data might be like this:
from keras.engine.topology import Input
from keras.layers import Concatenate
from keras.layers.core import Activation, Dense
from keras.models import Model
sentence1 = Input(shape=word_embedding_shape)
sentence2 = Input(shape=word_embedding_shape)
pos1 = Input(shape=pos_vector_shape)
pos2 = Input(shape=pos_vector_shape)
# Note: just choosing 128 as an embedding space dimension or intermediate
# layer size... in your real case, you'd choose these shape params
# based on what you want to model or experiment with. They don't mean
# anything here.
sentence1_branch = Dense(128)(sentence1)
sentence1_branch = Activation('relu')(sentence1_branch)
# ... do whatever other sentence1-only stuff
sentence2_branch = Dense(128)(sentence2)
sentence2_branch = Activation('relu')(sentence2_branch)
# ... do whatever other sentence2-only stuff
pos1_embedding = Dense(128)(pos1)
pos1_branch = Activation('relu')(pos1_embedding)
# ... do whatever other pos1-only stuff
pos2_embedding = Dense(128)(pos2)
pos2_branch = Activation('relu')(pos2_embedding)
# ... do whatever other pos2-only stuff
unified = Concatenate([sentence1_branch, sentence2_branch,
pos1_branch, pos2_branch])
# ... do dense layers, whatever, to the concatenated intermediate
# representations
# finally boil it down to whatever final prediction task you are using,
# whether it is predicting a sentence similarity score (Dense(1)),
# or predicting a binary label that indicates whether the sentence
# pairs are similar or not (Dense(2) then followed by softmax activation,
# or Dense(1) followed by some type of probability activation like sigmoid).
# Assume your data is binary labeled for similar sentences...
unified = Activation('softmax')(Dense(2)(unified))
unified.compile(loss='binary_crossentropy', other parameters)
# Do training to learn the weights...
# A separate model that will just produce the embedding output
# from a POS input vector, relying on weights learned from the
# training process.
pos_embedding_model = Model(inputs=[pos1], outputs=[pos1_embedding])

What should be the word vectors of token <pad>, <unknown>, <go>, <EOS> before sent into RNN?

In word embedding, what should be a good vector representation for the start_tokens _PAD, _UNKNOWN, _GO, _EOS?
Spettekaka's answer works if you are updating your word embedding vectors as well.
Sometimes you will want to use pretrained word vectors that you can't update, though. In this case, you can add a new dimension to your word vectors for each token you want to add and set the vector for each token to 1 in the new dimension and 0 for every other dimension. That way, you won't run into a situation where e.g. "EOS" is closer to the vector embedding of "start" than it is to the vector embedding of "end".
Example for clarification:
# assume_vector embeddings is a dictionary and word embeddings are 3-d before adding tokens
# e.g. vector_embedding['NLP'] = np.array([0.2, 0.3, 0.4])
vector_embedding['<EOS>'] = np.array([0,0,0,1])
vector_embedding['<PAD>'] = np.array([0,0,0,0,1])
new_vector_length = vector_embedding['<pad>'].shape[0] # length of longest vector
for key, word_vector in vector_embedding.items():
zero_append_length = new_vector_length - word_vector.shape[0]
vector_embedding[key] = np.append(word_vector, np.zeros(zero_append_length))
Now your dictionary of word embeddings contains 2 new dimensions for your tokens and all of your words have been updated.
As far as I understand you can represent these tokens by any vector.
Here's why:
Inputting a sequence of words to your model, you first convert each word to an ID and then look in your embedding-matrix which vector corresponds to that ID. With that vector, you train your model. But the embedding-matrix just contains also trainable weights which will be adjusted during training. The vector-representations from your pretrained vectors just serve as a good point to start to yield good results.
Thus, it doesn't matter that much what your special tokens are represented by in the beginning as their representation will change during training.

Resources