Bert sentence embeddings - nlp

Im trying to obtain sentence embeddings for Bert but Im not quite sure if Im doing it properly... and yes Im aware that exist such tools already such as bert-as-service but I want to do it myself and understand how it works.
Lets say I want to extract a sentence embedding from word embeddings from the following sentence "I am.". As I understood Bert outputs in the form of (12, seq_lenght, 768). I extracted each word embedding from the last encoder layer in the form of (1, 768). My doubt now lies in extracting the sentence from these two word vectors. If I have (2,768) should I sum the dim=1 and obtain a vector of (1,768)? Or maybe concatenate the two words (1, 1536) and applying a (mean) pooling and get the sentence vector in shape of (1, 768). Im not sure what is the right approach is to obtain the sentence vector for this given example is.

as I know, BERT had a comment line in its source code:
For classification tasks, the first vector (corresponding to [CLS]) is used as the "sentence vector." Note that this only makes sense because the entire model is fine-tuned.
[CLS] provided by BERT for sentence embeddings without any combination or processing from all the word vectors in the sentence.
Hope it helps.

Related

Do BERT word embeddings change depending on context?

Before answering "yes, of course", let me clarify what I mean:
After BERT has been trained, and I want to use the pretrained embeddings for some other NLP task, can I once-off extract all the word-level embeddings from BERT for all the words in my dictionary, and then have a set of static key-value word-embedding pairs, from where I retrieve the embedding for let's say "bank", or will the embeddings for "bank" change depending on whether the sentence is "Trees grow on the river bank", or "I deposited money at the bank" ?
And if the latter is the case, how do I practically use the BERT embeddings for another NLP task, do I need to run every input sentence through BERT before passing it into my own model?
Essentially - do embeddings stay the same for each word / token after the model has been trained, or are they dynamically adjusted by the model weights, based on the context?
This is a great question (I had the same question but you asking it made me experiment a bit).
The answer is yes, it changes based on the context. You should not extract the embeddings and re-use them (at least for most of the problems).
I'm checking the embedding for word bank in two cases: (1) when it comes separately and when it comes with a context (river bank). The embeddings that I'm getting are different from each other (they have a cosine distance of ~0.4).
from transformers import TFBertModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
print('bank is the second word in tokenization (index=1):', tokenizer.decode([i for i in tokenizer.encode('bank')]))
print('bank is the third word in tokenization (index=2):', tokenizer.decode([i for i in tokenizer.encode('river bank')]))
###output: bank is the second word in tokenization (index=1): [CLS] bank [SEP]
###output: bank is the third word in tokenization (index=2): [CLS] river bank [SEP]
bank_bank = model(tf.constant(tokenizer.encode('bank'))[None,:])[0][0,1,:] #use the index based on the tokenizer output above
river_bank_bank = model(tf.constant(tokenizer.encode('river bank'))[None,:])[0][0,2,:] #use the index based on the tokenizer output above
are_equal = np.allclose(bank_bank, river_bank_bank)
print(are_equal)
### output: False

How to get the word on which the text classification has been made?

I am doing a multi-label text classification using a pre-trained model of BERT. Here is an example of the prediction that has been made for one sentence-
pred_image
I want to get those words from the sentence on which the prediction has been made. Like this one - right_one
If anyone has any idea, Please enlighten me.
Multi-Label Text Classification (first image) and Token Classification (second image) are two different tasks for each which the model needs to be specifally trained for.
The first one returns a probability for each label considering the entire sentence. The second returns such predictions for each single word in the sentence while usually considering the rest of the sentence as context.
So you can not really use the output from a Text Classifier and use it for Token Classification because the information you get is not detailed enough.
What you can and should do is train a Token Classification model, although you obviously will need token-level-annotated data to do so.

what is the difference between pooled output and sequence output in bert layer?

everyone! I was reading about Bert and wanted to do text classification with its word embeddings. I came across this line of code:
pooled_output, sequence_output = self.bert_layer([input_word_ids, input_mask, segment_ids])
and then:
clf_output = sequence_output[:, 0, :]
out = Dense(1, activation='sigmoid')(clf_output)
But I can't understand the use of pooled output. Doesn't sequence output contain all the information including the word embedding of ['CLS']? If so, why do we have pooled output?
Thanks in advance!
Sequence output is the sequence of hidden-states (embeddings) at the output of the last layer of the BERT model. It includes the embedding of the [CLS] token. Hence, for the sentence "You are on Stackoverflow", it gives 5 embeddings: one embedding for each of the four words (assuming the word "Stackoverflow" was tokenized into a single token) along with the embedding of the [CLS] token.
Pooled output is the embedding of the [CLS] token (from Sequence output), further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining. For further details, please refer to the BERT original paper.
If you have given a sequence, "You are on StackOverflow". The sequence_output will give 768 embeddings of these four words. But, the pooled output will just give you one embedding of 768, it will pool the embeddings of these four words.
As a number of other answers have pointed out, sequence_output is token-level with 2 dimensions - the first dimension corresponds to the number of tokens in the input text.
pooled_output is one-dimensional and seems to be some sort of a higher-order context embedding for the input text.
I initially felt they should contain practically the same information (or that sequence_output should contain more given the additional n_token dimension), but right now I'm training a semantic similarity model based on Bert and am seeing definitively better results using both sequence_output and pooled_output in the model, compared to using just sequence_output.

Get the attention vector from the last layers of BERT

Is there any way to get the attention vector with normalizing values (0-1) from the last layer of BERT? I'm interested in getting the attention value that BERT assigns to each word in a sentence.
I'm working on emotion classification. I want to extract the relevant words associated with emotions. For example:
I feel wonderful today.
The words feel and wonderful are the more relevant words in the sentence for the classifier, so I want to get the attention scores that BERT assigns to each of them.
Thanks in advance

What are some common ways to get sentence vector from corresponding word vectors?

I have successfully implemented word2vec model to generate word embedding or word vectors now I require to generate sentence vectors from generated word vectors so that I can feed a neural network to summarize a text corpus. What are the common approaches to generate sentence vectors from word vectors?
You can try adding an LSTM/RNN encoder before your actual Neural Network and feed your neural net using hidden states of your encoder( which would act as document representations).
Benefit of doing this is your document embeddings will be trained for your specific task of text summarization.
I don't know what framework you are using otherwise would have helped you with some code to get you started.
EDIT 1: Add code snippet
word_in = Input(shape=("<MAX SEQ LEN>",))
emb_word = Embedding(input_dim="<vocab size>", output_dim="<embd_dim>",input_length="<MAX SEQ LEN>", mask_zero=True)(word_in)
lstm = LSTM(units="<size>", return_sequences=False,
recurrent_dropout=0.5, name="lstm_1")(emb_word)
Add any type of dense layer which takes vectors as inputs.
LSTM takes input of shape batch_size * sequence_length * word_vector_dimension and produces output of shape batch_size * rnn_size; which you can use as document embeddings.
Sentence representations can simply be the column-wise mean of all the word vectors in your sentence. There are also implementation of this like doc2vec https://radimrehurek.com/gensim/models/doc2vec.html where a document is just a collection of words like a sentence or paragraph.

Resources