Do BERT word embeddings change depending on context?

Do BERT word embeddings change depending on context? - nlp

Before answering "yes, of course", let me clarify what I mean:
After BERT has been trained, and I want to use the pretrained embeddings for some other NLP task, can I once-off extract all the word-level embeddings from BERT for all the words in my dictionary, and then have a set of static key-value word-embedding pairs, from where I retrieve the embedding for let's say "bank", or will the embeddings for "bank" change depending on whether the sentence is "Trees grow on the river bank", or "I deposited money at the bank" ?
And if the latter is the case, how do I practically use the BERT embeddings for another NLP task, do I need to run every input sentence through BERT before passing it into my own model?
Essentially - do embeddings stay the same for each word / token after the model has been trained, or are they dynamically adjusted by the model weights, based on the context?

This is a great question (I had the same question but you asking it made me experiment a bit).
The answer is yes, it changes based on the context. You should not extract the embeddings and re-use them (at least for most of the problems).
I'm checking the embedding for word bank in two cases: (1) when it comes separately and when it comes with a context (river bank). The embeddings that I'm getting are different from each other (they have a cosine distance of ~0.4).
from transformers import TFBertModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
print('bank is the second word in tokenization (index=1):', tokenizer.decode([i for i in tokenizer.encode('bank')]))
print('bank is the third word in tokenization (index=2):', tokenizer.decode([i for i in tokenizer.encode('river bank')]))
###output: bank is the second word in tokenization (index=1): [CLS] bank [SEP]
###output: bank is the third word in tokenization (index=2): [CLS] river bank [SEP]
bank_bank = model(tf.constant(tokenizer.encode('bank'))[None,:])[0][0,1,:] #use the index based on the tokenizer output above
river_bank_bank = model(tf.constant(tokenizer.encode('river bank'))[None,:])[0][0,2,:] #use the index based on the tokenizer output above
are_equal = np.allclose(bank_bank, river_bank_bank)
print(are_equal)
### output: False

Related

What is the best way to concat sentence with document-level features into BERT?

I am trying to do a sentence classification task with BERT. Each sentence comes from a certain document. What is the best way to concat the document-level features?
If using other classifier without input length limitation, such as logistic regression, I could e.g. simply concat the tf-idf vector of the document after the tf-idf vector of the sentence.
My current solution is concating sentence and the whole document with [SEP] in the inputs of dataset.
text = ' '.join([sentence,self.tokenizer.sep_token, document])
inputs = self.tokenizer.encode_plus(text.lower(),
truncation=True,
padding='max_length',
add_special_tokens=True,
return_attention_mask=True,
return_token_type_ids=True,
max_length=self.max_len,
return_tensors='pt')
But I am not sure if it is camparable with the above example. The words in the document, that exceed the max length 512 will be clipped, right? To compare with other experiment settings, changing model like Longformer cannot be considered. Is there a way to concat the complete document-level feature?

Which huggingface model is the best for sentence as input and a word from that sentence as the output?

What would be the best huggingface model to fine-tune for this type of task:
Example input 1:
If there's one person you don't want to interrupt in the middle of a sentence it's a judge.
Example output 1:
sentence
Example input 2:
A good baker will rise to the occasion, it's the yeast he can do.
Example output 2:
yeast

Architecture
This looks like a Question Answering type of task, where the input is a sentence and the output is a span from the input sentence.
In transformers this corresponds to the AutoModelForQuestionAnswering class.
See the following illustration from the original BERT paper:
The only difference you have is that the input will be compsed of the "question" only.
In other words, you won't have a Question, a [SEP] token, & a Paragraph, as shown in the figure.
Without knowing too much about your task, you might want to model this as a Token Classification type of task instead.
Here, your output would be labelled as some positive tag and the rest of the words labelled as some other negative tag.
If this makes more sense for you have a look at the AutoModelForTokenClassification class.
I will base the rest of my discussion on question-answering, but these concepts can be easily adapted.
Model
Since it seems that you're dealing with English sentences, you can probably use a pre-trained model such as bert-base-uncased.
Depending on the data distribution, your choice of language model can change.
Not sure what the task you're doing is, but unless there's some fine-tuned model available which is doing your task (you can try searching the HuggingFace model hub), you're going to have to fine-tune your own model.
To do so you need to have a dataset composed of sentences labelled with start & end indices corresponding to the answer span.
See the documentation for more information on how to train.
Evaluation
Once you have a fine-tuned model you just need to run your test sentences through the model to extract answers.
The following code, adapted from the HuggingFace documentation, does that:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer
import torch
model = AutoModelForQuestionAnswering.from_pretrained(name)
tokenizer = AutoTokenizer.from_pretrained(name)
input = "A good baker will rise to the occasion, it's the yeast he can do."
inputs = tokenizer(input, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
outputs = model(**inputs)
start_scores = outputs.start_logits
end_scores = outputs.end_logits
start_index = torch.argmax(start_scores)
end_index = torch.argmax(end_scores) + 1
answer = tokenizer.convert_tokens_to_string(
tokenizer.convert_ids_to_tokens(input_ids[start_index:end_index])
) # "yeast", hopefully!

Use custom Word2Vec embedding instead of GloVe

I am working on a text generation using seq2seq model where GloVe embedding is being used. I want to use a custom Word2Vec (CBOW/Gensim) embedding in this code. Can anyone please help to use my custom embedding instead of GloVe?
def initialize_embeddings(self):
"""Reads the GloVe word-embeddings and creates embedding matrix and word to index and index to word mapping."""
# load the word embeddings
self.word2vec = {}
with open(glove_path%self.EMBEDDING_DIM, 'r') as file:
for line in file:
vectors = line.split()
self.word2vec[vectors[0]] = np.asarray(vectors[1:], dtype="float32")```
```# get the embeddings matrix
self.num_words = min(self.MAX_VOCAB_SIZE, len(self.word2idx)+1)
self.embeddings_matrix = np.zeros((self.num_words, self.EMBEDDING_DIM))
for word, idx in self.word2idx.items():
if idx <= self.num_words:
word_embeddings = self.word2vec.get(word)
if word_embeddings is not None:
self.embeddings_matrix[idx] = word_embeddings
self.idx2word = {v:k for k,v in self.word2idx.items()}
This code is for GloVe embedding which is transformed to Word2Vec. I want to load my own Word2Vec embedding.

word2vec and Glove are a techniques for producing word embeddings, i.e., for modelling text (a set of sentences) into computer-readable vectors.
While word2vec trains on the local context (neighboring words), Glove will look for words co-occurrence in a whole text or corpus, its approach is more global.
word2vec
There are two main approaches for word2vec, in which the algorithm loops through the worlds of the sentence. For each current word w it will try to predict
the neighboring words from w and its context, this is the Skip-Gram approach
w from its context, this is the CBOW approach
Hence, word2vec will produce a similar embedding for words with similar contexts, for instance a noun in singular and its plural, or two synonyms.
Glove
The main intuition underlying the Glove model is the simple observation that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. In other words the embeddings are based on the computation of distances between pairs of target words. The model computes the distance between two target words in a text by analyzing the co-occurence of those two target words with some other probe words (contextual words).
https://nlp.stanford.edu/projects/glove/
For example, consider the co-occurrence probabilities for target words "ice" and "steam" with various probe words from the vocabulary. Here are some actual probabilities from a 6 billion word corpus:
As one might expect, "ice" co-occurs more frequently with "solid" than it does with "gas", whereas "steam" co-occurs more frequently with "gas" than it does with "solid". Both words co-occur with their shared property "water" frequently, and both co-occur with the unrelated word "fashion" infrequently. Only in the ratio of probabilities does noise from non-discriminative words like "water" and "fashion" cancel out, so that large values (much greater than 1) correlate well with properties specific to "ice", and small values (much less than 1) correlate well with properties specific of "steam". In this way, the ratio of probabilities encodes some crude form of meaning associated with the abstract concept of thermodynamic phase.
Also, Glove is very good at analogy, and performs well on the word2vec dataset.

How to i get word embeddings for out of vocabulary words using a transformer model?

When i tried to get word embeddings of a sentence using bio_clinical bert, for a sentence of 8 words i am getting 11 token ids(+start and end) because "embeddings" is an out of vocabulary word/token, that is being split into em, bed ,ding, s.
I would like to know if there is any aggregation strategies available that make sense apart from doing a mean of these vectors.
from transformers import AutoTokenizer, AutoModel
# download and load model
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
sentences = ['This framework generates embeddings for each input sentence']
#Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=128, return_tensors='pt')
#Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
print(encoded_input['input_ids'].shape)
Output:
torch.Size([1, 13])
for token in encoded_input['input_ids'][0]:
print(tokenizer.decode([token]))
Output:
[CLS]
this
framework
generates
em
##bed
##ding
##s
for
each
input
sentence
[SEP]

To my knowledge, mean aggregation is the most commonly used tool here, and in fact there is even scientific literature, empirically showing that it works well:
Generalizing Word Embeddings using Bag of Subwords by Zhao, Mudgal and Liang. Formula 1 is describing exactly what you are proposing as well.
The one alternative that you could theoretically employ is something like a mean aggregate over the entire input, essentially making a "context prediction" over all words (potentially except "embeddings"), therefore emulating something similar to the [MASK]ing during training of the transformer models. But this is just a suggestion from me without any backup of scientific evidence that it works (for better or worse).

What are some common ways to get sentence vector from corresponding word vectors?

I have successfully implemented word2vec model to generate word embedding or word vectors now I require to generate sentence vectors from generated word vectors so that I can feed a neural network to summarize a text corpus. What are the common approaches to generate sentence vectors from word vectors?

You can try adding an LSTM/RNN encoder before your actual Neural Network and feed your neural net using hidden states of your encoder( which would act as document representations).
Benefit of doing this is your document embeddings will be trained for your specific task of text summarization.
I don't know what framework you are using otherwise would have helped you with some code to get you started.
EDIT 1: Add code snippet
word_in = Input(shape=("<MAX SEQ LEN>",))
emb_word = Embedding(input_dim="<vocab size>", output_dim="<embd_dim>",input_length="<MAX SEQ LEN>", mask_zero=True)(word_in)
lstm = LSTM(units="<size>", return_sequences=False,
recurrent_dropout=0.5, name="lstm_1")(emb_word)
Add any type of dense layer which takes vectors as inputs.
LSTM takes input of shape batch_size * sequence_length * word_vector_dimension and produces output of shape batch_size * rnn_size; which you can use as document embeddings.

Sentence representations can simply be the column-wise mean of all the word vectors in your sentence. There are also implementation of this like doc2vec https://radimrehurek.com/gensim/models/doc2vec.html where a document is just a collection of words like a sentence or paragraph.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string