Get the attention vector from the last layers of BERT - text

Is there any way to get the attention vector with normalizing values (0-1) from the last layer of BERT? I'm interested in getting the attention value that BERT assigns to each word in a sentence.
I'm working on emotion classification. I want to extract the relevant words associated with emotions. For example:
I feel wonderful today.
The words feel and wonderful are the more relevant words in the sentence for the classifier, so I want to get the attention scores that BERT assigns to each of them.
Thanks in advance

Related

Do BERT word embeddings change depending on context?

Before answering "yes, of course", let me clarify what I mean:
After BERT has been trained, and I want to use the pretrained embeddings for some other NLP task, can I once-off extract all the word-level embeddings from BERT for all the words in my dictionary, and then have a set of static key-value word-embedding pairs, from where I retrieve the embedding for let's say "bank", or will the embeddings for "bank" change depending on whether the sentence is "Trees grow on the river bank", or "I deposited money at the bank" ?
And if the latter is the case, how do I practically use the BERT embeddings for another NLP task, do I need to run every input sentence through BERT before passing it into my own model?
Essentially - do embeddings stay the same for each word / token after the model has been trained, or are they dynamically adjusted by the model weights, based on the context?
This is a great question (I had the same question but you asking it made me experiment a bit).
The answer is yes, it changes based on the context. You should not extract the embeddings and re-use them (at least for most of the problems).
I'm checking the embedding for word bank in two cases: (1) when it comes separately and when it comes with a context (river bank). The embeddings that I'm getting are different from each other (they have a cosine distance of ~0.4).
from transformers import TFBertModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
print('bank is the second word in tokenization (index=1):', tokenizer.decode([i for i in tokenizer.encode('bank')]))
print('bank is the third word in tokenization (index=2):', tokenizer.decode([i for i in tokenizer.encode('river bank')]))
###output: bank is the second word in tokenization (index=1): [CLS] bank [SEP]
###output: bank is the third word in tokenization (index=2): [CLS] river bank [SEP]
bank_bank = model(tf.constant(tokenizer.encode('bank'))[None,:])[0][0,1,:] #use the index based on the tokenizer output above
river_bank_bank = model(tf.constant(tokenizer.encode('river bank'))[None,:])[0][0,2,:] #use the index based on the tokenizer output above
are_equal = np.allclose(bank_bank, river_bank_bank)
print(are_equal)
### output: False

How to get the word on which the text classification has been made?

I am doing a multi-label text classification using a pre-trained model of BERT. Here is an example of the prediction that has been made for one sentence-
pred_image
I want to get those words from the sentence on which the prediction has been made. Like this one - right_one
If anyone has any idea, Please enlighten me.
Multi-Label Text Classification (first image) and Token Classification (second image) are two different tasks for each which the model needs to be specifally trained for.
The first one returns a probability for each label considering the entire sentence. The second returns such predictions for each single word in the sentence while usually considering the rest of the sentence as context.
So you can not really use the output from a Text Classifier and use it for Token Classification because the information you get is not detailed enough.
What you can and should do is train a Token Classification model, although you obviously will need token-level-annotated data to do so.

Bert sentence embeddings

Im trying to obtain sentence embeddings for Bert but Im not quite sure if Im doing it properly... and yes Im aware that exist such tools already such as bert-as-service but I want to do it myself and understand how it works.
Lets say I want to extract a sentence embedding from word embeddings from the following sentence "I am.". As I understood Bert outputs in the form of (12, seq_lenght, 768). I extracted each word embedding from the last encoder layer in the form of (1, 768). My doubt now lies in extracting the sentence from these two word vectors. If I have (2,768) should I sum the dim=1 and obtain a vector of (1,768)? Or maybe concatenate the two words (1, 1536) and applying a (mean) pooling and get the sentence vector in shape of (1, 768). Im not sure what is the right approach is to obtain the sentence vector for this given example is.
as I know, BERT had a comment line in its source code:
For classification tasks, the first vector (corresponding to [CLS]) is used as the "sentence vector." Note that this only makes sense because the entire model is fine-tuned.
[CLS] provided by BERT for sentence embeddings without any combination or processing from all the word vectors in the sentence.
Hope it helps.

Find a sentence is related to a medical term or not

Input: user enters a sentence
if the word is related to any medical term , or if he needs any medical attention,
Output=True
else
Output=False
I am reading https://www.nltk.org/. I scraped 'https://www.merriam-webster.com/browse/medical/a' this website to get the medical related words but I am unable to figure out how to detect the sentence which are related to medical term . I haven't done any code because the algorithm is not clear to me.
I want to know what should I use , where to start, I need a tutorial link to implement this thing. Any guidance will be highly appreciated
I will list down the various ways you can do this with naive to intelligent ways -
Get a large vocabulary of medical terms, iterate over the sentence and return yes or no incase you find anything
Get a large vocabulary of medical terms, iterate over the sentence and do a fuzzy match with each word, so that words that are variations of the same work syntactically (alphabetically) are still detected and caught. [Check fuzzywuzzy library in python]
Get a large vocabulary of medical terms with definitions for each. Use pre-trained word embeddings (word2vec, Glove etc) for each word in the descriptions of those terms. Take a weighted sum of each word embeddings with weights set to the TFIDF of each word, to represent each medical term (its description to be precise) as a vector. Repeat the process for the sentence as well. Then take a cosine similary between them to calculate how contextually similar is the text to the description of the medical term. If the similarity is above a certain threshold that you fix, then return True. [This approach doesnt need the exact term, even if the person is talking about the condition, it should be able to detect]
Label a large number of sentences with their respective medical terms in them (annotate using something like the API.AI entity annotation tool or RASA entity annotation tool). Create a neural network with input embedding layer (which you can initialise with word2vec embeddings if you like), bi-LSTM layers and output with the list of medical terms / conditions with softmax. This will get you probability of each condition or term being associated with the sentence.
Create a neural network with encoder decoder architecture with attention layer between them. Create encoder embeddings from the input sentence. Create decoder with output as a string of medical terms. Train an encoder-decoder attention layer with pre-annotated data.
Create a pointer network which as input takes a sentence with the respective medical terms and return pointers, which point back to the inputs and marks them as medical term or non-medical term. (not easy to build fyi...)
OK so, I don't understand which part do you not understand? Because, the idea is rather simple and one google search gives you great and easy results. Unless the issue is that you don't know python. In that case it will be very hard for you to implement this.
The idea itself is simple - tokenize sentence (have each word for itself in a list) and search the list of medical terms. If the current word is in the list, the term is medical so the sentence is related to that medical term as well. If you imagine that you have a list of medical terms in a medical_terms list then in python it would look something like this:
>>> import nltk
>>> sentence = """At eight o'clock on Thursday morning
... Arthurs' abdomen was hurting."""
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthurs', 'abdomen', "was", 'hurting', '.']
>>> def is_medical(tokens):
... for i in tokens:
... if i in medical_terms:
... return True
... else:
... return False
>>> is_medical(tokens)
True
You just tokenize the input sentence with NLTK and then search the list if any of the words in the sentence are medical terms. You can adapt this function to work with n-grams as well. This has a lot of other approaches and different special cases that have to be handled by this is the good start.

How to get the sentiment score of each word in sentence based sentiment classification using RNN/LSTM?

If I use RNN/LSTM for sentence-based sentiment analysis, how do I get the sentiment/distribution/confidence of each word? I just read one article shown here: Neural network that remembers
It includes one picture as below, how to get the likelihood of each character what if I only do the classification in sentence-level? I know how to use LSTM for sentence classification, but we only use the last hidden representation for classification, so how to get the likelihood of each character/word?
An practical example showing what is the input and output would be great!

Resources