Translating English to French, we may have this:
Input: "Please help me translate this sentence" 6 tokens
Output: "Merci de m'aider à traduire cette phrase" 7 tokens
We have 7 tokens in the output. How does Bert model know this length during the network processing? Which hyperparameters are involved?
A sequence to sequence model (for translation) does not usually "know" the length of the output it predicts, but rather:
encodes a full sentence to a (often fixed size) representation
then, from said representation, predicts one output item at a time, until an "EOS" (end of sequence) item (token) is predicted, which marks the end of the prediction.
BERT specifically only does the first part, and creates a mathematical sentence representation from an input.
Related
I'm trying to figure out what BERT preprocess does. I mean, how it is done. But I can't find a good explanation. I would appreciate, if somebody know, a link to a better and deeply explained solution.
If someone, by the other hand, wants to solve it here, I would be also extremly thankful!
My question is, how does BERT mathematically convert a string input into a vector of numbers with fixed size? Which are the logical steps that follows?
BERT provides its own tokenizer. Because BERT is a pretrained model that expects input data in a specific format, following are required:
A special token, [SEP], to mark the end of a sentence, or the
separation between two sentences
A special token, [CLS], at the
beginning of our text. This token is used for classification tasks,
but BERT expects it no matter what your application is.
Tokens that conform with the fixed vocabulary used in BERT
The Token IDs for the tokens, from BERT’s tokenizer
Mask IDs to indicate which elements in the sequence are tokens and which are padding elements
Segment IDs used to distinguish different sentences
Positional Embeddings used to show token position within the sequence
.
from transformers import BertTokenizer
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# An example sentence
text = "Sentence to embed"
# Add the special tokens.
marked_text = "[CLS] " + text + " [SEP]"
# Split the sentence into tokens.
tokenized_text = tokenizer.tokenize(marked_text)
# Map the token strings to their vocabulary indices.
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
Have a look at this excellent tutorial for more details.
Refer to below image (the process of how word2vec skipgram extract training datasets-the word pair from the input sentences).
E.G. "I love you." ==> [(I,love), (I, you)]
May I ask what is the word pair when the sentence contains only one word?
Is it "Happy!" ==> [(happy,happy)] ?
I tested the word2vec algorithm in genism, when there is just one word in the training set sentences, (and this word is not included in other sentences), the word2vec algorithm can still construct an embedding vector for this specific word. I am not sure how the algorithm is able to do so.
===============UPDATE===============================
As the answer posted below, I think the word embedding vector created for the word in the 1-word-sentence is just the random initialization of neural network weights.
No word2vec training is possible from a 1-word sentence, because there's no neighbor words to use as input to predict a center/target word. Essentially, that sentence is skipped.
If that was the only appearance of the word in the corpus, and you're seeing a vector for that word, it's just the starting random-initialization of the word, with no further training. (And, you should probably use a higher min_count, as keeping such rare words is usually a mistake in word2vec: they won't get good vectors, and other nearby words' vectors will improve if the 'noise' from all such insufficiently model-able rare words is removed.)
If that 1-word sentence actually appeared next-to other real sentences in your corpus, it could make sense to combine it with surrounding texts. There's nothing magic about actual sentences for this kind word-from-surroundings modeling - the algorithm is just working on 'neighbors', and it's common to use multi-sentence chunks as the texts for training, and sometimes even punctuation (like sentence-ending periods) is also retained as 'words'. Then words from an actually-separate sentence – but still related by having appeared in the same document – will appear in each other's contexts.
I have seen both terms used while reading papers about BERT and ELMo so I wonder if there is a difference between them.
A contextualized word embeding is a vector representing a word in a special context. The traditional word embeddings such as Word2Vec and GloVe generate one vector for each word, whereas a contextualized word embedding generates a vector for a word depending on the context. Consider the sentences The duck is swimmingand You shall duck when someone shoots at you. With traditional word embeddings, the word vector for duckwould be the same in both sentences, whereas it should be a different one in the contextualized case.
While word embeddings encode words into a vector representation, there is also the question on how to represent a whole sentence in a way a computer can easily work with. These sentence encodings can embedd a whole sentence as one vector , doc2vec for example which generate a vector for a sentence. But also BERT generates a representation for the whole sentence, the [CLS]-token.
So in short, a conextualized word embedding represents a word in a context, whereas a sentence encoding represents a whole sentence.
Input: user enters a sentence
if the word is related to any medical term , or if he needs any medical attention,
Output=True
else
Output=False
I am reading https://www.nltk.org/. I scraped 'https://www.merriam-webster.com/browse/medical/a' this website to get the medical related words but I am unable to figure out how to detect the sentence which are related to medical term . I haven't done any code because the algorithm is not clear to me.
I want to know what should I use , where to start, I need a tutorial link to implement this thing. Any guidance will be highly appreciated
I will list down the various ways you can do this with naive to intelligent ways -
Get a large vocabulary of medical terms, iterate over the sentence and return yes or no incase you find anything
Get a large vocabulary of medical terms, iterate over the sentence and do a fuzzy match with each word, so that words that are variations of the same work syntactically (alphabetically) are still detected and caught. [Check fuzzywuzzy library in python]
Get a large vocabulary of medical terms with definitions for each. Use pre-trained word embeddings (word2vec, Glove etc) for each word in the descriptions of those terms. Take a weighted sum of each word embeddings with weights set to the TFIDF of each word, to represent each medical term (its description to be precise) as a vector. Repeat the process for the sentence as well. Then take a cosine similary between them to calculate how contextually similar is the text to the description of the medical term. If the similarity is above a certain threshold that you fix, then return True. [This approach doesnt need the exact term, even if the person is talking about the condition, it should be able to detect]
Label a large number of sentences with their respective medical terms in them (annotate using something like the API.AI entity annotation tool or RASA entity annotation tool). Create a neural network with input embedding layer (which you can initialise with word2vec embeddings if you like), bi-LSTM layers and output with the list of medical terms / conditions with softmax. This will get you probability of each condition or term being associated with the sentence.
Create a neural network with encoder decoder architecture with attention layer between them. Create encoder embeddings from the input sentence. Create decoder with output as a string of medical terms. Train an encoder-decoder attention layer with pre-annotated data.
Create a pointer network which as input takes a sentence with the respective medical terms and return pointers, which point back to the inputs and marks them as medical term or non-medical term. (not easy to build fyi...)
OK so, I don't understand which part do you not understand? Because, the idea is rather simple and one google search gives you great and easy results. Unless the issue is that you don't know python. In that case it will be very hard for you to implement this.
The idea itself is simple - tokenize sentence (have each word for itself in a list) and search the list of medical terms. If the current word is in the list, the term is medical so the sentence is related to that medical term as well. If you imagine that you have a list of medical terms in a medical_terms list then in python it would look something like this:
>>> import nltk
>>> sentence = """At eight o'clock on Thursday morning
... Arthurs' abdomen was hurting."""
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthurs', 'abdomen', "was", 'hurting', '.']
>>> def is_medical(tokens):
... for i in tokens:
... if i in medical_terms:
... return True
... else:
... return False
>>> is_medical(tokens)
True
You just tokenize the input sentence with NLTK and then search the list if any of the words in the sentence are medical terms. You can adapt this function to work with n-grams as well. This has a lot of other approaches and different special cases that have to be handled by this is the good start.
Difference between tokenize.fit_on_text, tokenize.text_to_sequence and word embeddings?
Tried to search on various platforms but didn't get a suitable answer.
Word embeddings is a way of representing words such that words with the same/similar meaning have a similar representation. Two commonly used algorithms that learn word embedding are Word2Vec and GloVe.
Note that word embeddings can also be learnt from scratch while training your neural network for text processing, on your specific NLP problem. You can also use transfer learning; in this case, it would mean to transfer the learned representation of the words from huge datasets on your problem.
As for the tokenizer(I assume it's Keras that we're speaking of), taking from the documentation:
tokenize.fit_on_text() --> Creates the vocabulary index based on word frequency. For example, if you had the phrase "My dog is different from your dog, my dog is prettier", word_index["dog"] = 0, word_index["is"] = 1 (dog appears 3 times, is appears 2 times)
tokenize.text_to_sequence() --> Transforms each text into a sequence of integers. Basically if you had a sentence, it would assign an integer to each word from your sentence. You can access tokenizer.word_index() (returns a dictionary) to verify the assigned integer to your word.