About BertForMaskedLM - nlp

I have recently read about Bert and want to use BertForMaskedLM for fill_mask task. I know about Bert architecture. Also, as far as I know, BertForMaskedLM is built from Bert with a language modeling head on top, but I have no idea about what language modeling head means here. Can anyone give me a brief explanation.

The BertForMaskedLM, as you have understood correctly uses a Language Modeling(LM) head .
Generally, as well as in this case, LM head is a linear layer having input dimension of hidden state (for BERT-base it will be 768) and output dimension of vocabulary size. Thus, it maps to hidden state output of BERT model to a specific token in the vocabulary. The loss is calculated based on the scores obtained of a given token with respect to the target token.

Additionally to #Ashwin Geet D'Sa's answer.
Here is the Huggingface's LM head definition:
The model head refers to the last layer of a neural network that
accepts the raw hidden states and projects them onto a different
dimension.
You can find the Huggingface's definition for other terms at this page https://huggingface.co/docs/transformers/glossary

Related

Can I fine-tune BERT using only masked language model and next sentence prediction?

So if I understand correctly there are mainly two ways to adapt BERT to a specific task: fine-tuning (all weights are changed, even pretrained ones) and feature-based (pretrained weights are frozen). However, I am confused.
When to use which one? If you have unlabeled data (unsupervised learning), should you then use fine-tuning?
If I want to fine-tuned BERT, isn't the only option to do that using masked language model and next sentence prediction? And also: is it necessary to put another layer of neural network on top?
Thank you.
Your first approach should be to try the pre-trained weights. Generally it works well. However if you are working on a different domain (e.g.: Medicine), then you'll need to fine-tune on data from new domain. Again you might be able to find pre-trained models on the domains (e.g.: BioBERT).
For adding layer, there are slightly different approaches depending on your task. E.g.: For question-answering, have a look at TANDA paper (Transfer and Adapt Pre-Trained Transformer Models for Answer Sentence Selection). It is a very nice easily readable paper which explains the transfer and adaptation strategy. Again, hugging-face has modified and pre-trained models for most of the standard tasks.

What does "fine-tuning of a BERT model" refer to?

I was not able to understand one thing , when it says "fine-tuning of BERT", what does it actually mean:
Are we retraining the entire model again with new data.
Or are we just training top few transformer layers with new data.
Or we are training the entire model but considering the pretrained weights as initial weight.
Or there is already few layers of ANN on top of transformer layers which is only getting trained keeping transformer weight freeze.
Tried Google but I am getting confused, if someone can help me on this.
Thanks in advance!
I remember reading about a Twitter poll with similar context, and it seems that most people tend to accept your suggestion 3. (or variants thereof) as the standard definition.
However, this obviously does not speak for every single work, but I think it's fairly safe to say that 1. is usually not included when talking about fine-tuning. Unless you have vast amounts of (labeled) task-specific data, this step would be referred to as pre-training a model.
2. and 4. could be considered fine-tuning as well, but from personal/anecdotal experience, allowing all parameters to change during fine-tuning has provided significantly better results. Depending on your use case, this is also fairly simple to experiment with, since freezing layers is trivial in libraries such as Huggingface transformers.
In either case, I would really consider them as variants of 3., since you're implicitly assuming that we start from pre-trained weights in these scenarios (correct me if I'm wrong).
Therefore, trying my best at a concise definition would be:
Fine-tuning refers to the step of training any number of parameters/layers with task-specific and labeled data, from a previous model checkpoint that has generally been trained on large amounts of text data with unsupervised MLM (masked language modeling).

How to fine tune BERT on unlabeled data?

I want to fine tune BERT on a specific domain. I have texts of that domain in text files. How can I use these to fine tune BERT?
I am looking here currently.
My main objective is to get sentence embeddings using BERT.
The important distinction to make here is whether you want to fine-tune your model, or whether you want to expose it to additional pretraining.
The former is simply a way to train BERT to adapt to a specific supervised task, for which you generally need in the order of 1000 or more samples including labels.
Pretraining, on the other hand, is basically trying to help BERT better "understand" data from a certain domain, by basically continuing its unsupervised training objective ([MASK]ing specific words and trying to predict what word should be there), for which you do not need labeled data.
If your ultimate objective is sentence embeddings, however, I would strongly suggest you to have a look at Sentence Transformers, which is based on a slightly outdated version of Huggingface's transformers library, but primarily tries to generate high-quality embeddings. Note that there are ways to train with surrogate losses, where you try to emulate some form ofloss that is relevant for embeddings.
Edit: The author of Sentence-Transformers recently joined Huggingface, so I expect support to greatly improve over the upcoming months!
#dennlinger gave an exhaustive answer. Additional pretraining is also referred as "post-training", "domain adaptation" and "language modeling fine-tuning". here you will find an example how to do it.
But, since you want to have good sentence embeddings, you better use Sentence Transformers. Moreover, they provide fine-tuned models, which already capable of understanding semantic similarity between sentences. "Continue Training on Other Data" section is what you want to further fine-tune the model on your domain. You do have to prepare training dataset, according to one of available loss functions. E.g. ContrastLoss requires a pair of texts and a label, whether this pair is similar.
I believe transfer learning is useful to train the model on a specific domain. First you load the pretrained base model and freeze its weights, then you add another layer on top of the base model and train that layer based on your own training data. However, the data would need to be labelled.
Tensorflow has some useful guide on transfer learning.
You are talking about pre-training. Fine-tuning on unlabeled data is called pre-training and for getting started, you can take a look over here.

Creating input data for BERT modelling - multiclass text classification

I'm trying to build a keras model to classify text for 45 different classes. I'm a little confused about preparing my data for the input as required by google's BERT model.
Some blog posts insert data as a tf dataset with input_ids, segment ids, and mask ids, as in this guide, but then some only go with input_ids and masks, as in this guide.
Also in the second guide, it notes that the segment mask and attention mask inputs are optional.
Can anyone explain whether or not those two are required for a multiclass classification task?
If it helps, each row of my data can consist of any number of sentences within a reasonably sized paragraph. I want to be able to classify each paragraph/input to a single label.
I can't seem to find many guides/blogs about using BERT with Keras (Tensorflow 2) for a multiclass problem, indeed many of them are for multi-label problems.
I guess it is too late to answer but I had the same question. I went through huggingface code and found that if attention_mask and segment_type ids are None then by default it pays attention to all tokens and all the segments are given id 0.
If you want to check it out, you can find the code here
Let me know if this clarifies it or you think otherwise.

What does discriminative reranking do in NLP tasks?

Recently,i have read about the "discriminative reranking for natural language processing" by Collins.
I'm confused what does the reranking actually do?
Add more global features to the rerank model? or something else?
If you mean this paper, then what is done is the following:
train a parser using a generative model, i.e. one where you compute P(term | tree) and use Bayes' rule to reverse that and get P(tree | term),
apply that to get an initial k-best ranking of trees from the model,
train a second model on features of the desired trees,
apply that to re-rank the output from 2.
The reason why the second model is useful is that in generative models (such as naïve Bayes, HMMs, PCFGs), it can be hard to add features other than word identity, because the model would try to predict the probability of the exact feature vector instead of the separate features, which might not have occurred in the training data and will have P(vector|tree) = 0 and therefore P(tree|vector) = 0 (+ smoothing, but the problem remains). This is the eternal NLP problem of data sparsity: you can't build a training corpus that contains every single utterance that you'll want to handle.
Discriminative models such as MaxEnt are much better at handling feature vectors, but take longer to fit and can be more complicated to handle (although CRFs and neural nets have been used to construct parsers as discriminative models). Collins et al. try to find a middle ground between the fully generative and fully discriminative approaches.

Resources