Fine-tune BERT for a specific domain on a different language? - python-3.x

I want to fine-tune on a pre-trained BERT model.
However, my task uses data within a specific domain (say biomedical data).
Additionally, my data is also in a language different from English (say Dutch).
Now I could fine-tune the Dutch bert-base-dutch-cased pre-trained model.
However, how would I go about fine-tuning a Biomedical BERT model, like BioBERT,
which is in the correct domain, but wrong language?
I have thought about using NMT, but don't think it's viable and worth the effort.
If I fine-tune without any alterations to the model, I fear that the model will not learn the task well
since it was pre-trained on a completely different language.

I just want to know if there are any methods that allow for fine-tuning a pre-trained BERT model trained on a specific domain and use it for data within that same domain, but a different language
Probably not. BERT's vocabulary is fixed at the start of pre-training, and adding additional vocabulary leads to random weight initializations.
Instead, I would:
Look for a multi-lingual, domain-specific version of BERT as #Ashwin said.
Fine-tune Dutch BERT on your task and see if performance is acceptable. In general, BERT can adapt to different tasks quite well.
(If you have the available resources) Continue pre-training Dutch BERT on your specific domain (for example, like SciBERT) and then fine-tune on your task.

Related

How to extend the vocabulary of a pretrained transformer model?

I would like to extend a zero-shot text classification (NLI) model's vocabulary, to include domain-specific vocabulary or just to keep it up-to-date. For example, I would like the model to know the names of the latest COVID-19 variants are related to the topic 'Healthcare'.
I've added the tokens to the tokenizer and resized the token embeddings. However, I don't know how to finetune the weights in the embedding layer, as suggested here.
To do the finetuning, can I use simply use texts containing a mixture of new vocabulary and existing vocabulary, and have the tokenizer recognise the relations between tokens through co-occurrences in an unsupervised fashion?
Any help is appreciated, thank you!
If you resized the corresponding embedding weights with resize_token_embeddings, they will be initialised randomly.
Technically, you can fine-tune the model on your target task (NLI, in your case), without touching the embedding weights. In practice, it will be harder for your model to learn anything meaningful about the newly added tokens, since their embeddings are randomly initialised.
To learn the embedding weights you can do further pre-training, before fine-tuning on the target task. This is done by training the model on the pre-training objective(s) (such as Masked Language Modelling). Pre-training is more expensive than fine-tuning of course, but remember that you aren't pre-training from scratch, since you start pre-training from the checkpoint of the already pre-trained model. Therefore, the number of epochs/steps will be significantly less than what was used in the original pre-training setup.
When doing pre-training it will be beneficial to include in-domain documents, so that it can learn the newly added tokens. Depending on whether you want the model to be more domain specific or remain varied so as to not "forget" any previous domains, you might also want to include documents from a variety of domains.
The Don't Stop Pretraining paper might also be an interesting reference, which delves into specifics regarding the type of data used as well as training steps.

Can I fine-tune BERT using only masked language model and next sentence prediction?

So if I understand correctly there are mainly two ways to adapt BERT to a specific task: fine-tuning (all weights are changed, even pretrained ones) and feature-based (pretrained weights are frozen). However, I am confused.
When to use which one? If you have unlabeled data (unsupervised learning), should you then use fine-tuning?
If I want to fine-tuned BERT, isn't the only option to do that using masked language model and next sentence prediction? And also: is it necessary to put another layer of neural network on top?
Thank you.
Your first approach should be to try the pre-trained weights. Generally it works well. However if you are working on a different domain (e.g.: Medicine), then you'll need to fine-tune on data from new domain. Again you might be able to find pre-trained models on the domains (e.g.: BioBERT).
For adding layer, there are slightly different approaches depending on your task. E.g.: For question-answering, have a look at TANDA paper (Transfer and Adapt Pre-Trained Transformer Models for Answer Sentence Selection). It is a very nice easily readable paper which explains the transfer and adaptation strategy. Again, hugging-face has modified and pre-trained models for most of the standard tasks.

How to fine tune BERT on unlabeled data?

I want to fine tune BERT on a specific domain. I have texts of that domain in text files. How can I use these to fine tune BERT?
I am looking here currently.
My main objective is to get sentence embeddings using BERT.
The important distinction to make here is whether you want to fine-tune your model, or whether you want to expose it to additional pretraining.
The former is simply a way to train BERT to adapt to a specific supervised task, for which you generally need in the order of 1000 or more samples including labels.
Pretraining, on the other hand, is basically trying to help BERT better "understand" data from a certain domain, by basically continuing its unsupervised training objective ([MASK]ing specific words and trying to predict what word should be there), for which you do not need labeled data.
If your ultimate objective is sentence embeddings, however, I would strongly suggest you to have a look at Sentence Transformers, which is based on a slightly outdated version of Huggingface's transformers library, but primarily tries to generate high-quality embeddings. Note that there are ways to train with surrogate losses, where you try to emulate some form ofloss that is relevant for embeddings.
Edit: The author of Sentence-Transformers recently joined Huggingface, so I expect support to greatly improve over the upcoming months!
#dennlinger gave an exhaustive answer. Additional pretraining is also referred as "post-training", "domain adaptation" and "language modeling fine-tuning". here you will find an example how to do it.
But, since you want to have good sentence embeddings, you better use Sentence Transformers. Moreover, they provide fine-tuned models, which already capable of understanding semantic similarity between sentences. "Continue Training on Other Data" section is what you want to further fine-tune the model on your domain. You do have to prepare training dataset, according to one of available loss functions. E.g. ContrastLoss requires a pair of texts and a label, whether this pair is similar.
I believe transfer learning is useful to train the model on a specific domain. First you load the pretrained base model and freeze its weights, then you add another layer on top of the base model and train that layer based on your own training data. However, the data would need to be labelled.
Tensorflow has some useful guide on transfer learning.
You are talking about pre-training. Fine-tuning on unlabeled data is called pre-training and for getting started, you can take a look over here.

Can you train a BERT model from scratch with task specific architecture?

BERT pre-training of the base-model is done by a language modeling approach, where we mask certain percent of tokens in a sentence, and we make the model learn those missing mask. Then, I think in order to do downstream tasks, we add a newly initialized layer and we fine-tune the model.
However, suppose we have a gigantic dataset for sentence classification. Theoretically, can we initialize the BERT base architecture from scratch, train both the additional downstream task specific layer + the base model weights form scratch with this sentence classification dataset only, and still achieve a good result?
Thanks.
BERT can be viewed as a language encoder, which is trained on a humongous amount of data to learn the language well. As we know, the original BERT model was trained on the entire English Wikipedia and Book corpus, which sums to 3,300M words. BERT-base has 109M model parameters. So, if you think you have large enough data to train BERT, then the answer to your question is yes.
However, when you said "still achieve a good result", I assume you are comparing against the original BERT model. In that case, the answer lies in the size of the training data.
I am wondering why do you prefer to train BERT from scratch instead of fine-tuning it? Is it because you are afraid of the domain adaptation issue? If not, pre-trained BERT is perhaps a better starting point.
Please note, if you want to train BERT from scratch, you may consider a smaller architecture. You may find the following papers useful.
Well-Read Students Learn Better: On the Importance of Pre-training Compact Models
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
I can give help.
First of all, MLM and NSP (which are the original pre-training objectives from NAACL 2019) are meant to train language encoders with prior language knowledge. Like a primary school student who read many books in the general domain. Before BERT, many neural networks would be trained from scratch, from a clean slate where the model doesn't know anything. This is like a newborn baby.
So my question is, "is it a good idea to start teaching a newborn baby when you can begin with a primary school student?" My answer is no. This is supported by numerous State-of-The-Arts achieved by the pre-trained models, compared to the old methods of training a neural network from scratch.
As someone who works in the field, I can assure you that it is a much better idea to fine-tune a pre-trained model. It doesn't matter if you have a 200k dataset or a 1mil datapoints. In fact, more fine-tuning data will only make the downstream results better if you use the right hyperparameters.
Though I recommend the learning rate between 2e-6 ~ 5e-5 for sentence classification tasks, you can explore. If your dataset is very, very domain-specific, it's up to you to fine-tune with a higher learning rate, which will deviate the model further away from its "pre-trained" knowledge.
And also, regarding your question on
can we initialize the BERT base architecture from scratch, train both the additional downstream task specific layer + the base model weights form scratch with this sentence classification dataset only, and still achieve a good result?
I'm negative about this idea. Even though you have a dataset with 200k instances, BERT is pre-trained on 3300mil words. BERT is too inefficient to be trained with 200k instances (both size-wise and architecture-wise). If you want to train a neural network from scratch, I'd recommend you look into LSTMs or RNNs.
I'm not saying I recommend LSTMs. Just fine-tune BERT. 200k is not even too big anyways.
All the best luck with your NLP studies :)

BERT fine tuning

I'm trying to create my model for question answering based on BERT und can't understand what is the meaning of fine tuning. Do I understand it right, that it is like adaption for specific domain? And if I want to use it with Wikipedia corpora, I just need to integrate unchanged pre-trained model in my network?
Fine tuning is adopting (refining) the pre-trained BERT model to two things:
Domain
Task (e.g. classification, entity extraction, etc.).
You can use pre-trained models as-is at first and if the performance is sufficient, fine tuning for your use case may not be needed.
Finetuning is more like adopting the pre-trained model to the downstream task. However, recent state-of-the-art proves that finetuning doesn't help much with QA tasks. See also the following post.

Resources