Fine-tune T5 pre-trained model on a specific domain for question answering - nlp

I need to build a question-answering system on a specific domain of Finance, I have documents data containing all the information about the field,
Can I fine-tune T5 pre-trained model (large) unsupervised training on the documents so it can answer related questions based on my documents corpus?
The documents corpus I have is quite large, so I cannot just use it as a context in the current QA within T5,
I am open to your suggestions!

Related

How to extend the vocabulary of a pretrained transformer model?

I would like to extend a zero-shot text classification (NLI) model's vocabulary, to include domain-specific vocabulary or just to keep it up-to-date. For example, I would like the model to know the names of the latest COVID-19 variants are related to the topic 'Healthcare'.
I've added the tokens to the tokenizer and resized the token embeddings. However, I don't know how to finetune the weights in the embedding layer, as suggested here.
To do the finetuning, can I use simply use texts containing a mixture of new vocabulary and existing vocabulary, and have the tokenizer recognise the relations between tokens through co-occurrences in an unsupervised fashion?
Any help is appreciated, thank you!
If you resized the corresponding embedding weights with resize_token_embeddings, they will be initialised randomly.
Technically, you can fine-tune the model on your target task (NLI, in your case), without touching the embedding weights. In practice, it will be harder for your model to learn anything meaningful about the newly added tokens, since their embeddings are randomly initialised.
To learn the embedding weights you can do further pre-training, before fine-tuning on the target task. This is done by training the model on the pre-training objective(s) (such as Masked Language Modelling). Pre-training is more expensive than fine-tuning of course, but remember that you aren't pre-training from scratch, since you start pre-training from the checkpoint of the already pre-trained model. Therefore, the number of epochs/steps will be significantly less than what was used in the original pre-training setup.
When doing pre-training it will be beneficial to include in-domain documents, so that it can learn the newly added tokens. Depending on whether you want the model to be more domain specific or remain varied so as to not "forget" any previous domains, you might also want to include documents from a variety of domains.
The Don't Stop Pretraining paper might also be an interesting reference, which delves into specifics regarding the type of data used as well as training steps.

How to prevent the entities from being overwritten during pretrained language model being retrained in terms NER

in terms of NER(name entities recognition), assuming I have one pretrained langauge model, which could recognize some predefined entities, if I perform retraining on this model, and don't want to those predefined entities to be overwritten during retraining, what can I do? For example, in one BERT-based model, it has been trained to recognize Databricks as SKILL, in case of it won't be retained as PRODUCT, how can I do during the retraining process?
Thanks.

Fine-tune BERT for a specific domain on a different language?

I want to fine-tune on a pre-trained BERT model.
However, my task uses data within a specific domain (say biomedical data).
Additionally, my data is also in a language different from English (say Dutch).
Now I could fine-tune the Dutch bert-base-dutch-cased pre-trained model.
However, how would I go about fine-tuning a Biomedical BERT model, like BioBERT,
which is in the correct domain, but wrong language?
I have thought about using NMT, but don't think it's viable and worth the effort.
If I fine-tune without any alterations to the model, I fear that the model will not learn the task well
since it was pre-trained on a completely different language.
I just want to know if there are any methods that allow for fine-tuning a pre-trained BERT model trained on a specific domain and use it for data within that same domain, but a different language
Probably not. BERT's vocabulary is fixed at the start of pre-training, and adding additional vocabulary leads to random weight initializations.
Instead, I would:
Look for a multi-lingual, domain-specific version of BERT as #Ashwin said.
Fine-tune Dutch BERT on your task and see if performance is acceptable. In general, BERT can adapt to different tasks quite well.
(If you have the available resources) Continue pre-training Dutch BERT on your specific domain (for example, like SciBERT) and then fine-tune on your task.

How to fine tune BERT on unlabeled data?

I want to fine tune BERT on a specific domain. I have texts of that domain in text files. How can I use these to fine tune BERT?
I am looking here currently.
My main objective is to get sentence embeddings using BERT.
The important distinction to make here is whether you want to fine-tune your model, or whether you want to expose it to additional pretraining.
The former is simply a way to train BERT to adapt to a specific supervised task, for which you generally need in the order of 1000 or more samples including labels.
Pretraining, on the other hand, is basically trying to help BERT better "understand" data from a certain domain, by basically continuing its unsupervised training objective ([MASK]ing specific words and trying to predict what word should be there), for which you do not need labeled data.
If your ultimate objective is sentence embeddings, however, I would strongly suggest you to have a look at Sentence Transformers, which is based on a slightly outdated version of Huggingface's transformers library, but primarily tries to generate high-quality embeddings. Note that there are ways to train with surrogate losses, where you try to emulate some form ofloss that is relevant for embeddings.
Edit: The author of Sentence-Transformers recently joined Huggingface, so I expect support to greatly improve over the upcoming months!
#dennlinger gave an exhaustive answer. Additional pretraining is also referred as "post-training", "domain adaptation" and "language modeling fine-tuning". here you will find an example how to do it.
But, since you want to have good sentence embeddings, you better use Sentence Transformers. Moreover, they provide fine-tuned models, which already capable of understanding semantic similarity between sentences. "Continue Training on Other Data" section is what you want to further fine-tune the model on your domain. You do have to prepare training dataset, according to one of available loss functions. E.g. ContrastLoss requires a pair of texts and a label, whether this pair is similar.
I believe transfer learning is useful to train the model on a specific domain. First you load the pretrained base model and freeze its weights, then you add another layer on top of the base model and train that layer based on your own training data. However, the data would need to be labelled.
Tensorflow has some useful guide on transfer learning.
You are talking about pre-training. Fine-tuning on unlabeled data is called pre-training and for getting started, you can take a look over here.

NLP - Best document embedding library

Good day, fellow humans (?).
I have a methodological question that is confused by a deep research in a tiny amount of time.
The question arises from the following problem(s): I need to apply semi-supervised or unsupervised clustering on documents. I have ~300 documents classified with multi-labels and approximately 3400 documents not classified. The number of unsupervised documents could become ~10'000 in the next days.
The main idea is that of applying semi-supervised clustering based on the labels at hands. Alternatively, that of going fully unsupervised for soft clustering.
We thought of creating embeddings for the whole documents, but here lies the confusion: which library is the best for such a task?
I guess the utmost importance needs to lie in the context of the whole document. As far as I know, BERT and FastText provide context-dependent word embedding, but not whole document embedding. On the other hand, Gensim's Doc2Vec is context-agnostic, right?
I think I saw a way to train sentence embeddings with BERT, via the HuggingFace API, and was wondering whether it could be useful to consider the whole document as a single sentence.
Do you have any suggestion? I'm probably exposing my utter ignorance and confusion on the matter, but my brain is melted.
Thank you very much for your time.
Viva!
Edit to answer to #gojomo:
My documents are on average ~180 words. The original task was that of multi-label text classification, i.e. each document can have from 1 to N labels, with the number of labels now being N=18. They are highly imbalanced.
Having only 330 labeled documents so far due to several issues, we asked the documents' provider to give also unlabeled data, that should reach the order of the 10k.
I used FastText classification mode, but the result is obviously atrocious. I also run a K-NN with Doc2Vec document embedding, but the result is obviously still atrocious.
I was going to use biomedical BERT-based models (like BioBERT and SciBERT) to produce a NER tagging (trained on domain-specific datasets) on the documents to later apply a classifier.
Now that we have unlabeled documents at disposal, we wanted to adventure into semi-supervised classification or unsupervised clustering, just to explore possibilities. I have to say that this is just a master thesis.

Resources