BERT multilingual model - For classification - nlp

I am trying to build multilingual classification model with BERT.
I'm using a feature-based approach (concatenating the features from top-4 hidden layers) and building a CNN classifier on top of that.
After that I'm using different language (say chinese) from the same domain for testing, but accuracy for these languages is near zero.
I am not sure that I understand paper well, so here is my question:
Is it possible to fine-tune BERT multilingual model on one language
(e.g. English) or use feature-based approach to extract the features and build classifer, and after that use this model for different languages (other
languages from the list of supported languages in documentation of
BERT)?
Also, is my hypothesis, "regarding BERT that it maps I think that it's embedding layer maps words from different languages with same context to similar clusters", correct?

Related

How to create a custom BERT language model for a different language?

I want to create a language translation model using transformers. However, Tensorflow seems to only have a BERT model for English https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4 . If I want a BERT for another language, what is the best way to go about accomplishing this? Should I create a new BERT or can I train Tensorflow's own BertTokenizer on another language?
The Hugging Face model hub contains a plethora of pre-trained monolingual and multilingual transformers (and relevant tokenizers) which can be fine-tuned for your downstream task.
However, if you are unable to locate a suitable model for you language, then yes training from scratch is the only option. Beware though that training from scratch can be a resource-intensive task that will require significant compute power. Here is an excellent blog post to get you started.

Fine-tune BERT for a specific domain on a different language?

I want to fine-tune on a pre-trained BERT model.
However, my task uses data within a specific domain (say biomedical data).
Additionally, my data is also in a language different from English (say Dutch).
Now I could fine-tune the Dutch bert-base-dutch-cased pre-trained model.
However, how would I go about fine-tuning a Biomedical BERT model, like BioBERT,
which is in the correct domain, but wrong language?
I have thought about using NMT, but don't think it's viable and worth the effort.
If I fine-tune without any alterations to the model, I fear that the model will not learn the task well
since it was pre-trained on a completely different language.
I just want to know if there are any methods that allow for fine-tuning a pre-trained BERT model trained on a specific domain and use it for data within that same domain, but a different language
Probably not. BERT's vocabulary is fixed at the start of pre-training, and adding additional vocabulary leads to random weight initializations.
Instead, I would:
Look for a multi-lingual, domain-specific version of BERT as #Ashwin said.
Fine-tune Dutch BERT on your task and see if performance is acceptable. In general, BERT can adapt to different tasks quite well.
(If you have the available resources) Continue pre-training Dutch BERT on your specific domain (for example, like SciBERT) and then fine-tune on your task.

How to fine tune BERT on unlabeled data?

I want to fine tune BERT on a specific domain. I have texts of that domain in text files. How can I use these to fine tune BERT?
I am looking here currently.
My main objective is to get sentence embeddings using BERT.
The important distinction to make here is whether you want to fine-tune your model, or whether you want to expose it to additional pretraining.
The former is simply a way to train BERT to adapt to a specific supervised task, for which you generally need in the order of 1000 or more samples including labels.
Pretraining, on the other hand, is basically trying to help BERT better "understand" data from a certain domain, by basically continuing its unsupervised training objective ([MASK]ing specific words and trying to predict what word should be there), for which you do not need labeled data.
If your ultimate objective is sentence embeddings, however, I would strongly suggest you to have a look at Sentence Transformers, which is based on a slightly outdated version of Huggingface's transformers library, but primarily tries to generate high-quality embeddings. Note that there are ways to train with surrogate losses, where you try to emulate some form ofloss that is relevant for embeddings.
Edit: The author of Sentence-Transformers recently joined Huggingface, so I expect support to greatly improve over the upcoming months!
#dennlinger gave an exhaustive answer. Additional pretraining is also referred as "post-training", "domain adaptation" and "language modeling fine-tuning". here you will find an example how to do it.
But, since you want to have good sentence embeddings, you better use Sentence Transformers. Moreover, they provide fine-tuned models, which already capable of understanding semantic similarity between sentences. "Continue Training on Other Data" section is what you want to further fine-tune the model on your domain. You do have to prepare training dataset, according to one of available loss functions. E.g. ContrastLoss requires a pair of texts and a label, whether this pair is similar.
I believe transfer learning is useful to train the model on a specific domain. First you load the pretrained base model and freeze its weights, then you add another layer on top of the base model and train that layer based on your own training data. However, the data would need to be labelled.
Tensorflow has some useful guide on transfer learning.
You are talking about pre-training. Fine-tuning on unlabeled data is called pre-training and for getting started, you can take a look over here.

Which Deep Learning Algorithm does Spacy uses when we train Custom model?

When we train custom model, I do see we have dropout and n_iter parameters to tune, but which deep learning algorithm does Spacy Uses to train Custom Models? Also, when Adding new Entity type is it good to create blank or train it on existing model?
Which learning algorithm does spaCy use?
spaCy has its own deep learning library called thinc used under the hood for different NLP models. for most (if not all) tasks, spaCy uses a deep neural network based on CNN with a few tweaks. Specifically for Named Entity Recognition, spacy uses:
A transition based approach borrowed from shift-reduce parsers, which is described in the paper Neural Architectures for Named Entity Recognition by Lample et al.
Matthew Honnibal describes how spaCy uses this on a YouTube video.
A framework that's called "Embed. Encode. Attend. Predict" (Starting here on the video), slides here.
Embed: Words are embedded using a Bloom filter, which means that word hashes are kept as keys in the embedding dictionary, instead of the word itself. This maintains a more compact embeddings dictionary, with words potentially colliding and ending up with the same vector representations.
Encode: List of words is encoded into a sentence matrix, to take context into account. spaCy uses CNN for encoding.
Attend: Decide which parts are more informative given a query, and get problem specific representations.
Predict: spaCy uses a multi layer perceptron for inference.
Advantages of this framework, per Honnibal are:
Mostly equivalent to sequence tagging (another task spaCy offers models for)
Shares code with the parser
Easily excludes invalid sequences
Arbitrary features are easily defined
For a full overview, Matthew Honnibal describes how the model works in this YouTube video. Slides could be found here.
Note: This information is based on slides from 2017. The engine might have changed since then.
When adding a new entity type, should we create a blank model or train an existing one?
Theoretically, when fine-tuning a spaCy model with new entities, you have to make sure the model doesn't forget representations for previously learned entities. The best thing, if possible, is to train a model from scratch, but that might not be easy or possible due to lack of data or resources.
EDIT Feb 2021: spaCy version 3 now uses the Transformer architecture as its deep learning model.

Linear CRF Versus Word2Vec for NER

I have done lots of reading around Linear CRF and Word2Vec and wanted to know which one is the best to do Named Entity Recognition. I trained my model using Stanford NER(Which is a Linear CRF Implementation) and got a precision of 85%. I know that Word2vec groups similar words together but is it a good model to do NER?
CRFs and word2vec are apples and oranges, so comparing them doesn't really make sense.
CRFs are used for sequence labelling problems like NER. Given a sequence of items, represented as features and paired with labels, they'll learn a model to predict labels for new sequences.
Word2vec's word embeddings are representations of words as vectors of floating point numbers. They don't predict anything by themselves. You can even use the word vectors to build features in a CRF, though it's more typical to use them with a neural model like an LSTM.
Some people have used word vectors with CRFs with success. For some discussion of using word vectors in a CRF see here and here.
Do note that with many standard CRF implementations features are expected to be binary or categorical, not continuous, so you typically can't just shove word vectors in as you would another feature.
If you want to know which is better for your use case, the only way to find out is to try both.
For typical NER tasks, Linear CRF is a popular method, while Word2Vec is a feature that can be leveraged to improve the CRF systems performence.
In this 2014 paper (GitHub), the authors compared multiple ways of incorporating output of Word2Vec in a CRF-based NER system, including dense embedding, binerized embedding, cluster embedding, and a novel prototype method.
I implemented the prototype idea in my domain-specific NER project and it works pretty well for me.

Resources