I would like to fine-tuning BERT for a specific domain on unlabeled data and get the output layer to check the similarity between them. How can I do it? Do I need to fine-tuning first a classifier task (or question answer, etc..) and get the embeddings? Or can I just use a pre-trained Bert model without task and fine-tuning with my own data?
There is no need to fine-tune for classification, especially if you do not have any supervised classification dataset.
You should continue training BERT the same unsupervised way it was originally trained, i.e., continue "pre-training" using the masked-language-model objective and next sentence prediction. Hugginface's implementation contains class BertForPretraining for this.
Related
I want to get the BERT word embeddings which will be used in another down-stream task later. I have a corpus for my custom dataset and want to further pre-train the pre-trained Huggingface BERT base model. I think this is called post-training. How can I do this using Huggingface transformers? Can I use transformers.BertForMaskedLM?
i would like to know when people say pretrained bert model, is it only the final classification neural network is trained
Or
Is there any update inside transformer through back propagation along with classification neural network
During pre-training, there is a complete training if the model (updation of weights). Moreover, BERT is trained on Masked Language Model objective and not classification objective.
In pre-training, you usually train a model with huge amount of generic data. Thus, it has to be fine-tuned with the task-specific data and task-specific objective.
So, if your task is classification on a dataset X. You fine-tune BERT accordingly. And now, you will be adding a task-specific layer (classification layer, in BERT they have used dense layer over [CLS] token). While fine-tuning, you update the pre-trained model weights as well as the new task-specific layer.
I am wondering if I am using word embeddings correctly.
I have combined contextualised word vectors with static word vectors because:
my domain corpus is too small to effectively train the model from scratch
my domain is too specialised to use general embeddings.
I used the off the shelf ELMo small model and trained word2vec model on a small domain specific corpus (around 500 academic papers). I then did a simple concatenation of the vectors from the two different embeddings.
I loosely followed the approach in this paper:
https://www.aclweb.org/anthology/P19-2041.pdf
But the approach in the paper trains the embeddings for a specific task. In my domain there is no labeled training data. Hence me just training the embeddings on the corpus alone.
I am new to NLP, so apologies if I am asking a stupid question.
are the text emebddings also fine-tuned when fine-tuning for classification task? Or up to which layer are the encodings fine-tuned (sencond last layer)?
If you are using the original BERT repository published by Google, all layers are trainable; meaning: no freezing at all. You can check that by printing tf.trainable_variables().
When I read the paper "Convolutional Neural Networks for Sentence Classification"-Yoon Kim-New York University, I noticed that the paper implemented the "CNN-non-static" model--A model with pre-trained vectors from word2vec,and all words— including the unknown ones that are randomly initialized, and the pre-trained vectors are fine-tuned for each task.
So I just do not understand how the pre-trained vectors are fine-tuned for each task. Cause as far as I know, the input vectors, which are converted from strings by word2vec.bin(pre-trained), just like image matrix, which can not change during training CNN. So, if they can, HOW? Please help me out, Thanks a lot in advance!
The word embeddings are weights of the neural network, and can therefore be updated during backpropagation.
E.g. http://sebastianruder.com/word-embeddings-1/ :
Naturally, every feed-forward neural network that takes words from a vocabulary as input and embeds them as vectors into a lower dimensional space, which it then fine-tunes through back-propagation, necessarily yields word embeddings as the weights of the first layer, which is usually referred to as Embedding Layer.