BERT pre-training from scratch on custom text data - nlp

I want to pre-train BERT from scratch using Hugging face library. Originally, BERT was pre-trained on two tasks: MLM and NSP. I am successful in training it for MLM but been running into issues for weeks now:
Token indices sequence length is longer than the specified maximum sequence length for this model (544 > 512). Running this sequence through the model will result in indexing errors
Can anyone help me pre training BERT on those two tasks?
I have tried runmlm.py script from Hugging Face which just trains on MLM. I have also tried original BERT code which is not very intuitional to me so I stick with Hugging face library.

Related

Hugging face transformer: model bio_ClinicalBERT not trained for any of the task?

This maybe the most beginner question of all :sweat:.
I just started learning about NLP and hugging face. The first thing I'm trying to do is to apply one the bioBERT models on some clinical note data and see what I do, before moving on to the fine-tuning the model. And it looks like "emilyalsentzer/Bio_ClinicalBERT" to be the closest model for my data.
But as I try to use it for any of the analyses I always get this warning.
Some weights of the model checkpoint at emilyalsentzer/Bio_ClinicalBERT were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
From the hugging face course chapter 2 I understand this meant.
This is because BERT has not been pretrained on classifying pairs of sentences, so the head of the pretrained model has been discarded and a new head suitable for sequence classification has been added instead. The warnings indicate that some weights were not used (the ones corresponding to the dropped pretraining head) and that some others were randomly initialized (the ones for the new head). It concludes by encouraging you to train the model, which is exactly what we are going to do now.
So I went on to test which NLP task I can use "emilyalsentzer/Bio_ClinicalBERT" for, out of the box.
from transformers import pipeline, AutoModel
checkpoint = "emilyalsentzer/Bio_ClinicalBERT"
nlp_task = ['conversational', 'feature-extraction', 'fill-mask', 'ner',
'question-answering', 'sentiment-analysis', 'text-classification',
'token-classification',
'zero-shot-classification' ]
for task in nlp_task:
print(task)
process = pipeline(task=task, model = checkpoint)
And I got the same warning message for all the NLP tasks, so it appears to me that I shouldn't/advised not to use the model for any of the tasks. This really confuses me. The original bio_clinicalBERT model paper stated that they had good results on a few different tasks. So certainly the model was trained for those tasks. I also have similar issue with other models as well, i.e. the blog or research papers said a model obtained good results with a specific task but when I tried to apply with pipeline it gives the warning message. Is there any reason why the head layers were not included in the model?
I only have a few hundreds clinical notes (also unannotated :frowning_face:), so it doesn't look like it's big enough for training. Is there any way I could use the model on my data without training?
Thank you for your time.
This Bio_ClinicalBERT model is trained for Masked Language Model (MLM) task. This task basically used for learning the semantic relation of the token in the language/domain. For downstream tasks, you can fine-tune the model's header with your small dataset, or you can use a fine-tuned model like Bio_ClinicalBERT-finetuned-medicalcondition which is the fine-tuned version of the same model. You can find all the fine-tuned models in HuggingFace by searching 'bio-clinicalBERT' as in the link.

How to padding using Huggingface for Bert training

I'm using this link to train hugginface bert. But I saw different batch has different sequence length in training time. But I want to keep the same sequence length for all of the batches. How can I do that? And how does hugging face handles different sequence length in different batches?

Can we use GPT-2 sentence embedding for classification tasks?

I am experimenting on the use of transformer embeddings in sentence classification tasks without finetuning them. I have used BERT embeddings and those experiments gave me very good results. Now I want to use GPT-2 embeddings (without fine-tuning). So I have two questions,
Can I use GPT-2 embeddings like that (because I know Gpt-2 is
trained on the left to right)
Is there any example uses of GPT-2 in
classification tasks other than generation tasks?
If I can use GPT-2embeddings, how should I do it?
I basically solved the problem. Here I used embeddings extracted from GPT-2.
So yes, we can use the final token of the GPT-2 embedding sequence as the class token. Because of the self-attention mechanism from left-to-right, the final token can represent the sequential information.
Please check the following GitHub issue for an implementation that uses GPT-2 embeddings. github issue
I conducted experiments comparing GPT-2 embedding with RoBERTa embedding. I got better results only with RoBERTa embedding and not GPT-2.

Anomaly detection in Text Classification

I have built a text classifier using OneClassSVM.
I have the training set which corresponds to only one label i.e("Yes") and I don't have the other("NO") label data. My task is to build a classifier which classifies the new unseen sentence(test data) as 1 if it is very similar to the training data. Else, it classifies as -1 i.e,(anomaly).
I have used Word2Vec to build the word embeddings for my training data. Then, I am using word-vector averaging with OneClassSVM to build a anomaly detector classifier.
This classifier is currently giving accuracy of about 50%-55%. I have to enhance this further to build a robust classifier.
Any suggestions to this problem would be helpful...
I'd suggest a very different approach since you have no training examples for the negative class at all.
You could train a language model on your training data. At inference time, you score the input with the language model, and classify it according to some threshold on the perplexity of the input sentence according to the LM.

how's the input word2vec get fine-tuned when training CNN

When I read the paper "Convolutional Neural Networks for Sentence Classification"-Yoon Kim-New York University, I noticed that the paper implemented the "CNN-non-static" model--A model with pre-trained vectors from word2vec,and all words— including the unknown ones that are randomly initialized, and the pre-trained vectors are fine-tuned for each task.
So I just do not understand how the pre-trained vectors are fine-tuned for each task. Cause as far as I know, the input vectors, which are converted from strings by word2vec.bin(pre-trained), just like image matrix, which can not change during training CNN. So, if they can, HOW? Please help me out, Thanks a lot in advance!
The word embeddings are weights of the neural network, and can therefore be updated during backpropagation.
E.g. http://sebastianruder.com/word-embeddings-1/ :
Naturally, every feed-forward neural network that takes words from a vocabulary as input and embeds them as vectors into a lower dimensional space, which it then fine-tunes through back-propagation, necessarily yields word embeddings as the weights of the first layer, which is usually referred to as Embedding Layer.

Resources