Text classification using BERT - how to handle misspelled words - pytorch

I am not sure if this is the best place to submit that kind of question, perhaps CrossValdation would be a better place.
I am working on a text multiclass classification problem.
I built a model based on BERT concept implemented in PyTorch (huggingface transformer library). The model performs pretty well, except when the input sentence has an OCR error or equivalently it is misspelled.
For instance, if the input is "NALIBU DRINK" the Bert tokenizer generates ['na', '##lib', '##u', 'drink'] and model's prediction is completely wrong. On the other hand, if I correct the first character, so my input is "MALIBU DRINK", the Bert tokenizer generates two tokens ['malibu', 'drink'] and the model makes a correct prediction with very high confidence.
Is there any way to enhance Bert tokenizer to be able to work with misspelled words?

You can leverage BERT's power to rectify the misspelled word.
The article linked below beautifully explains the process with code snippets
https://web.archive.org/web/20220507023114/https://www.statestitle.com/resource/using-nlp-bert-to-improve-ocr-accuracy/
To summarize, you can identify misspelled words via a SpellChecker function and get replacement suggestions. Then, find the most appropriate replacement using BERT.

Related

How to get the word on which the text classification has been made?

I am doing a multi-label text classification using a pre-trained model of BERT. Here is an example of the prediction that has been made for one sentence-
pred_image
I want to get those words from the sentence on which the prediction has been made. Like this one - right_one
If anyone has any idea, Please enlighten me.
Multi-Label Text Classification (first image) and Token Classification (second image) are two different tasks for each which the model needs to be specifally trained for.
The first one returns a probability for each label considering the entire sentence. The second returns such predictions for each single word in the sentence while usually considering the rest of the sentence as context.
So you can not really use the output from a Text Classifier and use it for Token Classification because the information you get is not detailed enough.
What you can and should do is train a Token Classification model, although you obviously will need token-level-annotated data to do so.

How to fine tune BERT on unlabeled data?

I want to fine tune BERT on a specific domain. I have texts of that domain in text files. How can I use these to fine tune BERT?
I am looking here currently.
My main objective is to get sentence embeddings using BERT.
The important distinction to make here is whether you want to fine-tune your model, or whether you want to expose it to additional pretraining.
The former is simply a way to train BERT to adapt to a specific supervised task, for which you generally need in the order of 1000 or more samples including labels.
Pretraining, on the other hand, is basically trying to help BERT better "understand" data from a certain domain, by basically continuing its unsupervised training objective ([MASK]ing specific words and trying to predict what word should be there), for which you do not need labeled data.
If your ultimate objective is sentence embeddings, however, I would strongly suggest you to have a look at Sentence Transformers, which is based on a slightly outdated version of Huggingface's transformers library, but primarily tries to generate high-quality embeddings. Note that there are ways to train with surrogate losses, where you try to emulate some form ofloss that is relevant for embeddings.
Edit: The author of Sentence-Transformers recently joined Huggingface, so I expect support to greatly improve over the upcoming months!
#dennlinger gave an exhaustive answer. Additional pretraining is also referred as "post-training", "domain adaptation" and "language modeling fine-tuning". here you will find an example how to do it.
But, since you want to have good sentence embeddings, you better use Sentence Transformers. Moreover, they provide fine-tuned models, which already capable of understanding semantic similarity between sentences. "Continue Training on Other Data" section is what you want to further fine-tune the model on your domain. You do have to prepare training dataset, according to one of available loss functions. E.g. ContrastLoss requires a pair of texts and a label, whether this pair is similar.
I believe transfer learning is useful to train the model on a specific domain. First you load the pretrained base model and freeze its weights, then you add another layer on top of the base model and train that layer based on your own training data. However, the data would need to be labelled.
Tensorflow has some useful guide on transfer learning.
You are talking about pre-training. Fine-tuning on unlabeled data is called pre-training and for getting started, you can take a look over here.

Sequence Labelling with BERT

I am using a model consisting of an embedding layer and an LSTM to perform sequence labelling, in pytorch + torchtext. I have already tokenised the sentences.
If I use self-trained or other pre-trained word embedding vectors, this is straightforward.
But if I use the Huggingface transformers BertTokenizer.from_pretrained and BertModel.from_pretrained there is a '[CLS]' and '[SEP]' token added to the beginning and end of the sentence, respectively. So the output of the model becomes a sequence that is two elements longer than the label/target sequence.
What I am unsure of is:
Are these two tags needed for the BertModel to embed each token of a sentence "correctly"?
If they are needed, can I take them out after the BERT embedding layer, before the input to the LSTM, so that the lengths are correct in the output?
Yes, BertModel needed them since without those special symbols added, the output representations would be different. However, my experience says, if you fine-tune BertModel on the labeling task without [CLS] and [SEP] token added, then you may not see a significant difference. If you use BertModel to extract fixed word features, then you better add those special symbols.
Yes, you can take out the embedding of those special symbols. In fact, this is a general idea for sequence labeling or tagging tasks.
I suggest taking a look at some sequence labeling or tagging examples using BERT to become confident about your modeling decisions. You can find NER tagging example using Huggingface transformers here.

Transformers PreTrainedTokenizer add_tokens Functionality

Referring to the documentation of the awesome Transformers library from Huggingface, I came across the add_tokens functions.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
num_added_toks = tokenizer.add_tokens(['new_tok1', 'my_new-tok2'])
model.resize_token_embeddings(len(tokenizer))
I tried the above by adding previously absent words in the default vocabulary. However, keeping all else constant, I noticed a decrease in accuracy of the fine tuned classifier making use of this updated tokenizer. I was able to replicate similar behavior even when just 10% of the previously absent words were added.
My questions
Am I missing something?
Instead of whole words, is the add_tokens function expecting masked tokens, for example : '##ah', '##red', '##ik', '##si', etc.? If yes, is there a procedure to generate such masked tokens?
Any help would be appreciated.
Thanks in advance.
If you add tokens to the tokenizer, you indeed make the tokenizer tokenize the text differently, but this is not the tokenization BERT was trained with, so you are basically adding noise to the input. The word embeddings are not trained and the rest of the network never saw them in context. You would need a lot of data to teach BERT to deal with the newly added words.
There are also some ways how to compute a single word embedding, such that it would not hurt BERT like in this paper but it seems pretty complicated and should not make any difference.
BERT uses a word-piece-based vocabulary, so it should not really matter if the words are present in the vocabulary as a single token or get split into multiple wordpieces. The model probably saw the split word during pre-training and will know what to do with it.
Regarding the ##-prefixed tokens, those are tokens can only be prepended as a suffix of another wordpiece. E.g., walrus gets split into ['wal', '##rus'] and you need both of the wordpieces to be in the vocabulary, but not ##wal or rus.

Do I need to provide sentences for training Spacy NER or are paragraphs fine?

I am trying to train a new Spacy model to recognize references to law articles. I start using a blank model, and train the ner pipe according to the example given in the documentation.
The performance of the trained model is really poor, even with several thousands on input points. I am tryong to figure out why.
One possible answer is that I am giving full paragraphs to train on, instead of sentences that are in the examples. Each of these paragraphs can have multiple references to law articles. Is this a possible issue?
Turns out I was making a huge mistake in my code. There is nothing wrong with paragraphs. As long as your code actually supplies them to spacy.
Paragraphs should be fine. Could you give an example input data point?

Resources