Sequence Labelling with BERT - pytorch

I am using a model consisting of an embedding layer and an LSTM to perform sequence labelling, in pytorch + torchtext. I have already tokenised the sentences.
If I use self-trained or other pre-trained word embedding vectors, this is straightforward.
But if I use the Huggingface transformers BertTokenizer.from_pretrained and BertModel.from_pretrained there is a '[CLS]' and '[SEP]' token added to the beginning and end of the sentence, respectively. So the output of the model becomes a sequence that is two elements longer than the label/target sequence.
What I am unsure of is:
Are these two tags needed for the BertModel to embed each token of a sentence "correctly"?
If they are needed, can I take them out after the BERT embedding layer, before the input to the LSTM, so that the lengths are correct in the output?

Yes, BertModel needed them since without those special symbols added, the output representations would be different. However, my experience says, if you fine-tune BertModel on the labeling task without [CLS] and [SEP] token added, then you may not see a significant difference. If you use BertModel to extract fixed word features, then you better add those special symbols.
Yes, you can take out the embedding of those special symbols. In fact, this is a general idea for sequence labeling or tagging tasks.
I suggest taking a look at some sequence labeling or tagging examples using BERT to become confident about your modeling decisions. You can find NER tagging example using Huggingface transformers here.

Related

Order/context-aware document / sentence to vectors in Spacy

I would like to do some supervised binary classification tasks with sentences, and have been using spaCy because of its ease of use. I used spaCy to convert the text into vectors, and then fed the vectors to a machine learning model (e.g. XGBoost) to perform the classfication. However, the results have not been very satisfactory.
In spaCy, it is easy to load a model (e.g. BERT / Roberta / XLNet) to convert words / sentences to nlp objects. Directly calling the vector of the object will however will default to an average of the token vectors.
Here are two questions:
1) Can we do better than simply getting the average of token vectors, like having context/order-aware sentence vectors using spaCy? For example, can we extract the sentence embedding from the previous layer of the BERT transformer instead of the final token vectors in spaCy?
2) Would it be better to directly use spaCy to train the downstream binary classification task? For example, here discusses how to add a text classifier to a spaCy model. Or is it generally better to apply more powerful machine learning models like XGBoost?
Thanks in advance!
I found this being discussed in the page below. Maybe it helps.
"Most people usually only take the hidden states of the [CLS] token of the last layer - using the hidden states for all tokens or from multiple layers doesn't usually help you that much."
https://github.com/huggingface/transformers/issues/1950

Text classification using BERT - how to handle misspelled words

I am not sure if this is the best place to submit that kind of question, perhaps CrossValdation would be a better place.
I am working on a text multiclass classification problem.
I built a model based on BERT concept implemented in PyTorch (huggingface transformer library). The model performs pretty well, except when the input sentence has an OCR error or equivalently it is misspelled.
For instance, if the input is "NALIBU DRINK" the Bert tokenizer generates ['na', '##lib', '##u', 'drink'] and model's prediction is completely wrong. On the other hand, if I correct the first character, so my input is "MALIBU DRINK", the Bert tokenizer generates two tokens ['malibu', 'drink'] and the model makes a correct prediction with very high confidence.
Is there any way to enhance Bert tokenizer to be able to work with misspelled words?
You can leverage BERT's power to rectify the misspelled word.
The article linked below beautifully explains the process with code snippets
https://web.archive.org/web/20220507023114/https://www.statestitle.com/resource/using-nlp-bert-to-improve-ocr-accuracy/
To summarize, you can identify misspelled words via a SpellChecker function and get replacement suggestions. Then, find the most appropriate replacement using BERT.

Should I train embeddings using data from both training,validating and testing corpus?

I am in a case that I don't have any pre-trained words embedding for my domain (Vietnamese food reviews). so I got a though of embedding from the general and specific corpus.
And the point here is can I use the dataset of training, test and validating (did preprocess) as a source for creating my own word embeddings. If don't, hope you can give your experience.
Based on my intuition, and some experiments a wide corpus appears to be better, but I'd like to know if there's relevant research or other relevant results.
can I use the dataset of training, test and validating (did
preprocess) as a source for creating my own word embeddings
Sure, embeddings are not your features for your machine learning model. They are the "computational representation" of your data. In short, they are made of words represented in a vector space. With embeddings, your data is less sparse. Using word embeddings could be considered part of the pre-processing step of NLP.
Usually (I mean, using the most used technique, word2vec), the representation of a word in the vector space is defined by its surroundings (the words that it commonly goes along with).
Therefore, to create embeddings, the larger the corpus, the better, since it can better place a word vector in the vector space (and hence compare it to other similar words).

Linear CRF Versus Word2Vec for NER

I have done lots of reading around Linear CRF and Word2Vec and wanted to know which one is the best to do Named Entity Recognition. I trained my model using Stanford NER(Which is a Linear CRF Implementation) and got a precision of 85%. I know that Word2vec groups similar words together but is it a good model to do NER?
CRFs and word2vec are apples and oranges, so comparing them doesn't really make sense.
CRFs are used for sequence labelling problems like NER. Given a sequence of items, represented as features and paired with labels, they'll learn a model to predict labels for new sequences.
Word2vec's word embeddings are representations of words as vectors of floating point numbers. They don't predict anything by themselves. You can even use the word vectors to build features in a CRF, though it's more typical to use them with a neural model like an LSTM.
Some people have used word vectors with CRFs with success. For some discussion of using word vectors in a CRF see here and here.
Do note that with many standard CRF implementations features are expected to be binary or categorical, not continuous, so you typically can't just shove word vectors in as you would another feature.
If you want to know which is better for your use case, the only way to find out is to try both.
For typical NER tasks, Linear CRF is a popular method, while Word2Vec is a feature that can be leveraged to improve the CRF systems performence.
In this 2014 paper (GitHub), the authors compared multiple ways of incorporating output of Word2Vec in a CRF-based NER system, including dense embedding, binerized embedding, cluster embedding, and a novel prototype method.
I implemented the prototype idea in my domain-specific NER project and it works pretty well for me.

Train some embeddings, keep others fixed

I do sequence classification with Keras, using an RNN and embeddings. My sequences are a bit weird. I have words mixed with special symbols. Words are associated with fixed, pre-trained embeddings, but the special symbol embeddings have to be modified during training.
In an Embedding layer during learning, how can I keep some embeddings fixed while updating others? Is there a way to mask those indices which shouldn't be modified? Or is this a case for a custom Embedding layer?
I do not believe that this is achievable with the existing Embedding layer. To get around it I would just create a custom layer that builds two embedding layers internally, and only puts the embedding matrix of one of them into the trainable_parameters.

Resources