Order/context-aware document / sentence to vectors in Spacy - nlp

I would like to do some supervised binary classification tasks with sentences, and have been using spaCy because of its ease of use. I used spaCy to convert the text into vectors, and then fed the vectors to a machine learning model (e.g. XGBoost) to perform the classfication. However, the results have not been very satisfactory.
In spaCy, it is easy to load a model (e.g. BERT / Roberta / XLNet) to convert words / sentences to nlp objects. Directly calling the vector of the object will however will default to an average of the token vectors.
Here are two questions:
1) Can we do better than simply getting the average of token vectors, like having context/order-aware sentence vectors using spaCy? For example, can we extract the sentence embedding from the previous layer of the BERT transformer instead of the final token vectors in spaCy?
2) Would it be better to directly use spaCy to train the downstream binary classification task? For example, here discusses how to add a text classifier to a spaCy model. Or is it generally better to apply more powerful machine learning models like XGBoost?
Thanks in advance!

I found this being discussed in the page below. Maybe it helps.
"Most people usually only take the hidden states of the [CLS] token of the last layer - using the hidden states for all tokens or from multiple layers doesn't usually help you that much."
https://github.com/huggingface/transformers/issues/1950

Related

How to convert small dataset into word embeddings instead of one-hot encoding?

I have a dataset of 33 words that are a mix of verbs and nouns, for eg. father, sing, etc. I have tried converting them to 1-hot encoding but for my use case, it has been suggested to look into word2vec embedding. I have looked in gensim and glove but struggling to make it work.
How could I convert my data into an embedding? Such that two words that may be semantically closer may have a lesser distance between their respective vectors. How may this be achieved or any helpful material on the same?
Such as this
Since your dataset is quite small, and I'm assuming it doesn't contain any jargon, it's best to use a pre-trained model in order to save up on training time.
With gensim, it's as simple as:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')
The 'word2vec-google-news-300' model has been pre-trained on a part of the Google News Dataset and generalizes well enough to most tasks. Following this, you can create word embeddings/vectors like so:
vec = wv['father']
And, finally, for computing word similarity:
similarity_score = wv.similarity('father', 'sing')
Lastly, one major limitation of Word2Vec is it's inability to deal with words that are OOV(out of vocabulary). For such cases, it's best to train a custom model for your corpus.

Sequence Labelling with BERT

I am using a model consisting of an embedding layer and an LSTM to perform sequence labelling, in pytorch + torchtext. I have already tokenised the sentences.
If I use self-trained or other pre-trained word embedding vectors, this is straightforward.
But if I use the Huggingface transformers BertTokenizer.from_pretrained and BertModel.from_pretrained there is a '[CLS]' and '[SEP]' token added to the beginning and end of the sentence, respectively. So the output of the model becomes a sequence that is two elements longer than the label/target sequence.
What I am unsure of is:
Are these two tags needed for the BertModel to embed each token of a sentence "correctly"?
If they are needed, can I take them out after the BERT embedding layer, before the input to the LSTM, so that the lengths are correct in the output?
Yes, BertModel needed them since without those special symbols added, the output representations would be different. However, my experience says, if you fine-tune BertModel on the labeling task without [CLS] and [SEP] token added, then you may not see a significant difference. If you use BertModel to extract fixed word features, then you better add those special symbols.
Yes, you can take out the embedding of those special symbols. In fact, this is a general idea for sequence labeling or tagging tasks.
I suggest taking a look at some sequence labeling or tagging examples using BERT to become confident about your modeling decisions. You can find NER tagging example using Huggingface transformers here.

I want to classify some sentences on the basis of their semantic meaning.How can I use Doc2Vec in this? Or is there a better approach than this?

I want to implement doc2vec on various reviews which we extracted from a source.And I want to classify these reviews into different classes defined by the user. How can I do this?
I consider this as one of the interesting question. I will be giving you some approaches depending on size of observations/reviews.
You can apply LSA (SVD on DTM (either incidence or TF-IDF vectors) you will be getting three vectors as outputs -- USV. The V transpose is the sentence embedding).
Use this embeddings as input to your model for classification.
I recommend to use LSA when your corpus size is large.
Resources: link
In the similar way instead of using LSA, You can use pre trained embeddings say glove, here you will be getting word embeddings for creating document vectors use inverse weighted frequency method. Use this document vectors for classification.
Resources: link

Linear CRF Versus Word2Vec for NER

I have done lots of reading around Linear CRF and Word2Vec and wanted to know which one is the best to do Named Entity Recognition. I trained my model using Stanford NER(Which is a Linear CRF Implementation) and got a precision of 85%. I know that Word2vec groups similar words together but is it a good model to do NER?
CRFs and word2vec are apples and oranges, so comparing them doesn't really make sense.
CRFs are used for sequence labelling problems like NER. Given a sequence of items, represented as features and paired with labels, they'll learn a model to predict labels for new sequences.
Word2vec's word embeddings are representations of words as vectors of floating point numbers. They don't predict anything by themselves. You can even use the word vectors to build features in a CRF, though it's more typical to use them with a neural model like an LSTM.
Some people have used word vectors with CRFs with success. For some discussion of using word vectors in a CRF see here and here.
Do note that with many standard CRF implementations features are expected to be binary or categorical, not continuous, so you typically can't just shove word vectors in as you would another feature.
If you want to know which is better for your use case, the only way to find out is to try both.
For typical NER tasks, Linear CRF is a popular method, while Word2Vec is a feature that can be leveraged to improve the CRF systems performence.
In this 2014 paper (GitHub), the authors compared multiple ways of incorporating output of Word2Vec in a CRF-based NER system, including dense embedding, binerized embedding, cluster embedding, and a novel prototype method.
I implemented the prototype idea in my domain-specific NER project and it works pretty well for me.

Does it make sense to talk about skip-gram and cbow when using The Glove method?

I'm trying different word embeddings methods, in order to pick the approache that works the best for me. I tried word2vec and FastText. Now, I would like to try Glove. In both word2vec and FastText, there is two versions: Skip-gram (predict context from word) and CBOW (predict word from context). But in Glove python package, there is no parameter that enables you to choose whether you want to use skipg-gram or Cbow.
Given that Glove does not work the same way as w2v, I'm wondering: Does it make sense to talk about skip-gram and cbow when using The Glove method ?
Thanks in Advance
Not really, skip-gram and CBOW are simply the names of the two Word2vec models. They are shallow neural networks that generate word embeddings by predicting a context from a word and vice versa, and then treating the output of the hidden layer as the vector/representation. GloVe uses a different approach, making use of the global statistics of a corpus by training on a co-occurrence matrix, rather than local context windows.

Resources