Linear CRF Versus Word2Vec for NER - nlp

I have done lots of reading around Linear CRF and Word2Vec and wanted to know which one is the best to do Named Entity Recognition. I trained my model using Stanford NER(Which is a Linear CRF Implementation) and got a precision of 85%. I know that Word2vec groups similar words together but is it a good model to do NER?

CRFs and word2vec are apples and oranges, so comparing them doesn't really make sense.
CRFs are used for sequence labelling problems like NER. Given a sequence of items, represented as features and paired with labels, they'll learn a model to predict labels for new sequences.
Word2vec's word embeddings are representations of words as vectors of floating point numbers. They don't predict anything by themselves. You can even use the word vectors to build features in a CRF, though it's more typical to use them with a neural model like an LSTM.
Some people have used word vectors with CRFs with success. For some discussion of using word vectors in a CRF see here and here.
Do note that with many standard CRF implementations features are expected to be binary or categorical, not continuous, so you typically can't just shove word vectors in as you would another feature.
If you want to know which is better for your use case, the only way to find out is to try both.

For typical NER tasks, Linear CRF is a popular method, while Word2Vec is a feature that can be leveraged to improve the CRF systems performence.
In this 2014 paper (GitHub), the authors compared multiple ways of incorporating output of Word2Vec in a CRF-based NER system, including dense embedding, binerized embedding, cluster embedding, and a novel prototype method.
I implemented the prototype idea in my domain-specific NER project and it works pretty well for me.

Related

How to convert small dataset into word embeddings instead of one-hot encoding?

I have a dataset of 33 words that are a mix of verbs and nouns, for eg. father, sing, etc. I have tried converting them to 1-hot encoding but for my use case, it has been suggested to look into word2vec embedding. I have looked in gensim and glove but struggling to make it work.
How could I convert my data into an embedding? Such that two words that may be semantically closer may have a lesser distance between their respective vectors. How may this be achieved or any helpful material on the same?
Such as this
Since your dataset is quite small, and I'm assuming it doesn't contain any jargon, it's best to use a pre-trained model in order to save up on training time.
With gensim, it's as simple as:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')
The 'word2vec-google-news-300' model has been pre-trained on a part of the Google News Dataset and generalizes well enough to most tasks. Following this, you can create word embeddings/vectors like so:
vec = wv['father']
And, finally, for computing word similarity:
similarity_score = wv.similarity('father', 'sing')
Lastly, one major limitation of Word2Vec is it's inability to deal with words that are OOV(out of vocabulary). For such cases, it's best to train a custom model for your corpus.

Order/context-aware document / sentence to vectors in Spacy

I would like to do some supervised binary classification tasks with sentences, and have been using spaCy because of its ease of use. I used spaCy to convert the text into vectors, and then fed the vectors to a machine learning model (e.g. XGBoost) to perform the classfication. However, the results have not been very satisfactory.
In spaCy, it is easy to load a model (e.g. BERT / Roberta / XLNet) to convert words / sentences to nlp objects. Directly calling the vector of the object will however will default to an average of the token vectors.
Here are two questions:
1) Can we do better than simply getting the average of token vectors, like having context/order-aware sentence vectors using spaCy? For example, can we extract the sentence embedding from the previous layer of the BERT transformer instead of the final token vectors in spaCy?
2) Would it be better to directly use spaCy to train the downstream binary classification task? For example, here discusses how to add a text classifier to a spaCy model. Or is it generally better to apply more powerful machine learning models like XGBoost?
Thanks in advance!
I found this being discussed in the page below. Maybe it helps.
"Most people usually only take the hidden states of the [CLS] token of the last layer - using the hidden states for all tokens or from multiple layers doesn't usually help you that much."
https://github.com/huggingface/transformers/issues/1950

train word2vec with pretrained vectors

I am training word vectors on particular text corpus using fast text.
Fasttext provides all the necessary mechanics and options for training word vectors and when looked with tsne, the vectors are amazing. I notice gensim has a wrapper for fasttext which is good for accessing vectors.
for my task, I have many text corpuses. I need to use the above trained vectors again with new corpus and use the trained vectors again on new discovered corpuses. fasttext doesnot provide this function. I donot see any package that achieves this or may be I am lost. I see in google forum gensim provides intersect_word2vec_format, but cannot understand or find usage tutorial for this. There is another question open similar to this with no answer.
So apart from gensim, is there any other way to train the models like above.

Does it make sense to talk about skip-gram and cbow when using The Glove method?

I'm trying different word embeddings methods, in order to pick the approache that works the best for me. I tried word2vec and FastText. Now, I would like to try Glove. In both word2vec and FastText, there is two versions: Skip-gram (predict context from word) and CBOW (predict word from context). But in Glove python package, there is no parameter that enables you to choose whether you want to use skipg-gram or Cbow.
Given that Glove does not work the same way as w2v, I'm wondering: Does it make sense to talk about skip-gram and cbow when using The Glove method ?
Thanks in Advance
Not really, skip-gram and CBOW are simply the names of the two Word2vec models. They are shallow neural networks that generate word embeddings by predicting a context from a word and vice versa, and then treating the output of the hidden layer as the vector/representation. GloVe uses a different approach, making use of the global statistics of a corpus by training on a co-occurrence matrix, rather than local context windows.

how to create word vector

How to create word vector? I used one hot key to create word vector, but it is very huge and not generalized for similar semantic word. So I have heard about word vector using neural network that finds word similarity and word vector. So I wanted to know how to generate this vector (algorithm) or good material to start creating word vector ?.
Word-vectors or so-called distributed representations have a long history by now, starting perhaps from work of S. Bengio (Bengio, Y., Ducharme, R., & Vincent, P. (2001).A neural probabilistic language model. NIPS.) where he obtained word-vectors as by-product of training neural-net lanuage model.
A lot of researches demonstrated that these vectors do capture semantic relationship between words (see for example http://research.microsoft.com/pubs/206777/338_Paper.pdf). Also this important paper (http://arxiv.org/abs/1103.0398) by Collobert et al, is a good starting point with understanding word vectors, the way they are obtained and used.
Besides word2vec there is a lot of methods to obtain them. Expamples include SENNA embeddings by Collobert et al (http://ronan.collobert.com/senna/), RNN embeddings by T. Mikolov that can be computed using RNNToolkit (http://www.fit.vutbr.cz/~imikolov/rnnlm/) and much more. For English, ready-made embeddings can be downloaded from these web-sites. word2vec really uses skip-gram model (not neural network model). Another fast code for computing word representations is GloVe (http://www-nlp.stanford.edu/projects/glove/). It is an open question whatever deep neural networks are essential for obtaining good embeddings or not.
Depending of your application, you may prefer using different types of word-vectors, so its a good idea to try several popular algorithms and see what works better for you.
I think the thing you mean is Word2Vec (https://code.google.com/p/word2vec/). It trains N-dimensional word vectors of documents based on a given corpus. So in my understanding of word2vec the neural network is just used to aggregate the dimensions of the document vector and also capturing some relationship between words. But what should be mentioned is that this is not really semantically related, it just reflects the structural relationship in your training body.
If you want to capture semantic relatedness have a look a WordNet based measures, for instance implemented is these libaries:
Java: https://code.google.com/p/ws4j/
Perl: http://wn-similarity.sourceforge.net/
To get started with word2vec you can use their pretrained vectors. You should find all information about this at https://code.google.com/p/word2vec/.
When you seek for a java implementation. This is a good starting point: http://deeplearning4j.org/word2vec.html
I hope this helps
Best wishes

Resources