Is it OK to combine domain specific word2vec embeddings and off the shelf ELMo embeddings for a downstream unsupervised task? - nlp

I am wondering if I am using word embeddings correctly.
I have combined contextualised word vectors with static word vectors because:
my domain corpus is too small to effectively train the model from scratch
my domain is too specialised to use general embeddings.
I used the off the shelf ELMo small model and trained word2vec model on a small domain specific corpus (around 500 academic papers). I then did a simple concatenation of the vectors from the two different embeddings.
I loosely followed the approach in this paper:
https://www.aclweb.org/anthology/P19-2041.pdf
But the approach in the paper trains the embeddings for a specific task. In my domain there is no labeled training data. Hence me just training the embeddings on the corpus alone.
I am new to NLP, so apologies if I am asking a stupid question.

Related

Training SVM classifier (word embeddings vs. sentence embeddings)

I want to experiment with different embeddings such Word2Vec, ELMo, and BERT but I'm a little confused about whether to use the word embeddings or sentence embeddings, and why. I'm using the embeddings as features input to SVM classifier.
Thank you.
Though both approaches can prove efficient for different datasets, as a rule of thumb I would advice you to use word embeddings when your input is of a few words, and sentence embeddings when your input in longer (e.g. large paragraphs).

Unsupervised finetuning of BERT for embeddings only?

I would like to fine-tuning BERT for a specific domain on unlabeled data and get the output layer to check the similarity between them. How can I do it? Do I need to fine-tuning first a classifier task (or question answer, etc..) and get the embeddings? Or can I just use a pre-trained Bert model without task and fine-tuning with my own data?
There is no need to fine-tune for classification, especially if you do not have any supervised classification dataset.
You should continue training BERT the same unsupervised way it was originally trained, i.e., continue "pre-training" using the masked-language-model objective and next sentence prediction. Hugginface's implementation contains class BertForPretraining for this.

Should the vocabulary be restricted to the training-set vocabulary when training an NN model with pretrained word2vec like GLOVE?

I wanted to use word embeddings for the embedding Layer in my neural network using pre-trained vectors from GLOVE. Do I need to restrict the vocabulary to the training-set when constructing the word2index dictionary?
Wouldn't that lead to a limited non-generalizable model?
Is considering all the vocabulary of GLOVE a recommended practice?
Yes, it is better to restrict your vocab size. Because pre-trained embeddings (like GLOVE) have many words in them that are not very useful (and so Word2Vec) and the bigger vocab size the more RAM you need and other problems.
Select your tokens from all of your data. it won't lead to a limited non-generalizable model if your data is big enough. if you think that your data does not have as many tokens as are needed, then you should know 2 things:
Your data is not good enough and you have to gather more.
Your model can't generate well on the tokens that it hasn't seen at training! so it has no point to having many unused words on your embedding and better to gather more data to cover those words.
I have an answer to show how you can select a minor set of word vectors from a pre-trained model in here

Best tool for text representation to deep learning

so I wanna ask you which is the best tool used to prepare my text to deep learning?
What is the difference between Word2Vec, Glove, Keras, LSA...
You should use a pre-trained embedding to represent the sentence into a vector or a matrix. There are a lot of sources where you can find pre-trained embeddings that use different dataset (for instance all the Wikipedia) to train their models. These models can have different length, but normally each word is represented with 100 or 300 dimensions.
Pre-trained embeddings
Pre-trained embeddings 2

Linear CRF Versus Word2Vec for NER

I have done lots of reading around Linear CRF and Word2Vec and wanted to know which one is the best to do Named Entity Recognition. I trained my model using Stanford NER(Which is a Linear CRF Implementation) and got a precision of 85%. I know that Word2vec groups similar words together but is it a good model to do NER?
CRFs and word2vec are apples and oranges, so comparing them doesn't really make sense.
CRFs are used for sequence labelling problems like NER. Given a sequence of items, represented as features and paired with labels, they'll learn a model to predict labels for new sequences.
Word2vec's word embeddings are representations of words as vectors of floating point numbers. They don't predict anything by themselves. You can even use the word vectors to build features in a CRF, though it's more typical to use them with a neural model like an LSTM.
Some people have used word vectors with CRFs with success. For some discussion of using word vectors in a CRF see here and here.
Do note that with many standard CRF implementations features are expected to be binary or categorical, not continuous, so you typically can't just shove word vectors in as you would another feature.
If you want to know which is better for your use case, the only way to find out is to try both.
For typical NER tasks, Linear CRF is a popular method, while Word2Vec is a feature that can be leveraged to improve the CRF systems performence.
In this 2014 paper (GitHub), the authors compared multiple ways of incorporating output of Word2Vec in a CRF-based NER system, including dense embedding, binerized embedding, cluster embedding, and a novel prototype method.
I implemented the prototype idea in my domain-specific NER project and it works pretty well for me.

Resources