train word2vec with pretrained vectors - nlp

I am training word vectors on particular text corpus using fast text.
Fasttext provides all the necessary mechanics and options for training word vectors and when looked with tsne, the vectors are amazing. I notice gensim has a wrapper for fasttext which is good for accessing vectors.
for my task, I have many text corpuses. I need to use the above trained vectors again with new corpus and use the trained vectors again on new discovered corpuses. fasttext doesnot provide this function. I donot see any package that achieves this or may be I am lost. I see in google forum gensim provides intersect_word2vec_format, but cannot understand or find usage tutorial for this. There is another question open similar to this with no answer.
So apart from gensim, is there any other way to train the models like above.

Related

Order/context-aware document / sentence to vectors in Spacy

I would like to do some supervised binary classification tasks with sentences, and have been using spaCy because of its ease of use. I used spaCy to convert the text into vectors, and then fed the vectors to a machine learning model (e.g. XGBoost) to perform the classfication. However, the results have not been very satisfactory.
In spaCy, it is easy to load a model (e.g. BERT / Roberta / XLNet) to convert words / sentences to nlp objects. Directly calling the vector of the object will however will default to an average of the token vectors.
Here are two questions:
1) Can we do better than simply getting the average of token vectors, like having context/order-aware sentence vectors using spaCy? For example, can we extract the sentence embedding from the previous layer of the BERT transformer instead of the final token vectors in spaCy?
2) Would it be better to directly use spaCy to train the downstream binary classification task? For example, here discusses how to add a text classifier to a spaCy model. Or is it generally better to apply more powerful machine learning models like XGBoost?
Thanks in advance!
I found this being discussed in the page below. Maybe it helps.
"Most people usually only take the hidden states of the [CLS] token of the last layer - using the hidden states for all tokens or from multiple layers doesn't usually help you that much."
https://github.com/huggingface/transformers/issues/1950

Is Google word2vec pertrained model CBOW or skipgram

Is Google's pretrained word2vec model CBO or skipgram.
We load pretrained model by:
from gensim.models.keyedvectors as word2vec
model= word2vec.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz')
How can we specifically load pretrained CBOW or skipgram model ?
The GoogleNews word-vectors were trained by Google, using a proprietary corpus, but they're never explicitly described all the training-parameters used. (It's not encoded in the file.)
It's been asked a number of times on the Google Group devoted to the word2vec-toolkit code, without a definitive answer. For example, there's a response from word2vec author Mikolov that he doesn't remember the training parameters. Elsewhere, another poster thinks one of the word2vec papers implies skip-gram was used – but as that passage doesn't precisely match other aspects (like vocabulary-size) of the released GoogleNews vectors, I wouldn't be completely confident of that.
As Google hasn't been clear, and in any case hasn't released alternate versions based on different training modes, if you want to run any tests or make any conclusions about the different modes, you'll have to use other vector-sets, or train your own vectors in varying ways.
Late to the party, but Mikolov describes the hyperparameters here. The Google News pretrained vectors were trained using CBOW. I believe that's the only option for you to load; there is no pretrained skip-gram version available.

Linear CRF Versus Word2Vec for NER

I have done lots of reading around Linear CRF and Word2Vec and wanted to know which one is the best to do Named Entity Recognition. I trained my model using Stanford NER(Which is a Linear CRF Implementation) and got a precision of 85%. I know that Word2vec groups similar words together but is it a good model to do NER?
CRFs and word2vec are apples and oranges, so comparing them doesn't really make sense.
CRFs are used for sequence labelling problems like NER. Given a sequence of items, represented as features and paired with labels, they'll learn a model to predict labels for new sequences.
Word2vec's word embeddings are representations of words as vectors of floating point numbers. They don't predict anything by themselves. You can even use the word vectors to build features in a CRF, though it's more typical to use them with a neural model like an LSTM.
Some people have used word vectors with CRFs with success. For some discussion of using word vectors in a CRF see here and here.
Do note that with many standard CRF implementations features are expected to be binary or categorical, not continuous, so you typically can't just shove word vectors in as you would another feature.
If you want to know which is better for your use case, the only way to find out is to try both.
For typical NER tasks, Linear CRF is a popular method, while Word2Vec is a feature that can be leveraged to improve the CRF systems performence.
In this 2014 paper (GitHub), the authors compared multiple ways of incorporating output of Word2Vec in a CRF-based NER system, including dense embedding, binerized embedding, cluster embedding, and a novel prototype method.
I implemented the prototype idea in my domain-specific NER project and it works pretty well for me.

Does it make sense to talk about skip-gram and cbow when using The Glove method?

I'm trying different word embeddings methods, in order to pick the approache that works the best for me. I tried word2vec and FastText. Now, I would like to try Glove. In both word2vec and FastText, there is two versions: Skip-gram (predict context from word) and CBOW (predict word from context). But in Glove python package, there is no parameter that enables you to choose whether you want to use skipg-gram or Cbow.
Given that Glove does not work the same way as w2v, I'm wondering: Does it make sense to talk about skip-gram and cbow when using The Glove method ?
Thanks in Advance
Not really, skip-gram and CBOW are simply the names of the two Word2vec models. They are shallow neural networks that generate word embeddings by predicting a context from a word and vice versa, and then treating the output of the hidden layer as the vector/representation. GloVe uses a different approach, making use of the global statistics of a corpus by training on a co-occurrence matrix, rather than local context windows.

Biasing word2vec towards special corpus

I am new to stackoverflow. Please forgive my bad English.
I am using word2vec for a school project. I want to work with a domain specific corpus (like Physics Textbook) for creating the word vectors using Word2Vec. This standalone does not provide good results due to lesser size of the corpus. This especially hurts as we want to evaluate on words that may very well be outside the vocabulary of the text book.
We want the textbook to encode the domain specific relationships and semantic "nearness". "Quantum" and "Heisenberg" are especially close in this textbook for eg. which may not hold true for background corpus. To handle the generic words (like "any") we need the basic background model(like the one provided by Google on word2vec site).
Is there any way that we can supplant to the background model using our newer corpus. Just training on the corpus etc. doesnot work well.
Are there any attempts to combine vector representations from two corpus- general and specific. I could not find any in my searches.
Let's talk about gensim since you tagged you question with it. You can load a previously trained model in python using gensim. Then you continue training it. Would it be useful?
# load from previous gensim file:
model = gensim.models.Word2Vec.load(fname)
# or from word2vec c format:
# model = gensim.models.Word2Vec.load_word2vec_format('/path/vectors.bin', binary=True)
# continue training:
model.train(other_sentences)
model.save(fname)

Resources