Is Google word2vec pertrained model CBOW or skipgram - python-3.x

Is Google's pretrained word2vec model CBO or skipgram.
We load pretrained model by:
from gensim.models.keyedvectors as word2vec
model= word2vec.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz')
How can we specifically load pretrained CBOW or skipgram model ?

The GoogleNews word-vectors were trained by Google, using a proprietary corpus, but they're never explicitly described all the training-parameters used. (It's not encoded in the file.)
It's been asked a number of times on the Google Group devoted to the word2vec-toolkit code, without a definitive answer. For example, there's a response from word2vec author Mikolov that he doesn't remember the training parameters. Elsewhere, another poster thinks one of the word2vec papers implies skip-gram was used – but as that passage doesn't precisely match other aspects (like vocabulary-size) of the released GoogleNews vectors, I wouldn't be completely confident of that.
As Google hasn't been clear, and in any case hasn't released alternate versions based on different training modes, if you want to run any tests or make any conclusions about the different modes, you'll have to use other vector-sets, or train your own vectors in varying ways.

Late to the party, but Mikolov describes the hyperparameters here. The Google News pretrained vectors were trained using CBOW. I believe that's the only option for you to load; there is no pretrained skip-gram version available.

Related

Can I fine-tune BERT using only masked language model and next sentence prediction?

So if I understand correctly there are mainly two ways to adapt BERT to a specific task: fine-tuning (all weights are changed, even pretrained ones) and feature-based (pretrained weights are frozen). However, I am confused.
When to use which one? If you have unlabeled data (unsupervised learning), should you then use fine-tuning?
If I want to fine-tuned BERT, isn't the only option to do that using masked language model and next sentence prediction? And also: is it necessary to put another layer of neural network on top?
Thank you.
Your first approach should be to try the pre-trained weights. Generally it works well. However if you are working on a different domain (e.g.: Medicine), then you'll need to fine-tune on data from new domain. Again you might be able to find pre-trained models on the domains (e.g.: BioBERT).
For adding layer, there are slightly different approaches depending on your task. E.g.: For question-answering, have a look at TANDA paper (Transfer and Adapt Pre-Trained Transformer Models for Answer Sentence Selection). It is a very nice easily readable paper which explains the transfer and adaptation strategy. Again, hugging-face has modified and pre-trained models for most of the standard tasks.

How to fine tune BERT on unlabeled data?

I want to fine tune BERT on a specific domain. I have texts of that domain in text files. How can I use these to fine tune BERT?
I am looking here currently.
My main objective is to get sentence embeddings using BERT.
The important distinction to make here is whether you want to fine-tune your model, or whether you want to expose it to additional pretraining.
The former is simply a way to train BERT to adapt to a specific supervised task, for which you generally need in the order of 1000 or more samples including labels.
Pretraining, on the other hand, is basically trying to help BERT better "understand" data from a certain domain, by basically continuing its unsupervised training objective ([MASK]ing specific words and trying to predict what word should be there), for which you do not need labeled data.
If your ultimate objective is sentence embeddings, however, I would strongly suggest you to have a look at Sentence Transformers, which is based on a slightly outdated version of Huggingface's transformers library, but primarily tries to generate high-quality embeddings. Note that there are ways to train with surrogate losses, where you try to emulate some form ofloss that is relevant for embeddings.
Edit: The author of Sentence-Transformers recently joined Huggingface, so I expect support to greatly improve over the upcoming months!
#dennlinger gave an exhaustive answer. Additional pretraining is also referred as "post-training", "domain adaptation" and "language modeling fine-tuning". here you will find an example how to do it.
But, since you want to have good sentence embeddings, you better use Sentence Transformers. Moreover, they provide fine-tuned models, which already capable of understanding semantic similarity between sentences. "Continue Training on Other Data" section is what you want to further fine-tune the model on your domain. You do have to prepare training dataset, according to one of available loss functions. E.g. ContrastLoss requires a pair of texts and a label, whether this pair is similar.
I believe transfer learning is useful to train the model on a specific domain. First you load the pretrained base model and freeze its weights, then you add another layer on top of the base model and train that layer based on your own training data. However, the data would need to be labelled.
Tensorflow has some useful guide on transfer learning.
You are talking about pre-training. Fine-tuning on unlabeled data is called pre-training and for getting started, you can take a look over here.

Can you train a BERT model from scratch with task specific architecture?

BERT pre-training of the base-model is done by a language modeling approach, where we mask certain percent of tokens in a sentence, and we make the model learn those missing mask. Then, I think in order to do downstream tasks, we add a newly initialized layer and we fine-tune the model.
However, suppose we have a gigantic dataset for sentence classification. Theoretically, can we initialize the BERT base architecture from scratch, train both the additional downstream task specific layer + the base model weights form scratch with this sentence classification dataset only, and still achieve a good result?
Thanks.
BERT can be viewed as a language encoder, which is trained on a humongous amount of data to learn the language well. As we know, the original BERT model was trained on the entire English Wikipedia and Book corpus, which sums to 3,300M words. BERT-base has 109M model parameters. So, if you think you have large enough data to train BERT, then the answer to your question is yes.
However, when you said "still achieve a good result", I assume you are comparing against the original BERT model. In that case, the answer lies in the size of the training data.
I am wondering why do you prefer to train BERT from scratch instead of fine-tuning it? Is it because you are afraid of the domain adaptation issue? If not, pre-trained BERT is perhaps a better starting point.
Please note, if you want to train BERT from scratch, you may consider a smaller architecture. You may find the following papers useful.
Well-Read Students Learn Better: On the Importance of Pre-training Compact Models
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
I can give help.
First of all, MLM and NSP (which are the original pre-training objectives from NAACL 2019) are meant to train language encoders with prior language knowledge. Like a primary school student who read many books in the general domain. Before BERT, many neural networks would be trained from scratch, from a clean slate where the model doesn't know anything. This is like a newborn baby.
So my question is, "is it a good idea to start teaching a newborn baby when you can begin with a primary school student?" My answer is no. This is supported by numerous State-of-The-Arts achieved by the pre-trained models, compared to the old methods of training a neural network from scratch.
As someone who works in the field, I can assure you that it is a much better idea to fine-tune a pre-trained model. It doesn't matter if you have a 200k dataset or a 1mil datapoints. In fact, more fine-tuning data will only make the downstream results better if you use the right hyperparameters.
Though I recommend the learning rate between 2e-6 ~ 5e-5 for sentence classification tasks, you can explore. If your dataset is very, very domain-specific, it's up to you to fine-tune with a higher learning rate, which will deviate the model further away from its "pre-trained" knowledge.
And also, regarding your question on
can we initialize the BERT base architecture from scratch, train both the additional downstream task specific layer + the base model weights form scratch with this sentence classification dataset only, and still achieve a good result?
I'm negative about this idea. Even though you have a dataset with 200k instances, BERT is pre-trained on 3300mil words. BERT is too inefficient to be trained with 200k instances (both size-wise and architecture-wise). If you want to train a neural network from scratch, I'd recommend you look into LSTMs or RNNs.
I'm not saying I recommend LSTMs. Just fine-tune BERT. 200k is not even too big anyways.
All the best luck with your NLP studies :)

train word2vec with pretrained vectors

I am training word vectors on particular text corpus using fast text.
Fasttext provides all the necessary mechanics and options for training word vectors and when looked with tsne, the vectors are amazing. I notice gensim has a wrapper for fasttext which is good for accessing vectors.
for my task, I have many text corpuses. I need to use the above trained vectors again with new corpus and use the trained vectors again on new discovered corpuses. fasttext doesnot provide this function. I donot see any package that achieves this or may be I am lost. I see in google forum gensim provides intersect_word2vec_format, but cannot understand or find usage tutorial for this. There is another question open similar to this with no answer.
So apart from gensim, is there any other way to train the models like above.

Linear CRF Versus Word2Vec for NER

I have done lots of reading around Linear CRF and Word2Vec and wanted to know which one is the best to do Named Entity Recognition. I trained my model using Stanford NER(Which is a Linear CRF Implementation) and got a precision of 85%. I know that Word2vec groups similar words together but is it a good model to do NER?
CRFs and word2vec are apples and oranges, so comparing them doesn't really make sense.
CRFs are used for sequence labelling problems like NER. Given a sequence of items, represented as features and paired with labels, they'll learn a model to predict labels for new sequences.
Word2vec's word embeddings are representations of words as vectors of floating point numbers. They don't predict anything by themselves. You can even use the word vectors to build features in a CRF, though it's more typical to use them with a neural model like an LSTM.
Some people have used word vectors with CRFs with success. For some discussion of using word vectors in a CRF see here and here.
Do note that with many standard CRF implementations features are expected to be binary or categorical, not continuous, so you typically can't just shove word vectors in as you would another feature.
If you want to know which is better for your use case, the only way to find out is to try both.
For typical NER tasks, Linear CRF is a popular method, while Word2Vec is a feature that can be leveraged to improve the CRF systems performence.
In this 2014 paper (GitHub), the authors compared multiple ways of incorporating output of Word2Vec in a CRF-based NER system, including dense embedding, binerized embedding, cluster embedding, and a novel prototype method.
I implemented the prototype idea in my domain-specific NER project and it works pretty well for me.

Resources