How to get word2vec from google's pre-trained model - nlp

I want to fetch vector representation of words.
I tried to use GENSIM api but got the same error as here (for Python 3.6):
ValueError when downloading gensim data set
What is the best way to get the vector out of the pre-trained model?

You can download the compressed vectors directly form the Google link on the page:
https://code.google.com/archive/p/word2vec/
(Search for GoogleNews-vectors to find the link about 2/3 through the page.)
Take note of the local file path where you downloaded the file.
Then load the set of vecors as a Gensim KeyedVectors model:
from gensim.models import KeyedVectors
goog_model = KeyedVectors.load_word2vec_format('/WHERE/YOU/DOWNLOADED/GoogleNews-vectors-negative300.bin.gz', binary=True)

Related

How to load pre trained FastText Word Embeddings using Gensim?

I downloaded word embedding from this link. I want to load it in Gensim to do some work but I am not able to load it. I have found many resources and none of it is working. I am using Gensim version 4.1.
I have tried
gensim.models.fasttext.load_facebook_model('/home/admin1/embeddings/crawl-300d-2M.vec')
gensim.models.fasttext.load_facebook_vectors('/home/admin1/embeddings/crawl-300d-2M.vec')
and it is showing me
NotImplementedError: Supervised fastText models are not supported
I went to try to load it using using FastText.load('/home/admin1/embeddings/crawl-300d-2M.vec',) but then it showed UnpicklingError: could not find MARK.
Also, using
Per the NotImplementedError, those are the one kind of full Facebook FastText model, -supervised mode, that Gensim does not support.
So sadly, the answer to "How do you load these?" is "you don't".
The .vec files contain just the full-word vectors in a plain-text format – no subword info for synthesizing OOV vectors, or supervised-classification output features. Those can be loaded into a KeyedVectors model:
kv_model = KeyedVectors.load_word2vec_format('crawl-300d-2M.vec')

Extracting fixed vectors from BioBERT without using terminal command?

If we want to use weights from pretrained BioBERT model, we can execute following terminal command after downloading all the required BioBERT files.
os.system('python3 extract_features.py \
--input_file=trial.txt \
--vocab_file=vocab.txt \
--bert_config_file=bert_config.json \
--init_checkpoint=biobert_model.ckpt \
--output_file=output.json')
The above command actually reads individual file containing the text, reads the textual content from it, and then writes the extracted vectors to another file. So, the problem with this is that it could not be scaled easily for very large data-sets containing thousands of sentences/paragraphs.
Is there is a way to extract these features on the go (using an embedding layer) like it could be done for the word2vec vectors in PyTorch or TF1.3?
Note: BioBERT checkpoints do not exist for TF2.0, so I guess there is no way it could be done with TF2.0 unless someone generates TF2.0 compatible checkpoint files.
I will be grateful for any hint or help.
You can get the contextual embeddings on the fly, but the total time spend on getting the embeddings will always be the same. There are two options how to do it: 1. import BioBERT into the Transformers package and treat use it in PyTorch (which I would do) or 2. use the original codebase.
1. Import BioBERT into the Transformers package
The most convenient way of using pre-trained BERT models is the Transformers package. It was primarily written for PyTorch, but works also with TensorFlow. It does not have BioBERT out of the box, so you need to convert it from TensorFlow format yourself. There is convert_tf_checkpoint_to_pytorch.py script that does that. People had some issues with this script and BioBERT (seems to be resolved).
After you convert the model, you can load it like this.
import torch
from transformers import *
# Load dataset, tokenizer, model from pretrained model/vocabulary
tokenizer = BertTokenizer.from_pretrained('directory_with_converted_model')
model = BertModel.from_pretrained('directory_with_converted_model')
# Call the model in a standard PyTorch way
embeddings = model([tokenizer.encode("Cool biomedical tetra-hydro-sentence.", add_special_tokens=True)])
2. Use directly BioBERT codebase
You can get the embeddings on the go basically using the code that is exctract_feautres.py. On lines 346-382, they initialize the model. You get the embeddings by calling estimator.predict(...).
For that, you need to format your format the input. First, you need to format the string (using code on line 326-337) and then apply and call convert_examples_to_features on it.

Word2Vec word not found with Gensim but shows up on TensorFlow embedding projector?

I've recently started experimenting with pre-trained word embeddings to enhance the performance of my LSTM model on a NLP task. In this case, I looked into Google's Word2Vec. Based on online tutorials, I first downloaded Word2Vec with wget https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz and used python's gensim package to query the embeddings, using the following code.
from gensim.models import KeyedVectors
if __name__ == "__main__":
model = KeyedVectors.load_word2vec_format("./data/word2vec/GoogleNews-vectors-negative300.bin", binary=True)
print(model["bosnia"])
However, after noticing that many common words weren't found in the model, I started to wonder if something was awry. I tried searching for bosnia in the embedding repo, as shown above, but it wasn't found. So, I went on the TensorFlow embedding projector, loaded the Word2Vec model, and searched for bosnia - it was there.
So, my question is: why is this happening? Was the version of Word2Vec I downloaded not complete? Or is gensim unable to load all words into memory and therefore omitting some?
You should check the length of the downloaded file(s), to ensure it's as expected (in case it was truncated or incompletely downloaded).
You should double-check that you're using the same file in both places, and also checking the exact same token (eg 'bosnia' vs 'Bosnia') via both paths. (None of the 5 options in the https://projector.tensorflow.org/ drop-down correspond to the GoogleNews 300-d, 3-million-token dataset, and the load button doesn't appear to support word2vec .bin files, so I'm not sure how that could be used to cross-check what's in that file.)
(There aren't any known bugs in gensim's load_word2vec_format() that would explain it missing vectors that are actually present.)

Fine tune spaCy's word embeddings

Spacy has great parsing capacities and it's API is very intuitive for the most part. Is there any way from the Spacy API to fine tune its word embedding models? In particular, I would like to keep Spacy's tokens and give them a vector when possible.
The only thing I've come across for now is to train the embeddings using gensim (but then I wouldn't know how to load the embeddings from spacy to gensim) and then load then back to spacy, as in: https://www.shanelynn.ie/word-embeddings-in-python-with-spacy-and-gensim/. This doesn't help for the first part: training on spacy tokens.
Any help appreciated.
From the spacy documentation:
If you need to train a word2vec model, we recommend the implementation
in the Python library Gensim.
Besides gensim you can also use other implementations like FastText . The easiest way to use the custom vectors from spacy, is to create a model using the init-model command-line utility, like this:
wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.la.300.vec.gz
python -m spacy init-model en model --vectors-loc cc.la.300.vec.gz
then simply load your model as usual: nlp = spacy.load('model'). There is a detailed documentation in the spacy website.

How to train a pretrained binary file on my own corpus using gensim?

Hey guys I have a pretrained binary file and I want to train it on my corpus.
Approach I tried :
I tried to extract the txt file from the bin file I had and use this as a word2vec file at time of loading and further trained it on my own corpus and saved the model but the model is performing badly for the words which are there in the pre-trained bin file (I used intersect_word2vec_format command for this.)
Here is the script I used.
What should be my approach for my model to perform well on words from both the pre-trained file and my corpus?
Load your model and use build_vocab with update = True.
import gensim
from gensim.models import Word2Vec
model = Word2Vec.load('w2vmodel.bin')
my_corpus = ... # load your corpus as sentences here
model.build_vocab(my_corpus, update=True)
model.train(my_corpus)
It's not really clear to me when intersect_word2vec_format is helpful, but you can read more about the intended use case here. It does seem it's not for ordinary re-training of vectors though.

Resources