How to store Word vector Embeddings? - python-3.x

I am using BERT Word Embeddings for sentence classification task with 3 labels. I am using Google Colab for coding. My problem is, since I will have to execute the embedding part every time I restart the kernel, is there any way to save these word embeddings once it is generated? Because, it takes a lot of time to generate those embeddings.
The code I am using to generate BERT Word Embeddings is -
[get_features(text_list[i]) for text_list[i] in text_list]
Here, gen_features is a function which returns word embedding for each i in my list text_list.
I read that converting embeddings into bumpy tensors and then using np.save can do it. But I actually don't know how to code it.

You can save your embeddings data to a numpy file by following these steps:
all_embeddings = here_is_your_function_return_all_data()
all_embeddings = np.array(all_embeddings)
np.save('embeddings.npy', all_embeddings)
If you're saving into google colab, then you can download it to your local computer. Whenever you need it, just upload it and load it.
all_embeddings = np.load('embeddings.npy')
That's it.
Btw, You can also directly save your file to google drive.

Related

How to load pre trained FastText Word Embeddings using Gensim?

I downloaded word embedding from this link. I want to load it in Gensim to do some work but I am not able to load it. I have found many resources and none of it is working. I am using Gensim version 4.1.
I have tried
gensim.models.fasttext.load_facebook_model('/home/admin1/embeddings/crawl-300d-2M.vec')
gensim.models.fasttext.load_facebook_vectors('/home/admin1/embeddings/crawl-300d-2M.vec')
and it is showing me
NotImplementedError: Supervised fastText models are not supported
I went to try to load it using using FastText.load('/home/admin1/embeddings/crawl-300d-2M.vec',) but then it showed UnpicklingError: could not find MARK.
Also, using
Per the NotImplementedError, those are the one kind of full Facebook FastText model, -supervised mode, that Gensim does not support.
So sadly, the answer to "How do you load these?" is "you don't".
The .vec files contain just the full-word vectors in a plain-text format – no subword info for synthesizing OOV vectors, or supervised-classification output features. Those can be loaded into a KeyedVectors model:
kv_model = KeyedVectors.load_word2vec_format('crawl-300d-2M.vec')

how to predict a masked word in a given sentence

FitBERT is an useful package , but I have a small doubt on BERT development for masked word prediction as below: I trained a bert model with custom corpus using Google's Scripts like create_pretraining_data.py, run_pretraining.py, extract_features.py etc..as a result I got vocab file, .tfrecord file, .json file and check point files.
Now how to use those file for your package to predict a masked word in a given sentence??
From the tensorflow documentation:
A TFRecord file stores your data as a sequence of binary strings. This means you need to specify the structure of your data before you write it to the file. Tensorflow provides two components for this purpose: tf.train.Example and tf.train.SequenceExample. You have to store each sample of your data in one of these structures, then serialize it and use a tf.python_io.TFRecordWriter to write it to disk.
This document along with the tensorflow documentation explain quite well how to use those file types.
While instead to use FitBERT directly through the library you can follow the examples you find on the project's github.

Word2Vec word not found with Gensim but shows up on TensorFlow embedding projector?

I've recently started experimenting with pre-trained word embeddings to enhance the performance of my LSTM model on a NLP task. In this case, I looked into Google's Word2Vec. Based on online tutorials, I first downloaded Word2Vec with wget https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz and used python's gensim package to query the embeddings, using the following code.
from gensim.models import KeyedVectors
if __name__ == "__main__":
model = KeyedVectors.load_word2vec_format("./data/word2vec/GoogleNews-vectors-negative300.bin", binary=True)
print(model["bosnia"])
However, after noticing that many common words weren't found in the model, I started to wonder if something was awry. I tried searching for bosnia in the embedding repo, as shown above, but it wasn't found. So, I went on the TensorFlow embedding projector, loaded the Word2Vec model, and searched for bosnia - it was there.
So, my question is: why is this happening? Was the version of Word2Vec I downloaded not complete? Or is gensim unable to load all words into memory and therefore omitting some?
You should check the length of the downloaded file(s), to ensure it's as expected (in case it was truncated or incompletely downloaded).
You should double-check that you're using the same file in both places, and also checking the exact same token (eg 'bosnia' vs 'Bosnia') via both paths. (None of the 5 options in the https://projector.tensorflow.org/ drop-down correspond to the GoogleNews 300-d, 3-million-token dataset, and the load button doesn't appear to support word2vec .bin files, so I'm not sure how that could be used to cross-check what's in that file.)
(There aren't any known bugs in gensim's load_word2vec_format() that would explain it missing vectors that are actually present.)

Use pretrained embedding in Spanish with Torchtext

I am using Torchtext in an NLP project. I have a pretrained embedding in my system, which I'd like to use. Therefore, I tried:
my_field.vocab.load_vectors(my_path)
But, apparently, this only accepts the names of a short list of pre-accepted embeddings, for some reason. In particular, I get this error:
Got string input vector "my_path", but allowed pretrained vectors are ['charngram.100d', 'fasttext.en.300d', ..., 'glove.6B.300d']
I found some people with similar problems, but the solutions I can find so far are "change Torchtext source code", which I would rather avoid if at all possible.
Is there any other way in which I can work with my pretrained embedding? A solution that allows to use another Spanish pretrained embedding is acceptable.
Some people seem to think it is not clear what I am asking. So, if the title and final question are not enough: "I need help using a pre-trained Spanish word-embedding in Torchtext".
It turns out there is a relatively simple way to do this without changing Torchtext's source code. Inspiration from this Github thread.
1. Create numpy word-vector tensor
You need to load your embedding so you end up with a numpy array with dimensions (number_of_words, word_vector_length):
my_vecs_array[word_index] should return your corresponding word vector.
IMPORTANT. The indices (word_index) for this array array MUST be taken from Torchtext's word-to-index dictionary (field.vocab.stoi). Otherwise Torchtext will point to the wrong vectors!
Don't forget to convert to tensor:
my_vecs_tensor = torch.from_numpy(my_vecs_array)
2. Load array to Torchtext
I don't think this step is really necessary because of the next one, but it allows to have the Torchtext field with both the dictionary and vectors in one place.
my_field.vocab.set_vectors(my_field.vocab.stoi, my_vecs_tensor, word_vectors_length)
3. Pass weights to model
In your model you will declare the embedding like this:
my_embedding = toch.nn.Embedding(vocab_len, word_vect_len)
Then you can load your weights using:
my_embedding.weight = torch.nn.Parameter(my_field.vocab.vectors, requires_grad=False)
Use requires_grad=True if you want to train the embedding, use False if you want to freeze it.
EDIT: It looks like there is another way that looks a bit easier! The improvement is that apparently you can pass the pre-trained word vectors directly during the vocabulary-building step, so that takes care of steps 1-2 here.

Get most similar words using GloVe

I am new to GloVe. I successfully ran their demo.sh as given in their website. After running demo I got several files created such as vocab, vectors etc. But they haven't any documentation or anything that describes what files we need to use and how to use to find most similar words.
Hence, please help me to find the most similar words given a word in GloVe (using cosine similarity)? (e.g., like most.similar in Gensim word2vec)
Please help me!
It doesn't really matter how word vectors are generated, you can always calculate cosine similarity between the words. The easiest way to achieve what you asked for is (considering you have gensim):
python -m gensim.scripts.glove2word2vec –input <GloVe vector file> –output <Word2vec vector file>
This will convert glove vector file to w2v format. You can do it manually too - just add extra line to your GloVe file containing total number of vectors and their dimensionality at the top of your file. It looks something a kin of:
180000 300
<The rest of your file>
After that you can just load the file into gensim and everything is working as if it is a regular w2v model.

Resources