Get most similar words using GloVe - nlp

I am new to GloVe. I successfully ran their demo.sh as given in their website. After running demo I got several files created such as vocab, vectors etc. But they haven't any documentation or anything that describes what files we need to use and how to use to find most similar words.
Hence, please help me to find the most similar words given a word in GloVe (using cosine similarity)? (e.g., like most.similar in Gensim word2vec)
Please help me!

It doesn't really matter how word vectors are generated, you can always calculate cosine similarity between the words. The easiest way to achieve what you asked for is (considering you have gensim):
python -m gensim.scripts.glove2word2vec –input <GloVe vector file> –output <Word2vec vector file>
This will convert glove vector file to w2v format. You can do it manually too - just add extra line to your GloVe file containing total number of vectors and their dimensionality at the top of your file. It looks something a kin of:
180000 300
<The rest of your file>
After that you can just load the file into gensim and everything is working as if it is a regular w2v model.

Related

Using spacy with archaich/old english words?

I am using en_core_web_lg to compare some texts for similarity and I am not getting the expected results.
The issue I guess is that my texts are mostly religious, for example:
"Thus hath it been decreed by Him Who is the Source of Divine inspiration."
"He, verily, is the Expounder, the Wise."
"Whoso layeth claim to a Revelation direct from God, ere the expiration of a full thousand years, such a man is assuredly a lying impostor. "
My question is, is there a way I can check spacy's "dictionary"? Does it include words like "whoso" "layeth" "decreed" or "verily"?
To check if spaCy knows about individual words you can check tok.is_oov ("is out of vocabulary"), where tok is a token from a doc.
spaCy is trained on a dataset called OntoNotes. While that does include some older texts, like the bible, it's mostly relatively recent newspapers and similar sources. The word vectors are trained on Internet text. I would not expect it to work well with documents of the type you are describing, which are very different from what it has seen before.
I would suggest you train custom word vectors on your dataset, which you can then load into spaCy. You could also look at the HistWords project.

How to store Word vector Embeddings?

I am using BERT Word Embeddings for sentence classification task with 3 labels. I am using Google Colab for coding. My problem is, since I will have to execute the embedding part every time I restart the kernel, is there any way to save these word embeddings once it is generated? Because, it takes a lot of time to generate those embeddings.
The code I am using to generate BERT Word Embeddings is -
[get_features(text_list[i]) for text_list[i] in text_list]
Here, gen_features is a function which returns word embedding for each i in my list text_list.
I read that converting embeddings into bumpy tensors and then using np.save can do it. But I actually don't know how to code it.
You can save your embeddings data to a numpy file by following these steps:
all_embeddings = here_is_your_function_return_all_data()
all_embeddings = np.array(all_embeddings)
np.save('embeddings.npy', all_embeddings)
If you're saving into google colab, then you can download it to your local computer. Whenever you need it, just upload it and load it.
all_embeddings = np.load('embeddings.npy')
That's it.
Btw, You can also directly save your file to google drive.

get closest vector from unknown vector with gensim

I am currently implementing a natural text generator for a school project. I have a dataset of sentences of predetermined lenght and key words, I convert them in vectors thanks to gensim and GoogleNews-vectors-negative300.bin.gz. I train a recurrent neural network to create a list of vectors that I compare to the list of vectors of the real sentence. So I try to get as close as possible to the "real" vectors.
My problem happens when I have to convert back vectors into words: my vectors aren't necessarily in the google set. So I would like to know if there is an efficient solution to get the closest vector in the Google set to an outpout vector.
I work with python 3 and Tensorflow
Thanks a lot, feel free to ask any questions about the project
Charles
The gensim method .most_similar() (on KeyedVectors & similar classes) will also accept raw vectors as the 'origin' from which to search.
Just be sure to explicitly name the positive parameter - a list of target words/vectors to combine to find the origin point.
For example:
gvecs = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz')
target_vec = gvecs['apple']
similars = gvecs.most_similar(positive=[target_vec,])

word2vec limit similar_by_vector() result to re-trained corpus

Assume you have a (wikipedia) pre-trained word2vec model, and train it on an additional corpus (very small, 1000 scentences).
Can you imagine a way to limit a vector-search to the "re-trained" corpus only?
For example
model.wv.similar_by_vector()
will simply find the closest word for a given vector, no matter if it is part of the Wikipedia corpus, or the re-trained vocabulary.
On the other hand, for 'word' search the concept exists:
most_similar_to_given('house',['garden','boat'])
I have tried to train based on the small corpus from scratch, and it somewhat works as expected. But of course could be much more powerful if the assigned vectors come from a pre-trained set.
Sharing an efficient way to do this manually:
re-train word2vec on the additional corpus
create full unique word-index of corpus
fetch re-trained vectors for each word in the index
instead of the canned function "similar_by_vector", use scipy.spatial.KDTree.query()
This finds the closest word within the given corpus only and works as expected.
Similar to the approach for creating a subset of doc-vectors in a new KeyedVectors instance suggested here, assuming small_vocab is a list of the words in your new corpus, you could try:
subset_vectors = WordEmbeddingsKeyedVectors(vector_size)
subset_vectors.add(small_vocab, w2v_model.wv[small_vocab])
Then subset_vectors contains just the words you've selected, but supports familiar operations like most_similar().

Adding additional words in word2vec or Glove (maybe using gensim)

I have two pretrained word embeddings: Glove.840b.300.txt and custom_glove.300.txt
One is pretrained by Stanford and the other is trained by me.
Both have different sets of vocabulary. To reduce oov, I'd like to add words that don't appear in file1 but do appear in file2 to file1.
How do I do that easily?
This is how I load and save the files in gensim 3.4.0.
from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('path/to/thefile')
model.save_word2vec_format('path/to/GoogleNews-vectors-negative300.txt', binary=False)
I don't know an easy way.
In particular, word-vectors that weren't co-trained together won't have compatible/comparable coordinate-spaces. (There's no one right place for a word – just a relatively-good place compared to the other words that are in the same model.)
So, you can't just append the missing words from another model: you'd need to transform them into compatible locations. Fortunately, it seems to work to use some set of shared anchor-words, present in both word-vector-sets, to learn a transformation – then apply that the words you want to move over.
There's a class, [TranslationMatrix][1], and demo notebook in gensim showing this process for language-translation (an application mentioned in the original word2vec papers). You could concievably use this, combined with the ability to append extra vectors to a gensim KeyedVectors instance, to create a new set of vectors with a superset of the words in either of your source models.

Resources