Combining vectors in Gensim Word2Vec vocabulary - nlp

Gensim Word2Vec Model has a great method which allows you to find the top n most similar words in the models vocabulary given a list of positive words and negative words.
wv.most_similar(positive=['word1', 'word2', 'word3'],
negative=['word4','word5'], topn=10)
What I am looking to do is create word vector that represents an averaged or summed vector of the input positive and negative words. I am hoping to use this new vector to compare to other vectors.
Something like this:
newVector = 'word1' + 'word2' + 'word3' - 'word4' - 'word5'
I know that vectors can be summed, but I am not sure if that is the best option. I am hoping to find out exactly how the above function (most_similar) combines the positive vectors and negative vectors, and if Gensim has a function to do so. Thank you in advance.

Gensim does not expose a separate function to add/subtract the (unit-normed) vectors in the same way that most_similar() does.
Perhaps it should, as that could be generally useful, including in sharing code between other existing methods.
But as an open-source project, you can look at its exact Python code for that operation, and use it as a model for your own calculations.
For the current code defining that function, see:
https://github.com/RaRe-Technologies/gensim/blob/ee3d6fd1e33fe39fc7aa31ebd56bd63b1a2a2ed6/gensim/models/keyedvectors.py#L687

From advice above, I chose to look at Gensim source code and copy their method for averaging the vectors. Here is the code incase it helps anyone else.
Note : this code is copied from gensim, and is just reformatted to return the averaged vector.
from gensim import matutils
import numpy as np
from numpy import ndarray, array, float32 as REAL
KEY_TYPES = (str, int, np.integer)
'''
FUNCTION : meanVector(...)
INPUT :
keyedVectors : word vectors or keyed vectors from gensim model, (model.wv)
positive : list of words or vectors to be applied positively [default = list()]
negative : list of words or vectors to be applied negatively [default = list()]
OUTPUT :
averaged word vector, [type = numpy.ndarray]
DESCRIPTION :
allows for simple averaging of positive and negative words and vectors given a gensim model's word vector library.
'''
def meanVector(keyedVectors, positive=list(), negative=list()):
positive = [
(item, 1.0) if isinstance(item, KEY_TYPES + (ndarray,))
else item for item in positive
]
negative = [
(item, -1.0) if isinstance(item, KEY_TYPES + (ndarray,))
else item for item in negative
]
# compute the weighted average of all keys
all_keys, mean = set(), []
for key, weight in positive + negative:
if isinstance(key, ndarray):
mean.append(weight * key)
else:
mean.append(weight * keyedVectors.get_vector(key, norm=True))
if keyedVectors.has_index_for(key):
all_keys.add(keyedVectors.get_index(key))
if not mean:
raise ValueError("cannot compute similarity with no input")
mean = matutils.unitvec(array(mean).mean(axis=0)).astype(REAL)
return mean
Note: this has not been thoroughly tested.

Related

Understanding get_sentence_vector() and get_word_vector() for fasttext

The thing I want to do is to get embeddings of a pair of words or phrases and calculate similarity.
I observed that the similarity is the same when I switch between get_sentence_vector() and get_word_vector() for a word. For example, I can switch the method when calculating embedding_2 or embedding_3, but embedding_2 and embedding_3 are not euqal, which is weird:
from scipy.spatial.distance import cosine
import numpy as np
import fasttext
import fasttext.util
# download an english model
fasttext.util.download_model('en', if_exists='ignore') # English
model = fasttext.load_model('cc.en.300.bin')
# Getting word vectors for 'one' and 'two'.
embedding_1 = model.get_sentence_vector('baby dog')
embedding_2 = model.get_word_vector('puppy')
embedding_3 = model.get_sentence_vector('puppy')
def cosine_similarity(embedding_1, embedding_2):
# Calculate the cosine similarity of the two embeddings.
sim = 1 - cosine(embedding_1, embedding_2)
print('Cosine similarity: {:.2}'.format(sim))
# compare the embeddings
cosine_similarity(embedding_1, embedding_2)
# compare the embeddings
cosine_similarity(embedding_1, embedding_3)
# Checking if the two approaches yield the same result.
is_equal = np.array_equal(embedding_2, embedding_3)
# Printing the result.
print(is_equal)
If I switch methods, similarity is always 0.76 but is_equal is false. I have two questions:
(1) I probably have to use get_sentence_vector() for phrases, but in terms of words, which one should I use? What happens when I call get_sentence_vector() for a word?
(2) I use fasttext because it can handle out of vocabulary, is it a good idea to use fasttext's embedding for cosine similarity comparison?
You should use get_word_vector for words and get_sentence_vector for sentences.
get_sentence_vector divides each word vector by its norm and then average them. If you are interested in more details, read this.
Since fastText provides vector representations, it is a good idea to use this vectors in order to compare words and sentences.

Using annoy with Torchtext for nearest neighbor search

I'm using Torchtext for some NLP tasks, specifically using the built-in embeddings.
I want to be able to do a inverse vector search: Generate a noisy vector, find the vector that is closest to it, then get back the word that is "closest" to the noisy vector.
From the torchtext docs, here's how to attach embeddings to a built-in dataset:
from torchtext.vocab import GloVe
from torchtext import data
embedding = GloVe(name='6B', dim=100)
# Set up fields
TEXT = data.Field(lower=True, include_lengths=True, batch_first=True)
LABEL = data.Field(sequential=False, is_target=True)
# make splits for data
train, test = datasets.IMDB.splits(TEXT, LABEL)
# build the vocabulary
TEXT.build_vocab(train, vectors=embedding, max_size=100000)
LABEL.build_vocab(train)
# Get an example vector
embedding.get_vecs_by_tokens("germany")
Then we can build the annoy index:
from annoy import AnnoyIndex
num_trees = 50
ann_index = AnnoyIndex(embedding_dims, 'angular')
# Iterate through each vector in the embedding and add it to the index
for vector_num, vector in enumerate(TEXT.vocab.vectors):
ann_index.add_item(vector_num, vector) # Here's the catch: will vector_num correspond to torchtext.vocab.Vocab.itos?
ann_index.build(num_trees)
Then say I want to retrieve a word using a noisy vector:
# Get an existing vector
original_vec = embedding.get_vecs_by_tokens("germany")
# Add some noise to it
noise = generate_noise_vector(ndims=100)
noisy_vector = original_vec + noise
# Get the vector closest to the noisy vector
closest_item_idx = ann_index.get_nns_by_vector(noisy_vector, 1)[0]
# Get word from noisy item
noisy_word = TEXT.vocab.itos[closest_item_idx]
My question comes in for the last two lines above: The ann_index was built using enumerate over the embedding object, which is a Torch tensor.
The [vocab][2] object has its own itos list that given an index returns a word.
My question is this: Can I be certain that the order in which words appear in the itos list, is the same as the order in TEXT.vocab.vectors? How can I map one index to the other?
Can I be certain that the order in which words appear in the itos list, is the same as the order in TEXT.vocab.vectors?
Yes.
The Field class will always instantiate a Vocab object (source), and since you are passing the pre-trained vectors to TEXT.build_vocab, the Vocab constructor will call the load_vectors function.
if vectors is not None:
self.load_vectors(vectors, unk_init=unk_init, cache=vectors_cache)
In the load_vectors, the vectors are filled by enumerating the words in the itos.
for i, token in enumerate(self.itos):
start_dim = 0
for v in vectors:
end_dim = start_dim + v.dim
self.vectors[i][start_dim:end_dim] = v[token.strip()]
start_dim = end_dim
assert(start_dim == tot_dim)
Therefore, you can be certain that itos and vectors will have the same order.

word2vec - find a word by a specific vector

I trained a gensim Word2Vec model.
Let's say I have a certain vector and I want the find the word it represents - what is the best way to do so?
Meaning, for a specific vector:
vec = array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32)
I want to get a word:
'computer' = model.vec2word(vec)
Word-vectors are generated through an iterative, approximative process – so shouldn't be thought of as precisely right (even though they do have exact coordinates), just "useful within certain tolerances".
So, there's no lookup of exact-word-for-exact-coordinates. Instead, in gensim Word2Vec and related classes there's most_similar(), which gives the known words closest to given known-words or vector coordinates, in ranked order, with the cosine-similarities. So if you've just trained (or loaded) a full Word2Vec model into the variable model, you can get the closest words to your vector with:
vec = array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32)
similars = model.wv.most_similar(positive=[vec])
print(similars)
If you just want the single closest word, it'd be in similars[0][0] (the first position of the top-ranked tuple).
This is now supported via vocab.vectors.most_similar
import spacy
nlp = spacy.load('en_core_web_md')
word_vec = nlp(u"Test").vector
result = nlp.vocab.vectors.most_similar(word_vec.reshape((1, -1)))
print(nlp.vocab.strings[result[0][0,0]], result)

How Word Mover's Distance (WMD) uses word2vec embedding space?

According to WMD paper, it's inspired by word2vec model and use word2vec vector space for moving document 1 towards document 2 (in the context of Earth Mover Distance metric). From the paper:
Assume we are provided with a word2vec embedding matrix
X ∈ Rd×n for a finite size vocabulary of n words. The
ith column, xi ∈ Rd, represents the embedding of the ith
word in d-dimensional space. We assume text documents
are represented as normalized bag-of-words (nBOW) vectors,
d ∈ Rn. To be precise, if word i appears ci times in
the document, we denote di = ci/cj (for j=1 to n). An nBOW vector
d is naturally very sparse as most words will not appear in
any given document. (We remove stop words, which are
generally category independent.)
I understand the concept from the paper, however, I couldn't understand how wmd uses word2vec embedding space from the code in Gensim.
Can someone explain it in a simple way? Does it calculate the word vectors in a different way because I couldn't understand where in this code word2vec embedding matrix is used?
WMD Fucntion from Gensim:
def wmdistance(self, document1, document2):
# Remove out-of-vocabulary words.
len_pre_oov1 = len(document1)
len_pre_oov2 = len(document2)
document1 = [token for token in document1 if token in self]
document2 = [token for token in document2 if token in self]
dictionary = Dictionary(documents=[document1, document2])
vocab_len = len(dictionary)
# Sets for faster look-up.
docset1 = set(document1)
docset2 = set(document2)
# Compute distance matrix.
distance_matrix = zeros((vocab_len, vocab_len), dtype=double)
for i, t1 in dictionary.items():
for j, t2 in dictionary.items():
if t1 not in docset1 or t2 not in docset2:
continue
# Compute Euclidean distance between word vectors.
distance_matrix[i, j] = sqrt(np_sum((self[t1] - self[t2])**2))
def nbow(document):
d = zeros(vocab_len, dtype=double)
nbow = dictionary.doc2bow(document) # Word frequencies.
doc_len = len(document)
for idx, freq in nbow:
d[idx] = freq / float(doc_len) # Normalized word frequencies.
return d
# Compute nBOW representation of documents.
d1 = nbow(document1)
d2 = nbow(document2)
# Compute WMD.
return emd(d1, d2, distance_matrix)
For the purposes of WMD, a text is considered a bunch of 'piles' of meaning. Those piles are placed at the coordinates of the text's words – and that's why WMD calculation is dependent on a set of word-vectors from another source. Those vectors position the text's piles.
The WMD is then the minimal amount of work needed to shift one text's piles to match another text's piles. And the measure of the work needed to shift from one pile to another is the euclidean distance between those pile's coordinates.
You could just try a naive shifting of the piles: look at the first word from text A, shift it to the first word from text B, and so forth. But that's unlikely to be the cheapest shifting – which would likely try to match nearer words, to send the 'meaning' on the shortest possible paths. So actually calculating the WMD is an iterative optimization problem – significantly more expensive than just a simple euclidean-distance or cosine-distance between two points.
That optimization is done inside the emd() call in the code you excerpt. But what that optimization requires is the pairwise distances between all words in text A, and all words in text B – because those are all the candidate paths across which meaning-weight might be shifted. You can see those pairwise distances calculated in the code to populate the distance_matrix, using the word-vectors already loaded in the model and accessible via self[t1], self[t2], etc.

make a confusion matrix for a classifier with 2 classes

I have a file with some sentences (A Persian sentence, a tab, a Persian word (tag), a tab, an English word (tag)). The English words show the class of each sentence. There are 2 classes in this file, "passion" and "salty". I classified the sentences with naive bayes algorithm and now I have to calculate precision and recall. For that I have to make a confusion matrix but I don't know how. I wrote a small code and assumed that "passion" is the positive group and "salty" is the negative group. The code returned the output for this case. But if I assume "salty" as positive and "passion" as negative, the numbers are totally different from the first case, and consequently when I want to calculate precision and recall, I don't have the correct answer. Should I calculate tp, tn, fp and fn separately for the 2 classes (once for passion and once for salty) and then calculate the average and then calculate precision and recall according to this average?
(hint1: argmax is the output of the NB algorithm and it is the tag that the code recognized it for the test sentences.
hint2: I have some other files with more than 2 classes, too)
#t = line.strip().split("\t")
if t[2] == "passion" and argmax == "passion":
tp += 1
elif t[2] == "passion" and argmax != "passion":
fn += 1
elif t[2] == "salty" and argmax != "salty":
fp += 1
elif t[2] == "salty" and argmax == "salty":
tn += 1
print ("tp", tp, "tn", tn, "fp", fp, "fn", fn)
You should use scikit-learn, which already provides confusion matrix and classification reports. A sample:
from sklearn.metrics import confusion_matrix, classification_report
# suppose your predictions are stored in a variable called preds
# and the true values are stored in a variable called y
print(confusion_matrix(y, preds))
print(classification_report(y, preds))
(btw, scikit-learn is intended to be used with python 2.7, but it is probably safe to use these functions, since you already have the model built).
Also, since I see you are in the NLP domain, you could use the facilities that the nltk library provides. I'm not an expert, but I suppose this should be useful.

Resources