How Word Mover's Distance (WMD) uses word2vec embedding space? - nlp

According to WMD paper, it's inspired by word2vec model and use word2vec vector space for moving document 1 towards document 2 (in the context of Earth Mover Distance metric). From the paper:
Assume we are provided with a word2vec embedding matrix
X ∈ Rd×n for a finite size vocabulary of n words. The
ith column, xi ∈ Rd, represents the embedding of the ith
word in d-dimensional space. We assume text documents
are represented as normalized bag-of-words (nBOW) vectors,
d ∈ Rn. To be precise, if word i appears ci times in
the document, we denote di = ci/cj (for j=1 to n). An nBOW vector
d is naturally very sparse as most words will not appear in
any given document. (We remove stop words, which are
generally category independent.)
I understand the concept from the paper, however, I couldn't understand how wmd uses word2vec embedding space from the code in Gensim.
Can someone explain it in a simple way? Does it calculate the word vectors in a different way because I couldn't understand where in this code word2vec embedding matrix is used?
WMD Fucntion from Gensim:
def wmdistance(self, document1, document2):
# Remove out-of-vocabulary words.
len_pre_oov1 = len(document1)
len_pre_oov2 = len(document2)
document1 = [token for token in document1 if token in self]
document2 = [token for token in document2 if token in self]
dictionary = Dictionary(documents=[document1, document2])
vocab_len = len(dictionary)
# Sets for faster look-up.
docset1 = set(document1)
docset2 = set(document2)
# Compute distance matrix.
distance_matrix = zeros((vocab_len, vocab_len), dtype=double)
for i, t1 in dictionary.items():
for j, t2 in dictionary.items():
if t1 not in docset1 or t2 not in docset2:
# Compute Euclidean distance between word vectors.
distance_matrix[i, j] = sqrt(np_sum((self[t1] - self[t2])**2))
def nbow(document):
d = zeros(vocab_len, dtype=double)
nbow = dictionary.doc2bow(document) # Word frequencies.
doc_len = len(document)
for idx, freq in nbow:
d[idx] = freq / float(doc_len) # Normalized word frequencies.
return d
# Compute nBOW representation of documents.
d1 = nbow(document1)
d2 = nbow(document2)
# Compute WMD.
return emd(d1, d2, distance_matrix)

For the purposes of WMD, a text is considered a bunch of 'piles' of meaning. Those piles are placed at the coordinates of the text's words – and that's why WMD calculation is dependent on a set of word-vectors from another source. Those vectors position the text's piles.
The WMD is then the minimal amount of work needed to shift one text's piles to match another text's piles. And the measure of the work needed to shift from one pile to another is the euclidean distance between those pile's coordinates.
You could just try a naive shifting of the piles: look at the first word from text A, shift it to the first word from text B, and so forth. But that's unlikely to be the cheapest shifting – which would likely try to match nearer words, to send the 'meaning' on the shortest possible paths. So actually calculating the WMD is an iterative optimization problem – significantly more expensive than just a simple euclidean-distance or cosine-distance between two points.
That optimization is done inside the emd() call in the code you excerpt. But what that optimization requires is the pairwise distances between all words in text A, and all words in text B – because those are all the candidate paths across which meaning-weight might be shifted. You can see those pairwise distances calculated in the code to populate the distance_matrix, using the word-vectors already loaded in the model and accessible via self[t1], self[t2], etc.


Combining vectors in Gensim Word2Vec vocabulary

Gensim Word2Vec Model has a great method which allows you to find the top n most similar words in the models vocabulary given a list of positive words and negative words.
wv.most_similar(positive=['word1', 'word2', 'word3'],
negative=['word4','word5'], topn=10)
What I am looking to do is create word vector that represents an averaged or summed vector of the input positive and negative words. I am hoping to use this new vector to compare to other vectors.
Something like this:
newVector = 'word1' + 'word2' + 'word3' - 'word4' - 'word5'
I know that vectors can be summed, but I am not sure if that is the best option. I am hoping to find out exactly how the above function (most_similar) combines the positive vectors and negative vectors, and if Gensim has a function to do so. Thank you in advance.
Gensim does not expose a separate function to add/subtract the (unit-normed) vectors in the same way that most_similar() does.
Perhaps it should, as that could be generally useful, including in sharing code between other existing methods.
But as an open-source project, you can look at its exact Python code for that operation, and use it as a model for your own calculations.
For the current code defining that function, see:
From advice above, I chose to look at Gensim source code and copy their method for averaging the vectors. Here is the code incase it helps anyone else.
Note : this code is copied from gensim, and is just reformatted to return the averaged vector.
from gensim import matutils
import numpy as np
from numpy import ndarray, array, float32 as REAL
KEY_TYPES = (str, int, np.integer)
FUNCTION : meanVector(...)
keyedVectors : word vectors or keyed vectors from gensim model, (model.wv)
positive : list of words or vectors to be applied positively [default = list()]
negative : list of words or vectors to be applied negatively [default = list()]
averaged word vector, [type = numpy.ndarray]
allows for simple averaging of positive and negative words and vectors given a gensim model's word vector library.
def meanVector(keyedVectors, positive=list(), negative=list()):
positive = [
(item, 1.0) if isinstance(item, KEY_TYPES + (ndarray,))
else item for item in positive
negative = [
(item, -1.0) if isinstance(item, KEY_TYPES + (ndarray,))
else item for item in negative
# compute the weighted average of all keys
all_keys, mean = set(), []
for key, weight in positive + negative:
if isinstance(key, ndarray):
mean.append(weight * key)
mean.append(weight * keyedVectors.get_vector(key, norm=True))
if keyedVectors.has_index_for(key):
if not mean:
raise ValueError("cannot compute similarity with no input")
mean = matutils.unitvec(array(mean).mean(axis=0)).astype(REAL)
return mean
Note: this has not been thoroughly tested.

Using annoy with Torchtext for nearest neighbor search

I'm using Torchtext for some NLP tasks, specifically using the built-in embeddings.
I want to be able to do a inverse vector search: Generate a noisy vector, find the vector that is closest to it, then get back the word that is "closest" to the noisy vector.
From the torchtext docs, here's how to attach embeddings to a built-in dataset:
from torchtext.vocab import GloVe
from torchtext import data
embedding = GloVe(name='6B', dim=100)
# Set up fields
TEXT = data.Field(lower=True, include_lengths=True, batch_first=True)
LABEL = data.Field(sequential=False, is_target=True)
# make splits for data
train, test = datasets.IMDB.splits(TEXT, LABEL)
# build the vocabulary
TEXT.build_vocab(train, vectors=embedding, max_size=100000)
# Get an example vector
Then we can build the annoy index:
from annoy import AnnoyIndex
num_trees = 50
ann_index = AnnoyIndex(embedding_dims, 'angular')
# Iterate through each vector in the embedding and add it to the index
for vector_num, vector in enumerate(TEXT.vocab.vectors):
ann_index.add_item(vector_num, vector) # Here's the catch: will vector_num correspond to torchtext.vocab.Vocab.itos?
Then say I want to retrieve a word using a noisy vector:
# Get an existing vector
original_vec = embedding.get_vecs_by_tokens("germany")
# Add some noise to it
noise = generate_noise_vector(ndims=100)
noisy_vector = original_vec + noise
# Get the vector closest to the noisy vector
closest_item_idx = ann_index.get_nns_by_vector(noisy_vector, 1)[0]
# Get word from noisy item
noisy_word = TEXT.vocab.itos[closest_item_idx]
My question comes in for the last two lines above: The ann_index was built using enumerate over the embedding object, which is a Torch tensor.
The [vocab][2] object has its own itos list that given an index returns a word.
My question is this: Can I be certain that the order in which words appear in the itos list, is the same as the order in TEXT.vocab.vectors? How can I map one index to the other?
Can I be certain that the order in which words appear in the itos list, is the same as the order in TEXT.vocab.vectors?
The Field class will always instantiate a Vocab object (source), and since you are passing the pre-trained vectors to TEXT.build_vocab, the Vocab constructor will call the load_vectors function.
if vectors is not None:
self.load_vectors(vectors, unk_init=unk_init, cache=vectors_cache)
In the load_vectors, the vectors are filled by enumerating the words in the itos.
for i, token in enumerate(self.itos):
start_dim = 0
for v in vectors:
end_dim = start_dim + v.dim
self.vectors[i][start_dim:end_dim] = v[token.strip()]
start_dim = end_dim
assert(start_dim == tot_dim)
Therefore, you can be certain that itos and vectors will have the same order.

What should be the word vectors of token <pad>, <unknown>, <go>, <EOS> before sent into RNN?

In word embedding, what should be a good vector representation for the start_tokens _PAD, _UNKNOWN, _GO, _EOS?
Spettekaka's answer works if you are updating your word embedding vectors as well.
Sometimes you will want to use pretrained word vectors that you can't update, though. In this case, you can add a new dimension to your word vectors for each token you want to add and set the vector for each token to 1 in the new dimension and 0 for every other dimension. That way, you won't run into a situation where e.g. "EOS" is closer to the vector embedding of "start" than it is to the vector embedding of "end".
Example for clarification:
# assume_vector embeddings is a dictionary and word embeddings are 3-d before adding tokens
# e.g. vector_embedding['NLP'] = np.array([0.2, 0.3, 0.4])
vector_embedding['<EOS>'] = np.array([0,0,0,1])
vector_embedding['<PAD>'] = np.array([0,0,0,0,1])
new_vector_length = vector_embedding['<pad>'].shape[0] # length of longest vector
for key, word_vector in vector_embedding.items():
zero_append_length = new_vector_length - word_vector.shape[0]
vector_embedding[key] = np.append(word_vector, np.zeros(zero_append_length))
Now your dictionary of word embeddings contains 2 new dimensions for your tokens and all of your words have been updated.
As far as I understand you can represent these tokens by any vector.
Here's why:
Inputting a sequence of words to your model, you first convert each word to an ID and then look in your embedding-matrix which vector corresponds to that ID. With that vector, you train your model. But the embedding-matrix just contains also trainable weights which will be adjusted during training. The vector-representations from your pretrained vectors just serve as a good point to start to yield good results.
Thus, it doesn't matter that much what your special tokens are represented by in the beginning as their representation will change during training.

ALS model - how to generate full_u * v^t * v?

I'm trying to figure out how an ALS model can predict values for new users in between it being updated by a batch process. In my search, I came across this stackoverflow answer. I've copied the answer below for the reader's convenience:
You can get predictions for new users using the trained model (without updating it):
To get predictions for a user in the model, you use its latent representation (vector u of size f (number of factors)), which is multiplied by the product latent factor matrix (matrix made of the latent representations of all products, a bunch of vectors of size f) and gives you a score for each product. For new users, the problem is that you don't have access to their latent representation (you only have the full representation of size M (number of different products), but what you can do is use a similarity function to compute a similar latent representation for this new user by multiplying it by the transpose of the product matrix.
i.e. if you user latent matrix is u and your product latent matrix is v, for user i in the model, you get scores by doing: u_i * v for a new user, you don't have a latent representation, so take the full representation full_u and do: full_u * v^t * v This will approximate the latent factors for the new users and should give reasonable recommendations (if the model already gives reasonable recommendations for existing users)
To answer the question of training, this allows you to compute predictions for new users without having to do the heavy computation of the model which you can now do only once in a while. So you have you batch processing at night and can still make prediction for new user during the day.
Note: MLLIB gives you access to the matrix u and v
The quoted text above is an excellent answer, however, I'm struggling to understand how to programmatically implement this solution. For example, the matrix u and v can be obtained with:
# pyspark example
# ommitted for brevity ... loading movielens 1M ratings
model = ALS.train(ratings, rank, numIterations, lambdaParam)
matrix_u = model.userFeatures()
print(matrix_u.take(2)) # take a look at the dataset
This returns:
(2, array('d', [0.26341307163238525, 0.1650490164756775, 0.118405282497406, -0.5976635217666626, -0.3913084864616394, -0.1379186064004898, -0.3866392970085144, -0.1768060326576233, -0.38342711329460144, 0.48550787568092346, -0.18867433071136475, -0.02757863700389862, 0.1410026103258133, 0.11498363316059113, 0.03958914801478386, 0.034536730498075485, 0.08427099883556366, 0.46969038248062134, -0.8230801224708557, -0.15124185383319855, 0.2566414773464203, 0.04326820373535156, 0.19077207148075104, 0.025207923725247383, -0.02030213735997677, 0.1696728765964508, 0.5714617967605591, -0.03885050490498543, -0.09797532111406326, 0.29186877608299255, -0.12768596410751343, -0.1582849770784378, 0.01933656632900238, -0.09131495654582977, 0.26577943563461304, -0.4543033838272095, -0.11789630353450775, 0.05775507912039757, 0.2891307771205902, -0.2147761881351471, -0.011787488125264645, 0.49508437514305115, 0.5610293745994568, 0.228189617395401, 0.624510645866394, -0.009683617390692234, -0.050237834453582764, -0.07940001785755157, 0.4686132073402405, -0.02288617007434368])),
(4, array('d', [-0.001666820957325399, -0.12487432360649109, 0.1252429485321045, -0.794727087020874, -0.3804478347301483, -0.04577340930700302, -0.42346617579460144, -0.27448347210884094, -0.25846347212791443, 0.5107921957969666, 0.04229479655623436, -0.10212298482656479, -0.13407345116138458, -0.2059325873851776, 0.12777331471443176, -0.318756639957428, 0.129398375749588, 0.4351944327354431, -0.9031049013137817, -0.29211774468421936, -0.02933369390666485, 0.023618215695023537, 0.10542935132980347, -0.22032295167446136, -0.1861676126718521, 0.13154461979866028, 0.6130356192588806, -0.10089754313230515, 0.13624103367328644, 0.22037173807621002, -0.2966669499874115, -0.34058427810668945, 0.37738317251205444, -0.3755438029766083, -0.2408779263496399, -0.35355791449546814, 0.05752146989107132, -0.15478627383708954, 0.3418906629085541, -0.6939512491226196, 0.4279302656650543, 0.4875738322734833, 0.5659542083740234, 0.1479463279247284, 0.5280753970146179, -0.24357643723487854, 0.14329688251018524, -0.2137598991394043, 0.011986476369202137, -0.015219110995531082]))
I can also do similar to get the v matrix:
matrix_v = model.productFeatures()
print(matrix_v.take(2)) # take a look at the dataset
This results in:
(2, array('d', [0.019985994324088097, 0.0673416256904602, -0.05697149783372879, -0.5434763431549072, -0.40705952048301697, -0.18632276356220245, -0.30776089429855347, -0.13178342580795288, -0.27466219663619995, 0.4183739423751831, -0.24422742426395416, -0.24130797386169434, 0.24116989970207214, 0.06833088397979736, -0.01750543899834156, 0.03404173627495766, 0.04333991929888725, 0.3577033281326294, -0.7044714689254761, 0.1438472419977188, 0.06652364134788513, -0.029888223856687546, -0.16717877984046936, 0.1027146726846695, -0.12836599349975586, 0.10197233408689499, 0.5053384900093079, 0.019304445013403893, -0.21254844963550568, 0.2705852687358856, -0.04169371724128723, -0.24098040163516998, -0.0683765709400177, -0.09532768279314041, 0.1006036177277565, -0.08682398498058319, -0.13584329187870026, -0.001340558985248208, 0.20587041974067688, -0.14007550477981567, -0.1831497997045517, 0.5021498203277588, 0.3049483597278595, 0.11236990243196487, 0.15783801674842834, -0.044139936566352844, -0.14372406899929047, 0.058535050600767136, 0.3777201473712921, -0.045475270599126816])),
(4, array('d', [0.10334215313196182, 0.1881643384695053, 0.09297363460063934, -0.457258403301239, -0.5272660255432129, -0.0989445373415947, -0.2053477019071579, -0.1644461452960968, -0.3771175146102905, 0.21405018866062164, -0.18553146719932556, 0.011830524541437626, 0.29562288522720337, 0.07959598302841187, -0.035378433763980865, -0.11786794662475586, -0.11603366583585739, 0.3776192367076874, -0.5124108791351318, 0.03971947357058525, -0.03365595266222954, 0.023278912529349327, 0.17436474561691284, -0.06317273527383804, 0.05118614062666893, 0.4375131130218506, 0.3281322419643402, 0.036590900272130966, -0.3759073317050934, 0.22429685294628143, -0.0728025734424591, -0.10945595055818558, 0.0728464275598526, 0.014129920862615108, -0.10701996833086014, -0.2496117204427719, -0.09409723430871964, -0.11898282915353775, 0.18940524756908417, -0.3211393356323242, -0.035668935626745224, 0.41765937209129333, 0.2636736035346985, -0.01290816068649292, 0.2824321389198303, 0.021533429622650146, -0.08053319901227951, 0.11117415875196457, 0.22975310683250427, 0.06993964314460754]))
However, I'm not sure how to progress from this to full_u * v^t * v
This new user is not the the matrix U, so you don't have its latent representation in 'k' factors, you only know its full representation, i.e., all its ratings. full_u here means all of the new user ratings in a dense format (not the sparse format ratings are) e.g.:
[0 2 0 0 0 1 0] if user u has rated item 2 with a 2 and item 6 with a 1.
then you can get v pretty much like you did and turning it to a matrix in numpy for instance:
pf = model.productFeatures()
Vt = np.matrix(np.asarray(pf.values().collect()))
then is is just a matter of multiplying:
Vt and V are transposed compared to the other answer but that's just a matter of convention.
Note that the Vt*Vt.T product is fixed, so if you are going to use this for multiple new users it will be computationally more efficient to pre-compute it. Actually for more than one user it would be better to put all their ratings in bigU (in the same format as my one new user example) and just do the matrix product:
bigU*Vt*Vt.T to get all the ratings for all the new users. Might still be worth checking that the product is done in the most efficient way in terms of number of operations.
Just a word of warning. People talk about the user and product matrices like they are left and right singular vectors. But as far as I understand, the method used to find U and V is an optimization of a straight squared error cost function, which makes none of the orthogonality guarantees of SVD.
In other words, think algebraically about what the above answer claims. We have a full ratings matrix R, an n by p matrix of ratings for n users over p products. We decompose it with k latent factors to approximate R = UV, where the rows of U, an n by k matrix, are the latent user representations, and the columns of V, a k by p matrix, are the latent product representations. In order to find latent user representations for a matrix R of entirely new users without refitting the model, we need:
R = U V
R V^{-1} = U V V^{-1}
R V^{-1} = U I_{k}
R V^{-1} = U
where I_{k} is the k dimensional identity matrix and V^{-1} is the p by k right inverse of V. The tip above assumes that V^{T} = V^{-1}. This is not guaranteed. And in general there is no guarantee that assuming this is true will give you anything but nonsense answers.
Let me know if I'm missing something in the optimization method behind MLLib's CF implementation. Is there a trick in the ALS model that guarantees orthogonality that I'm missing?

How to use vector representation of words (as obtained from Word2Vec,etc) as features for a classifier?

I am familiar with using BOW features for text classification, wherein we first find the size of the vocabulary for the corpus which becomes the size of our feature vector. For each sentence/document, and for all its constituent words, we then put 0/1 depending on the absence/presence of that word in that sentence/document.
However, now that I am trying to use vector representation of each word, is creating a global vocabulary essential?
Suppose the size of the vectors is N (usually between 50 or 500). The naive way of generalizing the traditional of generalizing BOW is just replacing 0 bit (in BOW) with N zeros, and replacing 1 bit (in BOW) with the the real vector (say from Word2Vec). Then the size of the features would be N * |V| (Compared to |V| feature vectors in the BOW, where |V| is the size of the vocabs). This simple generalization should work fine for decent number of training instances.
To make the feature vectors smaller, people use various techniques like using recursive combination of vectors with various operations. (See Recursive/Recurrent Neural Network and similar tricks, for example: or )
To get a fixed length feature vector for each sentence, although the number of words in each sentence is different, do as follows:
tokenize each sentence into constituent words
for each word get word vector (if it is not there ignore the word)
average all the word vectors you got
this will always give you a d-dim vector (d is word vector dim)
below is the code snipet
def getWordVecs(words, w2v_dict):
vecs = []
for word in words:
word = word.replace('\n', '')
except KeyError:
vecs = np.concatenate(vecs)
vecs = np.array(vecs, dtype='float')
final_vec = np.sum(vecs, axis=0)
return final_vec
words is list of tokens obtained after tokenizing a sentence.
