word2vec - find a word by a specific vector - python-3.x

I trained a gensim Word2Vec model.
Let's say I have a certain vector and I want the find the word it represents - what is the best way to do so?
Meaning, for a specific vector:
vec = array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32)
I want to get a word:
'computer' = model.vec2word(vec)

Word-vectors are generated through an iterative, approximative process – so shouldn't be thought of as precisely right (even though they do have exact coordinates), just "useful within certain tolerances".
So, there's no lookup of exact-word-for-exact-coordinates. Instead, in gensim Word2Vec and related classes there's most_similar(), which gives the known words closest to given known-words or vector coordinates, in ranked order, with the cosine-similarities. So if you've just trained (or loaded) a full Word2Vec model into the variable model, you can get the closest words to your vector with:
vec = array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32)
similars = model.wv.most_similar(positive=[vec])
print(similars)
If you just want the single closest word, it'd be in similars[0][0] (the first position of the top-ranked tuple).

This is now supported via vocab.vectors.most_similar
import spacy
nlp = spacy.load('en_core_web_md')
word_vec = nlp(u"Test").vector
result = nlp.vocab.vectors.most_similar(word_vec.reshape((1, -1)))
print(nlp.vocab.strings[result[0][0,0]], result)

Related

Combining vectors in Gensim Word2Vec vocabulary

Gensim Word2Vec Model has a great method which allows you to find the top n most similar words in the models vocabulary given a list of positive words and negative words.
wv.most_similar(positive=['word1', 'word2', 'word3'],
negative=['word4','word5'], topn=10)
What I am looking to do is create word vector that represents an averaged or summed vector of the input positive and negative words. I am hoping to use this new vector to compare to other vectors.
Something like this:
newVector = 'word1' + 'word2' + 'word3' - 'word4' - 'word5'
I know that vectors can be summed, but I am not sure if that is the best option. I am hoping to find out exactly how the above function (most_similar) combines the positive vectors and negative vectors, and if Gensim has a function to do so. Thank you in advance.
Gensim does not expose a separate function to add/subtract the (unit-normed) vectors in the same way that most_similar() does.
Perhaps it should, as that could be generally useful, including in sharing code between other existing methods.
But as an open-source project, you can look at its exact Python code for that operation, and use it as a model for your own calculations.
For the current code defining that function, see:
https://github.com/RaRe-Technologies/gensim/blob/ee3d6fd1e33fe39fc7aa31ebd56bd63b1a2a2ed6/gensim/models/keyedvectors.py#L687
From advice above, I chose to look at Gensim source code and copy their method for averaging the vectors. Here is the code incase it helps anyone else.
Note : this code is copied from gensim, and is just reformatted to return the averaged vector.
from gensim import matutils
import numpy as np
from numpy import ndarray, array, float32 as REAL
KEY_TYPES = (str, int, np.integer)
'''
FUNCTION : meanVector(...)
INPUT :
keyedVectors : word vectors or keyed vectors from gensim model, (model.wv)
positive : list of words or vectors to be applied positively [default = list()]
negative : list of words or vectors to be applied negatively [default = list()]
OUTPUT :
averaged word vector, [type = numpy.ndarray]
DESCRIPTION :
allows for simple averaging of positive and negative words and vectors given a gensim model's word vector library.
'''
def meanVector(keyedVectors, positive=list(), negative=list()):
positive = [
(item, 1.0) if isinstance(item, KEY_TYPES + (ndarray,))
else item for item in positive
]
negative = [
(item, -1.0) if isinstance(item, KEY_TYPES + (ndarray,))
else item for item in negative
]
# compute the weighted average of all keys
all_keys, mean = set(), []
for key, weight in positive + negative:
if isinstance(key, ndarray):
mean.append(weight * key)
else:
mean.append(weight * keyedVectors.get_vector(key, norm=True))
if keyedVectors.has_index_for(key):
all_keys.add(keyedVectors.get_index(key))
if not mean:
raise ValueError("cannot compute similarity with no input")
mean = matutils.unitvec(array(mean).mean(axis=0)).astype(REAL)
return mean
Note: this has not been thoroughly tested.

Using annoy with Torchtext for nearest neighbor search

I'm using Torchtext for some NLP tasks, specifically using the built-in embeddings.
I want to be able to do a inverse vector search: Generate a noisy vector, find the vector that is closest to it, then get back the word that is "closest" to the noisy vector.
From the torchtext docs, here's how to attach embeddings to a built-in dataset:
from torchtext.vocab import GloVe
from torchtext import data
embedding = GloVe(name='6B', dim=100)
# Set up fields
TEXT = data.Field(lower=True, include_lengths=True, batch_first=True)
LABEL = data.Field(sequential=False, is_target=True)
# make splits for data
train, test = datasets.IMDB.splits(TEXT, LABEL)
# build the vocabulary
TEXT.build_vocab(train, vectors=embedding, max_size=100000)
LABEL.build_vocab(train)
# Get an example vector
embedding.get_vecs_by_tokens("germany")
Then we can build the annoy index:
from annoy import AnnoyIndex
num_trees = 50
ann_index = AnnoyIndex(embedding_dims, 'angular')
# Iterate through each vector in the embedding and add it to the index
for vector_num, vector in enumerate(TEXT.vocab.vectors):
ann_index.add_item(vector_num, vector) # Here's the catch: will vector_num correspond to torchtext.vocab.Vocab.itos?
ann_index.build(num_trees)
Then say I want to retrieve a word using a noisy vector:
# Get an existing vector
original_vec = embedding.get_vecs_by_tokens("germany")
# Add some noise to it
noise = generate_noise_vector(ndims=100)
noisy_vector = original_vec + noise
# Get the vector closest to the noisy vector
closest_item_idx = ann_index.get_nns_by_vector(noisy_vector, 1)[0]
# Get word from noisy item
noisy_word = TEXT.vocab.itos[closest_item_idx]
My question comes in for the last two lines above: The ann_index was built using enumerate over the embedding object, which is a Torch tensor.
The [vocab][2] object has its own itos list that given an index returns a word.
My question is this: Can I be certain that the order in which words appear in the itos list, is the same as the order in TEXT.vocab.vectors? How can I map one index to the other?
Can I be certain that the order in which words appear in the itos list, is the same as the order in TEXT.vocab.vectors?
Yes.
The Field class will always instantiate a Vocab object (source), and since you are passing the pre-trained vectors to TEXT.build_vocab, the Vocab constructor will call the load_vectors function.
if vectors is not None:
self.load_vectors(vectors, unk_init=unk_init, cache=vectors_cache)
In the load_vectors, the vectors are filled by enumerating the words in the itos.
for i, token in enumerate(self.itos):
start_dim = 0
for v in vectors:
end_dim = start_dim + v.dim
self.vectors[i][start_dim:end_dim] = v[token.strip()]
start_dim = end_dim
assert(start_dim == tot_dim)
Therefore, you can be certain that itos and vectors will have the same order.

How to generate GloVe embeddings for POS tags? Python

For a sentence analysis task, I would like to take the sequence of POS tags associated with the sentence and feed it to my model as if the POS tags are words.
I am using GloVe to make representations of each word in the sentence and SpaCy to generate POS tags. However, GloVe embeddings do not make much sense for POS tags. So I will have to somehow create embeddings for each POS tag. What is the best way to do create embeddings for POS tags, so that I can feed POS sequences into my model in the same way I would feed sentences? Could anyone point to code examples of how to do this with GloVe in Python?
Added context
My task is a binary classification of sentence pairs, based on their resemblance (similar meaning vs different meaning).
I would like to use POS tags as words, so that the POS tags serve as an additional bit of information to compare the sentences. My current model does not use an LSTM as a way to predict sequences.
Most word embedding models still rely on an underlying assumption that the meaning of a word is induced by its usage context. For example, learning a word2vec embedding with skipgram or continuous bag of words formulations implicitly assumes a model in which the representation vector of the word is based on the context words that co-occur with the target word, specifically by learning to create embeddings that best solve the classification task of distinguishing pairs of words that contextually co-occur from random pairs of words (so-called negative sampling).
But if the input is changed to be a sequence of discrete labels (POS tags), this assumption doesn't seem like it needs to remain accurate or reasonable. Part of speech labels have an assigned meaning that is not really induced by the context of being surrounded by other part of speech labels, so it's unlikely that standard learning tasks which are used to produce word embeddings would work when treating POS labels as if they were words from a much smaller vocabulary.
What is the overall sentence analysis task in your situation?
Added after question was updated with the learning task at hand.
Let's assume you can create POS input vectors for each sentence example. If there are N different POS labels possible, it means your input will consist of one vector from word embeddings and another vector of length N, where the value in component i represents the number of terms in the input sentence that possess POS label P_i.
For example, let's pretend the only POS labels possible are 'article', 'noun' and 'verb', and you have a sentence with ['article', 'noun', 'verb', 'noun']. Then this transforms into [1, 2, 1], and probably you want to normalize it by the length of the sentence. Let's call this input pos1 for sentence number 1 and pos2 for sentence number 2.
Let's call the word embedding vector input for sentence 1 as sentence1. sentence1 will be calculated by looking up each word embedding from a separate source, like a pretrained word2vec model or fastText or GloVe, and summing them up (using continuous bag of words). And the same for sentence2.
It's assumed that your batches of training data would already be processed into these vector formats, so a given single input would be a 4-tuple of vectors: the looked up CBOW embedding vector for sentence 1, same for sentence 2, and the calculated discrete representation vector for POS labels of sentence 1, and same for sentence 2.
A model that could work from this data might be like this:
from keras.engine.topology import Input
from keras.layers import Concatenate
from keras.layers.core import Activation, Dense
from keras.models import Model
sentence1 = Input(shape=word_embedding_shape)
sentence2 = Input(shape=word_embedding_shape)
pos1 = Input(shape=pos_vector_shape)
pos2 = Input(shape=pos_vector_shape)
# Note: just choosing 128 as an embedding space dimension or intermediate
# layer size... in your real case, you'd choose these shape params
# based on what you want to model or experiment with. They don't mean
# anything here.
sentence1_branch = Dense(128)(sentence1)
sentence1_branch = Activation('relu')(sentence1_branch)
# ... do whatever other sentence1-only stuff
sentence2_branch = Dense(128)(sentence2)
sentence2_branch = Activation('relu')(sentence2_branch)
# ... do whatever other sentence2-only stuff
pos1_embedding = Dense(128)(pos1)
pos1_branch = Activation('relu')(pos1_embedding)
# ... do whatever other pos1-only stuff
pos2_embedding = Dense(128)(pos2)
pos2_branch = Activation('relu')(pos2_embedding)
# ... do whatever other pos2-only stuff
unified = Concatenate([sentence1_branch, sentence2_branch,
pos1_branch, pos2_branch])
# ... do dense layers, whatever, to the concatenated intermediate
# representations
# finally boil it down to whatever final prediction task you are using,
# whether it is predicting a sentence similarity score (Dense(1)),
# or predicting a binary label that indicates whether the sentence
# pairs are similar or not (Dense(2) then followed by softmax activation,
# or Dense(1) followed by some type of probability activation like sigmoid).
# Assume your data is binary labeled for similar sentences...
unified = Activation('softmax')(Dense(2)(unified))
unified.compile(loss='binary_crossentropy', other parameters)
# Do training to learn the weights...
# A separate model that will just produce the embedding output
# from a POS input vector, relying on weights learned from the
# training process.
pos_embedding_model = Model(inputs=[pos1], outputs=[pos1_embedding])

What should be the word vectors of token <pad>, <unknown>, <go>, <EOS> before sent into RNN?

In word embedding, what should be a good vector representation for the start_tokens _PAD, _UNKNOWN, _GO, _EOS?
Spettekaka's answer works if you are updating your word embedding vectors as well.
Sometimes you will want to use pretrained word vectors that you can't update, though. In this case, you can add a new dimension to your word vectors for each token you want to add and set the vector for each token to 1 in the new dimension and 0 for every other dimension. That way, you won't run into a situation where e.g. "EOS" is closer to the vector embedding of "start" than it is to the vector embedding of "end".
Example for clarification:
# assume_vector embeddings is a dictionary and word embeddings are 3-d before adding tokens
# e.g. vector_embedding['NLP'] = np.array([0.2, 0.3, 0.4])
vector_embedding['<EOS>'] = np.array([0,0,0,1])
vector_embedding['<PAD>'] = np.array([0,0,0,0,1])
new_vector_length = vector_embedding['<pad>'].shape[0] # length of longest vector
for key, word_vector in vector_embedding.items():
zero_append_length = new_vector_length - word_vector.shape[0]
vector_embedding[key] = np.append(word_vector, np.zeros(zero_append_length))
Now your dictionary of word embeddings contains 2 new dimensions for your tokens and all of your words have been updated.
As far as I understand you can represent these tokens by any vector.
Here's why:
Inputting a sequence of words to your model, you first convert each word to an ID and then look in your embedding-matrix which vector corresponds to that ID. With that vector, you train your model. But the embedding-matrix just contains also trainable weights which will be adjusted during training. The vector-representations from your pretrained vectors just serve as a good point to start to yield good results.
Thus, it doesn't matter that much what your special tokens are represented by in the beginning as their representation will change during training.

How to use vector representation of words (as obtained from Word2Vec,etc) as features for a classifier?

I am familiar with using BOW features for text classification, wherein we first find the size of the vocabulary for the corpus which becomes the size of our feature vector. For each sentence/document, and for all its constituent words, we then put 0/1 depending on the absence/presence of that word in that sentence/document.
However, now that I am trying to use vector representation of each word, is creating a global vocabulary essential?
Suppose the size of the vectors is N (usually between 50 or 500). The naive way of generalizing the traditional of generalizing BOW is just replacing 0 bit (in BOW) with N zeros, and replacing 1 bit (in BOW) with the the real vector (say from Word2Vec). Then the size of the features would be N * |V| (Compared to |V| feature vectors in the BOW, where |V| is the size of the vocabs). This simple generalization should work fine for decent number of training instances.
To make the feature vectors smaller, people use various techniques like using recursive combination of vectors with various operations. (See Recursive/Recurrent Neural Network and similar tricks, for example: http://web.engr.illinois.edu/~khashab2/files/2013_RNN.pdf or http://papers.nips.cc/paper/4204-dynamic-pooling-and-unfolding-recursive-autoencoders-for-paraphrase-detection.pdf )
To get a fixed length feature vector for each sentence, although the number of words in each sentence is different, do as follows:
tokenize each sentence into constituent words
for each word get word vector (if it is not there ignore the word)
average all the word vectors you got
this will always give you a d-dim vector (d is word vector dim)
below is the code snipet
def getWordVecs(words, w2v_dict):
vecs = []
for word in words:
word = word.replace('\n', '')
try:
vecs.append(w2v_model[word].reshape((1,300)))
except KeyError:
continue
vecs = np.concatenate(vecs)
vecs = np.array(vecs, dtype='float')
final_vec = np.sum(vecs, axis=0)
return final_vec
words is list of tokens obtained after tokenizing a sentence.

Resources