get closest vector from unknown vector with gensim - python-3.x

I am currently implementing a natural text generator for a school project. I have a dataset of sentences of predetermined lenght and key words, I convert them in vectors thanks to gensim and GoogleNews-vectors-negative300.bin.gz. I train a recurrent neural network to create a list of vectors that I compare to the list of vectors of the real sentence. So I try to get as close as possible to the "real" vectors.
My problem happens when I have to convert back vectors into words: my vectors aren't necessarily in the google set. So I would like to know if there is an efficient solution to get the closest vector in the Google set to an outpout vector.
I work with python 3 and Tensorflow
Thanks a lot, feel free to ask any questions about the project
Charles

The gensim method .most_similar() (on KeyedVectors & similar classes) will also accept raw vectors as the 'origin' from which to search.
Just be sure to explicitly name the positive parameter - a list of target words/vectors to combine to find the origin point.
For example:
gvecs = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz')
target_vec = gvecs['apple']
similars = gvecs.most_similar(positive=[target_vec,])

Related

How to convert small dataset into word embeddings instead of one-hot encoding?

I have a dataset of 33 words that are a mix of verbs and nouns, for eg. father, sing, etc. I have tried converting them to 1-hot encoding but for my use case, it has been suggested to look into word2vec embedding. I have looked in gensim and glove but struggling to make it work.
How could I convert my data into an embedding? Such that two words that may be semantically closer may have a lesser distance between their respective vectors. How may this be achieved or any helpful material on the same?
Such as this
Since your dataset is quite small, and I'm assuming it doesn't contain any jargon, it's best to use a pre-trained model in order to save up on training time.
With gensim, it's as simple as:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')
The 'word2vec-google-news-300' model has been pre-trained on a part of the Google News Dataset and generalizes well enough to most tasks. Following this, you can create word embeddings/vectors like so:
vec = wv['father']
And, finally, for computing word similarity:
similarity_score = wv.similarity('father', 'sing')
Lastly, one major limitation of Word2Vec is it's inability to deal with words that are OOV(out of vocabulary). For such cases, it's best to train a custom model for your corpus.

Finding both target and center word2vec matrices

I've read and heard(In the CS224 of Stanford) that the Word2Vec algorithm actually trains two matrices(that is, two sets of vectors.) These two are the U and the V set, one for words being a target and one for words being the context. The final output is the average of these two.
I have two questions in mind. one is that:
Why do we get an average of two vectors? Why it makes sense? Don't we lose some information?
The second question is, using pre-trained word2vec models, how can I get access to both matrices? Is there any downloadable word2vec with both sets of vectors? I don't have enough resources to train a new one.
Thanks
That relayed description isn't quite right. The word-vectors traditionally retrieved from a word2vec model come from a "projection matrix" which converts individual words to a right-sized input-vector for the shallow neural network.
(You could think of the projection matrix as turning a one-hot encoding into a dense-embedding for that word, but libraries typically implement this via a dictionary-lookup – eg: "what row of the vectors-matrix should I consult for this word-token?")
There's another matrix of weights leading to the model's output nodes, whose interpretation varies based on the training mode. In the common default of negative-sampling, there's one node per known word, so you could also interpret this matrix as having a vector per word. (In hierarchical-softmax mode, the known-words aren't encoded as single output nodes, so it's harder to interpret the relationship of this matrix to individual words.)
However, this second vector per word is rarely made directly available by libraries. Most commonly, the word-vector is considered simply the trained-up input vector, from the projection matrix. For example, the export format from Google's original word2vec.c release only saves-out those vectors, and the large "GoogleNews" vector set they released only has those vectors. (There's no averaging with the other output-side representation.)
Some work, especially that of Mitra et all of Microsoft Research (in "Dual Embedding Space Models" & associated writeups) has noted those output-side vectors may be of value in some applications as well – but I haven't seen much other work using those vectors. (And, even in that work, they're not averaged with the traditional vectors, but consulted as a separate option for some purposes.)
You'd have to look at the code of whichever libraries you're using to see if you can fetch these from their full post-training model representation. In the Python gensim library, this second matrix in the negative-sampling case is a model property named syn1neg, following the naming of the original word2vec.c.

Efficient way to get best matching pairs given a similarity-outputting neural network?

I am trying to come up with a neural network that ranks two short pairs of text (for example, stackexchange title and body). Following the deep learning cookbook's example, the network would look basically like this:
So we have our two inputs (title and body), embed them, then calculate the cosine similarity between embeddings. The inputs of the model would be [title,body], the output is [sim].
Now I'd like the closest matching body for a given title. I am wondering if there's a more efficient way of doing this that doesn't involve iterating over every possible pair of (title,body) and calculating the corresponding similarity? Because for very large datasets this is just not feasible.
Any help is much appreciated!
It is indeed not very efficient to iterate over every possible data pair. Instead you could use your model to extract all the embeddings of your titles and text bodies and save them in a database (or simply a .npy file). So, you don't use your model to output a similarity score but instead use your model to output an embedding (from your embedding layer).
At inference time you can then use a library for efficient similarity search such as faiss. Given a title you would simply look up its embedding and search in the whole embedding space of all body embeddings to see which ones get the highest score. I have used this approach myself and been able to search 1M vectors in just 100 ms.

reducing word2vec dimension from Google News Vector Dataset

I loaded google's news vector -300 dataset. Each word is represented with a 300 point vector. I want to use this in my neural network for classification. But 300 for one word seems to be too big. How can i reduce the vector from 300 to say 100 without compromising on the quality.
tl;dr Use a dimensionality reduction technique like PCA or t-SNE.
This is not a trivial operation that you are attempting. In order to understand why, you must understand what these word vectors are.
Word embeddings are vectors that attempt to encode information about what a word means, how it can be used, and more. What makes them interesting is that they manage to store all of this information as a collection of floating point numbers, which is nice for interacting with models that process words. Rather than pass a word to a model by itself, without any indication of what it means, how to use it, etc, we can pass the model a word vector with the intention of providing extra information about how natural language works.
As I hope I have made clear, word embeddings are pretty neat. Constructing them is an area of active research, though there are a couple of ways to do it that produce interesting results. It's not incredibly important to this question to understand all of the different ways, though I suggest you check them out. Instead, what you really need to know is that each of the values in the 300 dimensional vector associated with a word were "optimized" in some sense to capture a different aspect of the meaning and use of that word. Put another way, each of the 300 values corresponds to some abstract feature of the word. Removing any combination of these values at random will yield a vector that may be lacking significant information about the word, and may no longer serve as a good representation of that word.
So, picking the top 100 values of the vector is no good. We need a more principled way to reduce the dimensionality. What you really want is to sample a subset of these values such that as much information as possible about the word is retained in the resulting vector. This is where a dimensionality reduction technique like Principle Component Analysis (PCA) or t-distributed Stochastic Neighbor Embeddings (t-SNE) come into play. I won't describe in detail how these methods work, but essentially they aim to capture the essence of a collection of information while reducing the size of the vector describing said information. As an example, PCA does this by constructing a new vector from the old one, where the entries in the new vector correspond to combinations of the main "components" of the old vector, i.e those components which account for most of the variety in the old data.
To summarize, you should run a dimensionality reduction algorithm like PCA or t-SNE on your word vectors. There are a number of python libraries that implement both (e.g scipy has a PCA algorithm). Be warned, however, that the dimensionality of these word vectors is already relatively low. To see how this is true, consider the task of naively representing a word via its one-hot encoding (a one at one spot and zeros everywhere else). If your vocabulary size is as big as the google word2vec model, then each word is suddenly associated with a vector containing hundreds of thousands of entries! As you can see, the dimensionality has already been reduced significantly to 300, and any reduction that makes the vectors significantly smaller is likely to lose a good deal of information.
#narasimman I suggest that you simply keep the top 100 numbers in the output vector of the word2vec model. The output is of type numpy.ndarray so you can do something like:
>>> word_vectors = KeyedVectors.load_word2vec_format('modelConfig/GoogleNews-vectors-negative300.bin', binary=True)
>>> type(word_vectors["hello"])
<type 'numpy.ndarray'>
>>> word_vectors["hello"][:10]
array([-0.05419922, 0.01708984, -0.00527954, 0.33203125, -0.25 ,
-0.01397705, -0.15039062, -0.265625 , 0.01647949, 0.3828125 ], dtype=float32)
>>> word_vectors["hello"][:2]
array([-0.05419922, 0.01708984], dtype=float32)
I don't think that this will screw up the result if you do it to all the words (not sure though!)

Classification with array of strings as input vector

I have a question related to the machine learning task. The problem is to predict a value based on the vector of strings. The most straightforward idea that came to mind was to use linear regression. However, since my input is non-numeric, I thought I'd use hashcode of my strings, but I've read somewhere here that the results will be meaningless. Another idea was to encode my strings in base 26 using the letter positions in the alphabet, but I haven't tested it yet, thus asking for advice.
Could someone recommend a good (meaningful) way of encoding strings so that they can be used in linear regression algorithm? Or suggest another machine learning algorithm suitable for the task.
To summarise: the input to the classifier will consist of a fixed size array of strings (arrays are fixed length, not strings), and the output should be an integer in range 0-100. The training data will consist of a collection of such input arrays (x-values) with corresponding numbers (y-values).
Transform each one of your M strings into an N-dimensional vector using a vector space model like word2vec or GloVe. Then concatenate these vectors to one vector with M*N components. Optionally normalize each component to e.g. 0-1. You should then be able to run any regression (or classification) algorithm on the result, e.g. logistic regression.
You might also try a clustering approach, where you cluster all the words in your vocabulary into N clusters, e.g. with k-means on the word vectors or using brown clustering. You could then represent each word in your input array with a one hot vector (i.e. N-1 zeros and a single one at the index of the cluster of that word). Then concatenate them again and run regression on the result.
I did the similar project with strings. I am suggesting one of the way you can implement it.
In machine learning "naive bayes classifier" will make your problem easy. That works on the probability theory. So if you are working with python there is NLTK(toolkit) and Textblob(library on NLTK), those will help you a lot.
Your question is very generic so I can't describe everything here but just feel free to ask anything you are struggling with, I would be happy to answer them.

Resources