what is dimensionality in word embeddings? - nlp

I want to understand what is meant by "dimensionality" in word embeddings.
When I embed a word in the form of a matrix for NLP tasks, what role does dimensionality play? Is there a visual example which can help me understand this concept?

Answer
A Word Embedding is just a mapping from words to vectors. Dimensionality in word
embeddings refers to the length of these vectors.
Additional Info
These mappings come in different formats. Most pre-trained embeddings are
available as a space-separated text file, where each line contains a word in the
first position, and its vector representation next to it. If you were to split
these lines, you would find out that they are of length 1 + dim, where dim
is the dimensionality of the word vectors, and 1 corresponds to the word being represented. See the GloVe pre-trained
vectors for a real example.
For example, if you download glove.twitter.27B.zip, unzip it, and run the following python code:
#!/usr/bin/python3
with open('glove.twitter.27B.50d.txt') as f:
lines = f.readlines()
lines = [line.rstrip().split() for line in lines]
print(len(lines)) # number of words (aka vocabulary size)
print(len(lines[0])) # length of a line
print(lines[130][0]) # word 130
print(lines[130][1:]) # vector representation of word 130
print(len(lines[130][1:])) # dimensionality of word 130
you would get the output
1193514
51
people
['1.4653', '0.4827', ..., '-0.10117', '0.077996'] # shortened for illustration purposes
50
Somewhat unrelated, but equally important, is that lines in these files are sorted according to the word frequency found in the corpus in which the embeddings were trained (most frequent words first).
You could also represent these embeddings as a dictionary where
the keys are the words and the values are lists representing word vectors. The length
of these lists would be the dimensionality of your word vectors.
A more common practice is to represent them as matrices (also called lookup
tables), of dimension (V x D), where V is the vocabulary size (i.e., how
many words you have), and D is the dimensionality of each word vector. In
this case you need to keep a separate dictionary mapping each word to its
corresponding row in the matrix.
Background
Regarding your question about the role dimensionality plays, you'll need some theoretical background. But in a few words, the space in which words are embedded presents nice properties that allow NLP systems to perform better. One of these properties is that words that have similar meaning are spatially close to each other, that is, have similar vector representations, as measured by a distance metric such as the Euclidean distance or the cosine similarity.
You can visualize a 3D projection of several word embeddings here, and see, for example, that the closest words to "roads" are "highways", "road", and "routes" in the Word2Vec 10K embedding.
For a more detailed explanation I recommend reading the section "Word Embeddings" of this post by Christopher Olah.
For more theory on why using word embeddings, which are an instance of distributed representations, is better than using, for example, one-hot encodings (local representations), I recommend reading the first sections of Distributed Representations by Geoffrey Hinton et al.

Word embeddings like word2vec or GloVe don't embed words in two-dimensional matrices, they use one-dimensional vectors. "Dimensionality" refers to the size of these vectors. It is separate from the size of the vocabulary, which is the number of words you actually keep vectors for instead of just throwing out.
In theory larger vectors can store more information since they have more possible states. In practice there's not much benefit beyond a size of 300-500, and in some applications even smaller vectors work fine.
Here's a graphic from the GloVe homepage.
The dimensionality of the vectors is shown on the left axis; decreasing it would make the graph shorter, for example. Each column is an individual vector with color at each pixel determined by the number at that position in the vector.

The "dimensionality" in word embeddings represent the total number of features that it encodes. Actually, it is over simplification of the definition, but will come to that bit later.
The selection of features is usually not manual, it is automatic by using hidden layer in the training process. Depending on the corpus of literature the most useful dimensions (features) are selected. For example if the literature is about romantic fictions, the dimension for gender is much more likely to be represented compared to the literature of mathematics.
Once you have the word embedding vector of 100 dimensions (for example) generated by neural network for 100,000 unique words, it is not generally much useful to investigate the purpose of each dimension and try to label each dimension by "feature name". Because the feature(s) that each dimension represents may not be simple and orthogonal and since the process is automatic no body knows exactly what each dimension represents.
For more insight to understand this topic you may find this post useful.

Textual data has to be converted into numeric data before feeding into any Machine Learning algorithm.
Word Embedding is an approach for this where each word is mapped to a vector.
In algebra, A Vector is a point in space with scale & direction.
In simpler term Vector is a 1-Dimensional vertical array ( or say a matrix having single column) and Dimensionality is the number of elements in that 1-D vertical array.
Pre-trained word embedding models like Glove, Word2vec provides multiple dimensional options for each word, for instance 50, 100, 200, 300. Each word represents a point in D dimensionality space and synonyms word are points closer to each other. Higher the dimension better shall be the accuracy but computation needs would also be higher.

I'm not an expert, but I think the dimensions just represent the variables (aka attributes or features) which have been assigned to the words, although there may be more to it than that. The meaning of each dimension and total number of dimensions will be specific to your model.
I recently saw this embedding visualisation from the Tensor Flow library:
https://www.tensorflow.org/get_started/embedding_viz
This particularly helps reduce high-dimensional models down to something human-perceivable. If you have more than three variables it's extremely difficult to visualise the clustering (unless you are Stephen Hawking apparently).
This wikipedia article on dimensional reduction and related pages discuss how features are represented in dimensions, and the problems of having too many.

According to the book Neural Network Methods for Natural Language Processing by Goldenberg, dimensionality in word embeddings (demb) refers to number of columns in first weight matrix (weights between input layer and hidden layer) of embedding algorithms such as word2vec. N in the image is dimensionality in word embedding:
For more information you can refer to this link:
https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/

Related

What is the difference between word2vec, glove, and elmo? [duplicate]

What is the difference between word2vec and glove?
Are both the ways to train a word embedding? if yes then how can we use both?
Yes, they're both ways to train a word embedding. They both provide the same core output: one vector per word, with the vectors in a useful arrangement. That is, the vectors' relative distances/directions roughly correspond with human ideas of overall word relatedness, and even relatedness along certain salient semantic dimensions.
Word2Vec does incremental, 'sparse' training of a neural network, by repeatedly iterating over a training corpus.
GloVe works to fit vectors to model a giant word co-occurrence matrix built from the corpus.
Working from the same corpus, creating word-vectors of the same dimensionality, and devoting the same attention to meta-optimizations, the quality of their resulting word-vectors will be roughly similar. (When I've seen someone confidently claim one or the other is definitely better, they've often compared some tweaked/best-case use of one algorithm against some rough/arbitrary defaults of the other.)
I'm more familiar with Word2Vec, and my impression is that Word2Vec's training better scales to larger vocabularies, and has more tweakable settings that, if you have the time, might allow tuning your own trained word-vectors more to your specific application. (For example, using a small-versus-large window parameter can have a strong effect on whether a word's nearest-neighbors are 'drop-in replacement words' or more generally words-used-in-the-same-topics. Different downstream applications may prefer word-vectors that skew one way or the other.)
Conversely, some proponents of GLoVe tout that it does fairly well without needing metaparameter optimization.
You probably wouldn't use both, unless comparing them against each other, because they play the same role for any downstream applications of word-vectors.
Word2vec is a predictive model: trains by trying to predict a target word given a context (CBOW method) or the context words from the target (skip-gram method). It uses trainable embedding weights to map words to their corresponding embeddings, which are used to help the model make predictions. The loss function for training the model is related to how good the model’s predictions are, so as the model trains to make better predictions it will result in better embeddings.
The Glove is based on matrix factorization techniques on the word-context matrix. It first constructs a large matrix of (words x context) co-occurrence information, i.e. for each “word” (the rows), you count how frequently (matrix values) we see this word in some “context” (the columns) in a large corpus. The number of “contexts” would be very large, since it is essentially combinatorial in size. So we factorize this matrix to yield a lower-dimensional (word x features) matrix, where each row now yields a vector representation for each word. In general, this is done by minimizing a “reconstruction loss”. This loss tries to find the lower-dimensional representations which can explain most of the variance in the high-dimensional data.
Before GloVe, the algorithms of word representations can be divided into two main streams, the statistic-based (LDA) and learning-based (Word2Vec). LDA produces the low dimensional word vectors by singular value decomposition (SVD) on the co-occurrence matrix, while Word2Vec employs a three-layer neural network to do the center-context word pair classification task where word vectors are just the by-product.
The most amazing point from Word2Vec is that similar words are located together in the vector space and arithmetic operations on word vectors can pose semantic or syntactic relationships, e.g., “king” - “man” + “woman” -> “queen” or “better” - “good” + “bad” -> “worse”. However, LDA cannot maintain such linear relationship in vector space.
The motivation of GloVe is to force the model to learn such linear relationship based on the co-occurreence matrix explicitly. Essentially, GloVe is a log-bilinear model with a weighted least-squares objective. Obviously, it is a hybrid method that uses machine learning based on the statistic matrix, and this is the general difference between GloVe and Word2Vec.
If we dive into the deduction procedure of the equations in GloVe, we will find the difference inherent in the intuition. GloVe observes that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. Take the example from StanfordNLP (Global Vectors for Word Representation), to consider the co-occurrence probabilities for target words ice and steam with various probe words from the vocabulary:
As one might expect, ice co-occurs more frequently with solid than it
does with gas, whereas steam co-occurs more frequently with gas than
it does with solid.
Both words co-occur with their shared property water frequently, and both co-occur with the unrelated word fashion infrequently.
Only in the ratio of probabilities does noise from non-discriminative words like water and fashion cancel out, so that large values (much greater than 1) correlate well with properties specific to ice, and small values (much less than 1) correlate well with properties specific of steam.
However, Word2Vec works on the pure co-occurrence probabilities so that the probability that the words surrounding the target word to be the context is maximized.
In the practice, to speed up the training process, Word2Vec employs negative sampling to substitute the softmax fucntion by the sigmoid function operating on the real data and noise data. This emplicitly results in the clustering of words into a cone in the vector space while GloVe’s word vectors are located more discretely.

reducing word2vec dimension from Google News Vector Dataset

I loaded google's news vector -300 dataset. Each word is represented with a 300 point vector. I want to use this in my neural network for classification. But 300 for one word seems to be too big. How can i reduce the vector from 300 to say 100 without compromising on the quality.
tl;dr Use a dimensionality reduction technique like PCA or t-SNE.
This is not a trivial operation that you are attempting. In order to understand why, you must understand what these word vectors are.
Word embeddings are vectors that attempt to encode information about what a word means, how it can be used, and more. What makes them interesting is that they manage to store all of this information as a collection of floating point numbers, which is nice for interacting with models that process words. Rather than pass a word to a model by itself, without any indication of what it means, how to use it, etc, we can pass the model a word vector with the intention of providing extra information about how natural language works.
As I hope I have made clear, word embeddings are pretty neat. Constructing them is an area of active research, though there are a couple of ways to do it that produce interesting results. It's not incredibly important to this question to understand all of the different ways, though I suggest you check them out. Instead, what you really need to know is that each of the values in the 300 dimensional vector associated with a word were "optimized" in some sense to capture a different aspect of the meaning and use of that word. Put another way, each of the 300 values corresponds to some abstract feature of the word. Removing any combination of these values at random will yield a vector that may be lacking significant information about the word, and may no longer serve as a good representation of that word.
So, picking the top 100 values of the vector is no good. We need a more principled way to reduce the dimensionality. What you really want is to sample a subset of these values such that as much information as possible about the word is retained in the resulting vector. This is where a dimensionality reduction technique like Principle Component Analysis (PCA) or t-distributed Stochastic Neighbor Embeddings (t-SNE) come into play. I won't describe in detail how these methods work, but essentially they aim to capture the essence of a collection of information while reducing the size of the vector describing said information. As an example, PCA does this by constructing a new vector from the old one, where the entries in the new vector correspond to combinations of the main "components" of the old vector, i.e those components which account for most of the variety in the old data.
To summarize, you should run a dimensionality reduction algorithm like PCA or t-SNE on your word vectors. There are a number of python libraries that implement both (e.g scipy has a PCA algorithm). Be warned, however, that the dimensionality of these word vectors is already relatively low. To see how this is true, consider the task of naively representing a word via its one-hot encoding (a one at one spot and zeros everywhere else). If your vocabulary size is as big as the google word2vec model, then each word is suddenly associated with a vector containing hundreds of thousands of entries! As you can see, the dimensionality has already been reduced significantly to 300, and any reduction that makes the vectors significantly smaller is likely to lose a good deal of information.
#narasimman I suggest that you simply keep the top 100 numbers in the output vector of the word2vec model. The output is of type numpy.ndarray so you can do something like:
>>> word_vectors = KeyedVectors.load_word2vec_format('modelConfig/GoogleNews-vectors-negative300.bin', binary=True)
>>> type(word_vectors["hello"])
<type 'numpy.ndarray'>
>>> word_vectors["hello"][:10]
array([-0.05419922, 0.01708984, -0.00527954, 0.33203125, -0.25 ,
-0.01397705, -0.15039062, -0.265625 , 0.01647949, 0.3828125 ], dtype=float32)
>>> word_vectors["hello"][:2]
array([-0.05419922, 0.01708984], dtype=float32)
I don't think that this will screw up the result if you do it to all the words (not sure though!)

Creation of Position vectors in Convolution Neural Network for Relation Classification

This question pertains to the use of position vectors in CNN for relation classification as described in multiple publications such as the following by Zeng et al: http://www.aclweb.org/anthology/C14-1220
I am trying to implement such a model in tensorflow. My questions are as follows:
Is there any benefit to using randomly initialized vectors for denoting positional information? For instance, why not use one-hot vector encoding with say 100 dimensions to denote the positions? Is it not recommended to combine one-hot vectors with dense word vectors?
Is there a minimum dimension the positional vectors should have, depending on the dimensions of the word vectors? For instance, suppose the word vector dimension is 500, will a dimension of say 10 for the position vectors be too small to be of value in the model? Is there a range of dimensions that is known to perform well with position vectors?
Does the distance between the randomly initialized vectors for encoding positional information matter?
Thanks a lot for taking the time to look into this!
Regarding question 1, I don't have an explanation why combining one-hot and dense representations is bad, but empirically, looking at results reported by other people, it seems to be better to learn embeddings for the positions as well.
Yoav Goldberg also notes this in his NLP Deep Learning book (p. 96):
In the “traditional” NLP setup,
distances are usually encoded by binning the distances into several groups (i.e., 1, 2, 3, 4, 5–10,
10+) and associating each bin with a one-hot vector. In a neural architecture, where the input
vector is not composed of binary indicator features, it may seem natural to allocate a single input
entry to the distance feature, where the numeric value of that entry is the distance.
However, this
approach is not taken in practice. Instead, distance features are encoded similarly to the other
feature types: each bin is associated with a d-dimensional vector, and these distance-embedding
vectors are then trained as regular parameters in the network [dos Santos et al., 2015, Nguyen
and Grishman, 2015, Zeng et al., 2014, Zhu et al., 2015a].
Maybe you can find more insights into why embeddings are better by looking into the cited papers.
With regard to question 2, I would say as long as the dimensionality is big enough for the model to learn different embeddings for each position you want to encode, it should be fine. So they could be quite small in practice.

why we use input-hidden weight matrix to be the word vectors instead of hidden-output weight matrix?

In word2vec, after training, we get two weight matrixes:1.input-hidden weight matrix; 2.hidden-output weight matrix. and people will use the input-hidden weight matrix as the word vectors(each row corresponds to a word, namely, the word vectors).Here comes to my confusions:
why people use input-hidden weight matrix as the word vectors instead of the hidden-output weight matrix.
why don't we just add softmax activation function to the hidden layers rather than output layers, thus preventing time-consuming.
Plus, clarifying remarks on the intuition of how word vectors can be obtained like this will be appreciated.
Regarding the two, input-hidden weight matrix and hidden-output weight matrix, there is an interesting research paper.
'A Dual Embedding Space Model for Document Ranking', Mitra et al., arXiv 2016. (https://arxiv.org/pdf/1602.01137.pdf).
Similar with your question, this paper studies how these two weight matrix are different, and claims that they encode different characteristics of words.
Overall, from my understanding, it is your choice to use either the input-hidden weight matrix (convention), hidden-output weight matrix, or the combined one as word embeddings, depending on your data and the problem to solve.
For question 1:
This is because the input weight matrix is for the target word while the output weight matrix is for a context word. The vector we attempt to learn for a word is the vector of the word itself as the target word - as the intuition for word2vec is that words(as target word!) which occur in similar contexts learn similar vector representations.
The vector for a context word exists only for training's purpose. It's possible to use the same vector as target word, but learning the two separately is better. For example: if you use the same vector representations, the model would yield the highest probability for a word occurring in a context of itself (dot product of two same vectors), but it's obviously counterintuitive (how often do you use two identical words one after another?).

How does word2vec or skip-gram model convert words to vector?

I have been reading a lot of papers on NLP, and came across many models. I got the SVD Model and representing it in 2-D, but I still did not get how do we make a word vector by giving a corpus to the word2vec/skip-gram model? Is it also co-occurrence matrix representation for each word? Can you explain it by taking an example corpus:
Hello, my name is John.
John works in Google.
Google has the best search engine.
Basically, how does skip gram convert John to a vector?
I think you will need to read a paper about the training process. Basically the values of the vectors are the node values of the trained neural network.
I tried to read the original paper but I think the paper "word2vec Parameter Learning Explained" by Xin Rong has a more detailed explanation.
The main concept can be easily understood with an example of Autoencoding with neural networks. You train the neural network to pass information from the input layer to the output layer through the middle layer which is smaller.
In a traditional auto encoder, you have an input vector of size N, a middle layer of length M<N, and the output layer,again of size N. You want only one unit at a time turned on in you input layer and you train the network to replicate in the output layer the same unit that is turned on in the input layer.
After the training has completed succesfully you will see that the neural network, to transport the information from the input layer to the output layer, adapted itself so that each input unit has a corresponding vector representation in the middle layer .
Simplifying a bit, in the context of word2vec your input and output vectors work more or less in the same way, except for the fact that in the sample you submit to the network the unit turned on in the input layer is different from the unit turned on in the output layer.
In fact you train the network picking pairs of nearby (not necessarily adjacent) words from your corpus and submitting them to the network.
The size of the input and output vector is equal to the size of the vocabulary you are feeding to the network.
Your input vector has only one unit turned on (the one corresponding to the first word of the chosen pair) the output vector has one unit turned on (the one corresponding to the second word of chosen pair).
For current readers who might also be wondering "what does a word vector exactly mean" as the OP was at that time: As described at http://cs224d.stanford.edu/lecture_notes/LectureNotes1.pdf, a word vector is of dimension n, and n "is an arbitrary size which defines the size of our embedding space." That is to say, this word vector doesn't mean anything concretely. It's just an abstract representation of certain qualities that this word might have, that we can use to distinguish words.
In fact, to directly answer the original question of "how is a word converted to a vector representation", the values of a vector embedding for a word is usually just randomized at initialization, and improved iteration-by-iteration.
This is common in deep learning/neural networks, where the human beings who created the network themselves usually don't have much idea about what the values exactly stand for. The network itself is supposed to figure the values out gradually, through learning. They just abstractly represent something and distinguish stuffs. One example would be AlphaGo, where it would be impossible for the DeepMind team to explain what each value in a vector stands for. It just works.
First of all, you normally don't use SVD with Skip-Gram model, because Skip-Gram is based on neural network. You use SVD because you want to reduce the dimension of your word vector (ex: for visualization on 2D or 3D space), but in neural net you construct your embedding matrices with the dimension of your choice. You use SVD if you constructed your embedding matrix with co-occurrence matrix.
Vector representation with co-occurrence matrix
I wrote an article about this here.
Consider the following two sentences: "all that glitters is not gold" + "all is well that ends well"
Co-occurrence matrix is then:
With co-occurrence matrix, each row is a word vector for the word. However as you can see in the matrix constructed above, each row has 10 columns. This means that the word vectors are 10-dimensional, and can't be visualized in 2D or 3D space. So we run SVD to reduce it to 2 dimension:
Now that the word vectors are 2-dimensional, they can be visualized in a 2D space:
However, reducing the word vectors into 2D matrix results in significant loss of meaningful data, which is why you shouldn't reduce it down too much.
Lets take another example: achieve and success. Lets say they have 10-dimensional word vectors:
Since achieve and success convey similar meanings, their vector representations are similar. Notice their similar values & color band pattern. However, since these are 10-dimensional vectors, these can't be visualized. So we run SVD to reduce the dimension to 3D, and visualize them:
Each value in the word vector represents the word's position within the vector space. Similar words will have similar vectors, and as a result, will be placed closed with each other in the vector space.
Vector representation with Skip-Gram
I wrote an article about it here.
Skip-Gram uses neural net, and therefore does not use SVD because you can specify the word vector's dimension as a hyper-parameter when you first construct the network (if you really need to visualize, then we use a special technique called t-SNE, but not SVD).
Skip-Gram as the following structure:
With Skip-Gram, N-dimensional word vectors are randomly initialized. There are two embedding matrices: input weight matrix W_input and output weight matrix W_output
Lets take W_input as an example. Assume that the words of your interest are passes and should. Since the randomly initialized weight matrix is 3-dimensional, they can be visualized:
These weight matrices (W_input, and W_ouput) are optimized by predicting a center word's neighboring words, and updating the weights in a way that minimizes prediction error. The predictions are computed for each context words of a center word, and their prediction errors are summed up to calculate weight gradients
The weight matrices update equations are:
These updates are applied for each training sample within the corpus (since Word2Vec uses stochastic gradient descent).
Vanilla Skip-Gram vs Negative Sampling
The above Skip-Gram illustration assumes that we use vanilla Skip-Gram. In real-life, we don't use vanilla Skip-Gram because of its high computational cost. Instead, we use an adapted form of Skip-Gram, called negative sampling.

Resources