Word2vec CBOW model implementations, deviations from the original algorithm - pytorch

I am trying to implement CBOW model by pytorch.
What I understood from the explanation of word2vec is that word2vec has 2 layers (and therefore 2 matrices), the first matrix contains low dimensional word vectors, which is actually a lookup table and the vector representation of the word projected on the projection layer (no non-linearity, therefore not a hidden layer). The word vectors then multiplied by the 2nd matrix and the that goes to the output through a softmax function. After training, the first matrix can be used as a word embedding.
I see many implementation use 3 layers (1 embedding layer plus 2 more layers.), which is contradictory to my understanding above. Some example implementations here, here and here.
The following three lines of codes have commonly used as a model to implment 3 layer:
self.embeddings = nn.Embedding(vocab_size, embedding_dim)
self.linear1 = nn.Linear(context_size * embedding_dim, 128)
self.linear2 = nn.Linear(128, vocab_size)
My questions are, if my understanding is okay, then why they are using 3 layers? Are there any advantages?
One obvious disadvantage, I think is, it will be computationally expensive.
Word2vec resembles the idea of autoencoder (which also have two-layer), deviating from this proven idea might harm the embedding quality. Am I right?
Another important thing is that, according to the paper that I mentioned above, for multiple context words the average of the vectors will be projected on the protection layer. But instead of averaging, they are concatenating the vectors. Why is that? is there any advantages?
Also, they are using non-linearity at the hidden layer which I think will create a serious performance issue in training with a huge amount of data. Right?

Related

Can we deduce the relationship b/w a dimension of a word vector with the linguistic characteristic it represents?

Let's imagine we generated a 200 dimension word vector using any pre-trained model of the word ('hello') as shown in the below image.
So, by any means can we tell which linguistic feature is represented by each d_i of this vector?
For example, d1 might be looking at whether the word is a noun; d2 might tell whether the word is a named entity or not and so on.
Because these word vectors are dense distributional representations, it is often difficult / impossible to interpret individual neurons, and such models often do not localize interpretable features to a single neuron (though this is an active area of research). For example, see Analyzing Individual Neurons in Pre-trained Language Models
for a discussion of this with respect to pre-trained language models).
A common method for studying how individual dimensions contribute to a particular phenomenon / task of interest is to train a linear model (i.e., logistic regression if the task is classification) to perform the task from fixed vectors, and then analyze the weights of the trained linear model.
For example, if you're interested in part of speech, you can train a linear model to map from the word vector to the POS [1]. Then, the weights of the linear model represent a linear combination of the dimensions that are predictive of the feature. For example, if the weight on the 5th neuron has large magnitude (very positive or very negative), you might expect that neuron to be somewhat correlated with the phenomenon of interest.
[1]: Note that defining a POS for a particular word is nontrivial, since the POS often depends on context. For example, "play" can be a noun ("he saw a play") or a verb ("I will play in the grass").

What is the difference between word2vec, glove, and elmo? [duplicate]

What is the difference between word2vec and glove?
Are both the ways to train a word embedding? if yes then how can we use both?
Yes, they're both ways to train a word embedding. They both provide the same core output: one vector per word, with the vectors in a useful arrangement. That is, the vectors' relative distances/directions roughly correspond with human ideas of overall word relatedness, and even relatedness along certain salient semantic dimensions.
Word2Vec does incremental, 'sparse' training of a neural network, by repeatedly iterating over a training corpus.
GloVe works to fit vectors to model a giant word co-occurrence matrix built from the corpus.
Working from the same corpus, creating word-vectors of the same dimensionality, and devoting the same attention to meta-optimizations, the quality of their resulting word-vectors will be roughly similar. (When I've seen someone confidently claim one or the other is definitely better, they've often compared some tweaked/best-case use of one algorithm against some rough/arbitrary defaults of the other.)
I'm more familiar with Word2Vec, and my impression is that Word2Vec's training better scales to larger vocabularies, and has more tweakable settings that, if you have the time, might allow tuning your own trained word-vectors more to your specific application. (For example, using a small-versus-large window parameter can have a strong effect on whether a word's nearest-neighbors are 'drop-in replacement words' or more generally words-used-in-the-same-topics. Different downstream applications may prefer word-vectors that skew one way or the other.)
Conversely, some proponents of GLoVe tout that it does fairly well without needing metaparameter optimization.
You probably wouldn't use both, unless comparing them against each other, because they play the same role for any downstream applications of word-vectors.
Word2vec is a predictive model: trains by trying to predict a target word given a context (CBOW method) or the context words from the target (skip-gram method). It uses trainable embedding weights to map words to their corresponding embeddings, which are used to help the model make predictions. The loss function for training the model is related to how good the model’s predictions are, so as the model trains to make better predictions it will result in better embeddings.
The Glove is based on matrix factorization techniques on the word-context matrix. It first constructs a large matrix of (words x context) co-occurrence information, i.e. for each “word” (the rows), you count how frequently (matrix values) we see this word in some “context” (the columns) in a large corpus. The number of “contexts” would be very large, since it is essentially combinatorial in size. So we factorize this matrix to yield a lower-dimensional (word x features) matrix, where each row now yields a vector representation for each word. In general, this is done by minimizing a “reconstruction loss”. This loss tries to find the lower-dimensional representations which can explain most of the variance in the high-dimensional data.
Before GloVe, the algorithms of word representations can be divided into two main streams, the statistic-based (LDA) and learning-based (Word2Vec). LDA produces the low dimensional word vectors by singular value decomposition (SVD) on the co-occurrence matrix, while Word2Vec employs a three-layer neural network to do the center-context word pair classification task where word vectors are just the by-product.
The most amazing point from Word2Vec is that similar words are located together in the vector space and arithmetic operations on word vectors can pose semantic or syntactic relationships, e.g., “king” - “man” + “woman” -> “queen” or “better” - “good” + “bad” -> “worse”. However, LDA cannot maintain such linear relationship in vector space.
The motivation of GloVe is to force the model to learn such linear relationship based on the co-occurreence matrix explicitly. Essentially, GloVe is a log-bilinear model with a weighted least-squares objective. Obviously, it is a hybrid method that uses machine learning based on the statistic matrix, and this is the general difference between GloVe and Word2Vec.
If we dive into the deduction procedure of the equations in GloVe, we will find the difference inherent in the intuition. GloVe observes that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. Take the example from StanfordNLP (Global Vectors for Word Representation), to consider the co-occurrence probabilities for target words ice and steam with various probe words from the vocabulary:
As one might expect, ice co-occurs more frequently with solid than it
does with gas, whereas steam co-occurs more frequently with gas than
it does with solid.
Both words co-occur with their shared property water frequently, and both co-occur with the unrelated word fashion infrequently.
Only in the ratio of probabilities does noise from non-discriminative words like water and fashion cancel out, so that large values (much greater than 1) correlate well with properties specific to ice, and small values (much less than 1) correlate well with properties specific of steam.
However, Word2Vec works on the pure co-occurrence probabilities so that the probability that the words surrounding the target word to be the context is maximized.
In the practice, to speed up the training process, Word2Vec employs negative sampling to substitute the softmax fucntion by the sigmoid function operating on the real data and noise data. This emplicitly results in the clustering of words into a cone in the vector space while GloVe’s word vectors are located more discretely.

Reducing input dimensions for a deep learning model

I am following a course on deep learning and I have a model built with keras. After data preprocessing and encoding of categorical data, I get an array of shape (12500,) as the input to the model. This input makes the model training process slower and laggy. Is there an approach to minimize the dimensionality of the inputs?
Inputs are categorised geo coordinates, weather info, time, distance and I am trying to predict the travel time between two geo coordinates.
Original dataset has 8 features and 5 of them are categorical. I used onehot encoding to encode the above categorical data. geo coordinates have 6000 categories, weather 15 categories time has 96 categories. Likewise all together after encoding with onehot encoding I got an array of shape (12500,) as the input to model.
When the number of categories is large, one-hot encoding becomes too inefficient. The extreme example of this is processing of sentences in a natural language: in this task the vocabulary often has 100k or even more words. Obviously the translation of a 10-word sentence into a [10, 100000] matrix, almost all of which is zero, would be a waste of memory.
What the researches use instead is the embedding layer, which learns a dense representation of a categorical feature. In case of words, it's called word embedding, e.g. word2vec. This representation is much smaller, something like 100-dimensional, and makes the rest of the network to work efficiently with 100-d input vectors, rather than 100000-d vectors.
In keras, it's implemented by an Embedding layer, which I think would work perfectly for your geo and time features, while others may probably work fine with one-hot encoding. This means that your model is no longer Sequential, but rather has several inputs, some of which go through the embedding layer. The main model would take the concatenation of learned representations and do the regression inference.
You can use PCA to do dimensionality reduction.
It removes co-related variables and makes sure that high variances exits in the data.
Wikipedia PCA
Analytical Vidya PCA

How to calculate a One-Hot Encoding value into a real-valued vector?

In Word2Vec, i've learned that both of CBOW and Skip-gram produce a one-hot encoding value to create a vector (cmiiw), I wonder how to calculate or represents a One-Hot Encoding value into a real-valued vector, for example (source: DistrictDataLab's Blog about Distributed Representations)
from this:
into:
please help, I was struggling on finding this information.
The word2vec algorithm itself is what incrementally learns the real-valued vector, with varied dimension values.
In contrast to the one-hot encoding, these vectors are often called "dense embeddings". They're "dense" because unlike the one-hot encoding, which is "sparse" with many dimensions and mostly zero values, they have fewer dimensions and (usually) no zero-values. They're an "embedding" because they've "embed" a discrete set-of-words into another continuous-coordinate-system.
You'd want to read the original word2vec paper for a full formal description of how the dense embeddings are made.
But the gist is that the dense vectors start totally random, and so at first the algorithm's internal neural network is useless for predicting neighboring words. But each (context)->(target) word training example from a text corpus is tried against the network, and each time the difference from the desired prediction is used to apply a tiny nudge, towards a better prediction, to both word-vector and internal-network-weight values.
Repeated many times, initially with larger nudges (higher learning-rate) then with ever-smaller nudges, the dense vectors rearrange their coordinates from their initial randomness to a useful relative-arrangement – one that's about-as-good as possible for predicting the training text, given the limits of the model itself. (That is, any further nudge that improves predictions on some examples, worsens it on others – so you might as well consider training done.)
You then read the resulting dense embedding real-valued vectors out of the model, and use them for purposes other than just nearby-word prediction.

How does word2vec or skip-gram model convert words to vector?

I have been reading a lot of papers on NLP, and came across many models. I got the SVD Model and representing it in 2-D, but I still did not get how do we make a word vector by giving a corpus to the word2vec/skip-gram model? Is it also co-occurrence matrix representation for each word? Can you explain it by taking an example corpus:
Hello, my name is John.
John works in Google.
Google has the best search engine.
Basically, how does skip gram convert John to a vector?
I think you will need to read a paper about the training process. Basically the values of the vectors are the node values of the trained neural network.
I tried to read the original paper but I think the paper "word2vec Parameter Learning Explained" by Xin Rong has a more detailed explanation.
The main concept can be easily understood with an example of Autoencoding with neural networks. You train the neural network to pass information from the input layer to the output layer through the middle layer which is smaller.
In a traditional auto encoder, you have an input vector of size N, a middle layer of length M<N, and the output layer,again of size N. You want only one unit at a time turned on in you input layer and you train the network to replicate in the output layer the same unit that is turned on in the input layer.
After the training has completed succesfully you will see that the neural network, to transport the information from the input layer to the output layer, adapted itself so that each input unit has a corresponding vector representation in the middle layer .
Simplifying a bit, in the context of word2vec your input and output vectors work more or less in the same way, except for the fact that in the sample you submit to the network the unit turned on in the input layer is different from the unit turned on in the output layer.
In fact you train the network picking pairs of nearby (not necessarily adjacent) words from your corpus and submitting them to the network.
The size of the input and output vector is equal to the size of the vocabulary you are feeding to the network.
Your input vector has only one unit turned on (the one corresponding to the first word of the chosen pair) the output vector has one unit turned on (the one corresponding to the second word of chosen pair).
For current readers who might also be wondering "what does a word vector exactly mean" as the OP was at that time: As described at http://cs224d.stanford.edu/lecture_notes/LectureNotes1.pdf, a word vector is of dimension n, and n "is an arbitrary size which defines the size of our embedding space." That is to say, this word vector doesn't mean anything concretely. It's just an abstract representation of certain qualities that this word might have, that we can use to distinguish words.
In fact, to directly answer the original question of "how is a word converted to a vector representation", the values of a vector embedding for a word is usually just randomized at initialization, and improved iteration-by-iteration.
This is common in deep learning/neural networks, where the human beings who created the network themselves usually don't have much idea about what the values exactly stand for. The network itself is supposed to figure the values out gradually, through learning. They just abstractly represent something and distinguish stuffs. One example would be AlphaGo, where it would be impossible for the DeepMind team to explain what each value in a vector stands for. It just works.
First of all, you normally don't use SVD with Skip-Gram model, because Skip-Gram is based on neural network. You use SVD because you want to reduce the dimension of your word vector (ex: for visualization on 2D or 3D space), but in neural net you construct your embedding matrices with the dimension of your choice. You use SVD if you constructed your embedding matrix with co-occurrence matrix.
Vector representation with co-occurrence matrix
I wrote an article about this here.
Consider the following two sentences: "all that glitters is not gold" + "all is well that ends well"
Co-occurrence matrix is then:
With co-occurrence matrix, each row is a word vector for the word. However as you can see in the matrix constructed above, each row has 10 columns. This means that the word vectors are 10-dimensional, and can't be visualized in 2D or 3D space. So we run SVD to reduce it to 2 dimension:
Now that the word vectors are 2-dimensional, they can be visualized in a 2D space:
However, reducing the word vectors into 2D matrix results in significant loss of meaningful data, which is why you shouldn't reduce it down too much.
Lets take another example: achieve and success. Lets say they have 10-dimensional word vectors:
Since achieve and success convey similar meanings, their vector representations are similar. Notice their similar values & color band pattern. However, since these are 10-dimensional vectors, these can't be visualized. So we run SVD to reduce the dimension to 3D, and visualize them:
Each value in the word vector represents the word's position within the vector space. Similar words will have similar vectors, and as a result, will be placed closed with each other in the vector space.
Vector representation with Skip-Gram
I wrote an article about it here.
Skip-Gram uses neural net, and therefore does not use SVD because you can specify the word vector's dimension as a hyper-parameter when you first construct the network (if you really need to visualize, then we use a special technique called t-SNE, but not SVD).
Skip-Gram as the following structure:
With Skip-Gram, N-dimensional word vectors are randomly initialized. There are two embedding matrices: input weight matrix W_input and output weight matrix W_output
Lets take W_input as an example. Assume that the words of your interest are passes and should. Since the randomly initialized weight matrix is 3-dimensional, they can be visualized:
These weight matrices (W_input, and W_ouput) are optimized by predicting a center word's neighboring words, and updating the weights in a way that minimizes prediction error. The predictions are computed for each context words of a center word, and their prediction errors are summed up to calculate weight gradients
The weight matrices update equations are:
These updates are applied for each training sample within the corpus (since Word2Vec uses stochastic gradient descent).
Vanilla Skip-Gram vs Negative Sampling
The above Skip-Gram illustration assumes that we use vanilla Skip-Gram. In real-life, we don't use vanilla Skip-Gram because of its high computational cost. Instead, we use an adapted form of Skip-Gram, called negative sampling.

Resources