How to interpret CBOW word embeddings? - nlp

In context of word2vec, it is said that "words occurring in similar contexts have similar word embeddings"; for example, "love" and "hate" may have similar embeddings because they appear in contextual words such as "I" and "movie", just for an example.
I get the intuition with skip-gram: both embeddings of "love" and "hate" should predict the context words "I" and "movie", thus the embeddings should be similar. However, I can't get it with CBOW: it says that the average embeddings of "I" and "movie" should predict "love" and "hate"; does that necessarily lead to that the embeddings of "love" and "hate" should be similar? Or do we interpret word embeddings of SG and CBOW in different ways?

In practice it's all smoothed out by the diversity of contexts in CBOW – so the same intuition that works for skip-gram should also apply to CBOW.
Even if 'movie' is only 1/Nth of an influence on the average vector of all context words, when that average vector gets backpropagation-corrected to slightly more predictive of 'love' (for a single training-example), each word contributing to it is also backpropagation-corrected.
Over all examples, and all passes, corrections that pull in random directions tend to cancel each other out, but any consistent tendencies – like two words often co-occurring – tend to reinforce similar correction-nudges on their word-vectors, moving them near each other. (Or, other words that are like a word in other aspects.)
Skip-gram is the stark, simple version: force word X to be more predictive of word Y – but expect that lots of other 1:1 corrections will all balance out. CBOW does things in batches: force words X^1, X^2, ... X^n to all be more predictive of word Y - but expect that lots of other somewhat-overlapping batches will pull distinct words together/apart as needed.

Related

How to solve difficult sentences for nlp sentiment analysis

Such as the following sentence,
"Don't pay attention to people if they say it's no good."
As humans, we understand the overall sentiment from the sentence is positive.
Technique of "Bag of Words" or BOW
Then, we have the two categories of "positive" words as Polarity of 1, "negative" words of Polarity of 0.
In this case, the word of "good" fits into category, but here it is accidentally correct.
Thus, this technique is eliminated.
Still use BOW technique (sort of "Word Embedding")
But take into consideration of its surrounding words, in this case, the "no" word preceding it, thus, it's "no good", not the adj alone "good". However, "no good" is not what the author intended from the context of the entire sentence.
Thus, this question. Thanks in advance.
Word embeddings are one possible way to try to take into account the complexity coming from the sequence of terms in your example. Using pre-trained models on general English such as BERT should give you interesting results for your sentiment analysis problem. You can leverage on several implementation provided by Hugging face library.
Another approach, that doesn't rely on compute intensive techniques (such as word embeddings), would be to use n-gram which will capture the sequence aspect and should provide good features for sentiment estimation. You can try different depth (unigram, bigrams, trigrams...) and combine with different types of preprocesing and/or tokenizers. Scikit-learn provides a good reference implementation for n-gramss in its CountVectorizer class.

How does gensim word2vec word embedding extract training word pair for 1 word sentence?

Refer to below image (the process of how word2vec skipgram extract training datasets-the word pair from the input sentences).
E.G. "I love you." ==> [(I,love), (I, you)]
May I ask what is the word pair when the sentence contains only one word?
Is it "Happy!" ==> [(happy,happy)] ?
I tested the word2vec algorithm in genism, when there is just one word in the training set sentences, (and this word is not included in other sentences), the word2vec algorithm can still construct an embedding vector for this specific word. I am not sure how the algorithm is able to do so.
===============UPDATE===============================
As the answer posted below, I think the word embedding vector created for the word in the 1-word-sentence is just the random initialization of neural network weights.
No word2vec training is possible from a 1-word sentence, because there's no neighbor words to use as input to predict a center/target word. Essentially, that sentence is skipped.
If that was the only appearance of the word in the corpus, and you're seeing a vector for that word, it's just the starting random-initialization of the word, with no further training. (And, you should probably use a higher min_count, as keeping such rare words is usually a mistake in word2vec: they won't get good vectors, and other nearby words' vectors will improve if the 'noise' from all such insufficiently model-able rare words is removed.)
If that 1-word sentence actually appeared next-to other real sentences in your corpus, it could make sense to combine it with surrounding texts. There's nothing magic about actual sentences for this kind word-from-surroundings modeling - the algorithm is just working on 'neighbors', and it's common to use multi-sentence chunks as the texts for training, and sometimes even punctuation (like sentence-ending periods) is also retained as 'words'. Then words from an actually-separate sentence – but still related by having appeared in the same document – will appear in each other's contexts.

What is the difference between word2vec, glove, and elmo? [duplicate]

What is the difference between word2vec and glove?
Are both the ways to train a word embedding? if yes then how can we use both?
Yes, they're both ways to train a word embedding. They both provide the same core output: one vector per word, with the vectors in a useful arrangement. That is, the vectors' relative distances/directions roughly correspond with human ideas of overall word relatedness, and even relatedness along certain salient semantic dimensions.
Word2Vec does incremental, 'sparse' training of a neural network, by repeatedly iterating over a training corpus.
GloVe works to fit vectors to model a giant word co-occurrence matrix built from the corpus.
Working from the same corpus, creating word-vectors of the same dimensionality, and devoting the same attention to meta-optimizations, the quality of their resulting word-vectors will be roughly similar. (When I've seen someone confidently claim one or the other is definitely better, they've often compared some tweaked/best-case use of one algorithm against some rough/arbitrary defaults of the other.)
I'm more familiar with Word2Vec, and my impression is that Word2Vec's training better scales to larger vocabularies, and has more tweakable settings that, if you have the time, might allow tuning your own trained word-vectors more to your specific application. (For example, using a small-versus-large window parameter can have a strong effect on whether a word's nearest-neighbors are 'drop-in replacement words' or more generally words-used-in-the-same-topics. Different downstream applications may prefer word-vectors that skew one way or the other.)
Conversely, some proponents of GLoVe tout that it does fairly well without needing metaparameter optimization.
You probably wouldn't use both, unless comparing them against each other, because they play the same role for any downstream applications of word-vectors.
Word2vec is a predictive model: trains by trying to predict a target word given a context (CBOW method) or the context words from the target (skip-gram method). It uses trainable embedding weights to map words to their corresponding embeddings, which are used to help the model make predictions. The loss function for training the model is related to how good the model’s predictions are, so as the model trains to make better predictions it will result in better embeddings.
The Glove is based on matrix factorization techniques on the word-context matrix. It first constructs a large matrix of (words x context) co-occurrence information, i.e. for each “word” (the rows), you count how frequently (matrix values) we see this word in some “context” (the columns) in a large corpus. The number of “contexts” would be very large, since it is essentially combinatorial in size. So we factorize this matrix to yield a lower-dimensional (word x features) matrix, where each row now yields a vector representation for each word. In general, this is done by minimizing a “reconstruction loss”. This loss tries to find the lower-dimensional representations which can explain most of the variance in the high-dimensional data.
Before GloVe, the algorithms of word representations can be divided into two main streams, the statistic-based (LDA) and learning-based (Word2Vec). LDA produces the low dimensional word vectors by singular value decomposition (SVD) on the co-occurrence matrix, while Word2Vec employs a three-layer neural network to do the center-context word pair classification task where word vectors are just the by-product.
The most amazing point from Word2Vec is that similar words are located together in the vector space and arithmetic operations on word vectors can pose semantic or syntactic relationships, e.g., “king” - “man” + “woman” -> “queen” or “better” - “good” + “bad” -> “worse”. However, LDA cannot maintain such linear relationship in vector space.
The motivation of GloVe is to force the model to learn such linear relationship based on the co-occurreence matrix explicitly. Essentially, GloVe is a log-bilinear model with a weighted least-squares objective. Obviously, it is a hybrid method that uses machine learning based on the statistic matrix, and this is the general difference between GloVe and Word2Vec.
If we dive into the deduction procedure of the equations in GloVe, we will find the difference inherent in the intuition. GloVe observes that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. Take the example from StanfordNLP (Global Vectors for Word Representation), to consider the co-occurrence probabilities for target words ice and steam with various probe words from the vocabulary:
As one might expect, ice co-occurs more frequently with solid than it
does with gas, whereas steam co-occurs more frequently with gas than
it does with solid.
Both words co-occur with their shared property water frequently, and both co-occur with the unrelated word fashion infrequently.
Only in the ratio of probabilities does noise from non-discriminative words like water and fashion cancel out, so that large values (much greater than 1) correlate well with properties specific to ice, and small values (much less than 1) correlate well with properties specific of steam.
However, Word2Vec works on the pure co-occurrence probabilities so that the probability that the words surrounding the target word to be the context is maximized.
In the practice, to speed up the training process, Word2Vec employs negative sampling to substitute the softmax fucntion by the sigmoid function operating on the real data and noise data. This emplicitly results in the clustering of words into a cone in the vector space while GloVe’s word vectors are located more discretely.

Fasttext algorithm use only word and subword? or sentences too?

I read the paper and googled as well if there is any good example of the learning method(or more likely learning procedure)
For word2vec, suppose there is corpus sentence
I go to school with lunch box that my mother wrapped every morning
Then with window size 2, it will try to obtain the vector for 'school' by using surrounding words
['go', 'to', 'with', 'lunch']
Now, FastText says that it uses the subword to obtain the vector, so it is definitely use n gram subword, for example with n=3,
['sc', 'sch', 'cho', 'hoo', 'ool', 'school']
Up to here, I understood.
But it is not clear that if the other words are being used for learning for 'school'. I can only guess that other surrounding words are used as well like the word2vec, since the paper mentions
=> the terms Wc and Wt are both used in functions
where Wc is context word and Wt is word at sequence t.
However, it is not clear that how FastText learns the vectors for word.
.
.
Please clearly explain how FastText learning process goes in procedure?
.
.
More precisely I want to know that if FastText also follows the same procedure as Word2Vec while it learns the n-gram characterized subword in addition. Or only n-gram characterized subword with word being used?
How does it vectorize the subword at initial? etc
Any context word has its candidate input vector assembled from the combination of both its full-word token and all its character-n-grams. So if the context word is 'school', and you're using 3-4 character n-grams, the in-training input vector is a combination of the full-word vector for school, and all the n-gram vectors for ['sch', 'cho', 'hoo', 'ool', 'scho', 'choo', 'hool'].)
When that candidate vector is adjusted by training, all the constituent vectors are adjusted. (This is a little like how in word2vec CBOW, mode, all the words of the single average context input vector get adjusted together, when their ability to predict a single target output word is evaluated and improved.)
As a result, those n-grams that happen to be meaningful hints across many similar words – for example, common word-roots or prefixes/suffixes – get positioned where they confer that meaning. (Other n-grams may remain mostly low-magnitude noise, because there's little meaningful pattern to where they appear.)
After training, reported vectors for individual in-vocabulary words are also constructed by combining the full-word vector and all n-grams.
Then, when you also encounter an out-of-vocabulary word, to the extent it shares some or many n-grams with morphologically-similar in-training words, it will get a similar calculated vector – and thus be better than nothing, in guessing what that word's vector should be. (And in the case of small typos or slight variants of known words, the synthesized vector may be pretty good.)
The fastText site states that at least 2 of implemented algorithms do use surrounding words in sentences.
Moreover, the original fastText implementation is open source so you can check how exactly it works exploring the code.

CBOW v.s. skip-gram: why invert context and target words?

In this page, it is said that:
[...] skip-gram inverts contexts and targets, and tries to predict each context word from its target word [...]
However, looking at the training dataset it produces, the content of the X and Y pair seems to be interexchangeable, as those two pairs of (X, Y):
(quick, brown), (brown, quick)
So, why distinguish that much between context and targets if it is the same thing in the end?
Also, doing Udacity's Deep Learning course exercise on word2vec, I wonder why they seem to do the difference between those two approaches that much in this problem:
An alternative to skip-gram is another Word2Vec model called CBOW (Continuous Bag of Words). In the CBOW model, instead of predicting a context word from a word vector, you predict a word from the sum of all the word vectors in its context. Implement and evaluate a CBOW model trained on the text8 dataset.
Would not this yields the same results?
Here is my oversimplified and rather naive understanding of the difference:
As we know, CBOW is learning to predict the word by the context. Or maximize the probability of the target word by looking at the context. And this happens to be a problem for rare words. For example, given the context yesterday was a really [...] day CBOW model will tell you that most probably the word is beautiful or nice. Words like delightful will get much less attention of the model, because it is designed to predict the most probable word. This word will be smoothed over a lot of examples with more frequent words.
On the other hand, the skip-gram model is designed to predict the context. Given the word delightful it must understand it and tell us that there is a huge probability that the context is yesterday was really [...] day, or some other relevant context. With skip-gram the word delightful will not try to compete with the word beautiful but instead, delightful+context pairs will be treated as new observations.
UPDATE
Thanks to #0xF for sharing this article
According to Mikolov
Skip-gram: works well with small amount of the training data, represents well even rare words or phrases.
CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words
One more addition to the subject is found here:
In the "skip-gram" mode alternative to "CBOW", rather than averaging
the context words, each is used as a pairwise training example. That
is, in place of one CBOW example such as [predict 'ate' from
average('The', 'cat', 'the', 'mouse')], the network is presented with
four skip-gram examples [predict 'ate' from 'The'], [predict 'ate'
from 'cat'], [predict 'ate' from 'the'], [predict 'ate' from 'mouse'].
(The same random window-reduction occurs, so half the time that would
just be two examples, of the nearest words.)
It has to do with what exactly you're calculating at any given point. The difference will become clearer if you start to look at models that incorporate a larger context for each probability calculation.
In skip-gram, you're calculating the context word(s) from the word at the current position in the sentence; you're "skipping" the current word (and potentially a bit of the context) in your calculation. The result can be more than one word (but not if your context window is just one word long).
In CBOW, you're calculating the current word from the context word(s), so you will only ever have one word as a result.
In Deep Learning Course, from coursera https://www.coursera.org/learn/nlp-sequence-models?specialization=deep-learning you can see that Andrew NG doesn't switch the context-target concepts. It means that target word will ALWAYS be treated as the word to be predicted, no matter if is CBOW or skip-gram.

Resources