What is relation between tsne and word2vec? - nlp

As I know of, tsne is reducing dimension of word vector.
Word2vec is generate word embedding model with huge amount of data.
What is the relation between two?
Does Word2vec use tsne inside?
(I use Word2vec from Gensim)

Internally they both use gradient-descent to reach their final optimized states. And both can be considered dimensionality-reduction operations. But, word2vec does not internally use t-SNE (or vice-versa).
t-SNE ("t-distributed stochastic neighbor embedding") typically reduces many-dimensional data to 2- or 3-dimensions, for the purposes of plotting a visualization. It involves learning a mapping from the original dimensionality, to the fewer dimensions, which still keeps similar points near each other.
word2vec takes many text examples and learns a shallow neural-network that's good at predicting words from nearby words. A particular layer of that neural-network's weights, which represent individual words, then becomes the learned N-dimensional word-vectors, with the value of N often 100 to 600.
(There's an alternative way to create word-vectors called GLoVE that works a little more like t-SNE, in that it trains directly from the high-dimensional co-occurrence matrix of words, rather than from the many in-context co-occurrence examples. But it's still not t-SNE itself.)
You could potentially run t-SNE with a target dimensionality of 100-400. But since that end-result wouldn't yet yield nice plots, the maintenance of 'nearness' that's central to t-SNE won't have delivered its usual intended benefit.
You could potentially learn word2vec (or GLoVE) vectors of just 2- or 3-dimensions, but most of the useful similarities/arrangements that people seek from word-vectors would be lost in the crowding. And in a plot, you'd probably not see as strong visual 'clumping' of related-word categories, because t-SNE's specific high-to-low dimensionality nearness-preservation goal wasn't applied.

Related

Why my ELMo-CNN model gives worse performance than Word2vec?

I want to compare the performance between ELMo and word2vec as word embedding using the CNN model by classifying 4000 tweet data on five class labels, but the results show that ELMo gives worse performance than word2vec.
I used ELMoformanylangs for ELMo and pretrained 1 million tweets for word2vec
Curve loss of word2vec-cnn
Curve loss of ELMo-cnn
It shows that the 2 models are overfitting, but why can ELMo be worse than word2vec?
From the elmoformanylangs project you've linked, it looks like your generic ELMo model was trained on "on a set of 20-million-words data randomly sampled from the raw text released by the shared task (wikidump + common crawl)".
Given that many tweets are larger than 20 words, your 1-million-tweets training set for word2vec might be larger training data than was used for the ELMo model. And, coming from actual tweets, it may also reflect words/word-senses used in tweets better than generic wikidump/common-crawl text.
Given that, I'm not sure why you'd have expected the ELMo approach to necessarily be better.
But also, as you've noted, the fact that your classifier is performing worse with more training is highly indicative of extreme overfitting. You may want to fix that before attempting to reason any further about the relative merits of different approaches. (When both classifiers are massively broken, exactly why one's brokenness is a bit better than the others' brokenness should be a fairly moot point. After they're both fixed to do as well as they can, then the remaining difference may be interesting to choose between, or understand deeply.)

Finding both target and center word2vec matrices

I've read and heard(In the CS224 of Stanford) that the Word2Vec algorithm actually trains two matrices(that is, two sets of vectors.) These two are the U and the V set, one for words being a target and one for words being the context. The final output is the average of these two.
I have two questions in mind. one is that:
Why do we get an average of two vectors? Why it makes sense? Don't we lose some information?
The second question is, using pre-trained word2vec models, how can I get access to both matrices? Is there any downloadable word2vec with both sets of vectors? I don't have enough resources to train a new one.
Thanks
That relayed description isn't quite right. The word-vectors traditionally retrieved from a word2vec model come from a "projection matrix" which converts individual words to a right-sized input-vector for the shallow neural network.
(You could think of the projection matrix as turning a one-hot encoding into a dense-embedding for that word, but libraries typically implement this via a dictionary-lookup – eg: "what row of the vectors-matrix should I consult for this word-token?")
There's another matrix of weights leading to the model's output nodes, whose interpretation varies based on the training mode. In the common default of negative-sampling, there's one node per known word, so you could also interpret this matrix as having a vector per word. (In hierarchical-softmax mode, the known-words aren't encoded as single output nodes, so it's harder to interpret the relationship of this matrix to individual words.)
However, this second vector per word is rarely made directly available by libraries. Most commonly, the word-vector is considered simply the trained-up input vector, from the projection matrix. For example, the export format from Google's original word2vec.c release only saves-out those vectors, and the large "GoogleNews" vector set they released only has those vectors. (There's no averaging with the other output-side representation.)
Some work, especially that of Mitra et all of Microsoft Research (in "Dual Embedding Space Models" & associated writeups) has noted those output-side vectors may be of value in some applications as well – but I haven't seen much other work using those vectors. (And, even in that work, they're not averaged with the traditional vectors, but consulted as a separate option for some purposes.)
You'd have to look at the code of whichever libraries you're using to see if you can fetch these from their full post-training model representation. In the Python gensim library, this second matrix in the negative-sampling case is a model property named syn1neg, following the naming of the original word2vec.c.

What is the difference between word2vec, glove, and elmo? [duplicate]

What is the difference between word2vec and glove?
Are both the ways to train a word embedding? if yes then how can we use both?
Yes, they're both ways to train a word embedding. They both provide the same core output: one vector per word, with the vectors in a useful arrangement. That is, the vectors' relative distances/directions roughly correspond with human ideas of overall word relatedness, and even relatedness along certain salient semantic dimensions.
Word2Vec does incremental, 'sparse' training of a neural network, by repeatedly iterating over a training corpus.
GloVe works to fit vectors to model a giant word co-occurrence matrix built from the corpus.
Working from the same corpus, creating word-vectors of the same dimensionality, and devoting the same attention to meta-optimizations, the quality of their resulting word-vectors will be roughly similar. (When I've seen someone confidently claim one or the other is definitely better, they've often compared some tweaked/best-case use of one algorithm against some rough/arbitrary defaults of the other.)
I'm more familiar with Word2Vec, and my impression is that Word2Vec's training better scales to larger vocabularies, and has more tweakable settings that, if you have the time, might allow tuning your own trained word-vectors more to your specific application. (For example, using a small-versus-large window parameter can have a strong effect on whether a word's nearest-neighbors are 'drop-in replacement words' or more generally words-used-in-the-same-topics. Different downstream applications may prefer word-vectors that skew one way or the other.)
Conversely, some proponents of GLoVe tout that it does fairly well without needing metaparameter optimization.
You probably wouldn't use both, unless comparing them against each other, because they play the same role for any downstream applications of word-vectors.
Word2vec is a predictive model: trains by trying to predict a target word given a context (CBOW method) or the context words from the target (skip-gram method). It uses trainable embedding weights to map words to their corresponding embeddings, which are used to help the model make predictions. The loss function for training the model is related to how good the model’s predictions are, so as the model trains to make better predictions it will result in better embeddings.
The Glove is based on matrix factorization techniques on the word-context matrix. It first constructs a large matrix of (words x context) co-occurrence information, i.e. for each “word” (the rows), you count how frequently (matrix values) we see this word in some “context” (the columns) in a large corpus. The number of “contexts” would be very large, since it is essentially combinatorial in size. So we factorize this matrix to yield a lower-dimensional (word x features) matrix, where each row now yields a vector representation for each word. In general, this is done by minimizing a “reconstruction loss”. This loss tries to find the lower-dimensional representations which can explain most of the variance in the high-dimensional data.
Before GloVe, the algorithms of word representations can be divided into two main streams, the statistic-based (LDA) and learning-based (Word2Vec). LDA produces the low dimensional word vectors by singular value decomposition (SVD) on the co-occurrence matrix, while Word2Vec employs a three-layer neural network to do the center-context word pair classification task where word vectors are just the by-product.
The most amazing point from Word2Vec is that similar words are located together in the vector space and arithmetic operations on word vectors can pose semantic or syntactic relationships, e.g., “king” - “man” + “woman” -> “queen” or “better” - “good” + “bad” -> “worse”. However, LDA cannot maintain such linear relationship in vector space.
The motivation of GloVe is to force the model to learn such linear relationship based on the co-occurreence matrix explicitly. Essentially, GloVe is a log-bilinear model with a weighted least-squares objective. Obviously, it is a hybrid method that uses machine learning based on the statistic matrix, and this is the general difference between GloVe and Word2Vec.
If we dive into the deduction procedure of the equations in GloVe, we will find the difference inherent in the intuition. GloVe observes that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. Take the example from StanfordNLP (Global Vectors for Word Representation), to consider the co-occurrence probabilities for target words ice and steam with various probe words from the vocabulary:
As one might expect, ice co-occurs more frequently with solid than it
does with gas, whereas steam co-occurs more frequently with gas than
it does with solid.
Both words co-occur with their shared property water frequently, and both co-occur with the unrelated word fashion infrequently.
Only in the ratio of probabilities does noise from non-discriminative words like water and fashion cancel out, so that large values (much greater than 1) correlate well with properties specific to ice, and small values (much less than 1) correlate well with properties specific of steam.
However, Word2Vec works on the pure co-occurrence probabilities so that the probability that the words surrounding the target word to be the context is maximized.
In the practice, to speed up the training process, Word2Vec employs negative sampling to substitute the softmax fucntion by the sigmoid function operating on the real data and noise data. This emplicitly results in the clustering of words into a cone in the vector space while GloVe’s word vectors are located more discretely.

What is the stochastic aspect of Word2Vec?

I'm vectorizing words on a few different corpora with Gensim and am getting results that are making me rethink how Word2Vec functions. My understanding was that Word2Vec was deterministic, and that the position of a word in a vector space would not change from training to training. If "My cat is running" and "your dog can't be running" are the two sentences in the corpus, then the value of "running" (or its stem) seems necessarily fixed.
However, I've found that that value indeed does vary across models, and words keep changing where they are on a vector space when I train the model. The differences are not always hugely meaningful, but they do indicate the existence of some random process. What am I missing here?
This is well-covered in the Gensim FAQ, which I quote here:
Q11: I've trained my Word2Vec/Doc2Vec/etc model repeatedly using the exact same text corpus, but the vectors are different each time. Is there a bug or have I made a mistake? (*2vec training non-determinism)
Answer: The *2vec models (word2vec, fasttext, doc2vec…) begin with random initialization, then most modes use additional randomization
during training. (For example, the training windows are randomly
truncated as an efficient way of weighting nearer words higher. The
negative examples in the default negative-sampling mode are chosen
randomly. And the downsampling of highly-frequent words, as controlled
by the sample parameter, is driven by random choices. These
behaviors were all defined in the original Word2Vec paper's algorithm
description.)
Even when all this randomness comes from a
pseudorandom-number-generator that's been seeded to give a
reproducible stream of random numbers (which gensim does by default),
the usual case of multi-threaded training can further change the exact
training-order of text examples, and thus the final model state.
(Further, in Python 3.x, the hashing of strings is randomized each
re-launch of the Python interpreter - changing the iteration ordering
of vocabulary dicts from run to run, and thus making even the same
string-of-random-number-draws pick different words in different
launches.)
So, it is to be expected that models vary from run to run, even
trained on the same data. There's no single "right place" for any
word-vector or doc-vector to wind up: just positions that are at
progressively more-useful distances & directions from other vectors
co-trained inside the same model. (In general, only vectors that were
trained together in an interleaved session of contrasting uses become
comparable in their coordinates.)
Suitable training parameters should yield models that are roughly as
useful, from run-to-run, as each other. Testing and evaluation
processes should be tolerant of any shifts in vector positions, and of
small "jitter" in the overall utility of models, that arises from the
inherent algorithm randomness. (If the observed quality from
run-to-run varies a lot, there may be other problems: too little data,
poorly-tuned parameters, or errors/weaknesses in the evaluation
method.)
You can try to force determinism, by using workers=1 to limit
training to a single thread – and, if in Python 3.x, using the
PYTHONHASHSEED environment variable to disable its usual string hash
randomization. But training will be much slower than with more
threads. And, you'd be obscuring the inherent
randomness/approximateness of the underlying algorithms, in a way that
might make results more fragile and dependent on the luck of a
particular setup. It's better to tolerate a little jitter, and use
excessive jitter as an indicator of problems elsewhere in the data or
model setup – rather than impose a superficial determinism.
While I don't know any implementation details of Word2Vec in gensim, I do know that, in general, Word2Vec is trained by a simple neural network with an embedding layer as the first layer. The weight matrix of this embedding layer contains the word vectors that we are interested in.
This being said, it is in general also quite common to initialize the weights of a neural network randomly. So there you have the origin of your randomness.
But how can the results be different, regardless of different (random) starting conditions?
A well trained model will assign similar vectors to words that have similar meaning. This similarity is measured by the cosine of the angle between the two vectors. Mathematically speaking, if v and w are the vectors of two very similar words then
dot(v, w) / (len(v) * len(w)) # this formula gives you the cosine of the angle between v and w
will be close to 1.
Also, it will allow you to do arithmetics like the famous
king - man + woman = queen
For illustration purposes imagine 2D-vectors. Would these arithmetical properties get lost if you e.g. rotate everything by some angle around the origin? With a little mathematical background I can assure you: No, they won't!
So, your assumption
If "My cat is running" and "your dog can't be running" are the two
sentences in the corpus, then the value of "running" (or its stem)
seems necessarily fixed.
is wrong. The value of "running" is not fixed at all. What is (somehow) fixed, however, is the similarity (cosine) and arithmetical relationship to other words.

Is there a semantic similarity method that outperforms word2vec approach for semantic accuracy?

I am looking at various semantic similarity methods such as word2vec, word mover distance (WMD), and fastText. fastText is not better than Word2Vec as for as semantic similarity is concerned. WMD and Word2Vec have almost similar results.
I was wondering if there is an alternative which has outperformed the Word2Vec model for semantic accuracy?
My use case:
Finding word embeddings for two sentences, and then use cosine similarity to find their similarity.
Whether any technique "outperforms" another will depend highly on your training data, the specific metaparameter options you choose, and your exact end-task. (Even "semantic similarity" may have many alternate aspects depending on the application.)
There's no one way to go from word2vec word-vectors to a sentence/paragraph vector. You could add the raw vectors. You could average the unit-normalized vectors. You could perform some other sort of weighted-average, based on other measures of word-significance. So your implied baseline is unclear.
Essentially you have to try a variety of methods and parameters, for your data and goal, with your custom evaluation.
Word Mover's Distance doesn't reduce each text to a single vector, and the pairwise calculation between two texts can be expensive, but it has reported very good performance on some semantic-similarity tasks.
FastText is essentially word2vec with some extra enhancements and new modes. Some modes with the extras turned off are exactly the same as word2vec, so using FastText word-vectors in some wordvecs-to-textvecs scheme should closely approximate using word2vec word-vectors in the same scheme. Some modes might help the word-vector quality for some purposes, but make the word-vectors less effective inside a wordvecs-to-textvecs scheme. Some modes might make the word-vector better for sum/average composition schemes – you should look especially at the 'classifier' mode, which trains word-vecs to be good, when averaged, at a classification task. (To the extent you may have any semantic labels for your data, this might make the word-vecs more composable for semantic-similarity tasks.)
You may also want to look at the 'Paragraph Vectors' technique (available in gensim as Doc2Vec), or other research results that go by the shorthand names 'fastSent' or 'sent2vec'.

Resources