Why does word2Vec use cosine similarity? - nlp

I have been reading the papers on Word2Vec (e.g. this one), and I think I understand training the vectors to maximize the probability of other words found in the same contexts.
However, I do not understand why cosine is the correct measure of word similarity. Cosine similarity says that two vectors point in the same direction, but they could have different magnitudes.
For example, cosine similarity makes sense comparing bag-of-words for documents. Two documents might be of different length, but have similar distributions of words.
Why not, say, Euclidean distance?
Can anyone one explain why cosine similarity works for word2Vec?

Those two distance metrics are probably strongly correlated so it might not matter all that much which one you use. As you point out, cosine distance means we don't have to worry about the length of the vectors at all.
This paper indicates that there is a relationship between the frequency of the word and the length of the word2vec vector. http://arxiv.org/pdf/1508.02297v1.pdf

Cosine similarity of two n-dimensional vectors A and B is defined as:
which simply is the cosine of the angle between A and B.
while the Euclidean distance is defined as
Now think about the distance of two random elements of the vector space. For the cosine distance, the maximum distance is 1 as the range of cos is [-1, 1].
However, for the euclidean distance this can be any non-negative value.
When the dimension n gets bigger, two randomly chosen points have a cosine distance which gets closer and closer to 90°, whereas points in the unit-cube of R^n have an euclidean distance of roughly 0.41 (n)^0.5 (source)
TL;DR
cosine distance is better for vectors in a high-dimensional space because of the curse of dimensionality. (I'm not absolutely sure about it, though)

Related

Why use cosine similarity in Word2Vec when its trained using dot-product similarity

According to several posts I found on stackoverflow (for instance this Why does word2Vec use cosine similarity?), it's common practice to calculate the cosine similarity between two word vectors after we have trained a word2vec (either CBOW or Skip-gram) model. However, this seems a little odd to me since the model is actually trained with dot-product as a similarity score. One evidence of this is that the norm of the word vectors we get after training are actually meaningful. So why is it that people still use cosine-similarity instead of dot-product when calculating the similarity between two words?
Cosine similarity and Dot product are both similarity measures but dot product is magnitude sensitive while cosine similarity is not. Depending on the occurance count of a word it might have a large or small dot product with another word. We normally normalize our vector to prevent this effect so all vectors have unit magnitude. But if your particular downstream task requires occurance count as a feature then dot product might be the way to go, but if you do not care about counts then you can simlpy calculate the cosine similarity which will normalize them.

Lua - Finding the best match for a string

I was curious if anyone had a good method of choosing the best matching case between strings. For example, say I have a table with keys “Hi there”, “Hello”, “Hiya”, “hi”, “Hi”, and “Hey there”. The I want to find the closest match for “Hi”. It would then match to the “Hi” first. If that wasn’t found, then the “hi” then “Hiya”, and so on. Prioritizing perfect matches, then lower/uppercase matches, then which ever had the least number of differences or length difference.
My current method seems unwieldy, first checking for a perfect match, then looping around with a string.match, saving any with the closest string.len.
If you're not looking for a perfect match only, you need to use some metric as a measure of similarity and then look for the closest match.
As McBarby suggested in his comment you can use the Levenshtein distance which is the minimum number of single character edits necessary to get from string 1 to string 2. Just research which metrics are available and which one suits your needs best. Of course you can also define your own metric.
https://en.wikipedia.org/wiki/String_metric lists a number of other string metrics:
Sørensen–Dice coefficient
Block distance or L1 distance or City block distance
Jaro–Winkler distance
Simple matching coefficient (SMC)
Jaccard similarity or Jaccard coefficient or Tanimoto coefficient
Tversky index
Overlap coefficient
Variational distance
Hellinger distance or Bhattacharyya distance
Information radius (Jensen–Shannon divergence)
Skew divergence
Confusion probability
Tau metric, an approximation of the Kullback–Leibler divergence
Fellegi and Sunters metric (SFS)
Maximal matches
Grammar-based distance
TFIDF distance metric

Threshold for TF-IDF cosine similarity scores

This question is very similar to this one: Systematic threshold for cosine similarity with TF-IDF weights
How should I cut off tiny similarities? In the link above, the answer gives a technique based on averages. But this could return documents even if all similarities are very small, for example, < 0.01.
How do I know if a given document query is so unrelated to the corpus that no other document should be considered similar to it? Is there a systematic way to define a cutoff value for this?

K-means text documents clustering. How calculate intra and inner similarity?

I classify thousands of documents where the vector components are calculated according to the tf-idf. I use the cosine similarity. I did a frequency analysis of words in clusters to check the difference in top words. But I'm not sure how to calculate the similarity numerically in this sort of documents.
I count internal similarity of a cluster as the average of the similarity of each document to the centroid of the cluster. If I counted the average couple is based on small number.
External similarity calculated as the average similarity of all pairs cluster centroid
I count right? It is based on my inner similarity values average ​​from 0.2 (5 clusters and 2000 documents)to 0.35 (20 clusters and 2000 documents). Which is probably caused by a widely-oriented documents in computer science. Intra from 0.3-0.7. The result may be like that? On the Internet I found various ways of measuring, do not know which one to use than the one that was my idea. I am quite desperate.
Thank you so much for your advice!
Using k-means with anything but squared euclidean is risky. It may stop converging, as the convergence proof relies on both the mean and the distance assignment optimizing the same criterion. K-means minimizes squared deviations, not distances!
For a k-means variant that can handle arbitrary distance functions (and have guaranteed convergence), you will need to look at k-medoids.

How to know when to use a particular kind of Similarity index? Euclidean Distance vs. Pearson Correlation

What are some of the deciding factors to take into consideration when choosing a similarity index.
In what cases is a Euclidean Distance preferred over Pearson and vice versa?
Correlation is unit independent; if you scale one of the objects ten times, you will get different euclidean distances and same correlation distances. Therefore, correlation metrics is excellent when you want to measure distance between such objects as genes defined by their expression profile.
Often, absolute or squared correlation is used as a distance metrics, because we are more interested in the strength of the relationship than in its sign.
However, correlation is only suitable for highly dimensional data; there is hardly a point of calculating it for two- or three dimensional data points.
Also note that "Pearson distance" is a weighted type of Euclidean distance, and not the "correlation distance" using Pearson correlation coefficient.
It really depends on the application scenario you have in hand. Very briefly, if you are dealing with data where the actual difference in values of attributes is important, go with Euclidean Distance. If you are looking for trend or shape similarity, then go with correlation. Also note, that if you perform z-score normalization in each object, Euclidean Distance behaves similarly to Pearson correlation coefficient. Pearson is not sensitive to linear transformations of the data. There are other types of correlation coefficients that take into account the ranks of the values only, being insensitive to both linear and non linear transformations. Note that the usual use of correlation as dissimilarity is 1 - correlation, which does not respect all the rules for a metric distance.
There are some studies on which proximity measure select on a particular application, for instance:
Pablo A. Jaskowiak, Ricardo J. G. B. Campello, Ivan G. Costa Filho, "Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 99, no. PrePrints, p. 1, , 2013

Resources