Euclidean vs Cosine for text data - text

IF I use tf-idf feature representation (or just document length normalization), then is euclidean distance and (1 - cosine similarity) basically the same? All text books I have read and other forums, discussions say cosine similarity works better for text...
I wrote some basic code to test this and found indeed they are comparable, not exactly same floating point value but it looks like a scaled version. Given below are the results of both the similarities on simple demo text data. text no.2 is a big line of about 50 words, rest are small 10 word lines.
Cosine similarity:
0.0, 0.2967, 0.203, 0.2058
Euclidean distance:
0.0, 0.285, 0.2407, 0.2421
Note: If this question is more suitable to Cross Validation or Data Science, please let me know.

If your data is normalized to unit length, then it is very easy to prove that
Euclidean(A,B) = 2 - Cos(A,B)
This does hold if ||A||=||B||=1. It does not hold in the general case, and it depends on the exact order in which you perform your normalization steps. I.e. if you first normalize your document to unit length, next perform IDF weighting, then it will not hold...
Unfortunately, people use all kinds of variants, including quite different versions of IDF normalization.

Related

Understanding why two TF-IDF vectors are similar

I want some feedback on an approach to understanding the results of TF-IDF vectors, and possibly alternative approaches.
Right now, I have two corpuses of text. The goal is to find which documents in each corpus is most similar.
When I find a match that is intereseting, I want to know why, so I've implemented a simple function called why_match(), but I'd like to help to know if it is a valid approach.
It works like this:
def why_match(doc_vector_a, doc_vector_b, sklearn_tfidfvectorizer):
distance = abs(doc_vector_a - doc_vector_b)
nearest_words = np.array((distance != 0) & (distance < 0.0015))
vocab = np.array(vectorizer.get_feature_names())
keywords = vocab[nearest_words]
return keywords
The idea should be to return all keywords which are closer than some threshold (0.0015), and not 0 (most likely because the word isn't in either document).
Is this a valid way to 'explain' closeness in TF-IDF? My results are decent, but it seems to attach a great deal of value to very generic words, which is unfortunate, but very telling for my task.
It certainly is one way of explaining vector similarity. Your approach calculates the Manhattan distance (or L1-distance) between two TF-IDF vectors. Just like the Euclidean distance (or L2-distance), the Manhattan distance is overly influenced by "large" values in one of the vectors. A more commonly approach is therefore to measure the angle between two vectors using cosine similarity.
A minimal example for using cosine similarity would be as follows:
from scipy.spatial.distance import cosine
sim = 1 - cosine(doc_vector_a, doc_vector_b)
The 1 - cosine(...) is necessary because the scipy function returns the distance rather than the similarity.

Dissimilar Features between two documents

I am trying to find the dissimilarity between the two documents. I am using gensim and so far have obtained similarity score.
Is there any way to know the dissimilarity score and dissimilar features between two documents?
And how to evaluate it?
Cosine similarity using word vectors gives the semantic similarity between two sentences. First, let's understand how this is calculated. Suppose there are two vectors representing two text documents,
and
.
Then the dot product of the vectors is given by
.
Geometrically, theta represents the angle between the a and b vectors on a plane. So, smaller the angle, more is the similarity. Cosine similarity method thus reports this angle measure. Now geometrically, if the difference between the two vectors is less, then the angle is small, and thus cosine similarity is high. If the angles are far and near 90', then the cosine of that is near zero.
So, low scores of cosine similarity represent unrelated vectors. Of course, unrelated vector may be a measure of dissimilarity in case of text documents. Otherwise, if the angle is close to 180' then cosine similarity will be close to 1, but will be negated. This could mean that the two documents have the opposite meaning. This is again a different type of dissimilarity.
To sum it up, you can use both unrelated and opposite vectors to measure dissimilarity depending upon your application.
You may also consider syntactic differences such as the difference in dependency parse trees, named entities, etc. But again without knowing what exactly you are trying to achieve, its difficult suggest a single method.

Cosine similarity alternative for tf-idf (triangle inequality)

I am trying to use tf-idf to cluster similar documents. One of the major drawback of my system is that it uses cosine similarity to decide which vectors should be group together.
The problem is that cosine similarity does not satisfy triangle inequality. Because in my case I cannot have the same vector in multiple clusters, I have to merge every cluster with an element in common, which can cause two documents to be grouped together even if they're not similar to each other.
Is there another way of measure the similarity of two documents so that:
Vectors score as very similar based on their direction regardless of their magnitude
Satisfy triangle inequality: if A is similar to B and B is similar to C then A is also similar to C
Not sure if it can help you. Have a look at TS-SS method in this paper. It covers some drawbacks from Cosine and ED which helps to identify similarity among vectors with higher accuracy. The higher accuracy helps you to understand which documents are highly similar and can be grouped together. The paper shows why TS-SS can help you with that.
Cosine is squared Euclidean on normalized data.
So simply L2 normalize your vectors to unit length, and use Euclidean.

interpretation of SVD for text mining topic analysis

Background
I'm learning about text mining by building my own text mining toolkit from scratch - the best way to learn!
SVD
The Singular Value Decomposition is often cited as a good way to:
Visualise high dimensional data (word-document matrix) in 2d/3d
Extract key topics by reducing dimensions
I've spent about a month learning about the SVD .. I must admit much of the online tutorials, papers, university lecture slides, .. and even proper printed textbooks are not that easy to digest.
Here's my understanding so far: SVD demystified (blog)
I think I have understood the following:
Any (real) matrix can be decomposed uniquely into 3 multiplied
matrices using SVD, A=U⋅S⋅V^T
S is a diagonal matrix of singular values, in descending order of magnitude
U and V^T are matrices of orthonormal vectors
I understand that we can reduce the dimensions by filtering out less significant information by zero-ing the smaller elements of S, and reconstructing the original data. If I wanted to reduce dimensions to 2, I'd only keep the 2 top-left-most elements of the diagonal S to form a new matrix S'
My Problem
To see the documents projected onto the reduced dimension space, I've seen people use S'⋅V^T. Why? What's the interpretation of S'⋅V^T?
Similarly, to see the topics, I've seen people use U⋅S'. Why? What's the interpretation of this?
My limited school maths tells me I should look at these as transformations (rotation, scale) ... but that doesn't help clarify it either.
** Update **
I've added an update to my blog explanation at SVD demystified (blog) which reflects the rationale from one of the textbooks I looked at to explain why S'.V^T is a document view, and why U.S' is a word view. Still not really convinced ...

Why the similarity beteween two bag-of-words in gensim.word2vec calculated this way?

def n_similarity(self, ws1, ws2):
v1 = [self[word] for word in ws1]
v2 = [self[word] for word in ws2]
return dot(matutils.unitvec(array(v1).mean(axis=0)), matutils.unitvec(array(v2).mean(axis=0)))
This is the code I excerpt from gensim.word2Vec, I know that two single words' similarity can be calculated by cosine distances, but what about two word sets? The code seems to use the mean of each wordvec and then calculated on the two mean vectors' cosine distance. I know few in word2vec, is there some foundations of such process?
Taking the mean of all word vectors is the simplest way of reducing them to a single vector so cosine similarity can be used. The intuition is that by adding up all word vectors you get a bit of all of them (the meaning) in the result. You then divide by the number of vectors so that larger bag of words don't end up with longer vectors (not that it matters for cosine similarity anyway).
There are other ways to reduce an entire sentence to a single vector is a complex one. I wrote a bit about it in a related question on SO. Since then a bunch of new algorithms have been proposed. One of the more accessible ones is Paragraph Vector, which you shouldn't have problems understanding if you are familiar with word2vec.

Resources