Understanding why two TF-IDF vectors are similar - text

I want some feedback on an approach to understanding the results of TF-IDF vectors, and possibly alternative approaches.
Right now, I have two corpuses of text. The goal is to find which documents in each corpus is most similar.
When I find a match that is intereseting, I want to know why, so I've implemented a simple function called why_match(), but I'd like to help to know if it is a valid approach.
It works like this:
def why_match(doc_vector_a, doc_vector_b, sklearn_tfidfvectorizer):
distance = abs(doc_vector_a - doc_vector_b)
nearest_words = np.array((distance != 0) & (distance < 0.0015))
vocab = np.array(vectorizer.get_feature_names())
keywords = vocab[nearest_words]
return keywords
The idea should be to return all keywords which are closer than some threshold (0.0015), and not 0 (most likely because the word isn't in either document).
Is this a valid way to 'explain' closeness in TF-IDF? My results are decent, but it seems to attach a great deal of value to very generic words, which is unfortunate, but very telling for my task.

It certainly is one way of explaining vector similarity. Your approach calculates the Manhattan distance (or L1-distance) between two TF-IDF vectors. Just like the Euclidean distance (or L2-distance), the Manhattan distance is overly influenced by "large" values in one of the vectors. A more commonly approach is therefore to measure the angle between two vectors using cosine similarity.
A minimal example for using cosine similarity would be as follows:
from scipy.spatial.distance import cosine
sim = 1 - cosine(doc_vector_a, doc_vector_b)
The 1 - cosine(...) is necessary because the scipy function returns the distance rather than the similarity.

Related

Dissimilar Features between two documents

I am trying to find the dissimilarity between the two documents. I am using gensim and so far have obtained similarity score.
Is there any way to know the dissimilarity score and dissimilar features between two documents?
And how to evaluate it?
Cosine similarity using word vectors gives the semantic similarity between two sentences. First, let's understand how this is calculated. Suppose there are two vectors representing two text documents,
and
.
Then the dot product of the vectors is given by
.
Geometrically, theta represents the angle between the a and b vectors on a plane. So, smaller the angle, more is the similarity. Cosine similarity method thus reports this angle measure. Now geometrically, if the difference between the two vectors is less, then the angle is small, and thus cosine similarity is high. If the angles are far and near 90', then the cosine of that is near zero.
So, low scores of cosine similarity represent unrelated vectors. Of course, unrelated vector may be a measure of dissimilarity in case of text documents. Otherwise, if the angle is close to 180' then cosine similarity will be close to 1, but will be negated. This could mean that the two documents have the opposite meaning. This is again a different type of dissimilarity.
To sum it up, you can use both unrelated and opposite vectors to measure dissimilarity depending upon your application.
You may also consider syntactic differences such as the difference in dependency parse trees, named entities, etc. But again without knowing what exactly you are trying to achieve, its difficult suggest a single method.

What is the difference between wmd (word mover distance) and wmd based similarity?

I am using WMD to calculate the similarity scale between sentences. For example:
distance = model.wmdistance(sentence_obama, sentence_president)
Reference: https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html
However, there is also WMD based similarity method (WmdSimilarity).
Reference:
https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html
What is the difference between the two except the obvious that one is distance and another similarity?
Update: Both are exactly the same except with their different representation.
n_queries = len(query)
result = []
for qidx in range(n_queries):
# Compute similarity for each query.
qresult = [self.w2v_model.wmdistance(document, query[qidx]) for document in self.corpus]
qresult = numpy.array(qresult)
qresult = 1./(1.+qresult) # Similarity is the negative of the distance.
# Append single query result to list of all results.
result.append(qresult)
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/similarities/docsim.py
I think with the 'update' you'e more-or-less answered your own question.
That one is a distance, and the other a similarity, is the only difference between the two calculations. As the notebook you link notes in the relevant section:
WMD is a measure of distance. The similarities in WmdSimilarity are simply the negative distance. Be careful not to confuse distances and similarities. Two similar documents will have a high similarity score and a small distance; two very different documents will have low similarity score, and a large distance.
As the code you've excerpted shows, the similarity measure being used there is not exactly the 'negative' distance, but scaled so all similarity values are from 0.0 (exclusive) to 1.0 (inclusive). (That is, a zero distance becomes a 1.0 similarity, but ever-larger distances become ever-closer to 0.0.)

Cosine similarity alternative for tf-idf (triangle inequality)

I am trying to use tf-idf to cluster similar documents. One of the major drawback of my system is that it uses cosine similarity to decide which vectors should be group together.
The problem is that cosine similarity does not satisfy triangle inequality. Because in my case I cannot have the same vector in multiple clusters, I have to merge every cluster with an element in common, which can cause two documents to be grouped together even if they're not similar to each other.
Is there another way of measure the similarity of two documents so that:
Vectors score as very similar based on their direction regardless of their magnitude
Satisfy triangle inequality: if A is similar to B and B is similar to C then A is also similar to C
Not sure if it can help you. Have a look at TS-SS method in this paper. It covers some drawbacks from Cosine and ED which helps to identify similarity among vectors with higher accuracy. The higher accuracy helps you to understand which documents are highly similar and can be grouped together. The paper shows why TS-SS can help you with that.
Cosine is squared Euclidean on normalized data.
So simply L2 normalize your vectors to unit length, and use Euclidean.

Euclidean vs Cosine for text data

IF I use tf-idf feature representation (or just document length normalization), then is euclidean distance and (1 - cosine similarity) basically the same? All text books I have read and other forums, discussions say cosine similarity works better for text...
I wrote some basic code to test this and found indeed they are comparable, not exactly same floating point value but it looks like a scaled version. Given below are the results of both the similarities on simple demo text data. text no.2 is a big line of about 50 words, rest are small 10 word lines.
Cosine similarity:
0.0, 0.2967, 0.203, 0.2058
Euclidean distance:
0.0, 0.285, 0.2407, 0.2421
Note: If this question is more suitable to Cross Validation or Data Science, please let me know.
If your data is normalized to unit length, then it is very easy to prove that
Euclidean(A,B) = 2 - Cos(A,B)
This does hold if ||A||=||B||=1. It does not hold in the general case, and it depends on the exact order in which you perform your normalization steps. I.e. if you first normalize your document to unit length, next perform IDF weighting, then it will not hold...
Unfortunately, people use all kinds of variants, including quite different versions of IDF normalization.

Why the similarity beteween two bag-of-words in gensim.word2vec calculated this way?

def n_similarity(self, ws1, ws2):
v1 = [self[word] for word in ws1]
v2 = [self[word] for word in ws2]
return dot(matutils.unitvec(array(v1).mean(axis=0)), matutils.unitvec(array(v2).mean(axis=0)))
This is the code I excerpt from gensim.word2Vec, I know that two single words' similarity can be calculated by cosine distances, but what about two word sets? The code seems to use the mean of each wordvec and then calculated on the two mean vectors' cosine distance. I know few in word2vec, is there some foundations of such process?
Taking the mean of all word vectors is the simplest way of reducing them to a single vector so cosine similarity can be used. The intuition is that by adding up all word vectors you get a bit of all of them (the meaning) in the result. You then divide by the number of vectors so that larger bag of words don't end up with longer vectors (not that it matters for cosine similarity anyway).
There are other ways to reduce an entire sentence to a single vector is a complex one. I wrote a bit about it in a related question on SO. Since then a bunch of new algorithms have been proposed. One of the more accessible ones is Paragraph Vector, which you shouldn't have problems understanding if you are familiar with word2vec.

Resources