I am trying to use tf-idf to cluster similar documents. One of the major drawback of my system is that it uses cosine similarity to decide which vectors should be group together.
The problem is that cosine similarity does not satisfy triangle inequality. Because in my case I cannot have the same vector in multiple clusters, I have to merge every cluster with an element in common, which can cause two documents to be grouped together even if they're not similar to each other.
Is there another way of measure the similarity of two documents so that:
Vectors score as very similar based on their direction regardless of their magnitude
Satisfy triangle inequality: if A is similar to B and B is similar to C then A is also similar to C
Not sure if it can help you. Have a look at TS-SS method in this paper. It covers some drawbacks from Cosine and ED which helps to identify similarity among vectors with higher accuracy. The higher accuracy helps you to understand which documents are highly similar and can be grouped together. The paper shows why TS-SS can help you with that.
Cosine is squared Euclidean on normalized data.
So simply L2 normalize your vectors to unit length, and use Euclidean.
Related
These two distance measurements seem to be the most common in NLP from what I've read. I'm currently using cosine similarity (as does the gensim.fasttext distance measurement). Is there any case to be made for the use of Jaccard instead? Does it even work with only single words as input (with the use of ngrams I suppose)?
ft = fasttext.load_model('cc.en.300.bin')
distance = scipy.spatial.distance.cosine(ft['word1'], ft['word2'])
I suppose I could imagine Jaccard similarity over bags-of-ngrams being useful for something. You could try some experiments to see if it correlates with good performance on some particular word-to-word task.
Maybe: typo correction? Or perhaps, when using a plain, non-Fasttext set-of-word-vectors, you might try synthesizing vectors for OOV words, by some weighted average of the most ngram-Jaccard-similar existing words? (In both cases: other simple comparisons, like edit-distance or shared-substring counting, might do better.)
But, I've not noticed projects using Jaccard-over-ngrams in lieu of whole-word-vector to whole-word-vector comparisons, nor libraries offering it as part of their interfaces/examples.
You've also only described its potential use very vaguely, "with the use of ngrams I suppose", with no code either demonstrating such calculation, or the results of such calculation being put to any use.
So potential usefulness seems like a research conjecture that you'd need to probe with your own experiments.
The problem:
Suppose I have a group of around 1,000,000 short documents D (no more than 50 words each), and I want to let users to supply a document from the same group D, and and get the top K similar documents from D.
My approach:
My first approach was to preprocess the group D by applying simple tf-idf, and after I have vector for each document, which is extremely sparse, to use a simple nearest neighbours algorithm based on cosine similarity.
Then, on query time, to justuse my static nearest neighbours table which its size is 1,000,000 x K, without any further calculations.
After applying tf-idf, I got vectors in size ~200,000, which means now I have a very sparse table (that can be stored efficiently in memory using sparse vectors) in size 1,000,000 x 200,000.
However, calculating the nearest neighbours model took me more than one day, and still haven't finished.
I tried to lower the vectors dimension by applying HashingTF, that utilizes the hasing trick, instead, so I can set the dimension to a constant one (in my case, i used 2^13 for uninfied hashing), but still I get the same bad performance.
Some technical information:
I use Spark 2.0 for the tf-idf calculation, and sklearn NearestNeighbours on the collected data.
Is thier any more efficient way to achieve that goal?
Thanks in advance.
Edit:
I had an idea to try a LSH based approximation similarity algorithm like those implemented in spark as described here, but could not find one that supports the 'cosine' similarity metric.
There were some requirements for the algorithm on the relation between training instances and the dimensions of your vectors , but you can try DIMSUM.
You can find the paper here.
I want some feedback on an approach to understanding the results of TF-IDF vectors, and possibly alternative approaches.
Right now, I have two corpuses of text. The goal is to find which documents in each corpus is most similar.
When I find a match that is intereseting, I want to know why, so I've implemented a simple function called why_match(), but I'd like to help to know if it is a valid approach.
It works like this:
def why_match(doc_vector_a, doc_vector_b, sklearn_tfidfvectorizer):
distance = abs(doc_vector_a - doc_vector_b)
nearest_words = np.array((distance != 0) & (distance < 0.0015))
vocab = np.array(vectorizer.get_feature_names())
keywords = vocab[nearest_words]
return keywords
The idea should be to return all keywords which are closer than some threshold (0.0015), and not 0 (most likely because the word isn't in either document).
Is this a valid way to 'explain' closeness in TF-IDF? My results are decent, but it seems to attach a great deal of value to very generic words, which is unfortunate, but very telling for my task.
It certainly is one way of explaining vector similarity. Your approach calculates the Manhattan distance (or L1-distance) between two TF-IDF vectors. Just like the Euclidean distance (or L2-distance), the Manhattan distance is overly influenced by "large" values in one of the vectors. A more commonly approach is therefore to measure the angle between two vectors using cosine similarity.
A minimal example for using cosine similarity would be as follows:
from scipy.spatial.distance import cosine
sim = 1 - cosine(doc_vector_a, doc_vector_b)
The 1 - cosine(...) is necessary because the scipy function returns the distance rather than the similarity.
I am trying to find the dissimilarity between the two documents. I am using gensim and so far have obtained similarity score.
Is there any way to know the dissimilarity score and dissimilar features between two documents?
And how to evaluate it?
Cosine similarity using word vectors gives the semantic similarity between two sentences. First, let's understand how this is calculated. Suppose there are two vectors representing two text documents,
and
.
Then the dot product of the vectors is given by
.
Geometrically, theta represents the angle between the a and b vectors on a plane. So, smaller the angle, more is the similarity. Cosine similarity method thus reports this angle measure. Now geometrically, if the difference between the two vectors is less, then the angle is small, and thus cosine similarity is high. If the angles are far and near 90', then the cosine of that is near zero.
So, low scores of cosine similarity represent unrelated vectors. Of course, unrelated vector may be a measure of dissimilarity in case of text documents. Otherwise, if the angle is close to 180' then cosine similarity will be close to 1, but will be negated. This could mean that the two documents have the opposite meaning. This is again a different type of dissimilarity.
To sum it up, you can use both unrelated and opposite vectors to measure dissimilarity depending upon your application.
You may also consider syntactic differences such as the difference in dependency parse trees, named entities, etc. But again without knowing what exactly you are trying to achieve, its difficult suggest a single method.
def n_similarity(self, ws1, ws2):
v1 = [self[word] for word in ws1]
v2 = [self[word] for word in ws2]
return dot(matutils.unitvec(array(v1).mean(axis=0)), matutils.unitvec(array(v2).mean(axis=0)))
This is the code I excerpt from gensim.word2Vec, I know that two single words' similarity can be calculated by cosine distances, but what about two word sets? The code seems to use the mean of each wordvec and then calculated on the two mean vectors' cosine distance. I know few in word2vec, is there some foundations of such process?
Taking the mean of all word vectors is the simplest way of reducing them to a single vector so cosine similarity can be used. The intuition is that by adding up all word vectors you get a bit of all of them (the meaning) in the result. You then divide by the number of vectors so that larger bag of words don't end up with longer vectors (not that it matters for cosine similarity anyway).
There are other ways to reduce an entire sentence to a single vector is a complex one. I wrote a bit about it in a related question on SO. Since then a bunch of new algorithms have been proposed. One of the more accessible ones is Paragraph Vector, which you shouldn't have problems understanding if you are familiar with word2vec.