Threshold for TF-IDF cosine similarity scores - document

This question is very similar to this one: Systematic threshold for cosine similarity with TF-IDF weights
How should I cut off tiny similarities? In the link above, the answer gives a technique based on averages. But this could return documents even if all similarities are very small, for example, < 0.01.
How do I know if a given document query is so unrelated to the corpus that no other document should be considered similar to it? Is there a systematic way to define a cutoff value for this?

Related

Cosine Similarity with word2vec not giving good documemt similarity

Why cosine similarity with word embedding is not providing good output...Its giving similarity values of new document with most of the historical documents as high..eventhough both documents are not similar
the cosine similarity gives how similar two vectors are based on the angle between them.
How do you construct the embedding of a document ? as word2vec will only give you the embedding of words?
Most people use tf-idf as a metric to rank documents.

Why use cosine similarity in Word2Vec when its trained using dot-product similarity

According to several posts I found on stackoverflow (for instance this Why does word2Vec use cosine similarity?), it's common practice to calculate the cosine similarity between two word vectors after we have trained a word2vec (either CBOW or Skip-gram) model. However, this seems a little odd to me since the model is actually trained with dot-product as a similarity score. One evidence of this is that the norm of the word vectors we get after training are actually meaningful. So why is it that people still use cosine-similarity instead of dot-product when calculating the similarity between two words?
Cosine similarity and Dot product are both similarity measures but dot product is magnitude sensitive while cosine similarity is not. Depending on the occurance count of a word it might have a large or small dot product with another word. We normally normalize our vector to prevent this effect so all vectors have unit magnitude. But if your particular downstream task requires occurance count as a feature then dot product might be the way to go, but if you do not care about counts then you can simlpy calculate the cosine similarity which will normalize them.

Why does word2Vec use cosine similarity?

I have been reading the papers on Word2Vec (e.g. this one), and I think I understand training the vectors to maximize the probability of other words found in the same contexts.
However, I do not understand why cosine is the correct measure of word similarity. Cosine similarity says that two vectors point in the same direction, but they could have different magnitudes.
For example, cosine similarity makes sense comparing bag-of-words for documents. Two documents might be of different length, but have similar distributions of words.
Why not, say, Euclidean distance?
Can anyone one explain why cosine similarity works for word2Vec?
Those two distance metrics are probably strongly correlated so it might not matter all that much which one you use. As you point out, cosine distance means we don't have to worry about the length of the vectors at all.
This paper indicates that there is a relationship between the frequency of the word and the length of the word2vec vector. http://arxiv.org/pdf/1508.02297v1.pdf
Cosine similarity of two n-dimensional vectors A and B is defined as:
which simply is the cosine of the angle between A and B.
while the Euclidean distance is defined as
Now think about the distance of two random elements of the vector space. For the cosine distance, the maximum distance is 1 as the range of cos is [-1, 1].
However, for the euclidean distance this can be any non-negative value.
When the dimension n gets bigger, two randomly chosen points have a cosine distance which gets closer and closer to 90°, whereas points in the unit-cube of R^n have an euclidean distance of roughly 0.41 (n)^0.5 (source)
TL;DR
cosine distance is better for vectors in a high-dimensional space because of the curse of dimensionality. (I'm not absolutely sure about it, though)

Cosine similarity of representation of sentences formed with word vectors now measure word order?

I know, the original cosine similarity, when applied to representation of two documents by frequency of specific words, do not measure word order. I now see a whole bunch of papers applying cosine similarity to representation of pairs of sentences formed by words vectors. I assume they flatten the token# x embedding length matrix of each sentence to a long vector whose length is token# x embedding length of the original sentence. So "I love you" and "you love me(normalized to "I") would not yield 1 in this new way of applying cosine similarity whereas the old way would yield 1. Am I correct? Thanks for any enlightening answer.
Exactly!
"I love you" and "you love me(normalized to "I") would not yield 1 in this new way of applying cosine similarity whereas the old way would yield 1.
this modification is done:
A slight modification is made for sentence
representation. Instead of using indexing words from a text collection, a set of words
that appear in the sentence pair is used as a feature set. This is done to reduce the
degree of data sparseness in sentence representation
The standard TF-IDF similarity
(simTFIDF,vector) is defined as cosine similarity between vector representation of two
sentences.
you can read more here

K-means text documents clustering. How calculate intra and inner similarity?

I classify thousands of documents where the vector components are calculated according to the tf-idf. I use the cosine similarity. I did a frequency analysis of words in clusters to check the difference in top words. But I'm not sure how to calculate the similarity numerically in this sort of documents.
I count internal similarity of a cluster as the average of the similarity of each document to the centroid of the cluster. If I counted the average couple is based on small number.
External similarity calculated as the average similarity of all pairs cluster centroid
I count right? It is based on my inner similarity values average ​​from 0.2 (5 clusters and 2000 documents)to 0.35 (20 clusters and 2000 documents). Which is probably caused by a widely-oriented documents in computer science. Intra from 0.3-0.7. The result may be like that? On the Internet I found various ways of measuring, do not know which one to use than the one that was my idea. I am quite desperate.
Thank you so much for your advice!
Using k-means with anything but squared euclidean is risky. It may stop converging, as the convergence proof relies on both the mean and the distance assignment optimizing the same criterion. K-means minimizes squared deviations, not distances!
For a k-means variant that can handle arbitrary distance functions (and have guaranteed convergence), you will need to look at k-medoids.

Resources