Why cosine similarity with word embedding is not providing good output...Its giving similarity values of new document with most of the historical documents as high..eventhough both documents are not similar
the cosine similarity gives how similar two vectors are based on the angle between them.
How do you construct the embedding of a document ? as word2vec will only give you the embedding of words?
Most people use tf-idf as a metric to rank documents.
Related
According to several posts I found on stackoverflow (for instance this Why does word2Vec use cosine similarity?), it's common practice to calculate the cosine similarity between two word vectors after we have trained a word2vec (either CBOW or Skip-gram) model. However, this seems a little odd to me since the model is actually trained with dot-product as a similarity score. One evidence of this is that the norm of the word vectors we get after training are actually meaningful. So why is it that people still use cosine-similarity instead of dot-product when calculating the similarity between two words?
Cosine similarity and Dot product are both similarity measures but dot product is magnitude sensitive while cosine similarity is not. Depending on the occurance count of a word it might have a large or small dot product with another word. We normally normalize our vector to prevent this effect so all vectors have unit magnitude. But if your particular downstream task requires occurance count as a feature then dot product might be the way to go, but if you do not care about counts then you can simlpy calculate the cosine similarity which will normalize them.
I am working on a text classification use case. The text is basically contents of legal documents, for example, companies annual reports, W9 etc. So there are 10 different categories and 500 documents in total. Therefore 50 documents per category. So the dataset consists of 500 rows and 2 columns, 1st column consisting of text and 2nd column is the Target.
I have built a basic model using TF-IDF for my textual features. I have used Multinomial Naive Bayes, SVC, Linear SGD, Multilayer Perceptron, Random Forest. These models are giving me an F1-score of approx 70-75%.
I wanted to see if creating word-embedding will help me improve the accuracy. I trained the word vectors using gensim Word2vec, and fit the word vectors through the same ML models as above, but I am getting a score of about 30-35%. I have a very small dataset and lot of categories, is that the problem? Is it the only reason, or there is something I am missing out?
This question is very similar to this one: Systematic threshold for cosine similarity with TF-IDF weights
How should I cut off tiny similarities? In the link above, the answer gives a technique based on averages. But this could return documents even if all similarities are very small, for example, < 0.01.
How do I know if a given document query is so unrelated to the corpus that no other document should be considered similar to it? Is there a systematic way to define a cutoff value for this?
I have been reading the papers on Word2Vec (e.g. this one), and I think I understand training the vectors to maximize the probability of other words found in the same contexts.
However, I do not understand why cosine is the correct measure of word similarity. Cosine similarity says that two vectors point in the same direction, but they could have different magnitudes.
For example, cosine similarity makes sense comparing bag-of-words for documents. Two documents might be of different length, but have similar distributions of words.
Why not, say, Euclidean distance?
Can anyone one explain why cosine similarity works for word2Vec?
Those two distance metrics are probably strongly correlated so it might not matter all that much which one you use. As you point out, cosine distance means we don't have to worry about the length of the vectors at all.
This paper indicates that there is a relationship between the frequency of the word and the length of the word2vec vector. http://arxiv.org/pdf/1508.02297v1.pdf
Cosine similarity of two n-dimensional vectors A and B is defined as:
which simply is the cosine of the angle between A and B.
while the Euclidean distance is defined as
Now think about the distance of two random elements of the vector space. For the cosine distance, the maximum distance is 1 as the range of cos is [-1, 1].
However, for the euclidean distance this can be any non-negative value.
When the dimension n gets bigger, two randomly chosen points have a cosine distance which gets closer and closer to 90°, whereas points in the unit-cube of R^n have an euclidean distance of roughly 0.41 (n)^0.5 (source)
TL;DR
cosine distance is better for vectors in a high-dimensional space because of the curse of dimensionality. (I'm not absolutely sure about it, though)
I know, the original cosine similarity, when applied to representation of two documents by frequency of specific words, do not measure word order. I now see a whole bunch of papers applying cosine similarity to representation of pairs of sentences formed by words vectors. I assume they flatten the token# x embedding length matrix of each sentence to a long vector whose length is token# x embedding length of the original sentence. So "I love you" and "you love me(normalized to "I") would not yield 1 in this new way of applying cosine similarity whereas the old way would yield 1. Am I correct? Thanks for any enlightening answer.
Exactly!
"I love you" and "you love me(normalized to "I") would not yield 1 in this new way of applying cosine similarity whereas the old way would yield 1.
this modification is done:
A slight modification is made for sentence
representation. Instead of using indexing words from a text collection, a set of words
that appear in the sentence pair is used as a feature set. This is done to reduce the
degree of data sparseness in sentence representation
The standard TF-IDF similarity
(simTFIDF,vector) is defined as cosine similarity between vector representation of two
sentences.
you can read more here