I'm clustering text documents. I'm using tf-idf and cosine similarity. However there's something I don't really understand even tho I'm using these measures. Do the tf-idf weights affect the similarity calculations between two documents?
Suppose I have these two documents:
1- High trees.
2- High trees High trees High trees High trees.
Then the similarity between the two documents will be 1, although the tf-idf vectors of the two documents are different. Where the second should normally have higher weights for the terms compared to the first document.
Suppose the weights for the two vectors are (just suppose):
v1(1.0, 1.0)
v2(5.0, 8.0)
calculating the cosine similarity gives 1.0.
Here is a sketch of two random vectors that share the same terms but with different weights.
There's an obvious angel between the vectors, so the weights should play a role!
This triggers the question, where do the tf/idf weights play a role in the similarity calculations? Because what I understood so far is that the similarity here only cares about the presence and absence of the terms.
First off, your calculations are flawed. The cosine similarity between (1, 1) and (5, 8) is
1*5 + 1*8 / ||(1, 1)|| * ||(5, 8)||
= 13 / (1.4142 * 9.434)
= .97
where ||x|| is the Euclidean norm of x.
Because what I understood so far is that the similarity here only cares about the presence and absence of the terms.
That's not true. Consider
d1 = "hello world"
d2 = "hello world hello"
with tf vectors (no idf here)
v1 = [1, 1]
v2 = [2, 1]
The cosine similarity is 0.95, not 1.
Idf can have a further effect. Suppose we add
d3 = "hello"
then df("hello") = 3 and df("world") = 2, and the tf-idf vectors for d1, d2 become
v1' = [ 1. , 1.28768207]
v2' = [ 2. , 1.28768207]
with a slightly smaller cosine similarity of 0.94.
(Tf-idf and cosine similarities computed with scikit-learn; other packages may give different numbers due to the different varieties of tf-idf in use.)
I think you are mixing two different concepts here.
Cosine similarity measures the angle between two different vectors in a Euclidean space, independently of how the weights have been calculated.
TF-IDF decides, for each term in a document and a given collection, the weights for each one of the components of a vector that can be used for cosine similarity (among other things).
I hope this helps.
see my reply to this question and also the question
Python: tf-idf-cosine: to find document similarity
Basically if you want to use both tf-idf and cosine similarity then you can get the tf-idf vector and apply cosine similarity to that to get final result. So here you are applying cosine similarity(in this case dot product of tf - idf vectors) onto the tf-idf scores.
The answer also had 3 tutorials which you can refer to. They explain how this can work. thankz.
Related
I have two sets of short messages, I want to compute the similarity between these two sets and identify if they are talking about the same sub-topic based on their semantic similarity. I know how to use pairwise similarity, my problem I want to compute the overall similarity among all the sentences in the two sets not for 2 sentences. Is there a way to use tf-idf or word2vec/doc2vec with cosine similarity to calculate the overall score?
Basically what I did is, take the vectors of each word in each sentence.
Then take the average of the two vectors and do cosine similarity.
Of course before you do that you need a trained word2vec model. doc2vec's similarity is doing the same thing, as internally it keeps a word2vec model.
So you have two options, train a doc2vec, and use its build in similarity, or train a word2vec and do the work by yourself.
Infersent helps in finding semantic similarity
I build two word embedding (word2vec models) using gensim and save it as (word2vec1 and word2vec2) by using the model.save(model_name) command for two different corpus (the two corpuses are somewhat similar, similar means they are related like part 1 and part 2 of a book). Suppose, the top words (in terms of frequency or occurrence) for the two corpuses is the same word (let's say it as a).
How to compute the degree of similarity (cosine-similarity or similarity) of the extracted top word (say 'a'), for the two word2vec models? Does most_similar() will work in this case efficiently?
I want to know by how much degree of similarity, does the same word (a), is related for two different generated models?
Any idea is deeply appreciated.
You seem to have the wrong idea about word2vec. It doesn't provide one absolute vector for one word. It manages to find a representation for a word relative to other words. So, for the same corpus, if you run word2vec twice, you will get 2 different vectors for the same word. The meaning comes in when you compare it relative to other word vectors.
king - man will always be close(cosine similarity wise) to queen - woman no matter how many time you train it. But they will have different vectors after each train.
In your case, since the 2 models are trained differently, comparing vectors of the same word is the same as comparing two random vectors. You should rather compare the relative relations. Maybe something like: model1.most_similar('dog') vs model2.most_similar('dog')
However, to answer your question, if you wanted to compare the 2 vectors, you could do it as below. But the results will be meaningless.
Just take the vectors from each model and manually calculate cosine similarity.
vec1 = model1.wv['computer']
vec2 = model2.wv['computer']
print(np.sum(vec1*vec2)/(np.linalg.norm(vec1)*np.linalg.norm(vec2)))
I've been trying to determine the similarity between a set of documents, and one of the methods I'm using is the cosine similarity with the results of the TF-IDF.
I tried to use both sklearn and gensim's implementations, which give me similar results, but my own implementation results in a different matrix.
After analyzing, I noticed that the their implementations are different from the ones I've studied and came across:
Sklearn and gensim use raw counts as the TF, and apply L2 norm
on the resulting vectors.
On the other side, the implementations I found will normalize the term count,
like
TF = term count / sum of all term counts in the document
My question is, what is the difference with their implementations? Do they give better results in the end, for clustering or other purposes?
EDIT(So the question is clearer):
What is the difference between normalizing the end result vs normalizing the term count at the beggining?
I ended up understanding why the normalization is done at the end of the tf-idf calculations instead of doing it on the term frequencies.
After searching around, I noticed they use L2 normalization in order to facilitate cosine similarity calculations.
So, instead of using the formula dot(vector1, vector2) / (norm(vector1) * norm(vector2)) to get the similarity between 2 vectors, we can use directly the result from the fit_transform function: tfidf * tfidf.T, without the need to normalize, since the norm for the vectors is already 1.
I tried to add normalization on the term frequency, but it just gives out the same results in the end, when normalizing the whole vectors, ending up being a waste of time.
With scikit-learn, you can set the normalization as desired when calling TfidfTransformer() by setting norm to either l1, l2, or none.
If you try this with none, you may get similar results to your own hand-rolled tf-idf implementation.
The normalization is typically used to reduce the effects of document length on a particular tf-idf weighting so that words appearing in short documents are treated on more equal footing to words appearing in much longer documents.
I have set of short documents(1 or 2 paragraph each). I have used three different approaches for document similarity:
- simple cosine similarity on tfidf matrix
- applying LDA on the whole corpus and then using the LDA model to create the vector for each document then I applied cosine similarity.
-applying LSA on the whole corpus and then using the LSA model to create the vector for each document then I applied cosine similarity.
Based on experiments I am getting better result on simple cosine similarty on tfidf matrix without any LDA or LSA. Based on what I read LDA or LSA should improve the result, but in my case it is not!
Is there any idea why LDA or LSA have worse results?
both LDA and LSA when trained for more than 1000 rounds find similarity between some documents with probability higher than 90% which are totally unrelated!
Is there any justification for that?
Thanks
I have used LDA4j implementation and got better results than TFIDF, and similarly for LSI i have used semantic-vector implementation. If you have your own implementation share the model sketch. One more thing you should need to normalize the corpus for better results.
In NLP, it's always the case that the dimension of the features are very huge. For example, for one project at hand, the dimension of features is almost 20 thousands (p = 20,000), and each feature is a 0-1 integer to show whether a specific word or bi-gram is presented in a paper (one paper is a data point $x \in R^{p}$).
I know the redundancy among the features is huge, so dimension reduction is necessary. I have three questions:
1) I have 10 thousands data points (n = 10,000), and each data points has 10 thousands features (p = 10,000). What is the effieient way to conduct dimension reduction? The matrix $X \in R^{n \times p}$ is so huge that both PCA (or SVD, truncated SVD is OK, but I don't think SVD is a good way to reduce dimention for binary features) and Bag of Words (or K-means) is hard be be directly conducted on $X$ (Sure, it is sparse). I don't have a server, I just use my PC:-(.
2) How to judge the similarity or distance among two data points? I think the Euclidean distance may not work well for binary features. How about L0 norm? What do you use?
3) If I want to use SVM machine (or other kernel methods) to conduct classification, which kernel should I use?
Many Thanks!
1) You don't need dimensionality reduction. If you really want, you can use an L1 penalized linear classifier to reduce to the most helpful features.
2) Cosine similarity is often used, or cosine similarity of the TFIDF rescaled vectors.
3) Linear SVMs work best with so many features.
There is a good tutorial on how to do classification like this in python here: http://scikit-learn.org/dev/tutorial/text_analytics/working_with_text_data.html