Why do we calculate cosine similarities using tf-idf weightings?

Why do we calculate cosine similarities using tf-idf weightings? - text

Suppose we are trying to measure similarity between two very similar documents.
Document A: "a b c d"
Document B: "a b c e"
This corresponds to a term-frequency matrix
a b c d e
A 1 1 1 1 0
B 1 1 1 0 1
where the cosine similarity on the raw vectors is the dot product of the two vectors A and B, divided by the product of their magnitudes:
3/4 = (1*1 + 1*1 + 1*1 + 1*0 + 1*0) / (sqrt(4) * sqrt(4)).
But when we apply an inverse document frequency transformation by multiplying each term in the matrix by (log(N / df_i), where N is the number of documents in the matrix, 2, and df_i is the number of documents in which a term is present, we get a tf-idf matrix of
a b c d e
A: 0 0 0 log2 0
B: 0 0 0 0 1og2
Since "a" appears in both documents, it has an inverse-document-frequency value of 0. This is the same for "b" and "c". Meanwhile, "d" is in document A, but not in document B, so it is multiplied by log(2/1). "e" is in document B, but not in document A, so it is also multiplied by log(2/1).
The cosine similarity between these two vectors is 0, suggesting the two are totally different documents. Obviously, this is incorrect. For these two documents to be considered similar to each other using tf-idf weightings, we would need a third document C in the matrix which is vastly different from documents A and B.
Thus, I am wondering whether and/or why we would use tf-idf weightings in combination with a cosine similarity metric to compare highly similar documents. None of the tutorials or StackOverflow questions I've read have been able to answer this question.
This post discusses similar failings with tf-idf weights using cosine similarities, but offers no guidance on what to do about them.
EDIT: as it turns out, the guidance I was looking for was in the comments of that blog post. It recommends using the formula
1 + log ( N / ni + 1)
as the inverse document frequency transformation instead. This would keep the weights of terms which are in every document close to their original weights, while inflating the weights of terms which are not present in a lot of documents by a greater degree. Interesting that this formula is not more prominently found in posts about tf-idf.

Since "a" appears in both documents, it has an inverse-document-frequency value of 0
This is where you have made an error in using inverse document frequency (idf). Idf is meant to be computed over a large collection of documents (not just across two documents), the purpose being to be able to predict the importance of term overlaps in document pairs.
You would expect that common terms, such as 'the', 'a' etc. overlap across all document pairs. Should that be having any contribution to your similarity score? - No.
That is precisely the reason why the vector components are multiplied by the idf factor - just to dampen or boost a particular term overlap (a component of the form a_i*b_i being added to the numerator in the cosine-sim sum).
Now consider you have a collection on computer science journals. Do you believe that an overlap of terms such as 'computer' and 'science' across a document pair is considered to be important? - No.
And this will indeed happen because the idf of these terms would be considerably low in this collection.
What do you think will happen if you extend the collection to scientific articles of any discipline? In that collection, the idf value of the word 'computer' will no longer be low. And that makes sense because in this general collection, you would like to think that two documents are similar enough if they are on the same topic - computer science.

As it turns out, the guidance I was looking for was in the comments of that blog post. It recommends using the formula
1 + log ( N / ni + 1)
as the inverse document frequency transformation instead. This would keep the weights of terms which are in every document close to their original weights, while inflating the weights of terms which are not present in a lot of documents by a greater degree. Interesting that this formula is not more prominently found in posts about tf-idf.

Related

Creating a dynamic array with given probabilities in Excel

I want to create a dynamic array that returns me X values based on given probabilities. For instance:
Imagine this is a gift box and you can open the box N times. What I want is to have N random results. For example, I want to get randomly 5 of these two rarities but based on their chances.
I have this following formula for now:
=index(A2:A3,randarray(5,1,1,rows(A2:A3),1). And this is the output I get:
The problem here is that I have a dynamic array with the 5 results BUT NOT BASED ON THE PROBABILITIES.
How can I add probabilities to the array?

Here is how you could generate a random outcome with defined probabilities for the entries (Google Sheets solution, not sure about Excel):
=ARRAYFORMULA(
VLOOKUP(
RANDARRAY(H1, 1),
{
{0; OFFSET(C2:C,,, COUNTA(C2:C) - 1)},
OFFSET(A2:A,,, COUNTA(C2:C))
},
2
)
)

This whole subject of random selection was treated very thoroughly in Donald Knuth's series of books, The Art of Computer Programming, vol 2, "Semi-Numerical Algorithms". In that book he presents an algorithm for selecting exactly X out of N items in a list using pseudo-random numbers. What you may not have considered is that after you have chosen your first item the probability array has changed to (X-1)/(N-1) if your first outcome was "Normal" or X/(N-1) if your first outcome was "Rare". This means you'll want to keep track of some running totals based on your prior outcomes to ensure your probabilities are dynamically updated with each pick. You can do this with formulas, but I'm not certain how the back-reference will perform inside an array formula. Microsoft's dynamic array documentation indicates that such internal array references are considered "circular" and are prohibited.
In any case, trying to extend this to 3+ outcomes is very problematic. In order to implement that algorithm with 3 choices (X + Y + Z = N picks) you would need to break this up into one random number for an X or not X choice and then a second random number for a Y or not Y choice. This becomes a recursive algorithm, beyond Excel's ability to cope in formulas.

How to measure the similarity of two documents , given the similarity of each pair of words?

I have two documents, for example:
Doc1 = {'python','numpy','machine learning'}
Doc2 = {'python','pandas','tensorflow','svm','regression','R'}
And I also know the similarity(correlation) of each pair of words, e.g
Sim('python','python') = 1
Sim('python','pandas') = 0.8
Sim('numpy', 'R') = 0.1
What is the best way to measure the similarity of the two documents?
It seems that the traditional Jaccard distance and cosine distance are not a good metric in this situation.

I like a book by Peter Christen on this issue.
Here he describes a Monge-Elkan similarity measure between two sets of strings.
For each word from the first set you find the closest word from the second set and divide it by the number of elements in the first set.
You can see its description on page 30 here.

what is the formula of sentiment calculation

what is the actual formula to compute sentiments using sentiment rated lexicon. the lexicon that I am using contains rating between the range -5 to 5. I want to compute sentiment for individual sentences. Either i have to compute average of all sentiment ranked words in sentence or only sum up them.

There are several methods for computing an index from scored sentiment components of sentences. Each is based on comparing positive and negative words, and each has advantages and disadvantages.
For your scale, a measure of the central tendency of the words would be a fair measure, where the denominator is the number of scored words. This is a form of the "relative proportional difference" measure employed below. You would probably not want to divide the total sentiment words' scores by all words, since this makes each sentence's measure strongly affected by non-sentiment terms.
If you do not believe that the 11 point rating you describe is accurate, you could just classify it as positive or negative depending on its sign. Then you could apply the following methods where you have transformed
where each P and N refer to the counts of the Positive and Negative coded sentiment words, and O is the count of all other words (so that the total number of words = P + N + O).
Absolute Proportional Difference. Bounds: [0,1]
Sentiment = (P − N) / (P + N + O)
Disadvantage: A sentence's score is affected by non-sentiment-related content.
Relative Proportional Difference. Bounds: [-1, 1]
Sentiment = (P − N) / (P + N)
Disadvantage: A sentence's score may tend to cluster very strongly near the scale endpoints (because they may contain content primarily or exclusively of either positive or negative).
Logit scale. Bounds: [-infinity, +infinity]
Sentiment = log(P + 0.5) - log(N + 0.5)
This tends to have the smoothest properties and is symmetric around zero. The 0.5 is a smoother to prevent log(0).
For details, please see William Lowe, Kenneth Benoit, Slava Mikhaylov, and Michael Laver. (2011) "Scaling Policy Preferences From Coded Political Texts." Legislative Studies Quarterly 26(1, Feb): 123-155. where we compare their properties for measuring right-left ideology, but everything we discuss also applies to positive-negative sentiment.

you can use R tool for sentiment computation. here is the link you can refer to:
https://sites.google.com/site/miningtwitter/questions/sentiment/analysis

Can the cosine similarity when using Locality Sensitive Hashing be -1?

I was reading this question:
How to understand Locality Sensitive Hashing?
But then I found that the equation to calculate the cosine similarity is as follows:
Cos(v1, v2) = Cos(theta) = (hamming distance/signature length) * pi = ((h/b) * pi )
Which means if the vectors are fully similar, then the hamming distance will be zero and the cosine value will be 1. But when the vectors are totally not similar, then the hamming distance will be equal to the signature length and so we have cos(pi) which will result in -1. Shouldn't the similarity be always between 0 and 1?

Cosine similarity is the dot product of the vectors divided by the magnitudes, so it's entirely possible to have a negative value for the angle's cosine. For example, if you have unit vectors pointing in opposite directions, then you want the value to be -1. I think what's confusing you is the nature of the representation because the other post is talking about angles between vectors in 2-D space whereas it's more common to create vectors in a multidimensional space where the number of dimensions is customarily much greater than 2, and the value for each dimension is non-negative (e.g., a word occurs in document or not), resulting in a 0 to 1 range.

Best way to match 4 million rows of data against each other and sort results by similarity?

We use libpuzzle ( http://www.pureftpd.org/project/libpuzzle/doc ) to compare 4 million images against each other for similarity.
It works quite well.
But rather then doing a image vs image compare using the libpuzzle functions, there is another method of comparing the images.
Here is some quick background:
Libpuzzle creates a rather small (544 bytes) hash of any given image. This hash can in turn be used to compare against other hashes using libpuzzles functions. There are a few APIs... PHP, C, etc etc... We are using the PHP API.
The other method of comparing the images is by creating vectors from the given hash, here is a paste from the docs:
Cut the vector in fixed-length words. For instance, let's consider the
following vector:
[ a b c d e f g h i j k l m n o p q r s t u v w x y z ]
With a word length (K) of 10, you can get the following words:
[ a b c d e f g h i j ] found at position 0
[ b c d e f g h i j k ] found at position 1
[ c d e f g h i j k l ] found at position 2
etc. until position N-1
Then, index your vector with a compound index of (word + position).
Even with millions of images, K = 10 and N = 100 should be enough to
have very little entries sharing the same index.
So, we have the vector method working. Its actually works a bit better then the image vs image compare since when we do the image vs image compare, we use other data to reduce our sample size. Its a bit irrelevant and application specific what other data we use to reduce the sample size, but with the vector method... we would not have to do so, we could do a real test of each of the 4 million hashes against each other.
The issue we have is as follows:
With 4 million images, 100 vectors per image, this becomes 400 million rows. We have found MySQL tends to choke after about 60000 images (60000 x 100 = 6 million rows).
The query we use is as follows:
SELECT isw.itemid, COUNT(isw.word) as strength
FROM vectors isw
JOIN vectors isw_search ON isw.word = isw_search.word
WHERE isw_search.itemid = {ITEM ID TO COMPARE AGAINST ALL OTHER ENTRIES}
GROUP BY isw.itemid;
As mentioned, even with proper indexes, the above is quite slow when it comes to 400 million rows.
So, can anyone suggest any other technologies / algos to test these for similarity?
We are willing to give anything a shot.
Some things worth mentioning:
Hashes are binary.
Hashes are always the same length, 544 bytes.
The best we have been able to come up with is:
Convert image hash from binary to ascii.
Create vectors.
Create a string as follows: VECTOR1 VECTOR2 VECTOR3 etc etc.
Search using sphinx.
We have not yet tried the above, but this should probably yield a bit better results than doing the mysql query.
Any ideas? As mentioned, we are willing to install any new service (postgresql? hadoop?).
Final note, an outline of exactly how this vector + compare method works can be found in question Libpuzzle Indexing millions of pictures?. We are in essence using the exact method provided by Jason (currently the last answer, awarded 200+ so points).

Don't do this in a database, just use a simple file. Below i have shown a file with some of the words from the two vectores [abcdefghijklmnopqrst] (image 1) and [xxcdefghijklxxxxxxxx] (image 2)
<index> <image>
0abcdefghij 1
1bcdefghijk 1
2cdefghijkl 1
3defghijklm 1
4efghijklmn 1
...
...
0xxcdefghij 2
1xcdefghijk 2
2cdefghijkl 2
3defghijklx 2
4efghijklxx 2
...
Now sort the file:
<index> <image>
0abcdefghij 1
0xxcdefghij 2
1bcdefghijk 1
1xcdefghijk 2
2cdefghijkl 1
2cdefghijkl 2 <= the index is repeated, those we have a match
3defghijklm 1
3defghijklx 2
4efghijklmn 1
4efghijklxx 2
When the file have been sorted it's easy to find the records that have the same index. Write a small program or something that can run through the sorted list and find the duplicates.

i have opted to 'answer my own' question as we have found a solution that works quite well.
in the initial question, i mentioned we were thinking of doing this via sphinx search.
well, we went ahead and did it and the results are MUCH better then doing this via mysql.
so, in essence the process looks like this:
a) generate hash from image.
b) 'vectorize' this hash into 100 parts.
c) binhex (binary to hex) each of these vectors since they are in binary format.
d) store in sphinx search like so:
itemid | 0_vector0 1_vector1 2_vec... etc
e) search using sphinx search.
initially... once we had this sphinxbase full of 4 million records, it would still take about 1 second per search.
we then enabled distributed indexing for this sphinxbase, on 8 cores, and now are about to query about 10+ searches per second. this is good enough for us.
one final step would be to further distribute this sphinxbase over the multiple servers we have, further utilizing the unused cpu cycles we have available.
but for the time being, good enough. we add about 1000-2000 'items' per day, so searching thru 'just the new ones' will happen quite quickly... after we do the initial scan.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string