Sklearn and gensim's TF-IDF implementation

Sklearn and gensim's TF-IDF implementation - scikit-learn

I've been trying to determine the similarity between a set of documents, and one of the methods I'm using is the cosine similarity with the results of the TF-IDF.
I tried to use both sklearn and gensim's implementations, which give me similar results, but my own implementation results in a different matrix.
After analyzing, I noticed that the their implementations are different from the ones I've studied and came across:
Sklearn and gensim use raw counts as the TF, and apply L2 norm
on the resulting vectors.
On the other side, the implementations I found will normalize the term count,
like
TF = term count / sum of all term counts in the document
My question is, what is the difference with their implementations? Do they give better results in the end, for clustering or other purposes?
EDIT(So the question is clearer):
What is the difference between normalizing the end result vs normalizing the term count at the beggining?

I ended up understanding why the normalization is done at the end of the tf-idf calculations instead of doing it on the term frequencies.
After searching around, I noticed they use L2 normalization in order to facilitate cosine similarity calculations.
So, instead of using the formula dot(vector1, vector2) / (norm(vector1) * norm(vector2)) to get the similarity between 2 vectors, we can use directly the result from the fit_transform function: tfidf * tfidf.T, without the need to normalize, since the norm for the vectors is already 1.
I tried to add normalization on the term frequency, but it just gives out the same results in the end, when normalizing the whole vectors, ending up being a waste of time.

With scikit-learn, you can set the normalization as desired when calling TfidfTransformer() by setting norm to either l1, l2, or none.
If you try this with none, you may get similar results to your own hand-rolled tf-idf implementation.
The normalization is typically used to reduce the effects of document length on a particular tf-idf weighting so that words appearing in short documents are treated on more equal footing to words appearing in much longer documents.

Related

In the scikit learn implementation of LDA what is the difference between transform and decision_function?

I am currently working on a project that uses Linear Discriminant Analysis to transform some high-dimensional feature set into a scalar value according to some binary labels.
So I train LDA on the data and the labels and then use either transform(X) or decision_function(X) to project the data into a one-dimensional space.
I would like to understand the difference between these two functions. My intuition would be that the decision_function(X) would be transform(X) + bias, but this is not the case.
Also, I found that those two functions give a different AUC score, and thus indicate that it is not a monotonic transformation as I would have thought.
In the documentation, it states that the transform(X) projects the data to maximize class separation, but I would have expected decision_function(X) to do this.
I hope someone could help me understand the difference between these two.

LDA projects your multivariate data onto a 1D space. The projection is based on a linear combination of all your attributes (columns in X). The weights of each attribute are determined by maximizing the class separation. Subsequently, a threshold value in 1D space is determined which gives the best classification results. transform(X) gives you the value of each observation in this 1D space x' = transform(X). decision_function(X) gives you the log-likelihood of an attribute being a positive class log(P(y=1|x')).

What's a good measure for classifying text documents?

I have written an application that measures text importance. It takes a text article, splits it into words, drops stopwords, performs stemming, and counts word-frequency and document-frequency. Word-frequency is a measure that counts how many times the given word appeared in all documents, and document-frequency is a measure that counts how many documents the given word appeared.
Here's an example with two text articles:
Article I) "A fox jumps over another fox."
Article II) "A hunter saw a fox."
Article I gets split into words (afters stemming and dropping stopwords):
["fox", "jump", "another", "fox"].
Article II gets split into words:
["hunter", "see", "fox"].
These two articles produce the following word-frequency and document-frequency counters:
fox (word-frequency: 3, document-frequency: 2)
jump (word-frequency: 1, document-frequency: 1)
another (word-frequency: 1, document-frequency: 1)
hunter (word-frequency: 1, document-frequency: 1)
see (word-frequency: 1, document-frequency: 1)
Given a new text article, how do I measure how similar this article is to previous articles?
I've read about df-idf measure but it doesn't apply here as I'm dropping stopwords, so words like "a" and "the" don't appear in the counters.
For example, I have a new text article that says "hunters love foxes", how do I come up with a measure that says this article is pretty similar to ones previously seen?
Another example, I have a new text article that says "deer are funny", then this one is a totally new article and similarity should be 0.
I imagine I somehow need to sum word-frequency and document-frequency counter values but what's a good formula to use?

A standard solution is to apply the Naive Bayes classifier which estimates the posterior probability of a class C given a document D, denoted as P(C=k|D) (for a binary classification problem, k=0 and 1).
This is estimated by computing the priors from a training set of class labeled documents, where given a document D we know its class C.
P(C|D) = P(D|C) * P(D) (1)
Naive Bayes assumes that terms are independent, in which case you can write P(D|C) as
P(D|C) = \prod_{t \in D} P(t|C) (2)
P(t|C) can simply be computed by counting how many times does a term occur in a given class, e.g. you expect that the word football will occur a large number of times in documents belonging to the class (category) sports.
When it comes to the other factor P(D), you can estimate it by counting how many labeled documents are given from each class, may be you have more sports articles than finance ones, which makes you believe that there is a higher likelihood of an unseen document to be classified into the sports category.
It is very easy to incorporate factors, such as term importance (idf), or term dependence into Equation (1). For idf, you add it as a term sampling event from the collection (irrespective of the class).
For term dependence, you have to plugin probabilities of the form P(u|C)*P(u|t), which means that you sample a different term u and change (transform) it to t.
Standard implementations of Naive Bayes classifier can be found in the Stanford NLP package, Weka and Scipy among many others.

It seems that you are trying to answer several related questions:
How to measure similarity between documents A and B? (Metric learning)
How to measure how unusual document C is, compared to some collection of documents? (Anomaly detection)
How to split a collection of documents into groups of similar ones? (Clustering)
How to predict to which class a document belongs? (Classification)
All of these problems are normally solved in 2 steps:
Extract the features: Document --> Representation (usually a numeric vector)
Apply the model: Representation --> Result (usually a single number)
There are lots of options for both feature engineering and modeling. Here are just a few.
Feature extraction
Bag of words: Document --> number of occurences of each individual word (that is, term frequencies). This is the basic option, but not the only one.
Bag of n-grams (on word-level or character-level): co-occurence of several tokens is taken into account.
Bag of words + grammatic features (e.g. POS tags)
Bag of word embeddings (learned by an external model, e.g. word2vec). You can use embedding as a sequence or take their weighted average.
Whatever you can invent (e.g. rules based on dictionary lookup)...
Features may be preprocessed in order to decrease relative amount of noise in them. Some options for preprocessing are:
dividing by IDF, if you don't have a hard list of stop words or believe that words might be more or less "stoppy"
normalizing each column (e.g. word count) to have zero mean and unit variance
taking logs of word counts to reduce noise
normalizing each row to have L2 norm equal to 1
You cannot know in advance which option(s) is(are) best for your specific application - you have to do experiments.
Now you can build the ML model. Each of 4 problems has its own good solutions.
For classification, the best studied problem, you can use multiple kinds of models, including Naive Bayes, k-nearest-neighbors, logistic regression, SVM, decision trees and neural networks. Again, you cannot know in advance which would perform best.
Most of these models can use almost any kind of features. However, KNN and kernel-based SVM require your features to have special structure: representations of documents of one class should be close to each other in sense of Euclidean distance metric. This sometimes can be achieved by simple linear and/or logarithmic normalization (see above). More difficult cases require non-linear transformations, which in principle may be learned by neural networks. Learning of these transformations is something people call metric learning, and in general it is an problem which is not yet solved.
The most conventional distance metric is indeed Euclidean. However, other distance metrics are possible (e.g. manhattan distance), or different approaches, not based on vector representations of texts. For example, you can try to calculate Levenstein distance between texts, based on count of number of operations needed to transform one text to another. Or you can calculate "word mover distance" - the sum of distances of word pairs with closest embeddings.
For clustering, basic options are K-means and DBScan. Both these models require your feature space have this Euclidean property.
For anomaly detection you can use density estimations, which are produced by various probabilistic algorithms: classification (e.g. naive Bayes or neural networks), clustering (e.g. mixture of gaussian models), or other unsupervised methods (e.g. probabilistic PCA). For texts, you can exploit the sequential language structure, estimating probabilitiy of each word conditional on the previous words (using n-grams or convolutional/recurrent neural nets) - this is called language models, and it is usually more efficient than bag-of-word assumption of Naive Bayes, which ignores word order. Several language models (one for each class) may be combined into one classifier.
Whatever problem you solve, it is strongly recommended to have a good test set with the known "ground truth": which documents are close to each other, or belong to the same class, or are (un)usual. With this set, you can evaluate different approaches to feature engineering and modelling, and choose the best one.
If you don't have resourses or willingness to do multiple experiments, I would recommend to choose one of the following approaches to evaluate similarity between texts:
word counts + idf normalization + L2 normalization (equivalent to the solution of #mcoav) + Euclidean distance
mean word2vec embedding over all words in text (the embedding dictionary may be googled up and downloaded) + Euclidean distance
Based on one of these representations, you can build models for the other problems - e.g. KNN for classifications or k-means for clustering.

I would suggest tf-idf and cosine similarity.
You can still use tf-idf if you drop out stop-words. It is even probable that whether you include stop-words or not would not make such a difference: the Inverse Document Frequency measure automatically downweighs stop-words since they are very frequent and appear in most documents.
If your new document is entirely made of unknown terms, the cosine similarity will be 0 with every known document.

When I search on df-idf I find nothing.
tf-idf with cosine similarity is very accepted and common practice
Filtering out stop words does not break it. For common words idf gives them low weight anyway.
tf-idf is used by Lucene.
Don't get why you want to reinvent the wheel here.
Don't get why you think the sum of df idf is a similarity measure.
For classification do you have some predefined classes and sample documents to learn from? If so can use Naive Bayes. With tf-idf.
If you don't have predefined classes you can use k means clustering. With tf-idf.
It depend a lot on your knowledge of the corpus and classification objective. In like litigation support documents produced to you, you have and no knowledge of. In Enron they used names of raptors for a lot of the bad stuff and no way you would know that up front. k means lets the documents find their own clusters.
Stemming does not always yield better classification. If you later want to highlight the hits it makes that very complex and the stem will not be the length of the word.

Have you evaluated sent2vec or doc2vec approaches? You can play around with the vectors to see how close the sentences are. Just an idea. Not a verified solution to your question.

While in English a word alone may be enough, it isn't the case in some other more complex languages.
A word has many meanings, and many different uses cases. One text can talk about the same things while using fews to none matching words.
You need to find the most important words in a text. Then you need to catch their possible synonyms.
For that, the following api can help. It is doable to create something similar with some dictionaries.
synonyms("complex")
function synonyms(me){
var url = 'https://api.datamuse.com/words?ml=' + me;
fetch(url).then(v => v.json()).then((function(v){
syn = JSON.stringify(v)
syn = JSON.parse(syn)
for(var k in syn){
document.body.innerHTML += "<span>"+syn[k].word+"</span> "
}
})
)
}
From there comparing arrays will give much more accuracy, much less false positive.

A sufficient solution, in a possibly similar task:
Use of a binary bag-of-word (BOW) approach for the vector representation (frequent words aren't higher weighted than seldom words), rather than a real TF approach
The embedding "word2vec" approach, is sensitive to sequence and distances effects. It might make - depending on your hyper-parameters - a difference between 'a hunter saw a fox' and 'a fox saw a jumping hunter' ... so you have to decide, if this means adding noise to your task - or, alternatively, to use it as an averaged vector only, over all of your text
Extract high within-sentence-correlation words ( e.g., by using variables- mean-normalized- cosine-similaritities )
Second Step: Use this list of high-correlated words, as a positive list, i.e. as new vocab for an new binary vectorizer
This isolated meaningful words for the 2nd step cosine comparisons - in my case, even for rather small amounts of training texts

How can we use TFIDF vectors with multinomial naive bayes?

Say we have used the TFIDF transform to encode documents into continuous-valued features.
How would we now use this as input to a Naive Bayes classifier?
Bernoulli naive-bayes is out, because our features aren't binary anymore.
Seems like we can't use Multinomial naive-bayes either, because the values are continuous rather than categorical.
As an alternative, would it be appropriate to use gaussian naive bayes instead? Are TFIDF vectors likely to hold up well under the gaussian-distribution assumption?
The sci-kit learn documentation for MultionomialNB suggests the following:
The multinomial Naive Bayes classifier is suitable for classification
with discrete features (e.g., word counts for text classification).
The multinomial distribution normally requires integer feature counts.
However, in practice, fractional counts such as tf-idf may also work.
Isn't it fundamentally impossible to use fractional values for MultinomialNB?
As I understand it, the likelihood function itself assumes that we are dealing with discrete-counts (since it deals with counting/factorials)
How would TFIDF values even work with this formula?

Technically, you are right. The (traditional) Multinomial N.B. model considers a document D as a vocabulary-sized feature vector x, where each element xi is the count of term i i document D. By definition, this vector x then follows a multinomial distribution, leading to the characteristic classification function of MNB.
When using TF-IDF weights instead of term counts, our feature vectors are (most likely) not following a multinomial distribution anymore, so the classification function is not theoretically well-founded anymore. However, it does turn out that tf-idf weights instead of counts work (much) better.
How would TFIDF values even work with this formula?
In the exact same way, except that the feature vector x is now a vector of tf-idf weights and not counts.
You can also check out the Sublinear tf-idf weighting scheme, implemented in sklearn tfidf-vectorizer. In my own research I found this one performing even better: it uses a logarithmic version of the term frequency. The idea is that when a query term occurs 20 times in doc. a and 1 time in doc. b, doc. a should (probably) not be considered 20 times as important but more likely log(20) times as important.

Do tf-idf weights affect the cosine similarity?

I'm clustering text documents. I'm using tf-idf and cosine similarity. However there's something I don't really understand even tho I'm using these measures. Do the tf-idf weights affect the similarity calculations between two documents?
Suppose I have these two documents:
1- High trees.
2- High trees High trees High trees High trees.
Then the similarity between the two documents will be 1, although the tf-idf vectors of the two documents are different. Where the second should normally have higher weights for the terms compared to the first document.
Suppose the weights for the two vectors are (just suppose):
v1(1.0, 1.0)
v2(5.0, 8.0)
calculating the cosine similarity gives 1.0.
Here is a sketch of two random vectors that share the same terms but with different weights.
There's an obvious angel between the vectors, so the weights should play a role!
This triggers the question, where do the tf/idf weights play a role in the similarity calculations? Because what I understood so far is that the similarity here only cares about the presence and absence of the terms.

First off, your calculations are flawed. The cosine similarity between (1, 1) and (5, 8) is
1*5 + 1*8 / ||(1, 1)|| * ||(5, 8)||
= 13 / (1.4142 * 9.434)
= .97
where ||x|| is the Euclidean norm of x.
Because what I understood so far is that the similarity here only cares about the presence and absence of the terms.
That's not true. Consider
d1 = "hello world"
d2 = "hello world hello"
with tf vectors (no idf here)
v1 = [1, 1]
v2 = [2, 1]
The cosine similarity is 0.95, not 1.
Idf can have a further effect. Suppose we add
d3 = "hello"
then df("hello") = 3 and df("world") = 2, and the tf-idf vectors for d1, d2 become
v1' = [ 1. , 1.28768207]
v2' = [ 2. , 1.28768207]
with a slightly smaller cosine similarity of 0.94.
(Tf-idf and cosine similarities computed with scikit-learn; other packages may give different numbers due to the different varieties of tf-idf in use.)

I think you are mixing two different concepts here.
Cosine similarity measures the angle between two different vectors in a Euclidean space, independently of how the weights have been calculated.
TF-IDF decides, for each term in a document and a given collection, the weights for each one of the components of a vector that can be used for cosine similarity (among other things).
I hope this helps.

see my reply to this question and also the question
Python: tf-idf-cosine: to find document similarity
Basically if you want to use both tf-idf and cosine similarity then you can get the tf-idf vector and apply cosine similarity to that to get final result. So here you are applying cosine similarity(in this case dot product of tf - idf vectors) onto the tf-idf scores.
The answer also had 3 tutorials which you can refer to. They explain how this can work. thankz.

Re-normalize feature vectors after feature selection

I've performed χ² feature selection on my training documents already transformed to TF*IDF feature vectors using sklearn.feature_extraction.text.TfidfVectorizer, which produces normalized vectors by default. However, after selecting the top-K most informative features, the vectors are no longer normalized due to the removing of dimensions (all vectors are now with length < 1).
Is it advisable to re-normalize the feature vectors after feature selection? I'm also not very clear of the main difference B/T normalization and scaling. Do they server similar purposes for learners such as SVC?
Thank you in advance for your kind answer!

This is actually a lot of questions in one. The main reason for doing normalization on tf-idf vectors is so that their dot products (used by SVMs in their decision function) are readily interpretable as cosine similarities, the mainstay of document vector comparisons in information retrieval. Normalization makes sure that
"hello world" -> [1 2]
"hello hello world world" -> [2 4]
become the same vector, so concatenating a document onto itself doesn't change the decision boundary and the similarity between these two documents is exactly one (although with sublinear scaling, sublinear_tf in the vectorizer constructor, this is no longer true).
The main reasons for doing scaling is to avoid numerical instability issues. Normalization takes care of most of those because features will already be in the range [0, 1]. (I think it also relates to regularization, but I don't use SVMs that often.)
As you've noticed, chi² "denormalizes" feature vectors, so to answer the original question: you can try renormalizing them. I did that when adding chi² feature selection to the scikit-learn document classification example, and it helped with some estimators and hurt with others. You can also try doing the chi² on unnormalized tf-idf vectors (in which case I recommend you try setting sublinear_tf) and do either scaling or normalization afterwards.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string