In the scikit learn implementation of LDA what is the difference between transform and decision_function? - scikit-learn

I am currently working on a project that uses Linear Discriminant Analysis to transform some high-dimensional feature set into a scalar value according to some binary labels.
So I train LDA on the data and the labels and then use either transform(X) or decision_function(X) to project the data into a one-dimensional space.
I would like to understand the difference between these two functions. My intuition would be that the decision_function(X) would be transform(X) + bias, but this is not the case.
Also, I found that those two functions give a different AUC score, and thus indicate that it is not a monotonic transformation as I would have thought.
In the documentation, it states that the transform(X) projects the data to maximize class separation, but I would have expected decision_function(X) to do this.
I hope someone could help me understand the difference between these two.

LDA projects your multivariate data onto a 1D space. The projection is based on a linear combination of all your attributes (columns in X). The weights of each attribute are determined by maximizing the class separation. Subsequently, a threshold value in 1D space is determined which gives the best classification results. transform(X) gives you the value of each observation in this 1D space x' = transform(X). decision_function(X) gives you the log-likelihood of an attribute being a positive class log(P(y=1|x')).

Related

How can we use TFIDF vectors with multinomial naive bayes?

Say we have used the TFIDF transform to encode documents into continuous-valued features.
How would we now use this as input to a Naive Bayes classifier?
Bernoulli naive-bayes is out, because our features aren't binary anymore.
Seems like we can't use Multinomial naive-bayes either, because the values are continuous rather than categorical.
As an alternative, would it be appropriate to use gaussian naive bayes instead? Are TFIDF vectors likely to hold up well under the gaussian-distribution assumption?
The sci-kit learn documentation for MultionomialNB suggests the following:
The multinomial Naive Bayes classifier is suitable for classification
with discrete features (e.g., word counts for text classification).
The multinomial distribution normally requires integer feature counts.
However, in practice, fractional counts such as tf-idf may also work.
Isn't it fundamentally impossible to use fractional values for MultinomialNB?
As I understand it, the likelihood function itself assumes that we are dealing with discrete-counts (since it deals with counting/factorials)
How would TFIDF values even work with this formula?
Technically, you are right. The (traditional) Multinomial N.B. model considers a document D as a vocabulary-sized feature vector x, where each element xi is the count of term i i document D. By definition, this vector x then follows a multinomial distribution, leading to the characteristic classification function of MNB.
When using TF-IDF weights instead of term counts, our feature vectors are (most likely) not following a multinomial distribution anymore, so the classification function is not theoretically well-founded anymore. However, it does turn out that tf-idf weights instead of counts work (much) better.
How would TFIDF values even work with this formula?
In the exact same way, except that the feature vector x is now a vector of tf-idf weights and not counts.
You can also check out the Sublinear tf-idf weighting scheme, implemented in sklearn tfidf-vectorizer. In my own research I found this one performing even better: it uses a logarithmic version of the term frequency. The idea is that when a query term occurs 20 times in doc. a and 1 time in doc. b, doc. a should (probably) not be considered 20 times as important but more likely log(20) times as important.

Spark Naive Bayes Result accuracy (Spark ML 1.6.0) [duplicate]

I am using Spark ML to optimise a Naive Bayes multi-class classifier.
I have about 300 categories and I am classifying text documents.
The training set is balanced enough and there is about 300 training examples for each category.
All looks good and the classifier is working with acceptable precision on unseen documents. But what I am noticing that when classifying a new document, very often, the classifier assigns a high probability to one of the categories (the prediction probability is almost equal to 1), while the other categories receive very low probabilities (close to zero).
What are the possible reasons for this?
I would like to add that in SPARK ML there is something called "raw prediction" and when I look at it, I can see negative numbers but they have more or less comparable magnitude, so even the category with the high probability has comparable raw prediction score, but I am finding difficulties in interpreting this scores.
Lets start with a very informal description of Naive Bayes classifier. If C is a set of all classes and d is a document and xi are the features, Naive Bayes returns:
Since P(d) is the same for all classes we can simplify this to
where
Since we assume that features are conditionally independent (that is why it is naive) we can further simplify this (with Laplace correction to avoid zeros) to:
Problem with this expression is that in any non-trivial case it is numerically equal to zero. To avoid we use following property:
and replace initial condition with:
These are the values you get as the raw probabilities. Since each element is negative (logarithm of the value in (0, 1]) a whole expression has negative value as well. As you discovered by yourself these values are further normalized so the maximum value is equal to 1 and divided by the sum of the normalized values
It is important to note that while values you get are not strictly P(c|d) they preserve all important properties. The order and ratios are exactly (ignoring possible numerical issues) the same. If none other class gets prediction close to one it means that, given the evidence, it is a very strong prediction. So it is actually something you want to see.

What's the difference between predict_proba and decision_function in scikit-learn?

I'm studying a scikit-learn example (Classifier comparison) and got confused with predict_proba and decision_function.
They plot the classification results by drawing the contours using either Z = clf.decision_function(), or Z = clf.predict_proba().
What's the differences between these two? Is it so that each classification method has either of the two as score?
Which one is more proper to interpret the classification result and how should I choose from the two?
The latter, predict_proba is a method of a (soft) classifier outputting the probability of the instance being in each of the classes.
The former, decision_function, finds the distance to the separating hyperplane. For example, a(n) SVM classifier finds hyperplanes separating the space into areas associated with classification outcomes. This function, given a point, finds the distance to the separators.
I'd guess that predict_prob is more useful in your case, in general - the other method is more specific to the algorithm.
Your example is
if hasattr(clf, "decision_function"):
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
else:
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
so the code uses decision_function if it exists. On the SVM case, predict_proba is computed (in the binary case)
using Platt scaling
which is both "expensive" and has "theoretical issues". That's why decision_function is used here. (as #Ami said, this is the margin -
the distance to the hyperplane, which is accessible without much further computation). In the SVM case, it is advised to
use decision_function instead of predict_proba.
There are other decision_functions: SGDClassifier's. Here, predict_proba depends on the loss function, and decision_function is universally available.

python sklearn plotting classification results

I'm new to sklearn and want to interpret classification results. I'm confused that what the differences between decision surface and decision boundary are? I saw two examples showing the differences of classifiers:
1) http://scikit-learn.org/stable/auto_examples/svm/plot_iris.html#example-svm-plot-iris-py
2) http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
Both are used to show the difference of classifiers. But the first one used predict and the second one used predict_proba or decision function. So I'm confused.
The decision surface or boundary is the same. For example in classifications if you have 2 classes of which you want to predict and these 2 classes are represented by three dimensions (N=3) e.g. length, width, height. The decision boundary is a hyperplane of size N-1. The logic here is that to separate N dimensions, you need an object of size N-1 dimensions.
Both of your examples show decision boundaries/surfaces

Setting feature weights for KNN

I am working with sklearn's implementation of KNN. While my input data has about 20 features, I believe some of the features are more important than others. Is there a way to:
set the feature weights for each feature when "training" the KNN learner.
learn what the optimal weight values are with or without pre-processing the data.
On a related note, I understand generally KNN does not require training but since sklearn implements it using KDTrees, the tree must be generated from the training data. However, this sounds like its turning KNN into a binary tree problem. Is that the case?
Thanks.
kNN is simply based on a distance function. When you say "feature two is more important than others" it usually means difference in feature two is worth, say, 10x difference in other coords. Simple way to achive this is by multiplying coord #2 by its weight. So you put into the tree not the original coords but coords multiplied by their respective weights.
In case your features are combinations of the coords, you might need to apply appropriate matrix transform on your coords before applying weights, see PCA (principal component analysis). PCA is likely to help you with question 2.
The answer to question to is called "metric learning" and currently not implemented in Scikit-learn. Using the popular Mahalanobis distance amounts to rescaling the data using StandardScaler. Ideally you would want your metric to take into account the labels.

Resources