How to find the LDA decision boundary in sci-kit learn - scikit-learn

I am trying to draw the decision boundary obtain with the sci-kit learn LDA classifier.
I understand that you can transform your multivariate data using the transform method to project the data onto the first component line (two classes case). How do I obtain the value on the first component that acts as classification pivot? This is, the value that serves as the decision boundary?
Thanks!

LDA performs PCA or PCA like operation on $Cov_{between}/Cov_{within}$. The classification pivots are just the first n-1 eigenvectors of $Cov_{between}/Cov_{within}$.
The eigenmatrix is stored in lda.scalings_, so the first component is the first vector of lda.scalings_. It is the decision boundary for 2-class case, but not for the multi-class case

Related

In the scikit learn implementation of LDA what is the difference between transform and decision_function?

I am currently working on a project that uses Linear Discriminant Analysis to transform some high-dimensional feature set into a scalar value according to some binary labels.
So I train LDA on the data and the labels and then use either transform(X) or decision_function(X) to project the data into a one-dimensional space.
I would like to understand the difference between these two functions. My intuition would be that the decision_function(X) would be transform(X) + bias, but this is not the case.
Also, I found that those two functions give a different AUC score, and thus indicate that it is not a monotonic transformation as I would have thought.
In the documentation, it states that the transform(X) projects the data to maximize class separation, but I would have expected decision_function(X) to do this.
I hope someone could help me understand the difference between these two.
LDA projects your multivariate data onto a 1D space. The projection is based on a linear combination of all your attributes (columns in X). The weights of each attribute are determined by maximizing the class separation. Subsequently, a threshold value in 1D space is determined which gives the best classification results. transform(X) gives you the value of each observation in this 1D space x' = transform(X). decision_function(X) gives you the log-likelihood of an attribute being a positive class log(P(y=1|x')).

Predicting correct cluster for unseen data using a trained K-Means model

I know that K-Means is a lazy learner and will have to be retrained from scratch with new points, but still would like to know if there's any workaround to use a trained model to predict on a new unseen data.
I'm using K-Means algorithm to cluster a medical corpus. I'm creating a term-document matrix to represent this corpus. Before feeding the data to kmeans algorithm, I perform truncated singular value decomposition on the data for dimensionality reduction. I've been thinking if there's a way to cluster a new unseen document without retraining the entire model.
To get the vector representation of the new document and predict its cluster using the trained model, I need to ensure that it has the same vocabulary as that of the trained model and also maintains the same order in the term-document matrix. This can be done considering that these documents have a similar kind of vocabulary. But, how do I get SVD representation of this document? Now here's where my understanding gets a little shaky, so correct me if I'm wrong but to perform SVD on this vector representation, I'll need to append it to the original term-document matrix. Now, if I append this new document to original term-document matrix and perform SVD on it to get the vector representation with limited features (100 in this case), then I'm not sure how things will change? Will the new features selected by the SVD correspond semantically to that of the original ones? i.e. it won't make sense to measure the distance of new document from cluster centroids (with 100 features) if the corresponding features grasp different concepts.
Is there a way to use a trained kmeans model for new text data? Or any other better-suited clustering approach for this task?
You problem isn't k-means, a simple nearest-neighbor classificator using the means as data will work.
Your problem is SVD, which is not stable. Adding new data can give you entirely different results.

How should decide about using linear regression model or non linear regression model

How should one decide between using a linear regression model or non-linear regression model?
My goal is to predict Y.
In case of simple x and y dataset I could easily decide which regression model should be used by plotting a scatter plot.
In case of multi-variant like x1,x2,...,xn and y. How can I decide which regression model has to be used? That is, How will I decide about going with simple linear model or non linear models such as quadric, cubic etc.
Is there any technique or statistical approach or graphical plots to infer and decide which regression model has to be used? Please advise.
That is a pretty complex question.
You start visually first: if the data is normally distributed, and satisfy conditions for classical linear model, you use linear model. I normally start by making a scatter plot matrix to observe the relationships. If it is obvious that the relationship is non linear then you use non-linear model. But, a lot of times, I visually inspect, assuming that the number of factors are just not too many.
For example, this would be a non linear model:
However, if you want to use data mining (and computationally demanding methods), I suggest starting with stepwise regression. What you do is set a model evaluation criteria first: could be R^2 for example. You start a model with nothing and sequentially add predictors or permutations of them until your model evaluation criteria is "maximized". However, adding new predictor almost always increases R^2, a type of over-fitting.
The solution is to split the data into training and testing. You should make model based on the training and evaluate the mean error on testing. The best model will be the one that that minimized mean error on the testing set.
If your data is sparse, try integrating ridge or lasso regression in model evaluation.
Again, this is a kind of a complex question. The answer also kind of depends on whether you are building descriptive or explanatory model.

Setting feature weights for KNN

I am working with sklearn's implementation of KNN. While my input data has about 20 features, I believe some of the features are more important than others. Is there a way to:
set the feature weights for each feature when "training" the KNN learner.
learn what the optimal weight values are with or without pre-processing the data.
On a related note, I understand generally KNN does not require training but since sklearn implements it using KDTrees, the tree must be generated from the training data. However, this sounds like its turning KNN into a binary tree problem. Is that the case?
Thanks.
kNN is simply based on a distance function. When you say "feature two is more important than others" it usually means difference in feature two is worth, say, 10x difference in other coords. Simple way to achive this is by multiplying coord #2 by its weight. So you put into the tree not the original coords but coords multiplied by their respective weights.
In case your features are combinations of the coords, you might need to apply appropriate matrix transform on your coords before applying weights, see PCA (principal component analysis). PCA is likely to help you with question 2.
The answer to question to is called "metric learning" and currently not implemented in Scikit-learn. Using the popular Mahalanobis distance amounts to rescaling the data using StandardScaler. Ideally you would want your metric to take into account the labels.

Regarding Probability Estimates predicted by LIBSVM

I am attempting 3 class classification by using SVM classifier. How do we interpret the probabililty estimates predicted by LIBSVM. Is it based on perpendicular distance of the instance from the maximal margin hyperplane?.
Kindly through some light on the interpretation of probability estimates predicted by LIBSVM classifier. Parameters C and gamma are first tuned and then probability estimates are outputted by using -b option with both training and testing.
Multiclass SVM is always decomposed into several binary classifiers (typically a set of one vs all classifiers). Any binary SVM classifier's decision function outputs a (signed) distance to the separating hyperplane. In short, an SVM maps the input domain to a one-dimensional real number (the decision value). The predicted label is determined by the sign of the decision value. The most common technique to obtain probabilistic output from SVM models is through so-called Platt scaling (paper of LIBSVM authors).
Is it based on perpendicular distance of the instance from the maximal margin hyperplane?
Yes. Any classifier that outputs such a one-dimensional real value can be post-processed to yield probabilities, by calibrating a logistic function on the decision values of the classifier. This is the exact same approach as in standard logistic regression.
SVM performs binary classification. In order to achieve multiclass classification libsvm performs what it's called one vs all. What you get when you invoke -bis the probability related to this technique that you can found explained here .

Resources