sklearn: AUC score for LinearSVC and OneSVM - scikit-learn

One option of the SVM classifier (SVC) is probability which is false by default. The documentation does not say what it does. Looking at libsvm source code, it seems to do some sort of cross-validation.
This option does not exist for LinearSVC nor OneSVM.
I need to calculate AUC scores for several SVM models, including these last two. Should I calculate the AUC score using decision_function(X) as the thresholds?

Answering my own question.
Firstly, it is a common "myth" that you need probabilities to draw the ROC curve. No, you need some kind of threshold in your model that you can change. The ROC curve is then drawn by changing this threshold. The point of the ROC curve being, of course, to see how well your model is reproducing the hypothesis by seeing how well it is ordering the observations.
In the case of SVM, there are two ways I see people drawing ROC curves for them:
using distance to the decision bondary, as I mentioned in my own question
using the bias term as your threshold in the SVM: http://researchgate.net/post/How_can_I_plot_determine_ROC_AUC_for_SVM. In fact, if you use SVC(probabilities=True) then probabilities will be calculated for you in this manner, by using CV, which you can then use to draw the ROC curve. But as mentioned in the link I provide, it is much faster if you draw the ROC curve directly by varying the bias.
I think #2 is the same as #1 if we are using a linear kernel, as in my own case, because varying the bias is varying the distance in this particular case.

In order to calculate AUC, using sklearn, you need a predict_proba method on your classifier; this is what the probability parameter on SVC does (you are correct that it's calculated using cross-validation). From the docs:
probability : boolean, optional (default=False)
Whether to enable probability estimates. This must be enabled prior to calling fit, and will slow down that method.
You can't use the decision function directly to compute AUC, since it's not a probability. I suppose you could scale the decision function to take values in the range [0,1], and compute AUC, however I'm not sure what statistical properties this will have; you certainly won't be able to use it to compare with ROC calculated using probabilities.

Related

What is the parameter which is varied when running sklearn.metrics.plot_roc_curve on a SVM?

I am confused by this example here: https://scikit-learn.org/stable/visualizations.html
If we plot the ROC curve for a Logistic Regression Classifier the ROC curve is parametrized by the threshold parameter. But a usual SVM spits out binary values instead of probabilities.
Consequently there should not be a threshold which can be varied to obtain an ROC curve.
But which parameter is then varied in the example above?
SVMs have a measure of confidence in their predictions using the distance from the separating hyperplane (before the kernel, if you're not doing a linear SVM). These are obviously not probabilities, but they do rank-order the data points, and so you can get an ROC curve. In sklearn, this is done via the decision_function method. (You can also set probability=True in the SVC to calibrate the decision function values into probability estimates.)
See this section of the User Guide for some of the details on the decision function.

Evaluating Gaussian Mixture model using a score metric?

I have 1D data (on column data). I used Gaussian Mixture Model (GMM) as a density estimation, using this implementation in Python: https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html. By relying on AIC/BIC criteron i was able to determine number of components. After i fit the GMM, i plotted kernel density estimation of original observation + that of sampled data drawn from GMM. the plot of original and sampled desnities are quiet similar( that is good). But, i would like some metrics to report how good is the fitted model.
g = GaussianMixture(n_components = 35)
data= df['x'].values.reshape(-1,1) # data taken from data frame (10,000 data pints)
clf= g.fit(data)# fit model
samples= clf.sample(10000)[0] # generate sample data points (same # as original data points)
I found score in the implementation, but not sure how to implememnt. Am i doing it wrong? or is there any better way to show how accuracy is the fitted model, apart from histogram or kernel densities plots?.
print(clf.score(data))
print(clf.score(samples))
You can use normalized_mutual_info_score, adjusted_rand_score or silhouette score to evaluate your clusters. All of these metrics are implemented under sklearn.metrics section.
EDIT: You can check this link for more detail explanations.
In a summary:
Adjusted Rand Index: measures the similarity of the two assignments.
Normalized Mutual Information: measures the agreement of the two assignments.
Silhouette Coefficient: measures how well-assigned each individual point is.
gmm.fit(x_vec)
pred = gmm.predict(x_vec)
print ("gmm: silhouttte: ", silhouette_score(x_vec, pred))
I would better use cross-validation and try to see the accuracy of the trained model.
Use the predict method of the fitted model to predict the labels of unseen data (use cross-validation and report the acurracy): https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture.predict
Toy example:
g = GaussianMixture(n_components = 35)
g.fit(train_data)# fit model
y_pred = g.predict(test_data)
EDIT:
There are several options to measure the performance of your unsupervised case. For GMM, which base on real probabilities, the most common are BIC and AIC. They are immediatly included in the scikit GMM class.

Does SciKit Have A InHouse Function That Tallies The Accuracy For Each Y Solution?

I have LinearSVC algorithm that predicts some data for stock. It has a 90% acc rating, but I think this might be due to the fact that some y's are far more likely than others. I want to see if there is a way to see if for each y I've defined, how accurately that y was predicted.
I haven't seen anything like this in the docs, but it just makes sense to have it.
If what your really want is a measure of confidence rather than actual probabilities, you can use the method LinearSVC.decision_function(). See the documentation or the probability calibration CalibratedClassifierCV using this documentation.
You can use a confusion matrix representation implemented in SciKit to generate an accuracy matrix between the predicted and real values of your classification problem for each individual attribute. The diagonal represents the raw accuracy, which can easily be converted to a percentage accuracy.

Setting feature weights for KNN

I am working with sklearn's implementation of KNN. While my input data has about 20 features, I believe some of the features are more important than others. Is there a way to:
set the feature weights for each feature when "training" the KNN learner.
learn what the optimal weight values are with or without pre-processing the data.
On a related note, I understand generally KNN does not require training but since sklearn implements it using KDTrees, the tree must be generated from the training data. However, this sounds like its turning KNN into a binary tree problem. Is that the case?
Thanks.
kNN is simply based on a distance function. When you say "feature two is more important than others" it usually means difference in feature two is worth, say, 10x difference in other coords. Simple way to achive this is by multiplying coord #2 by its weight. So you put into the tree not the original coords but coords multiplied by their respective weights.
In case your features are combinations of the coords, you might need to apply appropriate matrix transform on your coords before applying weights, see PCA (principal component analysis). PCA is likely to help you with question 2.
The answer to question to is called "metric learning" and currently not implemented in Scikit-learn. Using the popular Mahalanobis distance amounts to rescaling the data using StandardScaler. Ideally you would want your metric to take into account the labels.

Regarding Probability Estimates predicted by LIBSVM

I am attempting 3 class classification by using SVM classifier. How do we interpret the probabililty estimates predicted by LIBSVM. Is it based on perpendicular distance of the instance from the maximal margin hyperplane?.
Kindly through some light on the interpretation of probability estimates predicted by LIBSVM classifier. Parameters C and gamma are first tuned and then probability estimates are outputted by using -b option with both training and testing.
Multiclass SVM is always decomposed into several binary classifiers (typically a set of one vs all classifiers). Any binary SVM classifier's decision function outputs a (signed) distance to the separating hyperplane. In short, an SVM maps the input domain to a one-dimensional real number (the decision value). The predicted label is determined by the sign of the decision value. The most common technique to obtain probabilistic output from SVM models is through so-called Platt scaling (paper of LIBSVM authors).
Is it based on perpendicular distance of the instance from the maximal margin hyperplane?
Yes. Any classifier that outputs such a one-dimensional real value can be post-processed to yield probabilities, by calibrating a logistic function on the decision values of the classifier. This is the exact same approach as in standard logistic regression.
SVM performs binary classification. In order to achieve multiclass classification libsvm performs what it's called one vs all. What you get when you invoke -bis the probability related to this technique that you can found explained here .

Resources