Multiclass AUC with 95% confidence interval - scikit-learn

I am currently trying to figure if there is a way to get the 95% CI of the AUC in python. Currently, I have a ypred list that contains the highest probability class predictions between the 4 classes I have(so either a 0/1/2/3 at each position) and a yactual list which contains the actual labels at each position. How exactly do I go about bootstrapping samples for multiple classes?
Edit: Currently the way I am calculating the AUC is by doing a one-vs-all scheme, where I take the AUC for each classes versus the rest and averaging those 4 values to get the final AUC.

Performing a one-vs-all classification scheme for each class and reporting out per class was good enough.

Related

Find threshold for 1-D data

I have a 1-D data. I have a binary classification problem. Points above a certain threshold belong to class 0 and points less than the threshold belong to class 1. I want to find that threshold. I don't want to pass this data to any classifier from sklearn, but want to use functionality like cross-validation and roc curve from sklearn. How can I do this?
Thanks

Get the probability of a sample in sklearn.linear_model.LogisticRegression instead of class label

I am using sklearn.linear_model.LogisticRegression for a text classification project. With the features I have extracted, the samples mostly receive a low probability score. Therefore, when I use the predict() those samples always classified to class 0. But what I want to do is get the actual probabilities for samples and choose the top 25%-30% based on the probability score. How do I get the probability score for a sample? In linear regression, the predict() provides the actual output. But it is not the case for logistic regression. I am not restricted to sklearn. A different package also works.
To make it more clear, what I want from the predict function is to return actual probability value (output of the sigmoid function) instead of the class label like linear regression predict function.

What is the classifier used in scikit-learn's VotingClassifier?

I looked at the documentation of scikit-learn but it is not clear to me what sort of classification method is used under the hood of the VotingClassifier? Is it logistic regression, SVM or some sort of a tree method?
I'm interested in ways to vary the classifier method used under the hood. If Scikit-learn is not offering such an option is there a python package which can be integrated easily with scikit-learn which would offer such functionality?
EDIT:
I meant the classifier method used for the second level model. I'm perfectly aware that the first level classifiers can be any type of classifier supported by scikit-learn.
The second level classifier uses the predictions of the first level classifiers as inputs. So my question is - what method does this second level classifier use? Is it logistic regression? Or something else? Can I change it?
General
The VotingClassifier is not limited to one specific method/algorithm. You can choose multiple different algorithms and combine them to one VotingClassifier. See example below:
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target
clf1 = LogisticRegression(...)
clf2 = RandomForestClassifier(...)
clf3 = SVC(...)
eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('svm', clf3)], voting='hard')
Read more about the usage here: VotingClassifier-Usage.
When it comes down to how the VotingClassifier "votes" you can either specify voting='hard' or voting='soft'. See the paragraph below for more detail.
Voting
Majority Class Labels (Majority/Hard Voting)
In majority voting, the predicted class label for a particular sample
is the class label that represents the majority (mode) of the class
labels predicted by each individual classifier.
E.g., if the prediction for a given sample is
classifier 1 -> class 1 classifier 2 -> class 1 classifier 3 -> class
2 the VotingClassifier (with voting='hard') would classify the sample
as “class 1” based on the majority class label.
Source: scikit-learn-majority-class-labels-majority-hard-voting
Weighted Average Probabilities (Soft Voting)
In contrast to majority voting (hard voting), soft voting returns the
class label as argmax of the sum of predicted probabilities.
Specific weights can be assigned to each classifier via the weights
parameter. When weights are provided, the predicted class
probabilities for each classifier are collected, multiplied by the
classifier weight, and averaged. The final class label is then derived
from the class label with the highest average probability.
Source/Read more here: scikit-learn-weighted-average-probabilities-soft-voting
The VotingClassifier does not fit any meta model on the first level of classifiers output.
It just aggregates the output of each classifier in the first level by the mode (if voting is hard) or averaging the probabilities (if the voting is soft).
In simple terms, VotingClassifier does not learn anything from the first level of classifiers. It only consolidates the output of individual classifiers.
If you want your meta model to be more intelligent, try using the adaboost, gradientBoosting models.

How do I test my classifier accuracy against random values?

I've set up my first scikit-learn example to play with and I'm trying to gauge accuracy on my predictions. I've got training and test lists set up fine, but I'm getting ~0.95 accuracy even if I give it random values.
This looks to be because I'm checking for 0/1 labels, and 95% of the labels are zero's, so it's guessing on 0's and getting 0.95 accuracy (I think?). Obviously this isn't what I want.
How do I go about deciding if my classifiers are working, and how do I get meaningful accuracy values?
You have a clear class imbalance issue. Your classifier is predicting 0 all the time knowing it will be right 95% of the time. You can inspect this by calling predict(X_test) on your fitted classifier. If all the values are 0 you know this is the case.
To get a better idea on how the model performs you can upsample the data labelled with 1 or down sample the data labelled with 0. You can use this package which builds off scikit-learn and implements a number of resampling methods. Alternatively, you can use scikit learns resampling method. Which will bootstrap new data points for you.

sklearn: AUC score for LinearSVC and OneSVM

One option of the SVM classifier (SVC) is probability which is false by default. The documentation does not say what it does. Looking at libsvm source code, it seems to do some sort of cross-validation.
This option does not exist for LinearSVC nor OneSVM.
I need to calculate AUC scores for several SVM models, including these last two. Should I calculate the AUC score using decision_function(X) as the thresholds?
Answering my own question.
Firstly, it is a common "myth" that you need probabilities to draw the ROC curve. No, you need some kind of threshold in your model that you can change. The ROC curve is then drawn by changing this threshold. The point of the ROC curve being, of course, to see how well your model is reproducing the hypothesis by seeing how well it is ordering the observations.
In the case of SVM, there are two ways I see people drawing ROC curves for them:
using distance to the decision bondary, as I mentioned in my own question
using the bias term as your threshold in the SVM: http://researchgate.net/post/How_can_I_plot_determine_ROC_AUC_for_SVM. In fact, if you use SVC(probabilities=True) then probabilities will be calculated for you in this manner, by using CV, which you can then use to draw the ROC curve. But as mentioned in the link I provide, it is much faster if you draw the ROC curve directly by varying the bias.
I think #2 is the same as #1 if we are using a linear kernel, as in my own case, because varying the bias is varying the distance in this particular case.
In order to calculate AUC, using sklearn, you need a predict_proba method on your classifier; this is what the probability parameter on SVC does (you are correct that it's calculated using cross-validation). From the docs:
probability : boolean, optional (default=False)
Whether to enable probability estimates. This must be enabled prior to calling fit, and will slow down that method.
You can't use the decision function directly to compute AUC, since it's not a probability. I suppose you could scale the decision function to take values in the range [0,1], and compute AUC, however I'm not sure what statistical properties this will have; you certainly won't be able to use it to compare with ROC calculated using probabilities.

Resources