How to plot ROC curve and precision-recall curve from BinaryClassificationMetrics - apache-spark

I was trying to plot ROC curve and Precision-Recall curve in graph. The points are generated from the Spark Mllib BinaryClassificationMetrics. By following the following Spark https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html
[(1.0,1.0), (0.0,0.4444444444444444)] Precision
[(1.0,1.0), (0.0,1.0)] Recall
[(1.0,1.0), (0.0,0.6153846153846153)] - F1Measure
[(0.0,1.0), (1.0,1.0), (1.0,0.4444444444444444)]- Precision-Recall curve
[(0.0,0.0), (0.0,1.0), (1.0,1.0), (1.0,1.0)] - ROC curve

It looks like you have a similar problem to what I experienced. You need to either flip your parameters to the Metrics constructor or perhaps pass in the probability instead of the prediction. So, for example, if you are using the BinaryClassificationMetrics and a RandomForestClassifier, then according to this page (under outputs) there is "prediction" and "probability".
Then initialize your Metrics thus:
new BinaryClassificationMetrics(predictionsWithResponse
.select(col("probability"),col("myLabel"))
.rdd.map(r=>(r.getAs[DenseVector](0)(1),r.getDouble(1))))
With the DenseVector call used to extract the probability of the 1 class.
As for actual plotting, that's up to you (many fine tools for that), but at least you will get more than 1 point on you curve (besides the endpoints).
And in case it's not clear:
metrics.roc().collect() will give you the data for the ROC curve: Tuples of: (false positive rate, true positive rate).

Related

What is the parameter which is varied when running sklearn.metrics.plot_roc_curve on a SVM?

I am confused by this example here: https://scikit-learn.org/stable/visualizations.html
If we plot the ROC curve for a Logistic Regression Classifier the ROC curve is parametrized by the threshold parameter. But a usual SVM spits out binary values instead of probabilities.
Consequently there should not be a threshold which can be varied to obtain an ROC curve.
But which parameter is then varied in the example above?
SVMs have a measure of confidence in their predictions using the distance from the separating hyperplane (before the kernel, if you're not doing a linear SVM). These are obviously not probabilities, but they do rank-order the data points, and so you can get an ROC curve. In sklearn, this is done via the decision_function method. (You can also set probability=True in the SVC to calibrate the decision function values into probability estimates.)
See this section of the User Guide for some of the details on the decision function.

Binary classifier too confident to plot ROC curve with sklearn?

I have a created a binary classifier in Tensorflow that will output a generator object containing predictions. I extract the predictions (e.g [0.98, 0.02]) from the object into a list, later converting this into a numpy array. I have the corresponding array of labels for these predictions. Using these two arrays I believe I should be able to plot a roc curve via:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve
fpr, tpr, thr = roc_curve(labels, predictions[:,1])
plt.plot(fpr, tpr)
plt.show()
print(fpr)
print(tpr)
print(thr)
Where predictions[:,1] gives the positive prediction score. However, running this code leads to only a flat line and only three values for each fpr, tpr and thr:
Flat line roc plot and limited function outputs.
The only theory I have as to why this is happening is because my classifier is too sure of it's predictions. Many, if not all, of the positive prediction scores are 1.0, or incredibly close to zero:
[[9.9999976e-01 2.8635742e-07]
[3.3693312e-11 1.0000000e+00]
[1.0000000e+00 9.8642090e-09]
...
[1.0106111e-15 1.0000000e+00]
[1.0000000e+00 1.0030269e-09]
[8.6156778e-15 1.0000000e+00]]
According to a few sources including this stackoverflow thread and this stackoverflow thread, the very polar values of my predictions could be creating an issue for roc_curve().
Is my intuition correct? If so is there anything I can do about it to plot my roc_curve?
I've tried to include all the information I think would be relevant to this issue but if you would like any more information about my program please ask.
ROC is generated by changing the threshold on your predictions and finding the sensitivity and specificity for each threshold. This generally means that as you increase the threshold, your sensitivity decreases but your specificity increases and it draws a picture of the overall quality of your predicted probabilities. In your case, since everything is either 0 or 1 (or very close to it) there are no meaningful thresholds to use. That's why the thr value is basically [ 1, 1, 1 ].
You can try to arbitrarily pull the values closer to 0.5 or alternatively implement your own ROC curve calculation with more tolerance for small differences.
On the other hand you might want to review your network because such result values often mean there is a problem there, maybe the labels leaked into the network somehow and therefore it produces perfect results.

How is the ROC curve plotted in Viola Jones face detection paper?

I am reading paper by Viola and Jones. There they have used ROC curve to measure the accuracy of their classifier.
https://www.cs.cmu.edu/~efros/courses/LBMV07/Papers/viola-cvpr-01.pdf
Could someone please explain how the ROC curve is plotted in case of binary classifier like face or non face? I mean how is the data points obtained.
(X,Y)= (falsepositive, correctdetection rate)
Do I have to calculate these points for every positives and negatives of my training data set. But my positive and negative data sets are of different sizes. I am bit confused.
ROC curve - Receiver operating characteristic is a measure of the accuracy of their classifier. As much as the area under the curve is larger the classifier is more accurate. In order to increase the area under the curve, the classifier needs to have a high value on the y-axis. That means to have a good TPR = true positive rate.
To calculate the ROC you first need to plot a graph of No' of instance as a function of the result of the AdaBoost classifier. After doing that in order to plot the graph you need to move the threshold of the AdaBoost classifier and calculate the TPR and FPR of each point.

Common way to plot a ROC Curve

I'm trying to obtain ROC Curve for GBTClassifier.
One way is to reuse BinaryClassificationMetrics, however the path given in the documentation (https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html) provides only 4 values for the ROC Curve, like:
[0.0|0.0]
[0.0|0.9285714285714286]
[1.0|1.0]
[1.0|1.0]
Another way is to use the "probability" column instead of "prediction". However, in case of GBTClassifier I don't have it and this solution works mostly for RandomForestClassifier.
How to plot ROC curve and precision-recall curve from BinaryClassificationMetrics
So what is the general/common way to get a ROC curve with enough points for an arbitrary classifier?

sklearn: AUC score for LinearSVC and OneSVM

One option of the SVM classifier (SVC) is probability which is false by default. The documentation does not say what it does. Looking at libsvm source code, it seems to do some sort of cross-validation.
This option does not exist for LinearSVC nor OneSVM.
I need to calculate AUC scores for several SVM models, including these last two. Should I calculate the AUC score using decision_function(X) as the thresholds?
Answering my own question.
Firstly, it is a common "myth" that you need probabilities to draw the ROC curve. No, you need some kind of threshold in your model that you can change. The ROC curve is then drawn by changing this threshold. The point of the ROC curve being, of course, to see how well your model is reproducing the hypothesis by seeing how well it is ordering the observations.
In the case of SVM, there are two ways I see people drawing ROC curves for them:
using distance to the decision bondary, as I mentioned in my own question
using the bias term as your threshold in the SVM: http://researchgate.net/post/How_can_I_plot_determine_ROC_AUC_for_SVM. In fact, if you use SVC(probabilities=True) then probabilities will be calculated for you in this manner, by using CV, which you can then use to draw the ROC curve. But as mentioned in the link I provide, it is much faster if you draw the ROC curve directly by varying the bias.
I think #2 is the same as #1 if we are using a linear kernel, as in my own case, because varying the bias is varying the distance in this particular case.
In order to calculate AUC, using sklearn, you need a predict_proba method on your classifier; this is what the probability parameter on SVC does (you are correct that it's calculated using cross-validation). From the docs:
probability : boolean, optional (default=False)
Whether to enable probability estimates. This must be enabled prior to calling fit, and will slow down that method.
You can't use the decision function directly to compute AUC, since it's not a probability. I suppose you could scale the decision function to take values in the range [0,1], and compute AUC, however I'm not sure what statistical properties this will have; you certainly won't be able to use it to compare with ROC calculated using probabilities.

Resources