I am reading paper by Viola and Jones. There they have used ROC curve to measure the accuracy of their classifier.
https://www.cs.cmu.edu/~efros/courses/LBMV07/Papers/viola-cvpr-01.pdf
Could someone please explain how the ROC curve is plotted in case of binary classifier like face or non face? I mean how is the data points obtained.
(X,Y)= (falsepositive, correctdetection rate)
Do I have to calculate these points for every positives and negatives of my training data set. But my positive and negative data sets are of different sizes. I am bit confused.
ROC curve - Receiver operating characteristic is a measure of the accuracy of their classifier. As much as the area under the curve is larger the classifier is more accurate. In order to increase the area under the curve, the classifier needs to have a high value on the y-axis. That means to have a good TPR = true positive rate.
To calculate the ROC you first need to plot a graph of No' of instance as a function of the result of the AdaBoost classifier. After doing that in order to plot the graph you need to move the threshold of the AdaBoost classifier and calculate the TPR and FPR of each point.
Related
I am confused by this example here: https://scikit-learn.org/stable/visualizations.html
If we plot the ROC curve for a Logistic Regression Classifier the ROC curve is parametrized by the threshold parameter. But a usual SVM spits out binary values instead of probabilities.
Consequently there should not be a threshold which can be varied to obtain an ROC curve.
But which parameter is then varied in the example above?
SVMs have a measure of confidence in their predictions using the distance from the separating hyperplane (before the kernel, if you're not doing a linear SVM). These are obviously not probabilities, but they do rank-order the data points, and so you can get an ROC curve. In sklearn, this is done via the decision_function method. (You can also set probability=True in the SVC to calibrate the decision function values into probability estimates.)
See this section of the User Guide for some of the details on the decision function.
I have 1D data (on column data). I used Gaussian Mixture Model (GMM) as a density estimation, using this implementation in Python: https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html. By relying on AIC/BIC criteron i was able to determine number of components. After i fit the GMM, i plotted kernel density estimation of original observation + that of sampled data drawn from GMM. the plot of original and sampled desnities are quiet similar( that is good). But, i would like some metrics to report how good is the fitted model.
g = GaussianMixture(n_components = 35)
data= df['x'].values.reshape(-1,1) # data taken from data frame (10,000 data pints)
clf= g.fit(data)# fit model
samples= clf.sample(10000)[0] # generate sample data points (same # as original data points)
I found score in the implementation, but not sure how to implememnt. Am i doing it wrong? or is there any better way to show how accuracy is the fitted model, apart from histogram or kernel densities plots?.
print(clf.score(data))
print(clf.score(samples))
You can use normalized_mutual_info_score, adjusted_rand_score or silhouette score to evaluate your clusters. All of these metrics are implemented under sklearn.metrics section.
EDIT: You can check this link for more detail explanations.
In a summary:
Adjusted Rand Index: measures the similarity of the two assignments.
Normalized Mutual Information: measures the agreement of the two assignments.
Silhouette Coefficient: measures how well-assigned each individual point is.
gmm.fit(x_vec)
pred = gmm.predict(x_vec)
print ("gmm: silhouttte: ", silhouette_score(x_vec, pred))
I would better use cross-validation and try to see the accuracy of the trained model.
Use the predict method of the fitted model to predict the labels of unseen data (use cross-validation and report the acurracy): https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture.predict
Toy example:
g = GaussianMixture(n_components = 35)
g.fit(train_data)# fit model
y_pred = g.predict(test_data)
EDIT:
There are several options to measure the performance of your unsupervised case. For GMM, which base on real probabilities, the most common are BIC and AIC. They are immediatly included in the scikit GMM class.
I want to evaluate a logistic regression model (binary event) using two measures:
1. model.score and confusion matrix which give me a 81% of classification accuracy
2. ROC Curve (using AUC) which gives back a 50% value
Are these two result in contradiction? Is that possible
I'missing something but still can't find it
y_pred = log_model.predict(X_test)
accuracy_score(y_test , y_pred)
cm = confusion_matrix( y_test,y_pred )
y_test.count()
print (cm)
tpr , fpr, _= roc_curve( y_test , y_pred, drop_intermediate=False)
roc = roc_auc_score( y_test ,y_pred)
enter image description here
The accuracy score is calculated based on the assumption that a class is selected if it has a prediction probability of more than 50%. This means that you are looking only at 1 case (one working point) out of many. Let's say you'd like to classify an instance as '0' even if it has a probability greater than 30% (this may happen if one of your classes is more important for you, and its a-priori probability is very low). In this case - you will have a very different confusion matrix with a different accuracy ([TP+TN]/[ALL]). The ROC auc score examines all of these working points and gives you an estimation of your overall model. A score of 50% means that the model is equal to a random selection of classes based on your a-priori probabilities of the classes. You would like the ROC to be much higher to say that you have a good model.
So in the above case - you can say that your model does not have a good prediction strength. As a matter of fact - a better prediction will be to predict everything as "1" - in your case it will lead to an accuracy of above 99%.
I was trying to plot ROC curve and Precision-Recall curve in graph. The points are generated from the Spark Mllib BinaryClassificationMetrics. By following the following Spark https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html
[(1.0,1.0), (0.0,0.4444444444444444)] Precision
[(1.0,1.0), (0.0,1.0)] Recall
[(1.0,1.0), (0.0,0.6153846153846153)] - F1Measure
[(0.0,1.0), (1.0,1.0), (1.0,0.4444444444444444)]- Precision-Recall curve
[(0.0,0.0), (0.0,1.0), (1.0,1.0), (1.0,1.0)] - ROC curve
It looks like you have a similar problem to what I experienced. You need to either flip your parameters to the Metrics constructor or perhaps pass in the probability instead of the prediction. So, for example, if you are using the BinaryClassificationMetrics and a RandomForestClassifier, then according to this page (under outputs) there is "prediction" and "probability".
Then initialize your Metrics thus:
new BinaryClassificationMetrics(predictionsWithResponse
.select(col("probability"),col("myLabel"))
.rdd.map(r=>(r.getAs[DenseVector](0)(1),r.getDouble(1))))
With the DenseVector call used to extract the probability of the 1 class.
As for actual plotting, that's up to you (many fine tools for that), but at least you will get more than 1 point on you curve (besides the endpoints).
And in case it's not clear:
metrics.roc().collect() will give you the data for the ROC curve: Tuples of: (false positive rate, true positive rate).
One option of the SVM classifier (SVC) is probability which is false by default. The documentation does not say what it does. Looking at libsvm source code, it seems to do some sort of cross-validation.
This option does not exist for LinearSVC nor OneSVM.
I need to calculate AUC scores for several SVM models, including these last two. Should I calculate the AUC score using decision_function(X) as the thresholds?
Answering my own question.
Firstly, it is a common "myth" that you need probabilities to draw the ROC curve. No, you need some kind of threshold in your model that you can change. The ROC curve is then drawn by changing this threshold. The point of the ROC curve being, of course, to see how well your model is reproducing the hypothesis by seeing how well it is ordering the observations.
In the case of SVM, there are two ways I see people drawing ROC curves for them:
using distance to the decision bondary, as I mentioned in my own question
using the bias term as your threshold in the SVM: http://researchgate.net/post/How_can_I_plot_determine_ROC_AUC_for_SVM. In fact, if you use SVC(probabilities=True) then probabilities will be calculated for you in this manner, by using CV, which you can then use to draw the ROC curve. But as mentioned in the link I provide, it is much faster if you draw the ROC curve directly by varying the bias.
I think #2 is the same as #1 if we are using a linear kernel, as in my own case, because varying the bias is varying the distance in this particular case.
In order to calculate AUC, using sklearn, you need a predict_proba method on your classifier; this is what the probability parameter on SVC does (you are correct that it's calculated using cross-validation). From the docs:
probability : boolean, optional (default=False)
Whether to enable probability estimates. This must be enabled prior to calling fit, and will slow down that method.
You can't use the decision function directly to compute AUC, since it's not a probability. I suppose you could scale the decision function to take values in the range [0,1], and compute AUC, however I'm not sure what statistical properties this will have; you certainly won't be able to use it to compare with ROC calculated using probabilities.