Getting perfect ROC-AUC score for Linear SVC - scikit-learn

I am evaluating different classifiers for my sentiment analysis model. I am looking at all available metrics, and whilst most achieve a similar precision, recall, F1-scores and ROC-AUC scores, Linear SVM appears to get a perfect ROC-AUC score. Look at the chart below:
Abbreviations: MNB=Multinomial Naive Bayes, SGD=Stochastic Gradient Descent, LR=Logistic Regression, LSVC=Linear Support Vector Classification
Here are the rest of the performance metrics for LSVC, which are very similar to the rest of the classifiers:
precision recall f1-score support
neg 0.83 0.90 0.87 24979
pos 0.90 0.82 0.86 25021
avg / total 0.87 0.86 0.86 50000
As you can see the dataset is balanced for pos and neg comments.
Here is the relevant code:
def evaluate(classifier):
predicted = classifier.predict(testing_text)
if isinstance(classifier.steps[2][1], LinearSVC):
probabilities = np.array(classifier.decision_function(testing_text))
scores = probabilities
else:
probabilities = np.array(classifier.predict_proba(testing_text))
scores = np.max(probabilities, axis=1)
pos_idx = np.where(predicted == 'pos')
predicted_true_binary = np.zeros(predicted.shape)
predicted_true_binary[pos_idx] = 1
fpr, tpr, thresholds = metrics.roc_curve(predicted_true_binary, scores)
auc = metrics.roc_auc_score(predicted_true_binary, scores)
mean_acc = np.mean(predicted == testing_category)
report = metrics.classification_report(testing_category, predicted)
confusion_matrix = metrics.confusion_matrix(testing_category, predicted)
return fpr, tpr, auc, mean_acc, report, confusion_matrix
I am using predict_proba for all classifiers apart from LSVC which uses decision_function instead (since it does not have a predict_proba method`)
What's going on?
EDIT: changes according to #Vivek Kumar's comments:
def evaluate(classifier):
predicted = classifier.predict(testing_text)
if isinstance(classifier.steps[2][1], LinearSVC):
probabilities = np.array(classifier.decision_function(testing_text))
scores = probabilities
else:
probabilities = np.array(classifier.predict_proba(testing_text))
scores = probabilities[:, 1] # NEW
testing_category_array = np.array(testing_category) # NEW
pos_idx = np.where(testing_category_array == 'pos')
predicted_true_binary = np.zeros(testing_category_array.shape)
predicted_true_binary[pos_idx] = 1
fpr, tpr, thresholds = metrics.roc_curve(predicted_true_binary, scores)
auc = metrics.roc_auc_score(predicted_true_binary, scores)
mean_acc = np.mean(predicted == testing_category)
report = metrics.classification_report(testing_category, predicted)
confusion_matrix = metrics.confusion_matrix(testing_category, predicted)
return fpr, tpr, auc, mean_acc, report, confusion_matrix
This now yields this graph:

I don't think it is valid to compare the methods predict_proba and decision_function like for like. The first sentence in the docs for LSVC decision function "Predict confidence scores for samples." must not be read as "predicting probabilties". The second sentences clarifies it, it is similar to the decision function for the general SVC.
You can use predict_proba for a linear SVC with sklearn; then you need to specific under the general SVC the kernel as 'linear'. However, then you are changing the implementation under the hood (away from "LIBLINEAR").

Related

F1 metric and LeaveOneOut validation strategy in scikit-learn

I want to use GridSearchCV to find the optimal n_neighbors parameter of KNeighborsClassifier
I want to use 'f1_score' metrics AND 'leave one out' strategy.
But this code
clf = GridSearchCV(KNeighborsClassifier(), {'n_neighbors': [1, 2, 3]}, cv=LeaveOneOut(), scoring='f1')
clf.fit(x_train, y_train)
leads to an error
UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
I want to compute f1 score not of each fold of cross validation (it is not possible to compute f1 score of the only one test example), but to compute f1 score based on the whole iteration set with n_neighbors = n.
Is it possible using GridSearchCV?
Not sure if this functionality is directly available in Scikit-Learn, but you can implement the following function to get the desired outcome.
In particular, we will make a dummy scorer which just returns the predicted class instead of computing any score using the ground-truth and the prediction. In this way we can access the predictions of each hyperparameters combination on the different examples in the LOO cv.
from sklearn.metrics import f1_score, make_scorer
def get_pred(y_true, y_predicted):
return y_predicted
get_pred_scorer = make_scorer(get_pred)
clf = GridSearchCV(
KNeighborsClassifier(),
{'n_neighbors': [1, 2, 3]},
cv=LeaveOneOut(),
refit=False,
scoring=get_pred_scorer
)
clf.fit(X_train, y_train)
The problem with this approach is that certain results available in the cv_results_ dictionary (and in certain attributes of GridSearchCV) won't have any meaning, but that probably is not a problem. We should just remember to put refit=False, since GridSearchCV doesn't have a way to determine the best model.
Now we can access the predictions through cv_results_ and just use f1_score to compute the metric for each hyperparams configuration.
def print_params_f1_scores(clf, y_true):
y_preds = [] # will contain the predictions of each params combination
results = clf.cv_results_
params = results["params"] # all params combinations
for j in range(len(params)): # for each combination
y_preds.append([])
for i in range(clf.n_splits_): # for each split (sample in loo)
prediction_of_j_on_i = results[f"split{i}_test_score"][j]
y_preds[j].append(prediction_of_j_on_i)
# show the f1-scores of each combination
for j in range(len(y_preds)):
score = f1_score(y_true, y_preds[j])
print(f"KNeighborsClassifier with {params[j]} obtained f1-score of {score}")
print_params_f1_scores(clf, y_train)
The function prints the following output:
KNeighborsClassifier with {'n_neighbors': 1} obtained f1-score of 0.94
KNeighborsClassifier with {'n_neighbors': 2} obtained f1-score of 0.94
KNeighborsClassifier with {'n_neighbors': 3} obtained f1-score of 0.92

Understanding Python's roc_curve, svm example

I'm trying to understand how this Python code works, conceptually, so I can write a paper about it. I have an analogous question for the random forest algorithm; but maybe if I understand this, I'll understand that too. Here's just the part that I think is relevant to my question:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_curve, auc
from numpy import interp
statifiedFolds = StratifiedKFold(n_splits=5, shuffle=True)
tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)
i = 1
for train,test in statifiedFolds.split(x,y):
svc = SVC(kernel = 'rbf', C = 10000, gamma = 0.1)
x_train, x_test = x[train], x[test]
y_train, y_test = y[train], y[test]
svc.fit(x_train, y_train)
y_pred = svc.decision_function(x_test)
fpr, tpr, thresholds = roc_curve(y_test,y_pred)
tprs.append(interp(mean_fpr, fpr, tpr))
tprs[-1][0] = 0.0
roc_auc = auc(fpr, tpr)
aucs.append(roc_auc)
i += 1
mean_tpr = np.mean(tprs, axis=0)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
As I understand, the ROC curve plots false positive rate against true positive rate. But each time you run SVM on the testing set, you get a single binary prediction for each testing point. You then calculate the true positive rate and false positive rate by tallying true positives and false positives. So tpr should be just a single number, as should fpr. Thus (tpr,fpr) should be just a single point.
This leads me to expect that to get an roc curve, one should run the classification algorithm under many different parameters. If you're lucky, the algorithm will have a parameter such that larger values tends to benefit sensitivity at the expense of specificity, or the other way around. But neither of SVM's parameters (C and gamma) do that. So I would have thought you'd have to try many values of C and gamma until the left, middle and right regions of the roc curve are all well-represented.
But this code looks nothing like that. Only one pair of parameter values (C=10000, gamma = 0.1) ever gets called. And svm is run only once, followed by a call of an interpolation function, within each fold of the 5-fold cross-validation.
My question is: How is it possible to interpolate the roc curve using only 1 point?
The mistake in this reasoning lies in the fact that svc.decision_function(x_test) is not returning binary data.
It's actually returning a (signed) value, proportional to the distance of the samples X to the separating hyperplane. You can therefore plot a proper roc curve by adjusting the threshold around the default value of 0.
NB: See the reference documentation for details, svc.decision_function will return slightly different formats depending on the decision_function_shape argument of svc.

Sklearn logistic regression - adjust cutoff point

I have a logistic regression model trying to predict one of two classes: A or B.
My model's accuracy when predicting A is ~85%.
Model's accuracy when predicting B is ~50%.
Prediction of B is not important however prediction of A is very important.
My goal is to maximize the accuracy when predicting A. Is there any way to adjust the default decision threshold when determining the class?
classifier = LogisticRegression(penalty = 'l2',solver = 'saga', multi_class = 'ovr')
classifier.fit(np.float64(X_train), np.float64(y_train))
Thanks!
RB
As mentioned in the comments, procedure of selecting threshold is done after training. You can find threshold that maximizes utility function of your choice, for example:
from sklearn import metrics
preds = classifier.predict_proba(test_data)
tpr, tpr, thresholds = metrics.roc_curve(test_y,preds[:,1])
print (thresholds)
accuracy_ls = []
for thres in thresholds:
y_pred = np.where(preds[:,1]>thres,1,0)
# Apply desired utility function to y_preds, for example accuracy.
accuracy_ls.append(metrics.accuracy_score(test_y, y_pred, normalize=True))
After that, choose threshold that maximizes chosen utility function. In your case choose threshold that maximizes 1 in y_pred.

why have high AUC and low accuracy in a balanced dataset for SVM

I used LIBSVM to classify 256 classes. My dataset is about 5000-10000. For SVM, I used one against one strategy to train my models. Now, I get the results of low accuracy (15%~30%) but high AUC (>90%). I suppose that one cannot obtain high AUC (0.9 and higher) if Acc of the corresponding predictive model is low (13-30 %)?
I refer to the Open Source Python Library (scikit-learn )to compute the AUC of many kinds of problems. (http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py)
This is used these code to compute AUC:
# compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
# test_label_kernel: the true label of one insance
# LensOfLabel : the number of all classes
y = label_binarize( test_label_kernel, classes = list(range(0,LensOfLabel,1)) )
#sort_pval: the prediction probability of SVM
for i in range(LensOfLabel):
fpr[i], tpr[i], _ = metrics.roc_curve( y[:,i], sort_pval[:,i] )
roc_auc[i] = metrics.auc( fpr[i], tpr[i] )
# First aggregate all false positive rates
n_classes = LensOfLabel
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))
# Then interpolate all ROC curves at this points
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
mean_tpr += interp(all_fpr, fpr[i], tpr[i])
# Finally average it and compute AUC
mean_tpr /= n_classes
fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = metrics.auc(fpr["macro"], tpr["macro"])
print( ("macroAUC: %.4f") %roc_auc["macro"] )
#compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = metrics.roc_curve( y.ravel(), sort_pval.ravel() )
roc_auc["micro"] = metrics.auc( fpr["micro"], tpr["micro"] )
print( ("microAUC: %.4f") %roc_auc["micro"] )
The ROC curve is;
https://i.stack.imgur.com/GEUqr.png
https://i.stack.imgur.com/ucbE6.png

Shouldn't a SVM binary classifier understand the threshold from the training set?

I'm very confused about SVM classifiers and I'm sorry if I'll sound stupid.
I'm using the Spark library for java http://spark.apache.org/docs/latest/mllib-linear-methods.html, the first example from the Linear Support Vector Machines paragraph. On this training set:
1 1:10
1 1:9
1 1:9
1 1:9
0 1:1
1 1:8
1 1:8
0 1:2
0 1:2
0 1:3
the prediction on values: 8, 2 and 1 are all positive (1). Given the training set, I would expect them to be positive, negative, negative. It gives negative only on 0 or negative values. I read that the standard threshold is "positive" if the prediction is a positive double, "negative" if it's negative, and I've seen that there is a method to manually set the threshold. But isn't this the exact reason I need a binary classifier for? I mean, if I know in advance what the threshold is I can distinguish between positive and negative values, so why bother training a classifier?
UPDATE:
Using this python code from a different library:
X = [[10], [9],[9],[9],[1],[8],[8],[2],[2],[3]]
y = [1,1,1,1,0,1,1,0,0,0]
​
from sklearn.svm import SVC
from sklearn.cross_validation import StratifiedKFold
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
import numpy as np
​
# we convert our list of lists in numpy arrays
X = np.array(X)
y = np.array(y)
# we compute the general accuracy of the system - we need more "false questions" to continue the study
accuracy = []
​
#we do 10 fold cross-validation - to be sure to test all possible combination of training and test
kf_total = StratifiedKFold(y, n_folds=5, shuffle=True)
for train, test in kf_total:
X_train, X_test = X[train], X[test]
y_train, y_test = y[train], y[test]
print X_train
clf = SVC().fit(X_train, y_train)
y_pred = clf.predict(X_test)
print "the classifier says: ", y_pred
print "reality is: ", y_test
print accuracy_score(y_test, y_pred)
print ""
accuracy.append(accuracy_score(y_test, y_pred))
print sum(accuracy)/len(accuracy)
the results are correct:
######
1 [0]
######
2 [0]
######
8 [1]
So I think it's possible for a SVM classifier to understand the threshold by itself; how can I do the same with the spark library?
SOLVED: I solved the issue changing the example to this:
SVMWithSGD std = new SVMWithSGD();
std.setIntercept(true);
final SVMModel model = std.run(training.rdd());
From this:
final SVMModel model = SVMWithSGD.train(training.rdd(), numIterations);
The standard value for "intercept" is false, which is what I needed to be true.
If you search for probability calibration you will find some research on a related matter (recalibrating the outputs to return better scores).
If your problem is a binary classification problem, you can calculate the slope of the cost by assigning vales to true/false positive/negative options multiplied by the class ratio. You can then form a line with the given AUC curve that intersects at only one point to find a point that is in some sense optimal as a threshold for your problem.
Threshold is one value that will differentiate classes .

Resources