Evaluating the performance of one class SVM - scikit-learn

I have been trying to evaluate the performance of my one-class SVM. I have tried plotting an ROC curve using scikit-learn, and the results have been a bit bizarre.
X_train, X_test = train_test_split(compressed_dataset,test_size = 0.5,random_state = 42)
clf = OneClassSVM(nu=0.1,kernel = "rbf", gamma =0.1)
y_score = clf.fit(X_train).decision_function(X_test)
pred = clf.predict(X_train)
fpr,tpr,thresholds = roc_curve(pred,y_score)
# Plotting roc curve
plt.figure()
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()
The ROC curve I get:
Can somebody help me out with this?

What is bizzare about this plot? You fixed a single set of nu and gamma, thus your model is neither over- nor underfitting. Moving threshold (which is a ROC variable) does not lead to 100% TPR. Try out high gamma and very small nu (which upper bounds the training errors) and you will get more "typical" plots.

In my opinon, get the scores:
pred_scores = clf.score_samples(X_train)
Then, the pred_scores need to be min-max normalized before min-max normalize

Related

azure automl how to find best threshold in precision recall curve?

I use automl for a classification problem. I obtained the following precision recall curve:
Is it possible to find the best threshold that maximizes the f-score from this curve? and how?
Using any kind of optimization mechanisms, we can improve the F1 score. In the below example I tried to reproduce with Standard Scalar optimization mechanism.
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression as lrs
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
pipeline = make_pipeline(StandardScaler(), lrs(random_state=1))
# Create training test splits using two features
#
pipeline.fit(X_train[:,[2, 13]],y_train)
probs = pipeline.predict_proba(X_test[:,[2, 13]])
fpr1, tpr1, thresholds = roc_curve(y_test, probs[:, 1], pos_label=1)
roc_auc1 = auc(fpr1, tpr1)
#
# Create training test splits using two different features
#
pipeline.fit(X_train[:,[4, 14]],y_train)
probs2 = pipeline.predict_proba(X_test[:,[4, 14]])
fpr2, tpr2, thresholds = roc_curve(y_test, probs2[:, 1], pos_label=1)
roc_auc2 = auc(fpr2, tpr2)
#
# Create training test splits using all features
#
pipeline.fit(X_train,y_train)
probs3 = pipeline.predict_proba(X_test)
fpr3, tpr3, thresholds = roc_curve(y_test, probs3[:, 1], pos_label=1)
roc_auc3 = auc(fpr3, tpr3)
fig, ax = plt.subplots(figsize=(7.5, 7.5))
plt.plot(fpr1, tpr1, label='ROC Curve 1 (AUC = %0.2f)' % (roc_auc1))
plt.plot(fpr2, tpr2, label='ROC Curve 2 (AUC = %0.2f)' % (roc_auc2))
plt.plot(fpr3, tpr3, label='ROC Curve 3 (AUC = %0.2f)' % (roc_auc3))
plt.plot([0, 1], [0, 1], linestyle='--', color='red', label='Random Classifier')
plt.plot([0, 0, 1], [0, 1, 1], linestyle=':', color='green', label='Perfect Classifier')
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.legend(loc="lower right")
plt.show()
With Standard Scaler we will get the maximum in F1 score with the max of AUC.

Can a Siamese Network model draw ROC Curve?

Based on example on Keras
https://keras.io/examples/vision/siamese_contrastive/
Here is how I code to get ROC Curve
from sklearn.metrics import confusion_matrix,accuracy_score, roc_curve, auc
import seaborn as sns
sns.set_style("whitegrid")
pred = siamese.predict([x_test_1, x_test_2])
pred = pred[:,0]
pred_NN_01 = np.where(pred > 0.5, 1, 0) #Turn probability to 0-1 binary output
#Print accuracy
acc_NN = accuracy_score(labels_test, pred_NN_01)
print('Overall accuracy of Neural Network model:', acc_NN)
#Print Area Under Curve
false_positive_rate, recall, thresholds = roc_curve(labels_test, pred)
roc_auc = auc(false_positive_rate, recall)
plt.figure()
plt.title('Receiver Operating Characteristic (ROC)')
plt.plot(false_positive_rate, recall, 'b', label = 'AUC = %0.3f' %roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1], [0,1], 'r--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out (1-Specificity)')
plt.show()
#Print Confusion Matrix
cm = confusion_matrix(labels_test, pred_NN_01)
labels = ['Unchange', 'Change']
plt.figure(figsize=(8,6))
sns.heatmap(cm,xticklabels=labels, yticklabels=labels, annot=True, fmt='d', cmap="Blues", vmin = 0.2);
plt.title('Confusion Matrix')
plt.ylabel('True Class')
plt.xlabel('Predicted Class')
plt.show()
Keras Example ROC Curve
Keras Example Confusion Matrix
If the code and image is right , since the example is 10 classes (0~9 digit image)
how if I use other images which only have two classes with this model
does this ROC Curve code need to change any part?
Because I've got a strange output with this same code with 2 classes
the image result kept weird , the confusion matrix doesn't match the ROC curve
2 Classes ROC Curve
2 Classes Confusion Matrix

Why is my MLP ROC_AUC plotting only 3 points

I am trying to get an ROC curve but my code seem to be plotting a triangle instead. I think I am doing something wrong. I will appreciate any help.
model = MLPClassifier()
model.fit(X_train, Y_train)
prediction=model.predict(X_test)
fpr, tpr, thresholds = roc_curve(Y_test, prediction)
roc_auc = auc(fpr, tpr)
fpr=array([0. , 0.2473072, 1. ])
tpr=array([0. , 0.70320656, 1. ])
thresholds=array([2, 1, 0])
# image drawing
plt.figure()
#plt.title('Receiver Operating Characteristic %d iter' %iter)
plt.plot(fpr, tpr, label = 'MLP AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
Can you please review and suggest what I can do. Even though I am not surprised I have a triangle because if I take a look at my fpr and tpr, there are just 3 values and I do not understand why. I expect to have more values that will result into a curve.
I saw someone has had same challenge as this but the solution did not seem to work for me as I expect the fpr and tpr to return more than 3 values.

How to avoid overfitting with imbalanced data?

I am working on a classifier for binary classification. The data is imbalanced with class 0 of of 83.41% and class 1 of 16.59%. I am using Mathews Correlation coefficient to evaluate the performance of the classifier. Also note that the data is quite less with dimension ((211, 800)).
I am using Logistic regression to address the problem. I used GridSearchCV for hyper parameter optimisation and came up with the following best hyper parameter values :
Best Params: {'C': 1000, 'class_weight': {1: 0.83, 0: 0.17000000000000004}, 'penalty': 'l1', 'solver': 'liblinear'}
Best MCC 0.7045053547679334
I plotted the validation curve over a range of C values to check wether the model is overfitting/underfitting.
train_scores, test_scores = validation_curve(LogisticRegression(penalty='l1',
solver='liblinear',
class_weight={1: 0.83, 0: 0.17000000000000004}),
X, y,'C', C, cv=5, scoring=make_scorer(matthews_corrcoef))
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.title("Validation Curve with Logistic Regression")
plt.xlabel("C")
plt.ylabel("MCC")
plt.ylim(0.0, 1.1)
lw = 2
plt.semilogx(C, train_scores_mean, label="Training score",
color="darkorange", lw=lw)
plt.fill_between(C, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.2,
color="darkorange", lw=lw)
plt.semilogx(C, test_scores_mean, label="Cross-validation score",
color="navy", lw=lw)
plt.fill_between(C, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.2,
color="navy", lw=lw)
plt.legend(loc="best")
plt.show()
Based on my understanding seeing this curve it shows that the model tends to overfit as it performs low on validation set and high on training set. Could anyone point me into some direction as how to address this on such a small dataset.
You could do a number of things:
Use SMOTE to oversample the minority class.
Reduce the number of iterations of GridSearchCV or use RandomSearchCV.

Very few distinct prediction probabilities for CV instances with sparse SVM

I’m having an issue using the prediction probabilities for sparse SVM, where many of the predictions come out the same for my test instances. These probabilities are produced during cross validation, and when I plot an ROC curve for the folds, the results look very strange, as there are a handful of clustered points on the graph. Here is my cross validation code, I based it off of the samples on the scikit website:
skf = StratifiedKFold(y, n_folds=numfolds)
for train_index, test_index in skf:
#split the training and testing sets
X_train, X_test = X_scaled[train_index], X_scaled[test_index]
y_train, y_test = y[train_index], y[test_index]
#train on the subset for this fold
print 'Training on fold ' + str(fold)
classifier = svm.SVC(C=C_val, kernel='rbf', gamma=gamma_val, probability=True)
probas_ = classifier.fit(X_train, y_train).predict_proba(X_test)
#Compute ROC curve and area the curve
fpr, tpr, thresholds = roc_curve(y_test, probas_[:, 1])
mean_tpr += interp(mean_fpr, fpr, tpr)
mean_tpr[0] = 0.0
roc_auc = auc(fpr, tpr)
I’m just trying to figure out if there’s something I’m obviously missing here, since I used this same training set and SVM parameters with libsvm and got much better results. When I used libsvm and printed out the distances from the hyperplane for the CV test instances and then plotted the ROC, it came out much more like I expected, and a much better AUC.

Resources