How to binarize RandomForest to plot a ROC in python? - python-3.x

I have 21 classes. I am using RandomForest. I want to plot a ROC curve, so I checked the example in scikit ROC with SVM
The example uses SVM. SVM has parameters like: probability and decision_function_shape which RF does not.
So how can I binarize RandomForest and plot a ROC?
Thank you
EDIT
To create the fake data. So there are 20 features and 21 classes (3 samples for each class).
df = pd.DataFrame(np.random.rand(63, 20))
label = np.arange(len(df)) // 3 + 1
df['label']=label
df
#TO TRAIN THE MODEL: IT IS A STRATIFIED SHUFFLED SPLIT
clf = make_pipeline(RandomForestClassifier())
xSSSmean10 = []
for i in range(10):
sss = StratifiedShuffleSplit(y, 10, test_size=0.1, random_state=i)
scoresSSS = cross_validation.cross_val_score(clf, x, y , cv=sss)
xSSSmean10.append(scoresSSS.mean())
result_list.append(xSSSmean10)
print("")

For multilabel random forest, each of your 21 labels has a binary classification, and you can create a ROC curve for each of the 21 classes.
Your y_train should be a matrix of 0 and 1 for each label.
Assume you fit a multilabel random forest from sklearn and called it rf, and have a X_test and y_test after a test train split. You can plot the ROC curve in python for your first label using this:
from sklearn import metrics
probs = rf.predict_proba(X_test)
fpr, tpr, threshs = metrics.roc_curve(y_test['name_of_your_first_tag'],probs[0][:,1])
Hope this helps. If you provide your code and data I could write this more specifically.

Related

Generate Random Forest feature importance plots from 3D arrays

After carrying our a librosa MFCC feature extraction on 1000 audio files, I end up with an X_test array of size 1000 x 40 x 174 (40 features as I set n_mfcc=40). In order for me to pass this through the random forest classifier, I scaled and then flattened the array. My new X_test now has a size of 1000 x 6960. How do I go about correctly generating the feature importance histogram?
This is the code that I used for the feature importance plot but not sure if this is the correct approach:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
x_train # this has a shape of 1000 x 40 x 174
X_train_scaled = scaler.fit_transform(x_train.reshape(-1, x_train.shape[-1])).reshape(x_train.shape)
X_train = np.array([features_2d.flatten() for features_2d in X_train_scaled])
X = pd.DataFrame(X_train) # X_train here is already flattened to 1000 x 6960
feature_names = [f"feature {i}" for i in range(X.shape[1])]
clf = RandomForestClassifier(random_state=0)
clf.fit(X_train, y_train)
importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf.estimators_], axis=0)
forest_importances = pd.Series(importances, index=feature_names)
fig, ax = plt.subplots()
plt.figure(figsize=(13, 10))
forest_importances.plot.bar(yerr=std, ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()
With this code, I get this plot:
Can you tell me if this is the correct approach? If this approach is correct, how can I generate a more "readable" plot for the Feature Importance? Thanks!

Can a Siamese Network model draw ROC Curve?

Based on example on Keras
https://keras.io/examples/vision/siamese_contrastive/
Here is how I code to get ROC Curve
from sklearn.metrics import confusion_matrix,accuracy_score, roc_curve, auc
import seaborn as sns
sns.set_style("whitegrid")
pred = siamese.predict([x_test_1, x_test_2])
pred = pred[:,0]
pred_NN_01 = np.where(pred > 0.5, 1, 0) #Turn probability to 0-1 binary output
#Print accuracy
acc_NN = accuracy_score(labels_test, pred_NN_01)
print('Overall accuracy of Neural Network model:', acc_NN)
#Print Area Under Curve
false_positive_rate, recall, thresholds = roc_curve(labels_test, pred)
roc_auc = auc(false_positive_rate, recall)
plt.figure()
plt.title('Receiver Operating Characteristic (ROC)')
plt.plot(false_positive_rate, recall, 'b', label = 'AUC = %0.3f' %roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1], [0,1], 'r--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out (1-Specificity)')
plt.show()
#Print Confusion Matrix
cm = confusion_matrix(labels_test, pred_NN_01)
labels = ['Unchange', 'Change']
plt.figure(figsize=(8,6))
sns.heatmap(cm,xticklabels=labels, yticklabels=labels, annot=True, fmt='d', cmap="Blues", vmin = 0.2);
plt.title('Confusion Matrix')
plt.ylabel('True Class')
plt.xlabel('Predicted Class')
plt.show()
Keras Example ROC Curve
Keras Example Confusion Matrix
If the code and image is right , since the example is 10 classes (0~9 digit image)
how if I use other images which only have two classes with this model
does this ROC Curve code need to change any part?
Because I've got a strange output with this same code with 2 classes
the image result kept weird , the confusion matrix doesn't match the ROC curve
2 Classes ROC Curve
2 Classes Confusion Matrix

Plotting AUC score for multiple model for multiclass classification in Python

I am doing a multiclass classification problem. There are a total of 46 unique classes in my dataset. I have computed the AUC sore for all the class and plot it but I want to plot my AUC score for different types of models in one graph means I want to plot my graph for LogisticRegression, XGBoost and 2 more which is used to solve the multiclass problem. My code what I have done till-
n_classes = 46
best_C =1000
best_gamma =0.0001
svc_model_grid_param = SVC(C=best_C, kernel="rbf", gamma= best_gamma, )
model_OVR_svc = OneVsRestClassifier(svc_model_grid_param)
y_score = model_OVR_svc.fit(X_train, y_train).decision_function(X_valid)
# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
# calculate dummies once
y_test_dummies = pd.get_dummies(y_valid, drop_first=False).values
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test_dummies[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
Plotting--
import matplotlib.pylab as plt
lists = sorted(roc_auc.items()) # sorted by key, return a list of tuples
x, y = zip(*lists) # unpack a list of pairs into two tuples
plt.xlabel('Class')
plt.ylabel('AUC Score')
plt.plot(x, y)
plt.show()
Graph--
What I want to do--
Can anyone help me to do this.. Thanks in advance

Shouldn't a SVM binary classifier understand the threshold from the training set?

I'm very confused about SVM classifiers and I'm sorry if I'll sound stupid.
I'm using the Spark library for java http://spark.apache.org/docs/latest/mllib-linear-methods.html, the first example from the Linear Support Vector Machines paragraph. On this training set:
1 1:10
1 1:9
1 1:9
1 1:9
0 1:1
1 1:8
1 1:8
0 1:2
0 1:2
0 1:3
the prediction on values: 8, 2 and 1 are all positive (1). Given the training set, I would expect them to be positive, negative, negative. It gives negative only on 0 or negative values. I read that the standard threshold is "positive" if the prediction is a positive double, "negative" if it's negative, and I've seen that there is a method to manually set the threshold. But isn't this the exact reason I need a binary classifier for? I mean, if I know in advance what the threshold is I can distinguish between positive and negative values, so why bother training a classifier?
UPDATE:
Using this python code from a different library:
X = [[10], [9],[9],[9],[1],[8],[8],[2],[2],[3]]
y = [1,1,1,1,0,1,1,0,0,0]
​
from sklearn.svm import SVC
from sklearn.cross_validation import StratifiedKFold
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
import numpy as np
​
# we convert our list of lists in numpy arrays
X = np.array(X)
y = np.array(y)
# we compute the general accuracy of the system - we need more "false questions" to continue the study
accuracy = []
​
#we do 10 fold cross-validation - to be sure to test all possible combination of training and test
kf_total = StratifiedKFold(y, n_folds=5, shuffle=True)
for train, test in kf_total:
X_train, X_test = X[train], X[test]
y_train, y_test = y[train], y[test]
print X_train
clf = SVC().fit(X_train, y_train)
y_pred = clf.predict(X_test)
print "the classifier says: ", y_pred
print "reality is: ", y_test
print accuracy_score(y_test, y_pred)
print ""
accuracy.append(accuracy_score(y_test, y_pred))
print sum(accuracy)/len(accuracy)
the results are correct:
######
1 [0]
######
2 [0]
######
8 [1]
So I think it's possible for a SVM classifier to understand the threshold by itself; how can I do the same with the spark library?
SOLVED: I solved the issue changing the example to this:
SVMWithSGD std = new SVMWithSGD();
std.setIntercept(true);
final SVMModel model = std.run(training.rdd());
From this:
final SVMModel model = SVMWithSGD.train(training.rdd(), numIterations);
The standard value for "intercept" is false, which is what I needed to be true.
If you search for probability calibration you will find some research on a related matter (recalibrating the outputs to return better scores).
If your problem is a binary classification problem, you can calculate the slope of the cost by assigning vales to true/false positive/negative options multiplied by the class ratio. You can then form a line with the given AUC curve that intersects at only one point to find a point that is in some sense optimal as a threshold for your problem.
Threshold is one value that will differentiate classes .

Very few distinct prediction probabilities for CV instances with sparse SVM

I’m having an issue using the prediction probabilities for sparse SVM, where many of the predictions come out the same for my test instances. These probabilities are produced during cross validation, and when I plot an ROC curve for the folds, the results look very strange, as there are a handful of clustered points on the graph. Here is my cross validation code, I based it off of the samples on the scikit website:
skf = StratifiedKFold(y, n_folds=numfolds)
for train_index, test_index in skf:
#split the training and testing sets
X_train, X_test = X_scaled[train_index], X_scaled[test_index]
y_train, y_test = y[train_index], y[test_index]
#train on the subset for this fold
print 'Training on fold ' + str(fold)
classifier = svm.SVC(C=C_val, kernel='rbf', gamma=gamma_val, probability=True)
probas_ = classifier.fit(X_train, y_train).predict_proba(X_test)
#Compute ROC curve and area the curve
fpr, tpr, thresholds = roc_curve(y_test, probas_[:, 1])
mean_tpr += interp(mean_fpr, fpr, tpr)
mean_tpr[0] = 0.0
roc_auc = auc(fpr, tpr)
I’m just trying to figure out if there’s something I’m obviously missing here, since I used this same training set and SVM parameters with libsvm and got much better results. When I used libsvm and printed out the distances from the hyperplane for the CV test instances and then plotted the ROC, it came out much more like I expected, and a much better AUC.

Resources