Plotting AUC score for multiple model for multiclass classification in Python - python-3.x

I am doing a multiclass classification problem. There are a total of 46 unique classes in my dataset. I have computed the AUC sore for all the class and plot it but I want to plot my AUC score for different types of models in one graph means I want to plot my graph for LogisticRegression, XGBoost and 2 more which is used to solve the multiclass problem. My code what I have done till-
n_classes = 46
best_C =1000
best_gamma =0.0001
svc_model_grid_param = SVC(C=best_C, kernel="rbf", gamma= best_gamma, )
model_OVR_svc = OneVsRestClassifier(svc_model_grid_param)
y_score = model_OVR_svc.fit(X_train, y_train).decision_function(X_valid)
# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
# calculate dummies once
y_test_dummies = pd.get_dummies(y_valid, drop_first=False).values
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test_dummies[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
Plotting--
import matplotlib.pylab as plt
lists = sorted(roc_auc.items()) # sorted by key, return a list of tuples
x, y = zip(*lists) # unpack a list of pairs into two tuples
plt.xlabel('Class')
plt.ylabel('AUC Score')
plt.plot(x, y)
plt.show()
Graph--
What I want to do--
Can anyone help me to do this.. Thanks in advance

Related

Generate Random Forest feature importance plots from 3D arrays

After carrying our a librosa MFCC feature extraction on 1000 audio files, I end up with an X_test array of size 1000 x 40 x 174 (40 features as I set n_mfcc=40). In order for me to pass this through the random forest classifier, I scaled and then flattened the array. My new X_test now has a size of 1000 x 6960. How do I go about correctly generating the feature importance histogram?
This is the code that I used for the feature importance plot but not sure if this is the correct approach:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
x_train # this has a shape of 1000 x 40 x 174
X_train_scaled = scaler.fit_transform(x_train.reshape(-1, x_train.shape[-1])).reshape(x_train.shape)
X_train = np.array([features_2d.flatten() for features_2d in X_train_scaled])
X = pd.DataFrame(X_train) # X_train here is already flattened to 1000 x 6960
feature_names = [f"feature {i}" for i in range(X.shape[1])]
clf = RandomForestClassifier(random_state=0)
clf.fit(X_train, y_train)
importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf.estimators_], axis=0)
forest_importances = pd.Series(importances, index=feature_names)
fig, ax = plt.subplots()
plt.figure(figsize=(13, 10))
forest_importances.plot.bar(yerr=std, ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()
With this code, I get this plot:
Can you tell me if this is the correct approach? If this approach is correct, how can I generate a more "readable" plot for the Feature Importance? Thanks!

Can a Siamese Network model draw ROC Curve?

Based on example on Keras
https://keras.io/examples/vision/siamese_contrastive/
Here is how I code to get ROC Curve
from sklearn.metrics import confusion_matrix,accuracy_score, roc_curve, auc
import seaborn as sns
sns.set_style("whitegrid")
pred = siamese.predict([x_test_1, x_test_2])
pred = pred[:,0]
pred_NN_01 = np.where(pred > 0.5, 1, 0) #Turn probability to 0-1 binary output
#Print accuracy
acc_NN = accuracy_score(labels_test, pred_NN_01)
print('Overall accuracy of Neural Network model:', acc_NN)
#Print Area Under Curve
false_positive_rate, recall, thresholds = roc_curve(labels_test, pred)
roc_auc = auc(false_positive_rate, recall)
plt.figure()
plt.title('Receiver Operating Characteristic (ROC)')
plt.plot(false_positive_rate, recall, 'b', label = 'AUC = %0.3f' %roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1], [0,1], 'r--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out (1-Specificity)')
plt.show()
#Print Confusion Matrix
cm = confusion_matrix(labels_test, pred_NN_01)
labels = ['Unchange', 'Change']
plt.figure(figsize=(8,6))
sns.heatmap(cm,xticklabels=labels, yticklabels=labels, annot=True, fmt='d', cmap="Blues", vmin = 0.2);
plt.title('Confusion Matrix')
plt.ylabel('True Class')
plt.xlabel('Predicted Class')
plt.show()
Keras Example ROC Curve
Keras Example Confusion Matrix
If the code and image is right , since the example is 10 classes (0~9 digit image)
how if I use other images which only have two classes with this model
does this ROC Curve code need to change any part?
Because I've got a strange output with this same code with 2 classes
the image result kept weird , the confusion matrix doesn't match the ROC curve
2 Classes ROC Curve
2 Classes Confusion Matrix

Recommendations for fitting nonlinear vector function in tensorflow

I am new to tensorflow, so sorry if the following code is not very intelligent. I would like to ask you for recommendations while fitting/training a neural network. A minimal working example is given below, m-dimensional vectors x with real valued entries are the input (features, if I understand correctly the terminology), while the outputs (labels, I think) are n-dimensional vectors f(x).
For the upcoming example (in python 3.6.5), let's use 2d vectors x (m=2) and 2d vectors f (n=2), at what I am approximating a square root function (first component of f) and a sinus function (second component).
What would you recommend me to do, in order to improve the model? Are there any rules of thumb?
Later I need to deal with high dimensional input x and high dimensional output f(x), but first I want to know how to do better in small examples like the following.
I have been playing around with different numbers of neurons, number of layers, activation functions, optimizers, step sizes and number of epochs. Any recommendations? I already standardized / normalized the inputs x and outputs f, i.e., I computed their means and stardard deviations and rescaled x and f accordingly.
Example in python 3.6.5:
I will need these modules
#%% Setup
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
import tensorflow as tf
The following lines generate some input data x (x1 from 0 to 8 and x2 from 0 to 15, each direction is discretized in 30 points) and compute the exact values of the functions f(x), and display the functions f1(x) and f2(x)
#%% Generate data
lx = (8,15)
nx = 30
x = np.array(np.meshgrid(*[l*np.linspace(0,1,nx) for l in lx])).reshape(len(lx),-1).T
f = np.array([[10*np.sqrt(sum(xx)), 10*np.sin(np.sum(xx))+20] for xx in list(x)])
dx = np.shape(x)[1]
df = np.shape(f)[1]
fig = plt.figure(figsize=(7,2))
fig.add_subplot(1,2,1, projection='3d').scatter(x[:,0],x[:,1],f[:,0])
fig.add_subplot(1,2,2, projection='3d').scatter(x[:,0],x[:,1],f[:,1])
The following lines standardize / nomalize the input and output data and plot again the data (check that x and f have been rescaled)
#%% Standardize data
def sdata(x):
for i in range(np.shape(x)[1]):
x[:,i] = (x[:,i]-np.mean(x[:,i]))/np.std(x[:,i])
return x
x = sdata(x)
f = sdata(f)
fig = plt.figure(figsize=(7,2))
fig.add_subplot(1,2,1, projection='3d').scatter(x[:,0],x[:,1],f[:,0])
fig.add_subplot(1,2,2, projection='3d').scatter(x[:,0],x[:,1],f[:,1])
The following lines are the setup for the neural network. You can change or add number in the number of neuros nn. The length of the tuple nn automatically defines the number of layers. The activation function is af, the opimizer opt and the number of epochs nep.
#%% Neural network setup
# Number of neurons in layers, activation function, optimizer and number of epochs
nn = (16,16,16,df)
af = tf.tanh
opt = tf.train.AdamOptimizer(0.01)
nep = 3000
# Placeholders for input x, output f
xp = tf.placeholder(dtype=tf.float32, shape=[None,dx])
fp = tf.placeholder(dtype=tf.float32, shape=[None,df])
# Build sequential model, last call is output
model = xp
for i in range(len(nn)-1):
model = tf.layers.dense(model,nn[i],activation=af)
model = tf.layers.dense(model,nn[-1],activation=None)
# Loss function (MSE), training step and initializer
loss = tf.reduce_mean(tf.square(model-fp))
trainstep = opt.minimize(loss)
init = tf.global_variables_initializer()
The following lines start the tensorflow session, plot the initial prediction values fpre(x) (of f(x)), compute the values of the loss function (mean square error, MSE), do the training, plot the loss function over the epochs, plot the final predicition of f(x) and plot the absolute difference between the output data f(x) and fpred(x)
#%% Session, train, see results
# Start session
s = tf.Session()
s.run(init)
# Plot predicted values before training at positions x
fpre = s.run(model,{xp:x})
fig = plt.figure(figsize=(7,2))
p1 = fig.add_subplot(1,2,1, projection='3d')
p1.scatter(x[:,0],x[:,1],f[:,0])
p1.scatter(x[:,0],x[:,1],fpre[:,0])
p2 = fig.add_subplot(1,2,2, projection='3d')
p2.scatter(x[:,0],x[:,1],f[:,1])
p2.scatter(x[:,0],x[:,1],fpre[:,1])
# Prepare container for loss data and print initial loss
lossd = [None]*(nep+1)
lossd[0] = s.run(loss,{xp:x,fp:f})
print(lossd[0])
# Train
for i in range(1,nep+1):
s.run(trainstep,{xp:x,fp:f})
lossd[i] = s.run(loss,{xp:x,fp:f})
# See loss development
print('Last loss:\n'+str(lossd[-1]))
plt.figure()
plt.plot(list(range(nep+1)),lossd)
# Plot predicted values after training at positions x
fpre = s.run(model,{xp:x})
print('Computed MSE:\n'+str(np.mean((fpre-f)**2)))
fig = plt.figure(figsize=(7,2))
p1 = fig.add_subplot(1,2,1, projection='3d')
p1.scatter(x[:,0],x[:,1],f[:,0])
p1.scatter(x[:,0],x[:,1],fpre[:,0])
p2 = fig.add_subplot(1,2,2, projection='3d')
p2.scatter(x[:,0],x[:,1],f[:,1])
p2.scatter(x[:,0],x[:,1],fpre[:,1])
# Plot difference
fig = plt.figure(figsize=(7,2))
p1 = fig.add_subplot(1,2,1, projection='3d')
p1.scatter(x[:,0],x[:,1],np.abs(f[:,0]-fpre[:,0]))
p2 = fig.add_subplot(1,2,2, projection='3d')
p2.scatter(x[:,0],x[:,1],np.abs(f[:,1]-fpre[:,1]))
# Close sessions
s.close()
This was the initial prediction
This is the loss function over the epochs
This are the final predictions and the absolute difference
Any thoughts on how to improve this or make the code more efficient? Thank you!

why have high AUC and low accuracy in a balanced dataset for SVM

I used LIBSVM to classify 256 classes. My dataset is about 5000-10000. For SVM, I used one against one strategy to train my models. Now, I get the results of low accuracy (15%~30%) but high AUC (>90%). I suppose that one cannot obtain high AUC (0.9 and higher) if Acc of the corresponding predictive model is low (13-30 %)?
I refer to the Open Source Python Library (scikit-learn )to compute the AUC of many kinds of problems. (http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py)
This is used these code to compute AUC:
# compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
# test_label_kernel: the true label of one insance
# LensOfLabel : the number of all classes
y = label_binarize( test_label_kernel, classes = list(range(0,LensOfLabel,1)) )
#sort_pval: the prediction probability of SVM
for i in range(LensOfLabel):
fpr[i], tpr[i], _ = metrics.roc_curve( y[:,i], sort_pval[:,i] )
roc_auc[i] = metrics.auc( fpr[i], tpr[i] )
# First aggregate all false positive rates
n_classes = LensOfLabel
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))
# Then interpolate all ROC curves at this points
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
mean_tpr += interp(all_fpr, fpr[i], tpr[i])
# Finally average it and compute AUC
mean_tpr /= n_classes
fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = metrics.auc(fpr["macro"], tpr["macro"])
print( ("macroAUC: %.4f") %roc_auc["macro"] )
#compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = metrics.roc_curve( y.ravel(), sort_pval.ravel() )
roc_auc["micro"] = metrics.auc( fpr["micro"], tpr["micro"] )
print( ("microAUC: %.4f") %roc_auc["micro"] )
The ROC curve is;
https://i.stack.imgur.com/GEUqr.png
https://i.stack.imgur.com/ucbE6.png

How to binarize RandomForest to plot a ROC in python?

I have 21 classes. I am using RandomForest. I want to plot a ROC curve, so I checked the example in scikit ROC with SVM
The example uses SVM. SVM has parameters like: probability and decision_function_shape which RF does not.
So how can I binarize RandomForest and plot a ROC?
Thank you
EDIT
To create the fake data. So there are 20 features and 21 classes (3 samples for each class).
df = pd.DataFrame(np.random.rand(63, 20))
label = np.arange(len(df)) // 3 + 1
df['label']=label
df
#TO TRAIN THE MODEL: IT IS A STRATIFIED SHUFFLED SPLIT
clf = make_pipeline(RandomForestClassifier())
xSSSmean10 = []
for i in range(10):
sss = StratifiedShuffleSplit(y, 10, test_size=0.1, random_state=i)
scoresSSS = cross_validation.cross_val_score(clf, x, y , cv=sss)
xSSSmean10.append(scoresSSS.mean())
result_list.append(xSSSmean10)
print("")
For multilabel random forest, each of your 21 labels has a binary classification, and you can create a ROC curve for each of the 21 classes.
Your y_train should be a matrix of 0 and 1 for each label.
Assume you fit a multilabel random forest from sklearn and called it rf, and have a X_test and y_test after a test train split. You can plot the ROC curve in python for your first label using this:
from sklearn import metrics
probs = rf.predict_proba(X_test)
fpr, tpr, threshs = metrics.roc_curve(y_test['name_of_your_first_tag'],probs[0][:,1])
Hope this helps. If you provide your code and data I could write this more specifically.

Resources