FastText precision and recall trade-off - nlp

In FastText I want to change the balance between precision and recall. Can it be done?

If you're referring to the python fasttext implementation than I'm afraid there is no built in simple method to do this, what you can do is look at the returned probabilities and call an AUC or ROC curve plot method of your choice with the probability lists, here is a code example that does just this for a binary classifier:
# label the data
labels, probabilities = fasttext_classifier.predict([re.sub('\n', ' ', sentence)
for sentence in test_sentences])
# convert fasttext multilabel results to a binary classifier (probability of TRUE)
labels = list(map(lambda x: x == ['__label__TRUE'], labels))
probabilities = [probability[0] if label else (1-probability[0])
for label, probability in zip(labels, probabilities)]
And then you are free to build your metrics using the common sklearn methods:
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import f1_score
from sklearn.metrics import auc
from matplotlib import pyplot
auc = roc_auc_score(testy, probabilities)
print('ROC AUC=%.3f' % (auc))
# calculate roc curve
fpr, tpr, _ = roc_curve(testy, probabilities)
# plot the roc curve for the model
pyplot.plot(fpr, tpr, marker='.', label='ROC curve')
# axis labels
pyplot.xlabel('False Positive Rate (sensitivity)')
pyplot.ylabel('True Positive Rate (specificity)')
# show the legend
pyplot.legend()
# show the plot
pyplot.show()
precision_values, recall_values, _ = precision_recall_curve(testy, probabilities)
f1 = f1_score(testy, labels)
# summarize scores
print('f1=%.3f auc=%.3f' % (f1, auc))
# plot the precision-recall curves
pyplot.plot(recall_values, precision_values, marker='.', label='Precision,Recall')
# axis labels
pyplot.xlabel('Recall')
pyplot.ylabel('Precision')
# show the legend
pyplot.legend()
# show the plot
pyplot.show()
The command line fasttext version has a threshold parameter and you can perform multiple runs with different thresholds but this is needlessly time consuming.

Related

How to plot heatmap for AUC score across parameters?

While performing GridSearch CV on a decision tree model I gave the values for min_samples_split as [10,100,200] and max_depth as [10,20,40], and then I collected the AUC scores.
Now I want to plot a heatmap such that the x-axis should be indicating the min_samples_split and y-axis should be max_depth and the cell values should signify AUC scores that the model calculated.
But I'm unable to combine these three data: min_samples_split, max_depth, and AUC score.
Please refer to the code below:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
param={'max_depth':[10,20,40],
'min_samples_split':[10,100,200]}
decision_tree=DecisionTreeClassifier(class_weight='balanced')
clf=GridSearchCV(decision_tree,param,cv=3,scoring='roc_auc',return_train_score=True,n_jobs=-1)
clf.fit(X_train_setOne,Y_train)
and then I'm finding out the AUC scores:
train_AUC=clf.cv_results_['mean_train_score']
train_AUC_std=clf.cv_results_['std_train_score']
CV_AUC=clf.cv_results_['mean_test_score']
CV_AUC_std=clf.cv_results_['std_test_score']
Now I want to plot a heatmap.

Can a Siamese Network model draw ROC Curve?

Based on example on Keras
https://keras.io/examples/vision/siamese_contrastive/
Here is how I code to get ROC Curve
from sklearn.metrics import confusion_matrix,accuracy_score, roc_curve, auc
import seaborn as sns
sns.set_style("whitegrid")
pred = siamese.predict([x_test_1, x_test_2])
pred = pred[:,0]
pred_NN_01 = np.where(pred > 0.5, 1, 0) #Turn probability to 0-1 binary output
#Print accuracy
acc_NN = accuracy_score(labels_test, pred_NN_01)
print('Overall accuracy of Neural Network model:', acc_NN)
#Print Area Under Curve
false_positive_rate, recall, thresholds = roc_curve(labels_test, pred)
roc_auc = auc(false_positive_rate, recall)
plt.figure()
plt.title('Receiver Operating Characteristic (ROC)')
plt.plot(false_positive_rate, recall, 'b', label = 'AUC = %0.3f' %roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1], [0,1], 'r--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out (1-Specificity)')
plt.show()
#Print Confusion Matrix
cm = confusion_matrix(labels_test, pred_NN_01)
labels = ['Unchange', 'Change']
plt.figure(figsize=(8,6))
sns.heatmap(cm,xticklabels=labels, yticklabels=labels, annot=True, fmt='d', cmap="Blues", vmin = 0.2);
plt.title('Confusion Matrix')
plt.ylabel('True Class')
plt.xlabel('Predicted Class')
plt.show()
Keras Example ROC Curve
Keras Example Confusion Matrix
If the code and image is right , since the example is 10 classes (0~9 digit image)
how if I use other images which only have two classes with this model
does this ROC Curve code need to change any part?
Because I've got a strange output with this same code with 2 classes
the image result kept weird , the confusion matrix doesn't match the ROC curve
2 Classes ROC Curve
2 Classes Confusion Matrix

Geneating Learning Curve for Logistic Regression

I have a dataset of 260 microscopic image.I want to generate a learning curve for logistic regression algorithm.But I am getting this error "'module' object is not iterable" .Perhaps I don't understand something basic, as I'm a begineer who freshly learning Python
from sklearn.cross_validation import train_test_split
from imutils import paths
from scipy import misc
import numpy as np
import argparse
import imutils
import cv2
import os
from matplotlib import pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.model_selection import cross_val_score
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=None, train_sizes=np.linspace(50, 80, 110)):
"""
Generate a simple plot of the test and training learning curve.
Parameters
----------
estimator : object type that implements the "fit" and "predict" methods
An object of that type which is cloned for each validation.
title : string
Title for the chart.
X : array-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.
y : array-like, shape (n_samples) or (n_samples, n_features), optional
Target relative to X for classification or regression;
None for unsupervised learning.
cv : int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy.
Possible inputs for cv are:
- None, to use the default 3-fold cross-validation,
- integer, to specify the number of folds.
- :term:`CV splitter`,
- An iterable yielding (train, test) splits as arrays of indices.
For integer/None inputs, if ``y`` is binary or multiclass,
:class:`StratifiedKFold` used. If the estimator is not a classifier
or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.
Refer :ref:`User Guide <cross_validation>` for the various
cross-validators that can be used here.
n_jobs : int or None, optional (default=None)
Number of jobs to run in parallel.
``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
``-1`` means using all processors. See :term:`Glossary <n_jobs>`
for more details.
train_sizes : array-like, shape (n_ticks,), dtype float or int
Relative or absolute numbers of training examples that will be used to
generate the learning curve. If the dtype is float, it is regarded as a
fraction of the maximum size of the training set (that is determined
by the selected validation method), i.e. it has to be within (0, 1].
Otherwise it is interpreted as absolute sizes of the training sets.
Note that for classification the number of samples usually have to
be big enough to contain at least one sample from each class.
(default: np.linspace(0.1, 1.0, 5))
"""
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
#training with logistic regression
clfLR = LogisticRegression(random_state=0, solver='lbfgs')
clfLR.fit(trainFeat,trainLabels)
acc = clfLR.score(testFeat, testLabels)
print("accuracy of Logistic regression ",acc)
I am facing this problem only when i Want to plot the curve.Rest of the code works fine.
#plotting the curve
estimator =LogisticRegression()
train_sizes, train_scores, valid_scores = plot_learning_curve(
estimator,'logistic learning curve ', trainFeat, trainLabels, cv=5, n_jobs=4,train_sizes=[50, 80, 110])
print(train_sizes)
plt.show()
Screenshot of the error
learning curve
Try running the code on Jupyter online IDE IDE. It plots automatically if you add "%matplotlib" line to import section.
If you want to keep working on this IDE, share your error message so maybe I can help you. You are missing one of the imports probably or it might be a Python2/3 problem.

How to binarize RandomForest to plot a ROC in python?

I have 21 classes. I am using RandomForest. I want to plot a ROC curve, so I checked the example in scikit ROC with SVM
The example uses SVM. SVM has parameters like: probability and decision_function_shape which RF does not.
So how can I binarize RandomForest and plot a ROC?
Thank you
EDIT
To create the fake data. So there are 20 features and 21 classes (3 samples for each class).
df = pd.DataFrame(np.random.rand(63, 20))
label = np.arange(len(df)) // 3 + 1
df['label']=label
df
#TO TRAIN THE MODEL: IT IS A STRATIFIED SHUFFLED SPLIT
clf = make_pipeline(RandomForestClassifier())
xSSSmean10 = []
for i in range(10):
sss = StratifiedShuffleSplit(y, 10, test_size=0.1, random_state=i)
scoresSSS = cross_validation.cross_val_score(clf, x, y , cv=sss)
xSSSmean10.append(scoresSSS.mean())
result_list.append(xSSSmean10)
print("")
For multilabel random forest, each of your 21 labels has a binary classification, and you can create a ROC curve for each of the 21 classes.
Your y_train should be a matrix of 0 and 1 for each label.
Assume you fit a multilabel random forest from sklearn and called it rf, and have a X_test and y_test after a test train split. You can plot the ROC curve in python for your first label using this:
from sklearn import metrics
probs = rf.predict_proba(X_test)
fpr, tpr, threshs = metrics.roc_curve(y_test['name_of_your_first_tag'],probs[0][:,1])
Hope this helps. If you provide your code and data I could write this more specifically.

Shouldn't a SVM binary classifier understand the threshold from the training set?

I'm very confused about SVM classifiers and I'm sorry if I'll sound stupid.
I'm using the Spark library for java http://spark.apache.org/docs/latest/mllib-linear-methods.html, the first example from the Linear Support Vector Machines paragraph. On this training set:
1 1:10
1 1:9
1 1:9
1 1:9
0 1:1
1 1:8
1 1:8
0 1:2
0 1:2
0 1:3
the prediction on values: 8, 2 and 1 are all positive (1). Given the training set, I would expect them to be positive, negative, negative. It gives negative only on 0 or negative values. I read that the standard threshold is "positive" if the prediction is a positive double, "negative" if it's negative, and I've seen that there is a method to manually set the threshold. But isn't this the exact reason I need a binary classifier for? I mean, if I know in advance what the threshold is I can distinguish between positive and negative values, so why bother training a classifier?
UPDATE:
Using this python code from a different library:
X = [[10], [9],[9],[9],[1],[8],[8],[2],[2],[3]]
y = [1,1,1,1,0,1,1,0,0,0]
​
from sklearn.svm import SVC
from sklearn.cross_validation import StratifiedKFold
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
import numpy as np
​
# we convert our list of lists in numpy arrays
X = np.array(X)
y = np.array(y)
# we compute the general accuracy of the system - we need more "false questions" to continue the study
accuracy = []
​
#we do 10 fold cross-validation - to be sure to test all possible combination of training and test
kf_total = StratifiedKFold(y, n_folds=5, shuffle=True)
for train, test in kf_total:
X_train, X_test = X[train], X[test]
y_train, y_test = y[train], y[test]
print X_train
clf = SVC().fit(X_train, y_train)
y_pred = clf.predict(X_test)
print "the classifier says: ", y_pred
print "reality is: ", y_test
print accuracy_score(y_test, y_pred)
print ""
accuracy.append(accuracy_score(y_test, y_pred))
print sum(accuracy)/len(accuracy)
the results are correct:
######
1 [0]
######
2 [0]
######
8 [1]
So I think it's possible for a SVM classifier to understand the threshold by itself; how can I do the same with the spark library?
SOLVED: I solved the issue changing the example to this:
SVMWithSGD std = new SVMWithSGD();
std.setIntercept(true);
final SVMModel model = std.run(training.rdd());
From this:
final SVMModel model = SVMWithSGD.train(training.rdd(), numIterations);
The standard value for "intercept" is false, which is what I needed to be true.
If you search for probability calibration you will find some research on a related matter (recalibrating the outputs to return better scores).
If your problem is a binary classification problem, you can calculate the slope of the cost by assigning vales to true/false positive/negative options multiplied by the class ratio. You can then form a line with the given AUC curve that intersects at only one point to find a point that is in some sense optimal as a threshold for your problem.
Threshold is one value that will differentiate classes .

Resources