Here is a very small example using precision_recall_curve():
from sklearn.metrics import precision_recall_curve, precision_score, recall_score
y_true = [0, 1]
y_predict_proba = [0.25,0.75]
precision, recall, thresholds = precision_recall_curve(y_true, y_predict_proba)
precision, recall
which results in:
(array([1., 1.]), array([1., 0.]))
The above does not match the "manual" calculation which follows.
There are three possible class vectors depending on threshold: [0,0] (when the threshold is > 0.75) , [0,1] (when the threshold is between 0.25 and 0.75), and [1,1] (when the threshold is <0.25). We have to discard [0,0] because it gives an undefined precision (divide by zero). So, applying precision_score() and recall_score() to the other two:
y_predict_class=[0,1]
precision_score(y_true, y_predict_class), recall_score(y_true, y_predict_class)
which gives:
(1.0, 1.0)
and
y_predict_class=[1,1]
precision_score(y_true, y_predict_class), recall_score(y_true, y_predict_class)
which gives
(0.5, 1.0)
This seems not to match the output of precision_recall_curve() (which for example did not produce a 0.5 precision value).
Am I missing something?
I know I am late, but I had your same doubt that I have eventually solved.
The main point here is that precision_recall_curve() does not output precision and recall values anymore after full recall is obtained the first time; moreover, it concatenates a 0 to the recall array and a 1 to the precision array so as to let the curve start in correspondence of the y-axis.
In your specific example, you'll have effectively two arrays done like this (they are ordered the other way around because of the specific implementation of sklearn):
precision, recall
(array([1., 0.5]), array([1., 1.]))
Then, the values of the two arrays which do correspond to the second occurrence of full recall are omitted and 1 and 0 values (for precision and recall, respectively) are concatenated as described above:
precision, recall
(array([1., 1.]), array([1., 0.]))
I have tried to explain it here in full details; another useful link is certainly this one.
Related
I've got a multiclass problem. I'm using sklearn.metrics to calculate the confusion matrix, overall accuracy, per class precision, per class recall and per class F1-score.
Now I wanted to calculate the per class accuracy. Since there is no method in sklearn for this I used another one which i got from a google search. I've now realised, that the per class recall equals the per class accuracy. Can anyone explain to me if this holds true and if yes, why?
I found an explanation here, but I'm not sure since there the micro-recall equals the overall accuracy if I'm understanding it correctly. And I'm looking for the per class accuracy.
I too experienced same results. because per class Recall = TP/TP+FN , Here TP+FN is same as all the samples of a class. So the formula becomes similar to accuracy.
This generally doesn't hold. Accuracy and recall are calculated using different formulas and are different measures explaining something else.
Recall is the percentage of true positive data points compared to all data points that are predicted as positive by your classifier.
Accuracy is the percentage of all examples that are classified correctly, including positive and negative.
If they are equal, this is either coincidence or you have an error is your method of calculating them. Most likely this will be coincidence.
EDIT:
I will show why it's not the case with an example that can be generalised to N classes.
Let's assume three classes: 0, 1, 2 with the following confusion matrix:
[[3 0 1]
[2 5 0]
[0 1 4]]
When we want to calculate measures per class, we do this binary. For example for class 0, we combine 1 and 2 into 'not 0'. This results in the following confusion matrix:
[[3 1]
[2 9]]
Resulting in:
TP = 3
FT = 5
FN = 1
TN = 9
Accuracy = (TN + TP) / (N + P)
Recall = TP / (TN + FN)
So you can already tell from these formulas, that they are really not equal. To disprove an hypothesis in mathematics it suffices to show a counter example. In this case an example that show that accuracy is not equal to recall.
In this example filled in we get:
Accuracy = 12/18 = 2/3
Recall = 3/4
And 2/3 is not equal to 3/4. Thus disproving the hypothesis that per class accuracy is equal to per class recall.
It is however also possible to provide examples for which the hypothesis is correct. But because it is not in general, it is disproven.
Not sure if you are looking for average per-class accuracy as a single metric or per-class accuracy as separate metrics for each class.
For per-class accuracy as a separate metric for each class, see the code below. It's the same as recall-micro per class.
For average per-class accuracy as a single metric, it is equivalent to recall-macro (which is equivalent to balanced accuracy in sklearn). See the code blow.
Here is the empirical demonstration in code.
from sklearn.metrics import accuracy_score, balanced_accuracy_score, recall_score
label_class1 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
label_class2 = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
labels = label_class1 + label_class2
pred_class1 = [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
pred_class2 = [1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
pred = pred_class1 + pred_class2
# 1. calculate accuracy scores per class
score_accuracy_class1 = accuracy_score(label_class1, pred_class1)
score_accuracy_class2 = accuracy_score(label_class2, pred_class2)
print(score_accuracy_class1) # 0.6
print(score_accuracy_class2) # 0.9
# 2. calculate recall scores per class
score_recall_class1 = recall_score(label_class1, pred_class1, average='micro')
score_recall_class2 = recall_score(label_class2, pred_class2, average='micro')
print(score_recall_class1) # 0.6
print(score_recall_class2) # 0.9
assert score_accuracy_class1 == score_recall_class1
assert score_accuracy_class2 == score_recall_class2
# 3. this also means that average per-class accuracy is equivalent to averaged recall and balanced accuracy
score_balanced_accuracy1 = (score_accuracy_class1 + score_accuracy_class2) / 2
score_balanced_accuracy2 = (score_recall_class1 + score_recall_class2) / 2
score_balanced_accuracy3 = balanced_accuracy_score(labels, pred)
score_balanced_accuracy4 = recall_score(labels, pred, average='macro')
print(score_balanced_accuracy1) # 0.75
print(score_balanced_accuracy2) # 0.75
print(score_balanced_accuracy3) # 0.75
print(score_balanced_accuracy4) # 0.75
# balanced accuracy, average per-class accuracy and recall-macro are equivalent
assert score_balanced_accuracy1 == score_balanced_accuracy2 == score_balanced_accuracy3 == score_balanced_accuracy4
These official docs say: "balanced accuracy ... is defined as the average of recall obtained on each class."
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html
I tried to run the code from the notebook on self generated data, to prove if the model will do any classification.
https://gpflow.readthedocs.io/en/master/notebooks/basics/classification.html
So I created X and Y as input data.
X=np.array([-0.0259,-0.3579,-0.289,0.0356,0.0147,0.0234]).reshape(-1,1)
Y=np.array([0,0,0,1,1,1]).reshape(-1,1)
The value in X and Y were chosen as binary logic, negative value in X is equal to 0 in Y. And positive value in X should be classified as 1 in Y.
Then I created a model and trained it:
Per = gpflow.kernels.Periodic(gpflow.kernels.SquaredExponential())
model_Per = gpflow.models.VGP((X, Y), likelihood=gpflow.likelihoods.Bernoulli(), kernel=Per)
I tried to predict Y as class with the same X that was used as input for the model training, wanted just to see, if there is the right result.
Ypred, VARpred = model_Per.predict_y(X)
For Ypred I get the output:
<tf.Tensor: shape=(6, 1), dtype=float64, numpy=
array([[0.5],
[0.5],
[0.5],
[0.5],
[0.5],
[0.5]])>
For the VARpred
<tf.Tensor: shape=(6, 1), dtype=float64, numpy=
array([[0.25],
[0.25],
[0.25],
[0.25],
[0.25],
[0.25]])>
I tried, to change the kernel, to combine the kernels, to make an optimization with Scipy before predicting, changed the data, but always the same output for mean and variance. I was expecting, the Ypred = Y with this data set.
What am I doing wrong creating this classification model?
You have to actually optimise your model. Once you optimise it, the results actually look very reasonable. I would not expect a GP model to exactly predict p=1 -- this would mean 0.0% probability of ever observing a 0 at this point, which I would only believe if I had seen an infinite amount of data all saying 1...
For the Bernoulli likelihood you are using, the variance is deterministically related to the mean. If y ~ Bernoulli, and Mean[y] = p, then Var[y] = p * (1 - p). For you, the mean is p=0.5, so the variance is 0.5 * (1 - 0.5) = 0.25.
I am using GridSearchCV with cv = KFold(n_splits=10), scoring='accuracy' with some testing SVM (c=1, gamma=1).
For this testing, I am using only vector of 51 values, and another one of 51 binary responses.
My results look like this:
'split0_test_score': array([ 0.16666667]), 'split1_test_score': array([ 0.4]), 'split2_test_score': array([ 0.8]), 'split3_test_score': array([ 0.6]), 'split4_test_score': array([ 0.2]), 'split5_test_score': array([ 1.]), 'split6_test_score': array([ 0.2]), 'split7_test_score': array([ 0.]), 'split8_test_score': array([ 0.4]), 'split9_test_score': array([ 0.6]),
'mean_test_score': array([ 0.43137255]) ...
The problem is that mean score is not the actual mean score of all folds test score (it should be 0.4367). Is there a way to get real mean of all folds from GridSearchCV? Or do I have to extract it manually?
Thank you
I also noticed such discrepancies using GridSearchCV from Scikit-learn. Using my own test cases, the difference betwen the average (numpy.mean) over splitX_test_score[i] and mean_test_score from the attribute cv_results_ is noticeable from the 17th decimal with 2 folds. With 10 folds, there are discrepancies from the 6th decimal.
I think this issue may be related to floating point precision. Please, could someone explain how exactly mean_test_score (which function is used, with which floating point precision)? Many thanks in advance.
Edit: I read the answer from Leena in the following topic: sikit learn cv grid scores - Unexpected results. The difference is due to the parameter iid. If set to False, then mean_test_score is computed from mean value across folds.
Hello I am working with sklearn and in order to understand better the metrics, I followed the following example of precision_score:
from sklearn.metrics import precision_score
y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]
print(precision_score(y_true, y_pred, average='macro'))
the result that i got was the following:
0.222222222222
I understand that sklearn compute that result following these steps:
for label 0 precision is tp / (tp + fp) = 2 / (2 + 1) = 0.66
for label 1 precision is 0 / (0 + 2) = 0
for label 2 precision is 0 / (0 + 1) = 0
and finally sklearn calculates mean precision by all three labels: precision = (0.66 + 0 + 0) / 3 = 0.22
this result is given if we take this parameters:
precision_score(y_true, y_pred, average='macro')
on the other hand if we take this parameters, changing average='micro' :
precision_score(y_true, y_pred, average='micro')
then we get:
0.33
and if we take average='weighted':
precision_score(y_true, y_pred, average='weighted')
then we obtain:
0.22.
I don't understand well how sklearn computes this metric when the average parameter is set to 'weighted' or 'micro', I really would like to appreciate if someone could give me a clear explanation of this.
'micro':
Calculate metrics globally by considering each element of the label indicator matrix as a label.
'macro':
Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
'weighted':
Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label).
'samples':
Calculate metrics for each instance, and find their average.
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html
For Support measures:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
Basically, class membership.
3.3.2.12. Receiver operating characteristic (ROC)
The function roc_curve computes the receiver operating characteristic curve, or ROC curve. Quoting Wikipedia :
“A receiver operating characteristic (ROC), or simply ROC curve, is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied. It is created by plotting the fraction of true positives out of the positives (TPR = true positive rate) vs. the fraction of false positives out of the negatives (FPR = false positive rate), at various threshold settings. TPR is also known as sensitivity, and FPR is one minus the specificity or true negative rate.”
TN / True Negative: case was negative and predicted negative.
TP / True Positive: case was positive and predicted positive.
FN / False Negative: case was positive but predicted negative.
FP / False Positive: case was negative but predicted positive# Basic terminology
confusion = metrics.confusion_matrix(expected, predicted)
print confusion,"\n"
TN, FP = confusion[0, 0], confusion[0, 1]
FN, TP = confusion[1, 0], confusion[1, 1]
print 'Specificity: ', round(TN / float(TN + FP),3)*100, "\n"
print 'Sensitivity: ', round(TP / float(TP + FN),3)*100, "(Recall)"
I am trying to figure out what exactly the loss function formula is and how I can manually calculate it when class_weight='auto' in case of svm.svc, svm.linearSVC and linear_model.LogisticRegression.
For balanced data, say you have a trained classifier: clf_c. Logistic loss should be (am I correct?):
def logistic_loss(x,y,w,b,b0):
'''
x: nxp data matrix where n is number of data points and p is number of features.
y: nx1 vector of true labels (-1 or 1).
w: nx1 vector of weights (vector of 1./n for balanced data).
b: px1 vector of feature weights.
b0: intercept.
'''
s = y
if 0 in np.unique(y):
print 'yes'
s = 2. * y - 1
l = np.dot(w, np.log(1 + np.exp(-s * (np.dot(x, np.squeeze(b)) + b0))))
return l
I realized that logisticRegression has predict_log_proba() which gives you exactly that when data is balanced:
b, b0 = clf_c.coef_, clf_c.intercept_
w = np.ones(len(y))/len(y)
-(clf_c.predict_log_proba(x[xrange(len(x)), np.floor((y+1)/2).astype(np.int8)]).mean() == logistic_loss(x,y,w,b,b0)
Note, np.floor((y+1)/2).astype(np.int8) simply maps y=(-1,1) to y=(0,1).
But this does not work when data is imbalanced.
What's more, you expect the classifier (here, logisticRegression) to perform similarly (in terms of loss function value) when data in balance and class_weight=None versus when data is imbalanced and class_weight='auto'. I need to have a way to calculate the loss function (without the regularization term) for both scenarios and compare them.
In short, what does class_weight = 'auto' exactly mean? Does it mean class_weight = {-1 : (y==1).sum()/(y==-1).sum() , 1 : 1.} or rather class_weight = {-1 : 1./(y==-1).sum() , 1 : 1./(y==1).sum()}?
Any help is much much appreciated. I tried going through the source code, but I am not a programmer and I am stuck.
Thanks a lot in advance.
class_weight heuristics
I am a bit puzzled by your first proposition for the class_weight='auto' heuristic, as:
class_weight = {-1 : (y == 1).sum() / (y == -1).sum(),
1 : 1.}
is the same as your second proposition if we normalize it so that the weights sum to one.
Anyway to understand what class_weight="auto" does, see this question:
what is the difference between class weight = none and auto in svm scikit learn.
I am copying it here for later comparison:
This means that each class you have (in classes) gets a weight equal
to 1 divided by the number of times that class appears in your data
(y), so classes that appear more often will get lower weights. This is
then further divided by the mean of all the inverse class frequencies.
Note how this is not completely obvious ;).
This heuristic is deprecated and will be removed in 0.18. It will be replaced by another heuristic, class_weight='balanced'.
The 'balanced' heuristic weighs classes proportionally to the inverse of their frequency.
From the docs:
The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data:
n_samples / (n_classes * np.bincount(y)).
np.bincount(y) is an array with the element i being the count of class i samples.
Here's a bit of code to compare the two:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.utils import compute_class_weight
n_classes = 3
n_samples = 1000
X, y = make_classification(n_samples=n_samples, n_features=20, n_informative=10,
n_classes=n_classes, weights=[0.05, 0.4, 0.55])
print("Count of samples per class: ", np.bincount(y))
balanced_weights = n_samples /(n_classes * np.bincount(y))
# Equivalent to the following, using version 0.17+:
# compute_class_weight("balanced", [0, 1, 2], y)
print("Balanced weights: ", balanced_weights)
print("'auto' weights: ", compute_class_weight("auto", [0, 1, 2], y))
Output:
Count of samples per class: [ 57 396 547]
Balanced weights: [ 5.84795322 0.84175084 0.60938452]
'auto' weights: [ 2.40356854 0.3459682 0.25046327]
The loss functions
Now the real question is: how are these weights used to train the classifier?
I don't have a thorough answer here unfortunately.
For SVC and linearSVC the docstring is pretty clear
Set the parameter C of class i to class_weight[i]*C for SVC.
So high weights mean less regularization for the class and a higher incentive for the svm to classify it properly.
I do not know how they work with logistic regression. I'll try to look into it but most of the code is in liblinear or libsvm and I'm not too familiar with those.
However, note that the weights in class_weight do not influence directly methods such as predict_proba. They change its ouput because the classifier optimizes a different loss function.
Not sure this is clear, so here's a snippet to explain what I mean (you need to run the first one for the imports and variable definition):
lr = LogisticRegression(class_weight="auto")
lr.fit(X, y)
# We get some probabilities...
print(lr.predict_proba(X))
new_lr = LogisticRegression(class_weight={0: 100, 1: 1, 2: 1})
new_lr.fit(X, y)
# We get different probabilities...
print(new_lr.predict_proba(X))
# Let's cheat a bit and hand-modify our new classifier.
new_lr.intercept_ = lr.intercept_.copy()
new_lr.coef_ = lr.coef_.copy()
# Now we get the SAME probabilities.
np.testing.assert_array_equal(new_lr.predict_proba(X), lr.predict_proba(X))
Hope this helps.