How does sklearn compute the precision_score metric? - scikit-learn

Hello I am working with sklearn and in order to understand better the metrics, I followed the following example of precision_score:
from sklearn.metrics import precision_score
y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]
print(precision_score(y_true, y_pred, average='macro'))
the result that i got was the following:
0.222222222222
I understand that sklearn compute that result following these steps:
for label 0 precision is tp / (tp + fp) = 2 / (2 + 1) = 0.66
for label 1 precision is 0 / (0 + 2) = 0
for label 2 precision is 0 / (0 + 1) = 0
and finally sklearn calculates mean precision by all three labels: precision = (0.66 + 0 + 0) / 3 = 0.22
this result is given if we take this parameters:
precision_score(y_true, y_pred, average='macro')
on the other hand if we take this parameters, changing average='micro' :
precision_score(y_true, y_pred, average='micro')
then we get:
0.33
and if we take average='weighted':
precision_score(y_true, y_pred, average='weighted')
then we obtain:
0.22.
I don't understand well how sklearn computes this metric when the average parameter is set to 'weighted' or 'micro', I really would like to appreciate if someone could give me a clear explanation of this.

'micro':
Calculate metrics globally by considering each element of the label indicator matrix as a label.
'macro':
Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
'weighted':
Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label).
'samples':
Calculate metrics for each instance, and find their average.
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html
For Support measures:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
Basically, class membership.
3.3.2.12. Receiver operating characteristic (ROC)
The function roc_curve computes the receiver operating characteristic curve, or ROC curve. Quoting Wikipedia :
“A receiver operating characteristic (ROC), or simply ROC curve, is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied. It is created by plotting the fraction of true positives out of the positives (TPR = true positive rate) vs. the fraction of false positives out of the negatives (FPR = false positive rate), at various threshold settings. TPR is also known as sensitivity, and FPR is one minus the specificity or true negative rate.”
TN / True Negative: case was negative and predicted negative.
TP / True Positive: case was positive and predicted positive.
FN / False Negative: case was positive but predicted negative.
FP / False Positive: case was negative but predicted positive# Basic terminology
confusion = metrics.confusion_matrix(expected, predicted)
print confusion,"\n"
TN, FP = confusion[0, 0], confusion[0, 1]
FN, TP = confusion[1, 0], confusion[1, 1]
print 'Specificity: ', round(TN / float(TN + FP),3)*100, "\n"
print 'Sensitivity: ', round(TP / float(TP + FN),3)*100, "(Recall)"

Related

How to measure sensitivity, speficity, PPV and NPV from confusion matrix for multiclass multiclassification [migrated]

I wonder how to compute precision and recall using a confusion matrix for a multi-class classification problem. Specifically, an observation can only be assigned to its most probable class / label. I would like to compute:
Precision = TP / (TP+FP)
Recall = TP / (TP+FN)
for each class, and then compute the micro-averaged F-measure.
In a 2-hypothesis case, the confusion matrix is usually:
Declare H1
Declare H0
Is H1
TP
FN
Is H0
FP
TN
where I've used something similar to your notation:
TP = true positive (declare H1 when, in truth, H1),
FN = false negative (declare H0 when, in truth, H1),
FP = false positive
TN = true negative
From the raw data, the values in the table would typically be the counts for each occurrence over the test data. From this, you should be able to compute the quantities you need.
Edit
The generalization to multi-class problems is to sum over rows / columns of the confusion matrix. Given that the matrix is oriented as above, i.e., that
a given row of the matrix corresponds to specific value for the "truth", we have:
$\text{Precision}_{~i} = \cfrac{M_{ii}}{\sum_j M_{ji}}$
$\text{Recall}_{~i} = \cfrac{M_{ii}}{\sum_j M_{ij}}$
That is, precision is the fraction of events where we correctly declared $i$
out of all instances where the algorithm declared $i$. Conversely, recall is the fraction of events where we correctly declared $i$ out of all of the cases where the true of state of the world is $i$.
Good summary paper, looking at these metrics for multi-class problems:
Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing and Management, 45, p. 427-437. (pdf)
The abstract reads:
This paper presents a systematic analysis of twenty four performance
measures used in the complete spectrum of Machine Learning
classification tasks, i.e., binary, multi-class, multi-labelled, and
hierarchical. For each classification task, the study relates a set of
changes in a confusion matrix to specific characteristics of data.
Then the analysis concentrates on the type of changes to a confusion
matrix that do not change a measure, therefore, preserve a
classifier’s evaluation (measure invariance). The result is the
measure invariance taxonomy with respect to all relevant label
distribution changes in a classification problem. This formal analysis
is supported by examples of applications where invariance properties
of measures lead to a more reliable evaluation of classifiers. Text
classification supplements the discussion with several case studies.
Using sklearn or tensorflow and numpy:
from sklearn.metrics import confusion_matrix
# or:
# from tensorflow.math import confusion_matrix
import numpy as np
labels = ...
predictions = ...
cm = confusion_matrix(labels, predictions)
recall = np.diag(cm) / np.sum(cm, axis = 1)
precision = np.diag(cm) / np.sum(cm, axis = 0)
To get overall measures of precision and recall, use then
np.mean(recall)
np.mean(precision)
#Cristian Garcia code can be reduced by sklearn.
>>> from sklearn.metrics import precision_score
>>> y_true = [0, 1, 2, 0, 1, 2]
>>> y_pred = [0, 2, 1, 0, 0, 1]
>>> precision_score(y_true, y_pred, average='micro')
Here is a different view from the other answers that I think will be helpful to others. The goal here is to allow you to compute these metrics using basic laws of probability.
First, it helps to understand what a confusion matrix is telling us in general. Let $Y$ represent a class label and $\hat Y$ represent a class prediction. In the binary case, let the two possible values for $Y$ and $\hat Y$ be $0$ and $1$, which represent the classes. Next, suppose that the confusion matrix for $Y$ and $\hat Y$ is:
$\hat Y = 0$
$\hat Y = 1$
$Y = 0$
10
20
$Y = 1$
30
40
With hindsight, let us normalize the rows and columns of this confusion matrix, such that the sum of all elements of the confusion matrix is $1$. Currently, the sum of all elements of the confusion matrix is $10 + 20 + 30 + 40 = 100$, which is our normalization factor. After dividing the elements of the confusion matrix by the normalization factor, we get the following normalized confusion matrix:
$\hat Y = 0$
$\hat Y = 1$
$Y = 0$
$\frac{1}{10}$
$\frac{2}{10}$
$Y = 1$
$\frac{3}{10}$
$\frac{4}{10}$
With this formulation of the confusion matrix, we can interpret $Y$ and $\hat Y$ slightly differently. We can interpret them as jointly Bernoulli (binary) random variables, where their normalized confusion matrix represents their joint probability mass function. When we interpret $Y$ and $\hat Y$ this way, the definitions of precision and recall are much easier to remember using Bayes' rule and the law of total probability:
\begin{align}
\text{Precision} &= P(Y = 1 \mid \hat Y = 1) = \frac{P(Y = 1 , \hat Y = 1)}{P(Y = 1 , \hat Y = 1) + P(Y = 0 , \hat Y = 1)} \\
\text{Recall} &= P(\hat Y = 1 \mid Y = 1) = \frac{P(Y = 1 , \hat Y = 1)}{P(Y = 1 , \hat Y = 1) + P(Y = 1 , \hat Y = 0)}
\end{align}
How do we determine these probabilities? We can estimate them using the normalized confusion matrix. From the table above, we see that
\begin{align}
P(Y = 0 , \hat Y = 0) &\approx \frac{1}{10} \\
P(Y = 0 , \hat Y = 1) &\approx \frac{2}{10} \\
P(Y = 1 , \hat Y = 0) &\approx \frac{3}{10} \\
P(Y = 1 , \hat Y = 1) &\approx \frac{4}{10}
\end{align}
Therefore, the precision and recall for this specific example are
\begin{align}
\text{Precision} &= P(Y = 1 \mid \hat Y = 1) = \frac{\frac{4}{10}}{\frac{4}{10} + \frac{2}{10}} = \frac{4}{4 + 2} = \frac{2}{3} \\
\text{Recall} &= P(\hat Y = 1 \mid Y = 1) = \frac{\frac{4}{10}}{\frac{4}{10} + \frac{3}{10}} = \frac{4}{4 + 3} = \frac{4}{7}
\end{align}
Note that, from the calculations above, we didn't really need to normalize the confusion matrix before computing the precision and recall. The reason for this is that, because of Bayes' rule, we end up dividing one value that is normalized by another value that is normalized, which means that the normalization factor can be cancelled out.
A nice thing about this interpretation is that it can be generalized to confusion matrices of any size. In the case where there are more than 2 classes, $Y$ and $\hat Y$ are no longer considered to be jointly Bernoulli, but rather jointly categorical. Moreover, we would need to specify which class we are computing the precision and recall for. In fact, the definitions above may be interpreted as the precision and recall for class $1$. We can also compute the precision and recall for class $0$, but these have different names in the literature.

Multiclass classification per class recall equals per class accuracy?

I've got a multiclass problem. I'm using sklearn.metrics to calculate the confusion matrix, overall accuracy, per class precision, per class recall and per class F1-score.
Now I wanted to calculate the per class accuracy. Since there is no method in sklearn for this I used another one which i got from a google search. I've now realised, that the per class recall equals the per class accuracy. Can anyone explain to me if this holds true and if yes, why?
I found an explanation here, but I'm not sure since there the micro-recall equals the overall accuracy if I'm understanding it correctly. And I'm looking for the per class accuracy.
I too experienced same results. because per class Recall = TP/TP+FN , Here TP+FN is same as all the samples of a class. So the formula becomes similar to accuracy.
This generally doesn't hold. Accuracy and recall are calculated using different formulas and are different measures explaining something else.
Recall is the percentage of true positive data points compared to all data points that are predicted as positive by your classifier.
Accuracy is the percentage of all examples that are classified correctly, including positive and negative.
If they are equal, this is either coincidence or you have an error is your method of calculating them. Most likely this will be coincidence.
EDIT:
I will show why it's not the case with an example that can be generalised to N classes.
Let's assume three classes: 0, 1, 2 with the following confusion matrix:
[[3 0 1]
[2 5 0]
[0 1 4]]
When we want to calculate measures per class, we do this binary. For example for class 0, we combine 1 and 2 into 'not 0'. This results in the following confusion matrix:
[[3 1]
[2 9]]
Resulting in:
TP = 3
FT = 5
FN = 1
TN = 9
Accuracy = (TN + TP) / (N + P)
Recall = TP / (TN + FN)
So you can already tell from these formulas, that they are really not equal. To disprove an hypothesis in mathematics it suffices to show a counter example. In this case an example that show that accuracy is not equal to recall.
In this example filled in we get:
Accuracy = 12/18 = 2/3
Recall = 3/4
And 2/3 is not equal to 3/4. Thus disproving the hypothesis that per class accuracy is equal to per class recall.
It is however also possible to provide examples for which the hypothesis is correct. But because it is not in general, it is disproven.
Not sure if you are looking for average per-class accuracy as a single metric or per-class accuracy as separate metrics for each class.
For per-class accuracy as a separate metric for each class, see the code below. It's the same as recall-micro per class.
For average per-class accuracy as a single metric, it is equivalent to recall-macro (which is equivalent to balanced accuracy in sklearn). See the code blow.
Here is the empirical demonstration in code.
from sklearn.metrics import accuracy_score, balanced_accuracy_score, recall_score
label_class1 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
label_class2 = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
labels = label_class1 + label_class2
pred_class1 = [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
pred_class2 = [1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
pred = pred_class1 + pred_class2
# 1. calculate accuracy scores per class
score_accuracy_class1 = accuracy_score(label_class1, pred_class1)
score_accuracy_class2 = accuracy_score(label_class2, pred_class2)
print(score_accuracy_class1) # 0.6
print(score_accuracy_class2) # 0.9
# 2. calculate recall scores per class
score_recall_class1 = recall_score(label_class1, pred_class1, average='micro')
score_recall_class2 = recall_score(label_class2, pred_class2, average='micro')
print(score_recall_class1) # 0.6
print(score_recall_class2) # 0.9
assert score_accuracy_class1 == score_recall_class1
assert score_accuracy_class2 == score_recall_class2
# 3. this also means that average per-class accuracy is equivalent to averaged recall and balanced accuracy
score_balanced_accuracy1 = (score_accuracy_class1 + score_accuracy_class2) / 2
score_balanced_accuracy2 = (score_recall_class1 + score_recall_class2) / 2
score_balanced_accuracy3 = balanced_accuracy_score(labels, pred)
score_balanced_accuracy4 = recall_score(labels, pred, average='macro')
print(score_balanced_accuracy1) # 0.75
print(score_balanced_accuracy2) # 0.75
print(score_balanced_accuracy3) # 0.75
print(score_balanced_accuracy4) # 0.75
# balanced accuracy, average per-class accuracy and recall-macro are equivalent
assert score_balanced_accuracy1 == score_balanced_accuracy2 == score_balanced_accuracy3 == score_balanced_accuracy4
These official docs say: "balanced accuracy ... is defined as the average of recall obtained on each class."
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html

python with lightgbm, predict the class label (0 or 1) rather than the probability

I am using lightgbm. I want to know how to get the class label (0 or 1) not the probability for classification.
I know that lightgbm has provided scikit-learn API. The API has the function "predict" to get label and "predict_proba" to the probability. I want to get the label directly without scikit-learn API.
The question is similar with how to get the label based on the probability. We can use the threshold. Probabilities exceeding the threshold are set to one, and probabilities which are less than the threshold are set to zero. Obviously, the threshold 0.5 is not appropriate.
I use following codes to find the best threshold, but there must be something wrong:
def find_optimal_cutoff(target, predicted):
fpr, tpr, threshold = roc_curve(target, predicted)
i = np.arange(len(tpr))
roc = pd.DataFrame({'tf': pd.Series(tpr - (1 - fpr), index=i), 'threshold': pd.Series(threshold, index=i)})
roc_t = roc.loc[(roc.tf - 0).abs().argsort()[:1]]
return list(roc_t['threshold'])[0]
def find_optimal_cutoff2(target, predicted):
fpr, tpr, threshold = roc_curve(target, predicted)
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = threshold[optimal_idx]
return optimal_threshold
thre1 = find_optimal_cutoff(self, target, predicted)
thre2 = find_optimal_cutoff2(self, target, predicted)
thre1 is a little different from thre2. But this is not the core problem. If the parameters for the lightgbm model are not appropriate, my experiment shows that the auc is approximately 0.5. However, thre1 is 1 and thre2 is 2.
You can use
np.where(predictions_lgbm_prob > 0.5, 1, 0)

PyTorch doesn't seem to be optimizing correctly

I have posted this question on Data Science StackExchange site since StackOverflow does not support LaTeX. Linking it here because this site is probably more appropriate.
The question with correctly rendered LaTeX is here: https://datascience.stackexchange.com/questions/48062/pytorch-does-not-seem-to-be-optimizing-correctly
The idea is that I am considering sums of sine waves with different phases. The waves are sampled with some sample rate s in the interval [0, 2pi]. I need to select phases in such a way, that the sum of the waves at any sample point is minimized.
Below is the Python code. Optimization does not seem to be computed correctly.
import numpy as np
import torch
def phaseOptimize(n, s = 48000, nsteps = 1000):
learning_rate = 1e-3
theta = torch.zeros([n, 1], requires_grad=True)
l = torch.linspace(0, 2 * np.pi, s)
t = torch.stack([l] * n)
T = t + theta
for jj in range(nsteps):
loss = T.sin().sum(0).pow(2).sum() / s
loss.backward()
theta.data -= learning_rate * theta.grad.data
print('Optimal theta: \n\n', theta.data)
print('\n\nMaximum value:', T.sin().sum(0).abs().max().item())
Below is a sample output.
phaseOptimize(5, nsteps=100)
Optimal theta:
tensor([[1.2812e-07],
[1.2812e-07],
[1.2812e-07],
[1.2812e-07],
[1.2812e-07]], requires_grad=True)
Maximum value: 5.0
I am assuming this has something to do with broadcasting in
T = t + theta
and/or the way I am computing the loss function.
One way to verify that optimization is incorrect, is to simply evaluate the loss function at random values for the array $\theta_1, \dots, \theta_n$, say uniformly distributed in $[0, 2\pi]$. The maximum value in this case is almost always much lower than the maximum value reported by phaseOptimize(). Much easier in fact is to consider the case with $n = 2$, and simply evaluate at $\theta_1 = 0$ and $\theta_2 = \pi$. In that case we get:
phaseOptimize(2, nsteps=100)
Optimal theta:
tensor([[2.8599e-08],
[2.8599e-08]])
Maximum value: 2.0
On the other hand,
theta = torch.FloatTensor([[0], [np.pi]])
l = torch.linspace(0, 2 * np.pi, 48000)
t = torch.stack([l] * 2)
T = t + theta
T.sin().sum(0).abs().max().item()
produces
3.2782554626464844e-07
You have to move computing T inside the loop, or it will always have the same constant value, thus constant loss.
Another thing is to initialize theta to different values at indices, otherwise because of the symmetric nature of the problem the gradient is the same for every index.
Another thing is that you need to zero gradient, because backward just accumulates them.
This seems to work:
def phaseOptimize(n, s = 48000, nsteps = 1000):
learning_rate = 1e-1
theta = torch.zeros([n, 1], requires_grad=True)
theta.data[0][0] = 1
l = torch.linspace(0, 2 * np.pi, s)
t = torch.stack([l] * n)
for jj in range(nsteps):
T = t + theta
loss = T.sin().sum(0).pow(2).sum() / s
loss.backward()
theta.data -= learning_rate * theta.grad.data
theta.grad.zero_()
You're being bitten by both PyTorch and math. Firstly, you need to
Zero out the gradient by setting theta.grad = None before each backward step. Otherwise the gradients accumulate instead of overwriting the previous ones
You need to recalculate T at each step. PyTorch is not symbolic, unlike TensorFlow and T = t + theta means "T equals the sum of current t and current theta" and not "T equals the sum of t and theta, whatever their values may be at any time in the future".
With those fixes you get the following code:
def phaseOptimize(n, s = 48000, nsteps = 1000):
learning_rate = 1e-3
theta = torch.zeros(n, 1, requires_grad=True)
l = torch.linspace(0, 2 * np.pi, s)
t = torch.stack([l] * n)
T = t + theta
for jj in range(nsteps):
T = t + theta
loss = T.sin().sum(0).pow(2).sum() / s
theta.grad = None
loss.backward()
theta.data -= learning_rate * theta.grad.data
T = t + theta
print('Optimal theta: \n\n', theta.data)
print('\n\nMaximum value:', T.sin().sum(0).abs().max().item())
which will still not work as you expect because of math.
One can easily see that the minimum to your loss function is when theta are also uniformly spaced over [0, 2pi). The problem is that you are initializing your parameters as torch.zeros, which leads to all those values being equal (this is the polar opposite of equispaced!). Since your loss function is symmetrical with respect to permutations of theta, the computed gradients are equal and the gradient descent algorithm can never "differentiate them". In more mathematical terms, you're unlucky enough to initialize your algorithm exactly on a saddle point, so it cannot continue. If you add any noise, it will converge. For instance with
theta = torch.zeros(n, 1) + 0.001 * torch.randn(n, 1)
theta.requires_grad_(True)

How to avoid NaN in numpy implementation of logistic regression?

EDIT: I already made significant progress. My current question is written after my last edit below and can be answered without the context.
I currently follow Andrew Ng's Machine Learning Course on Coursera and tried to implement logistic regression today.
Notation:
X is a (m x n)-matrix with vectors of input variables as rows (m training samples of n-1 variables, the entries of the first column are equal to 1 everywhere to represent a constant).
y is the corresponding vector of expected output samples (column vector with m entries equal to 0 or 1)
theta is the vector of model coefficients (row vector with n entries)
For an input row vector x the model will predict the probability sigmoid(x * theta.T) for a positive outcome.
This is my Python3/numpy implementation:
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
vec_sigmoid = np.vectorize(sigmoid)
def logistic_cost(X, y, theta):
summands = np.multiply(y, np.log(vec_sigmoid(X*theta.T))) + np.multiply(1 - y, np.log(1 - vec_sigmoid(X*theta.T)))
return - np.sum(summands) / len(y)
def gradient_descent(X, y, learning_rate, num_iterations):
num_parameters = X.shape[1] # dim theta
theta = np.matrix([0.0 for i in range(num_parameters)]) # init theta
cost = [0.0 for i in range(num_iterations)]
for it in range(num_iterations):
error = np.repeat(vec_sigmoid(X * theta.T) - y, num_parameters, axis=1)
error_derivative = np.sum(np.multiply(error, X), axis=0)
theta = theta - (learning_rate / len(y)) * error_derivative
cost[it] = logistic_cost(X, y, theta)
return theta, cost
This implementation seems to work fine, but I encountered a problem when calculating the logistic-cost. At some point the gradient descent algorithm converges to a pretty good fitting theta and the following happens:
For some input row X_i with expected outcome 1 X * theta.T will become positive with a good margin (for example 23.207). This will lead to sigmoid(X_i * theta) to become exactly 1.0000 (this is because of lost precision I think). This is a good prediction (since the expected outcome is equal to 1), but this breaks the calculation of the logistic cost, since np.log(1 - vec_sigmoid(X*theta.T)) will evaluate to NaN. This shouldn't be a problem, since the term is multiplied with 1 - y = 0, but once a value of NaN occurs, the whole calculation is broken (0 * NaN = NaN).
How should I handle this in the vectorized implementation, since np.multiply(1 - y, np.log(1 - vec_sigmoid(X*theta.T))) is calculated in every row of X (not only where y = 0)?
Example input:
X = np.matrix([[1. , 0. , 0. ],
[1. , 1. , 0. ],
[1. , 0. , 1. ],
[1. , 0.5, 0.3],
[1. , 1. , 0.2]])
y = np.matrix([[0],
[1],
[1],
[0],
[1]])
Then theta, _ = gradient_descent(X, y, 10000, 10000) (yes, in this case we can set the learning rate this large) will set theta as:
theta = np.matrix([[-3000.04008972, 3499.97995514, 4099.98797308]])
This will lead to vec_sigmoid(X * theta.T) to be the really good prediction of:
np.matrix([[0.00000000e+00], # 0
[1.00000000e+00], # 1
[1.00000000e+00], # 1
[1.95334953e-09], # nearly zero
[1.00000000e+00]]) # 1
but logistic_cost(X, y, theta) evaluates to NaN.
EDIT:
I came up with the following solution. I just replaced the logistic_cost function with:
def new_logistic_cost(X, y, theta):
term1 = vec_sigmoid(X*theta.T)
term1[y == 0] = 1
term2 = 1 - vec_sigmoid(X*theta.T)
term2[y == 1] = 1
summands = np.multiply(y, np.log(term1)) + np.multiply(1 - y, np.log(term2))
return - np.sum(summands) / len(y)
By using the mask I just calculate log(1) at the places at which the result will be multiplied with zero anyway. Now log(0) will only happen in wrong implementations of gradient descent.
Open questions: How can I make this solution more clean? Is it possible to achieve a similar effect in a cleaner way?
If you don't mind using SciPy, you could import expit and xlog1py from scipy.special:
from scipy.special import expit, xlog1py
and replace the expression
np.multiply(1 - y, np.log(1 - vec_sigmoid(X*theta.T)))
with
xlog1py(1 - y, -expit(X*theta.T))
I know it is an old question but I ran into the same problem, and maybe it can help others in the future, I actually solved it by implementing normalization on the data before appending X0.
def normalize_data(X):
mean = np.mean(X, axis=0)
std = np.std(X, axis=0)
return (X-mean) / std
After this all worked well!

Resources