How to measure sensitivity, speficity, PPV and NPV from confusion matrix for multiclass multiclassification [migrated] - confusion-matrix

I wonder how to compute precision and recall using a confusion matrix for a multi-class classification problem. Specifically, an observation can only be assigned to its most probable class / label. I would like to compute:
Precision = TP / (TP+FP)
Recall = TP / (TP+FN)
for each class, and then compute the micro-averaged F-measure.

In a 2-hypothesis case, the confusion matrix is usually:
Declare H1
Declare H0
Is H1
TP
FN
Is H0
FP
TN
where I've used something similar to your notation:
TP = true positive (declare H1 when, in truth, H1),
FN = false negative (declare H0 when, in truth, H1),
FP = false positive
TN = true negative
From the raw data, the values in the table would typically be the counts for each occurrence over the test data. From this, you should be able to compute the quantities you need.
Edit
The generalization to multi-class problems is to sum over rows / columns of the confusion matrix. Given that the matrix is oriented as above, i.e., that
a given row of the matrix corresponds to specific value for the "truth", we have:
$\text{Precision}_{~i} = \cfrac{M_{ii}}{\sum_j M_{ji}}$
$\text{Recall}_{~i} = \cfrac{M_{ii}}{\sum_j M_{ij}}$
That is, precision is the fraction of events where we correctly declared $i$
out of all instances where the algorithm declared $i$. Conversely, recall is the fraction of events where we correctly declared $i$ out of all of the cases where the true of state of the world is $i$.

Good summary paper, looking at these metrics for multi-class problems:
Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing and Management, 45, p. 427-437. (pdf)
The abstract reads:
This paper presents a systematic analysis of twenty four performance
measures used in the complete spectrum of Machine Learning
classification tasks, i.e., binary, multi-class, multi-labelled, and
hierarchical. For each classification task, the study relates a set of
changes in a confusion matrix to specific characteristics of data.
Then the analysis concentrates on the type of changes to a confusion
matrix that do not change a measure, therefore, preserve a
classifier’s evaluation (measure invariance). The result is the
measure invariance taxonomy with respect to all relevant label
distribution changes in a classification problem. This formal analysis
is supported by examples of applications where invariance properties
of measures lead to a more reliable evaluation of classifiers. Text
classification supplements the discussion with several case studies.

Using sklearn or tensorflow and numpy:
from sklearn.metrics import confusion_matrix
# or:
# from tensorflow.math import confusion_matrix
import numpy as np
labels = ...
predictions = ...
cm = confusion_matrix(labels, predictions)
recall = np.diag(cm) / np.sum(cm, axis = 1)
precision = np.diag(cm) / np.sum(cm, axis = 0)
To get overall measures of precision and recall, use then
np.mean(recall)
np.mean(precision)

#Cristian Garcia code can be reduced by sklearn.
>>> from sklearn.metrics import precision_score
>>> y_true = [0, 1, 2, 0, 1, 2]
>>> y_pred = [0, 2, 1, 0, 0, 1]
>>> precision_score(y_true, y_pred, average='micro')

Here is a different view from the other answers that I think will be helpful to others. The goal here is to allow you to compute these metrics using basic laws of probability.
First, it helps to understand what a confusion matrix is telling us in general. Let $Y$ represent a class label and $\hat Y$ represent a class prediction. In the binary case, let the two possible values for $Y$ and $\hat Y$ be $0$ and $1$, which represent the classes. Next, suppose that the confusion matrix for $Y$ and $\hat Y$ is:
$\hat Y = 0$
$\hat Y = 1$
$Y = 0$
10
20
$Y = 1$
30
40
With hindsight, let us normalize the rows and columns of this confusion matrix, such that the sum of all elements of the confusion matrix is $1$. Currently, the sum of all elements of the confusion matrix is $10 + 20 + 30 + 40 = 100$, which is our normalization factor. After dividing the elements of the confusion matrix by the normalization factor, we get the following normalized confusion matrix:
$\hat Y = 0$
$\hat Y = 1$
$Y = 0$
$\frac{1}{10}$
$\frac{2}{10}$
$Y = 1$
$\frac{3}{10}$
$\frac{4}{10}$
With this formulation of the confusion matrix, we can interpret $Y$ and $\hat Y$ slightly differently. We can interpret them as jointly Bernoulli (binary) random variables, where their normalized confusion matrix represents their joint probability mass function. When we interpret $Y$ and $\hat Y$ this way, the definitions of precision and recall are much easier to remember using Bayes' rule and the law of total probability:
\begin{align}
\text{Precision} &= P(Y = 1 \mid \hat Y = 1) = \frac{P(Y = 1 , \hat Y = 1)}{P(Y = 1 , \hat Y = 1) + P(Y = 0 , \hat Y = 1)} \\
\text{Recall} &= P(\hat Y = 1 \mid Y = 1) = \frac{P(Y = 1 , \hat Y = 1)}{P(Y = 1 , \hat Y = 1) + P(Y = 1 , \hat Y = 0)}
\end{align}
How do we determine these probabilities? We can estimate them using the normalized confusion matrix. From the table above, we see that
\begin{align}
P(Y = 0 , \hat Y = 0) &\approx \frac{1}{10} \\
P(Y = 0 , \hat Y = 1) &\approx \frac{2}{10} \\
P(Y = 1 , \hat Y = 0) &\approx \frac{3}{10} \\
P(Y = 1 , \hat Y = 1) &\approx \frac{4}{10}
\end{align}
Therefore, the precision and recall for this specific example are
\begin{align}
\text{Precision} &= P(Y = 1 \mid \hat Y = 1) = \frac{\frac{4}{10}}{\frac{4}{10} + \frac{2}{10}} = \frac{4}{4 + 2} = \frac{2}{3} \\
\text{Recall} &= P(\hat Y = 1 \mid Y = 1) = \frac{\frac{4}{10}}{\frac{4}{10} + \frac{3}{10}} = \frac{4}{4 + 3} = \frac{4}{7}
\end{align}
Note that, from the calculations above, we didn't really need to normalize the confusion matrix before computing the precision and recall. The reason for this is that, because of Bayes' rule, we end up dividing one value that is normalized by another value that is normalized, which means that the normalization factor can be cancelled out.
A nice thing about this interpretation is that it can be generalized to confusion matrices of any size. In the case where there are more than 2 classes, $Y$ and $\hat Y$ are no longer considered to be jointly Bernoulli, but rather jointly categorical. Moreover, we would need to specify which class we are computing the precision and recall for. In fact, the definitions above may be interpreted as the precision and recall for class $1$. We can also compute the precision and recall for class $0$, but these have different names in the literature.

Related

imlementation of binomial coefficient in Google JAX

trying to implement custom MLE for binomial distribution (for learning purpose) stuck with implantation of binomial coefficient in google JAX . there is no analog for scipy.special.binom() implemented.
what shall i use instead ?
The binomial coefficient for general real-valued inputs can be computed in terms of the gamma function, which is available in JAX via jax.scipy.special.gammaln. Here's one way you could define it:
def binom(x, y):
return jnp.exp(gammaln(x + 1) - gammaln(y + 1) - gammaln(x - y + 1))
Here is a (sequential) integer implementation using JAX.
def binom_int_seq(x : int, y : int):
def scan_body(carry, values):
n, d = values
carry = (carry*n)//d
return carry, None
y = max(y, x-y)
nd = jnp.concatenate(
(jnp.arange(y+2, x+1, dtype = 'u8')[:,None],
jnp.arange(2, x-y+1, dtype = 'u8')[:,None],),
axis = 1
)
bc, *_ = jax.lax.scan(scan_body, jnp.array(y+1, dtype = 'u8'), nd)
return bc
binom_int_seq_jit = jax.jit(binom_int_seq, static_argnums = (0, 1))
which gives
x, y = 60, 31
bc_ref = sp.special.comb(x, y, exact=True)
# 114449595062769120
binom_int_seq(x, y)-bc_ref
# DeviceArray(0, dtype=uint64)
# Using above logarithmic gamma function based implementation
binom(x, y)-bc_ref
# DeviceArray(496., dtype=float64, weak_type=True)
Keep in mind the binom_int_seq implementation is only correct if
(x-max(x-y, y))*sp.special.comb(x, y, exact=True) < jnp.iinfo(jnp.uint64).max
Unlike the real-valued version, the error will be sudden and catastrophic if this condition is not satisfied.
There may be other ways to increase this constraint, such as running cancellations based upon prime factorisation, without resorting to larger unsigned integers (/arbitrary precision).
A monoidal version could be implemented which computes the binomial coefficient numerator and denominator reductions then integer divides, but this places stricter constraints on the maximum arguments.

How to avoid NaN in numpy implementation of logistic regression?

EDIT: I already made significant progress. My current question is written after my last edit below and can be answered without the context.
I currently follow Andrew Ng's Machine Learning Course on Coursera and tried to implement logistic regression today.
Notation:
X is a (m x n)-matrix with vectors of input variables as rows (m training samples of n-1 variables, the entries of the first column are equal to 1 everywhere to represent a constant).
y is the corresponding vector of expected output samples (column vector with m entries equal to 0 or 1)
theta is the vector of model coefficients (row vector with n entries)
For an input row vector x the model will predict the probability sigmoid(x * theta.T) for a positive outcome.
This is my Python3/numpy implementation:
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
vec_sigmoid = np.vectorize(sigmoid)
def logistic_cost(X, y, theta):
summands = np.multiply(y, np.log(vec_sigmoid(X*theta.T))) + np.multiply(1 - y, np.log(1 - vec_sigmoid(X*theta.T)))
return - np.sum(summands) / len(y)
def gradient_descent(X, y, learning_rate, num_iterations):
num_parameters = X.shape[1] # dim theta
theta = np.matrix([0.0 for i in range(num_parameters)]) # init theta
cost = [0.0 for i in range(num_iterations)]
for it in range(num_iterations):
error = np.repeat(vec_sigmoid(X * theta.T) - y, num_parameters, axis=1)
error_derivative = np.sum(np.multiply(error, X), axis=0)
theta = theta - (learning_rate / len(y)) * error_derivative
cost[it] = logistic_cost(X, y, theta)
return theta, cost
This implementation seems to work fine, but I encountered a problem when calculating the logistic-cost. At some point the gradient descent algorithm converges to a pretty good fitting theta and the following happens:
For some input row X_i with expected outcome 1 X * theta.T will become positive with a good margin (for example 23.207). This will lead to sigmoid(X_i * theta) to become exactly 1.0000 (this is because of lost precision I think). This is a good prediction (since the expected outcome is equal to 1), but this breaks the calculation of the logistic cost, since np.log(1 - vec_sigmoid(X*theta.T)) will evaluate to NaN. This shouldn't be a problem, since the term is multiplied with 1 - y = 0, but once a value of NaN occurs, the whole calculation is broken (0 * NaN = NaN).
How should I handle this in the vectorized implementation, since np.multiply(1 - y, np.log(1 - vec_sigmoid(X*theta.T))) is calculated in every row of X (not only where y = 0)?
Example input:
X = np.matrix([[1. , 0. , 0. ],
[1. , 1. , 0. ],
[1. , 0. , 1. ],
[1. , 0.5, 0.3],
[1. , 1. , 0.2]])
y = np.matrix([[0],
[1],
[1],
[0],
[1]])
Then theta, _ = gradient_descent(X, y, 10000, 10000) (yes, in this case we can set the learning rate this large) will set theta as:
theta = np.matrix([[-3000.04008972, 3499.97995514, 4099.98797308]])
This will lead to vec_sigmoid(X * theta.T) to be the really good prediction of:
np.matrix([[0.00000000e+00], # 0
[1.00000000e+00], # 1
[1.00000000e+00], # 1
[1.95334953e-09], # nearly zero
[1.00000000e+00]]) # 1
but logistic_cost(X, y, theta) evaluates to NaN.
EDIT:
I came up with the following solution. I just replaced the logistic_cost function with:
def new_logistic_cost(X, y, theta):
term1 = vec_sigmoid(X*theta.T)
term1[y == 0] = 1
term2 = 1 - vec_sigmoid(X*theta.T)
term2[y == 1] = 1
summands = np.multiply(y, np.log(term1)) + np.multiply(1 - y, np.log(term2))
return - np.sum(summands) / len(y)
By using the mask I just calculate log(1) at the places at which the result will be multiplied with zero anyway. Now log(0) will only happen in wrong implementations of gradient descent.
Open questions: How can I make this solution more clean? Is it possible to achieve a similar effect in a cleaner way?
If you don't mind using SciPy, you could import expit and xlog1py from scipy.special:
from scipy.special import expit, xlog1py
and replace the expression
np.multiply(1 - y, np.log(1 - vec_sigmoid(X*theta.T)))
with
xlog1py(1 - y, -expit(X*theta.T))
I know it is an old question but I ran into the same problem, and maybe it can help others in the future, I actually solved it by implementing normalization on the data before appending X0.
def normalize_data(X):
mean = np.mean(X, axis=0)
std = np.std(X, axis=0)
return (X-mean) / std
After this all worked well!

Improper cost function outputs for Vectorized Logistic Regression

I'm trying to implement vectorized logistic regression on the Iris dataset. This is the implementation from Andrew Ng's youtube series on deep learning. My best predictions using this method have been 81% accuracy while sklearn's implementation achieves 100% with completely different values for coefficients and bias. Also, I cant seem to get get proper outputs from my cost function. I suspect it is an issue with computing the gradients of the weights and bias with respect to the cost function though in the course he provides all of the necessary equations ( unless there is something in the actual exercise which I don't have access to being left out.) My code is as follows.
n = 4
m = 150
y = y.reshape(1, 150)
X = X.reshape(4, 150)
W = np.zeros((4, 1))
b = np.zeros((1,1))
for epoch in range(1000):
Z = np.dot(W.T, X) + b
A = sigmoid(Z) # 1/(1 + e **(-Z))
J = -1/m * np.sum(y * np.log(A) + (1-y) * (1 - np.log(A))) #cost function
dz = A - y
dw = 1/m * np.dot(X, dz.T)
db = np.sum(dz)
W = W - 0.01 * dw
b = b - 0.01 * db
if epoch % 100 == 0:
print(J)
My output looks something like this.
-1.6126604413879289
-1.6185960074767125
-1.6242504226045396
-1.6296400635926438
-1.6347800862216104
-1.6396845400653066
-1.6443664703028427
-1.648838008214648
-1.653110451818512
-1.6571943378913891
W and b values are:
array([[-0.68262679, -1.56816916, 0.12043066, 1.13296948]])
array([[0.53087131]])
Where as sklearn outputs:
(array([[ 0.41498833, 1.46129739, -2.26214118, -1.0290951 ]]),
array([0.26560617]))
I understand sklearn uses L2 regularization but even when turned off it's still far from the correct values. Any help would be appreciated. Thanks
You are likely getting strange results because you are trying to use logistic regression where y is not a binary choice. Categorizing the iris data is a multiclass problem, y can be one of three values:
> np.unique(iris.target)
> array([0, 1, 2])
The cross entropy cost function expects y to either be one or zero. One way to handle this is the one vs all method.
You can check each class by making y a boolean of whether the iris in in one class or not. For example here you can make y a data set of either class 1 or not:
y = (iris.target == 1).astype(int)
With that your cost function and gradient descent should work, but you'll need to run it multiple times and pick the best score for each example. Andrew Ng's class talks about this method.
EDIT:
It's not clear what you are starting with for data. When I do this, don't reshape the inputs. So you should double check that all your multiplication is delivering the shapes you want. On thing I notice that's a little odd, is the last term in your cost function. I generally do this:
cost = -1/m * np.sum(Y*np.log(A) + (1-Y) * np.log(1-A))
not:
-1/m * np.sum(y * np.log(A) + (1-y) * (1 - np.log(A)))
Here's code that converges for me using the dataset from sklearn:
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
# Iris is a multiclass problem. Here, just calculate the probabily that
# the class is `iris_class`
iris_class = 0
Y = np.expand_dims((iris.target == iris_class).astype(int), axis=1)
# Y is now a data set of booleans indicating whether the sample is or isn't a member of iris_class
# initialize w and b
W = np.random.randn(4, 1)
b = np.random.randn(1, 1)
a = 0.1 # learning rate
m = Y.shape[0] # number of samples
def sigmoid(Z):
return 1/(1 + np.exp(-Z))
for i in range(1000):
Z = np.dot(X ,W) + b
A = sigmoid(Z)
dz = A - Y
dw = 1/m * np.dot(X.T, dz)
db = np.mean(dz)
W -= a * dw
b -= a * db
cost = -1/m * np.sum(Y*np.log(A) + (1-Y) * np.log(1-A))
if i%100 == 0:
print(cost)

Role of class_weight in loss functions for linearSVC and LogisticRegression

I am trying to figure out what exactly the loss function formula is and how I can manually calculate it when class_weight='auto' in case of svm.svc, svm.linearSVC and linear_model.LogisticRegression.
For balanced data, say you have a trained classifier: clf_c. Logistic loss should be (am I correct?):
def logistic_loss(x,y,w,b,b0):
'''
x: nxp data matrix where n is number of data points and p is number of features.
y: nx1 vector of true labels (-1 or 1).
w: nx1 vector of weights (vector of 1./n for balanced data).
b: px1 vector of feature weights.
b0: intercept.
'''
s = y
if 0 in np.unique(y):
print 'yes'
s = 2. * y - 1
l = np.dot(w, np.log(1 + np.exp(-s * (np.dot(x, np.squeeze(b)) + b0))))
return l
I realized that logisticRegression has predict_log_proba() which gives you exactly that when data is balanced:
b, b0 = clf_c.coef_, clf_c.intercept_
w = np.ones(len(y))/len(y)
-(clf_c.predict_log_proba(x[xrange(len(x)), np.floor((y+1)/2).astype(np.int8)]).mean() == logistic_loss(x,y,w,b,b0)
Note, np.floor((y+1)/2).astype(np.int8) simply maps y=(-1,1) to y=(0,1).
But this does not work when data is imbalanced.
What's more, you expect the classifier (here, logisticRegression) to perform similarly (in terms of loss function value) when data in balance and class_weight=None versus when data is imbalanced and class_weight='auto'. I need to have a way to calculate the loss function (without the regularization term) for both scenarios and compare them.
In short, what does class_weight = 'auto' exactly mean? Does it mean class_weight = {-1 : (y==1).sum()/(y==-1).sum() , 1 : 1.} or rather class_weight = {-1 : 1./(y==-1).sum() , 1 : 1./(y==1).sum()}?
Any help is much much appreciated. I tried going through the source code, but I am not a programmer and I am stuck.
Thanks a lot in advance.
class_weight heuristics
I am a bit puzzled by your first proposition for the class_weight='auto' heuristic, as:
class_weight = {-1 : (y == 1).sum() / (y == -1).sum(),
1 : 1.}
is the same as your second proposition if we normalize it so that the weights sum to one.
Anyway to understand what class_weight="auto" does, see this question:
what is the difference between class weight = none and auto in svm scikit learn.
I am copying it here for later comparison:
This means that each class you have (in classes) gets a weight equal
to 1 divided by the number of times that class appears in your data
(y), so classes that appear more often will get lower weights. This is
then further divided by the mean of all the inverse class frequencies.
Note how this is not completely obvious ;).
This heuristic is deprecated and will be removed in 0.18. It will be replaced by another heuristic, class_weight='balanced'.
The 'balanced' heuristic weighs classes proportionally to the inverse of their frequency.
From the docs:
The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data:
n_samples / (n_classes * np.bincount(y)).
np.bincount(y) is an array with the element i being the count of class i samples.
Here's a bit of code to compare the two:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.utils import compute_class_weight
n_classes = 3
n_samples = 1000
X, y = make_classification(n_samples=n_samples, n_features=20, n_informative=10,
n_classes=n_classes, weights=[0.05, 0.4, 0.55])
print("Count of samples per class: ", np.bincount(y))
balanced_weights = n_samples /(n_classes * np.bincount(y))
# Equivalent to the following, using version 0.17+:
# compute_class_weight("balanced", [0, 1, 2], y)
print("Balanced weights: ", balanced_weights)
print("'auto' weights: ", compute_class_weight("auto", [0, 1, 2], y))
Output:
Count of samples per class: [ 57 396 547]
Balanced weights: [ 5.84795322 0.84175084 0.60938452]
'auto' weights: [ 2.40356854 0.3459682 0.25046327]
The loss functions
Now the real question is: how are these weights used to train the classifier?
I don't have a thorough answer here unfortunately.
For SVC and linearSVC the docstring is pretty clear
Set the parameter C of class i to class_weight[i]*C for SVC.
So high weights mean less regularization for the class and a higher incentive for the svm to classify it properly.
I do not know how they work with logistic regression. I'll try to look into it but most of the code is in liblinear or libsvm and I'm not too familiar with those.
However, note that the weights in class_weight do not influence directly methods such as predict_proba. They change its ouput because the classifier optimizes a different loss function.
Not sure this is clear, so here's a snippet to explain what I mean (you need to run the first one for the imports and variable definition):
lr = LogisticRegression(class_weight="auto")
lr.fit(X, y)
# We get some probabilities...
print(lr.predict_proba(X))
new_lr = LogisticRegression(class_weight={0: 100, 1: 1, 2: 1})
new_lr.fit(X, y)
# We get different probabilities...
print(new_lr.predict_proba(X))
# Let's cheat a bit and hand-modify our new classifier.
new_lr.intercept_ = lr.intercept_.copy()
new_lr.coef_ = lr.coef_.copy()
# Now we get the SAME probabilities.
np.testing.assert_array_equal(new_lr.predict_proba(X), lr.predict_proba(X))
Hope this helps.

scikit learn: how to check coefficients significance

i tried to do a LR with SKLearn for a rather large dataset with ~600 dummy and only few interval variables (and 300 K lines in my dataset) and the resulting confusion matrix looks suspicious. I wanted to check the significance of the returned coefficients and ANOVA but I cannot find how to access it. Is it possible at all? And what is the best strategy for data that contains lots of dummy variables? Thanks a lot!
Scikit-learn deliberately does not support statistical inference. If you want out-of-the-box coefficients significance tests (and much more), you can use Logit estimator from Statsmodels. This package mimics interface glm models in R, so you could find it familiar.
If you still want to stick to scikit-learn LogisticRegression, you can use asymtotic approximation to distribution of maximum likelihiood estimates. Precisely, for a vector of maximum likelihood estimates theta, its variance-covariance matrix can be estimated as inverse(H), where H is the Hessian matrix of log-likelihood at theta. This is exactly what the function below does:
import numpy as np
from scipy.stats import norm
from sklearn.linear_model import LogisticRegression
def logit_pvalue(model, x):
""" Calculate z-scores for scikit-learn LogisticRegression.
parameters:
model: fitted sklearn.linear_model.LogisticRegression with intercept and large C
x: matrix on which the model was fit
This function uses asymtptics for maximum likelihood estimates.
"""
p = model.predict_proba(x)
n = len(p)
m = len(model.coef_[0]) + 1
coefs = np.concatenate([model.intercept_, model.coef_[0]])
x_full = np.matrix(np.insert(np.array(x), 0, 1, axis = 1))
ans = np.zeros((m, m))
for i in range(n):
ans = ans + np.dot(np.transpose(x_full[i, :]), x_full[i, :]) * p[i,1] * p[i, 0]
vcov = np.linalg.inv(np.matrix(ans))
se = np.sqrt(np.diag(vcov))
t = coefs/se
p = (1 - norm.cdf(abs(t))) * 2
return p
# test p-values
x = np.arange(10)[:, np.newaxis]
y = np.array([0,0,0,1,0,0,1,1,1,1])
model = LogisticRegression(C=1e30).fit(x, y)
print(logit_pvalue(model, x))
# compare with statsmodels
import statsmodels.api as sm
sm_model = sm.Logit(y, sm.add_constant(x)).fit(disp=0)
print(sm_model.pvalues)
sm_model.summary()
The outputs of print() are identical, and they happen to be coefficient p-values.
[ 0.11413093 0.08779978]
[ 0.11413093 0.08779979]
sm_model.summary() also prints a nicely formatted HTML summary.

Resources