Related
I wonder how to compute precision and recall using a confusion matrix for a multi-class classification problem. Specifically, an observation can only be assigned to its most probable class / label. I would like to compute:
Precision = TP / (TP+FP)
Recall = TP / (TP+FN)
for each class, and then compute the micro-averaged F-measure.
In a 2-hypothesis case, the confusion matrix is usually:
Declare H1
Declare H0
Is H1
TP
FN
Is H0
FP
TN
where I've used something similar to your notation:
TP = true positive (declare H1 when, in truth, H1),
FN = false negative (declare H0 when, in truth, H1),
FP = false positive
TN = true negative
From the raw data, the values in the table would typically be the counts for each occurrence over the test data. From this, you should be able to compute the quantities you need.
Edit
The generalization to multi-class problems is to sum over rows / columns of the confusion matrix. Given that the matrix is oriented as above, i.e., that
a given row of the matrix corresponds to specific value for the "truth", we have:
$\text{Precision}_{~i} = \cfrac{M_{ii}}{\sum_j M_{ji}}$
$\text{Recall}_{~i} = \cfrac{M_{ii}}{\sum_j M_{ij}}$
That is, precision is the fraction of events where we correctly declared $i$
out of all instances where the algorithm declared $i$. Conversely, recall is the fraction of events where we correctly declared $i$ out of all of the cases where the true of state of the world is $i$.
Good summary paper, looking at these metrics for multi-class problems:
Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing and Management, 45, p. 427-437. (pdf)
The abstract reads:
This paper presents a systematic analysis of twenty four performance
measures used in the complete spectrum of Machine Learning
classification tasks, i.e., binary, multi-class, multi-labelled, and
hierarchical. For each classification task, the study relates a set of
changes in a confusion matrix to specific characteristics of data.
Then the analysis concentrates on the type of changes to a confusion
matrix that do not change a measure, therefore, preserve a
classifier’s evaluation (measure invariance). The result is the
measure invariance taxonomy with respect to all relevant label
distribution changes in a classification problem. This formal analysis
is supported by examples of applications where invariance properties
of measures lead to a more reliable evaluation of classifiers. Text
classification supplements the discussion with several case studies.
Using sklearn or tensorflow and numpy:
from sklearn.metrics import confusion_matrix
# or:
# from tensorflow.math import confusion_matrix
import numpy as np
labels = ...
predictions = ...
cm = confusion_matrix(labels, predictions)
recall = np.diag(cm) / np.sum(cm, axis = 1)
precision = np.diag(cm) / np.sum(cm, axis = 0)
To get overall measures of precision and recall, use then
np.mean(recall)
np.mean(precision)
#Cristian Garcia code can be reduced by sklearn.
>>> from sklearn.metrics import precision_score
>>> y_true = [0, 1, 2, 0, 1, 2]
>>> y_pred = [0, 2, 1, 0, 0, 1]
>>> precision_score(y_true, y_pred, average='micro')
Here is a different view from the other answers that I think will be helpful to others. The goal here is to allow you to compute these metrics using basic laws of probability.
First, it helps to understand what a confusion matrix is telling us in general. Let $Y$ represent a class label and $\hat Y$ represent a class prediction. In the binary case, let the two possible values for $Y$ and $\hat Y$ be $0$ and $1$, which represent the classes. Next, suppose that the confusion matrix for $Y$ and $\hat Y$ is:
$\hat Y = 0$
$\hat Y = 1$
$Y = 0$
10
20
$Y = 1$
30
40
With hindsight, let us normalize the rows and columns of this confusion matrix, such that the sum of all elements of the confusion matrix is $1$. Currently, the sum of all elements of the confusion matrix is $10 + 20 + 30 + 40 = 100$, which is our normalization factor. After dividing the elements of the confusion matrix by the normalization factor, we get the following normalized confusion matrix:
$\hat Y = 0$
$\hat Y = 1$
$Y = 0$
$\frac{1}{10}$
$\frac{2}{10}$
$Y = 1$
$\frac{3}{10}$
$\frac{4}{10}$
With this formulation of the confusion matrix, we can interpret $Y$ and $\hat Y$ slightly differently. We can interpret them as jointly Bernoulli (binary) random variables, where their normalized confusion matrix represents their joint probability mass function. When we interpret $Y$ and $\hat Y$ this way, the definitions of precision and recall are much easier to remember using Bayes' rule and the law of total probability:
\begin{align}
\text{Precision} &= P(Y = 1 \mid \hat Y = 1) = \frac{P(Y = 1 , \hat Y = 1)}{P(Y = 1 , \hat Y = 1) + P(Y = 0 , \hat Y = 1)} \\
\text{Recall} &= P(\hat Y = 1 \mid Y = 1) = \frac{P(Y = 1 , \hat Y = 1)}{P(Y = 1 , \hat Y = 1) + P(Y = 1 , \hat Y = 0)}
\end{align}
How do we determine these probabilities? We can estimate them using the normalized confusion matrix. From the table above, we see that
\begin{align}
P(Y = 0 , \hat Y = 0) &\approx \frac{1}{10} \\
P(Y = 0 , \hat Y = 1) &\approx \frac{2}{10} \\
P(Y = 1 , \hat Y = 0) &\approx \frac{3}{10} \\
P(Y = 1 , \hat Y = 1) &\approx \frac{4}{10}
\end{align}
Therefore, the precision and recall for this specific example are
\begin{align}
\text{Precision} &= P(Y = 1 \mid \hat Y = 1) = \frac{\frac{4}{10}}{\frac{4}{10} + \frac{2}{10}} = \frac{4}{4 + 2} = \frac{2}{3} \\
\text{Recall} &= P(\hat Y = 1 \mid Y = 1) = \frac{\frac{4}{10}}{\frac{4}{10} + \frac{3}{10}} = \frac{4}{4 + 3} = \frac{4}{7}
\end{align}
Note that, from the calculations above, we didn't really need to normalize the confusion matrix before computing the precision and recall. The reason for this is that, because of Bayes' rule, we end up dividing one value that is normalized by another value that is normalized, which means that the normalization factor can be cancelled out.
A nice thing about this interpretation is that it can be generalized to confusion matrices of any size. In the case where there are more than 2 classes, $Y$ and $\hat Y$ are no longer considered to be jointly Bernoulli, but rather jointly categorical. Moreover, we would need to specify which class we are computing the precision and recall for. In fact, the definitions above may be interpreted as the precision and recall for class $1$. We can also compute the precision and recall for class $0$, but these have different names in the literature.
I need to solve a non linear optimization problem in Python. I found out that scipy solves optimization problems, however I don't know what I am doing wrong since with some example input it can't find the correct solution that I have in NEOS server solver Knitro AMPL.
My problem is that, given a set of points it must find the biggest ellipse inscribed that at max touches those points and the points are never included inside of it.
Theory
Formulating the optimization problem, I have a and b the semiaxis, phi the rotation, xc and yc the coordinates of the centre and points the list of points with each element in the form of [x, y] -> [0, 1] indices.
On paper the problem and the constraints are these, a, b, phi, xc, yc are real, the points are integers:
NEOS
The files I used in NEOS are these:
mod
dat
run
With successful results (complete):
xc = 143.012
yc = 262.634
a = 181.489
b = 140.429
phi = 1.43575
Python
So, my python code is this, it is my first time using scipy for optimization, so I don't exclude errors of understanding how it works from the documentation.
from typing import List
import numpy as np
from scipy.optimize import *
def ellipse_calc(
points: List[List[int]],
verbose: bool = False
):
centre = [0, 0]
for i in range(len(points)):
centre[0] += points[i][0]
centre[1] += points[i][1]
centre[0] /= len(points)
centre[1] /= len(points)
if verbose:
print(f'centre: {centre[0]:.2f}, {centre[1]:.2f}')
max_x = max([p[0] for p in points])
max_y = max([p[1] for p in points])
min_x = min([p[0] for p in points])
min_y = min([p[1] for p in points])
initial_axis = 0.25 * (max_x - min_x + max_y - min_y)
if verbose:
print(initial_axis)
constraints = [
NonlinearConstraint(lambda x: x[0], 1, np.inf),
NonlinearConstraint(lambda x: x[1], 1, np.inf),
NonlinearConstraint(lambda x: x[2], 0, np.inf),
]
for i in range(len(points)):
constraints += [NonlinearConstraint(
lambda x:
(points[i][0] - x[3]) ** 2 * (np.cos(x[2]) ** 2 / x[0]**2 + np.sin(x[2]) ** 2 / x[1]**2) +
(points[i][1] - x[4]) ** 2 * (np.sin(x[2]) ** 2 / x[0]**2 + np.cos(x[2]) ** 2 / x[1]**2) +
2 * (points[i][0] - x[3]) * (points[i][1] - x[4]) *
np.cos(x[2]) * np.sin(x[2]) * (1 / x[1]**2 - 1 / x[0]**2), 1, np.inf)]
result = minimize(
lambda x: -np.pi * x[0] * x[1],
[initial_axis, initial_axis, 0, centre[0], centre[1]],
constraints=constraints
)
print(result)
if __name__ == '__main__':
points = [[50,44],[91,44],[161,44],[177,44],[44,88],[189,88],[239,88],[259,88],[2,132],[250,132],[2,176],[329,176],[2,220],[289,220],[2,264],[288,264],[2,308],[277,308],[2,352],[285,352],[2,396],[25,396],[35,396],[231,396],[284,396],[298,396],[36,440],[76,440],[106,440],[173,440]]
ellipse_calc(points, True)
This try, that has the same data I tried on NEOS gives as output the following:
fun: -8.992626773255127e+40
jac: array([-5.68832805e+20, -4.96651566e+20, -0.00000000e+00, -0.00000000e+00,
-0.00000000e+00])
message: 'Inequality constraints incompatible'
nfev: 54
nit: 10
njev: 9
status: 4
success: False
x: array([ 1.58089104e+20, 1.81065104e+20, -1.24564497e+15, -1.55647883e+10,
-2.76654483e+10])
Does anyone know what I am doing wrong and how to fix it? Also, I don't really know if it is possible to solve this problem with scipy, in that case I am looking for a free library to solve it or even to alternative methods of finding that ellipse equation
This isn't a complete answer, but it should help you to get started. Here are two hints:
Pass simple box constraints on the variables as boundaries, not as constraints. That is, use
bounds = [(1, None), (1, None), (0, None), (None, None), (None, None)]
and pass it to minimize via the bounds parameter.
You need to be really careful when defining constraints through lambda expressions inside a loop, see here. You need to capture the loop variable i by lambda x, i=i: your_fun. Otherwise, each of your constraints uses i=29 and thus evaluates the last point. This can easily be observed by evaluating all constraints for a specific value.
Then you should at least get a feasible solution with an objective value of 79384. Note also that you can shorten your code significantly by using numpy functions instead of loops.
I am working on the estimation module of a prototype. The purpose is to send proper seasonality variation parameters to the forecaster module.
Initially, in the booking curve estimation, we were using a formula for day of year seaonality - trigonometric function with 5 orders (fixed). It goes like this:
doy_seasonality = np.exp(z[0]*np.sin(2*np.pi*doy/365.)+z[1]*np.cos(2*np.pi*doy/365.)
+z[2]*np.sin(4*np.pi*doy/365.)+ z[3]*np.cos(4*np.pi*doy/365.)
+z[4]*np.sin(6*np.pi*doy/365.)+ z[5]*np.cos(6*np.pi*doy/365.)
+z[6]*np.sin(8*np.pi*doy/365.)+ z[7]*np.cos(8*np.pi*doy/365.)
+ z[8]*np.sin(10*np.*pi*doy/365.)+ z[9]*np.cos(10*np.pi*doy/365.))
i.e. we had 5 fixed orders [2, 4, 6, 8, 10]
Now, we have found a better way to get the orders through Fast Fourier Transform. Depending on the estimation key we use as input in the simulation, the order array could have different number of values.
For instance, let's say the order array is as follows
orders = [2, 6, 10, 24]
Corresponding to every order value, there would be two values of z (it's a trigonometric parameter - one value for SIN part and one value for COS part). For example, it could look like this
z = [0.08 0.11 0.25 0.01 0.66 0.19 0.45 0.07]
To achieve this, I would need to define a for-loop with two parallel iterations:
z[0] to z[2*length(orders)-1] i.e. `z[0] to z[7]`
and orders[0] to orders[length(orders)-1] i.e. orders[0] to orders[3]
ultimately, the formula should compute this:
doy_seasonality = np.exp(z[0]*np.sin(orders[0]*np.pi*doy/365.)+z[1]*np.cos(orders[0]*np.pi*doy/365.)
+z[2]*np.sin(orders[1]*np.pi*doy/365.)+ z[3]*np.cos(orders[1]*np.pi*doy/365.)
+z[4]*np.sin(orders[2]*np.pi*doy/365.)+ z[5]*np.cos(orders[2]*np.pi*doy/365.)
+z[6]*np.sin(orders[3]*np.pi*doy/365.)+ z[7]*np.cos(orders[3]*np.pi*doy/365.))
I am not able to design the appropriate syntax for this.
doy (day of year) is a vector taking equally spaced values : 1, 2, 3... 364, 365
orders = np.array([2, 6, 10, 24])
z = np.array([0.08, 0.11, 0.25, 0.01, 0.66, 0.19, 0.45, 0.07])
doy = np.arange(365) + 1
s = 0
for k in range(len(orders)):
s += z[2 * k ] * np.sin(orders[k] * np.pi * doy / 365.)
s += z[2 * k + 1] * np.cos(orders[k] * np.pi * doy / 365.)
s = np.exp(s)
EDIT: I already made significant progress. My current question is written after my last edit below and can be answered without the context.
I currently follow Andrew Ng's Machine Learning Course on Coursera and tried to implement logistic regression today.
Notation:
X is a (m x n)-matrix with vectors of input variables as rows (m training samples of n-1 variables, the entries of the first column are equal to 1 everywhere to represent a constant).
y is the corresponding vector of expected output samples (column vector with m entries equal to 0 or 1)
theta is the vector of model coefficients (row vector with n entries)
For an input row vector x the model will predict the probability sigmoid(x * theta.T) for a positive outcome.
This is my Python3/numpy implementation:
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
vec_sigmoid = np.vectorize(sigmoid)
def logistic_cost(X, y, theta):
summands = np.multiply(y, np.log(vec_sigmoid(X*theta.T))) + np.multiply(1 - y, np.log(1 - vec_sigmoid(X*theta.T)))
return - np.sum(summands) / len(y)
def gradient_descent(X, y, learning_rate, num_iterations):
num_parameters = X.shape[1] # dim theta
theta = np.matrix([0.0 for i in range(num_parameters)]) # init theta
cost = [0.0 for i in range(num_iterations)]
for it in range(num_iterations):
error = np.repeat(vec_sigmoid(X * theta.T) - y, num_parameters, axis=1)
error_derivative = np.sum(np.multiply(error, X), axis=0)
theta = theta - (learning_rate / len(y)) * error_derivative
cost[it] = logistic_cost(X, y, theta)
return theta, cost
This implementation seems to work fine, but I encountered a problem when calculating the logistic-cost. At some point the gradient descent algorithm converges to a pretty good fitting theta and the following happens:
For some input row X_i with expected outcome 1 X * theta.T will become positive with a good margin (for example 23.207). This will lead to sigmoid(X_i * theta) to become exactly 1.0000 (this is because of lost precision I think). This is a good prediction (since the expected outcome is equal to 1), but this breaks the calculation of the logistic cost, since np.log(1 - vec_sigmoid(X*theta.T)) will evaluate to NaN. This shouldn't be a problem, since the term is multiplied with 1 - y = 0, but once a value of NaN occurs, the whole calculation is broken (0 * NaN = NaN).
How should I handle this in the vectorized implementation, since np.multiply(1 - y, np.log(1 - vec_sigmoid(X*theta.T))) is calculated in every row of X (not only where y = 0)?
Example input:
X = np.matrix([[1. , 0. , 0. ],
[1. , 1. , 0. ],
[1. , 0. , 1. ],
[1. , 0.5, 0.3],
[1. , 1. , 0.2]])
y = np.matrix([[0],
[1],
[1],
[0],
[1]])
Then theta, _ = gradient_descent(X, y, 10000, 10000) (yes, in this case we can set the learning rate this large) will set theta as:
theta = np.matrix([[-3000.04008972, 3499.97995514, 4099.98797308]])
This will lead to vec_sigmoid(X * theta.T) to be the really good prediction of:
np.matrix([[0.00000000e+00], # 0
[1.00000000e+00], # 1
[1.00000000e+00], # 1
[1.95334953e-09], # nearly zero
[1.00000000e+00]]) # 1
but logistic_cost(X, y, theta) evaluates to NaN.
EDIT:
I came up with the following solution. I just replaced the logistic_cost function with:
def new_logistic_cost(X, y, theta):
term1 = vec_sigmoid(X*theta.T)
term1[y == 0] = 1
term2 = 1 - vec_sigmoid(X*theta.T)
term2[y == 1] = 1
summands = np.multiply(y, np.log(term1)) + np.multiply(1 - y, np.log(term2))
return - np.sum(summands) / len(y)
By using the mask I just calculate log(1) at the places at which the result will be multiplied with zero anyway. Now log(0) will only happen in wrong implementations of gradient descent.
Open questions: How can I make this solution more clean? Is it possible to achieve a similar effect in a cleaner way?
If you don't mind using SciPy, you could import expit and xlog1py from scipy.special:
from scipy.special import expit, xlog1py
and replace the expression
np.multiply(1 - y, np.log(1 - vec_sigmoid(X*theta.T)))
with
xlog1py(1 - y, -expit(X*theta.T))
I know it is an old question but I ran into the same problem, and maybe it can help others in the future, I actually solved it by implementing normalization on the data before appending X0.
def normalize_data(X):
mean = np.mean(X, axis=0)
std = np.std(X, axis=0)
return (X-mean) / std
After this all worked well!
Assuming that I have defined 2 probability variables in SymPy:
x = Normal('x', 0, 2)
y = 2*x + Normal('0', 3)
Now given evidence that y = 4, is it possible to define a new probability variable that follow the posterior distribution P(x | y=4)?
It is easy to simply multiply the probability distribution function of 2, however I wonder whether sympy has the feature to yield a probability variable directly.
The typical way is to pass conditions as the second argument without creating a new random symbol: for example,
density(x, Eq(y, 4)) # Lambda(x, 5*sqrt(2)*exp(8/25)*exp(-x**2/8)*exp(-2*(-x + 2)**2/9)/(12*sqrt(pi)))
P(x > 0, Eq(y, 4)) # -erfc(8*sqrt(2)/15)/2 + 1
But it's also possible to create a random variable with a custom density using ContinuousRV:
from sympy.stats import ContinuousRV
x_post = Symbol("x_post")
X_post = ContinuousRV(x_post, density(x, Eq(y, 4))(x_post))
For example, simplify(E(X_post)) returns 16*erf(3*sqrt(2)/10)/25 + 16*erfc(3*sqrt(2)/10)/25 + 16/25.