imlementation of binomial coefficient in Google JAX - jax

trying to implement custom MLE for binomial distribution (for learning purpose) stuck with implantation of binomial coefficient in google JAX . there is no analog for scipy.special.binom() implemented.
what shall i use instead ?

The binomial coefficient for general real-valued inputs can be computed in terms of the gamma function, which is available in JAX via jax.scipy.special.gammaln. Here's one way you could define it:
def binom(x, y):
return jnp.exp(gammaln(x + 1) - gammaln(y + 1) - gammaln(x - y + 1))

Here is a (sequential) integer implementation using JAX.
def binom_int_seq(x : int, y : int):
def scan_body(carry, values):
n, d = values
carry = (carry*n)//d
return carry, None
y = max(y, x-y)
nd = jnp.concatenate(
(jnp.arange(y+2, x+1, dtype = 'u8')[:,None],
jnp.arange(2, x-y+1, dtype = 'u8')[:,None],),
axis = 1
)
bc, *_ = jax.lax.scan(scan_body, jnp.array(y+1, dtype = 'u8'), nd)
return bc
binom_int_seq_jit = jax.jit(binom_int_seq, static_argnums = (0, 1))
which gives
x, y = 60, 31
bc_ref = sp.special.comb(x, y, exact=True)
# 114449595062769120
binom_int_seq(x, y)-bc_ref
# DeviceArray(0, dtype=uint64)
# Using above logarithmic gamma function based implementation
binom(x, y)-bc_ref
# DeviceArray(496., dtype=float64, weak_type=True)
Keep in mind the binom_int_seq implementation is only correct if
(x-max(x-y, y))*sp.special.comb(x, y, exact=True) < jnp.iinfo(jnp.uint64).max
Unlike the real-valued version, the error will be sudden and catastrophic if this condition is not satisfied.
There may be other ways to increase this constraint, such as running cancellations based upon prime factorisation, without resorting to larger unsigned integers (/arbitrary precision).
A monoidal version could be implemented which computes the binomial coefficient numerator and denominator reductions then integer divides, but this places stricter constraints on the maximum arguments.

Related

How to measure sensitivity, speficity, PPV and NPV from confusion matrix for multiclass multiclassification [migrated]

I wonder how to compute precision and recall using a confusion matrix for a multi-class classification problem. Specifically, an observation can only be assigned to its most probable class / label. I would like to compute:
Precision = TP / (TP+FP)
Recall = TP / (TP+FN)
for each class, and then compute the micro-averaged F-measure.
In a 2-hypothesis case, the confusion matrix is usually:
Declare H1
Declare H0
Is H1
TP
FN
Is H0
FP
TN
where I've used something similar to your notation:
TP = true positive (declare H1 when, in truth, H1),
FN = false negative (declare H0 when, in truth, H1),
FP = false positive
TN = true negative
From the raw data, the values in the table would typically be the counts for each occurrence over the test data. From this, you should be able to compute the quantities you need.
Edit
The generalization to multi-class problems is to sum over rows / columns of the confusion matrix. Given that the matrix is oriented as above, i.e., that
a given row of the matrix corresponds to specific value for the "truth", we have:
$\text{Precision}_{~i} = \cfrac{M_{ii}}{\sum_j M_{ji}}$
$\text{Recall}_{~i} = \cfrac{M_{ii}}{\sum_j M_{ij}}$
That is, precision is the fraction of events where we correctly declared $i$
out of all instances where the algorithm declared $i$. Conversely, recall is the fraction of events where we correctly declared $i$ out of all of the cases where the true of state of the world is $i$.
Good summary paper, looking at these metrics for multi-class problems:
Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing and Management, 45, p. 427-437. (pdf)
The abstract reads:
This paper presents a systematic analysis of twenty four performance
measures used in the complete spectrum of Machine Learning
classification tasks, i.e., binary, multi-class, multi-labelled, and
hierarchical. For each classification task, the study relates a set of
changes in a confusion matrix to specific characteristics of data.
Then the analysis concentrates on the type of changes to a confusion
matrix that do not change a measure, therefore, preserve a
classifier’s evaluation (measure invariance). The result is the
measure invariance taxonomy with respect to all relevant label
distribution changes in a classification problem. This formal analysis
is supported by examples of applications where invariance properties
of measures lead to a more reliable evaluation of classifiers. Text
classification supplements the discussion with several case studies.
Using sklearn or tensorflow and numpy:
from sklearn.metrics import confusion_matrix
# or:
# from tensorflow.math import confusion_matrix
import numpy as np
labels = ...
predictions = ...
cm = confusion_matrix(labels, predictions)
recall = np.diag(cm) / np.sum(cm, axis = 1)
precision = np.diag(cm) / np.sum(cm, axis = 0)
To get overall measures of precision and recall, use then
np.mean(recall)
np.mean(precision)
#Cristian Garcia code can be reduced by sklearn.
>>> from sklearn.metrics import precision_score
>>> y_true = [0, 1, 2, 0, 1, 2]
>>> y_pred = [0, 2, 1, 0, 0, 1]
>>> precision_score(y_true, y_pred, average='micro')
Here is a different view from the other answers that I think will be helpful to others. The goal here is to allow you to compute these metrics using basic laws of probability.
First, it helps to understand what a confusion matrix is telling us in general. Let $Y$ represent a class label and $\hat Y$ represent a class prediction. In the binary case, let the two possible values for $Y$ and $\hat Y$ be $0$ and $1$, which represent the classes. Next, suppose that the confusion matrix for $Y$ and $\hat Y$ is:
$\hat Y = 0$
$\hat Y = 1$
$Y = 0$
10
20
$Y = 1$
30
40
With hindsight, let us normalize the rows and columns of this confusion matrix, such that the sum of all elements of the confusion matrix is $1$. Currently, the sum of all elements of the confusion matrix is $10 + 20 + 30 + 40 = 100$, which is our normalization factor. After dividing the elements of the confusion matrix by the normalization factor, we get the following normalized confusion matrix:
$\hat Y = 0$
$\hat Y = 1$
$Y = 0$
$\frac{1}{10}$
$\frac{2}{10}$
$Y = 1$
$\frac{3}{10}$
$\frac{4}{10}$
With this formulation of the confusion matrix, we can interpret $Y$ and $\hat Y$ slightly differently. We can interpret them as jointly Bernoulli (binary) random variables, where their normalized confusion matrix represents their joint probability mass function. When we interpret $Y$ and $\hat Y$ this way, the definitions of precision and recall are much easier to remember using Bayes' rule and the law of total probability:
\begin{align}
\text{Precision} &= P(Y = 1 \mid \hat Y = 1) = \frac{P(Y = 1 , \hat Y = 1)}{P(Y = 1 , \hat Y = 1) + P(Y = 0 , \hat Y = 1)} \\
\text{Recall} &= P(\hat Y = 1 \mid Y = 1) = \frac{P(Y = 1 , \hat Y = 1)}{P(Y = 1 , \hat Y = 1) + P(Y = 1 , \hat Y = 0)}
\end{align}
How do we determine these probabilities? We can estimate them using the normalized confusion matrix. From the table above, we see that
\begin{align}
P(Y = 0 , \hat Y = 0) &\approx \frac{1}{10} \\
P(Y = 0 , \hat Y = 1) &\approx \frac{2}{10} \\
P(Y = 1 , \hat Y = 0) &\approx \frac{3}{10} \\
P(Y = 1 , \hat Y = 1) &\approx \frac{4}{10}
\end{align}
Therefore, the precision and recall for this specific example are
\begin{align}
\text{Precision} &= P(Y = 1 \mid \hat Y = 1) = \frac{\frac{4}{10}}{\frac{4}{10} + \frac{2}{10}} = \frac{4}{4 + 2} = \frac{2}{3} \\
\text{Recall} &= P(\hat Y = 1 \mid Y = 1) = \frac{\frac{4}{10}}{\frac{4}{10} + \frac{3}{10}} = \frac{4}{4 + 3} = \frac{4}{7}
\end{align}
Note that, from the calculations above, we didn't really need to normalize the confusion matrix before computing the precision and recall. The reason for this is that, because of Bayes' rule, we end up dividing one value that is normalized by another value that is normalized, which means that the normalization factor can be cancelled out.
A nice thing about this interpretation is that it can be generalized to confusion matrices of any size. In the case where there are more than 2 classes, $Y$ and $\hat Y$ are no longer considered to be jointly Bernoulli, but rather jointly categorical. Moreover, we would need to specify which class we are computing the precision and recall for. In fact, the definitions above may be interpreted as the precision and recall for class $1$. We can also compute the precision and recall for class $0$, but these have different names in the literature.

Pytorch: Custom thresholding activation function - gradient

I created an activation function class Threshold that should operate on one-hot-encoded image tensors.
The function performs min-max feature scaling on each channel followed by thresholding.
class Threshold(nn.Module):
def __init__(self, threshold=.5):
super().__init__()
if threshold < 0.0 or threshold > 1.0:
raise ValueError("Threshold value must be in [0,1]")
else:
self.threshold = threshold
def min_max_fscale(self, input):
r"""
applies min max feature scaling to input. Each channel is treated individually.
input is assumed to be N x C x H x W (one-hot-encoded prediction)
"""
for i in range(input.shape[0]):
# N
for j in range(input.shape[1]):
# C
min = torch.min(input[i][j])
max = torch.max(input[i][j])
input[i][j] = (input[i][j] - min) / (max - min)
return input
def forward(self, input):
assert (len(input.shape) == 4), f"input has wrong number of dims. Must have dim = 4 but has dim {input.shape}"
input = self.min_max_fscale(input)
return (input >= self.threshold) * 1.0
When I use the function I get the following error, since the gradients are not calculated automatically I assume.
Variable._execution_engine.run_backward(RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
I already had a look at How to properly update the weights in PyTorch? but could not get a clue how to apply it to my case.
How is it possible to calculate the gradients for this function?
Thanks for your help.
The issue is you are manipulating and overwriting elements, this time of operation can't be tracked by autograd. Instead, you should stick with built-in functions. You example is not that tricky to tackle: you are looking to retrieve the minimum and maximum values along input.shape[0] x input.shape[1]. Then you will scale your whole tensor in one go i.e. in vectorized form. No for loops involved!
One way to compute min/max along multiple axes is to flatten those:
>>> x_f = x.flatten(2)
Then, find the min-max on the flattened axis while retaining all shapes:
>>> x_min = x_f.min(axis=-1, keepdim=True).values
>>> x_max = x_f.max(axis=-1, keepdim=True).values
The resulting min_max_fscale function would look something like:
class Threshold(nn.Module):
def min_max_fscale(self, x):
r"""
Applies min max feature scaling to input. Each channel is treated individually.
Input is assumed to be N x C x H x W (one-hot-encoded prediction)
"""
x_f = x.flatten(2)
x_min, x_max = x_f.min(-1, True).values, x_f.max(-1, True).values
x_f = (x_f - x_min) / (x_max - x_min)
return x_f.reshape_as(x)
Important note:
You would notice that you can now backpropagate on min_max_fscale... but not on forward. This is because you are applying a boolean condition which is not a differentiable operation.

How to calculate geometric mean in a differentiable way?

How to calculate goemetric mean along a dimension using Pytorch? Some numbers can be negative. The function must be differentiable.
A known (reasonably) numerically-stable version of the geometric mean is:
import torch
def gmean(input_x, dim):
log_x = torch.log(input_x)
return torch.exp(torch.mean(log_x, dim=dim))
x = torch.Tensor([2.0] * 1000).requires_grad_(True)
print(gmean(x, dim=0))
# tensor(2.0000, grad_fn=<ExpBackward>)
This kind of implementation can be found, for example, in SciPy (see here), which is a quite stable lib.
The implementation above does not handle zeros and negative numbers. Some will argue that the geometric mean with negative numbers is not well-defined, at least when not all of them are negative.
torch.prod() helps:
import torch
x = torch.FloatTensor(3).uniform_().requires_grad_(True)
print(x)
y = x.prod() ** (1.0/x.shape[0])
print(y)
y.backward()
print(x.grad)
# tensor([0.5692, 0.7495, 0.1702], requires_grad=True)
# tensor(0.4172, grad_fn=<PowBackward0>)
# tensor([0.2443, 0.1856, 0.8169])
EDIT: ?what about
y = (x.abs() ** (1.0/x.shape[0]) * x.sign() ).prod()

Masking and Instance Normalization in PyTorch

Assume I have a PyTorch tensor, arranged as shape [N, C, L] where N is the batch size, C is the number of channels or features, and L is the length. In this case, if one wishes to perform instance normalization, one does something like:
N = 20
C = 100
L = 40
m = nn.InstanceNorm1d(C, affine=True)
input = torch.randn(N, C, L)
output = m(input)
This will perform a normalization in the L-wise dimension for each N*C = 2000 slices of data, subtracting 2000 means, scaling by 2000 standard deviations, and re-scaling by 100 learnable weight and bias parameters (one per channel). The unspoken assumption here is that all of these values exist and are meaningful.
But I have a situation where, for the slice N=1, I would like to exclude all data after (say) L=35. For the slice N=2 (say) all the data are valid. For the slice N=3, exclude all data after L=30, etc. This mimics data which are one dimensional time sequences, having multiple features, but which are not the same length.
How can I perform an instance norm on such data, get correct statistics, and maintain differentiability/AutoGrad information in PyTorch?
Update: While maintaining GPU performance, or at least not killing it dead.
I cannot...
...Mask with zero values, as this destroys the computer means and variances giving erroneous results
...Mask with np.nan or np.inf, as PyTorch tensors do not ignore such values, but treat them as errors. They are sticky, and lead to garbage results. PyTorch currently lacks the equivalent of np.nanmean and np.nanvar.
...Permute or transpose to an amenable arrangement of data; no such approach gives me what I need
...Use a pack_padded_sequence; instance normalization does not operate on that data structure, and one cannot import data into that structure as far as I know. Also, data re-arrangement would still be necessary, see 3 above.
Am I missing an approach which would give me what I need? Or perhaps am I missing a method of data re-arrangement which would allow 3 or 4 above to work?
This is an issue faced by recurrent neural networks all the time, hence the pack_padded_sequence functionality, but it isn't quite applicable here.
I don't think this is directly possible to implement using the existing InstanceNorm1d, the easiest way would probably be implementing it yourself from scratch. I did a quick implementation that should work. To make it a little bit more general this module requires a boolean mask (a boolean tensor of the same size as the input) that specifies which elements should be considered when passing through the instance norm.
import torch
class MaskedInstanceNorm1d(torch.nn.Module):
def __init__(self, num_features, eps=1e-6, momentum=0.1, affine=True, track_running_stats=False):
super().__init__()
self.num_features = num_features
self.eps = eps
self.momentum = momentum
self.affine = affine
self.track_running_stats = track_running_stats
self.gamma = None
self.beta = None
if self.affine:
self.gamma = torch.nn.Parameter(torch.ones((1, self.num_features, 1), requires_grad=True))
self.beta = torch.nn.Parameter(torch.zeros((1, self.num_features, 1), requires_grad=True))
self.running_mean = None
self.running_variance = None
if self.affine:
self.running_mean = torch.zeros((1, self.num_features, 1), requires_grad=True)
self.running_variance = torch.zeros((1, self.num_features, 1), requires_grad=True)
def forward(self, x, mask):
mean = torch.zeros((1, self.num_features, 1), requires_grad=False)
variance = torch.ones((1, self.num_features, 1), requires_grad=False)
# compute masked mean and variance of batch
for c in range(self.num_features):
if mask[:, c, :].any():
mean[0, c, 0] = x[:, c, :][mask[:, c, :]].mean()
variance[0, c, 0] = (x[:, c, :][mask[:, c, :]] - mean[0, c, 0]).pow(2).mean()
# update running mean and variance
if self.training and self.track_running_stats:
for c in range(self.num_features):
if mask[:, c, :].any():
self.running_mean[0, c, 0] = (1-self.momentum) * self.running_mean[0, c, 0] \
+ self.momentum * mean[0, c, 0]
self.running_variance[0, c, 0] = (1-self.momentum) * self.running_variance[0, c, 0] \
+ self.momentum * variance[0, c, 0]
# compute output
x = (x - mean)/(self.eps + variance).sqrt()
if self.affine:
x = x * self.gamma + self.beta
return x

Improper cost function outputs for Vectorized Logistic Regression

I'm trying to implement vectorized logistic regression on the Iris dataset. This is the implementation from Andrew Ng's youtube series on deep learning. My best predictions using this method have been 81% accuracy while sklearn's implementation achieves 100% with completely different values for coefficients and bias. Also, I cant seem to get get proper outputs from my cost function. I suspect it is an issue with computing the gradients of the weights and bias with respect to the cost function though in the course he provides all of the necessary equations ( unless there is something in the actual exercise which I don't have access to being left out.) My code is as follows.
n = 4
m = 150
y = y.reshape(1, 150)
X = X.reshape(4, 150)
W = np.zeros((4, 1))
b = np.zeros((1,1))
for epoch in range(1000):
Z = np.dot(W.T, X) + b
A = sigmoid(Z) # 1/(1 + e **(-Z))
J = -1/m * np.sum(y * np.log(A) + (1-y) * (1 - np.log(A))) #cost function
dz = A - y
dw = 1/m * np.dot(X, dz.T)
db = np.sum(dz)
W = W - 0.01 * dw
b = b - 0.01 * db
if epoch % 100 == 0:
print(J)
My output looks something like this.
-1.6126604413879289
-1.6185960074767125
-1.6242504226045396
-1.6296400635926438
-1.6347800862216104
-1.6396845400653066
-1.6443664703028427
-1.648838008214648
-1.653110451818512
-1.6571943378913891
W and b values are:
array([[-0.68262679, -1.56816916, 0.12043066, 1.13296948]])
array([[0.53087131]])
Where as sklearn outputs:
(array([[ 0.41498833, 1.46129739, -2.26214118, -1.0290951 ]]),
array([0.26560617]))
I understand sklearn uses L2 regularization but even when turned off it's still far from the correct values. Any help would be appreciated. Thanks
You are likely getting strange results because you are trying to use logistic regression where y is not a binary choice. Categorizing the iris data is a multiclass problem, y can be one of three values:
> np.unique(iris.target)
> array([0, 1, 2])
The cross entropy cost function expects y to either be one or zero. One way to handle this is the one vs all method.
You can check each class by making y a boolean of whether the iris in in one class or not. For example here you can make y a data set of either class 1 or not:
y = (iris.target == 1).astype(int)
With that your cost function and gradient descent should work, but you'll need to run it multiple times and pick the best score for each example. Andrew Ng's class talks about this method.
EDIT:
It's not clear what you are starting with for data. When I do this, don't reshape the inputs. So you should double check that all your multiplication is delivering the shapes you want. On thing I notice that's a little odd, is the last term in your cost function. I generally do this:
cost = -1/m * np.sum(Y*np.log(A) + (1-Y) * np.log(1-A))
not:
-1/m * np.sum(y * np.log(A) + (1-y) * (1 - np.log(A)))
Here's code that converges for me using the dataset from sklearn:
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
# Iris is a multiclass problem. Here, just calculate the probabily that
# the class is `iris_class`
iris_class = 0
Y = np.expand_dims((iris.target == iris_class).astype(int), axis=1)
# Y is now a data set of booleans indicating whether the sample is or isn't a member of iris_class
# initialize w and b
W = np.random.randn(4, 1)
b = np.random.randn(1, 1)
a = 0.1 # learning rate
m = Y.shape[0] # number of samples
def sigmoid(Z):
return 1/(1 + np.exp(-Z))
for i in range(1000):
Z = np.dot(X ,W) + b
A = sigmoid(Z)
dz = A - Y
dw = 1/m * np.dot(X.T, dz)
db = np.mean(dz)
W -= a * dw
b -= a * db
cost = -1/m * np.sum(Y*np.log(A) + (1-Y) * np.log(1-A))
if i%100 == 0:
print(cost)

Resources