Confidence thresholds on mean average precision calculation - scikit-learn

is there any rules for PR curve threshold because in sklearn.metrics.average_precision they automatically make threshold from the prob/confidence which can result in weird result if I have inputs like this:
y_true = np.array([0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
y_scores = np.array([ 0.7088982, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
it will output mAP = 0.93333. Sklearn implementation got that number because it automatically uses [0.7088982, 0] as the thresholds. When the prob threshold is 0 all of zero score will counted as positive resulting in high map. Is this a correct behavior ?

A couple of considerations on your example:
the peculiarity of your y_scores, having two distinct values only, defines the length of your threshold. As you might see from source code and as
you may logically imply, threshold is defined by the number of distinct values in y_scores.
then, your argument is correct and implicit in what the threshold represents. Actually, if the score is greater or equal than the threshold,
the instance is assigned to the positive class. Therefore, in the case score=threshold=0 you'll have true positives only based on your y_true (and in turn the average precision is a weighted mean of precisions achieved at each threshold).
Have a look also here to observe that
Precision values such that element i is the precision of predictions with score >= thresholds[i]
and
Decreasing recall values such that element i is the recall of predictions with score >= thresholds[i]
I'd also suggest you to have a look here to get a glimpse of how precision, recall and threshold are computed within precision_recall_curve().

Related

PyTorch differentiable mask

How would I go about blacking out a portion of an image or feature map such that AutoGrad can backprop through the operation?
Specifically I want to black out everything except for n layers of border pixels. So if we consider a single channel of the feature map which looks like:
[
[1, 1, 1, 1],
[1, 1, 1, 1],
[1, 1, 1, 1],
[1, 1, 1, 1],
]
I set a constant n=1 so my operation does the following to the input:
[
[1, 1, 1, 1],
[1, 0, 0, 1],
[1, 0, 0, 1],
[1, 1, 1, 1],
]
In my case I'd be doing it to a multi channel feature map and all channels would be treated the same way.
If possible, I want to do it in a functional manner.
Considering the comments you added, i.e. that you don't need the output to be differentiable wrt. to the mask (said differently, the mask is constant), you could just store the indices of the 1s in the mask and act only on the corresponding elements of whatever Tensor you're considering. Or if you don't want to deal with fancy indexing, you could just keep the mask as a Tensor of 0s and 1s and do an element-wise multiplication of it with whatever Tensor you're considering. Or, if you truly just need to compute a loss along just the border pixels, just extract the first and last row, and first and last column, and avoid double-counting the corners. This latter solution is essentially just the first solution recast in a special case.
To address the question in your comment to my answer:
x = torch.tensor([[1.0,2,3],[4,5,6]], requires_grad = True)
print(x[:,0])
gives
tensor([1., 4.], grad_fn=<SelectBackward>)
, so we see that slicing does not mess with the autograd engine (it's still tracking the contribution to the gradient). It is not too surprising that this works automatically; slicing can be viewed as the (mathematical) function that of projecting onto a subspace of R^n, for which it's easy to compute the gradient.

Loss for binary sparsity

I have binary images (as the one below) at the output of my net. I need the '1's to be further from each other (not connected), so that they would form a sparse binary image (without white blobs). Something like salt-and-pepper noise. I am looking for a way to define a loss (in pytorch) that would punish based on the density of the '1's.
Thanks.
I
It depends on how you're generating said image. Since neural networks have to be trained by backpropagation, I'm rather sure your binary image is not the direct output of your neural network (ie not the thing you're applying loss to), because gradient can't blow through binary (discrete) variables. I suspect you do something like pixel-wise binary cross entropy or similar and then threshold.
I assume your code works like that: you densely regress real-valued numbers and then apply thresholding, likely using sigmoid to map from [-inf, inf] to [0, 1]. If it is so, you can do the following. Build a convolution kernel which is 0 in the center and 1 elsewhere, of size related to how big you want your "sparsity gaps" to be.
kernel = [
[1, 1, 1, 1, 1]
[1, 1, 1, 1, 1]
[1, 1, 0, 1, 1]
[1, 1, 1, 1, 1]
[1, 1, 1, 1, 1]
]
Then you apply sigmoid to your real-valued output to squash it to [0, 1]:
squashed = torch.sigmoid(nn_output)
then you convolve squashed with kernel, which gives you the relaxed number of non-zero neighbors.
neighborhood = nn.functional.conv2d(squashed, kernel, padding=2)
and your loss will be the product of each pixel's value in squashed with the corresponding value in neighborhood:
sparsity_loss = (squashed * neighborhood).mean()
If you think of this loss applied to your binary image, for a given pixel p it will be 1 if and only if both p and at least one of its neighbors have values 1 and 0 otherwise. Since we apply it to non-binary numbers in [0, 1] range, it will be the differentiable approximation of that.
Please note that I left out some of the details from the code above (like correctly reshaping kernel to work with nn.functional.conv2d).

Count number of repeated elements in list considering the ones larger than them

I am trying to do some clustering analysis on a dataset. I am using a number of different approaches to estimate the number of clusters, then I put what every approach gives (number of clusters) in a list, like so:
total_pred = [0, 0, 1, 1, 0, 1, 1]
Now I want to estimate the real number of clusters, so I let the methods above vote, for example, above, more models found 1 cluster than 0, so I take 1 as the real number of clusters.
I do this by:
counts = np.bincount(np.array(total_pred))
real_nr_of_clusters = np.argmax(counts))
There is a problem with this method, however. If the above list contains something like:
[2, 0, 1, 0, 1, 0, 1, 0, 1]
I will get 0 clusters as the average, since 0 is repeated more often. However, if one model found 2 clusters, it's safe to assume it considers at least 1 cluster is there, hence the real number would be 1.
How can I do this by modifying the above snippet?
To make the problem clear, here are a few more examples:
[1, 1, 1, 0, 0, 0, 3]
should return 1,
[0, 0, 0, 1, 1, 3, 4]
should also return 1 (since most of them agree there is AT LEAST 1 cluster).
There is a problem with your logic
Here is an implementation of the described algorithm.
l = [2, 0, 1, 0, 1, 0, 1, 0, 1]
l = sorted(l, reverse=True)
votes = {x: i for i, x in enumerate(l, start=1)}
Output
{2: 1, 1: 5, 0: 9}
Notice that since you define a vote as agreeing with anything smaller than itself, then min(l) will always win, because everyone will agree that there are at least min(l) clusters. In this case min(l) == 0.
How to fix it
Mean and median
Beforehand, notice that taking the mean or the median are valid and light-weight options that both satisfy the desired output on your examples.
Bias
Although, taking the mean might not be what you want if, for say, you encounter votes with high variance such as [0, 0, 7, 8, 10] where it is unlikely that the answer is 5.
A more general way to fix that is to include a voter's bias toward votes close to theirs. Surely that a 2-voter will agree more to a 1 than a 0.
You do that by implementing a metric (note: this is not a metric in the mathematical sense) that determines how much an instance that voted for x is willing to agree to a vote for y on a scale of 0 to 1.
Note that this approach will allow voters to agree on a number that is not on the list.
We need to update our code to account for applying that pseudometric.
def d(x, y):
return x <= y
l = [2, 0, 1, 0, 1, 0, 1, 0, 1]
votes = {y: sum(d(x, y) for x in l) for y in range(min(l), max(l) + 1)}
Output
{0: 9, 1: 5, 2: 1}
The above metric is a sanity check. It is the one your provided in your question and it indeed ends up determining that 0 wins.
Metric choices
You will have to toy a bit with your metrics, but here are a few which may make sense.
Inverse of the linear distance
def d(x, y):
return 1 / (1 + abs(x - y))
l = [2, 0, 1, 0, 1, 0, 1, 0, 1]
votes = {y: sum(d(x, y) for x in l) for y in range(min(l), max(l) + 1)}
# {0: 6.33, 1: 6.5, 2: 4.33}
Inverse of the nth power of the distance
This one is a generalization of the previous. As n grows, voters tend to agree less and less with distant vote casts.
def d(x, y, n=1):
return 1 / (1 + abs(x - y)) ** n
l = [2, 0, 1, 0, 1, 0, 1, 0, 1]
votes = {y: sum(d(x, y, n=2) for x in l) for y in range(min(l), max(l) + 1)}
# {0: 5.11, 1: 5.25, 2: 2.44}
Upper-bound distance
Similar to the previous metric, this one is close to what you described at first in the sense that a voter will never agree to a vote higher than theirs.
def d(x, y, n=1):
return 1 / (1 + abs(x - y)) ** n if x >= y else 0
l = [2, 0, 1, 0, 1, 0, 1, 0, 1]
votes = {y: sum(d(x, y, n=2) for x in l) for y in range(min(l), max(l) + 1)}
# {0: 5.11, 1: 4.25, 2: 1.0}
Normal distribution
An other option that would be sensical is a normal distribution or a skewed normal distribution.
While the other answer provides a comprehensive review of possible metrics and methods, it seems what you are seeking is to find the closest number of clusters to mean!
So something as simple as:
cluster_num=int(np.round(np.mean(total_pred)))
Which returns 1 for all your cases as you expect.

Scikit-learn R2 always zero

I'm trying to test my Scikit-learn machine learning algorithm with a simple R^2 score, but for some reason it always returns zero.
import numpy
from sklearn.metrics import r2_score
prediction = numpy.array([0.1567, 4.7528, 1.1260, 0.2294]).reshape(1, -1)
training = numpy.array([0, 3, 1, 0]).reshape(1, -1)
r2 = r2_score(training, prediction, multioutput="raw_values")
print r2
[ 0. 0. 0. 0.]
This is a single four-part value, not four separate values. How do I get proper R^2 scores?
If you are trying to calculate the r2 value between two vectors you should just pass two one dimensional arrays. See the documentation
In the example you provided, the first item is compared to the first item, but note you only have one list in each the prediction and training, so it is calculating R2 for 0.1567 to 0, which is 0, then it calculates it for 4.7528 to 3 which is also 0 and so on... It sounds like you want the R2 for the two vectors like the following:
prediction = numpy.array([0.1567, 4.7528, 1.1260, 0.2294])
training = numpy.array([0, 3, 1, 0])
print(r2_score(training, prediction))
0.472439485
If you have multi-dimensional arrays you can use the multioutput flag to determine what the output should look like:
#modified from the scikit-learn example
y_true = [[0.5, 1], [-1, 1], [7, -6]]
y_pred = [[0, 2], [-1, 2], [8, -5]]
print(r2_score(y_true, y_pred, multioutput='raw_values'))
array([ 0.96543779, 0.90816327])
Here the output is where the first item of each list in y_true is compared to the first item in each list of y_pred, the second item to the second and so on

Explanation for coverage_error metric in scikit learn

I am not understanding how the coverage_error is calculated in scikit learn, available in sklearn.metrics module. Explanation in the docs is as below:
The coverage_error function computes the average number of labels that have to be included in the final prediction such that all true labels are predicted.
For eg:
import numpy as np
from sklearn.metrics import coverage_error
y_true = np.array([[1, 0, 0], [0, 1, 1]])
y_score = np.array([[1, 0, 0], [0, 1, 1]])
print coverage_error(y_true, y_score)
1.5
As per my understanding, here we need to include 3 labels from the prediction to get all labels in y_true. So coverage error = 3/2, ie, 1.5. But I am not able to understand what happens in the below cases:
>>> y_score = np.array([[1, 0, 0], [0, 0, 1]])
>>> print coverage_error(y_true, y_score)
2.0
>>> y_score = np.array([[1, 0, 1], [0, 1, 1]])
>>> print coverage_error(y_true, y_score)
2.0
How come the error is same in both the cases?
You can have a look at User Guide 3.3.3. Multilabel ranking metrics
with
One thing you need to take care is how to compute ranks and break ties in ranking y_score.
To be specific, the first case:
In [4]: y_true
Out[4]:
array([[1, 0, 0],
[0, 1, 1]])
In [5]: y_score
Out[5]:
array([[1, 0, 0],
[0, 0, 1]])
For the 1st sample, the 1st true label is true, and the rank of 1st score is 1.
For the 2ed sample, the 2ed and 3rd true label are true, and the ranks of score are 3 and 1 respectively, so the max rank is 3.
The average is (3+1)/2=2.
the second case:
In [7]: y_score
Out[7]:
array([[1, 0, 1],
[0, 1, 1]])
For the 1st sample, the 1st true label is true, and the rank of 1st score is 2.
For the 2ed sample, the 2ed and 3rd true label are true, and the ranks of score are 2 and 2 respectively, so the max rank is 2.
The average is (2+2)/2=2.
Edit:
The rank is within one sample of y_score. The formula says the rank of a label is the number of labels (including itself) whose score is greater than or equal to its score.
It is just like sort the labels by y_score, and the label with largest score is ranked 1, the second largest is ranked 2, the third largest is ranked 3, etc. But if the second and third largest labels have the same score, they are both ranked 3.
Notice that y_score is
Target scores, can either be probability estimates of the positive class, confidence values, or binary decisions.
The goal is to have all true labels predicted, so we need to include all the labels with higher or equal scores than the true label.

Resources