How to combine False positives and false negatives into one single measure - statistics

I'am trying to measure the performance of a computer vision program that tries to detect objects in video. I have 3 different versions of the program which have different parameters.
I've benchmarked each of this versions and got 3 pairs of (False positives percent, False negative percent).
Now i want to compare the versions with each other and then I wonder if it makes sense to combine false positives and false negatives into a single value and use that to do the comparation. for example, take the equation falsePositives/falseNegatives and see which is smaller.

In addition to the popular Area Under the ROC Curve (AUC) measure mentioned by #alchemist-al, there's a score that combines both precision and recall (which are defined in terms of TP/FP/TN/FN) called the F-measure that goes from 0 to 1 (0 being the worst, 1 the best):
F-measure = 2*precision*recall / (precision+recall)
where
precision = TP/(TP+FP) , recall = TP/(TP+FN)

A couple of other possible solutions:
-Your false-positive rate (fp) and false-negative rate (fn) may depend on a threshold. If you plot the curve where the y-value is (1-fn), and the x-value is (fp), you'll be plotting the Receiver-Operator-Characteristic (ROC) curve. The Area Under the ROC Curve (AUC) is one popular measure of quality.
-AUC can be weighted if there are certain regions of interest
-Report the Equal-Error Rate. For some threshold, fp=fn. Report this value.

It depends on how much detail you want in the comparision.
Combining the two figures will give you an overall sense of error margin but no insight into what sort of error so if you just want to know what is "more correct" in an overall sense then it's fine.
If, on the other hand, you're actually wanting to use the results for some sort of more in depth determination of whether the process is suited to a particular problem then I would imagine keeping them seperate is a good idea.
e.g. Sometimes false negatives are a very different problem to false positives in a real world setting. Did the robot just avoid an object that wasn't there... or fail to notice it was heading off the side of a cliff?
In short, there's no hard and fast global rule for determining how effective vision based on one super calculation. It comes down to what you're planning to do with the information that's the important bit.

You need to factor in how "important" false positive are relative to false negatives.
For example, if your program is designed to recognise people's faces, the both false positives and false negatives are equally harmless and you can probably just combine them linearly.
But if your program was designed to detect bombs, then false positives aren't a huge deal (i.e. saying "this is a bomb" when it's actually not) but false negatives (that is, saying "this isn't a bomb" when it actually is) would be catastrophic.

Well, one conventional way is to assign a weight to each of the two event types (e.g., some integer to indicate the relative significance of each to model validation). Then,
multiply each instance by the
appropriate weighting factor;
then square them;
sum the terms;
take the square root
This leaves you with a single number--something "total error".

If you want to maximize both the true positives and the true negatives you can use the Diagnostic Efficiency:
Diagnostic Efficiency = Sensitivity * Specificity
Where...
Sensitivity = TP / (TP + FN)
Specificity = TN / (TN + FP)
(TP = number of true positives, FN = number of false negatives, TN = number of true negatives, FP = number of false positives)
This metric works well for datasets that have an unbalanced number of classes (i.e. the dataset is skewed)

Related

Does lowering the p-value threshold always result in lower FDR?

Intuitively I would think that lowering the o-value threshold would result in fewer false discoveries, but does this always hold or are there situations where it does not?
This is a very interesting question, and I think that there may be situations in which reducing your p-value threshold without good justification and attention to your design does increase the false discovery rate, even though it would theoretically reduce the number of false discoveries across your tests. This is my guess based on the relationship between FDR and power.
This is my intuition because the FDCR is "false discoveries" divided by "true and false discoveries". So if you reduce the likelihood of true discoveries without proportionately reducing the likelihood of false discoveries, your FDR would increase. The way this relates to reducing the p-value threshold is that reducing the threshold without changing anything else would reduce your power. As the power of your test decreases, its ability to make true discoveries also decreases, and thus you might end up with a higher FDR. See this paper for a discussion of how the FDR, power and p-value relate.
This doesn't mean going for a higher p-value threshold btw, it means going for as stringent a threshold as would be appropriate and then increasing your power (increasing sample size, etc) until the power also is appropriate.

Need of bonferroni correction in A/B testing

I am a newbie in the field of Data Science. I came across the below statements which read:
More metrics we choose in our A/B testing, higher the chance of getting significant difference by chance.
To eliminate this problem we use Bonferroni correction method.
What does the 1st statement mean? How does it increase the chances of getting false positives? and how does the Bonferroni correction method help us here?
With p value of 0.05 (which is a commonly used level of statistical significance), you will get false positive results 5% of time. Thus if in your analysis you have one test, your chance of false positive is 5%. If you have two tests, you´ll have 5% for the first AND 5% for the second. Et cetera.
So for each additional test, your risk increases. Still, as you want to keep your total risk level at 0.05, you either set more strict level of statistical significance (smaller p value), or use some statistical method to correct for multiple comparisons. Bonferroni correction is one of such methods.

Is the akaike information criterion (AIC) unit-dependent?

One formula for AIC is:
AIC = 2k + n*Log(RSS/n)
Intuitively, if you add a parameter to your model, your AIC will decrease (and hence you should keep the parameter), if the increase in the 2k term due to the new parameter is offset by the decrease in the n*Log(RSS/n) term due to the decreased residual sum of squares. But isn't this RSS value unit-specific? So if I'm modeling money, and my units are in millions of dollars, the change in RSS with adding a parameter might be very small, and won't offset the increase in the 2k term. Conversely, if my units are pennies, the change in RSS would be very large, and could greatly offset the increase in the 2k term. This arbitrary change in units would lead to a change in my decision whether to keep the extra parameter.
So: does the RSS have to be in standardized units for AIC to be a useful criterion? I don't see how it could be otherwise.
No, I don't think so (partially rowing back from what I said in my earlier comment). For the simplest possible case (least squares regression for y = ax + b), from wikipedia, RSS = Syy - a x Sxy.
From their definitions given in that article, both a and Sxy grow by a factor of 100 and Syy grows by a factor of 1002 if you change the unit for y from dollars to cents. So, after rescaling, the new RSS for that model will be 1002 times the the old one. I'm quite sure that the same result holds for models with k <> 2 parameters.
Hence nothing changes for the AIC difference where the key part is log(RSSB/RSSA). After rescaling both RSS will have grown by the same factor and you'll get the exact same AIC difference between model A and B as before.
Edit:
I've just found this one:
"It is correct that the choice of units introduces a multiplicative
constant into the likelihood. Thence the log likelihood has an
additive constant which contributes (after doubling) to the AIC. The difference of AICs is unchanged."
Note that this comment even talks about the general case where the exact log-likelihood is used.
I had the same question, and I felt like the existing answer above could have been clearer and more direct. Hopefully the following clarifies it a bit for others as well.
When using the AIC to compare models, it is the difference that is of interest. The portion in question here is the n*log(RSS/n). When we compare this for two different models, we will get:
n1*log(RSS1/n1) + 2k1 - n2*log(RSS2/n2) - 2k2
From our logarithmic identities, we know that log(a) - log(b) = log(a/b). AIC1 - AIC2 therefore simplifies to:
2k1 - 2k2 + log(RSS1*n2/(RSS2*n1))
If we add a gain factor G to represent a change in units, that difference becomes:
2k1 - 2k2 + log(G*RSS1*n2/(G*RSS2*n1)) = 2k1 - 2k2 + log(RSS1*n2/(RSS2*n1))
As you can see, we are left with the same AIC difference, regardless of which units we choose.

how to interpret IRStatisticsImpl data in mahout

i want to read the IRStatisticsImpl data but have some problems:
my result is:
IRStatisticsImpl[precision:0.04285714285714287,recall:0.04275534441805227,fallOut:0.0018668022652391654,nDCG:0.04447353132522083,reach:0.997624703087886]
does it meant, that i got only 4% of good recommendations (precision) and about the same level of bad recommendation (recall)?
what should the numbers look like at best - precision at 1.0 and recall at 0.0?
Well, by definition:
Precision represents how many results are correct in your result set.
Recall represents the probability that a correct element in a test set has to be selected as a correct and picked in a result set.
To be perfect Precision and Recall should be both at 100%. Good results and criteria about these values must be evaluated according to your domain.
For example if you have a bucket with good and bad mushrooms you should aim at 100% for precision no matter how low is your recall. Because precision is critical for your health, you can even leave a lot of good mushrooms. The important thing is not eat the ugly ones.
You could pick one good mushroom and so you get 100% precision, but if there were four good mushrooms in your bucket, your recall is 25%.
Ideally if precision and recall are 100% means that in your result set all your mushrooms are good and also all good mushrooms are in your result set and none is leaved in your test set.
So values might have different meanings.
Sadly your results seem very ugly, because you're having many false positives and too much false negatives.
Take a look here.

Numerical Integration

Generally speaking when you are numerically evaluating and integral, say in MATLAB do I just pick a large number for the bounds or is there a way to tell MATLAB to "take the limit?"
I am assuming that you just use the large number because different machines would be able to handle numbers of different magnitudes.
I am just wondering if their is a way to improve my code. I am doing lots of expected value calculations via Monte Carlo and often use the trapezoid method to check my self of my degrees of freedom are small enough.
Strictly speaking, it's impossible to evaluate a numerical integral out to infinity. In most cases, if the integral in question is finite, you can simply integrate over a reasonably large range. To converge at a stable value, the integral of the normal error has to be less than 10 sigma -- this value is, for better or worse, as equal as you are going to get to evaluating the same integral all the way out to infinity.
It depends very much on what type of function you want to integrate. If it is "smooth" (no jumps - preferably not in any derivatives either, but that becomes progressively less important) and finite, that you have two main choices (limiting myself to the simplest approach):
1. if it is periodic, here meaning: could you put the left and right ends together and the also there have no jumps in value (and derivatives...): distribute your points evenly over the interval and just sample the functionvalues to get the estimated average, and than multiply by the length of the interval to get your integral.
2. if not periodic: use Legendre-integration.
Monte-carlo is almost invariably a poor method: it progresses very slow towards (machine-)precision: for any additional significant digit you need to apply 100 times more points!
The two methods above, for periodic and non-periodic "nice" (smooth etcetera) functions gives fair results already with a very small number of sample-points and then progresses very rapidly towards more precision: 1 of 2 points more usually adds several digits to your precision! This far outweighs the burden that you have to throw away all parts of the previous result when you want to apply a next effort with more sample points: you REPLACE the previous set of points with a fresh new one, while in Monte-Carlo you can just simply add points to the existing set and so refine the outcome.

Resources