Does lowering the p-value threshold always result in lower FDR? - statistics

Intuitively I would think that lowering the o-value threshold would result in fewer false discoveries, but does this always hold or are there situations where it does not?

This is a very interesting question, and I think that there may be situations in which reducing your p-value threshold without good justification and attention to your design does increase the false discovery rate, even though it would theoretically reduce the number of false discoveries across your tests. This is my guess based on the relationship between FDR and power.
This is my intuition because the FDCR is "false discoveries" divided by "true and false discoveries". So if you reduce the likelihood of true discoveries without proportionately reducing the likelihood of false discoveries, your FDR would increase. The way this relates to reducing the p-value threshold is that reducing the threshold without changing anything else would reduce your power. As the power of your test decreases, its ability to make true discoveries also decreases, and thus you might end up with a higher FDR. See this paper for a discussion of how the FDR, power and p-value relate.
This doesn't mean going for a higher p-value threshold btw, it means going for as stringent a threshold as would be appropriate and then increasing your power (increasing sample size, etc) until the power also is appropriate.

Related

would the width of simulation 95% confidence interval for k<2 proportion in multinomial sample be narow, wide or same comparing indvidual CI?

a categorical response variable can take on K different values.
Apparently a narrow confidence interval implies that there is a smaller chance of obtaining an observation within that interval, therefore, our accuracy is higher. Also a 95% confidence interval is narrower than a 99% confidence interval which is wider.
Would this be a correct explination if not what would be the best reason?
Sort of.
Recall that CI are to estimate the real mean of a population, based on some sample data from that population.
It is true that a 95% CI would be smaller (narrower) than a 99% one.
It is directly tied to your alpha value (significance level), so an alpha of 5% would mean that you run a theoretical 5% risk that your CI does not contain the actual/real mean value.
Thus, if we want to minimize this risk, thus increasing significance, we could set alpha to 1%, and the CI would become a (100-alpha)% CI. And just like you observed, this 99% interval is much larger, because me want a lower risk, so it is larger to cover a much more broader range of values. - making it less likely that we miss the real mean.
Like I mentioned, this risk is theoretical and is pretty much: "What is the chance that the data (and our CI we created from it) just happened to be so incorrect (by sheer chance) that the real mean value is outside of our CI?"
So, it is only a probability calculation based on how much data you had, and how widely spread it was.
And that's why the wideness/narrowness differs between them. A 95% CI indicates that you have selected your significance level (alpha) to be 5%.

Need of bonferroni correction in A/B testing

I am a newbie in the field of Data Science. I came across the below statements which read:
More metrics we choose in our A/B testing, higher the chance of getting significant difference by chance.
To eliminate this problem we use Bonferroni correction method.
What does the 1st statement mean? How does it increase the chances of getting false positives? and how does the Bonferroni correction method help us here?
With p value of 0.05 (which is a commonly used level of statistical significance), you will get false positive results 5% of time. Thus if in your analysis you have one test, your chance of false positive is 5%. If you have two tests, you´ll have 5% for the first AND 5% for the second. Et cetera.
So for each additional test, your risk increases. Still, as you want to keep your total risk level at 0.05, you either set more strict level of statistical significance (smaller p value), or use some statistical method to correct for multiple comparisons. Bonferroni correction is one of such methods.

What's considered a strong p-value?

Hi I'm new to statistics and just wanted some clarifications on p-values.
So far I've learned that if we use a 5% significance level then we reject the null hypothesis and accept the alternative hypothesis if the p-value is less than 0.05.
If the p-value is greater than 0.05 then we say there is insufficient evidence and we can't reject the null hypothesis. I've learned that we can't accept the null hypothesis if p-value is greater than 0.05 but at the same time if we have a strong p-value we can't ignore it
So my question it what is considered a high p-value where I should consider accepting the null hypothesis, like where should I cut off at 0.7 and higher? 0.8? 0.9?
Can't argue with the link to the ASA statement.
An example that helped me with this:
If you are working to a 5% significance level (alpha=0.05), and calculate a p-value of 0.5, your data does not provide sufficient evidence to reject the null hypothesis.
There are two possible scenarios here:
The null hypothesis is indeed true
The null hypothesis is actually false (Type
II error, false negative)
Once that point has been reached, you shouldn't do much more with the p-value. It is tempting to try to justify inconvenient results by saying that (for example) a p-value of 0.07 is quite close to 0.05, so there is some evidence to support the alternative hypothesis, but that is not a very robust approach. It is good practice to set your significance level in advance, and stick to it.
As a side-note, significance levels are an expression of how much uncertainty in your results you are willing to accept. A value of 5% indicates that you are willing (on average, over a large number of experiments) to be wrong about 5% of the time, or 1 in 20 experiments. In this case, by 'wrong' we mean falsely reject a true null hypothesis in favour of the alternative hypothesis (that is not true). By increasing the significance level we are saying we are willing to be wrong more often (with the trade-off of having to gather less data).

how to interpret IRStatisticsImpl data in mahout

i want to read the IRStatisticsImpl data but have some problems:
my result is:
IRStatisticsImpl[precision:0.04285714285714287,recall:0.04275534441805227,fallOut:0.0018668022652391654,nDCG:0.04447353132522083,reach:0.997624703087886]
does it meant, that i got only 4% of good recommendations (precision) and about the same level of bad recommendation (recall)?
what should the numbers look like at best - precision at 1.0 and recall at 0.0?
Well, by definition:
Precision represents how many results are correct in your result set.
Recall represents the probability that a correct element in a test set has to be selected as a correct and picked in a result set.
To be perfect Precision and Recall should be both at 100%. Good results and criteria about these values must be evaluated according to your domain.
For example if you have a bucket with good and bad mushrooms you should aim at 100% for precision no matter how low is your recall. Because precision is critical for your health, you can even leave a lot of good mushrooms. The important thing is not eat the ugly ones.
You could pick one good mushroom and so you get 100% precision, but if there were four good mushrooms in your bucket, your recall is 25%.
Ideally if precision and recall are 100% means that in your result set all your mushrooms are good and also all good mushrooms are in your result set and none is leaved in your test set.
So values might have different meanings.
Sadly your results seem very ugly, because you're having many false positives and too much false negatives.
Take a look here.

How to combine False positives and false negatives into one single measure

I'am trying to measure the performance of a computer vision program that tries to detect objects in video. I have 3 different versions of the program which have different parameters.
I've benchmarked each of this versions and got 3 pairs of (False positives percent, False negative percent).
Now i want to compare the versions with each other and then I wonder if it makes sense to combine false positives and false negatives into a single value and use that to do the comparation. for example, take the equation falsePositives/falseNegatives and see which is smaller.
In addition to the popular Area Under the ROC Curve (AUC) measure mentioned by #alchemist-al, there's a score that combines both precision and recall (which are defined in terms of TP/FP/TN/FN) called the F-measure that goes from 0 to 1 (0 being the worst, 1 the best):
F-measure = 2*precision*recall / (precision+recall)
where
precision = TP/(TP+FP) , recall = TP/(TP+FN)
A couple of other possible solutions:
-Your false-positive rate (fp) and false-negative rate (fn) may depend on a threshold. If you plot the curve where the y-value is (1-fn), and the x-value is (fp), you'll be plotting the Receiver-Operator-Characteristic (ROC) curve. The Area Under the ROC Curve (AUC) is one popular measure of quality.
-AUC can be weighted if there are certain regions of interest
-Report the Equal-Error Rate. For some threshold, fp=fn. Report this value.
It depends on how much detail you want in the comparision.
Combining the two figures will give you an overall sense of error margin but no insight into what sort of error so if you just want to know what is "more correct" in an overall sense then it's fine.
If, on the other hand, you're actually wanting to use the results for some sort of more in depth determination of whether the process is suited to a particular problem then I would imagine keeping them seperate is a good idea.
e.g. Sometimes false negatives are a very different problem to false positives in a real world setting. Did the robot just avoid an object that wasn't there... or fail to notice it was heading off the side of a cliff?
In short, there's no hard and fast global rule for determining how effective vision based on one super calculation. It comes down to what you're planning to do with the information that's the important bit.
You need to factor in how "important" false positive are relative to false negatives.
For example, if your program is designed to recognise people's faces, the both false positives and false negatives are equally harmless and you can probably just combine them linearly.
But if your program was designed to detect bombs, then false positives aren't a huge deal (i.e. saying "this is a bomb" when it's actually not) but false negatives (that is, saying "this isn't a bomb" when it actually is) would be catastrophic.
Well, one conventional way is to assign a weight to each of the two event types (e.g., some integer to indicate the relative significance of each to model validation). Then,
multiply each instance by the
appropriate weighting factor;
then square them;
sum the terms;
take the square root
This leaves you with a single number--something "total error".
If you want to maximize both the true positives and the true negatives you can use the Diagnostic Efficiency:
Diagnostic Efficiency = Sensitivity * Specificity
Where...
Sensitivity = TP / (TP + FN)
Specificity = TN / (TN + FP)
(TP = number of true positives, FN = number of false negatives, TN = number of true negatives, FP = number of false positives)
This metric works well for datasets that have an unbalanced number of classes (i.e. the dataset is skewed)

Resources