Need of bonferroni correction in A/B testing - statistics

I am a newbie in the field of Data Science. I came across the below statements which read:
More metrics we choose in our A/B testing, higher the chance of getting significant difference by chance.
To eliminate this problem we use Bonferroni correction method.
What does the 1st statement mean? How does it increase the chances of getting false positives? and how does the Bonferroni correction method help us here?

With p value of 0.05 (which is a commonly used level of statistical significance), you will get false positive results 5% of time. Thus if in your analysis you have one test, your chance of false positive is 5%. If you have two tests, you´ll have 5% for the first AND 5% for the second. Et cetera.
So for each additional test, your risk increases. Still, as you want to keep your total risk level at 0.05, you either set more strict level of statistical significance (smaller p value), or use some statistical method to correct for multiple comparisons. Bonferroni correction is one of such methods.

Related

would the width of simulation 95% confidence interval for k<2 proportion in multinomial sample be narow, wide or same comparing indvidual CI?

a categorical response variable can take on K different values.
Apparently a narrow confidence interval implies that there is a smaller chance of obtaining an observation within that interval, therefore, our accuracy is higher. Also a 95% confidence interval is narrower than a 99% confidence interval which is wider.
Would this be a correct explination if not what would be the best reason?
Sort of.
Recall that CI are to estimate the real mean of a population, based on some sample data from that population.
It is true that a 95% CI would be smaller (narrower) than a 99% one.
It is directly tied to your alpha value (significance level), so an alpha of 5% would mean that you run a theoretical 5% risk that your CI does not contain the actual/real mean value.
Thus, if we want to minimize this risk, thus increasing significance, we could set alpha to 1%, and the CI would become a (100-alpha)% CI. And just like you observed, this 99% interval is much larger, because me want a lower risk, so it is larger to cover a much more broader range of values. - making it less likely that we miss the real mean.
Like I mentioned, this risk is theoretical and is pretty much: "What is the chance that the data (and our CI we created from it) just happened to be so incorrect (by sheer chance) that the real mean value is outside of our CI?"
So, it is only a probability calculation based on how much data you had, and how widely spread it was.
And that's why the wideness/narrowness differs between them. A 95% CI indicates that you have selected your significance level (alpha) to be 5%.

Does lowering the p-value threshold always result in lower FDR?

Intuitively I would think that lowering the o-value threshold would result in fewer false discoveries, but does this always hold or are there situations where it does not?
This is a very interesting question, and I think that there may be situations in which reducing your p-value threshold without good justification and attention to your design does increase the false discovery rate, even though it would theoretically reduce the number of false discoveries across your tests. This is my guess based on the relationship between FDR and power.
This is my intuition because the FDCR is "false discoveries" divided by "true and false discoveries". So if you reduce the likelihood of true discoveries without proportionately reducing the likelihood of false discoveries, your FDR would increase. The way this relates to reducing the p-value threshold is that reducing the threshold without changing anything else would reduce your power. As the power of your test decreases, its ability to make true discoveries also decreases, and thus you might end up with a higher FDR. See this paper for a discussion of how the FDR, power and p-value relate.
This doesn't mean going for a higher p-value threshold btw, it means going for as stringent a threshold as would be appropriate and then increasing your power (increasing sample size, etc) until the power also is appropriate.

Violation of PH assumption

Running a survival analysis, assume the p-value regarding a variable is statistically significant - let's say with a positive association with the outcome. However, according to the Schoenfeld residuals, the proportional hazard (PH) assumption has is violated.
Which scenario among below could possibly happen after correcting for PH violations?
The p-value may not be significant anymore.
p-value still significant, but the size of HR may change.
p-value still significant, but the direction of association may be altered (i. e. a positive association may end up being negative).
The PH assumption violation usually means that there is an interaction effect that needs to be included in the model. In the simple linear regression, including a new variable may alter the direction of the existing variables' coefficients due to the collinearity. Can we use the same rationale in the case above?
Therneau and Gramsch have written a very useful text, "Modeling Survival Data" that has an entire chapter on testing proportionality. At the end of the chapter is a section on causes and modeling alternatives, which I think can be used for answering this question. Since you mention interactions it makes your question about a particular p-value rather ambiguous and vague.
1) Certainly if you have chosen a particular measurement as the subject of your interest and it turns out the all of the effects are due to its interaction with another variable that you happened to also measure, then you may be in a position where the variable-of-interest's p-value will decrease, possibly to zero.
2) It's almost certain that modification of a model with a different structure (say will the addition of time-varying covariates or a different treatment of time) will result in a different estimated HR for a particular covariate and I think it would be impossible to predict the direction of the change.
3) As to whether to sign of the coefficient could change, I'm quite sure that would be possible as well. The scenario I'm thinking of would have a mixture of two groups say men and women and one of the groups had a sub-group whose early mortality was greatly increased, e.g. breast cancer, while the surviving members of that group would have a more favorable survival expectation. The base model might show a positive coefficient (high risk) while a model that was capable of identifying the subgroup at risk would then allow the gender-related coefficient to become negative (lower risk).

Obtaining the Standard Error of Weighted Data in SPSS

I'm trying to find confidence intervals for the means of various variables in a database using SPSS, and I've run into a spot of trouble.
The data is weighted, because each of the people who was surveyed represents a different portion of the overall population. For example, one young man in our sample might represent 28000 young men in the general population. The problem is that SPSS seems to think that the young man's database entries each represent 28000 measurements when they actually just represent one, and this makes SPSS think we have much more data than we actually do. As a result SPSS is giving very very low standard error estimates and very very narrow confidence intervals.
I've tried fixing this by dividing every weight value by the mean weight. This gives plausible figures and an average weight of 1, but I'm not sure the resulting numbers are actually correct.
Is my approach sound? If not, what should I try?
I've been using the Explore command to find mean and standard error (among other things), in case it matters.
You do need to scale weights to the actual sample size, but only the procedures in the Complex Samples option are designed to account for sampling weights properly. The regular weight variable in Statistics is treated as a frequency weight.

how to interpret IRStatisticsImpl data in mahout

i want to read the IRStatisticsImpl data but have some problems:
my result is:
IRStatisticsImpl[precision:0.04285714285714287,recall:0.04275534441805227,fallOut:0.0018668022652391654,nDCG:0.04447353132522083,reach:0.997624703087886]
does it meant, that i got only 4% of good recommendations (precision) and about the same level of bad recommendation (recall)?
what should the numbers look like at best - precision at 1.0 and recall at 0.0?
Well, by definition:
Precision represents how many results are correct in your result set.
Recall represents the probability that a correct element in a test set has to be selected as a correct and picked in a result set.
To be perfect Precision and Recall should be both at 100%. Good results and criteria about these values must be evaluated according to your domain.
For example if you have a bucket with good and bad mushrooms you should aim at 100% for precision no matter how low is your recall. Because precision is critical for your health, you can even leave a lot of good mushrooms. The important thing is not eat the ugly ones.
You could pick one good mushroom and so you get 100% precision, but if there were four good mushrooms in your bucket, your recall is 25%.
Ideally if precision and recall are 100% means that in your result set all your mushrooms are good and also all good mushrooms are in your result set and none is leaved in your test set.
So values might have different meanings.
Sadly your results seem very ugly, because you're having many false positives and too much false negatives.
Take a look here.

Resources