What is wrong with the following reasoning when doing a t test for difference in means ?
(Assume this is an Independent two-sample t-test with equal sample sizes and variance)
Perform a 2 tailed test and see that there is a statistically significant difference in the 2 means.
From here, run a 1 tailed test in both directions to test whether one group has a statistically significant larger mean.
In my mind that makes sense... Test for a difference, then test for the direction of the difference.
What is the issue with performing a 2 tailed test before a 1 tailed ?
Note: I asked this before with a more concrete example, but the responses went stale
Related
I did a research on group of cases and controls. During my research I observed 7 variants in group of cases while I did not observe any in controls.
I would like to test is there a significant difference between 7:0 finding.
I thought of doing Fishers exact test, but not sure how it can be performed on 1x2 table and is it a suitable test for such analysis. Also, I thought of correction, to exclude 0 from statistics, so maybe to compare 7.5 and 0.5.
Is there a better test to perform for such a cases?
All suggestions are welcomed. Thank you.
You have a 2x2 table, but you are ignoring one of the rows. You need to add how many non-variants you had in each group. The answer to your question depends on that. Say you had 7 variants out of 30 cases and 0 variants out of 20 controls. Your contingency table would look like this:
Cases
Controls
Variants
7
0
Non-variants
23
20
Without this information, testing makes no sense. Now, you can use an exact test, such as Fisher's (conditional) or Barnard's (unconditional). I have described an unconditional test that seems to be more powerful than those (link). In this case,
> library(mtest)
> m.test(list(c(7, 23), c(0, 20)))
[1] 0.01702871
In below post of Analytics Vidya, ANOVA test has been performed on COVID data, to check whether the difference in posotive cases of denser region is statistically significant.
I believe ANOVA test can’t be performed on this COVID time series data, atleast not in way as it has been done in this post.
Sample data has been consider randomly from different groups(denser1, denser2…denser4). The data is time series so it is more likely that number of positive cases in random sample of groups will be from different point of time.
There might be the case denser1 has random data from early covid time and another region has random data from another point of time. If this is the case, then F-Statistics will high certainly.
Can anyone explain if you have other opinions?
https://www.analyticsvidhya.com/blog/2020/06/introduction-anova-statistics-data-science-covid-python/
ANOVA should not be applied to time-series data, as the independence assumption is violated. The issue with independence is that days tend to correlate very highly. For example, if you know that today you have 1400 positive cases, you would expect tomorrow to have a similar number of positive cases, regardless of any underlying trends.
It sounds like you're trying to determine causality of different treatments (ie mask mandates or other restrictions etc) and their effects on positive cases. The best way to infer causality is usually to perform A-B testing, but obviously in this case it would not be reasonable to give different populations different treatments. One method that is good for going back and retro-actively inferring causality is called "synthetic control".
https://economics.mit.edu/files/17847
Above is linked a basic paper on the methodology. The hard part of this analysis will be in constructing synthetic counterfactuals or "controls" to test your actual population against.
If this is not what you're looking for, please reply with a clarifying question, but I think this should be an appropriate method that is well-suited to studying time-series data.
We know from this article that ending an A/B test early due to "significant" results is a mistake.
But what about when a test runs for the desired time period and shows insignificant results – is it fine to prolong it? What are the risks?
It would be great with a simple mathematical example of any risks, similar to the example in that linked article.
I have only a basic knowledge of probability theory and maths, so I would appreciate an answer I can understand with that knowledge.
My intuition is that it could be problematic, because you had an experiment with a calculated reliability (will show false positives in X% and false negatives in Y% of such experiments), but now you're effectively waiting indefinitely for the first true-positive or false-positive significance.
So I should think you get more false positives than you accounted for when setting up the original experiment. But presumably the likelihood of false positives also decreases as we get more data. I would love to get specific numbers on this, if it's true at all.
This is an area of current research. We've done some modeling and advise our customers to follow this principle:
• If the experiment reaches statistical significance, i.e. when the CI
ribbon entirely rises above zero or entirely falls below it, and
remains significant for 50% more observations than it took to get to
significance for 0.10 level tests (65% more observations than it took
to get to significance for .05 level tests), the experiment is called
by accepting the alternative hypothesis, or, in other words, the
treatment wins.
• If the experiment does not reach statistical significance, while the
CI ribbon has narrowed to where its width represents a difference
between the treatment and the control that is not consequential to the
application semantics, the experiment is called by rejecting the
research hypothesis, or, in other words, the treatment fails to win
and we stick with the control.
For more, here's the White Paper.
I write a lot of statistical methods for application.
The problem is I don't know how to test it appropriately.
For example, in unit-test I check whether the sum of all probabilities of distribution converges to 1, however is never 1.
For example, the sum of all probabilities might be 0.9999999 or even 1.0000000005, the actually value if strongly depends on how many different outcomes the distribution have.
maybe I can test like so
value should be less that 1.1
value should be more that 0.9
but I am not sure that this test is consistent, maybe there is a distribution that due to numeric calculation will output 1.1
How to test it appropriately.
There is a related discussion here that you might find interesting.
The short version is that you want to break up your statistical methods into pieces that can be tested deterministically.
Where that's not the case, you probably want to use some epsilon value to compare your expected and actual outputs. You could also run several iterations of the test and perform a simpler statistical test (a t-test perhaps?) to see if the distribution looks like what you think it should be.
I've had an A/B Test running in Google Web Optimizer for six weeks now, and there's still no end in sight. Google is still saying: "We have not gathered enough data yet to show any significant results. When we collect more data we should be able to show you a winning combination."
Is there any way of telling how close Google is to making up its mind? (Does anyone know what algorithm does it use to decide if there's been any "high confidence winners"?)
According to the Google help documentation:
Sometimes we simply need more data to
be able to reach a level of high
confidence. A tested combination
typically needs around 200 conversions
for us to judge its performance with
certainty.
But all of our conversions have over 200 conversations at the moment:
230 / 4061 (Original)
223 / 3937 (Variation 1)
205 / 3984 (Variation 2)
205 / 4007 (Variation 3)
How much longer is it going to have to run??
Thanks for any help.
Is there any way of telling how close Google is to making up its mind?
You can use the GWO calculator to help determine how long a test will take based on a number of assumptions that you provide. Keep in mind though that it is possible that there is not significant difference between your test combination, in which case a test to see which is best would take an infinite amount of time, because it is not possible to find a winner.
(Does anyone know what algorithm does it use to decide if there's been any "high confidence winners"?)
That is a mystery, but with most, if not all, statistical tests, there is what's called a p-value which is the probability of obtaining a result as extreme as the one observed by chance alone. GWO tests run until the p-value passes some threshold, probably 5%. To be more clear, GWO tests run until a combination is significantly better than the original combination, such that the result only has a 5% chance of occurring by chance alone.
For your test there appears to be no significant winner, it's a tie.