Google Web Optimizer (A/B Testing) Why no clear winner? - web

I've previously asked how long it takes for a winning combination to appear on Google's Web Optimizer, but now I have another weird problem during an A/B test:
For the past two days has Google announced that there was a "High Confidence Winner" that had a 98.5% chance of beating the original variation by 27.4%. Great!
I decided to leave it running to make absolutely sure, but something weird happened: Today Google is saying that they "haven't collected enough data yet to show any significant results" (as shown below). Sure, the figures have changed slightly, but they're still very high: 96.6% chance of beating the original by 22%.
So, why is Google not so sure now?
How could it have gone from having a statistically significant "High Confidence" winner, to not having enough data to calculate one? Are my numbers too tiny for Google to be absolutely sure or something?
Thanks for any insights!

How could it have gone from having a
statistically significant "High
Confidence" winner, to not having
enough data to calculate one?
With all statistics tests there is what's called a p-value, which is the probablity of obtaining the observed result by random chance, assuming that there is no difference between what's being tested. So when you run a test, you want a small p-value so that you can be confident with your results.
So with GWO must have a p-value between 1.5% and 3.4% (I'm guessing it's 2.5%, atleast in this case, it might be that it depends on the number of combinations)
So when (100% - chance to beat %) > p-value, then GWO will say that it has not collected enough information, and if a combination has a (100% - chance to beat %) < p-value then a winner is found. Obviously if that line is just crossed, then it could easily go back with a little more data.
To summerize, you shouldn't be checking the results frequently, you should setup a test, then ignore it for a long while then check the results.
Are my numbers too tiny for Google to
be absolutely sure or something?
No

Related

Is it OK to prolong a non-significant A/B test?

We know from this article that ending an A/B test early due to "significant" results is a mistake.
But what about when a test runs for the desired time period and shows insignificant results – is it fine to prolong it? What are the risks?
It would be great with a simple mathematical example of any risks, similar to the example in that linked article.
I have only a basic knowledge of probability theory and maths, so I would appreciate an answer I can understand with that knowledge.
My intuition is that it could be problematic, because you had an experiment with a calculated reliability (will show false positives in X% and false negatives in Y% of such experiments), but now you're effectively waiting indefinitely for the first true-positive or false-positive significance.
So I should think you get more false positives than you accounted for when setting up the original experiment. But presumably the likelihood of false positives also decreases as we get more data. I would love to get specific numbers on this, if it's true at all.
This is an area of current research. We've done some modeling and advise our customers to follow this principle:
• If the experiment reaches statistical significance, i.e. when the CI
ribbon entirely rises above zero or entirely falls below it, and
remains significant for 50% more observations than it took to get to
significance for 0.10 level tests (65% more observations than it took
to get to significance for .05 level tests), the experiment is called
by accepting the alternative hypothesis, or, in other words, the
treatment wins.
• If the experiment does not reach statistical significance, while the
CI ribbon has narrowed to where its width represents a difference
between the treatment and the control that is not consequential to the
application semantics, the experiment is called by rejecting the
research hypothesis, or, in other words, the treatment fails to win
and we stick with the control.
For more, here's the White Paper.

Why can't I target the complement of my goal in Optimizely?

Optimizely's Sample Size calculator shows, that a higher baseline conversion rate leads to a smaller required sample size for an A/B-test. So, instead of maximizing my conversion goal, I'd like to minimize the opposite, i.e. not reaching the goal.
For every goal with a conversion rate less than 50%, its complement would be higher than 50% and would thus require a smaller sample size if targeted.
An example: instead of measuring all users that visit payment-success.html, I'd rather measure all users that don't visit it, and try minimizing that. Which would usually require a lot smaller sample size if my reasoning is correct!
Optimizely only lets me target pageviews as goals, not not-pageviewing.
I realize I'm probably missing or misunderstanding something important here, but if so, what is it?
Statistically there's nothing wrong with your approach, but unfortunately it won't have the desired effect of lowering the duration.
While you'll reduce the margin of error, you'll proportionately decrease the lift, causing you to take the same amount of time to reach confidence.
Since the lift is calculated as a percentage of the baseline conversion rate, the same change in conversion rate of a larger baseline will produce a smaller lift.
Say your real conversion rate is 10% and the test winds up increasing it to 12%. The inverse conversion rate would be 90% which gets lowered to 88%. In both cases it's a change of 2%, but 2% is a much greater change to 10% (it's a 20% lift) than it is to 90% (only a -2.22% lift).
Practically, you run a much larger risk of incorrectly bucketing people into the goal with the inverse. You know that someone who hits the success page should be counted toward the goal. I'm pretty sure what you're suggesting would cause every pageview that wasn't on the success page after the user saw the experiment to count as a goal.
Say you're testing the home page. Person A and B both land on the home page and view the experiment.
Person A visits 1 other pages and leaves
Person B visits 1 other pages and buys something
If your goal was setup on the success page, only person B would trigger the goal. If the goal was setup on all other pages, both people would trigger the goal. That's obviously not the inverse.
In fact, if there are any pages necessary to reach the success page after the user first sees the experiment (so unless you're testing the final step of checkout), everyone will trigger the inverse pageview goal (whether they hit the success page or not).
Optimizely pageview goals aren't just for pages included in the URL Targeting of your experiment. They're counted for anyone who's seen the experiment and at any point afterward hit that page.
Just to answer whether this is possible (not addressing whether your setup will result in the same outcome), you're right that Optimizely pageview goal doesn't allow for not, but you can probably use the Regex match type to achieve what you want (see 'URL Match Type' in point 3 here). In this case it would look like this, taken from this answer here (which also explains the complexity involved with not matching in Regex, suggesting why Optimizely hasn't built not pageviews into the product).
^((?!payment-success\.html).)*$
Hopefully that helps you get to where you want.

Cannot generalize my Genetic Algorithm to new Data

I've written a GA to model a handful of stocks (4) over a period of time (5 years). It's impressive how quickly the GA can find an optimal solution to the training data, but I am also aware that this is mainly due to it's tendency to over-fit in the training phase.
However, I still thought I could take a few precautions and and get some kind of prediction on a set of unseen test stocks from the same period.
One precaution I took was:
When multiple stocks can be bought on the same day the GA only buys one from the list and it chooses this one randomly. I thought this randomness might help to avoid over-fitting?
Even if over-fitting is still occurring,shouldn't it be absent in the initial generations of the GA since it hasn't had a chance to over-fit yet?
As a note, I am aware of the no-free-lunch theorem which demonstrates ( I believe) that there is no perfect set of parameters which will produce an optimal output for two different datasets. If we take this further, does this no-free-lunch theorem also prohibit generalization?
The graph below illustrates this.
->The blue line is the GA output.
->The red line is the training data (slightly different because of the aforementioned randomness)
-> The yellow line is the stubborn test data which shows no generalization. In fact this is the most flattering graph I could produce..
The y-axis is profit, the x axis is the trading strategies sorted from worst to best ( left to right) according to there respective profits (on the y axis)
Some of the best advice I've received so far (thanks seaotternerd) is to focus on the earlier generations and increase the number of training examples. The graph below has 12 training stocks rather than just 4, and shows only the first 200 generations (instead of 1,000). Again, it's the most flattering chart I could produce, this time with medium selection pressure. It certainly looks a little bit better, but not fantastic either. The red line is the test data.
The problem with over-fitting is that, within a single data-set it's pretty challenging to tell over-fitting apart from actually getting better in the general case. In many ways, this is more of an art than a science, but here are some general guidelines:
A GA will learn to do exactly what you attach fitness to. If you tell it to get really good at predicting one series of stocks, it will do that. If you keep swapping in different stocks to predict, though, you might be more successful at getting it to generalize. There are a few ways to do this. The one that has had perhaps the most promising results for reducing over-fitting is imposing spatial structure on the population and evaluating on different test cases in different cells, as in the SCALP algorithm. You could also switch out the test cases on a time basis, but I've had more mixed results with that sort of an approach.
You are correct that over-fitting should be less of a problem early on. Generally, the longer you run a GA, the more over-fitting will be possible. Typically, people tend to assume that the general rules will be learned first, before the rote memorization of over-fitting takes place. However, I don't think I've actually ever seen this studied rigorously - I could imagine a scenario where over-fitting was so much easier than finding general rules that it happens first. I have no idea how common that is, though. Stopping early will also reduce the ability of the GA to find better general solutions.
Using a larger data-set (four stocks isn't that many) will make your GA less susceptible to over-fitting.
Randomness is an interesting idea. It will definitely hurt the GA's ability to find general rules, but it should also reduce over-fitting. Without knowing more about the specifics of your algorithm, it's hard to say which would win out.
That's a really interesting thought about the no free lunch theorem. I'm not 100% sure, but I think it does apply here to some extent - better fitting some data will make your results fit other data worse, by necessity. However, as wide as the range of possible stock behaviors is, it is much narrower than the range of all possible time series in general. This is why it is possible to have optimization algorithms at all - a given problem that we are working with tends produce data that cluster relatively closely together, relative to the entire space of possible data. So, within that set of inputs that we actually care about, it is possible to get better. There is generally an upper limit of some sort on how well you can do, and it is possible that you have hit that upper limit for your data-set. But generalization is possible to some extent, so I wouldn't give up just yet.
Bottom line: I think that varying the test cases shows the most promise (although I'm biased, because that's one of my primary areas of research), but it is also the most challenging solution, implementation-wise. So as a simpler fix you can try stopping evolution sooner or increasing your data-set.

How long will it take to audit 29k lines of Drupal code?

A client is asking how long does it take to audit the security of his Drupal module that is 29k lines long. Does anyone know at least what ballpark I should give him? His main concerns are file encryption and user permission.
Nope, not a damn clue :-)
However, whatever value you choose, may I suggest one thing?
Monitor your progress! Tell your client that your initial estimate is (for example) twenty-nine working days but that it depends on a great many factors outside your control.
Tell them you plan to mitigate risks of budget overrun by providing a daily snapshot of progress:
current number of lines audited in total [a].
days spent [b].
current "run rate" (number of lines per day, average) [c = a/b].
number of lines yet to be audited [d = 29,000 - a].
estimated days to completion [e = d / c].
Allow them to pull the plug at any time if the run rate is well below what you estimated.
This basic project management/reporting should give them the confidence that you know what you're doing, and will minimise their exposure considerably, to the point where they'll feel a lot more comfortable about taking you on.
Just on that last bullet point above, you may want to consider giving them a range (say +/-5% of the estimate), but don't get too clever about working out best and worst case based on your best and worst days to date. The power of averaging is that it gives you a "best" guess without having to fiddle too much with figures.
Typical estimates I've seen are that you can expect a developer to review 100-150 lines of code per hour. This is a very rough estimate, and it will vary greatly depending upon the nature of the code and the thoroughness of the review. Also, if you can review code for 8 hours a day, 5 days a week, straight, you're inhuman and amazing; for the rest of us, we need a change of activity to clear the brain.

Google Web Optimizer -- How long until winning combination?

I've had an A/B Test running in Google Web Optimizer for six weeks now, and there's still no end in sight. Google is still saying: "We have not gathered enough data yet to show any significant results. When we collect more data we should be able to show you a winning combination."
Is there any way of telling how close Google is to making up its mind? (Does anyone know what algorithm does it use to decide if there's been any "high confidence winners"?)
According to the Google help documentation:
Sometimes we simply need more data to
be able to reach a level of high
confidence. A tested combination
typically needs around 200 conversions
for us to judge its performance with
certainty.
But all of our conversions have over 200 conversations at the moment:
230 / 4061 (Original)
223 / 3937 (Variation 1)
205 / 3984 (Variation 2)
205 / 4007 (Variation 3)
How much longer is it going to have to run??
Thanks for any help.
Is there any way of telling how close Google is to making up its mind?
You can use the GWO calculator to help determine how long a test will take based on a number of assumptions that you provide. Keep in mind though that it is possible that there is not significant difference between your test combination, in which case a test to see which is best would take an infinite amount of time, because it is not possible to find a winner.
(Does anyone know what algorithm does it use to decide if there's been any "high confidence winners"?)
That is a mystery, but with most, if not all, statistical tests, there is what's called a p-value which is the probability of obtaining a result as extreme as the one observed by chance alone. GWO tests run until the p-value passes some threshold, probably 5%. To be more clear, GWO tests run until a combination is significantly better than the original combination, such that the result only has a 5% chance of occurring by chance alone.
For your test there appears to be no significant winner, it's a tie.

Resources