Why can't I target the complement of my goal in Optimizely? - web

Optimizely's Sample Size calculator shows, that a higher baseline conversion rate leads to a smaller required sample size for an A/B-test. So, instead of maximizing my conversion goal, I'd like to minimize the opposite, i.e. not reaching the goal.
For every goal with a conversion rate less than 50%, its complement would be higher than 50% and would thus require a smaller sample size if targeted.
An example: instead of measuring all users that visit payment-success.html, I'd rather measure all users that don't visit it, and try minimizing that. Which would usually require a lot smaller sample size if my reasoning is correct!
Optimizely only lets me target pageviews as goals, not not-pageviewing.
I realize I'm probably missing or misunderstanding something important here, but if so, what is it?

Statistically there's nothing wrong with your approach, but unfortunately it won't have the desired effect of lowering the duration.
While you'll reduce the margin of error, you'll proportionately decrease the lift, causing you to take the same amount of time to reach confidence.
Since the lift is calculated as a percentage of the baseline conversion rate, the same change in conversion rate of a larger baseline will produce a smaller lift.
Say your real conversion rate is 10% and the test winds up increasing it to 12%. The inverse conversion rate would be 90% which gets lowered to 88%. In both cases it's a change of 2%, but 2% is a much greater change to 10% (it's a 20% lift) than it is to 90% (only a -2.22% lift).
Practically, you run a much larger risk of incorrectly bucketing people into the goal with the inverse. You know that someone who hits the success page should be counted toward the goal. I'm pretty sure what you're suggesting would cause every pageview that wasn't on the success page after the user saw the experiment to count as a goal.
Say you're testing the home page. Person A and B both land on the home page and view the experiment.
Person A visits 1 other pages and leaves
Person B visits 1 other pages and buys something
If your goal was setup on the success page, only person B would trigger the goal. If the goal was setup on all other pages, both people would trigger the goal. That's obviously not the inverse.
In fact, if there are any pages necessary to reach the success page after the user first sees the experiment (so unless you're testing the final step of checkout), everyone will trigger the inverse pageview goal (whether they hit the success page or not).
Optimizely pageview goals aren't just for pages included in the URL Targeting of your experiment. They're counted for anyone who's seen the experiment and at any point afterward hit that page.

Just to answer whether this is possible (not addressing whether your setup will result in the same outcome), you're right that Optimizely pageview goal doesn't allow for not, but you can probably use the Regex match type to achieve what you want (see 'URL Match Type' in point 3 here). In this case it would look like this, taken from this answer here (which also explains the complexity involved with not matching in Regex, suggesting why Optimizely hasn't built not pageviews into the product).
^((?!payment-success\.html).)*$
Hopefully that helps you get to where you want.

Related

Cannot generalize my Genetic Algorithm to new Data

I've written a GA to model a handful of stocks (4) over a period of time (5 years). It's impressive how quickly the GA can find an optimal solution to the training data, but I am also aware that this is mainly due to it's tendency to over-fit in the training phase.
However, I still thought I could take a few precautions and and get some kind of prediction on a set of unseen test stocks from the same period.
One precaution I took was:
When multiple stocks can be bought on the same day the GA only buys one from the list and it chooses this one randomly. I thought this randomness might help to avoid over-fitting?
Even if over-fitting is still occurring,shouldn't it be absent in the initial generations of the GA since it hasn't had a chance to over-fit yet?
As a note, I am aware of the no-free-lunch theorem which demonstrates ( I believe) that there is no perfect set of parameters which will produce an optimal output for two different datasets. If we take this further, does this no-free-lunch theorem also prohibit generalization?
The graph below illustrates this.
->The blue line is the GA output.
->The red line is the training data (slightly different because of the aforementioned randomness)
-> The yellow line is the stubborn test data which shows no generalization. In fact this is the most flattering graph I could produce..
The y-axis is profit, the x axis is the trading strategies sorted from worst to best ( left to right) according to there respective profits (on the y axis)
Some of the best advice I've received so far (thanks seaotternerd) is to focus on the earlier generations and increase the number of training examples. The graph below has 12 training stocks rather than just 4, and shows only the first 200 generations (instead of 1,000). Again, it's the most flattering chart I could produce, this time with medium selection pressure. It certainly looks a little bit better, but not fantastic either. The red line is the test data.
The problem with over-fitting is that, within a single data-set it's pretty challenging to tell over-fitting apart from actually getting better in the general case. In many ways, this is more of an art than a science, but here are some general guidelines:
A GA will learn to do exactly what you attach fitness to. If you tell it to get really good at predicting one series of stocks, it will do that. If you keep swapping in different stocks to predict, though, you might be more successful at getting it to generalize. There are a few ways to do this. The one that has had perhaps the most promising results for reducing over-fitting is imposing spatial structure on the population and evaluating on different test cases in different cells, as in the SCALP algorithm. You could also switch out the test cases on a time basis, but I've had more mixed results with that sort of an approach.
You are correct that over-fitting should be less of a problem early on. Generally, the longer you run a GA, the more over-fitting will be possible. Typically, people tend to assume that the general rules will be learned first, before the rote memorization of over-fitting takes place. However, I don't think I've actually ever seen this studied rigorously - I could imagine a scenario where over-fitting was so much easier than finding general rules that it happens first. I have no idea how common that is, though. Stopping early will also reduce the ability of the GA to find better general solutions.
Using a larger data-set (four stocks isn't that many) will make your GA less susceptible to over-fitting.
Randomness is an interesting idea. It will definitely hurt the GA's ability to find general rules, but it should also reduce over-fitting. Without knowing more about the specifics of your algorithm, it's hard to say which would win out.
That's a really interesting thought about the no free lunch theorem. I'm not 100% sure, but I think it does apply here to some extent - better fitting some data will make your results fit other data worse, by necessity. However, as wide as the range of possible stock behaviors is, it is much narrower than the range of all possible time series in general. This is why it is possible to have optimization algorithms at all - a given problem that we are working with tends produce data that cluster relatively closely together, relative to the entire space of possible data. So, within that set of inputs that we actually care about, it is possible to get better. There is generally an upper limit of some sort on how well you can do, and it is possible that you have hit that upper limit for your data-set. But generalization is possible to some extent, so I wouldn't give up just yet.
Bottom line: I think that varying the test cases shows the most promise (although I'm biased, because that's one of my primary areas of research), but it is also the most challenging solution, implementation-wise. So as a simpler fix you can try stopping evolution sooner or increasing your data-set.

Will different website A/B tests interfere with either test's results?

I have a question about running an A/B test against different pages on a website and if I should worry about them interfering with either test's results. Not that it matters, but I'm using Visual Website Optimizer to do the testing.
For example, if I have two A/B tests running on different pages in the order placement flow, should I worry about the tests having an effect on one anothers goal conversion rate for the same conversion goal? For example, I have two tests running on a website, one against the product detail page and another running on the shopping cart. Ultimately I want to know if a variation of either page affects the order placement conversion rate. I'm not sure if I should be concerned with the different test's results interfering with one another if they are run at the same time.
My gut is telling me we don't have to worry about it, as the visitors on each page will be distributed across each variation of the other page. So the product detail page version A visitors will be distributed across the A and B variations of the cart, therefore the influence of the product detail page's variation A on order conversion will still be measured correctly even though the visitor sees different versions of the cart from the other test. Of course, I may be completely wrong, and hopefully someone with a statistics background can answer this question more precisely.
The only issue I can think of, is a combination between one page's variation and another page's variation worked together better than other combinations. But this seems unlikely.
I'm not sure if I'm explaining the issue clearly enough, so please let me know if my question makes sense. I searched the web and Stackoverflow for an answer, but I'm not having any luck finding anything.
I understand your problem and there is no quick answer to it and it depends on the types of test you are running. There are times that A/B tests on different pages influence each other, specially if they are within the same sequence of actions, e.g. checkout.
A simple example, if on your first page, variation A says "Click here to view pricing" and variation B says "Click here to get $500 cash". You may find that click through on B is higher and declare that one successful. Once the user clicks, on the following page, there are asked to enter their credit card details, with variations being "Pay" button being either green or red. In a situation like this, people from variation A might have a better chance of actually entering their CC details and converting as opposed to variation B who may feel cheated.
I have noticed when websites are in their seminal stages and they are trying to get a feel of what customers respond to well, drastic changes are made these multivariate tests are more important. When there is some stability and traffic, however, the changes tend to be very subtle and overall message and flow are the same and A/B tests become more micro refinements. In those cases, there might be less value in multi page cross testings (does background colour on page one means anything three pages down the process? probably not!).
Hope this answer helps!

Google Web Optimizer (A/B Testing) Why no clear winner?

I've previously asked how long it takes for a winning combination to appear on Google's Web Optimizer, but now I have another weird problem during an A/B test:
For the past two days has Google announced that there was a "High Confidence Winner" that had a 98.5% chance of beating the original variation by 27.4%. Great!
I decided to leave it running to make absolutely sure, but something weird happened: Today Google is saying that they "haven't collected enough data yet to show any significant results" (as shown below). Sure, the figures have changed slightly, but they're still very high: 96.6% chance of beating the original by 22%.
So, why is Google not so sure now?
How could it have gone from having a statistically significant "High Confidence" winner, to not having enough data to calculate one? Are my numbers too tiny for Google to be absolutely sure or something?
Thanks for any insights!
How could it have gone from having a
statistically significant "High
Confidence" winner, to not having
enough data to calculate one?
With all statistics tests there is what's called a p-value, which is the probablity of obtaining the observed result by random chance, assuming that there is no difference between what's being tested. So when you run a test, you want a small p-value so that you can be confident with your results.
So with GWO must have a p-value between 1.5% and 3.4% (I'm guessing it's 2.5%, atleast in this case, it might be that it depends on the number of combinations)
So when (100% - chance to beat %) > p-value, then GWO will say that it has not collected enough information, and if a combination has a (100% - chance to beat %) < p-value then a winner is found. Obviously if that line is just crossed, then it could easily go back with a little more data.
To summerize, you shouldn't be checking the results frequently, you should setup a test, then ignore it for a long while then check the results.
Are my numbers too tiny for Google to
be absolutely sure or something?
No

Google Web Optimizer -- How long until winning combination?

I've had an A/B Test running in Google Web Optimizer for six weeks now, and there's still no end in sight. Google is still saying: "We have not gathered enough data yet to show any significant results. When we collect more data we should be able to show you a winning combination."
Is there any way of telling how close Google is to making up its mind? (Does anyone know what algorithm does it use to decide if there's been any "high confidence winners"?)
According to the Google help documentation:
Sometimes we simply need more data to
be able to reach a level of high
confidence. A tested combination
typically needs around 200 conversions
for us to judge its performance with
certainty.
But all of our conversions have over 200 conversations at the moment:
230 / 4061 (Original)
223 / 3937 (Variation 1)
205 / 3984 (Variation 2)
205 / 4007 (Variation 3)
How much longer is it going to have to run??
Thanks for any help.
Is there any way of telling how close Google is to making up its mind?
You can use the GWO calculator to help determine how long a test will take based on a number of assumptions that you provide. Keep in mind though that it is possible that there is not significant difference between your test combination, in which case a test to see which is best would take an infinite amount of time, because it is not possible to find a winner.
(Does anyone know what algorithm does it use to decide if there's been any "high confidence winners"?)
That is a mystery, but with most, if not all, statistical tests, there is what's called a p-value which is the probability of obtaining a result as extreme as the one observed by chance alone. GWO tests run until the p-value passes some threshold, probably 5%. To be more clear, GWO tests run until a combination is significantly better than the original combination, such that the result only has a 5% chance of occurring by chance alone.
For your test there appears to be no significant winner, it's a tie.

How do you measure if an interface change improved or reduced usability?

For an ecommerce website how do you measure if a change to your site actually improved usability? What kind of measurements should you gather and how would you set up a framework for making this testing part of development?
Multivariate testing and reporting is a great way to actually measure these kind of things.
It allows you to test what combination of page elements has the greatest conversion rate, providing continual improvement on your site design and usability.
Google Web Optimiser has support for this.
Similar methods that you used to identify the usability problems to begin with-- usability testing. Typically you identify your use-cases and then have a lab study evaluating how users go about accomplishing certain goals. Lab testing is typically good with 8-10 people.
The more information methodology we have adopted to understand our users is to have anonymous data collection (you may need user permission, make your privacy policys clear, etc.) This is simply evaluating what buttons/navigation menus users click on, how users delete something (i.e. changing quantity - are more users entering 0 and updating quantity or hitting X)? This is a bit more complex to setup; you have to develop an infrastructure to hold this data (which is actually just counters, i.e. "Times clicked x: 138838383, Times entered 0: 390393") and allow data points to be created as needed to plug into the design.
To push the measurement of an improvement of a UI change up the stream from end-user (where the data gathering could take a while) to design or implementation, some simple heuristics can be used:
Is the number of actions it takes to perform a scenario less? (If yes, then it has improved). Measurement: # of steps reduced / added.
Does the change reduce the number of kinds of input devices to use (even if # of steps is the same)? By this, I mean if you take something that relied on both the mouse and keyboard and changed it to rely only on the mouse or only on the keyboard, then you have improved useability. Measurement: Change in # of devices used.
Does the change make different parts of the website consistent? E.g. If one part of the e-Commerce site loses changes made while you are not logged on and another part does not, this is inconsistent. Changing it so that they have the same behavior improves usability (preferably to the more fault tolerant please!). Measurement: Make a graph (flow chart really) mapping the ways a particular action could be done. Improvement is a reduction in the # of edges on the graph.
And so on... find some general UI tips, figure out some metrics like the above, and you can approximate usability improvement.
Once you have these design approximations of user improvement, and then gather longer term data, you can see if there is any predictive ability for the design-level usability improvements to the end-user reaction (like: Over the last 10 projects, we've seen an average of 1% quicker scenarios for each action removed, with a range of 0.25% and standard dev of 0.32%).
The first way can be fully subjective or partly quantified: user complaints and positive feedbacks. The problem with this is that you may have some strong biases when it comes to filter those feedbacks, so you better make as quantitative as possible. Having some ticketing system to file every report from the users and gathering statistics about each version of the interface might be useful. Just get your statistics right.
The second way is to measure the difference in a questionnaire taken about the interface by end-users. Answers to each question should be a set of discrete values and then again you can gather statistics for each version of the interface.
The latter way may be much harder to setup (designing a questionnaire and possibly the controlled environment for it as well as the guidelines to interpret the results is a craft by itself) but the former makes it unpleasantly easy to mess up with the measurements. For example, you have to consider the fact that the number of tickets you get for each version is dependent on the time it is used, and that all time ranges are not equal (e.g. a whole class of critical issues may never be discovered before the third or fourth week of usage, or users might tend not to file tickets the first days of use, even if they find issues, etc.).
Torial stole my answer. Although if there is a measure of how long it takes to do a certain task. If the time is reduced and the task is still completed, then that's a good thing.
Also, if there is a way to record the number of cancels, then that would work too.

Resources