I am learning about AB testing and have run into some questions.
In the events of borderline significant p-value, say p = 0.049 and p = 0.051, is it really that different?
In the events of that I have a p-value of 0.051, what should I do? Gather further info would be expensive, but I'm also hesitating to accept null.
Also, say that if I do a further research on subset of the data with one feature (i.e, if I got p = 0.051 for a general study, and then divide the data into sports/movies/books, and found a p_sports = 0.01, p_movies = 0.07, p_books = 0.055), can I conclude that sports category is statistical significant?
Thanks!
In any case, you must have in mind that testing each hypothesis has a price, and if you are testing multiple hypothesis, you must be aware of inflation of type I error that can happen ( https://en.wikipedia.org/wiki/Multiple_comparisons_problem )
What are you suggesting is not the best way; any tests after the results are known fall under the category of post-hoc analysis (https://en.wikipedia.org/wiki/Post_hoc_analysis ). In any case, these practices should be planned before observing the data, otherwise it is just a blind chase.
If you have this specific situation, you should state your null hypothesis and the alpha level (under which you start rejecting the null). If the test gives you 0.051 and you said alpha 0.05, then you should not reject the null (software does that). Also, be sure that that is your final answer (check missing cases etc etc). After this, you always have the discussion to elaborate why you got your results, even present findings from your post-hoc analyses and ask questions. If you have post-hoc analyses with significant result, just state it, relate it to findings from previous research. If these results mean something, then any next research effort with the proper design should answer that particular question.
Related
I am aware that consulting a statistician is not free and it is something I cannot afford, so I am trying my shot here. So for the problem at hand, I've already finished data gathering for my research and am now calculating the results. However, I am stuck on what should I use for my statistical treatment of data.
For background, I am using ISO 25010 to test my software quality and user acceptance. The questionnaire consists of a number of questions for each cluster (functionality, reliability, usability, efficiency, maintainability, and portability). I've also used Likert Scale: Agreement Type. The hypothesis of my research says "There is no significant difference in the user acceptance results in terms of [clusters]". As of now, I've used Descriptive Statistics, mean(for each question), average mean(ave. of mean for each cluster, and mode), for calculating the results.
I feel that the result I currently have might be lacking when the final defense came. As far as I know, using a combination of statistical methods is okay to give a more strong foundation for your result.
Based on the background of my research, what other statistical methods should I use?
I am thinking of sample standard deviation, but I don't know if I should compute it by questions or by cluster.
Sorry, statistics is not really my forte.
Thank you in advance for your answers
We know from this article that ending an A/B test early due to "significant" results is a mistake.
But what about when a test runs for the desired time period and shows insignificant results – is it fine to prolong it? What are the risks?
It would be great with a simple mathematical example of any risks, similar to the example in that linked article.
I have only a basic knowledge of probability theory and maths, so I would appreciate an answer I can understand with that knowledge.
My intuition is that it could be problematic, because you had an experiment with a calculated reliability (will show false positives in X% and false negatives in Y% of such experiments), but now you're effectively waiting indefinitely for the first true-positive or false-positive significance.
So I should think you get more false positives than you accounted for when setting up the original experiment. But presumably the likelihood of false positives also decreases as we get more data. I would love to get specific numbers on this, if it's true at all.
This is an area of current research. We've done some modeling and advise our customers to follow this principle:
• If the experiment reaches statistical significance, i.e. when the CI
ribbon entirely rises above zero or entirely falls below it, and
remains significant for 50% more observations than it took to get to
significance for 0.10 level tests (65% more observations than it took
to get to significance for .05 level tests), the experiment is called
by accepting the alternative hypothesis, or, in other words, the
treatment wins.
• If the experiment does not reach statistical significance, while the
CI ribbon has narrowed to where its width represents a difference
between the treatment and the control that is not consequential to the
application semantics, the experiment is called by rejecting the
research hypothesis, or, in other words, the treatment fails to win
and we stick with the control.
For more, here's the White Paper.
I am seeking a method to allow me to analyse/search for patterns in asset price movements using 5 variables that move and change with price (from historical data).
I'd like to be able to assign a probability to a forecasted price move when for example, var1 and var2 do this and var3..5 do this, then price should do this with x amount of certainty.
Q1: Could someone point me in the right direction as to what framework / technique can help me achieve this?
Q2: Would this be a multivariate continuous random series analysis?
Q3: A Hidden Markov modelling?
Q4: Or perhaps is it a data-mining problem?
I'm looking for what rather then how.
One may opt to use Machine-Learning tools to build a learner to either
both classify of what kind the said "asset price movement" will beand serve also statistical probability measures for such a Classifier prediction
both regress a real target value, to which the asset price will moveandserve also statistical probability measures for such a Regressor prediction
A1: ( while StackOverflow strongly discourages users to ask about an opinion about a tool or a particular framework ) there would be not much damages or extra time to be spent, if one performs academia papers research and there would be quite a remarkable list of repeatedly used tools, used for ML in the context of academic R&D. For a reason, there would not be a surprise to meet scikit-learn ML-classes a lot, some other papers may work with R-based quantitative finance / statistical libraries. The tools, however, with all due respect, are not the core to answer all the doubts and inital confusion present in a mix of your questions. The subject confusion is.
A2: No, it would not. Well, unless you beat all the advanced quantitative research and happen to prove that the Market exhibits a random behaviour ( which it is not and for which it would be waste of time to re-cite remarkable research published about why it is not indeed a random process ).
A3: Do not try to jump on any wagon just because of it's attractive Tag or "contemporary popularity" in marketing minded texts. With all due respect, understanding HMM is outside of your sight while you now appear to move just to the nearest horizons to first understand what to look for.
A4: This is a nice proof of a missed target. Your question shows in this particular point better than in others, how small amount of own research efforts were put into covering the problem-domain and acquiring at least some elementary knowledge before typing the last two questions.
StackOverflow encourages users to ask high quality questions, so do not hesitate to re-edit your post to add some polishing efforts to this subject.
If in a need for an inspiration, try to review a nice and a powerful approach for a fast Machine Learning process, where both Classification and Regression tasks obtain also probability estimates for each predicted target value.
To have some idea about highly performant ML-predictors, these typically operate on much more than a set of 5 variables ( called in the ML-domain "features" ) . ( Think rather about some large hundreds to small thousands features, typically heavily non-linear transformations from the original TimeSeries' data ).
There you go, if indeed willing to master ML for algorithmic trading.
May like to read about a state-of-art research in this direction:
[1] Mondrian Forests: Efficient Online Random Forests
>>> arXiv:1406.2673v2 [stat.ML] 16 Feb 2015
[2] Mondrian Forests for Large-Scale Regression when Uncertainty Matters
>>> arXiv:1506.03805v4 [stat.ML] 27 May 2016 >>>
May also enjoy other posts on subject: >>> StackOverflow Algorithmic-Trading >>>
I've written a GA to model a handful of stocks (4) over a period of time (5 years). It's impressive how quickly the GA can find an optimal solution to the training data, but I am also aware that this is mainly due to it's tendency to over-fit in the training phase.
However, I still thought I could take a few precautions and and get some kind of prediction on a set of unseen test stocks from the same period.
One precaution I took was:
When multiple stocks can be bought on the same day the GA only buys one from the list and it chooses this one randomly. I thought this randomness might help to avoid over-fitting?
Even if over-fitting is still occurring,shouldn't it be absent in the initial generations of the GA since it hasn't had a chance to over-fit yet?
As a note, I am aware of the no-free-lunch theorem which demonstrates ( I believe) that there is no perfect set of parameters which will produce an optimal output for two different datasets. If we take this further, does this no-free-lunch theorem also prohibit generalization?
The graph below illustrates this.
->The blue line is the GA output.
->The red line is the training data (slightly different because of the aforementioned randomness)
-> The yellow line is the stubborn test data which shows no generalization. In fact this is the most flattering graph I could produce..
The y-axis is profit, the x axis is the trading strategies sorted from worst to best ( left to right) according to there respective profits (on the y axis)
Some of the best advice I've received so far (thanks seaotternerd) is to focus on the earlier generations and increase the number of training examples. The graph below has 12 training stocks rather than just 4, and shows only the first 200 generations (instead of 1,000). Again, it's the most flattering chart I could produce, this time with medium selection pressure. It certainly looks a little bit better, but not fantastic either. The red line is the test data.
The problem with over-fitting is that, within a single data-set it's pretty challenging to tell over-fitting apart from actually getting better in the general case. In many ways, this is more of an art than a science, but here are some general guidelines:
A GA will learn to do exactly what you attach fitness to. If you tell it to get really good at predicting one series of stocks, it will do that. If you keep swapping in different stocks to predict, though, you might be more successful at getting it to generalize. There are a few ways to do this. The one that has had perhaps the most promising results for reducing over-fitting is imposing spatial structure on the population and evaluating on different test cases in different cells, as in the SCALP algorithm. You could also switch out the test cases on a time basis, but I've had more mixed results with that sort of an approach.
You are correct that over-fitting should be less of a problem early on. Generally, the longer you run a GA, the more over-fitting will be possible. Typically, people tend to assume that the general rules will be learned first, before the rote memorization of over-fitting takes place. However, I don't think I've actually ever seen this studied rigorously - I could imagine a scenario where over-fitting was so much easier than finding general rules that it happens first. I have no idea how common that is, though. Stopping early will also reduce the ability of the GA to find better general solutions.
Using a larger data-set (four stocks isn't that many) will make your GA less susceptible to over-fitting.
Randomness is an interesting idea. It will definitely hurt the GA's ability to find general rules, but it should also reduce over-fitting. Without knowing more about the specifics of your algorithm, it's hard to say which would win out.
That's a really interesting thought about the no free lunch theorem. I'm not 100% sure, but I think it does apply here to some extent - better fitting some data will make your results fit other data worse, by necessity. However, as wide as the range of possible stock behaviors is, it is much narrower than the range of all possible time series in general. This is why it is possible to have optimization algorithms at all - a given problem that we are working with tends produce data that cluster relatively closely together, relative to the entire space of possible data. So, within that set of inputs that we actually care about, it is possible to get better. There is generally an upper limit of some sort on how well you can do, and it is possible that you have hit that upper limit for your data-set. But generalization is possible to some extent, so I wouldn't give up just yet.
Bottom line: I think that varying the test cases shows the most promise (although I'm biased, because that's one of my primary areas of research), but it is also the most challenging solution, implementation-wise. So as a simpler fix you can try stopping evolution sooner or increasing your data-set.
I am currently working on a search ranking algorithm which will be applied to elastic search queries (domain: e-commerce). It assigns scores on several entities returned and finally sorts them based on the score assigned.
My question is: Has anyone ever tried to introduce a certain level of randomness to any search algorithm and has experienced a positive effect of it. I am thinking that it might be useful to reduce bias and promote the lower ranking items to give them a chance to be seen easier and get popular if they deserve it. I know that some machine learning algorithms are introducing some randomization to reduce the bias so I thought it might be applied to search as well.
Closest I can get here is this but not exactly what I am hoping to get answers for:
Randomness in Artificial Intelligence & Machine Learning
I don't see this mentioned in your post... Elasticsearch offers a random scoring feature: https://www.elastic.co/guide/en/elasticsearch/guide/master/random-scoring.html
As the owner of the website, you want to give your advertisers as much exposure as possible. With the current query, results with the same _score would be returned in the same order every time. It would be good to introduce some randomness here, to ensure that all documents in a single score level get a similar amount of exposure.
We want every user to see a different random order, but we want the same user to see the same order when clicking on page 2, 3, and so forth. This is what is meant by consistently random.
The random_score function, which outputs a number between 0 and 1, will produce consistently random results when it is provided with the same seed value, such as a user’s session ID
Your intuition is right - randomization can help surface results that get a lower than deserved score due to uncertainty in the estimation. Empirically, Google search ads seemed to have sometimes been randomized, and e.g. this paper is hinting at it (see Section 6).
This problem describes an instance of a class of problems called Explore/Exploit algorithms, or Multi-Armed Bandit problems; see e.g. http://en.wikipedia.org/wiki/Multi-armed_bandit. There is a large body of mathematical theory and algorithmic approaches. A general idea is to not always order by expected, "best" utility, but by an optimistic estimate that takes the degree of uncertainty into account. A readable, motivating blog post can be found here.