Sorry this was a wrong question posted!
Knowing the per-customer probability p of purchasing gives you no information about how many people will show up. However, if you know that N people showed up and purchase decisions are independent, the distribution of how many will make a purchase is binomial with parameters N and p.
Related
I am seeking a method to allow me to analyse/search for patterns in asset price movements using 5 variables that move and change with price (from historical data).
I'd like to be able to assign a probability to a forecasted price move when for example, var1 and var2 do this and var3..5 do this, then price should do this with x amount of certainty.
Q1: Could someone point me in the right direction as to what framework / technique can help me achieve this?
Q2: Would this be a multivariate continuous random series analysis?
Q3: A Hidden Markov modelling?
Q4: Or perhaps is it a data-mining problem?
I'm looking for what rather then how.
One may opt to use Machine-Learning tools to build a learner to either
both classify of what kind the said "asset price movement" will beand serve also statistical probability measures for such a Classifier prediction
both regress a real target value, to which the asset price will moveandserve also statistical probability measures for such a Regressor prediction
A1: ( while StackOverflow strongly discourages users to ask about an opinion about a tool or a particular framework ) there would be not much damages or extra time to be spent, if one performs academia papers research and there would be quite a remarkable list of repeatedly used tools, used for ML in the context of academic R&D. For a reason, there would not be a surprise to meet scikit-learn ML-classes a lot, some other papers may work with R-based quantitative finance / statistical libraries. The tools, however, with all due respect, are not the core to answer all the doubts and inital confusion present in a mix of your questions. The subject confusion is.
A2: No, it would not. Well, unless you beat all the advanced quantitative research and happen to prove that the Market exhibits a random behaviour ( which it is not and for which it would be waste of time to re-cite remarkable research published about why it is not indeed a random process ).
A3: Do not try to jump on any wagon just because of it's attractive Tag or "contemporary popularity" in marketing minded texts. With all due respect, understanding HMM is outside of your sight while you now appear to move just to the nearest horizons to first understand what to look for.
A4: This is a nice proof of a missed target. Your question shows in this particular point better than in others, how small amount of own research efforts were put into covering the problem-domain and acquiring at least some elementary knowledge before typing the last two questions.
StackOverflow encourages users to ask high quality questions, so do not hesitate to re-edit your post to add some polishing efforts to this subject.
If in a need for an inspiration, try to review a nice and a powerful approach for a fast Machine Learning process, where both Classification and Regression tasks obtain also probability estimates for each predicted target value.
To have some idea about highly performant ML-predictors, these typically operate on much more than a set of 5 variables ( called in the ML-domain "features" ) . ( Think rather about some large hundreds to small thousands features, typically heavily non-linear transformations from the original TimeSeries' data ).
There you go, if indeed willing to master ML for algorithmic trading.
May like to read about a state-of-art research in this direction:
[1] Mondrian Forests: Efficient Online Random Forests
>>> arXiv:1406.2673v2 [stat.ML] 16 Feb 2015
[2] Mondrian Forests for Large-Scale Regression when Uncertainty Matters
>>> arXiv:1506.03805v4 [stat.ML] 27 May 2016 >>>
May also enjoy other posts on subject: >>> StackOverflow Algorithmic-Trading >>>
I coded a program in which people rate different products. Per rating people get a bonus point. The more bonus points people get the more reputation they get. But my issue that people sometimes give ratings not to rate but just to earn bonus points. Is there a mathematical solution to identify fake raters?
Absolutely. Search for "shilling recommender systems" in Google Scholar or elsewhere. There has been a decent amount of scholarly work identifying bad actors in recommender systems. Generally there's a focus on preventing robot actions (which doesn't seem to be your concern) as well as finding humans who rate differently than the norm (i.e., rating distributions, time-of-rating distributions).
https://scholar.google.com/scholar?hl=en&q=shilling+recommender+systems
I'm performing survival analysis in R using the 'survival' package and coxph. My goal is to compare survival between individuals with different chronic diseases. My data are structured like this:
id, time, event, disease, age.at.dx
1, 342, 0, A, 8247
2, 2684, 1, B, 3879
3, 7634, 1, A, 3847
where 'time' is the number of days from diagnosis to event, 'event' is 1 if the subject died, 0 if censored, 'disease' is a factor with 8 levels, and 'age.at.dx' is the age in days when the subject was first diagnosed. I am new to using survival analysis. Looking at the cox.zph output for a model like this:
combi.age<-coxph(Surv(time,event)~disease+age.at.dx,data=combi)
Two of the disease levels violate the PH assumption, having p-values <0.05. Plotting the Schoenfeld residuals over time shows that for one disease the hazard falls steadily over time, and with the second, the line is predominantly parallel, but with a small upswing at the extreme left of the graph.
My question is how to deal with these disease levels? I'm aware from my reading that I should attempt to add a time interaction to the disease whose hazard drops steadily, but I'm unsure how to do this, given that most examples of coxph I've come across only compare two groups, whereas I am comparing 8. Also, can I safely ignore the assumption violation of the disease level with the high hazard at early time points?
I wonder whether this is an inappropriate way to structure my data, because it does not preclude a single individual appearing multiple times in the data - is this a problem?
Thanks for any help, please let me know if more information is needed to answer these questions.
I'd say you have a fairly good understanding of the data already and should present what you found. This sounds like a descriptive study rather than one where you will be presenting to the FDA with a request to honor your p-values. Since your audience will (or should) be expecting that the time-course of risk for different diseases will be heterogeneous, I'd think you can just describe these results and talk about the biological/medical reasons why the first "non-conformist" disease becomes less important with time and the other non-conforming condition might become more potent over time. You already done a more thorough analysis than most descriptive articles in the medical literature exhibit. I rarely see description of the nature of non-proportionality.
The last question regarding data "does not preclude a single individual appearing multiple times in the data" may require some more thorough discussion. The first approach would be to stratify by patient ID with the cluster()-function.
I am currently working on a search ranking algorithm which will be applied to elastic search queries (domain: e-commerce). It assigns scores on several entities returned and finally sorts them based on the score assigned.
My question is: Has anyone ever tried to introduce a certain level of randomness to any search algorithm and has experienced a positive effect of it. I am thinking that it might be useful to reduce bias and promote the lower ranking items to give them a chance to be seen easier and get popular if they deserve it. I know that some machine learning algorithms are introducing some randomization to reduce the bias so I thought it might be applied to search as well.
Closest I can get here is this but not exactly what I am hoping to get answers for:
Randomness in Artificial Intelligence & Machine Learning
I don't see this mentioned in your post... Elasticsearch offers a random scoring feature: https://www.elastic.co/guide/en/elasticsearch/guide/master/random-scoring.html
As the owner of the website, you want to give your advertisers as much exposure as possible. With the current query, results with the same _score would be returned in the same order every time. It would be good to introduce some randomness here, to ensure that all documents in a single score level get a similar amount of exposure.
We want every user to see a different random order, but we want the same user to see the same order when clicking on page 2, 3, and so forth. This is what is meant by consistently random.
The random_score function, which outputs a number between 0 and 1, will produce consistently random results when it is provided with the same seed value, such as a user’s session ID
Your intuition is right - randomization can help surface results that get a lower than deserved score due to uncertainty in the estimation. Empirically, Google search ads seemed to have sometimes been randomized, and e.g. this paper is hinting at it (see Section 6).
This problem describes an instance of a class of problems called Explore/Exploit algorithms, or Multi-Armed Bandit problems; see e.g. http://en.wikipedia.org/wiki/Multi-armed_bandit. There is a large body of mathematical theory and algorithmic approaches. A general idea is to not always order by expected, "best" utility, but by an optimistic estimate that takes the degree of uncertainty into account. A readable, motivating blog post can be found here.
I have a dataset that contains admissions rates of all providers that we work with. I need to divide that data into quartiles, so that each provider can see where their rate lies in comparison to other providers. The rate ranges from 7% to 89%. can anyone suggest me how to do this? I am not sure if this is the right place to ask this question but if somebody can help me with this, I would really appreciate that.
The other concern is that if a provider's numbers is really small eg: 2/4 = 50%, the provider might fall into worse quartile but it doesn't mean that the provider's performance is bad because the numbers are so small. I hope this is making sense. Please let me know if I can clarify it further.
There are ways to obtain quantiles without doing a complete sort but unless you've got huge amounts of data there is no point in implementing those algorithms if you haven't already got them available. Presuming you have a sort() function available, all you need to do is:
Given n data points.
Sort the data points.
Find the n/4, n/2 and 3*n/4th points in the sorted data, which are your quartiles.
As you say, if n is less than some number (that you'll have to decide for yourself) you may want to say that the quartile result is "not applicable" or some such.
First concern: For small n, do not use quartiles. Whether n is small is arbitrary.