efficiently predicting the likelihood of a user clicking a hyperlink [duplicate] - statistics

This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
Determining the probability of a user clicking a hyperlink
So I have a bunch of hyperlinks on a web page. From past observation I know the probabilities that a user will click on each of these hyperlinks. I can therefore calculate the mean and standard deviation of these probabilities.
I now add a new hyperlink to this page. After a short amount of testing I find that of the 20 users that see this hyperlink, 5 click on it.
Taking into account the known mean and standard deviation of the click-through probabilities on other hyperlinks (this forms a "prior expectation"), how can I efficiently estimate the probability of a user clicking on the new hyperlink?
A naive solution would be to ignore the other probabilities, in which case my estimate is just 5/20 or 0.25 - however this means we are throwing away relevant information, namely our prior expectation of what the click-through probability is.
So I'm looking for a function that looks something like this:
double estimate(double priorMean, double priorStandardDeviation, int clicks, int views);
I'd ask that, since I'm more familiar with code than mathematical notation, that any answers use code or pseudocode in preference to math.

I hate to give a non-answer here, but doesn't the likelihood that your user click the link depend on the type of link it is? Or is this ads of some sort? Even then, it would be content-dependent, wouldn't it? I would be tempted to say this is more of a psychology question than a statistics one... ;]

Related

Framework for minimizing time complexity of generalized search

I have training in pure math but not in statistics, computer science, and information theory so I am a bit lost here and would really appreciate any guidance.
I am looking for some helpful ways to frame a general search approach which would minimize the time complexity of the search.
For example, let's say I was playing a modified version of 20-questions with a friend. The friend has thought of a human, presently alive in the US, and I can ask upto 20 questions to uncover the truth. I want to ask as few questions as possible on average to win the game. We will play this game repeatedly and I want to develop a strategy that would minimize my average win time (as measured by the number of questions asked).
Sample Space: 329.5 million humans currently alive in the US
Rule: Ask any question. The question can have yes or no answer or even a descriptive answer. So for instance, it is allowed to ask the first name of the person.
Intuitively, it seems to me that immediately (as a first quesiton) asking a question like "Is it Barack Obama?" is a terrible question because it splits the sample space (or search space) into two sets, one with 1 person, namely the former US President, and the second containing rest of the US population.
Asking, what is their sex (or old school gender) may be a better question as it will split the yes and no answers into sets of roughly equal sizes.
Instead of asking a binary question, asking an n-ary question is likely better because it will split the sample space into n sub-spaces of varying sizes and if the sizes are similar then that's fantastic. For instance, the question could be, what is the first letter of their last name? There are 26 possible answers, although we know that people in the US are much more likely to have their last name begin with "J" rather than "X".
Of course, I can conceivably ask a 329.5 million-ary question whereby I'll have the answer in one-shot.
My questions for you guys are as follows:
If we fix "n", so asking only binary or ternary or fixed-n-ary questions, it seems to me that the efficient approach would be to ask questions which would divide the sample space into "n" roughly equal parts, if I am minimizing time complexity. How can I prove this? What is the right approach or mathematical fraemwork to prove this? Assuming that I am only minimizing time complexity or the average number of questions I need to ask to get to the solution.
If we don't fix "n" then what would be a general way to frame this mathematically? Now I have two variables over which I am operating, "n" and "the relative size of subsets the answer to a n-ary question splits the sample space", to minimize the time complexity. How can I frame this problem mathematically?
Is my intuition even correct? Or are there faster ways to approach this?
What I am describing sounds an awful lot like a Classificaiton Decision Tree in Machine Learning. Is minimzing Entorpy the right way to frame my question?
Who would know or think about this type of stuff ? Information theorists? Computer Scientists? Statisticians? Probability Theorists? Machine Learning folks? Someone else?
What's the right forum on the internet to get help on this question? Reddit? Some specific stackexchange? Anything else?
Thx

Optimiser for excel spreadsheet

I'm a mechanical engineer, and I have developed a pretty cool spreadsheet that I use to size some steel members for lifting beams. The set back is that I need to do some trial and error in the selection of the member until I get one that gets as close to the allowable limits as possible.
What I'm hoping to improve on is to develop a function that based upon a length and weight variable that I enter, the program runs a loop and automatically selects the best member size(s) based upon a list of the members and their physical properties. Is this possible?
Yeah, depending on the complexity, either a simple search through parameters (less than, more than etc) might bring you the answer. You can do it quite easily via Pandas library. Just load up the excel as pandas DataFrame (pandas.read_excel()), which then will allow you to perform the searches on that DataFrame object.
If you want to run some optimization algo, you should look into SciPy's optimize to get what you're looking for based on the input data (it handles unconstrained and constrained functions).
Of course, the question you've stated is quite general, so I only pointed the direction. More info would be better.

How to predict when next event occurs based on previous events? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
Basically, I have a reasonably large list (a year's worth of data) of times that a single discrete event occurred (for my current project, a list of times that someone printed something). Based on this list, I would like to construct a statistical model of some sort that will predict the most likely time for the next event (the next print job) given all of the previous event times.
I've already read this, but the responses don't exactly help out with what I have in mind for my project. I did some additional research and found that a Hidden Markov Model would likely allow me to do so accurately, but I can't find a link on how to generate a Hidden Markov Model using just a list of times. I also found that using a Kalman filter on the list may be useful but basically, I'd like to get some more information about it from someone who's actually used them and knows their limitations and requirements before just trying something and hoping it works.
Thanks a bunch!
EDIT: So by Amit's suggestion in the comments, I also posted this to the Statistics StackExchange, CrossValidated. If you do know what I should do, please post either here or there
I'll admit it, I'm not a statistics kind of guy. But I've run into these kind of problems before. Really what we're talking about here is that you have some observed, discrete events and you want to figure out how likely it is you'll see them occur at any given point in time. The issue you've got is that you want to take discrete data and make continuous data out of it.
The term that comes to mind is density estimation. Specifically kernel density estimation. You can get some of the effects of kernel density estimation by simple binning (e.g. count the number events in a time interval such as every quarter hour or hour.) Kernel density estimation just has some nicer statistical properties than simple binning. (The produced data is often 'smoother'.)
That only takes care of one of your problems, though. The next problem is still the far more interesting one -- how do you take a time line of data (in this case, only printer data) and produced a prediction from it? First thing's first -- the way you've set up the problem may not be what you're looking for. While the miracle idea of having a limited source of data and predicting the next step of that source sounds attractive, it's far more practical to integrate more data sources to create an actual prediction. (e.g. maybe the printers get hit hard just after there's a lot of phone activity -- something that can be very hard to predict in some companies) The Netflix Challenge is a rather potent example of this point.
Of course, the problem with more data sources is that there's extra legwork to set up the systems that collect the data then.
Honestly, I'd consider this a domain-specific problem and take two approaches: Find time-independent patterns, and find time-dependent patterns.
An example time-dependent pattern would be that every week day at 4:30 Suzy prints out her end of the day report. This happens at specific times every day of the week. This kind of thing is easy to detect with fixed intervals. (Every day, every week day, every weekend day, every Tuesday, every 1st of the month, etc...) This is extremely simple to detect with predetermined intervals -- just create a curve of the estimated probability density function that's one week long and go back in time and average the curves (possibly a weighted average via a windowing function for better predictions).
If you want to get more sophisticated, find a way to automate the detection of such intervals. (Likely the data wouldn't be so overwhelming that you could just brute force this.)
An example time-independent pattern is that every time Mike in accounting prints out an invoice list sheet, he goes over to Johnathan who prints out a rather large batch of complete invoice reports a few hours later. This kind of thing is harder to detect because it's more free form. I recommend looking at various intervals of time (e.g. 30 seconds, 40 seconds, 50 seconds, 1 minute, 1.2 minutes, 1.5 minutes, 1.7 minutes, 2 minutes, 3 minutes, .... 1 hour, 2 hours, 3 hours, ....) and subsampling them via in a nice way (e.g. Lanczos resampling) to create a vector. Then use a vector-quantization style algorithm to categorize the "interesting" patterns. You'll need to think carefully about how you'll deal with certainty of the categories, though -- if your a resulting category has very little data in it, it probably isn't reliable. (Some vector quantization algorithms are better at this than others.)
Then, to create a prediction as to the likelihood of printing something in the future, look up the most recent activity intervals (30 seconds, 40 seconds, 50 seconds, 1 minute, and all the other intervals) via vector quantization and weight the outcomes based on their certainty to create a weighted average of predictions.
You'll want to find a good way to measure certainty of the time-dependent and time-independent outputs to create a final estimate.
This sort of thing is typical of predictive data compression schemes. I recommend you take a look at PAQ since it's got a lot of the concepts I've gone over here and can provide some very interesting insight. The source code is even available along with excellent documentation on the algorithms used.
You may want to take an entirely different approach from vector quantization and discretize the data and use something more like a PPM scheme. It can be very much simpler to implement and still effective.
I don't know what the time frame or scope of this project is, but this sort of thing can always be taken to the N-th degree. If it's got a deadline, I'd like to emphasize that you worry about getting something working first, and then make it work well. Something not optimal is better than nothing.
This kind of project is cool. This kind of project can get you a job if you wrap it up right. I'd recommend you do take your time, do it right, and post it up as function, open source, useful software. I highly recommend open source since you'll want to make a community that can contribute data source providers in more environments that you have access to, will to support, or time to support.
Best of luck!
I really don't see how a Markov model would be useful here. Markov models are typically employed when the event you're predicting is dependent on previous events. The canonical example, of course, is text, where a good Markov model can do a surprisingly good job of guessing what the next character or word will be.
But is there a pattern to when a user might print the next thing? That is, do you see a regular pattern of time between jobs? If so, then a Markov model will work. If not, then the Markov model will be a random guess.
In how to model it, think of the different time periods between jobs as letters in an alphabet. In fact, you could assign each time period a letter, something like:
A - 1 to 2 minutes
B - 2 to 5 minutes
C - 5 to 10 minutes
etc.
Then, go through the data and assign a letter to each time period between print jobs. When you're done, you have a text representation of your data, and that you can run through any of the Markov examples that do text prediction.
If you have an actual model that you think might be relevant for the problem domain, you should apply it. For example, it is likely that there are patterns related to day of week, time of day, and possibly date (holidays would presumably show lower usage).
Most raw statistical modelling techniques based on examining (say) time between adjacent events would have difficulty capturing these underlying influences.
I would build a statistical model for each of those known events (day of week, etc), and use that to predict future occurrences.
I think the predictive neural network would be a good approach for this task.
http://en.wikipedia.org/wiki/Predictive_analytics#Neural_networks
This method is also used for predicting f.x. weather forecasting, stock marked, sun spots.
There's a tutorial here if you want to know more about how it works.
http://www.obitko.com/tutorials/neural-network-prediction/
Think of a markov chain like a graph with vertex connect to each other with a weight or distance. Moving around this graph would eat up the sum of the weights or distance you travel. Here is an example with text generation: http://phpir.com/text-generation.
A Kalman filter is used to track a state vector, generally with continuous (or at least discretized continuous) dynamics. This is sort of the polar opposite of sporadic, discrete events, so unless you have an underlying model that includes this kind of state vector (and is either linear or almost linear), you probably don't want a Kalman filter.
It sounds like you don't have an underlying model, and are fishing around for one: you've got a nail, and are going through the toolbox trying out files, screwdrivers, and tape measures 8^)
My best advice: first, use what you know about the problem to build the model; then figure out how to solve the problem, based on the model.

Crowdsourcing reliability measurements - spam/fraud detection

I'd like to collect some kind of geographical information from website users - for given set of data they will mark checkbox indicating whether place has or has not given property. Are there any tools/frameworks for detecting fraud or spam submissions based on whole colected data set (and possibly other info)? I'd like to get filtered, more reliable data.
Not sure if that's exactly what you're asking for, but here are some tips from my experience using Amazon Turk:
There are several academic papers dealing with such problems. here is a good one.
In addition, based on the following general recommendations, I've created a custom procedure which worked on my data:
a. Include an open question, and filter out cases where it wasn't answered. It's harder to answer such a question automatically, and it might also be more time-consuming, thus less attractive, for a fraudster.
b. If possible, don't use a binary scale (i.e. a checkbox), but some grade (e.g. 1-4 or 1-6). This would give you more data to work with.
c. If available, filter out cases where the time spent in filling your form was too short. (especially useful if you include that open question)
d. If you have multiplicity of inputs per user, check for repetitive answers, and for users which consistently give far-from-average answers.
If each user submits only a single "form", consider putting more than a single element/question in it, so you'll get multiple submissions per-user.
e. If you have only a single submission per user or user-id, your options are more limited. I can suggest filtering out outliars, (e.g. data points farther than 3 standard deviations from the average), in case you have enough data.
f. After all the filtering, check the agreement or disagreement in your data (e.g. by checking what proportion of your data points fall within x standard deviations from the average). In case of agreement, use the average; in case of disagreement, collect some more data.
Hope it helps,

natural language question creation

I am trying to build question based on information available on about 10 variables- e.g. shape (square, circle, rectangle, paralellogram),length, width, circumference, area, diagonal length etc
e.g. if i want to set question to calculate area based on shape, length and width- the question gets created stating- calculate area of 'rectangle' given length='10' and width='5'. If i provide area and ask for width, the question autmatically forms as calculate area of 'rectangle' given length='10' and area='50'.
I am not too ambitious and am willing to be able to build this under constraints- any pointers around how I can achieve this? initial thoughts to have a question and answer fragment for each variable- but initial attempts creates very messy grammar
i have been advised on other forums to look at 'natural language generators" and focus on data-to-text as feature to look for. i have sen a few products and evaluating whether they are over-engineered for my needs

Resources