How to predict when next event occurs based on previous events? [closed] - statistics

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
Basically, I have a reasonably large list (a year's worth of data) of times that a single discrete event occurred (for my current project, a list of times that someone printed something). Based on this list, I would like to construct a statistical model of some sort that will predict the most likely time for the next event (the next print job) given all of the previous event times.
I've already read this, but the responses don't exactly help out with what I have in mind for my project. I did some additional research and found that a Hidden Markov Model would likely allow me to do so accurately, but I can't find a link on how to generate a Hidden Markov Model using just a list of times. I also found that using a Kalman filter on the list may be useful but basically, I'd like to get some more information about it from someone who's actually used them and knows their limitations and requirements before just trying something and hoping it works.
Thanks a bunch!
EDIT: So by Amit's suggestion in the comments, I also posted this to the Statistics StackExchange, CrossValidated. If you do know what I should do, please post either here or there

I'll admit it, I'm not a statistics kind of guy. But I've run into these kind of problems before. Really what we're talking about here is that you have some observed, discrete events and you want to figure out how likely it is you'll see them occur at any given point in time. The issue you've got is that you want to take discrete data and make continuous data out of it.
The term that comes to mind is density estimation. Specifically kernel density estimation. You can get some of the effects of kernel density estimation by simple binning (e.g. count the number events in a time interval such as every quarter hour or hour.) Kernel density estimation just has some nicer statistical properties than simple binning. (The produced data is often 'smoother'.)
That only takes care of one of your problems, though. The next problem is still the far more interesting one -- how do you take a time line of data (in this case, only printer data) and produced a prediction from it? First thing's first -- the way you've set up the problem may not be what you're looking for. While the miracle idea of having a limited source of data and predicting the next step of that source sounds attractive, it's far more practical to integrate more data sources to create an actual prediction. (e.g. maybe the printers get hit hard just after there's a lot of phone activity -- something that can be very hard to predict in some companies) The Netflix Challenge is a rather potent example of this point.
Of course, the problem with more data sources is that there's extra legwork to set up the systems that collect the data then.
Honestly, I'd consider this a domain-specific problem and take two approaches: Find time-independent patterns, and find time-dependent patterns.
An example time-dependent pattern would be that every week day at 4:30 Suzy prints out her end of the day report. This happens at specific times every day of the week. This kind of thing is easy to detect with fixed intervals. (Every day, every week day, every weekend day, every Tuesday, every 1st of the month, etc...) This is extremely simple to detect with predetermined intervals -- just create a curve of the estimated probability density function that's one week long and go back in time and average the curves (possibly a weighted average via a windowing function for better predictions).
If you want to get more sophisticated, find a way to automate the detection of such intervals. (Likely the data wouldn't be so overwhelming that you could just brute force this.)
An example time-independent pattern is that every time Mike in accounting prints out an invoice list sheet, he goes over to Johnathan who prints out a rather large batch of complete invoice reports a few hours later. This kind of thing is harder to detect because it's more free form. I recommend looking at various intervals of time (e.g. 30 seconds, 40 seconds, 50 seconds, 1 minute, 1.2 minutes, 1.5 minutes, 1.7 minutes, 2 minutes, 3 minutes, .... 1 hour, 2 hours, 3 hours, ....) and subsampling them via in a nice way (e.g. Lanczos resampling) to create a vector. Then use a vector-quantization style algorithm to categorize the "interesting" patterns. You'll need to think carefully about how you'll deal with certainty of the categories, though -- if your a resulting category has very little data in it, it probably isn't reliable. (Some vector quantization algorithms are better at this than others.)
Then, to create a prediction as to the likelihood of printing something in the future, look up the most recent activity intervals (30 seconds, 40 seconds, 50 seconds, 1 minute, and all the other intervals) via vector quantization and weight the outcomes based on their certainty to create a weighted average of predictions.
You'll want to find a good way to measure certainty of the time-dependent and time-independent outputs to create a final estimate.
This sort of thing is typical of predictive data compression schemes. I recommend you take a look at PAQ since it's got a lot of the concepts I've gone over here and can provide some very interesting insight. The source code is even available along with excellent documentation on the algorithms used.
You may want to take an entirely different approach from vector quantization and discretize the data and use something more like a PPM scheme. It can be very much simpler to implement and still effective.
I don't know what the time frame or scope of this project is, but this sort of thing can always be taken to the N-th degree. If it's got a deadline, I'd like to emphasize that you worry about getting something working first, and then make it work well. Something not optimal is better than nothing.
This kind of project is cool. This kind of project can get you a job if you wrap it up right. I'd recommend you do take your time, do it right, and post it up as function, open source, useful software. I highly recommend open source since you'll want to make a community that can contribute data source providers in more environments that you have access to, will to support, or time to support.
Best of luck!

I really don't see how a Markov model would be useful here. Markov models are typically employed when the event you're predicting is dependent on previous events. The canonical example, of course, is text, where a good Markov model can do a surprisingly good job of guessing what the next character or word will be.
But is there a pattern to when a user might print the next thing? That is, do you see a regular pattern of time between jobs? If so, then a Markov model will work. If not, then the Markov model will be a random guess.
In how to model it, think of the different time periods between jobs as letters in an alphabet. In fact, you could assign each time period a letter, something like:
A - 1 to 2 minutes
B - 2 to 5 minutes
C - 5 to 10 minutes
etc.
Then, go through the data and assign a letter to each time period between print jobs. When you're done, you have a text representation of your data, and that you can run through any of the Markov examples that do text prediction.

If you have an actual model that you think might be relevant for the problem domain, you should apply it. For example, it is likely that there are patterns related to day of week, time of day, and possibly date (holidays would presumably show lower usage).
Most raw statistical modelling techniques based on examining (say) time between adjacent events would have difficulty capturing these underlying influences.
I would build a statistical model for each of those known events (day of week, etc), and use that to predict future occurrences.

I think the predictive neural network would be a good approach for this task.
http://en.wikipedia.org/wiki/Predictive_analytics#Neural_networks
This method is also used for predicting f.x. weather forecasting, stock marked, sun spots.
There's a tutorial here if you want to know more about how it works.
http://www.obitko.com/tutorials/neural-network-prediction/

Think of a markov chain like a graph with vertex connect to each other with a weight or distance. Moving around this graph would eat up the sum of the weights or distance you travel. Here is an example with text generation: http://phpir.com/text-generation.

A Kalman filter is used to track a state vector, generally with continuous (or at least discretized continuous) dynamics. This is sort of the polar opposite of sporadic, discrete events, so unless you have an underlying model that includes this kind of state vector (and is either linear or almost linear), you probably don't want a Kalman filter.
It sounds like you don't have an underlying model, and are fishing around for one: you've got a nail, and are going through the toolbox trying out files, screwdrivers, and tape measures 8^)
My best advice: first, use what you know about the problem to build the model; then figure out how to solve the problem, based on the model.

Related

Traveling Salesman Alternate - How would one code it if the cities were all the same distances from each other?

First time asking question, apologies if incorrect.
What would be the best way to approach this problem (Similar to travelling salesman, but I'm not sure if it runs into the same issues).
You have a list of "tasks" at certain locations (Cities) and a group of "people" that can complete those tasks (Salesmen). This is structured over a day, where some tasks may need to be completed before a specific time and may require specific "tools" (Set number available). The difference is that the length between each location is the same in all circumstances, but they all have to return to the start. Therefore, rather than trying to minimise the distance travelled, instead you want to maximise the time each salesmen spends moving and stays at the initial staring node. This also gives you pre-defined requirements.
The program doesn't need to find an optimal solution, just an acceptable one (Greater than a certain value.) Would you just bash out each case? If so, what would be the best language to use for bashing out the solutions?
Thanks
EDIT - Just to confirm, the pre-requisite where all the cities are the same distance from each other is just for simplification of the problem, not reflective of real life.

If I interrupt sklearn grid_search.fit() before completion can I access the current .best_score_, .best_params_?

If I interrupt grid_search.fit() before completion will I loose everything it's done so far?
I got a little carried away with my grid search and provided an obscenely large search space. I can see scores that I'm happy with already but my stdout doesn't display which params led to those scores..
I've searched the docs: http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html
And there is a discussion from a couple years ago about adding a feature for parrallel search here: https://sourceforge.net/p/scikit-learn/mailman/message/31036457/
But nothing definitive. My search has been working for ~48hrs, so I don't want to loose what's been discovered, but I also don't want to continue.
Thanks!
welcome to SO!
To my understanding there isn't any intermediate variables that get returned off the grid_search function, only the resulting grid and their scores (see here for more information grid search.py).
So if you cancel it you might lose the work that's been done so far.
But a bit of advice, 48 hours is a long time (obviously this depends on the rows, columns and number of hyper parameters being tuned). You might want to start with a more broad grid search first and then refine your parameter search off that.
That will benefit you two ways:
Run time might end up being much shorter (see caveats above) meaning you don't have to wait so long and risk losing results
You might find that your model prediction score is only impacted by one or two hyper parameters, letting you keep the other searches more broad and focussing your efforts on the parameters that influence your prediction accuracy most.
Hopefully by the time I've written this response your grid search has completed!!

Test multiple algorithms in one experiment

Is there any way to test multiple algorithms rather than doing it once for each and every algorithm; then checking the result? There are a lot of times where I don’t really know which one to use, so I would like to test multiple and get the result (error rate) fairly quick in Azure Machine Learning Studio.
You could connect the scores of multiple algorithms with an 'Evaluate Model' button to evaluate algorithms against each other.
Hope this helps.
The module you are looking for, is the one called “Cross-Validate Model”. It basically splits whatever comes in from the input-port (dataset) into 10 pieces, then reserves the last piece as the “answer”; and trains the nine other subset models and returns a set of accuracy statistics measured towards the last subset. What you would look at is the column called “Mean absolute error” which is the average error for the trained models. You can connect whatever algorithm you want to one of the ports, and subsequently you will receive the result for that algorithm in particular after you “right-click” the port which gives the score.
After that you can assess which algorithm did the best. And as a pro-tip; you could use the Filter-based-feature selection to actually see which column had a significant impact on the result.
You can check section 6.2.4 of hands-on-lab at GitHub https://github.com/Azure-Readiness/hol-azure-machine-learning/blob/master/006-lab-model-evaluation.md which focuses on the evaluation of multiple algorithms etc.

Cannot generalize my Genetic Algorithm to new Data

I've written a GA to model a handful of stocks (4) over a period of time (5 years). It's impressive how quickly the GA can find an optimal solution to the training data, but I am also aware that this is mainly due to it's tendency to over-fit in the training phase.
However, I still thought I could take a few precautions and and get some kind of prediction on a set of unseen test stocks from the same period.
One precaution I took was:
When multiple stocks can be bought on the same day the GA only buys one from the list and it chooses this one randomly. I thought this randomness might help to avoid over-fitting?
Even if over-fitting is still occurring,shouldn't it be absent in the initial generations of the GA since it hasn't had a chance to over-fit yet?
As a note, I am aware of the no-free-lunch theorem which demonstrates ( I believe) that there is no perfect set of parameters which will produce an optimal output for two different datasets. If we take this further, does this no-free-lunch theorem also prohibit generalization?
The graph below illustrates this.
->The blue line is the GA output.
->The red line is the training data (slightly different because of the aforementioned randomness)
-> The yellow line is the stubborn test data which shows no generalization. In fact this is the most flattering graph I could produce..
The y-axis is profit, the x axis is the trading strategies sorted from worst to best ( left to right) according to there respective profits (on the y axis)
Some of the best advice I've received so far (thanks seaotternerd) is to focus on the earlier generations and increase the number of training examples. The graph below has 12 training stocks rather than just 4, and shows only the first 200 generations (instead of 1,000). Again, it's the most flattering chart I could produce, this time with medium selection pressure. It certainly looks a little bit better, but not fantastic either. The red line is the test data.
The problem with over-fitting is that, within a single data-set it's pretty challenging to tell over-fitting apart from actually getting better in the general case. In many ways, this is more of an art than a science, but here are some general guidelines:
A GA will learn to do exactly what you attach fitness to. If you tell it to get really good at predicting one series of stocks, it will do that. If you keep swapping in different stocks to predict, though, you might be more successful at getting it to generalize. There are a few ways to do this. The one that has had perhaps the most promising results for reducing over-fitting is imposing spatial structure on the population and evaluating on different test cases in different cells, as in the SCALP algorithm. You could also switch out the test cases on a time basis, but I've had more mixed results with that sort of an approach.
You are correct that over-fitting should be less of a problem early on. Generally, the longer you run a GA, the more over-fitting will be possible. Typically, people tend to assume that the general rules will be learned first, before the rote memorization of over-fitting takes place. However, I don't think I've actually ever seen this studied rigorously - I could imagine a scenario where over-fitting was so much easier than finding general rules that it happens first. I have no idea how common that is, though. Stopping early will also reduce the ability of the GA to find better general solutions.
Using a larger data-set (four stocks isn't that many) will make your GA less susceptible to over-fitting.
Randomness is an interesting idea. It will definitely hurt the GA's ability to find general rules, but it should also reduce over-fitting. Without knowing more about the specifics of your algorithm, it's hard to say which would win out.
That's a really interesting thought about the no free lunch theorem. I'm not 100% sure, but I think it does apply here to some extent - better fitting some data will make your results fit other data worse, by necessity. However, as wide as the range of possible stock behaviors is, it is much narrower than the range of all possible time series in general. This is why it is possible to have optimization algorithms at all - a given problem that we are working with tends produce data that cluster relatively closely together, relative to the entire space of possible data. So, within that set of inputs that we actually care about, it is possible to get better. There is generally an upper limit of some sort on how well you can do, and it is possible that you have hit that upper limit for your data-set. But generalization is possible to some extent, so I wouldn't give up just yet.
Bottom line: I think that varying the test cases shows the most promise (although I'm biased, because that's one of my primary areas of research), but it is also the most challenging solution, implementation-wise. So as a simpler fix you can try stopping evolution sooner or increasing your data-set.

Comparing audio recordings

I have 5 recorded wav files. I want to compare the new incoming recordings with these files and determine which one it resembles most.
In the final product I need to implement it in C++ on Linux, but now I am experimenting in Matlab. I can see FFT plots very easily. But I don't know how to compare them.
How can I compute the similarity of two FFT plots?
Edit: There is only speech in the recordings. Actually, I am trying to identify the response of answering machines of a few telecom companies. It's enough to distinguish two messages "this person can not be reached at the moment" and "this number is not used anymore"
This depends a lot on your definition of "resembles most". Depending on your use case this can be a lot of things. If you just want to compare the bare spectra of the whole file you can just correlate the values returned by the two ffts.
However spectra tend to change a lot when the files get warped in time. To figure out the difference with this, you need to do a windowed fft and compare the spectra for each window. This then defines your difference function you can use in a Dynamic time warping algorithm.
If you need perceptual resemblance an FFT probably does not get you what you need. An MFCC of the recordings is most likely much closer to this problem. Again, you might need to calculate windowed MFCCs instead of MFCCs of the whole recording.
If you have musical recordings again you need completely different aproaches. There is a blog posting that describes how Shazam works, so you might be able to find this on google. Or if you want real musical similarity have a look at this book
EDIT:
The best solution for the problem specified above would be the one described here ("shazam algorithm" as mentioned above).This is however a bit complicated to implement and easier solution might do well enough.
If you know that there are only 5 different different possible incoming files, I would suggest trying first something as easy as doing the euclidian distance between the two signals (in temporal or fourier). It is likely to give you good result.
Edit : So with different possible starts, try doing an autocorrelation and see which file has the higher peak.
I suggest you compute simple sound parameter like fundamental frequency. There are several methods of getting this value - I tried autocorrelation and cepstrum and for voice signals they worked fine. With such function working you can make time-analysis and compare two signals (base - to which you compare, in - which you would like to match) on given interval frequency. Comparing several intervals based on such criteria can tell you which base sample matches the best.
Of course everything depends on what you mean resembles most. To compare function you can introduce other parameters like volume, noise, clicks, pitches...

Resources