Solr: Modeling opening hours in schema.xml - search

I'm currently try to use solr to query for open restaurants. These restaurants have opening hours which are represented by 30min time slots. So e.g. 10:00 - 10:30, 10:30 - 11:00, ... These time slots can be different for each day.
I've seen that a nearly similar problem was discussed in solr: multivalued dateranges (representing opening hours) , but without providing a proper solution.
Beside the opening hours I have a lot of other data in my index for facet searches and so on. So I cannot duplicate the restaurant data for each start and end time.
What would be the best approach to model this feature?

You may want to read this. It is an approach to use SOLRs spatial capabilities for modeling time spans.
Here is a short roundup of that approach:
You define a rectangular (instead of spherical) "world", where both axes have "time" as their unit. A time span (opening time) is then modeled as a single point (starttime, endtime) in that world. You can then use spatial polygon queries to query the time spans.

Related

How to solve this using genetic algorithm?

I need suggestions for solving the following optimization problem
I have N jobs, J1,J2,J3,J4... JN.
I have n persons to finish the above jobs. P1,P2,P3....Pn
I have some constraints like only specific persons can do certain jobs. For example P1 can do only J1,J4,J5, P2 can do only J2,J7,J14,J8,J9 etc
The persons are available for certain periods of time of day. Like P1 available for 9am to 11am and then 12 noon to 3pm and then 3:30pm to 6pm. Like that applicable for other persons have their available times.
The traveling time also need to be considered based on the job locations and current location of persons.
So ultimately the genetic algorithm need to provide an optimum solution in form of a schedule. The following diagram shows below
On first sight, it looks like Project Job Scheduling, which is a form of Job Shop Scheduling. The Mista2013 competition featured that problem and several papers describe several implementations for that.
But a closer look reveals that this is more of a Vehicle Routing Problem with Time Windows (VRPTW), for which there also many papers available too.

How to predict when next event occurs based on previous events? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
Basically, I have a reasonably large list (a year's worth of data) of times that a single discrete event occurred (for my current project, a list of times that someone printed something). Based on this list, I would like to construct a statistical model of some sort that will predict the most likely time for the next event (the next print job) given all of the previous event times.
I've already read this, but the responses don't exactly help out with what I have in mind for my project. I did some additional research and found that a Hidden Markov Model would likely allow me to do so accurately, but I can't find a link on how to generate a Hidden Markov Model using just a list of times. I also found that using a Kalman filter on the list may be useful but basically, I'd like to get some more information about it from someone who's actually used them and knows their limitations and requirements before just trying something and hoping it works.
Thanks a bunch!
EDIT: So by Amit's suggestion in the comments, I also posted this to the Statistics StackExchange, CrossValidated. If you do know what I should do, please post either here or there
I'll admit it, I'm not a statistics kind of guy. But I've run into these kind of problems before. Really what we're talking about here is that you have some observed, discrete events and you want to figure out how likely it is you'll see them occur at any given point in time. The issue you've got is that you want to take discrete data and make continuous data out of it.
The term that comes to mind is density estimation. Specifically kernel density estimation. You can get some of the effects of kernel density estimation by simple binning (e.g. count the number events in a time interval such as every quarter hour or hour.) Kernel density estimation just has some nicer statistical properties than simple binning. (The produced data is often 'smoother'.)
That only takes care of one of your problems, though. The next problem is still the far more interesting one -- how do you take a time line of data (in this case, only printer data) and produced a prediction from it? First thing's first -- the way you've set up the problem may not be what you're looking for. While the miracle idea of having a limited source of data and predicting the next step of that source sounds attractive, it's far more practical to integrate more data sources to create an actual prediction. (e.g. maybe the printers get hit hard just after there's a lot of phone activity -- something that can be very hard to predict in some companies) The Netflix Challenge is a rather potent example of this point.
Of course, the problem with more data sources is that there's extra legwork to set up the systems that collect the data then.
Honestly, I'd consider this a domain-specific problem and take two approaches: Find time-independent patterns, and find time-dependent patterns.
An example time-dependent pattern would be that every week day at 4:30 Suzy prints out her end of the day report. This happens at specific times every day of the week. This kind of thing is easy to detect with fixed intervals. (Every day, every week day, every weekend day, every Tuesday, every 1st of the month, etc...) This is extremely simple to detect with predetermined intervals -- just create a curve of the estimated probability density function that's one week long and go back in time and average the curves (possibly a weighted average via a windowing function for better predictions).
If you want to get more sophisticated, find a way to automate the detection of such intervals. (Likely the data wouldn't be so overwhelming that you could just brute force this.)
An example time-independent pattern is that every time Mike in accounting prints out an invoice list sheet, he goes over to Johnathan who prints out a rather large batch of complete invoice reports a few hours later. This kind of thing is harder to detect because it's more free form. I recommend looking at various intervals of time (e.g. 30 seconds, 40 seconds, 50 seconds, 1 minute, 1.2 minutes, 1.5 minutes, 1.7 minutes, 2 minutes, 3 minutes, .... 1 hour, 2 hours, 3 hours, ....) and subsampling them via in a nice way (e.g. Lanczos resampling) to create a vector. Then use a vector-quantization style algorithm to categorize the "interesting" patterns. You'll need to think carefully about how you'll deal with certainty of the categories, though -- if your a resulting category has very little data in it, it probably isn't reliable. (Some vector quantization algorithms are better at this than others.)
Then, to create a prediction as to the likelihood of printing something in the future, look up the most recent activity intervals (30 seconds, 40 seconds, 50 seconds, 1 minute, and all the other intervals) via vector quantization and weight the outcomes based on their certainty to create a weighted average of predictions.
You'll want to find a good way to measure certainty of the time-dependent and time-independent outputs to create a final estimate.
This sort of thing is typical of predictive data compression schemes. I recommend you take a look at PAQ since it's got a lot of the concepts I've gone over here and can provide some very interesting insight. The source code is even available along with excellent documentation on the algorithms used.
You may want to take an entirely different approach from vector quantization and discretize the data and use something more like a PPM scheme. It can be very much simpler to implement and still effective.
I don't know what the time frame or scope of this project is, but this sort of thing can always be taken to the N-th degree. If it's got a deadline, I'd like to emphasize that you worry about getting something working first, and then make it work well. Something not optimal is better than nothing.
This kind of project is cool. This kind of project can get you a job if you wrap it up right. I'd recommend you do take your time, do it right, and post it up as function, open source, useful software. I highly recommend open source since you'll want to make a community that can contribute data source providers in more environments that you have access to, will to support, or time to support.
Best of luck!
I really don't see how a Markov model would be useful here. Markov models are typically employed when the event you're predicting is dependent on previous events. The canonical example, of course, is text, where a good Markov model can do a surprisingly good job of guessing what the next character or word will be.
But is there a pattern to when a user might print the next thing? That is, do you see a regular pattern of time between jobs? If so, then a Markov model will work. If not, then the Markov model will be a random guess.
In how to model it, think of the different time periods between jobs as letters in an alphabet. In fact, you could assign each time period a letter, something like:
A - 1 to 2 minutes
B - 2 to 5 minutes
C - 5 to 10 minutes
etc.
Then, go through the data and assign a letter to each time period between print jobs. When you're done, you have a text representation of your data, and that you can run through any of the Markov examples that do text prediction.
If you have an actual model that you think might be relevant for the problem domain, you should apply it. For example, it is likely that there are patterns related to day of week, time of day, and possibly date (holidays would presumably show lower usage).
Most raw statistical modelling techniques based on examining (say) time between adjacent events would have difficulty capturing these underlying influences.
I would build a statistical model for each of those known events (day of week, etc), and use that to predict future occurrences.
I think the predictive neural network would be a good approach for this task.
http://en.wikipedia.org/wiki/Predictive_analytics#Neural_networks
This method is also used for predicting f.x. weather forecasting, stock marked, sun spots.
There's a tutorial here if you want to know more about how it works.
http://www.obitko.com/tutorials/neural-network-prediction/
Think of a markov chain like a graph with vertex connect to each other with a weight or distance. Moving around this graph would eat up the sum of the weights or distance you travel. Here is an example with text generation: http://phpir.com/text-generation.
A Kalman filter is used to track a state vector, generally with continuous (or at least discretized continuous) dynamics. This is sort of the polar opposite of sporadic, discrete events, so unless you have an underlying model that includes this kind of state vector (and is either linear or almost linear), you probably don't want a Kalman filter.
It sounds like you don't have an underlying model, and are fishing around for one: you've got a nail, and are going through the toolbox trying out files, screwdrivers, and tape measures 8^)
My best advice: first, use what you know about the problem to build the model; then figure out how to solve the problem, based on the model.

Find UK PostCodes closest to other UK Post Codes by matching the Post Code String

Here is a question that has me awake for a number of days now. The only conclusion I came up so far is that Red Bull does not usually help coders.
I have a scenario in my application where I have a couple of jobs (1 to 50). The job has an address and I have the following properties of an address: Postcode, Latitude, and Longitude.
I have a table of workers also and they too have addresses. While the jobs or workers are created through screens, I use Google Map queries to make sure the provided Postcode is valid and is in UK so all the addresses are verified.
I am using a scheduler control to display some workers on y-axis and a timeline on x-axis. Every job has a date and can only move vertically on the scheduler on the job’s date. The user selects a number of jobs and they are displayed in a basket close to the scheduler. The user can then drag and drop job against workers. All this is manual so it works.
My task is to automate this so that the user does not do much except just verifying and allotting the jobs. Therefore, I have to automate the process.
Every worker has a property called WillingMaximumDistanceTravel which is an integer representing miles, the worker is willing to travel for a job.
Now here is the headache: I have over 1500 workers. I have a utility function that uses Newtonsoft’s Json Convert to de-serialize a stream of response from Google Maps. I need to feed it Postcode A and B.
I also plan to introduce a new table to DB to store the distance finds as Postcode A, Postcode B, and Distance. Therefore, if I find myself comparing the same postcodes again, I will just retrieve the result from DB instead and slowly and eventually, I would no longer require bothering Google anymore as this table would be very comprehensive.
I cannot use the simple Haversine formula, as Crow-fly path is not my requirement here. The pain in this is that it takes a lot of time to calculate. Some workers can travel over 10 miles while some vary from 15 to 80. I have to take the first job from the list and run it with every applicable worker o the system! I was wondering that the UK postcode has a pattern to it. If we sort a list of UK postcodes, can we rough-estimate, from the alphanumeric pattern, where will we hit a 100-mile mark, a 200-mile mark and so on?
If anyone is interested in the code, please drop a line and I will paste it.
(I work for Google, but I'm not speaking on behalf of Google. I have nothing to do with the maps API.)
I suspect this isn't a great situation for using the Google Maps API, simply because you're pushing so much data through. You really don't want to make that many requests, even if you could do so under the directions limits.
When I tackled something similar in a previous job, we bought into a locally-hosted maps API - but even that wasn't fast enough for this sort of work. We ended up precomputing the time to travel from the centroid of each postcode "area" (probably the wrong name for it, but the first part of the postcode followed by the first digit of the remainder, e.g. "SW1W 9" for "SW1W 9TQ") to every other area, storing the result in a giant table. I think we only did it for postcodes which were within 100 miles or something similar, to cut down on the amount of preprocessing.
Even then, a simple DB wasn't quite as fast as we wanted - so we stored the results in a giant file, with a single byte per source/destination pair. (We had a fixed sequence of source postcodes and target postcodes, so we didn't need to specify those.) At that point, computing a travel time consisted of:
Work out postcode areas (substring work)
Find the index of each postcode area within the sequence
Check if we'd loaded that part of the file (we lazy loaded for startup speed)
Load the row if necessary, and just access it otherwise
The bytes were on a sliding scale of accuracy, so for the first 60 minutes it was on a per-minute basis, then each extra value meant an extra 2 minutes, then 5 etc. (Those aren't the exact values, but it was something like that.)
When you've worked out "good candidates" you can ask an on-site API or the Google Maps API for more accurate directions for your exact postcodes, of course.
You want to look for a spatial-index or a space-filling-curve. A spatial index reduce the 2d problem to a 1d problem and recursivley subdivide the surface into smaller tiles but it is basically a reordering of the tiles. You can subdivide the surface either with an index or a string using 4 characters. The latter one can be useful to you because it let you query the string with all string operation hidden in the database engine. You want to look for Nick's spatial index quadtree hilbert-curve blog.

Feeding a UITableViewController calculated values

Very quick question. I need to feed a UITableViewController daily averages calculated from multiple objects with an NSNumber attribute (each object is timestamped, with 8-10 objects per day usually). Is there any straightforward way to calculate on the fly using my own version of lazy loading (I would average data in chronological order till I had a screenful of daily averages) or should I just take the easy way out and maintain an Averages entity pre-populated with averages for all possible days, which I present conventionally in my UITableViewController?
Thanks.
Depends how long the calculation takes? Why not just calculate it in the UITableViewDataSource method that asks for the cellForRowAtIndexPath? Unless it takes time to calculate - thereby causing a delay in populating the values onscreen - I'd say there's no harm in calculating on the fly.
There's plenty of other places to do the calculation. You could do it on app startup, or on loading the view. You could even start an NSOperation that does it in the background with the controller as a delegate that tells it when to reloadData.
You should do a fetch by value (see Core Data Programming Guide) and then use the #avg collections operator on the attribute name.

How do search engines conduct 'AND' operation?

Consider the following search results:
Google for 'David' - 591 millions hits in 0.28 sec
Google for 'John' - 785 millions hits in 0.18 sec
OK. Pages are indexed, it only needs to look up the count and the first few items in the index table, so speed is understandable.
Now consider the following search with AND operation:
Google for 'David John' ('David' AND 'John') - 173 millions hits in 0.25 sec
This makes me ticked ;) How on earth can search engines get the result of AND operations on gigantic datasets so fast? I see the following two ways to conduct the task and both are terrible:
You conduct the search of 'David'. Take the gigantic temp table and conduct a search of 'John' on it. HOWEVER, the temp table is not indexed by 'John', so brute force search is needed. That just won't compute within 0.25 sec no matter what HW you have.
Indexing by all possible word
combinations like 'David John'. Then
we face a combinatorial explosion on the number of keys and
not even Google has the storage
capacity to handle that.
And you can AND together as many search phrases as you want and you still get answers under a 0.5 sec! How?
What Markus wrote about Google processing the query on many machines in parallel is correct.
In addition, there are information retrieval algorithms that make this job a little bit easier. The classic way to do it is to build an inverted index which consists of postings lists - a list for each term of all the documents that contain that term, in order.
When a query with two terms is searched, conceptually, you would take the postings lists for each of the two terms ('david' and 'john'), and walk along them, looking for documents that are in both lists. If both lists are ordered the same way, this can be done in O(N). Granted, N is still huge, which is why this will be done on hundreds of machines in parallel.
Also, there may be additional tricks. For example, if the highest-ranked documents were placed higher on the lists, then maybe the algorithm could decide that it found the 10 best results without walking the entire lists. It would then guess at the remaining number of results (based on the size of the two lists).
I think you're approaching the problem from the wrong angle.
Google doesn't have a tables/indices on a single machine. Instead they partition their dataset heavily across their servers. Reports indicate that as many as 1000 physical machines are involved in every single query!
With that amount of computing power it's "simply" (used highly ironically) a matter of ensuring that every machine completes their work in fractions of a second.
Reading about Google technology and infrastructure is very inspiring and highly educational. I'd recommend reading up on BigTable, MapReduce and the Google File System.
Google have an archive of their publications available with lots of juicy information about their techologies. This thread on metafilter also provides some insight to the enourmous amount of hardware needed to run a search engine.
I don't know how google does it, but I can tell you how I did it when a client needed something similar:
It starts with an inverted index, as described by Avi. That's just a table listing, for every word in every document, the document id, the word, and a score for the word's relevance in that document. (Another approach is to index each appearance of the word individually along with its position, but that wasn't required in this case.)
From there, it's even simpler than Avi's description - there's no need to do a separate search for each term. Standard database summary operations can easily do that in a single pass:
SELECT document_id, sum(score) total_score, count(score) matches FROM rev_index
WHERE word IN ('david', 'john') GROUP BY document_id HAVING matches = 2
ORDER BY total_score DESC
This will return the IDs of all documents which have scores for both 'David' and 'John' (i.e., both words appear), ordered by some approximation of relevance and will take about the same time to execute regardless of how many or how few terms you're looking for, since IN performance is not affected much by the size of the target set and it's using a simple count to determine whether all terms were matched or not.
Note that this simplistic method just adds the 'David' score and the 'John' score together to determine overall relevance; it doesn't take the order/proximity/etc. of the names into account. Once again, I'm sure that google does factor that into their scores, but my client didn't need it.
I did something similar to this years ago on a 16 bit machine. The dataset had an upper limit of around 110,000 records (it was a cemetery, so finite limit on burials) so I setup a series of bitmaps each containing 128K bits.
The search for "david" resulting in me setting the relevant bit in one of the bitmaps to signify that the record had the word "david" in it. Did the same for 'john' in a second bitmap.
Then all you need to do is a binary 'and' of the two bitmaps, and the resulting bitmap tells you which record numbers had both 'david' and 'john' in them. Quick scan of the resulting bitmap gives you back the list of records that match both terms.
This technique wouldn't work for google though, so consider this my $0.02 worth.

Resources