Find UK PostCodes closest to other UK Post Codes by matching the Post Code String - c#-4.0

Here is a question that has me awake for a number of days now. The only conclusion I came up so far is that Red Bull does not usually help coders.
I have a scenario in my application where I have a couple of jobs (1 to 50). The job has an address and I have the following properties of an address: Postcode, Latitude, and Longitude.
I have a table of workers also and they too have addresses. While the jobs or workers are created through screens, I use Google Map queries to make sure the provided Postcode is valid and is in UK so all the addresses are verified.
I am using a scheduler control to display some workers on y-axis and a timeline on x-axis. Every job has a date and can only move vertically on the scheduler on the job’s date. The user selects a number of jobs and they are displayed in a basket close to the scheduler. The user can then drag and drop job against workers. All this is manual so it works.
My task is to automate this so that the user does not do much except just verifying and allotting the jobs. Therefore, I have to automate the process.
Every worker has a property called WillingMaximumDistanceTravel which is an integer representing miles, the worker is willing to travel for a job.
Now here is the headache: I have over 1500 workers. I have a utility function that uses Newtonsoft’s Json Convert to de-serialize a stream of response from Google Maps. I need to feed it Postcode A and B.
I also plan to introduce a new table to DB to store the distance finds as Postcode A, Postcode B, and Distance. Therefore, if I find myself comparing the same postcodes again, I will just retrieve the result from DB instead and slowly and eventually, I would no longer require bothering Google anymore as this table would be very comprehensive.
I cannot use the simple Haversine formula, as Crow-fly path is not my requirement here. The pain in this is that it takes a lot of time to calculate. Some workers can travel over 10 miles while some vary from 15 to 80. I have to take the first job from the list and run it with every applicable worker o the system! I was wondering that the UK postcode has a pattern to it. If we sort a list of UK postcodes, can we rough-estimate, from the alphanumeric pattern, where will we hit a 100-mile mark, a 200-mile mark and so on?
If anyone is interested in the code, please drop a line and I will paste it.

(I work for Google, but I'm not speaking on behalf of Google. I have nothing to do with the maps API.)
I suspect this isn't a great situation for using the Google Maps API, simply because you're pushing so much data through. You really don't want to make that many requests, even if you could do so under the directions limits.
When I tackled something similar in a previous job, we bought into a locally-hosted maps API - but even that wasn't fast enough for this sort of work. We ended up precomputing the time to travel from the centroid of each postcode "area" (probably the wrong name for it, but the first part of the postcode followed by the first digit of the remainder, e.g. "SW1W 9" for "SW1W 9TQ") to every other area, storing the result in a giant table. I think we only did it for postcodes which were within 100 miles or something similar, to cut down on the amount of preprocessing.
Even then, a simple DB wasn't quite as fast as we wanted - so we stored the results in a giant file, with a single byte per source/destination pair. (We had a fixed sequence of source postcodes and target postcodes, so we didn't need to specify those.) At that point, computing a travel time consisted of:
Work out postcode areas (substring work)
Find the index of each postcode area within the sequence
Check if we'd loaded that part of the file (we lazy loaded for startup speed)
Load the row if necessary, and just access it otherwise
The bytes were on a sliding scale of accuracy, so for the first 60 minutes it was on a per-minute basis, then each extra value meant an extra 2 minutes, then 5 etc. (Those aren't the exact values, but it was something like that.)
When you've worked out "good candidates" you can ask an on-site API or the Google Maps API for more accurate directions for your exact postcodes, of course.

You want to look for a spatial-index or a space-filling-curve. A spatial index reduce the 2d problem to a 1d problem and recursivley subdivide the surface into smaller tiles but it is basically a reordering of the tiles. You can subdivide the surface either with an index or a string using 4 characters. The latter one can be useful to you because it let you query the string with all string operation hidden in the database engine. You want to look for Nick's spatial index quadtree hilbert-curve blog.

Related

Optimization of resources in Excel

I'm struggling with a task of optimization of resources, and I wonder if someone knows any efficient way to find the optimal solution for it, using only Excel. I explain what I'm trying to achieve:
Suppose you are managing an assembly factory for metal tubes. Your raw material is standard size tubes, and then in your factory you need to cut these tubes according to a list of requests from clients, with very specific sizes. All tubes are of the same type, so we can reuse leftovers from each cut, if the length of that leftover is sufficient to satisfy any tube request.
We can also group small length requests to be made from one single tube, for example, on the attached list, we could use one 8 metre tube to deliver the last four entries (1,615+1,62+1,625+1,67), with 1,47 leftover wasted.
Assuming a long list of requests, and that the tubes supplied are 8 metres each, do you know of any way of calculating how many tubes I have to order to satisfy the list of requests, minimising the losses per each cut?
Example of request list, each entry is in metres

Performing a location proximity search on a database using S2 Geometry Library

I am working on a project that requires fast performing proximity queries on a database with location data.
In my database I want to store locations with additional information. Idea is that user opens a map on a certain location and my program only fetches the markers visible to the user. If I plan on having millions of values, fetching markers from NYC when I'm zoomed in on London would make the map activity work extremely slow and the data I send back from the db would be HUGE.
That's why when the user opens the map I want to fetch all the markers that are for example in 10km distance from the center of the map. (I'm okay with fetching markers outside of the visible area. I just don't want to fetch markers that are 100km away)
After a thorough research I chose the S2 Geometry Library approach with Hilbert's space filling curve.
The idea of mapping a 2D value to one integer value, where the longer a shared prefix between two indexes is, the spatially closer they are together, was a big selling point.
I need my database to be able to perform this SELECT query lightning fast and I expect to have A LOT of data in the future so operating on only one column is a big plus.
Also the thing that intrigued me the most was the ability to perform fast proximity searches because of the fact that two numbers that are close to each other on the map will have 1D indexes also close to each other.
The idea looks very simple (If I don't miss anything).
The thing I'm having problems with is how to (If it's even possible) pick the min value and max value on the 1D plane to be sure I'm scanning the whole visible area.
Most of the answers and tutorials I find on the internet propose a solution where you take a bounding area full of smaller S2 index "boxes" and then scan every index in the database to see if it's contained in one of the "boxes" from the array. This is easy to do but when you have 50 milion records it's not possible to go through every single one of them to see if it's in on of the "boxes".
What I have in mind is a solution where you take the minimum value of the area and the maximum value of the area you're searching in and you perform something in the lines of SELECT (...) WHERE s2cellid BETWEEN min AND max
For example I'm in a location 47194c and want to fetch all markers in 10km distance so I take a value that's x to the left of the indeks and a value that's x to the right of the index and perform a BETWEEN 47194c-x AND 47194c+x query
Is something like that possible with the S2 library?
If no then what approach should I take to make my queries as quick as possible?
Thanks in advance :)
[I plan on using PostgreSQL]

Sort results in Azure Maps Search

i'm using AzureMaps Search and i'm trying to retrieve all POI(point of interest) in a location, but i can't find in any documentation how to sort, for example by distance my results
Someone has same problem?
https://atlas.microsoft.com/search/poi/json?subscription-key=key&api-version=1.0&query=restaurant&lat=45&lon=9
I don't think the current Search POI API provides sorting as part of the API itself. So, you'll have to do that in-memory afterwards. The results are sorted by "score"(relevancy) by default.
There is no way to order by results with POI,I guess what you're looking for here. As per the best practices, you could use nearby-search
https://atlas.microsoft.com/search/address/json?subscription-key={subscription-key}&api-version=1&query=400%20Broad%20Street%2C%20Seattle%2C%20WA&countrySet=US
If you would like straight line distances you can loop through the results can calculate the distances using the haversine formula. If using the Azure Maps Web SDK, you can use the atlas.math.getDistanceTo function instead. Once you calculate a distance to each point, then you can sort accordingly.
If you want to get the driving distance to each point there are two approaches you can take;
Use the Route Matrix API. This is fairly easy to use, would be less error prone than the second option below and the response is easy enough to work with. Only negative with this approach is that you will need to S1 pricing tier to access this service and each cell would generate a transaction which can get expensive fast.
Use the Routing Directions API with a large number of waypoints that go from your origin to each destination and back (A->B->A->C....). This will be a bit more work to understand the results and if any leg of the route is unrouteable for any reason, the whole route calculation would fail. However, this would be significantly cheaper than option one as you can use S0 pricing tier which has free limits and this would only generate 1 transaction in most cases (if you have a large number of locations then you might need to break them up and spread across a few calls). Because this would calculate the route from the origin to each destination and back, you twice as many calculations are made than you need which could make this slower than approach 1. When parsing the response you would look at the odd indexed route legs as those would go from the origin to each destination. In some scenarios it might be desirable to know the travel time from the destinations to the origin (i.e. how long would it take all employees to get to work), in which case the even numbered legs is what you would want to use.
Again, once you have the distance, or better yet, travel time, you can then sort the results accordingly.

Adding weight ot variables in a line graph Tableau

I have a dataset consisting of calls going to agents (atually 10 of them) per day. These agents can either answer calls or transfer them to a call center. What we are interested in is whether each of these agents answers more calls than he transfers. In order to answer this, I have created a variable for each of these agents:
Answered/Transferred
I am using line graph to depict these variables per agent over time.
Now if this variable is less than 1 then this agent transferred more calls than he received. The problem now is that this is not a safe way to measure the overall impact of transferred calls. This is because the traffic pertaining to agents 1,2,3 is far greater than the one pertaining to agents 5,6,7 and so on. Therefore, I am trying to come up with a way to "weight" the variables I have created before. That is, somehow include the total number of calls reaching each agent (irrespectively of whether they are getting transferred or answered) in my calculations. That means that if an agent is getting 5 calls per day while another guy is getting 5.000 per day then I should find a way to depict this in my graphs.
Do you guys have any ideas?
Easiest would be to drag weight measure to colors and choose something like temperature diverging. Depending on your viz you can also drag weight measure to size, and for example, make bars or lines thicker to show there are more records there.

How do search engines conduct 'AND' operation?

Consider the following search results:
Google for 'David' - 591 millions hits in 0.28 sec
Google for 'John' - 785 millions hits in 0.18 sec
OK. Pages are indexed, it only needs to look up the count and the first few items in the index table, so speed is understandable.
Now consider the following search with AND operation:
Google for 'David John' ('David' AND 'John') - 173 millions hits in 0.25 sec
This makes me ticked ;) How on earth can search engines get the result of AND operations on gigantic datasets so fast? I see the following two ways to conduct the task and both are terrible:
You conduct the search of 'David'. Take the gigantic temp table and conduct a search of 'John' on it. HOWEVER, the temp table is not indexed by 'John', so brute force search is needed. That just won't compute within 0.25 sec no matter what HW you have.
Indexing by all possible word
combinations like 'David John'. Then
we face a combinatorial explosion on the number of keys and
not even Google has the storage
capacity to handle that.
And you can AND together as many search phrases as you want and you still get answers under a 0.5 sec! How?
What Markus wrote about Google processing the query on many machines in parallel is correct.
In addition, there are information retrieval algorithms that make this job a little bit easier. The classic way to do it is to build an inverted index which consists of postings lists - a list for each term of all the documents that contain that term, in order.
When a query with two terms is searched, conceptually, you would take the postings lists for each of the two terms ('david' and 'john'), and walk along them, looking for documents that are in both lists. If both lists are ordered the same way, this can be done in O(N). Granted, N is still huge, which is why this will be done on hundreds of machines in parallel.
Also, there may be additional tricks. For example, if the highest-ranked documents were placed higher on the lists, then maybe the algorithm could decide that it found the 10 best results without walking the entire lists. It would then guess at the remaining number of results (based on the size of the two lists).
I think you're approaching the problem from the wrong angle.
Google doesn't have a tables/indices on a single machine. Instead they partition their dataset heavily across their servers. Reports indicate that as many as 1000 physical machines are involved in every single query!
With that amount of computing power it's "simply" (used highly ironically) a matter of ensuring that every machine completes their work in fractions of a second.
Reading about Google technology and infrastructure is very inspiring and highly educational. I'd recommend reading up on BigTable, MapReduce and the Google File System.
Google have an archive of their publications available with lots of juicy information about their techologies. This thread on metafilter also provides some insight to the enourmous amount of hardware needed to run a search engine.
I don't know how google does it, but I can tell you how I did it when a client needed something similar:
It starts with an inverted index, as described by Avi. That's just a table listing, for every word in every document, the document id, the word, and a score for the word's relevance in that document. (Another approach is to index each appearance of the word individually along with its position, but that wasn't required in this case.)
From there, it's even simpler than Avi's description - there's no need to do a separate search for each term. Standard database summary operations can easily do that in a single pass:
SELECT document_id, sum(score) total_score, count(score) matches FROM rev_index
WHERE word IN ('david', 'john') GROUP BY document_id HAVING matches = 2
ORDER BY total_score DESC
This will return the IDs of all documents which have scores for both 'David' and 'John' (i.e., both words appear), ordered by some approximation of relevance and will take about the same time to execute regardless of how many or how few terms you're looking for, since IN performance is not affected much by the size of the target set and it's using a simple count to determine whether all terms were matched or not.
Note that this simplistic method just adds the 'David' score and the 'John' score together to determine overall relevance; it doesn't take the order/proximity/etc. of the names into account. Once again, I'm sure that google does factor that into their scores, but my client didn't need it.
I did something similar to this years ago on a 16 bit machine. The dataset had an upper limit of around 110,000 records (it was a cemetery, so finite limit on burials) so I setup a series of bitmaps each containing 128K bits.
The search for "david" resulting in me setting the relevant bit in one of the bitmaps to signify that the record had the word "david" in it. Did the same for 'john' in a second bitmap.
Then all you need to do is a binary 'and' of the two bitmaps, and the resulting bitmap tells you which record numbers had both 'david' and 'john' in them. Quick scan of the resulting bitmap gives you back the list of records that match both terms.
This technique wouldn't work for google though, so consider this my $0.02 worth.

Resources