for one of our clients we are providing a system for retrieving the closest N landmarks from the users zipcode location.
We have a database of all the available zipcodes (650,000+) with the coresponding coordinates (latitude and longitude) and also all of 400+ landmarks in the country.
For now we are using the following process from finding closest N landmarks
Retrieve the lat and lng of the selected zipcode
Get the coordinates of all the landmarks
Order them by using a geographic distance formula
Take the closest N+2 landmarks and get the real distance to them using the following process
check if the distance between coordinates is stored in the distance cache table
if not it goes to a map engine, retrieved the distance and stores it in the cache
Reorder the list and return first N closest landmarks
The problem is we need to optimize this both from database access point of view and 3rd party access also.
We have tried to cache for all zipcodes the distance to closest M landmarks but the table would gain an additional 6Gb of data and it would take around 250 days to fill since a request takes aprox 30 sec.
We were thinking on partitioning the data and grouping close postcodes together but that will void the exact distance.
What optimising solutions you see in this situation.
Thank you.
You could try an repetitive approach.
Pick a value to use as your "radius"
Go through all results and pick only ones +- radius horizontally and vertically (according to geolocation
if not enough rows returned, increase "radius" and start again
Now perform distance calculation and use a PriorityQueue to minimise the number of calculations used in this sort and select the required items
This should be done on database- level. You should use a database with an geographic extension as SQL Server 2008 R2, or the excellent open source choise PostGre SQL with PostGIS extension. With those you store Geographical BLOBs instead of coordinates, and there are many built in functions to calculate geography that will take care of step 2 to 5 for you.
I suggest you start here:
http://postgis.refractions.net/
Regards
Related
Given a dataset which consists of geographic coordinates and the corresponding timestamps for each record, I want to know if there's any suitable measure that can determine the closeness between two points by taking the spatial and temporal distance into consideration.
The approaches I've tried so far includes implementing a distance measure between the two coordinate values and calculating the time difference separately. But in this case, I'd require two threshold values for both the spatial and temporal distances to determine their overall proximity.
I wanted to know there's any single function that can take in these values as an input together and give a single measure of their correlation. Ultimately, I want to be able to use this measure to cluster similar records together.
I am currently working on a financial data problem. I want to detect trades for which anomalous theta values are being generated by the models (due to several factors).
My data mainly consists of trade with its profile variables like dealId, portfolio, etc. along with different theta values along with the theta components for different dates(dates back to 3 years).
Data that I am currently using looks like this:
Tradeid
Date1
Date 2 and so on
id1
1234
1238
id2
1289
1234
Currently, I am Tracking daily theta movement for all trades and sending trades whose theta has moved more than 20k (absolute value).
I want to build an ML model which tracks theta movement and detects that for the current date this(or these) particular deal id/s are having anomalous theta.
So far, I have tried clustering trades based on their theta movement correlation using DBSCAN with a distance matrix. I have also tried using Isolation forest but it is not generalizing very well on the dataset.
All the examples that I have seen so far for anomaly detection are more like finding a rotten apple from a bunch of apples. Is there any algorithm that would be best suitable for my case or can be modified to best suit my problem?
Your problem seems to be too simple for the machine learning world.
You can manually define a threshold, for which the data is anomalous and identify them.
And to do that, you can easily analyze your data using pandas to find out the mean, max, min etc. and then proceed to define a threshold.
How to find city based on current location lat, lng without reverse Geo coding ? I tried to save Indian cities data in the form of Geo json but it is of type Multi Polygon. i didn't find any way to convert it to single polygon which is easy to find. if i try to find lat lng inside every polygon it will take lot of time. i want to do it efficiently and faster way.
If you do not want to use an API, you will need to maintain a list of cities and their coordinates yourself.
For example, here is a good starting point for such a list:
https://www.latlong.net/category/cities-102-15.html
With the list, take current lat/lng and calculate the distance. Euclidian distance would work well enough for most practical applications. Sort the results starting by smallest and take the first one. This is your nearest city.
I am looking for an algorithm that can do efficient search in a grid.
I have a large array which includes all the centroid points (x,y,z)
Now for a given location (xp,yp,zp) I want to find the closest centroid to that p location.
Currently I am doing a brute force search which basically for each point p I go through all points, calculate the distance to location p and by this find out which centroid that is.
I know that octree search and kd-tree might help but not too sure how to tackle it or which one would be better.
I would you a spatial index, such as the kd-tree or quadtree/octree (which you suggested) or maybe an R-Tree based solution.
Put all your centroids into the index. Usually you can associated any point in the index with some additional data, so if you need that, you could provide a back-treference into the grids, for example grid coordinates).
Finding the nearest point in the index should be very fast. The returned data then allows you to go back into the grid.
In a way, a quadtree/octree is in itself nothing but a discretizing grid that get finer if the point density increases. The difference to a grid is that it is hierarchical and that empty areas are not stored at all.
Does anyone have any handy algorithms that could be used to reduce the number of geo-points ?
I am using a list of 2,000,000 postcodes which come with their own geo-point. I am using them to collect data from an API to be used offline. The program is written in C++.
I have to go through each postcode, calculate a bounding box based on the postcodes location, and then send it to the API which gives me some data near to that postcode.
However 2,000,000 is a lot to process and some of the postcodes are next to each other or close enough to each other that they would share some of the same data.
So far I've came up with two ways I could reduce them but I am not sure if they would work:
1 - Program uses data structure to record which postcode overlaps which and then run a routine a few time to removes the ones that have overlaps one by one until we are left without ones without overlapping postcodes.
Start at the top left geo point of the UK and slowly increment it the rough size of a postcode area until we have covered the entire UK.
Is there a easy way to reduce these number of postcodes so that I have few of them overlapping as possible ? whilst still making sure I get data covering as much of the UK as possible ? I was thinking there may be an algorithm handy for this, that people use else where.
You can use a quadtree especially a quadkey. A quadkey plot the points along a curve. It's similar to sort the points into a grid. Then you can traverse the grid to search deeper in the tree. You can also search around a center point. You can also use a database with a spatial index. It depends how much the data overlap but with a quadtree you can choose the size of the grid.