How do you ID GeoHash points that are NOT over land? - geospatial

I have a list of coordinates in latitude/longitude that I have converted to GeoHash. My goal is to ID the points that are reported to be over water (oceans, seas, etc. outside of any countries borders). I also have a data set of all the shapes of all the worlds countries borders in latitude/longitude converted to GeoHash too.
So for a given GeoHash point I am trying to be able to classify it as being over (international) water or not. I thought about picking points manually in the middle of the ocean and using a short GeoHash prefix to create a large box in the ocean but that is fairly limited.
Perhaps generally there is a way to understand what it means to be a GeoHash point outside of any countries borders?

It is not a good use of geohash. Geohash is good at identifying specific points, but not great at describing complex shapes like country borders or ocean.
I thought about picking points manually in the middle of the ocean and
using a short GeoHash prefix to create a large box in the ocean but
that is fairly limited.
Yes, that will give very imprecise result. What you need is to test each point, whether it belongs to any country's polygon. How you do this depends on the platform you use, e.g. in SQL you run an ST_Intersects(point, country) query.
I would just convert geohash back to lat/lon pair and check them.
If you do want to use geohash or if you have too many (billions) of points, you can use the short GeoHash prefix trick - but you would need to use many prefixes to represent each ocean. Something like the following, using prefix tree:
start with GeoHash length of couple letters,
for every possible GeoHash string, compute whether its box is fully contained by the ocean or land (using ST_Intersects or similar precise method).
if whole box belongs to one class - add it to prefix tree.
if not - add more letters (again, all possible combinations) and continue recursively up to some limit, where you need to stop.
Once you've built such tree - you can use GeoHash to lookup your answer quickly in this tree.

Related

Performing a location proximity search on a database using S2 Geometry Library

I am working on a project that requires fast performing proximity queries on a database with location data.
In my database I want to store locations with additional information. Idea is that user opens a map on a certain location and my program only fetches the markers visible to the user. If I plan on having millions of values, fetching markers from NYC when I'm zoomed in on London would make the map activity work extremely slow and the data I send back from the db would be HUGE.
That's why when the user opens the map I want to fetch all the markers that are for example in 10km distance from the center of the map. (I'm okay with fetching markers outside of the visible area. I just don't want to fetch markers that are 100km away)
After a thorough research I chose the S2 Geometry Library approach with Hilbert's space filling curve.
The idea of mapping a 2D value to one integer value, where the longer a shared prefix between two indexes is, the spatially closer they are together, was a big selling point.
I need my database to be able to perform this SELECT query lightning fast and I expect to have A LOT of data in the future so operating on only one column is a big plus.
Also the thing that intrigued me the most was the ability to perform fast proximity searches because of the fact that two numbers that are close to each other on the map will have 1D indexes also close to each other.
The idea looks very simple (If I don't miss anything).
The thing I'm having problems with is how to (If it's even possible) pick the min value and max value on the 1D plane to be sure I'm scanning the whole visible area.
Most of the answers and tutorials I find on the internet propose a solution where you take a bounding area full of smaller S2 index "boxes" and then scan every index in the database to see if it's contained in one of the "boxes" from the array. This is easy to do but when you have 50 milion records it's not possible to go through every single one of them to see if it's in on of the "boxes".
What I have in mind is a solution where you take the minimum value of the area and the maximum value of the area you're searching in and you perform something in the lines of SELECT (...) WHERE s2cellid BETWEEN min AND max
For example I'm in a location 47194c and want to fetch all markers in 10km distance so I take a value that's x to the left of the indeks and a value that's x to the right of the index and perform a BETWEEN 47194c-x AND 47194c+x query
Is something like that possible with the S2 library?
If no then what approach should I take to make my queries as quick as possible?
Thanks in advance :)
[I plan on using PostgreSQL]

Finding all geohashes within two bounding coordinates

I have coordinates, which are assigned a corresponding geohash in my database. Now I want to retrieve all of the coordinates within two bounding coordinates (top right and top left corner). How can I do that properly?
I tried getting the geohash that fits both of those bounding coordinates, but this solution does not work when they are in completely different regions of the world (so they are not sharing anything in common).
Is there a better way to do that?
Thanks for your help
Unfortunately, this isn't something you can do out-of-the-box with datastore / App engine. (There are no built in spatial queries.)
For early prototyping, etc., you can do it the hard way - retrieve all the rows, and discard the ones not meeting your query in code. Obviously, probably not viable with real production data.
See related question Query for Entities Nearby with Geopt for some possible production solutions.

How to get only the nearby subset of a way's nodes

I'm using the Overpass API to query Open Street Maps for nearby road segments. I am pretty sure that my query is returning all of the nodes of the nearby way... but I only want nearby nodes of the nearby way.
In the documentation it references this problem:
In general, you will be rather interested in complete data than just
elements of a single type. First, there are several valid definitions
of what "complete map data" means. The first unclear topic is what to
do with nodes outside the bounding box which are members of ways that
lie partly inside the bounding box.
The same question repeats for relations. If you wait for a turn
restriction, you may prefer to get all elements of the relation
included. If your bounding box hits for example the border of Russia,
you likely don't want to download ten thousands kilometers of boundary
around half the world.
But I looked at the subsequent examples and didn't see the solution.
Basically, in their example, how would I restrict the elements returned to those strictly in the bounding box (rather than returning the whole boundary of Russia)?
My current query is
way (around:100,50.746,7.154) [highway~"^(secondary|tertiary)$"];
>;
out ids geom;
I'm thinking maybe I need to change it to node (around:...) and then recurse upwards to the way to query for the highway tag but I'm not sure if I am even on the right track.
Actually, it's even a bit more complicated, as you need the set intersection of all nodes in a 100m distance and those nodes belonging to one of the relevant ways. Here's how your query should look like: Adjust distance, tags for ways as needed.
Note that depending on the tagging, there's no guarantee that you will find a node in a certain distance, especially if roads tend to be rather straight and longish. This for sure will impact your results, so a bit experimenting with a suitable radius is probably needed.
// Find nodes up to 100m around center point
// (center is overpass turbo specific for center point lat/lon in current map view)
node(around:100,{{center}})->.aroundnodes;
// recurse up to ways with highway = secondary/tertiary
way(bn.aroundnodes)[highway~"^(secondary|tertiary)$"]->.allways;
// determine nodes belonging to found ways
node(w.allways)->.waynodes;
(
// determine intersection of all ways' nodes and nodes around center point
node.waynodes.aroundnodes;
// and return ways (intersection is just a workaround for a bug)
way.allways.allways;
);
out;
check it out in overpass turbo: http://overpass-turbo.eu/s/hPV

What's the best way to determine if a point is within a certain distance of a GEOJSON polygon?

What's the best way to determine if a point is within a certain distance of a GEOJSON polygon? Should one use TurfJS buffer method (https://github.com/Turfjs/turf-buffer#turf-buffer)? Can one perform queries on the buffered polygon?
It's clear to me one could us the TurfJS' inside method (https://github.com/Turfjs/turf-inside) to determine whether a point is within a polygon. I'm just curious what the best approach would be for finding whether or not a point is inside of a buffered polygon.
For example:
I have a number of neighborhoods provided as a GEOJSON polygon files. I also have a set of locations/addresses for employees (already geocoded to lat/long coordinates). What would be the best way to see whether or not my employees live within 10 miles of a given neighborhood polygon?
Thanks!
Yes, you can use buffer in conjunction with inside to find points within 10 miles of something else, eg, expanding on the existing examples,
var pt = point(14.616599, -90.548630)
var unit = 'miles'
var buffered = buffer(pt, 10, unit)
var ptTest = point(-1, 52)
var bIn = inside (ptIn, buffer)
which should obviously be false.
In general, though, buffering is somewhat expensive, so you would not necessarily want to do this every time you run the query. There are a couple of things you can do to speed things up:
1). Pre-buffer your search areas
2). Use some kind of R-tree type index, which will first check bounding box intersection, and avoid lots of unecessary point in polygon operations. turfjs, which I hadn't heard of until seeing your post, uses jsts under the hood for a number of operations, including buffering. This library has an implemention of R-tree indexes that you could potentially use. Here is a fun example of this being done.
In general, in situations where you have a spatial (R-tree type) index in place, such as a spatially enabled database like Postgis on top of Postgres, you would use an operator like ST_Dwithin (geom1, geom2, distance) in a where clause to find all points within some distance of another geometry, and this would be very efficient as many candidates would be rejected for failing an initial bounding box test.
Really, it depends on the size of your data and frequency of queries. There is nothing, in principle, wrong with doing contains queries on a buffer. I hope I haven't created more questions than answers.
I'm using GeoScript to do that sort of calculations in JavaScript. It has a distance method in the geom.Geometry class which can return the minimum distance between two geometries. You could use that, or take a look at the source on GitHub to see how they do it if you want to roll your own solution.

How do I sort search results by relevance?

I'm working on a project which searches through a database, then sorts the search results by relevance, according to a string the user inputs. I think my current search is fairly decent, but the comparator I wrote to sort the results by relevance is giving me funny results. I don’t know what to consider relevant. I know this is a big branch of information retrieval, but I have no idea where to start finding examples of searches which sort objects by relevance and would appreciate any feedback.
To give a little more background about my specific issue, the user will input a string in a website database, which stores objects (items in the store) with various fields, such as a minor and major classification (for example, an XBox 360 game might be stored with major=video_games and minor=xbox360 fields along with its specific name). The four main fields that I think should be considered in the search are the specific name, major, minor, and genre of the type of object, if that helps.
In case you don't wanna use lucene/Solr, you can always use distance metrics to find the similarity between query and the rows retrieved from database. Once you get the score you can sort them and they will be considered as sorted by relevance.
This is what exactly happens behind the scene of lucene. You can use simple similarity metrics like manhattan distance, distance of points in n-dimensional space etc. Look for lucene scoring formula for more insight.

Resources