Cassandra bounding box search

Cassandra bounding box search - cassandra

I am looking to use Cassandra for a nearby search type query. based on my lon/lat coordinates I want to retrieve the closest points. I do not need 100% accuracy so I am comfortable in using a bounding box instead of a circle (better performance too), but I can't find concrete instructions (Hopefully with an example) how to implement a bounding box.

From my experience, there's no easy way to have a generic geospatial index search on top of Cassandra. I believe you only have two options:
Geohashing, split your dataset into square/rectangular elements: for example, use integer parts of lat/lon as an indexes in a grid. Upon doing search, you can load all elements in an enclosing grid element and perform full neighbour scan inside your application.
works well if you have an evenly distributed dataset, like grid points in NWP similation that I've had.
works really bad on a datasets like "restaurants in USA", where most of the points are herding around large cities. You'll have unbalanced high load on different grid elements like New York area and get absolutely empty index buckets located somewhere in the Atlantic Ocean.
External indexes like ElasticSearch/Solr/Sphinx/etc.
All of them have geospatial indexing support out-of-the-box, no need to develop your own in your application layer.
You have to setup a separate indexing service and keep cassandra/index data in sync. There's some cassandra/search integrations like DSE (commercial), stargate-core (I've never heard about anyone using this in production), or you can roll your own, but all of these require time and effort.

This issue was touched on in the Euro Cassandra Summit in 2014.
RedHat: Scalable Geospatial Indexing with Cassandra
The presenter explains how he created a spatial index using User Defined Types that is very suitable to querying geospatial data using a region or bounding box based lookup.
The general idea is to break up your data into regions that are defined by bounding boxes. Each region then represents a rowkey, which you can then use to access any data associated with that region. If you have a location of interest, you query the keyspace on the regions which fall inside that area.

Related

create a geoshape and retrieve latitude/longitude pairs within the shape from the indexed data

I have a list of input geocodes (latitude/longitude pairs), with which we can create a geo shape. I have data indexed in rows and each row has the geocode (latitude/longitude) pair. For a Java based application, which is the best technology to create a shape using my input geocode list and then to search against the indexed data columns (latitude and longitude) and find the list of rows based on which geocode falls into the shape.

You are looking for Spatial4J which adds spatial support to SOLR and Lucene. The support is provided via the Java Topology Suite (JTS), which can be used in any standalone Java application also. It provides functions for contains and intersects (and many more). The C++ port, GEOS, provides much of the functionality behind Postgis spatial predicate functions. JTS is mature, fast and well tested.

Getting data based on location and a specified radius

Scenario: I have a large dataset, with each entry containing a location (x,y - coordinates).
I want to be able to request every entry from this dataset that is within 100m within this dataset and have it returned as an array.
How does one go about implementing something like this? Are there any patterns or framework that recommended? I've previously only worked with relational or simple key-value type data.

The data structure that solves this problem efficiently is a k-d tree. There are many implementations available, including a node.js module.

Put your data set into PostgreSQL and use an R-Tree index. You can then do a bounding box query to get all points with +-100 miles of any locations. Then calculate the radial distance and accept points within 100 miles. You can roll your own schema and queries or use PostGIS.
Unlike R-Trees KD-trees are not inherently balanced. So depending on how a KD-Tree is built you can get inconsistent performance due to unbalanced trees and the longest path.

Storing and quickly comparing luminosity histograms

I am interested in building a domain-specific image search application capable of searching for images similar to a given image. With a little google-fu I managed to find this question on this site. If I understand the top rated answer correctly then what I am looking to do is achievable by storing the luminosity data for each image in my library.
This is all well and good, but I need a way to quickly search through and compare against 25,000+ records. I have used PostgreSQL and so I immediately thought of it. The problem I find myself facing is that to store luminance data for 256 discrete possible values across 3 colors, I would need a table with 768 columns (r0,g0,b0,...,r255,g255,b255) and in order to effectively search across all records for similarities I would need 768 indices. I have never really worked with large scale data at this level before but that number seems a little unwieldy to me (although I don't know, my experience doesn't extend into this realm).
My other idea is to store the luminance data in one large text column (formatted like this: r0:rrr g0:ggg b0:bbb ... r255:rrr g255:ggg b255:bbb) and construct a full text search index on that column in order to allow searches across the data for similar images.
Another possibility is using the hamming distance between a query histogram and a stored histogram, but I do not believe that is possible to do quickly against all records in the database.
Am I even approaching this the right way? I am also open to any alternatives to relational databases that could provide fast, real-time search across my dataset.

It looks like you are putting each image into a 3 dimensional space -- have you tried looking at any geospatial / multidimensional query engines. Similar images should be near each other in 3-space with your approach.

Cassandra data model for linear spatial data

I am relatively new to cassandra and its data model. I have a large set of data that are described by locations on chromosomes (chromosome:start-end) where we have 24 chromosomes and start and end are integers. The query I would like to support is to find all locations in the genome that overlap with a set of other locations. I can create a simple R-tree-based "indexing" scheme if there are not other ideas, but I thought someone might have run into this problem and come up with a solution.

As you need to query on 2 dimensions, either you could use other db like mongodb that support these kind of geospacial indexing/queries see Bounds Queries
In Cassandra, I think the best you could do is use geocell (doc) or other Space filling curves
you will convert start and end to a geohash, for each of your data, then you will be able to search for the bounding box, with start in [s1,s2] and end in [e1,e2], by searching geocells between geohash(s1, e1) and geohash(s2, e2) that gives contiguous locations in the bouding box

How to find bounding boxes for Geocode radius Searches

If I want to find all restaurants within a zip code, I can do a string search on the address, if I want to find all restaurants with 10 miles of a zip code, I need to do a location search. I have a database full of addresses and Geocodes should be no problem. But how do I compute the bounding box of an irregular shaped area, like a zip code, or city, or state or Metro Area?
Is there a tool around that does this? is this information for sale somewhere?
My initial solution is to create an estimate of the areas by searching for all addresses within them and deriving the simplest polygon that surrounds them and using that as a bounding box. However this seems a really brute force way to do this. Do I do this calculation for every city, state, and zip in my database and store it? How have other people solved this problem?

Companies such as Maponics have polygon data on neighborhoods, counties, cities, states, provinces, townships, etc. There may be other providers.
Many of these polygons have huge numbers of points, so you should either:
compute bounding boxes, or
precompute the zip, neighborhood, city, etc. identifier for each address, and index a search collection by these regions.
But why build your application by storing a database of places and computing geographic data on your own? You can partner with providers such as CityGrid; they provide APIs for places that can be searched by neighborhood, zip, etc.; you can use their data for free in your own local application.

If you happen to be using PostgreSQL for your database, you can use box(geometry) or a variation thereof to compute the bounding box for a geometry. You can also implicitly use the bounding box for a geometry in your SQL. For example (from Using PostGIS: Data Management and Queries):
SELECT road_id, road_name FROM roads WHERE roads_geom && ST_GeomFromText('POLYGON((...))',-1);
where && "tells whether the bounding box of one geometry intersects the bounding box of another".
To get the bounding box for a collection of geometries, you can first use Collect or Union to aggregate or combine all the geometries together.
Of course, if you are not using PostGIS, the functionality really comes from GEOS, which is the underlying library that PostGIS actually uses. The basic geometry functions can be used directly (from python for example) to do what you want.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string