Cassandra data model for linear spatial data - cassandra

I am relatively new to cassandra and its data model. I have a large set of data that are described by locations on chromosomes (chromosome:start-end) where we have 24 chromosomes and start and end are integers. The query I would like to support is to find all locations in the genome that overlap with a set of other locations. I can create a simple R-tree-based "indexing" scheme if there are not other ideas, but I thought someone might have run into this problem and come up with a solution.

As you need to query on 2 dimensions, either you could use other db like mongodb that support these kind of geospacial indexing/queries see Bounds Queries
In Cassandra, I think the best you could do is use geocell (doc) or other Space filling curves
you will convert start and end to a geohash, for each of your data, then you will be able to search for the bounding box, with start in [s1,s2] and end in [e1,e2], by searching geocells between geohash(s1, e1) and geohash(s2, e2) that gives contiguous locations in the bouding box

Related

Updatable nearest neighbor search

I'm trying to come up with a good design for a nearest neighbor search application. This would be somewhat similar to this question:
Saving and incrementally updating nearest-neighbor model in R
In my case this would be in Python but the main point being the part that when new data comes, the model / index must be updated. I'm currently playing around with scikit-learn neighbors module but I'm not convinced it's a good fit.
The goal of the application:
User comes in with a query and then the n (probably will be fixed to 5) nearest neighbors in the existing data set will be shown. For this step such a search structure from sklearn would help but that would have to be regenerated when adding new records.Also this is a first ste that happens 1 per query and hence could be somewhat "slow" as in 2-3 seconds compared to "instantly".
Then the user can click on one of the records and see that records nearest neighbors and so forth. This means we are now within the exiting dataset and the NNs could be precomputed and stored in redis (for now 200k records but can be expanded to 10th or 100th of millions). This should be very fast to browse around.
But here I would face the same problem of how to update the precomputed data without having to do a full recomputation of the distance matrix especially since there will be very few new records (like 100 per week).
Does such a tool, method or algorithm exist for updatable NN searching?
EDIT April, 3rd:
As is indicated in many places KDTree or BallTree isn't really suited for high-dimensional data. I've realized that for a Proof-of-concept with a small data set of 200k records and 512 dimensions, brute force isn't much slower at all, roughly 550ms vs 750ms.
However for large data set in millions+, the question remains unsolved. I've looked at datasketch LSH Forest but it seems in my case this simply is not accurate enough or I'm using it wrong. Will ask a separate question regarding this.
You should look into FAISS and its IVFPQ method
What you can do there is create multiple indexes for every update and merge them with the old one
You could try out Milvus that supports adding and near real-time search of vectors.
Here are the benchmarks of Milvus.
nmslib supports adding new vectors. It's used by OpenSearch as part their Similarity Search Engine, and it's very fast.
One caveat:
While the HNSW algorithm allows incremental addition of points, it forbids deletion and modification of indexed points.
You can also look into solutions like Milvus or Vearch.

Cassandra bounding box search

I am looking to use Cassandra for a nearby search type query. based on my lon/lat coordinates I want to retrieve the closest points. I do not need 100% accuracy so I am comfortable in using a bounding box instead of a circle (better performance too), but I can't find concrete instructions (Hopefully with an example) how to implement a bounding box.
From my experience, there's no easy way to have a generic geospatial index search on top of Cassandra. I believe you only have two options:
Geohashing, split your dataset into square/rectangular elements: for example, use integer parts of lat/lon as an indexes in a grid. Upon doing search, you can load all elements in an enclosing grid element and perform full neighbour scan inside your application.
works well if you have an evenly distributed dataset, like grid points in NWP similation that I've had.
works really bad on a datasets like "restaurants in USA", where most of the points are herding around large cities. You'll have unbalanced high load on different grid elements like New York area and get absolutely empty index buckets located somewhere in the Atlantic Ocean.
External indexes like ElasticSearch/Solr/Sphinx/etc.
All of them have geospatial indexing support out-of-the-box, no need to develop your own in your application layer.
You have to setup a separate indexing service and keep cassandra/index data in sync. There's some cassandra/search integrations like DSE (commercial), stargate-core (I've never heard about anyone using this in production), or you can roll your own, but all of these require time and effort.
This issue was touched on in the Euro Cassandra Summit in 2014.
RedHat: Scalable Geospatial Indexing with Cassandra
The presenter explains how he created a spatial index using User Defined Types that is very suitable to querying geospatial data using a region or bounding box based lookup.
The general idea is to break up your data into regions that are defined by bounding boxes. Each region then represents a rowkey, which you can then use to access any data associated with that region. If you have a location of interest, you query the keyspace on the regions which fall inside that area.

Getting data based on location and a specified radius

Scenario: I have a large dataset, with each entry containing a location (x,y - coordinates).
I want to be able to request every entry from this dataset that is within 100m within this dataset and have it returned as an array.
How does one go about implementing something like this? Are there any patterns or framework that recommended? I've previously only worked with relational or simple key-value type data.
The data structure that solves this problem efficiently is a k-d tree. There are many implementations available, including a node.js module.
Put your data set into PostgreSQL and use an R-Tree index. You can then do a bounding box query to get all points with +-100 miles of any locations. Then calculate the radial distance and accept points within 100 miles. You can roll your own schema and queries or use PostGIS.
Unlike R-Trees KD-trees are not inherently balanced. So depending on how a KD-Tree is built you can get inconsistent performance due to unbalanced trees and the longest path.

Storing and quickly comparing luminosity histograms

I am interested in building a domain-specific image search application capable of searching for images similar to a given image. With a little google-fu I managed to find this question on this site. If I understand the top rated answer correctly then what I am looking to do is achievable by storing the luminosity data for each image in my library.
This is all well and good, but I need a way to quickly search through and compare against 25,000+ records. I have used PostgreSQL and so I immediately thought of it. The problem I find myself facing is that to store luminance data for 256 discrete possible values across 3 colors, I would need a table with 768 columns (r0,g0,b0,...,r255,g255,b255) and in order to effectively search across all records for similarities I would need 768 indices. I have never really worked with large scale data at this level before but that number seems a little unwieldy to me (although I don't know, my experience doesn't extend into this realm).
My other idea is to store the luminance data in one large text column (formatted like this: r0:rrr g0:ggg b0:bbb ... r255:rrr g255:ggg b255:bbb) and construct a full text search index on that column in order to allow searches across the data for similar images.
Another possibility is using the hamming distance between a query histogram and a stored histogram, but I do not believe that is possible to do quickly against all records in the database.
Am I even approaching this the right way? I am also open to any alternatives to relational databases that could provide fast, real-time search across my dataset.
It looks like you are putting each image into a 3 dimensional space -- have you tried looking at any geospatial / multidimensional query engines. Similar images should be near each other in 3-space with your approach.

Fast(er) way of matching feature to database

I'm working on a project where I have a feature in an image described as a set of X & Y coordinates (5-10 points per feature) which are unique for this feature. I also have a database with thousands of features where each have the same type of descriptor. The result looks like this:
myFeature: (x1,y1), (x2,y2), (x3,y3)...
myDatabase: Feature1: (x1,y1), (x2,y2), (x3,y3)...
Feature2: (x1,y1), (x2,y2), (x3,y3)...
Feature3: (x1,y1), (x2,y2), (x3,y3)...
...
I want to find the best match of myFeature in the features in myDatabase.
What is the fastest way to match these features? Currently I am stepping though each feature in the database and comparing each individual point:
bestScore = 0
for each feature in myDatabase:
score = 0
for each point descriptor in MyFeature:
find minimum distance from the current point to the...
points describing the current feature in the database
if the distance < threshold:
there is a match to the current point in the target feature
score += 1
if score > bestScore:
save feature as new best match
This search works, but clearly it gets painfully slow on large databases. Does anyone know of a faster method to do this type of search, or at least if there is a way to quickly rule out features that clearly won't match the descriptor?
Create a bitset (an array of 1s and 0s) from each feature.
Create such a bitmask for your search criteria and then just use a bitwise and to compare the search mask to your features.
With this approach, you can shift most work to the routines responsible for saving the stuff. Also, creating the bitmasks should not be that computationally intensive.
If you just want to rule out features that absolutely can't match, then your mask-creation algorithm should take care of that and create the bitmasks a bit fuzzy.
The easiest way to create such masks is probably by creating a matrix as big as the matrix of your features and put a one in every coordinate that is set for the feature and a zero in every coordinate that isn't. Then turn that matrix into a one dimensional row. Compare the feature-row then to the search mask bitwise.
This is similar to the way bitmap indexes work on large databases (oracle e.g.), but with a different intention and without a full bitmap-image of all database rows in memory.
The power of this is in the bitwise comparisons.
On a 32bit machine you can perform 32 comparisons per instruction when you can just do one with integer numbers in a point comparison. It yields even higher boni for floating point operations, depending on the architecture.
This in general looks like a spatial index problem. It's not my field, but you'll probably need to build a sort of tree index, such as a quadtree, that you can use to easily search for features. You can find some links from this wikipedia article: http://en.wikipedia.org/wiki/Spatial_index
It might be a problem that you can easily implement in an existing spatial database. It's very GIS-like in its description.
One thing you can do is calculate a point of gravity for every feature and use that to whittle down the search space a bit (a one dimensional search is a lot easier to build an index for), but that has the downside of being just a heuristic (depending on the shapes of your feature, the point of gravity may end up in weird places).

Resources