How to index d dimensional vector in solr using geospatial search? - search

I am trying to create a plugin for solr that will enable indexing 3d models. I will take 'screenshots' of each model from several different views and preprocess those images so they'll be represented in 1d vector.
I wanted to use lucene/solr Geospatial search for that purpose, as I saw there is an option to index a vector (larger than 2 dims) and search according to distance form the vector (according to location).
Unfortunately the documentation for this option disappeared last week and it isn't cached in google.
How can I index a location vector with dimension > 2?
The link for the documentation was here:
https://wiki.apache.org/solr
And I found it from here:
https://lucene.apache.org/solr/guide/6_6/spatial-search.html#SpatialSearch-LatLonPointSpatialField

Geospatial search is intended to work with n-dimension points (Lucene), but it seems the Solr implementation for dimensions higher than 2d is not available.
You can still do 3d spatial search with Solr if you index one dimension vector/coordinate per field (use a double fieldType).
Then, in order to query and sort documents by distance from a given point, instead of using geodist(sfield2D,x,y), you may use dist() :
Returns the distance between two vectors (points) in an n-dimensional
space. Takes in the power, plus two or more ValueSource instances and
calculates the distances between the two vectors.
To compute the euclidean distance between an arbitrary point (0,0,0) and the indexed points for each document, you would use :
dist(2, fieldx, fieldy, fieldz, 0, 0, 0)
See also :
- Calculate distance in 3D space
- Distance Sorting or Boosting (Function Queries)
- Difference between geodist() and dist() for Geo-Spacial Search

Related

Difference between distance() and geo_distance() in arangodb

What is the difference between the arango function - DISTANCE() and GE0_DISTANCE(). I know both of them calculates distance using haversines formula.
Thanks,
Nilotpal
Both are used for two different purposes
DISTANCE(latitude1, longitude1, latitude2, longitude2) → distance
The value is computed using the haversine formula, which is based on a spherical Earth model. It’s fast to compute and is accurate to around 0.3%, which is sufficient for most use cases such as location-aware services.
GEO_DISTANCE(geoJsonA, geoJsonB, ellipsoid) → distance
Return the distance between two GeoJSON objects, measured from the centroid of each shape. For a list of supported types see the geo index page. (Ref: https://www.arangodb.com/docs/3.8/aql/functions-geo.html#geo-index-functions)
This GeoJSON objects can be anything like GEO_LINESTRING, GEO_MULTILINESTRING, GEO_MULTIPOINT, GEO_POINT, GEO_POLYGON and GEO_MULTIPOLYGON - Reference<2>
Reference:
https://www.arangodb.com/docs/3.8/aql/functions-geo.html#geo-utility-functions
https://www.arangodb.com/docs/3.8/aql/functions-geo.html#geojson-constructors

quanteda how much scale can textstat_simil handle

I have been using quanteda for the past couple of months and really enjoy using the package. One question I have is how many rows of a dfm can the textstat_simil function handle before the time to create the similarity matrix becomes too long.
I have a search corpus containing 15 million documents. Each document is a short sentence containing anywhere from 5 to 10 words (the documents sometimes include some 3-4 digit numbers too). I have tokenized this search corpus using character bigrams and created a dfm from it.
I also have another corpus that I call the match corpus. It has a couple hundred documents of similar length, has had the same tokenization, and a dfm created for it also. The aim is to find the closest matching document from the search corpus for each of the match corpus documents.
A combined dfm is made by rbinding the match dfm with the search dfm. The number of unique tokens for the combined dfm is about 1580. I then run textstat_simil on this combined dfm using "cosine" method, "documents" as the margin, and the selection being just one of the match corpus documents for now to test. However, when I run textstat_simil it takes over 5 minutes to run.
Is this sort of volume too much for this type of approach using quanteda?
Cheers,
Sof
In quanteda v1.3.13, we reprogrammed the function for computing cosine similarities so that is more efficient for memory and for storage. However it sounds like you are still trying to get a document-by-document distance matrix (excluding the diagonal) that will be (15000000^2)/2 - 150000000 = 1.124998e+14 cells in size. If you are able to get this to run at all, I'm very impressed with your machine!
For your 1,850 target document set, however, you can narrow this down by using the selection argument.
Also, look for the experimental textstat_proxy() function in v1.3.13, which we created for this sort of problem. You can specify a minimum distance below which a distance will not be recorded, and it returns a distance matrix using a sparse matrix object. This is still experimental because the sparse values are not zeroes, but will be treated as zeroes by any operations on the sparse matrix. (This violates some distance properties - see the discussion here.)

Find top K cosine similar vectors to a given vector efficiently

The problem:
Suppose I have a group of around 1,000,000 short documents D (no more than 50 words each), and I want to let users to supply a document from the same group D, and and get the top K similar documents from D.
My approach:
My first approach was to preprocess the group D by applying simple tf-idf, and after I have vector for each document, which is extremely sparse, to use a simple nearest neighbours algorithm based on cosine similarity.
Then, on query time, to justuse my static nearest neighbours table which its size is 1,000,000 x K, without any further calculations.
After applying tf-idf, I got vectors in size ~200,000, which means now I have a very sparse table (that can be stored efficiently in memory using sparse vectors) in size 1,000,000 x 200,000.
However, calculating the nearest neighbours model took me more than one day, and still haven't finished.
I tried to lower the vectors dimension by applying HashingTF, that utilizes the hasing trick, instead, so I can set the dimension to a constant one (in my case, i used 2^13 for uninfied hashing), but still I get the same bad performance.
Some technical information:
I use Spark 2.0 for the tf-idf calculation, and sklearn NearestNeighbours on the collected data.
Is thier any more efficient way to achieve that goal?
Thanks in advance.
Edit:
I had an idea to try a LSH based approximation similarity algorithm like those implemented in spark as described here, but could not find one that supports the 'cosine' similarity metric.
There were some requirements for the algorithm on the relation between training instances and the dimensions of your vectors , but you can try DIMSUM.
You can find the paper here.

Euclidean vs Cosine for text data

IF I use tf-idf feature representation (or just document length normalization), then is euclidean distance and (1 - cosine similarity) basically the same? All text books I have read and other forums, discussions say cosine similarity works better for text...
I wrote some basic code to test this and found indeed they are comparable, not exactly same floating point value but it looks like a scaled version. Given below are the results of both the similarities on simple demo text data. text no.2 is a big line of about 50 words, rest are small 10 word lines.
Cosine similarity:
0.0, 0.2967, 0.203, 0.2058
Euclidean distance:
0.0, 0.285, 0.2407, 0.2421
Note: If this question is more suitable to Cross Validation or Data Science, please let me know.
If your data is normalized to unit length, then it is very easy to prove that
Euclidean(A,B) = 2 - Cos(A,B)
This does hold if ||A||=||B||=1. It does not hold in the general case, and it depends on the exact order in which you perform your normalization steps. I.e. if you first normalize your document to unit length, next perform IDF weighting, then it will not hold...
Unfortunately, people use all kinds of variants, including quite different versions of IDF normalization.

Fast(er) way of matching feature to database

I'm working on a project where I have a feature in an image described as a set of X & Y coordinates (5-10 points per feature) which are unique for this feature. I also have a database with thousands of features where each have the same type of descriptor. The result looks like this:
myFeature: (x1,y1), (x2,y2), (x3,y3)...
myDatabase: Feature1: (x1,y1), (x2,y2), (x3,y3)...
Feature2: (x1,y1), (x2,y2), (x3,y3)...
Feature3: (x1,y1), (x2,y2), (x3,y3)...
...
I want to find the best match of myFeature in the features in myDatabase.
What is the fastest way to match these features? Currently I am stepping though each feature in the database and comparing each individual point:
bestScore = 0
for each feature in myDatabase:
score = 0
for each point descriptor in MyFeature:
find minimum distance from the current point to the...
points describing the current feature in the database
if the distance < threshold:
there is a match to the current point in the target feature
score += 1
if score > bestScore:
save feature as new best match
This search works, but clearly it gets painfully slow on large databases. Does anyone know of a faster method to do this type of search, or at least if there is a way to quickly rule out features that clearly won't match the descriptor?
Create a bitset (an array of 1s and 0s) from each feature.
Create such a bitmask for your search criteria and then just use a bitwise and to compare the search mask to your features.
With this approach, you can shift most work to the routines responsible for saving the stuff. Also, creating the bitmasks should not be that computationally intensive.
If you just want to rule out features that absolutely can't match, then your mask-creation algorithm should take care of that and create the bitmasks a bit fuzzy.
The easiest way to create such masks is probably by creating a matrix as big as the matrix of your features and put a one in every coordinate that is set for the feature and a zero in every coordinate that isn't. Then turn that matrix into a one dimensional row. Compare the feature-row then to the search mask bitwise.
This is similar to the way bitmap indexes work on large databases (oracle e.g.), but with a different intention and without a full bitmap-image of all database rows in memory.
The power of this is in the bitwise comparisons.
On a 32bit machine you can perform 32 comparisons per instruction when you can just do one with integer numbers in a point comparison. It yields even higher boni for floating point operations, depending on the architecture.
This in general looks like a spatial index problem. It's not my field, but you'll probably need to build a sort of tree index, such as a quadtree, that you can use to easily search for features. You can find some links from this wikipedia article: http://en.wikipedia.org/wiki/Spatial_index
It might be a problem that you can easily implement in an existing spatial database. It's very GIS-like in its description.
One thing you can do is calculate a point of gravity for every feature and use that to whittle down the search space a bit (a one dimensional search is a lot easier to build an index for), but that has the downside of being just a heuristic (depending on the shapes of your feature, the point of gravity may end up in weird places).

Resources