Levenshtein distance and modern indexes in Neo4j

Levenshtein distance and modern indexes in Neo4j - search

I have been using legacy indexes (aka Lucene) to run queries like the following in Neo4j Cypher in order to go through millions of simhashes in realtime and find similar items:
START file = node:File("simhash:THE_SIMHASH_HERE~0.8")
RETURN file.name
Since Lucene indexes are being deprecated now, I wonder if there is a way to run similar queries on the modern Label-based indexes.

Related

Does IBM-Graph use a search index? If so, what one?

My understanding is that IBM-Graph uses Titan, backed by Cassandra as it's persistent datastore.
In this stack it is usual to have a separate, search-index of Solr, Lucene or Elasticsearch, in order to enable more advanced queries like full-text search and geo-related queries.
Does IBM-Graph implement a search index such as this? If so, which one. And also, are these more advanced queries exposed via 'gremlin', i.e can we make use of this search index manually in order to perform full-text queries?

IBM Graph support search index by setting composite with false when you create an index, a Mixed index will be created by this way. FYI, the API doc: https://ibm-graph-docs.ng.bluemix.net/api.html#index-apis
But IBM Graph only support first level index, for example:
An index related to field name is available for Gremlin query g.V().has("name","Jack")
But not for the 2nd criteria has("age",20) in the Gremlin query g.V().has("name","Jack").out().has("age",20)

Neo4j as a search engine

I did several tests and read a lot of cases to use Neo4J for Graph-Based Search. I am convinced by the features as the flexible schema and Real-time search and retrieval. But I also realise it is not designed to store documents to facilitate full-text search. For me the potential of this product is in the business value through data relationships.
The product is matching for 99% with my case: a 'internal google' for the company where I work, except for full-text search on documents (Word, PDF, etc). This is not a hard requirement, but a nice to have. Nevertheless, should I drop the specific Neo4J features and go for a product like Elastic Search or is Neo4J the product we are looking for?

There are a few options for text search in Neo4j:
Cypher (the Neo4j query language) includes a few string comparison operators: CONTAINS, STARTS WITH and ENDS WITH. For example:
MATCH (d:Document) WHERE d.title STARTS WITH "Graph"
RETURN d
You can also make use of Lucene queries with Neo4j through "legacy" indexes. For example:
START doc=node:node_auto_index("title:graph*")
...
See this post for more information.
You can also model documents as graphs, and query them using Cypher as a graph model. For example, see the Neo4j Doc Manager project for converting data from MongoDB to Neo4j.
Finally, you can also use Neo4j and Elasticsearch together, indexing text data in Elasticsearch and using Neo4j for graph traversals. See this project.

mongodb approximate string matching

I am trying to implement a search engine for my recipes-website using mongo db.
I am trying to display the search suggestions in type-ahead widget box to the users.
I am even trying to support mis-spelled queries(levenshtein distance).
For example: whenever users type 'pza', type-ahead should display 'pizza' as one of the suggestion.
How can I implement such functionality using mongodb?
Please note, the search should be instantaneous, since the search result will be fetched by type-ahead widget. The collections over which I would run search queries have at-most 1 million entries.
I thought of implementing levenshtein distance algorithm, but this would slow down performance, as collection is huge.
I read FTS(Full Text Search) in mongo 2.6 is quite stable now, but my requirement is Approximate match, not FTS. FTS won't return 'pza' for 'pizza'.
Please recommend me the efficient way.
I am using node js mongodb native driver.

The text search feature in MongoDB (as at 2.6) does not have any built-in features for fuzzy/partial string matching. As you've noted, the use case currently focuses on language & stemming support with basic boolean operators and word/phrase matching.
There are several possible approaches to consider for fuzzy matching depending on your requirements and how you want to qualify "efficient" (speed, storage, developer time, infrastructure required, etc):
Implement support for fuzzy/partial matching in your application logic using some of the readily available soundalike and similarity algorithms. Benefits of this approach include not having to add any extra infrastructure and being able to closely tune matching to your requirements.
For some more detailed examples, see: Efficient Techniques for Fuzzy and Partial matching in MongoDB.
Integrate with an external search tool that provides more advanced search features. This adds some complexity to your deployment and is likely overkill just for typeahead, but you may find other search features you would like to incorporate elsewhere in your application (e.g. "like this", word proximity, faceted search, ..).
For example see: How to Perform Fuzzy-Matching with Mongo Connector and Elastic Search. Note: ElasticSearch's fuzzy query is based on Levenshtein distance.
Use an autocomplete library like Twitter's open source typeahead.js, which includes a suggestion engine and query/caching API. Typeahead is actually complementary to any of the other backend approaches, and its (optional) suggestion engine Bloodhound supports prefetching as well as caching data in local storage.

The best case for it would be using elasticsearch fuzzy query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html
It supports levenshtein distance algorithm out of the box and has additional features which can be useful for your requirements i.e.:
- more like this
- powerful facets / aggregations
- autocomplete

which nodejs index method is better

According to the neo4j documentation, indexing can be done i 2 ways"
Indexing in Neo4j can be done in two different ways:
1. The database itself is a natural index consisting of its relationships of different types between nodes. For example a tree
structure can be layered on top of the data and used for index lookups
performed by a traverser.
2. Separate index engines can be used, with Apache Lucene being the default
backend included with Neo4j.
But there is no comparison which is better in what and what is better in which cases.
Which one is better and why?

Is this a data warehouse/mart or reporting database? If you have both transactions and search going against the database it might give interesting pros or cons.
Lucene exists for one reason search and it does it really well. If you have a large system with multiple services, for ultimate scalability it is always to split the services up and keep them doing their single responsibility. This would give you flexibility of using that Lucene index against other services if necessary...also if you ever got rid off neo4j, then you still have your index/search artifacts around not coupled to Neo4j.
I would look at it from the overall system architecture not just specific functionality.

Efficient, database-independent PHP implementation of geospatial index? Zend_Search_Lucene extension?

I'm storing lat/lon information in a MySQL database, which doesn't have great geospatial search support. I'm already maintaining a separate Lucene text search index for efficient full text search, so I looked at the geospatial extension for Lucene; but it only seems to be available for the Java implementation, not the Zend_Search_Lucene PHP version I use.
Is there something similar that would allow me to maintain a separate, database-independent geospatial index? A good implementation of an R-Tree variant in PHP or something similar? A geospatial extension for Zend_Search_Lucene?
It'd need to allow efficient geospatial queries, mostly within-radius-of-x and within-bounding-box-y queries, and return the id of the entry in the database.

http://www.ideacode.com/content/spatial-searches-with-zendsearchlucene helped me in this situation

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string