I did several tests and read a lot of cases to use Neo4J for Graph-Based Search. I am convinced by the features as the flexible schema and Real-time search and retrieval. But I also realise it is not designed to store documents to facilitate full-text search. For me the potential of this product is in the business value through data relationships.
The product is matching for 99% with my case: a 'internal google' for the company where I work, except for full-text search on documents (Word, PDF, etc). This is not a hard requirement, but a nice to have. Nevertheless, should I drop the specific Neo4J features and go for a product like Elastic Search or is Neo4J the product we are looking for?
There are a few options for text search in Neo4j:
Cypher (the Neo4j query language) includes a few string comparison operators: CONTAINS, STARTS WITH and ENDS WITH. For example:
MATCH (d:Document) WHERE d.title STARTS WITH "Graph"
RETURN d
You can also make use of Lucene queries with Neo4j through "legacy" indexes. For example:
START doc=node:node_auto_index("title:graph*")
...
See this post for more information.
You can also model documents as graphs, and query them using Cypher as a graph model. For example, see the Neo4j Doc Manager project for converting data from MongoDB to Neo4j.
Finally, you can also use Neo4j and Elasticsearch together, indexing text data in Elasticsearch and using Neo4j for graph traversals. See this project.
Related
My understanding is that IBM-Graph uses Titan, backed by Cassandra as it's persistent datastore.
In this stack it is usual to have a separate, search-index of Solr, Lucene or Elasticsearch, in order to enable more advanced queries like full-text search and geo-related queries.
Does IBM-Graph implement a search index such as this? If so, which one. And also, are these more advanced queries exposed via 'gremlin', i.e can we make use of this search index manually in order to perform full-text queries?
IBM Graph support search index by setting composite with false when you create an index, a Mixed index will be created by this way. FYI, the API doc: https://ibm-graph-docs.ng.bluemix.net/api.html#index-apis
But IBM Graph only support first level index, for example:
An index related to field name is available for Gremlin query g.V().has("name","Jack")
But not for the 2nd criteria has("age",20) in the Gremlin query g.V().has("name","Jack").out().has("age",20)
I have two fairly general question about full text search in a database. I was looking into elastic search and solr and it seems to me that one needs to produce separate documents made up of table entries, which then get searched. So the result of such a search is not actually a database entry? Or did I misunderstand something?
I also looked into whoosh search, which does index table columns and the result of whoosh are actual table rows.
When using solr or elastic search, should I put the row id into the document which gets searched and after I have my result use that id to retrieve the relevant rows from the table? Or is there a better solution?
Another question I have is if I have a id like abc/123.64664, which is stored as a string, is there any advantage in searching such a column with a FTS? It seems to me there is not much to be gained by indexing? Or am I wrong?
thanks
Elasticsearch can store the indexed document, and you can retrieve it as a part of query result. Usually ppl still store the original data in an usual DB, it gives you more reliability and flexibility on reindexing. Mind that ES indexes non-relational data. You can have you data stored in relational manner and compose denormalized documents for indexing.
As for "abc/123.64664" you can index it as tokenized string or you can tune the index for prefix search etc. It's up to you
(TL;DR) Don't think about what your data is structured in your RDBS. Think about what you are searching.
Content storage for good full text search is quite different from relational database standard storage. So, your data going into Search Engine can end up looking quite differently from the way you stored it.
This is all driven by your expected search results. You may increase granularity of the data or - opposite - denormalize it so the parent/related record content shows up in the records you actually want returned as part of search. Text processing (copyField, tokenization, pre-processing, etc) is also where a lot of content modifications happen to make a record findable.
Sometimes, relational databases support full-text search. PostgreSQL is getting better and better at that. But most of the time, relational databases just do not provide enough flexibility to support good relevancy-driven search.
Finally, if the original schema is quite complex, it may make sense to only use search engine to get the right - relevant - IDs out and then merge them in the client code with the details from the original database records.
I am trying to implement a search engine for my recipes-website using mongo db.
I am trying to display the search suggestions in type-ahead widget box to the users.
I am even trying to support mis-spelled queries(levenshtein distance).
For example: whenever users type 'pza', type-ahead should display 'pizza' as one of the suggestion.
How can I implement such functionality using mongodb?
Please note, the search should be instantaneous, since the search result will be fetched by type-ahead widget. The collections over which I would run search queries have at-most 1 million entries.
I thought of implementing levenshtein distance algorithm, but this would slow down performance, as collection is huge.
I read FTS(Full Text Search) in mongo 2.6 is quite stable now, but my requirement is Approximate match, not FTS. FTS won't return 'pza' for 'pizza'.
Please recommend me the efficient way.
I am using node js mongodb native driver.
The text search feature in MongoDB (as at 2.6) does not have any built-in features for fuzzy/partial string matching. As you've noted, the use case currently focuses on language & stemming support with basic boolean operators and word/phrase matching.
There are several possible approaches to consider for fuzzy matching depending on your requirements and how you want to qualify "efficient" (speed, storage, developer time, infrastructure required, etc):
Implement support for fuzzy/partial matching in your application logic using some of the readily available soundalike and similarity algorithms. Benefits of this approach include not having to add any extra infrastructure and being able to closely tune matching to your requirements.
For some more detailed examples, see: Efficient Techniques for Fuzzy and Partial matching in MongoDB.
Integrate with an external search tool that provides more advanced search features. This adds some complexity to your deployment and is likely overkill just for typeahead, but you may find other search features you would like to incorporate elsewhere in your application (e.g. "like this", word proximity, faceted search, ..).
For example see: How to Perform Fuzzy-Matching with Mongo Connector and Elastic Search. Note: ElasticSearch's fuzzy query is based on Levenshtein distance.
Use an autocomplete library like Twitter's open source typeahead.js, which includes a suggestion engine and query/caching API. Typeahead is actually complementary to any of the other backend approaches, and its (optional) suggestion engine Bloodhound supports prefetching as well as caching data in local storage.
The best case for it would be using elasticsearch fuzzy query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html
It supports levenshtein distance algorithm out of the box and has additional features which can be useful for your requirements i.e.:
- more like this
- powerful facets / aggregations
- autocomplete
I have a Solr instance that gets and indexes data about companies from DB. A DB data about a single company can be provided in several languages(english and russian for example).All the companies, of course, have a unikue key that is a uniqueKey in solr index too. I need to present solr search in all the languages at once.
How can it be performed?
1. Multicore? I've build two seperate cores with each language data, but i can't search in two indexes simultaneously.
localhost:8983/solr/core0/select?shards=localhost:8983/solr/core0/,localhost:8983/solr/core1/&indent=true&q=*:*&distributed=true
or
localhost:8983/solr/core0/select?shards=localhost:8983/solr/core0/,localhost:8983/solr/core1/&indent=true&id:123456
gives no results. while searching in each core is succesful.
Enable Name field(for example) as a multivalued is not a solution, because a different language data data from DB are get by different procedures. And the value is just rewritten.
I'm not sure about the multicore piece, but have you considered creating two fields in a single core - one for each language? You could then combine with an "OR" which is the default, so a query for:
en:"query test here" OR ru:"query test here"
would be an example
Sounds like you are possibly using the DataImportHandler to load your data. You can implement #Mike Sokolov's answer or implement the multivalued solution via the use of a Solr client. You would need to write some custom code in a client like SolrJ (or one of the other clients listed on IntegratingSolr in the Solr Wiki) to pull both languages in separate queries from your database and then parse the data from both results into a common data/result set that can be transformed into a single Solr document.
I have a sphinx server to index a mysql database for a django app. My search is working fine but my content includes medical words/phrases. So, for example, I need a search for "dvt" to also match against "deep venous thrombosis" and even "deep vein thrombosis". I looked through the documentation and see an option for "wordforms" and "morphology". Which of these (or something else) should I use? Also, what will work backwards? ie, a search for "deep venous thrombosis"/"deep vein thrombosis" will match against "dvt".
Also, I would appreciate some advice on how to set these up since I'm new to sphinx in general.
You will need to provide your own list of word/term synonyms to be used in query expansion.
Since Sphinx does not currently support synonym expansion in queries, you'll need to massage the query based on your list of synonyms before submitting it to the search engine.
So, using your example:
User queries for: 'dvt remediation procedures'.
Server receives query and checks each term against its list of synonyms.
Server finds a match and adds 'deep vein thrombosis' to query.
Server submits newly expanded query 'dvt deep vein thrombosis remediation procedures' to search engine.
Finally, if the stemmer built into Sphinx is doing its job, you shouldn't have to support both 'venous' and 'vein' as separate terms since they both should stem to the same term. If this is not the case, you might need to do additional pre-stemming to handle words specific to your corpora (medical terms).