Solr vs Elasticsearch for nested documents - search

I have been using solr for my project but recently I encountered Elasticsearch which seems to be very promising. My project requires ability to handle nested documents and I would like to know which one does better job. Solr just added child documents recently but is it as good as Elasticsearch's? Could Elasticsearch perform query on both parent and children at once? Thanks

I've been looking into the subject recently and to my understanding ElasticSearch makes the life a lot easier when working with nested documents, although Solr also supports nesting (but is less flexible in querying).
So the features of ElasticSearch are:
"Seamlessly" supports nesting: you don't have to change your
nested documents structure or add specific fields. However, you need
to indicate in the mapping what fields are nested when creating the
index
Supports nested query with "nested" and "path":
Supports aggregation and filtering with nested docs: also via
"nested" and "path".
With Solr you will have to:
Modify your schema.xml by adding the _ root _ field
Modify your dataset so that parent and child documents would have a specific distinguishing field, in particular, childDocuments to indicate children (see more at this question)
Aggregation and filtering on nested documents promises to be very complicated if not impossible.
Also, nested fields are not supported at all.
Recent Solr versions (5.1 and up) can eventually be configured to support nesting (including you'll have to change your input data structure), however, documentation is not very clear and there is not much information on the Internet because these features are recent.
The bottomline is that in the sense of nested documents ElasticSearch can do everything that Solr can and even more with less effort and smoother learning curve. So going with ElasticSearch seems more reasonable in this case.

I am not aware of Elastic Search, so this is always 50% answer.
Solr works best with denormalized data. However, given that you have nested documents, you can use solr in two scenarios:
Query for parent, with a child attribute
Query for all children of a parent.
You can use block join to perform the above queries. Even though, you deal with nested levels, solr internally manages them as denormalized. I mean, when a parent have 2 children, you end up with three high level documents in solr. And solr manages the relation part.

Related

Field.Store and Field.Index both set to `NO` in a Lucene document?

I am aware of what Field.store and Field.Index means in Lucene document and aware of the use-cases when either Field.store or Field.Index is set to NO.
But recently, I came across piece of code, when both are set to NO. Could anybody explain the use-case with an example, when we need to set them to NO ?.
PS: I referred to this SO question, which explains why one is set to NO and another is set to Yes, with good use-cases, but it doesn't give answer to my question.
Lucene is the generic full-text indexing and search library and its not the framework in itself like ElasticSearch or Solr.
So, if you are developing your search application and directly using Lucene then you have full control over which fields to index and/or which fields to store from your app in the Lucene inverted index.
Frameworks like ElasticSearch or Solr which are built on top of Lucene, may use a schema for indexing or it might be schemaless too.
I think in cases where it's schemaless, it makes sense to explicitly ignore the fields which we don't want to index and store both.

Cloudant: Indexes vs Views

Are Cloudant's concept of Indexes native to CouchDB? It appears the Cloudant has built their Index feature on top of CouchDB, is this correct? If so, what is the difference between an Index and a View?
The Query interface is (currently) a simplifying API for creating and accessing the undelying CouchDB views. The indexes you define via the _index endpoint are actually translated into views, and those views can be accessed and used in the same way as a normal CouchDB view, as well as via the _find endpoint (note: the inverse is not true - Query doesn't currently use existing javascript views). The views stay in the erlang layer so gives us the opportunity for performance enhancements etc.
You can also filter result data to only return document fields you're interested in, as opposed to hard coding the returned fields in the view or running the view result through a list function.
Cheers
Simon

Heterogeneous Data Storage in CouchDB

I would like to know what are the best practices for storing heterogeneous data in CouchDB. In MongoDB you have collections which help with the modelling of the data (IE: Typical usage is one document type per collection). What is the best way to handle this kind of requirement in CouchDB? Tagging of documents with a _type field? Or is there some other method that I am not aware of?
Main benefit of Mongo's collection is that indexes are defined and calculated per collection. In case of Couch you have even more freedom and flexibility to do that. Each index is defined by the view in map/reduce way. You limit the data to calculate the index by filtering it in map function. Because of this flexibility, it is up to you how to distinguish which document belongs to which view.
If you really like the fixed Mongo-like style of division documents into set of distinct partitions with separate indexes just create the field collection and never mix two different collections in single view. In my opinion, rejecting one of the only benefit of Couch over Mongo (where Mougo is in general more powerful and flexible system) does not seem to be good idea.

Data relationships as a context for search in Marklogic

I using marklogic's search functionality to create a search page. As of right now, I'm running an XQuery to get search results through search:search. As a bare bones example, see this code:
xquery version "1.0-ml";
import module namespace search = "http://marklogic.com/appservices/search"
at "/MarkLogic/appservices/search/search.xqy";
search:search('test',
<options xmlns='http://marklogic.com/appservices/search'></options>)
This search searches all content in the database, which is fine in many cases. In other cases, I search based on collections with cts:collection-query. The collections serve as great contexts for my searches.
Now, I would like to limit my search results based on a relationship of data in a "main" document. This "main" document has all the relationships in an object model. If that object model has a reference to a document, I want that document included in the search. Essentially, the "main"/model document is the context of the search.
I was trying to brainstorm some ideas of the best way to to this. Here's what I've come up with thus far, but I was hoping someone more familiar with Marklogic (I've only been working with it for 6 months) could lead me in a good direction:
Add all documents referenced in the model document to a unique collection. Then query search based on that collection. However, the collections would have to be updated as the model changed.
Load the model document into my code and get a list of all the references and add them to a query by cts:document-query (or the like).
Restructure my concept of a "model" somehow in my XML documents.
Thanks for any input or suggestions.
I would start with (2) and see if the performance is good enough. That will depend on your use-case, but I expect it should be fine for thousands or even hundreds of thousands of references.
Be sure to use a single-term cts:document-query($list-of-references). That will be faster than cts:or-query(for $ref in $list-of-references return cts:document-query($ref)), because the index lookup can be a single pass instead of N separate lookups.
All of these ideas would work fine. Deciding which to use depends on particulars of your application such as how often the main document is changed (and are you in control of it),
how hard it is to remodel your XML.
Another thing to consider is you can set a trigger on document updates which could perform the collection changes automatically.
-David Lee

which nodejs index method is better

According to the neo4j documentation, indexing can be done i 2 ways"
Indexing in Neo4j can be done in two different ways:
1. The database itself is a natural index consisting of its relationships of different types between nodes. For example a tree
structure can be layered on top of the data and used for index lookups
performed by a traverser.
2. Separate index engines can be used, with Apache Lucene being the default
backend included with Neo4j.
But there is no comparison which is better in what and what is better in which cases.
Which one is better and why?
Is this a data warehouse/mart or reporting database? If you have both transactions and search going against the database it might give interesting pros or cons.
Lucene exists for one reason search and it does it really well. If you have a large system with multiple services, for ultimate scalability it is always to split the services up and keep them doing their single responsibility. This would give you flexibility of using that Lucene index against other services if necessary...also if you ever got rid off neo4j, then you still have your index/search artifacts around not coupled to Neo4j.
I would look at it from the overall system architecture not just specific functionality.

Resources