Is Solr Distributed search same as Federated Search?

Is Solr Distributed search same as Federated Search? - search

I am trying to understand the difference between Solr distributed search and the concept of federated search. Can I uses Solr distributed search to implement federated searches? The requirement is that there are two or more domain models exists and each such domain system indexes its own data to lucene based index. Now I have an interesting use case that I should be able to do a federated search for a single query cutting across different domain systems having its own index.

No, distributed search is not the same as federated.
Federated search" is the term more typically used when searching
across heterogeneous data sources - think about things like
meta-search engines, as a common example of this.
Distributed search is when you have a homogeneous data source, but it
needs to be distributed in order to scale properly.
(taken from here - http://wiki.apache.org/solr/FederatedSearch)
About second question - is it possible to implement federated search using Solr - I'm pretty sure it's possible, the only question is - how much effort it will require from you.
I could see possible solution to create a separate collections in Solr and query them, and later merge all results in query time, but it just a raw idea.

Related

Benefits of using a hosted search service over building your own

I'm building a B2B Node app which has heavily related data models. We currently have our own search queries, but as we scale some of the queries appear to be becoming sluggish.
We will need to support multilingual search as well as content-based searches (searching matching content within related data).
The queries are growing more and more complicated (each has multiple joins on joins on joins) and I'm now considering a hosted search tool such as Algolia.
Given my concerns below, why should I use a hosted cloud search service rather than continue building my own queries?
Data privacy is important
Data is hosted in our own postgres DB - integrations with that are important (e.g.: will I now need to manually maintain our DB data and data in Algolia?)
Speed will be important, but not so much now
Must be able to do content-based searches across multiple languages
We are a tiny team of devs now, so dev resource time is vital
What other things should I be concerned about that can help make a decision in search capabilities?
Regarding maintenance of both DB and Cloud data, it seems it's as simple as getting all data, caching it, and storing it in the cloud:
var index = Algolia.initIndex('contacts');
var contactsJSON = require('./contacts.json');
index.addObjects(contactsJSON, function(err, content) {
if (err) {
console.error(err);
}
});

Search services like Algolia or self-hosted Elasticsearch/solr operate as full text search, not relational db queries.
But it sounds like the bottleneck is the continual rejoining. Which if you can make your relational data act like a full text document db then that could be a more efficient type of index (pre-joined sort of).
You might also look into views, or a data warehouse (maybe star schema).
But if you are going the search route maybe investigate hosting your own elasticsearch.
You could specify database, schema, sql, index, query details if you want more help.

Full Disclosure: I founded a company called SearchStax on the premise that companies and developers should not spend time setting up, managing, scaling or building tools for the search infrastructure (ops) - they are better off investing time of their employees into building value for the company, whether that be features, capabilities, product or customers.
Open Source Search solutions based on top of Lucene (Apache Solr / Elasticsearch) have what you need now and what you might need in near future from a capability perspective from a search engine. Find a mature service provider / AS-A-Service company that has specialization in open source search and let them deal with all. It may look small effort right now, though it's probably not worth time and effort of your devs to spend time on the operations of that.
For your concerns mentioned above:
Data privacy is important
Your concern around Privacy and Security are addressable. There are multiple ways you can secure your Solr environment and the right MSP or a Managed Solution provider should be able to address those.
a. Security at the transport layer can be addressed by SSL certificates. All the data going over the wire is encrypted.
b. IP Filtering and User Based Authentication should address who has access to what. Solr-as-a-Service offering by Measured Search supports both.
c. Security at rest can be addressed in multiple ways - OS level / File encryption, but you can even go further by ensuring not even your services provider has access to that data by using Searchable Encryption technology.
Privacy concerns are all address by Terms & Conditions - I am sure your legal department will address that from a Service Provider's perspective.
Data is hosted in our own postgres DB - integrations with that are important
Solr provides ability to import data directly (DIH) through a traditional relational database (MySQL, Postgres, Oracle, etc). You can either use that so Solr can pull data periodically or write your own simple script to push data through the Solr APIs.
If you are hosted in the cloud (AWS), a tunnel can be created so only the Solr deployments have the ability to pull data from your servers and your database servers are not exposed to the world, if you choose to go the DIH route.
Speed will be important, but not so much now
Solr is built for search speed - I don't think that's where your problems are going to be. Service offering like Measured Search's - you can spin up a cluster in any data center supported by AWS or Azure and make sure your search deployments are closer to your application servers so the latency overhead is minimal.
Must be able to do content-based searches across multiple languages
Yes, Solr supports that. More than 30 languages.
We are a tiny team of devs now, so dev resource time is vital
I am biased here, but I would not have my developers spend much time on operations and let them focus on what they do best - build great product capabilities to push the limits and deliver business value.
If you are interested in doing a comparison and ROI of doing it yourself vs using a solr-as-a-service like offered by SearchStax, check this paper out - https://www.searchstax.com/white-papers/why-measured-search-is-better-than-diy-solr-infrastructure/

Using Lucene to work with graph data

Is there a way to use Lucene to work with graph data?
Example
One user has a relationship with many lucene documents (Document Connections)
One User has a relationship with other Users (User Connections [Graph])
If a user searches the Index, he gets back the documents that he has a relationship with. This is simple and straightforward.
What would be a way to get back the documents that the User Connections have a relationship with.
Indexing each document with all the user's that have a relationship with it in a user_id field is an approach. However when you query the index providing the User Connections for the user performing the search query size is unpredictable. Think of Users that have 1000's of User Connections. This will not scale.
It's almost like the User Connections and User Documents stored in a Graph DB can easily provide us the documents to search against but what is an effective way to communicate that to Lucene so it can only search against those documents for the given query. If any results are returned, this will guarantee that at least one or more of the User Connections has a relationship with the documents returned in the results.

I don't believe there is currently any graph technology that sits on top of solr or lucene.
You would probably be best looking at either one of these two camps:
Neo4j with SpringData (free for single instance)
OR
Tinkerpop Blueprints (possibly rexter if not using java/scala)
on one of these technologies:
Titan on Cassandra with Hadoop (multi master, no point of failure)
OrientDb
Neo4j
These databases are graph databases.
Tinkerpop Blueprints is a standard that allows you to abstract the specific implementation.
Springdata currently only supports neo4j for graph technologies.
Neo4j costs money if you cluster (free license is single instance only).
You can read discussion on solr/lucene with graphing here.
http://lucene.472066.n3.nabble.com/indexing-directed-graph-td2949556.html
Note neo4j supports full text search.

Graph databases are supported since solr 6.0; if you don't have solr installed, it's probably still better to use a graph database instead, but now at least, you have a choice. I found this, documentation is still sparse:
https://solr.pl/en/2016/04/18/solr-6-0-and-graph-traversal-support/

Apache Jena may be relevant here since it has some graph capabilities (SPARQL, RDF) and makes use of Lucene.
See Apache Jena Fuseki and Jena Text.

Solr vs. ElasticSearch [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
What are the core architectural differences between these technologies?
Also, what use cases are generally more appropriate for each?

Update
Now that the question scope has been corrected, I might add something in this regard as well:
There are many comparisons between Apache Solr and ElasticSearch available, so I'll reference those I found most useful myself, i.e. covering the most important aspects:
Bob Yoplait already linked kimchy's answer to ElasticSearch, Sphinx, Lucene, Solr, Xapian. Which fits for which usage?, which summarizes the reasons why he went ahead and created ElasticSearch, which in his opinion provides a much superior distributed model and ease of use in comparison to Solr.
Ryan Sonnek's Realtime Search: Solr vs Elasticsearch provides an insightful analysis/comparison and explains why he switched from Solr to ElasticSeach, despite being a happy Solr user already - he summarizes this as follows:
Solr may be the weapon of choice when building standard search
applications, but Elasticsearch takes it to the next level with an
architecture for creating modern realtime search applications.
Percolation is an exciting and innovative feature that singlehandedly
blows Solr right out of the water. Elasticsearch is scalable, speedy
and a dream to integrate with. Adios Solr, it was nice knowing you. [emphasis mine]
The Wikipedia article on ElasticSearch quotes a comparison from the reputed German iX magazine, listing advantages and disadvantages, which pretty much summarize what has been said above already:
Advantages:
ElasticSearch is distributed. No separate project required. Replicas are near real-time too, which is called "Push replication".
ElasticSearch fully supports the near real-time search of Apache
Lucene.
Handling multitenancy is not a special configuration, where
with Solr a more advanced setup is necessary.
ElasticSearch introduces
the concept of the Gateway, which makes full backups easier.
Disadvantages:
Only one main developer [not applicable anymore according to the current elasticsearch GitHub organization, besides having a pretty active committer base in the first place]
No autowarming feature [not applicable anymore according to the new Index Warmup API]
Initial Answer
They are completely different technologies addressing completely different use cases, thus cannot be compared at all in any meaningful way:
Apache Solr - Apache Solr offers Lucene's capabilities in an easy to use, fast search server with additional features like faceting, scalability and much more
Amazon ElastiCache - Amazon ElastiCache is a web service that makes it easy to deploy, operate, and scale an in-memory cache in the cloud.
Please note that Amazon ElastiCache is protocol-compliant with Memcached, a widely adopted memory object caching system, so code, applications, and popular tools that you use today with existing Memcached environments will work seamlessly with the service (see Memcached for details).
[emphasis mine]
Maybe this has been confused with the following two related technologies one way or another:
ElasticSearch - It is an Open Source (Apache 2), Distributed, RESTful, Search Engine built on top of Apache Lucene.
Amazon CloudSearch - Amazon CloudSearch is a fully-managed search service in the cloud that allows customers to easily integrate fast and highly scalable search functionality into their applications.
The Solr and ElasticSearch offerings sound strikingly similar at first sight, and both use the same backend search engine, namely Apache Lucene.
While Solr is older, quite versatile and mature and widely used accordingly, ElasticSearch has been developed specifically to address Solr shortcomings with scalability requirements in modern cloud environments, which are hard(er) to address with Solr.
As such it would probably be most useful to compare ElasticSearch with the recently introduced Amazon CloudSearch (see the introductory post Start Searching in One Hour for Less Than $100 / Month), because both claim to cover the same use cases in principle.

I see some of the above answers are now a bit out of date. From my perspective, and I work with both Solr(Cloud and non-Cloud) and ElasticSearch on a daily basis, here are some interesting differences:
Community: Solr has a bigger, more mature user, dev, and contributor community. ES has a smaller, but active community of users and a growing community of contributors
Maturity: Solr is more mature, but ES has grown rapidly and I consider it stable
Performance: hard to judge. I/we have not done direct performance benchmarks. A person at LinkedIn did compare Solr vs. ES vs. Sensei once, but the initial results should be ignored because they used non-expert setup for both Solr and ES.
Design: People love Solr. The Java API is somewhat verbose, but people like how it's put together. Solr code is unfortunately not always very pretty. Also, ES has sharding, real-time replication, document and routing built-in. While some of this exists in Solr, too, it feels a bit like an after-thought.
Support: there are companies providing tech and consulting support for both Solr and ElasticSearch. I think the only company that provides support for both is Sematext (disclosure: I'm Sematext founder)
Scalability: both can be scaled to very large clusters. ES is easier to scale than pre-Solr 4.0 version of Solr, but with Solr 4.0 that's no longer the case.
For more thorough coverage of Solr vs. ElasticSearch topic have a look at https://sematext.com/blog/solr-vs-elasticsearch-part-1-overview/ . This is the first post in the series of posts from Sematext doing direct and neutral Solr vs. ElasticSearch comparison. Disclosure: I work at Sematext.

I see that a lot of folks here have answered this ElasticSearch vs Solr question in terms of features and functionality but I don't see much discussion here (or elsewhere) regarding how they compare in terms of performance.
That is why I decided to conduct my own investigation. I took an already coded heterogenous data source micro-service that already used Solr for term search. I switched out Solr for ElasticSearch then I ran both versions on AWS with an already coded load test application and captured the performance metrics for subsequent analysis.
Here is what I found. ElasticSearch had 13% higher throughput when it came to indexing documents but Solr was ten times faster. When it came to querying for documents, Solr had five times more throughput and was five times faster than ElasticSearch.

Since the long history of Apache Solr, I think one strength of the Solr is its ecosystem. There are many Solr plugins for different types of data and purposes.
Search platform in the following layers from bottom to top:
Data
Purpose: Represent various data types and sources
Document building
Purpose: Build document information for indexing
Indexing and searching
Purpose: Build and query a document index
Logic enhancement
Purpose: Additional logic for processing search queries and results
Search platform service
Purpose: Add additional functionalities of search engine core to provide a service platform.
UI application
Purpose: End-user search interface or applications
Reference article : Enterprise search

I have been working on both solr and elastic search for .Net applications.
The major difference what i have faced is
Elastic search :
More code and less configuration, however there are api's to change
but still is a code change
for complex types, type within types i.e nested types(wasn't able to achieve in solr)
Solr :
less code and more configuration and hence less maintenance
for grouping results during querying(lots of work to achieve in
elastic search in short no straight way)

I have created a table of major differences between elasticsearch and Solr and splunk, you can use it as 2016 update:

While all of the above links have merit, and have benefited me greatly in the past, as a linguist "exposed" to various Lucene search engines for the last 15 years, I have to say that elastic-search development is very fast in Python. That being said, some of the code felt non-intuitive to me. So, I reached out to one component of the ELK stack, Kibana, from an open source perspective, and found that I could generate the somewhat cryptic code of elasticsearch very easily in Kibana. Also, I could pull Chrome Sense es queries into Kibana as well. If you use Kibana to evaluate es, it will further speed up your evaluation. What took hours to run on other platforms was up and running in JSON in Sense on top of elasticsearch (RESTful interface) in a few minutes at worst (largest data sets); in seconds at best. The documentation for elasticsearch, while 700+ pages, didn't answer questions I had that normally would be resolved in SOLR or other Lucene documentation, which obviously took more time to analyze. Also, you may want to take a look at Aggregates in elastic-search, which have taken Faceting to a new level.
Bigger picture: if you're doing data science, text analytics, or computational linguistics, elasticsearch has some ranking algorithms that seem to innovate well in the information retrieval area. If you're using any TF/IDF algorithms, Text Frequency/Inverse Document Frequency, elasticsearch extends this 1960's algorithm to a new level, even using BM25, Best Match 25, and other Relevancy Ranking algorithms. So, if you are scoring or ranking words, phrases or sentences, elasticsearch does this scoring on the fly, without the large overhead of other data analytics approaches that take hours--another elasticsearch time savings.
With es, combining some of the strengths of bucketing from aggregations with the real-time JSON data relevancy scoring and ranking, you could find a winning combination, depending on either your agile (stories) or architectural(use cases) approach.
Note: did see a similar discussion on aggregations above, but not on aggregations and relevancy scoring--my apology for any overlap.
Disclosure: I don't work for elastic and won't be able to benefit in the near future from their excellent work due to a different architecural path, unless I do some charity work with elasticsearch, which wouldn't be a bad idea

If you are already using SOLR, remain stick to it. If you are starting up, go for Elastic search.
Maximum major issues have been fixed in SOLR and it is quite mature.

Imagine the use case:
A lot(100+) of small(10Mb-100Mb, 1000-100000 documents) search indexes.
They are using by a lot of applications (microservices)
Each application can use more than one index
Small by size index, yes. But huge load(hundreds search-requests per second) and requests are complex (multiple aggregations, conditions and so on)
Downtimes are not allowed
All of that is working years long, and constantly growing.
Idea to have individual ES instance per each index - is huge overhead in this case.
Based on my experience, this kind of use case is very complex to support with Elasticsearch.
Why?
FIRST.
The major problem is fundamental back compatibility disregard.
Breaking changes are so cool!
(Note: imagine SQL-server which require you to do small change in all your SQL-statements, when upgraded... can't imagine it. But for ES it's normal)
Deprecations which will dropped in next major release are so sexy!
(Note: you know, Java contain some deprecations, which 20+ years old, but still working in actual Java version...)
And not only that, sometimes you even have something which nowhere documented (personally came across only once but... )
So. If you want to upgrade ES (because you need new features for some app or you want to get bug fixes) - you are in hell. Especially if it is about major version upgrade.
Client API will not back compatible. Index settings will not back compatible.
And upgrade all app/services same moment with ES upgrade is not realistic.
But you must do it time to time. No other way.
Existing indexes is automatically upgraded? - Yes. But it not help you when you will need to change some old-index settings.
To live with that, you need constantly invest a lot of power in ... forward compatibility of you apps/services with future releases of ES.
Or you need to build(and anyway constantly support) some kind of middleware between you app/services and ES, which provide you back compatible client API.
(And, you can't use Transport Client (because it required jar upgrade for every minor version ES upgrade), and this fact do not make your life easier)
Is it looks simple & cheap? No, it's not. Far from it.
Continuous maintenance of complex infrastructure which based on ES, is way to expensive in all possible senses.
SECOND.
Simple API ? Well... no really.
When you is really using complex conditions and aggregations.... JSON-request with 5 nested levels is whatever, but not simple.
Unfortunately, I have no experience with SOLR, can't say anything about it.
But Sphinxsearch is much better it this scenario, becasue of totally back compatible SphinxQL.
Note:
Sphinxsearch/Manticore are indeed interesting. It's not Lucine based, and as result seriously different. Contain several unique features from the box which ES do not have and crazy fast with small/middle size indexes.

I have use Elasticsearch for 3 years and Solr for about a month, I feel elasticsearch cluster is quite easy to install as compared to Solr installation. Elasticsearch has a pool of help documents with great explanation. One of the use case I was stuck up with Histogram Aggregation which was available in ES however not found in Solr.

Add an nested document in solr very complex and nested data search also very complex. but Elastic Search easy to add nested document and search

I only use Elastic-search. Since I found solr is very hard to start.
Elastic-search's features:
Easy to start, very few setting. Even a newbie can setup a cluster step by step.
Simple Restful API which using NoSQL query. And many language libraries for easy accessing.
Good document, you can read the book: . There is a web version on official website.

Whats the best deployment for "like" search in MVC/Azure

I use MVC3 on Azure, I like to have a "like" kind of search,
e.g. http://msdn.microsoft.com/en-us/library/ms179859.aspx
First question: Does Lucene support "like" search, I tried ask this question on Google, but it's very difficult to search the word "like" without get result like: I like to use Lucene :)
Second: What kind of performance can I get for use SQL Azure for "like" search, with only id(int) as key, and text(string(100)) for "like" search, and rows around 10 million. I tried seems cannot work out, always timeout. Or you can answer the question as: I know theres a way to improve "like" search in SQL Azure.
3rd question: Is there any other product thats works well with Azure Platform can support "like" search with reasonable performance(less than 2 seconds for above sample database)
Thanks.

SQL Azure doesn't support full text indexing so 'LIKE' is limited to the ANSI SQL operator. This is wholly inadequate for general searching. In general, on the cloud (Azure) you want to avoid using SQL for searching anyway - is is the wrong place for it from a scalability point of view.
As you suggest, a lucene-based search engine is the way to go, but I would recommend using Solr (the Apache/Java lucene server). Solr can still be hosted in Azure and you will find a lot more community support, documentation and help for it.

Lucene does support LIKE search and there is a library specific for Lucene.NET that leverages Azure Storage for the Lucene index. This allows you to provide a fault tolerant Lucene index that will scale well in the cloud.
http://code.msdn.microsoft.com/windowsazure/Azure-Library-for-83562538
Solr is a good option, but you will have to manage the storage of the index yourself unless you extend Solr to run on Azure storage yourself.

You may want to look into implementing Solr on Azure. There's a good write up with demo's and tutorials here:
http://wiki.apache.org/solr/SolrOnWindowsAzure

Restricting resource access in CouchDB to exactly 2 users

Currently I'm in the process of evaluating CouchDB for a new project.
Key constraint for this project is strong privacy. There need to be resources that are readable by exactly two users.
One usecase may be something similar to Direct Messages (DMs) on Twitter. Another usecase would be User / SuperUser access level.
I currently don't have any ideas about how to solve these kind of problems with CouchDB other than creating one Database that is accessable only by these 2 users. I wonder how I would then build views aggregating data from several databases?
Do you have any hints / suggestions for me?

I've asked this question several times on couchdb mailing lists, and never got an answer.
There are a number of things that couchdb is missing.
One of them is the document level security which would :
allow only certain users to view a doc
filter the documents indexed in a view on a user level permission base
I don't think that there is a solution to the permission considerations with the current couchdb implementation.
One solution would be to use an external indexing tool like lucene, and tag your documents with user rights, then issue a lucene query with user right definition in order to get the docs. It also implies extra load on your server(s) (lucene requires a JVM) and an extra delay for the data to be available (lucene indexing time ... )
As for the several databases solution, there are language framework implementations that simply don't allow to use more then one databases ( for instance couch_potato for Ruby ).
Having several databases also means that you'll have several replication processes if your databases are replicated.
Also, this means that the views will be updated for each of the database. In some cases this is better then have huge views indexed in a single database, but it also means that distinct users might not be up to date for a single source of information ( i.e some will have their views updated, other won't). So you cannot guarantee that the data is consistent for all users.
So unless something is implemented in the couch core in order to manage document level authorizations, CouchDB does not seem appropriate for managing data with privacy constraints.

There are a bunch of details missing about what you are trying to accomplish, what the data looks like, so it's hard to make a specific recommendation. You may be able to create a database per user and copy items into each users database (for the DM use case you described). Each user would only be able to access their own database, and then you could have an admin user that could access all databases. If you need to later update those records copying them to multiple databases might not be a good idea, and then you might consider whether you want to control permissions at a different level from storage.
For views that aggregate data from several databases, I recommend looking at lounge and bigcouch, which take different approaches.
http://tilgovi.github.com/couchdb-lounge/
http://support.cloudant.com/faqs/views/chained-mapreduce-views

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string