How to use Solr for multiple data sources? - search

I am a newbie to Solr & is facing challenges as below.
I have two data sources : a portal & a cms. I need to provide Solr search solution for these two sources so that when user searches on custom portlet(on portal), he should see results from both the sources at same place or Solr should fetch results from both sources. Also user should be able to access these results by clicking on same.
What all should i consider for implementing this use case. Should i use multiple Solr cores or single core? Also how can i achieve features like faceted search, search filter, stop words etc.?
Regards.

It should be perfectly fine to go with single core (and it will also work faster).
To import data from multiple data sources check out Solr Data Import Handler configuration:
http://wiki.apache.org/solr/DataImportHandler
and setup two entities - one for each of your data sources.
You will probably need to set some field to keep information about data source in imported document.
Your question is little bit too general to really answer. Go and experiment a little bit with documentation you have. It should not be very hard to get some basic search functionality.
You can find a lot of info about configuring Solr on LucidWorks wiki:
http://docs.lucidworks.com/display/solr/Faceting
and on Solr wiki: http://wiki.apache.org/solr/
You may also try with some books. Ex: http://www.packtpub.com/apache-solr-4-cookbook/book

I figured out a way to do the same. We can use http://wiki.apache.org/solr/Solrj as java client for Solr. Alfresco content can be put into XMLs & these XMLs can be dumped into SOlr using Solrj.

Related

Yii2: How should site-wide search work?

What is the best practice methododology of implementing site-wide search in Yii2?
This question is not about how to implement search specifically, but rather about what kind of approach to use. Should we use Sphinx? Elasticsearch? Or do we use UNION selects to get the data into a DataProvider?
Assume the application is using a relational database to store data. We want to search and display multiple different models. For example, our database contains tables of Books, Authors and Stores. When we search for a keyword we want to display results from all 3 tables (matching Books by title or content, Authors by full name and Stores by name etc).
There are tutorials which show how to use Elasticsearch but assume that our data is stored in the Elasticsearch database, which does not make sense. Our data is already stored in MySQL or PostgreSQL. Does this mean
we need to maintain a duplicate of our data in the Elasticsearch database?
What is the best practice methododology of implementing site-wide search in Yii2?
That depends on many factors, so I cant give you a specific recommendation for your case. Some of the factors to think about are:
What would you like to achieve with this search? Is every little bit in your database a significant search term?
Do you need only full-text-search or a wide range of analytics?
Have you any limits in time or costs?
Can your (tech-)infrastructure handle your ideas?
Is it worth to bring in another extensive technology in the project?
Can you handle additional maintenance tasks to run such a search engine?
And many more ...
In my internal Yii2 Project with a PostgreSQL RDBMS, I decided to use a PostgreSQL Text Search Type called tsvector. Thats good enough for my needs. Why?
You can use Stemming.
Supports Fuzzy search.
Supports basic ranking.
Supports multiple languages.
I highly recommend this blog post Postgres full-text search is Good Enough.

Optimal Indexing strategy for Multilingual requirement using solr

We use IBM WCS v7 for one of our e-commerce based requirement, in which Apache solr is embeded for the search based implementation.
As per a new requirement, there will be multiple language support for an website, ex- France version of the site can have support for english, french etc. (en_FR, fr_FR etc.) In order to configure solr with this interface, what should be the optimal indexing strategy using a single solr core ?
I got some ideas 1) using multiple fields in schema.xml for multiple languages, 2) using different solr cores for different languages.
But these approaches don't seem to be the best one fitting to the current requirement, as there will be 18 language support for the e-commerce website. Using different fields for every language will be very complicated, and also using different solr code is not a good approach as we need to apply the configurational change in all the solr cores if ever it happens as per any requirement.
Is there any other approaches, or is there any way I can associate the localeId to the indexed data and process the search result with respect to the detected language ?
Any help on this topic will be highly appreciated.
Thanks and Regards,
Jitendriya Dash
This post has already been answered by original poster and others- just summarizing that as an answer:
Recommended solution is to create one index core per locale/language. This is especially important if either the catalog or content (such as product name, description, keywords) will be different and business prefers to manage it separately for each locale. This gives the added benefit for Solr to perform its stemming and tokenization specific to that locale, if applicable.
I have been part of solutions where this approach was preferred over maintaining multiple fields or documents in the same core for each locale/language. Most number of index cores I have worked with is 6.
One must also remember that index core addition will require updates to supporting processes (Product Information Management system updates to catalog load to workspace management to stage-propagation to reindexing to cache invalidation).

SphinxSearch or a spider - which one to choose?

We own SiteA and SiteB and they share the same server and database where we have full control.
SiteC , siteD and siteE are some of the sites we own as well but reside on a different web hosts.
The goal is to create a unified search functionality for all of the sites mentioned above. That is if somebody search for a term in SiteA, the search result will automatically come up with results from SiteB,SiteC,SiteD and Site E too. The search results should be shown under the website they were found in.
All these websites content are stored in their own databases.
If I use SphinxSearch to index the above sites,I would then require those sites that we dont have complete control with to setup a web service where i can download a database dump or csv file for indexing.
Im not quite sure about how a sphider will come into play here so need your opinion.
Sphinx or a spider?
THanks!
If you can ask the owner of other websites to give you content for free, then there is no need for a spider. Just use sphinxsearch to index the content.
If you can't get content directly from them, a spider is the only choice for you. There is little to think about this issue.
Sphinx is a full-text search engine solution, while a spider is for fetching contents from internet. They are not replacements to each other. Even if you use a spider, you still have to use some full-text search engine software for example sphinx or lucene/solr.
So you have to make a decision first: Do I want to use sphinx for searching? If the answer is yes, then there is only one thing left: how can I index the contents for searching?
sphinx supports using database or XML as data source. Database as data source is more popular because preparing and updating XML documents in a specific format is very tedious(compared to maintaining a database table). So I guess finally you have to store all of the data into database. As you described, all of the data are all ready in databases, but some of the databases are out of your control. For you own database, there is no problem. For the databases that out of your control, I suggest that you use distributed sphinx searching: http://sphinxsearch.com/docs/2.0.6/distributed.html
The key idea is to horizontally partition (HP) searched data accross
search nodes and then process it in parallel.
Partitioning is done manually. You should
setup several instances of Sphinx programs (indexer and searchd) on
different servers;
make the instances index (and search) different parts of data;
configure a special distributed index on some of the searchd
instances;
and query this index.
This index only contains references to other local and remote indexes
- so it could not be directly reindexed, and you should reindex those indexes which it references instead.

Creating a web indexer in Java?

I'm supposed to write a web crawler in Java. The crawling part is easy, but the indexing part is difficult. I need to be able to query the indexer and have it return matches (multiple word queries). What would be the best data structure for doing such a thing?
Use an indexing tool such as Lucene, Solr or Compass.
The solution to the index & search step is to use an inverted index data structure, and the best available open source package that implements this for indexing & search is Lucence.
There are also open source projects that provide a composite solution to the crawling, indexing & searching steps which may be of interest, e.g. nutch
This free online book on information retrieval may help you (see chapter on constructing an inverted index).
If you're buliding this from scratch you should look at the inverted index data structure. If you can use one off the shelf then look at the Nutch project.

What's the best approach for using SOLR with web projects?

ok, I'm totally new to SOLR and Lucene, but have got Solr running out-of-the-box under Tomcat 6.x and have just gone over some of the basic Wiki entries.
I have a few questions, and require some suggestions too.
Solr can index data in files (XML, CSV) and it can also index DBs. Can you also just point it to a URI/domain, and have it index a website in the way google would?
If I have a website with "Pages" data, so "Page Name", "Page Content" etc, and "Products Data", so "Product Name", "SKU" etc, do I need two different Schema.xml files? and if so, does that mean two different instances of Solr?
Finally, if you have a project with a large relational and normalized database, what would you say is the best approach from the 3 options below?:
Have a middleware service running in the background, which mines the DB and manually creates the relevant XML files to then send to SOLR
Have SOLR index the DB directly. In this case, would it be best to just point SOLR to views, which would abstract all the table relationships?
Any other options I'm unaware of?
Context: We're running in a Windows 2003 environment, .NET 3.5, SQLServer 2005/2008
cheers!
No, you need a crawler for that, e.g. Nutch
Yes, you want two separate indexes ( = two schema.xml) since the datasets don't seem to be related. This doesn't mean two instances of Solr, you can manage the two indexes with Cores.
As for populating the Solr index, it depends on your particular project, for example, can it tolerate stale data or does it have to absolutely fresh.
Other options to index data include:
Database triggers
If you're using some sort of ORM use its interception capabilities. For example you can use NHibernate events to update the index on update, insert or delete. If you use NHibernate and SolrNet this is taken care of automatically
I think Mauricio is dead on for his advice. The only point I would make is that when deciding to have a "middleware" indexer, or use the database directly. If your database (or the views?) map very closely to what a good Solr schema wants, then DIH is great. But, if you are indexing from multiple sources of data, or if you have to munge about the data in your database to meet what Solr would like, then having a dedicated middleware indexer is better.

Resources