SphinxSearch or a spider - which one to choose?

SphinxSearch or a spider - which one to choose? - search

We own SiteA and SiteB and they share the same server and database where we have full control.
SiteC , siteD and siteE are some of the sites we own as well but reside on a different web hosts.
The goal is to create a unified search functionality for all of the sites mentioned above. That is if somebody search for a term in SiteA, the search result will automatically come up with results from SiteB,SiteC,SiteD and Site E too. The search results should be shown under the website they were found in.
All these websites content are stored in their own databases.
If I use SphinxSearch to index the above sites,I would then require those sites that we dont have complete control with to setup a web service where i can download a database dump or csv file for indexing.
Im not quite sure about how a sphider will come into play here so need your opinion.
Sphinx or a spider?
THanks!

If you can ask the owner of other websites to give you content for free, then there is no need for a spider. Just use sphinxsearch to index the content.
If you can't get content directly from them, a spider is the only choice for you. There is little to think about this issue.

Sphinx is a full-text search engine solution, while a spider is for fetching contents from internet. They are not replacements to each other. Even if you use a spider, you still have to use some full-text search engine software for example sphinx or lucene/solr.
So you have to make a decision first: Do I want to use sphinx for searching? If the answer is yes, then there is only one thing left: how can I index the contents for searching?
sphinx supports using database or XML as data source. Database as data source is more popular because preparing and updating XML documents in a specific format is very tedious(compared to maintaining a database table). So I guess finally you have to store all of the data into database. As you described, all of the data are all ready in databases, but some of the databases are out of your control. For you own database, there is no problem. For the databases that out of your control, I suggest that you use distributed sphinx searching: http://sphinxsearch.com/docs/2.0.6/distributed.html
The key idea is to horizontally partition (HP) searched data accross
search nodes and then process it in parallel.
Partitioning is done manually. You should
setup several instances of Sphinx programs (indexer and searchd) on
different servers;
make the instances index (and search) different parts of data;
configure a special distributed index on some of the searchd
instances;
and query this index.
This index only contains references to other local and remote indexes
- so it could not be directly reindexed, and you should reindex those indexes which it references instead.

Related

Yii2: How should site-wide search work?

What is the best practice methododology of implementing site-wide search in Yii2?
This question is not about how to implement search specifically, but rather about what kind of approach to use. Should we use Sphinx? Elasticsearch? Or do we use UNION selects to get the data into a DataProvider?
Assume the application is using a relational database to store data. We want to search and display multiple different models. For example, our database contains tables of Books, Authors and Stores. When we search for a keyword we want to display results from all 3 tables (matching Books by title or content, Authors by full name and Stores by name etc).
There are tutorials which show how to use Elasticsearch but assume that our data is stored in the Elasticsearch database, which does not make sense. Our data is already stored in MySQL or PostgreSQL. Does this mean
we need to maintain a duplicate of our data in the Elasticsearch database?

What is the best practice methododology of implementing site-wide search in Yii2?
That depends on many factors, so I cant give you a specific recommendation for your case. Some of the factors to think about are:
What would you like to achieve with this search? Is every little bit in your database a significant search term?
Do you need only full-text-search or a wide range of analytics?
Have you any limits in time or costs?
Can your (tech-)infrastructure handle your ideas?
Is it worth to bring in another extensive technology in the project?
Can you handle additional maintenance tasks to run such a search engine?
And many more ...
In my internal Yii2 Project with a PostgreSQL RDBMS, I decided to use a PostgreSQL Text Search Type called tsvector. Thats good enough for my needs. Why?
You can use Stemming.
Supports Fuzzy search.
Supports basic ranking.
Supports multiple languages.
I highly recommend this blog post Postgres full-text search is Good Enough.

How to use Solr for multiple data sources?

I am a newbie to Solr & is facing challenges as below.
I have two data sources : a portal & a cms. I need to provide Solr search solution for these two sources so that when user searches on custom portlet(on portal), he should see results from both the sources at same place or Solr should fetch results from both sources. Also user should be able to access these results by clicking on same.
What all should i consider for implementing this use case. Should i use multiple Solr cores or single core? Also how can i achieve features like faceted search, search filter, stop words etc.?
Regards.

It should be perfectly fine to go with single core (and it will also work faster).
To import data from multiple data sources check out Solr Data Import Handler configuration:
http://wiki.apache.org/solr/DataImportHandler
and setup two entities - one for each of your data sources.
You will probably need to set some field to keep information about data source in imported document.
Your question is little bit too general to really answer. Go and experiment a little bit with documentation you have. It should not be very hard to get some basic search functionality.
You can find a lot of info about configuring Solr on LucidWorks wiki:
http://docs.lucidworks.com/display/solr/Faceting
and on Solr wiki: http://wiki.apache.org/solr/
You may also try with some books. Ex: http://www.packtpub.com/apache-solr-4-cookbook/book

I figured out a way to do the same. We can use http://wiki.apache.org/solr/Solrj as java client for Solr. Alfresco content can be put into XMLs & these XMLs can be dumped into SOlr using Solrj.

How to implement search on a webapp / website

How do you implement a search "engine" on your website / webapp?
Suppose that you have some products, news, events, and so on, all stored in database in different tables.
You have free text hardcoded inside website in static pages, or at least you have them as gettext files.
You want to be able to list pages that contains some of the query terms requested.
Personally, i create another table (fulltext with mysql) that contains url and contents of the page, and then i do the fulltext search on that table and report results.
This table is periodically filled by a script that reads the db and insert the data.
Are there better methods to implement a "simple" search?

Well "simple" is subjective. Your approach to search will not scale and certainly not suited to complex queries (tihnk boolean searches, or range queries etc.)
My recommendation would be to denormalize your data into a flat structure and write it to Apache Solr. It offers a RESTful interface for integrating into PHP or whatever platform you prefer. It offers faceting, caching, a sophisticated query language etc.

What's the best approach for using SOLR with web projects?

ok, I'm totally new to SOLR and Lucene, but have got Solr running out-of-the-box under Tomcat 6.x and have just gone over some of the basic Wiki entries.
I have a few questions, and require some suggestions too.
Solr can index data in files (XML, CSV) and it can also index DBs. Can you also just point it to a URI/domain, and have it index a website in the way google would?
If I have a website with "Pages" data, so "Page Name", "Page Content" etc, and "Products Data", so "Product Name", "SKU" etc, do I need two different Schema.xml files? and if so, does that mean two different instances of Solr?
Finally, if you have a project with a large relational and normalized database, what would you say is the best approach from the 3 options below?:
Have a middleware service running in the background, which mines the DB and manually creates the relevant XML files to then send to SOLR
Have SOLR index the DB directly. In this case, would it be best to just point SOLR to views, which would abstract all the table relationships?
Any other options I'm unaware of?
Context: We're running in a Windows 2003 environment, .NET 3.5, SQLServer 2005/2008
cheers!

No, you need a crawler for that, e.g. Nutch
Yes, you want two separate indexes ( = two schema.xml) since the datasets don't seem to be related. This doesn't mean two instances of Solr, you can manage the two indexes with Cores.
As for populating the Solr index, it depends on your particular project, for example, can it tolerate stale data or does it have to absolutely fresh.
Other options to index data include:
Database triggers
If you're using some sort of ORM use its interception capabilities. For example you can use NHibernate events to update the index on update, insert or delete. If you use NHibernate and SolrNet this is taken care of automatically

I think Mauricio is dead on for his advice. The only point I would make is that when deciding to have a "middleware" indexer, or use the database directly. If your database (or the views?) map very closely to what a good Solr schema wants, then DIH is great. But, if you are indexing from multiple sources of data, or if you have to munge about the data in your database to meet what Solr would like, then having a dedicated middleware indexer is better.

Using Lucene like a relational database

I am just wondering if we could achieve some RDBMS capabilities in lucene.
Example:
1) I have 10,000 project documents (pdf files) which have to be indexed with their content to make them available for search.
2) Every document is related to a SINGLE PROJECT. The project can contain details like project name, number, start date, end date, location, type etc.
I have to search in the contents of the pdf files for a given keyword, but while displaying the results I want to display the project meta data as mentioned in point (2).
My idea is to associate a field called projectId with each pdf file while indexing. Once we get that, we will fire search again for getting project meta data.
This way we could avoid duplicated data. Also, if we want to update the project meta data we will end up updating at a SINGLE PLACE only. Otherwise if we store this meta data with all the pdf doument indexes, we will end up updating all of the documents, which is not the way I am looking for.
please advise.

If I understand you correctly, you have two questions:
Can I store a project id in Lucene and use it for further searches? Yes, you can. This is a common practice.
Can I use this project id to search Lucene for project meta data? Yes, you can. I do not know if this is a good idea. It depends on the frequency of your meta data updates and your access pattern. If the meta data is relatively static, and you only access it by id, Lucene may be a good place to store it. Otherwise, you can use the project id as a primary key to a database table, which could be a better fit.

Sounds like a perfectly good thing to do. The only limitation you'll have (by storing a reference to the project in Lucene rather than the project data itself) is that you won't be able to query both the document text and project metadata at the same time. For example, "documentText:foo OR projectName:bar" . If you have no such requirement, then seems like storing the ID in Lucene which refers to a database row is a fine thing to do.

I am not sure on your overall setup, but maybe Hibernate Search is for you. It would allow you to combine the benefits of a relational database with the power of a fulltext search engine like Lucene. The meta data could live in the database, maybe together with the original pdf documents, while the Lucene documents just contain the searchable data.

This is definitely possible. But always be aware of the fact that you're using Lucene for something that it was not intended for. In general, Lucene is designed for full-text search, not for mapping relational content. So the more complex your system your relational content becomes, the more you'll see a decrease in performance.
In particular, there are a few areas to keep a close eye on:
Storing the value of each field in your index will decrease performance. If you are not overly concerned with sub-second search results, or if your index is relatively small, then this may not be a problem.
Also, be aware that if you are not using the default ranking algorithm, and your custom algorithm requires information about the project in order to calculate the score for each document, this will have a dramatic impact on search performance, as well.
If you need a more powerful index that was designed for relational content, there are hierarchical indexing tools out there (one developed by Apache, called Jackrabbit) that are worth looking into.
As your project continues to grow, you might also check out Solr, also developed by Apache, which provides some added functionality, such as multi-faceted search.

You can use Lucene that way;
Pros:
Full-text search is easy to implement, which is not the case in an RDBMS.
Cons:
Referential integrity: you get it for free in an RDBMS, but in Lucene, you must implement it yourself.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string