How do you implement a search "engine" on your website / webapp?
Suppose that you have some products, news, events, and so on, all stored in database in different tables.
You have free text hardcoded inside website in static pages, or at least you have them as gettext files.
You want to be able to list pages that contains some of the query terms requested.
Personally, i create another table (fulltext with mysql) that contains url and contents of the page, and then i do the fulltext search on that table and report results.
This table is periodically filled by a script that reads the db and insert the data.
Are there better methods to implement a "simple" search?
Well "simple" is subjective. Your approach to search will not scale and certainly not suited to complex queries (tihnk boolean searches, or range queries etc.)
My recommendation would be to denormalize your data into a flat structure and write it to Apache Solr. It offers a RESTful interface for integrating into PHP or whatever platform you prefer. It offers faceting, caching, a sophisticated query language etc.
Related
What is the best practice methododology of implementing site-wide search in Yii2?
This question is not about how to implement search specifically, but rather about what kind of approach to use. Should we use Sphinx? Elasticsearch? Or do we use UNION selects to get the data into a DataProvider?
Assume the application is using a relational database to store data. We want to search and display multiple different models. For example, our database contains tables of Books, Authors and Stores. When we search for a keyword we want to display results from all 3 tables (matching Books by title or content, Authors by full name and Stores by name etc).
There are tutorials which show how to use Elasticsearch but assume that our data is stored in the Elasticsearch database, which does not make sense. Our data is already stored in MySQL or PostgreSQL. Does this mean
we need to maintain a duplicate of our data in the Elasticsearch database?
What is the best practice methododology of implementing site-wide search in Yii2?
That depends on many factors, so I cant give you a specific recommendation for your case. Some of the factors to think about are:
What would you like to achieve with this search? Is every little bit in your database a significant search term?
Do you need only full-text-search or a wide range of analytics?
Have you any limits in time or costs?
Can your (tech-)infrastructure handle your ideas?
Is it worth to bring in another extensive technology in the project?
Can you handle additional maintenance tasks to run such a search engine?
And many more ...
In my internal Yii2 Project with a PostgreSQL RDBMS, I decided to use a PostgreSQL Text Search Type called tsvector. Thats good enough for my needs. Why?
You can use Stemming.
Supports Fuzzy search.
Supports basic ranking.
Supports multiple languages.
I highly recommend this blog post Postgres full-text search is Good Enough.
I have a Solr instance that gets and indexes data about companies from DB. A DB data about a single company can be provided in several languages(english and russian for example).All the companies, of course, have a unikue key that is a uniqueKey in solr index too. I need to present solr search in all the languages at once.
How can it be performed?
1. Multicore? I've build two seperate cores with each language data, but i can't search in two indexes simultaneously.
localhost:8983/solr/core0/select?shards=localhost:8983/solr/core0/,localhost:8983/solr/core1/&indent=true&q=*:*&distributed=true
or
localhost:8983/solr/core0/select?shards=localhost:8983/solr/core0/,localhost:8983/solr/core1/&indent=true&id:123456
gives no results. while searching in each core is succesful.
Enable Name field(for example) as a multivalued is not a solution, because a different language data data from DB are get by different procedures. And the value is just rewritten.
I'm not sure about the multicore piece, but have you considered creating two fields in a single core - one for each language? You could then combine with an "OR" which is the default, so a query for:
en:"query test here" OR ru:"query test here"
would be an example
Sounds like you are possibly using the DataImportHandler to load your data. You can implement #Mike Sokolov's answer or implement the multivalued solution via the use of a Solr client. You would need to write some custom code in a client like SolrJ (or one of the other clients listed on IntegratingSolr in the Solr Wiki) to pull both languages in separate queries from your database and then parse the data from both results into a common data/result set that can be transformed into a single Solr document.
We own SiteA and SiteB and they share the same server and database where we have full control.
SiteC , siteD and siteE are some of the sites we own as well but reside on a different web hosts.
The goal is to create a unified search functionality for all of the sites mentioned above. That is if somebody search for a term in SiteA, the search result will automatically come up with results from SiteB,SiteC,SiteD and Site E too. The search results should be shown under the website they were found in.
All these websites content are stored in their own databases.
If I use SphinxSearch to index the above sites,I would then require those sites that we dont have complete control with to setup a web service where i can download a database dump or csv file for indexing.
Im not quite sure about how a sphider will come into play here so need your opinion.
Sphinx or a spider?
THanks!
If you can ask the owner of other websites to give you content for free, then there is no need for a spider. Just use sphinxsearch to index the content.
If you can't get content directly from them, a spider is the only choice for you. There is little to think about this issue.
Sphinx is a full-text search engine solution, while a spider is for fetching contents from internet. They are not replacements to each other. Even if you use a spider, you still have to use some full-text search engine software for example sphinx or lucene/solr.
So you have to make a decision first: Do I want to use sphinx for searching? If the answer is yes, then there is only one thing left: how can I index the contents for searching?
sphinx supports using database or XML as data source. Database as data source is more popular because preparing and updating XML documents in a specific format is very tedious(compared to maintaining a database table). So I guess finally you have to store all of the data into database. As you described, all of the data are all ready in databases, but some of the databases are out of your control. For you own database, there is no problem. For the databases that out of your control, I suggest that you use distributed sphinx searching: http://sphinxsearch.com/docs/2.0.6/distributed.html
The key idea is to horizontally partition (HP) searched data accross
search nodes and then process it in parallel.
Partitioning is done manually. You should
setup several instances of Sphinx programs (indexer and searchd) on
different servers;
make the instances index (and search) different parts of data;
configure a special distributed index on some of the searchd
instances;
and query this index.
This index only contains references to other local and remote indexes
- so it could not be directly reindexed, and you should reindex those indexes which it references instead.
PROBLEM:
I need to write an advanced search functionality for a website. All the data is stored in MySQL and I'm using Zend Framework on top. I know that I can write a script that takes the search page and builds an SQL query out of it, but this becomes extremely slow if there's a lot of hits. Then I would have to get down to the gritty details of optimizing the database tables/fields/etc. which I'm trying to avoid if possible.
Lucene: I gave Lucene a try, but since it's a full-text search engine, it does not allow any mathematical operators!! So if I wanted to get all the records where field_x > 5, there is no way to do it (correct?)
General Practice? I would like to know how large sites deal with this dilemma. Is there a standard way of doing this that I don't know about, or does everyone have to deal with the nasty details of optimizing the database at some point? I was hoping that some fast indexing/searching technology existed (e.g. Lucene) that would address this problem.
ANY OTHER COMMENTS OR SUGGESTION ARE MOST WELCOME!!
Thanks a lot guys!
Ali
You can use Zend Lucene for textual search, and combine it with MySQL for joins.
Please see Mark Krellenstein's Search Engine vs DBMS paper about the choice; Basically, search engines are better for ranked text search; Databases are better for more complex data manipulations, such as joins, using different record structures.
For a simple x>5 type query, you can use a range query inside Lucene.
Use Lucene for your text-based searches, and use SQL for field_x > 5 searches. I say this because text-based search is hard to get right, and you're probably better off leaving that to an expert.
If you need your users to have the capability of building mathematical expression searches, consider writing an expression builder dialog like this example to collect the search phrase. Then use a parameterized SQL query to execute the search.
SqlWhereBuilder ASP.NET Server Control
http://www.codeproject.com/KB/custom-controls/SqlWhereBuilder.aspx
You can use filters in Lucene to carry out a text search of a reduced set of records. So if you query the database first to get all records where field_x > 5, build a filter (a list of lucene document IDs) and pass this into the lucene search method along with the text query. I'm just learning about this, here's a link to a question I asked (it uses Lucene.Net and C# but it may help) - ignore my question, just check out the accepted answer:
How do you implement a custom filter with Lucene.net?
I am just wondering if we could achieve some RDBMS capabilities in lucene.
Example:
1) I have 10,000 project documents (pdf files) which have to be indexed with their content to make them available for search.
2) Every document is related to a SINGLE PROJECT. The project can contain details like project name, number, start date, end date, location, type etc.
I have to search in the contents of the pdf files for a given keyword, but while displaying the results I want to display the project meta data as mentioned in point (2).
My idea is to associate a field called projectId with each pdf file while indexing. Once we get that, we will fire search again for getting project meta data.
This way we could avoid duplicated data. Also, if we want to update the project meta data we will end up updating at a SINGLE PLACE only. Otherwise if we store this meta data with all the pdf doument indexes, we will end up updating all of the documents, which is not the way I am looking for.
please advise.
If I understand you correctly, you have two questions:
Can I store a project id in Lucene and use it for further searches? Yes, you can. This is a common practice.
Can I use this project id to search Lucene for project meta data? Yes, you can. I do not know if this is a good idea. It depends on the frequency of your meta data updates and your access pattern. If the meta data is relatively static, and you only access it by id, Lucene may be a good place to store it. Otherwise, you can use the project id as a primary key to a database table, which could be a better fit.
Sounds like a perfectly good thing to do. The only limitation you'll have (by storing a reference to the project in Lucene rather than the project data itself) is that you won't be able to query both the document text and project metadata at the same time. For example, "documentText:foo OR projectName:bar" . If you have no such requirement, then seems like storing the ID in Lucene which refers to a database row is a fine thing to do.
I am not sure on your overall setup, but maybe Hibernate Search is for you. It would allow you to combine the benefits of a relational database with the power of a fulltext search engine like Lucene. The meta data could live in the database, maybe together with the original pdf documents, while the Lucene documents just contain the searchable data.
This is definitely possible. But always be aware of the fact that you're using Lucene for something that it was not intended for. In general, Lucene is designed for full-text search, not for mapping relational content. So the more complex your system your relational content becomes, the more you'll see a decrease in performance.
In particular, there are a few areas to keep a close eye on:
Storing the value of each field in your index will decrease performance. If you are not overly concerned with sub-second search results, or if your index is relatively small, then this may not be a problem.
Also, be aware that if you are not using the default ranking algorithm, and your custom algorithm requires information about the project in order to calculate the score for each document, this will have a dramatic impact on search performance, as well.
If you need a more powerful index that was designed for relational content, there are hierarchical indexing tools out there (one developed by Apache, called Jackrabbit) that are worth looking into.
As your project continues to grow, you might also check out Solr, also developed by Apache, which provides some added functionality, such as multi-faceted search.
You can use Lucene that way;
Pros:
Full-text search is easy to implement, which is not the case in an RDBMS.
Cons:
Referential integrity: you get it for free in an RDBMS, but in Lucene, you must implement it yourself.