Multilingual free-text search in an app with normalized data? - search

We have enums, free-text, and referenced fields etc. in our DB.
Each enum has its own translation, free-text could be in any language. We'd like to do efficient large-scale free-text searching and enum value based searching.
I know of solutions like Solr which are nice, but that would mean we'd have to index entire de-normalized records with all the text of all the languages in the system. This seems a bit excessive.
What are some recommended approaches for searching multilingual normalized data? Anyone tackle this before?

ETL. Extract, Transform, Load. In other words, get the data out of your existing databases, transform it (which is more than merely denormalizing it) and load it into SOLR. The SOLR db will be a lot smaller than the existing databases because there is no relational overhead. And SOLR search takes most of the load off of your existing database servers.
Take a good look at how to configure and use SOLR and learn about SOLR cores. You may want to put some languages in separate cores because that way you can more effectively use the various stemming algorithms in SOLR. But even with multilingual data you can still use bigrams (such as are used with Chinese language analysis).
Having multiple cores makes searching a bit more complex since you can try either a single language index, or an all-languages index. But it is much more effective to group language data and apply language specific stopwords, protected words, stemming and language analysis tools.
Normally you would include some key data in the index so that when you find a record via SOLR search, you can then reference directly into the source db. Also, you can have normalised and non-normalised data together, for instance an enum could be recorded in a normalised field in English as well as a non-normalised field in the same language as the free-text. A field can be duplicated in order to apply two different analysis and filtering treatments.
It would be worth your while to trial this with a subset of your data in order to learn how SOLR works and how best to configure it.

Related

Yii2: How should site-wide search work?

What is the best practice methododology of implementing site-wide search in Yii2?
This question is not about how to implement search specifically, but rather about what kind of approach to use. Should we use Sphinx? Elasticsearch? Or do we use UNION selects to get the data into a DataProvider?
Assume the application is using a relational database to store data. We want to search and display multiple different models. For example, our database contains tables of Books, Authors and Stores. When we search for a keyword we want to display results from all 3 tables (matching Books by title or content, Authors by full name and Stores by name etc).
There are tutorials which show how to use Elasticsearch but assume that our data is stored in the Elasticsearch database, which does not make sense. Our data is already stored in MySQL or PostgreSQL. Does this mean
we need to maintain a duplicate of our data in the Elasticsearch database?
What is the best practice methododology of implementing site-wide search in Yii2?
That depends on many factors, so I cant give you a specific recommendation for your case. Some of the factors to think about are:
What would you like to achieve with this search? Is every little bit in your database a significant search term?
Do you need only full-text-search or a wide range of analytics?
Have you any limits in time or costs?
Can your (tech-)infrastructure handle your ideas?
Is it worth to bring in another extensive technology in the project?
Can you handle additional maintenance tasks to run such a search engine?
And many more ...
In my internal Yii2 Project with a PostgreSQL RDBMS, I decided to use a PostgreSQL Text Search Type called tsvector. Thats good enough for my needs. Why?
You can use Stemming.
Supports Fuzzy search.
Supports basic ranking.
Supports multiple languages.
I highly recommend this blog post Postgres full-text search is Good Enough.

Using Lucene to index private data, should I have a separate index for each user or a single index

I am developing an Azure based website and I want to provide search capabilities using Lucene. (structured json objects would be indexed and stored in Lucene and other content such as Word documents, etc. would be indexed in lucene but stored in blob storage) I want the search to be secure, such that one user would never see a document belonging to another user. I want to allow ad-hoc searches as typed by the user. Lastly, I want to query programmatically to return predefined sets of data, such as "all notes for user X". I think I understand how to add properties to each document to achieve these 3 objectives. (I am listing them here so if anyone is kind enough to answer, they will have better idea of what I am trying to do)
My questions revolve around performance and security.
Can I improve document security by having a separate index for each user, or is including the user's ID as a parameter in each search sufficient?
Can I improve indexing speed and total throughput of the system by having a separate index for each user? My thinking is that having separate indexes would allow me to scale the system by having multiple index writers (perhaps even on different server instances) working at the same time, each on their own index.
Any insight would be greatly appreciated.
Regards,
Nate
Of course, one index.
You can do even better than what you suggested by using ManifoldCF (Apache product that knows how to handle Solr) to manage security.
And one off topic, uninformed suggestion: I'd rather use CloudBees or Heroku (or Amazon) instead of Azure.
Until you will use several machines for indexing I guess it's more convenient to use single index. Lucene community done a lot of work to make indexing process as efficient as it can. So unless you intentionally want to implement distributed indexing I doesn't recommend you to split indexes.
However there are several reasons why you would want to split indexes:
if your machine have several IO devices which could be utilized in parallel. In this case, if you are IO bound, splitting indexes is good idea.
splitting document fields between indexes (this is what ParallelReader is supposed for). This is more exotic form of splitting, but it may be a good idea if search is performed using different groups of fields. Suppose, we have two search query types: the first is using field name and type, and the second is using fields price and discount. If those fields are updated at different rate (I guess, name updates are far more rarely than price updates), updating only part of index would require less IO resources. This will give more overall throughput to the system.

SOLR: Simultaneous search in different languages

I have a Solr instance that gets and indexes data about companies from DB. A DB data about a single company can be provided in several languages(english and russian for example).All the companies, of course, have a unikue key that is a uniqueKey in solr index too. I need to present solr search in all the languages at once.
How can it be performed?
1. Multicore? I've build two seperate cores with each language data, but i can't search in two indexes simultaneously.
localhost:8983/solr/core0/select?shards=localhost:8983/solr/core0/,localhost:8983/solr/core1/&indent=true&q=*:*&distributed=true
or
localhost:8983/solr/core0/select?shards=localhost:8983/solr/core0/,localhost:8983/solr/core1/&indent=true&id:123456
gives no results. while searching in each core is succesful.
Enable Name field(for example) as a multivalued is not a solution, because a different language data data from DB are get by different procedures. And the value is just rewritten.
I'm not sure about the multicore piece, but have you considered creating two fields in a single core - one for each language? You could then combine with an "OR" which is the default, so a query for:
en:"query test here" OR ru:"query test here"
would be an example
Sounds like you are possibly using the DataImportHandler to load your data. You can implement #Mike Sokolov's answer or implement the multivalued solution via the use of a Solr client. You would need to write some custom code in a client like SolrJ (or one of the other clients listed on IntegratingSolr in the Solr Wiki) to pull both languages in separate queries from your database and then parse the data from both results into a common data/result set that can be transformed into a single Solr document.

Calling search gurus: Numeric range search performance with Lucene?

I'm working on a system that performs matching on large sets of records based on strings and numeric ranges, and date ranges. The String matches are mostly exact matches as far as I can tell, as opposed to less exact full text search type results that I understand lucene is generally designed for. Numeric precision is important as the data concerns prices.
I noticed that Lucene recently added some support for numeric range searching but it's not something it's originally designed for.
Currently the system uses procedural SQL to do the matching and the limits are being reached as to the scalability of the system. I'm researching ways to scale the system horizontally and using search engine technology seems like a possibility, given that there are technologies that can scale to very large data sets while performing very fast search results. I'd like to investigate if it's possible to take a lot of load off the database by doing the matching with the lucene generated metadata without hitting the database for the full records until the matching rules have determined what should be retrieved. I would like to aim eventually for near real time results although we are a long way from that at this point.
My question is as follows: Is it likely that Lucene would perform many times faster and scale to greater data sets more cheaply than an RDBMS for this type of indexing and searching?
Lucene stores its numeric stuff as a trie; a SQL implementation will probably store it as a b-tree or an r-tree. The way Lucene stores its trie and SQL uses an R-tree are pretty similar, and I would be surprised if you saw a huge difference (unless you leveraged some of the scalability that comes from Solr).
As a general question of the performance of Lucene vs. SQL fulltext, a good study I've found is: Jing, Y., C. Zhang, and X. Wang. “An Empirical Study on Performance Comparison of Lucene and Relational Database.” In Communication Software and Networks, 2009. ICCSN'09. International Conference on, 336-340. IEEE, 2009.
First, when executing
exact query, the performance of Lucene is much better than
that of unindexed-RDB, while is almost same as that of
indexed-RDB. Second, when the wildcard query is a prefix
query, then the indexed-RDB and Lucene both perform very
well still by leveraging the index... Third, for combinational query, Lucene performs
smoothly and usually costs little time, while the query time
of RDB is related to the combinational search conditions and
the number of indexed fields. If some fields in the
combinational condition haven’t been indexed, search will
cost much more time. Fourth, the query time of Lucene and
unindexed-RDB has relations with the record complexity,
but the indexed-RDB is nearly independent of it.
In short, if you are doing a search like "select * where x = y", it doesn't matter which you use. The more clauses you add in (x = y OR (x = z AND y = x)...) the better Lucene becomes.
They don't really mention this, but a huge advantage of Lucene is all the built-in functionality: stemming, query parsing etc.
I suggest you read Marc Krellenstein's "Full Text Search Engines vs DBMS".
A relatively easy way to start using Lucene is by trying Solr. You can scale Lucene and Solr using replication and sharding.
At its heart, and in its simplest form, Lucene is a word density search engine. Lucene can scale to handle extremely large data sets and when indexed correctly return results in a blistering speed. For text based searching it is possible and very probable that search results will return quicker in Lucene as opposed to SQL Server/Oracle/My SQL. That being said it is unfair to compare Lucene to traditional RDBMS as they both have completely different usages.

Using Lucene like a relational database

I am just wondering if we could achieve some RDBMS capabilities in lucene.
Example:
1) I have 10,000 project documents (pdf files) which have to be indexed with their content to make them available for search.
2) Every document is related to a SINGLE PROJECT. The project can contain details like project name, number, start date, end date, location, type etc.
I have to search in the contents of the pdf files for a given keyword, but while displaying the results I want to display the project meta data as mentioned in point (2).
My idea is to associate a field called projectId with each pdf file while indexing. Once we get that, we will fire search again for getting project meta data.
This way we could avoid duplicated data. Also, if we want to update the project meta data we will end up updating at a SINGLE PLACE only. Otherwise if we store this meta data with all the pdf doument indexes, we will end up updating all of the documents, which is not the way I am looking for.
please advise.
If I understand you correctly, you have two questions:
Can I store a project id in Lucene and use it for further searches? Yes, you can. This is a common practice.
Can I use this project id to search Lucene for project meta data? Yes, you can. I do not know if this is a good idea. It depends on the frequency of your meta data updates and your access pattern. If the meta data is relatively static, and you only access it by id, Lucene may be a good place to store it. Otherwise, you can use the project id as a primary key to a database table, which could be a better fit.
Sounds like a perfectly good thing to do. The only limitation you'll have (by storing a reference to the project in Lucene rather than the project data itself) is that you won't be able to query both the document text and project metadata at the same time. For example, "documentText:foo OR projectName:bar" . If you have no such requirement, then seems like storing the ID in Lucene which refers to a database row is a fine thing to do.
I am not sure on your overall setup, but maybe Hibernate Search is for you. It would allow you to combine the benefits of a relational database with the power of a fulltext search engine like Lucene. The meta data could live in the database, maybe together with the original pdf documents, while the Lucene documents just contain the searchable data.
This is definitely possible. But always be aware of the fact that you're using Lucene for something that it was not intended for. In general, Lucene is designed for full-text search, not for mapping relational content. So the more complex your system your relational content becomes, the more you'll see a decrease in performance.
In particular, there are a few areas to keep a close eye on:
Storing the value of each field in your index will decrease performance. If you are not overly concerned with sub-second search results, or if your index is relatively small, then this may not be a problem.
Also, be aware that if you are not using the default ranking algorithm, and your custom algorithm requires information about the project in order to calculate the score for each document, this will have a dramatic impact on search performance, as well.
If you need a more powerful index that was designed for relational content, there are hierarchical indexing tools out there (one developed by Apache, called Jackrabbit) that are worth looking into.
As your project continues to grow, you might also check out Solr, also developed by Apache, which provides some added functionality, such as multi-faceted search.
You can use Lucene that way;
Pros:
Full-text search is easy to implement, which is not the case in an RDBMS.
Cons:
Referential integrity: you get it for free in an RDBMS, but in Lucene, you must implement it yourself.

Resources