I am doing a multilingual search. And I will use lucene as the tool to do it.
I have the translated contents already, there will be 3 or 4 languages of each document.
For indexing and search, there could be the 4 strategies, For each document/contents:
each language are indexed in different index/directory.
each language are indexed in different document but in the same index.
each language are indexed in different Field but in the same document.
all the languages are indexed in the same Field in a document
But I have not test each of the way yet, could anyone experienced tell me which one is a better way to do the multilingual search?
Thanks!
Although the question has been asked a couple of years ago, it's still a great question.
There are a couple of aspects to consider evaluating the different solution approaches:
are language specific analyzers used at indexing time?
is the query language always known (e.g. user selectable)?
does the query language always match one of the "content" languages?
should only content matching the query language be retuned?
is relevancy important?
If (1.) & (5.) are valid in your project you should not consider any strategy that (re-)uses the same field for multiple languages in the same inverted index, as term frequencies for the various languages are all mixed up (independent of whether you index your multilingual content as one document or as multiple documents). It might be interesting to know, that adding "n" language specific fields does not result in an "n"-times larger index, but for obvious reasons it comes with some overhead.
Single Field (Strategies 2 & 4)
+ only one field to query
+ scales well for additional languages
+ can distinguish/filter languages (if multiple documents, and extra language field)
- cannot distinguish/filter languages (if single document)
- cannot just display the queried language (if single document)
- "wrong" term frequencies (as all languages mixed up)
Multiple Fields (Strategy 3)
+ correct term frequencies
+ can easily restrict/filter queries for particular language(s)
+ facilitates Auto-Complete & Spellcheck / Did-You-Mean
- more fields to index
- more fields to query
Multiple Indices (Strategy 1)
+ correct term frequencies
+ can easily restrict/filter queries for particular language(s)
+ facilitates Auto-Complete & Spellcheck / Did-You-Mean
- additional languages requires all their own index
Independent of a single or multiple fields approach, your solution might need to handle result collapsing for matches in the "wrong" language, if you index your content as multiple documents. One approach might could be by adding a language field and filter for that.
Recommendation: The approach/strategy you choose, depends on a projects requirements. Whenever possible I would opt for a multiple fields or multiple indices approach.
In short, it depends on your needs, but I would go with option 3 or 1.
1) would probably the best way, if there is no overlap / shared fields between the languages at all.
3) would be the way to go if there are several fields that need to be shared across languages, as this saves disk space and allows a larger part of the index to fit in the file system cache
I would not recommend 2): this makes your search queries more complex and forces lucene to consider more documents.
4) will make your search query very complex, unless you want users to be able to search in any language without selecting it first.
Related
I have a database which holds keywords for items, and also their localizations in different languages (supporting around 30 different languages right now), if there are any for that item. I want to be able to search these items using Azure Search. However, I'm not sure about how to set up the index architecture. Two solutions come to my mind in this scenario:
Either I will
1) have a different index for each language, and use that language's analyzer for that index. Later on, when I want to search using this index, I will also need to detect the query language coming from the user, and then search on the index corresponding to that language.
or
2) have a single index with a lot of fields that correspond to the different localizations of the item. Azure Search has support on having language priorities when searching, so knowing the user's language may come in handy, but is not necessarily a must.
I'm kind of new to this stuff, so any pointers, links, ideas etc. will be of tremendous help, even if it doesn't answer the question directly.
Option 2 is what we recommend (having a single index with one field per language). You can set some static priorities by assigning field weights using a scoring profile. If you are able to detect the language used in a query, you can scope the search to just that language using the searchFields option.
I am pretty new to Solr, when I met the concept of query parser such as basicLucene, dismax, edismac etcs, I wonder what is the main difference among them? Is it the way it scoring fields? What should I pay the major attention to when I just want a simple keyword (or boolean logic combination) search (which may involve specifying field)?
I think the standard will suffice for simple boolean manipulation and querying by fields. If you need some special features, do look at others just to see if they perform better for your needs.
About parsers
There are number of differences (allowed syntax, features, error handling), but first let's state what a parser does: it takes query text as submitted and converts it into Lucene Query object, that Solr/Lucene combo can understand and use when searching.
Lucene parser is the default one. It has fairly intuitive syntax, but lacks power. It works with Solr recommended operators (+ and -) as well as with not recommended Boolean ones (AND, OR, NOT).
Dismax parser was aimed for simple consumer-apps: "more Google-like" and "less stringent". It rarely throws errors (compared to standard one), works with + and - to avoid inner query building problem and it's name stems from Maximum Dysjunction:
A query that generates the union of documents produced by its
subqueries, and that scores each document with the maximum score for
that document as produced by any subquery, plus a tie breaking
increment for any additional matching subqueries.
EDixMax was DisMax Extended - with more power / features added. See links at the bottom for full feature list, but it basically allows to use full Lucene syntax (which neither standard Solr nor DisMax parser didn't).
More? Sure! There's LOTS of parsers, some by Lucene team, some by Solr team, some by others. See third link below for full list and some info.
https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser
https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser
https://cwiki.apache.org/confluence/display/solr/Other+Parsers
I am developing an Azure based website and I want to provide search capabilities using Lucene. (structured json objects would be indexed and stored in Lucene and other content such as Word documents, etc. would be indexed in lucene but stored in blob storage) I want the search to be secure, such that one user would never see a document belonging to another user. I want to allow ad-hoc searches as typed by the user. Lastly, I want to query programmatically to return predefined sets of data, such as "all notes for user X". I think I understand how to add properties to each document to achieve these 3 objectives. (I am listing them here so if anyone is kind enough to answer, they will have better idea of what I am trying to do)
My questions revolve around performance and security.
Can I improve document security by having a separate index for each user, or is including the user's ID as a parameter in each search sufficient?
Can I improve indexing speed and total throughput of the system by having a separate index for each user? My thinking is that having separate indexes would allow me to scale the system by having multiple index writers (perhaps even on different server instances) working at the same time, each on their own index.
Any insight would be greatly appreciated.
Regards,
Nate
Of course, one index.
You can do even better than what you suggested by using ManifoldCF (Apache product that knows how to handle Solr) to manage security.
And one off topic, uninformed suggestion: I'd rather use CloudBees or Heroku (or Amazon) instead of Azure.
Until you will use several machines for indexing I guess it's more convenient to use single index. Lucene community done a lot of work to make indexing process as efficient as it can. So unless you intentionally want to implement distributed indexing I doesn't recommend you to split indexes.
However there are several reasons why you would want to split indexes:
if your machine have several IO devices which could be utilized in parallel. In this case, if you are IO bound, splitting indexes is good idea.
splitting document fields between indexes (this is what ParallelReader is supposed for). This is more exotic form of splitting, but it may be a good idea if search is performed using different groups of fields. Suppose, we have two search query types: the first is using field name and type, and the second is using fields price and discount. If those fields are updated at different rate (I guess, name updates are far more rarely than price updates), updating only part of index would require less IO resources. This will give more overall throughput to the system.
We have enums, free-text, and referenced fields etc. in our DB.
Each enum has its own translation, free-text could be in any language. We'd like to do efficient large-scale free-text searching and enum value based searching.
I know of solutions like Solr which are nice, but that would mean we'd have to index entire de-normalized records with all the text of all the languages in the system. This seems a bit excessive.
What are some recommended approaches for searching multilingual normalized data? Anyone tackle this before?
ETL. Extract, Transform, Load. In other words, get the data out of your existing databases, transform it (which is more than merely denormalizing it) and load it into SOLR. The SOLR db will be a lot smaller than the existing databases because there is no relational overhead. And SOLR search takes most of the load off of your existing database servers.
Take a good look at how to configure and use SOLR and learn about SOLR cores. You may want to put some languages in separate cores because that way you can more effectively use the various stemming algorithms in SOLR. But even with multilingual data you can still use bigrams (such as are used with Chinese language analysis).
Having multiple cores makes searching a bit more complex since you can try either a single language index, or an all-languages index. But it is much more effective to group language data and apply language specific stopwords, protected words, stemming and language analysis tools.
Normally you would include some key data in the index so that when you find a record via SOLR search, you can then reference directly into the source db. Also, you can have normalised and non-normalised data together, for instance an enum could be recorded in a normalised field in English as well as a non-normalised field in the same language as the free-text. A field can be duplicated in order to apply two different analysis and filtering treatments.
It would be worth your while to trial this with a subset of your data in order to learn how SOLR works and how best to configure it.
I'm working on a system that performs matching on large sets of records based on strings and numeric ranges, and date ranges. The String matches are mostly exact matches as far as I can tell, as opposed to less exact full text search type results that I understand lucene is generally designed for. Numeric precision is important as the data concerns prices.
I noticed that Lucene recently added some support for numeric range searching but it's not something it's originally designed for.
Currently the system uses procedural SQL to do the matching and the limits are being reached as to the scalability of the system. I'm researching ways to scale the system horizontally and using search engine technology seems like a possibility, given that there are technologies that can scale to very large data sets while performing very fast search results. I'd like to investigate if it's possible to take a lot of load off the database by doing the matching with the lucene generated metadata without hitting the database for the full records until the matching rules have determined what should be retrieved. I would like to aim eventually for near real time results although we are a long way from that at this point.
My question is as follows: Is it likely that Lucene would perform many times faster and scale to greater data sets more cheaply than an RDBMS for this type of indexing and searching?
Lucene stores its numeric stuff as a trie; a SQL implementation will probably store it as a b-tree or an r-tree. The way Lucene stores its trie and SQL uses an R-tree are pretty similar, and I would be surprised if you saw a huge difference (unless you leveraged some of the scalability that comes from Solr).
As a general question of the performance of Lucene vs. SQL fulltext, a good study I've found is: Jing, Y., C. Zhang, and X. Wang. “An Empirical Study on Performance Comparison of Lucene and Relational Database.” In Communication Software and Networks, 2009. ICCSN'09. International Conference on, 336-340. IEEE, 2009.
First, when executing
exact query, the performance of Lucene is much better than
that of unindexed-RDB, while is almost same as that of
indexed-RDB. Second, when the wildcard query is a prefix
query, then the indexed-RDB and Lucene both perform very
well still by leveraging the index... Third, for combinational query, Lucene performs
smoothly and usually costs little time, while the query time
of RDB is related to the combinational search conditions and
the number of indexed fields. If some fields in the
combinational condition haven’t been indexed, search will
cost much more time. Fourth, the query time of Lucene and
unindexed-RDB has relations with the record complexity,
but the indexed-RDB is nearly independent of it.
In short, if you are doing a search like "select * where x = y", it doesn't matter which you use. The more clauses you add in (x = y OR (x = z AND y = x)...) the better Lucene becomes.
They don't really mention this, but a huge advantage of Lucene is all the built-in functionality: stemming, query parsing etc.
I suggest you read Marc Krellenstein's "Full Text Search Engines vs DBMS".
A relatively easy way to start using Lucene is by trying Solr. You can scale Lucene and Solr using replication and sharding.
At its heart, and in its simplest form, Lucene is a word density search engine. Lucene can scale to handle extremely large data sets and when indexed correctly return results in a blistering speed. For text based searching it is possible and very probable that search results will return quicker in Lucene as opposed to SQL Server/Oracle/My SQL. That being said it is unfair to compare Lucene to traditional RDBMS as they both have completely different usages.