Confused about main difference among query parsers in Solr - search

I am pretty new to Solr, when I met the concept of query parser such as basicLucene, dismax, edismac etcs, I wonder what is the main difference among them? Is it the way it scoring fields? What should I pay the major attention to when I just want a simple keyword (or boolean logic combination) search (which may involve specifying field)?

I think the standard will suffice for simple boolean manipulation and querying by fields. If you need some special features, do look at others just to see if they perform better for your needs.
About parsers
There are number of differences (allowed syntax, features, error handling), but first let's state what a parser does: it takes query text as submitted and converts it into Lucene Query object, that Solr/Lucene combo can understand and use when searching.
Lucene parser is the default one. It has fairly intuitive syntax, but lacks power. It works with Solr recommended operators (+ and -) as well as with not recommended Boolean ones (AND, OR, NOT).
Dismax parser was aimed for simple consumer-apps: "more Google-like" and "less stringent". It rarely throws errors (compared to standard one), works with + and - to avoid inner query building problem and it's name stems from Maximum Dysjunction:
A query that generates the union of documents produced by its
subqueries, and that scores each document with the maximum score for
that document as produced by any subquery, plus a tie breaking
increment for any additional matching subqueries.
EDixMax was DisMax Extended - with more power / features added. See links at the bottom for full feature list, but it basically allows to use full Lucene syntax (which neither standard Solr nor DisMax parser didn't).
More? Sure! There's LOTS of parsers, some by Lucene team, some by Solr team, some by others. See third link below for full list and some info.
https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser
https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser
https://cwiki.apache.org/confluence/display/solr/Other+Parsers

Related

Azure Cognitive Search - how to prevent searching for plural form also returning singular matches

We have an Azure Cognitive Search index that we use for full text searches.
Currently, when the user searches for a plural word (e.g. Buildings), the singular forms are also being matched (building).
We want to restrict this behaviour so that only the plural matches are returned.
I've read through the odata documentation but cannot find any reference to how we could accomplish this either through parameters in the search.ismatch in the filter or in the index config.
Plural and singular forms are likely both matching because the field is configured with the default language analyzer, which performs stemming of terms. If you're looking for an exact match, you can use the 'eq' operator in a filter. If you want a case-insensitive (but otherwise exact) match, you can try normalizers (note that this feature is in preview at the time of this writing.)
If you need matching behaviour that is somewhat more sophisticated than a case-insensitive match, you should look into custom analyzers. They allow you to customize the behaviour of tokenization, as well as selectively use (or not use) stemming and other lexical analysis techniques.
To add onto Bruce's answer,
Custom normalizers support many of the same token and character filters as custom analyzers do. In order to decide which one best fits your needs, consider the following questions:
Will this plural/singular matching behaviour be used in filtering/sorting/faceting operations? If so, pre-configured or custom normalizers will enable you to refine what results are returned by your search filter. You can build your own or choose from the list of pre-configured ones, and it supports more than case insensitivity. See the list of supported char and token filters.
Will you need this plural/singular matching behaviour in full-text search, regardless of whether a filter is used? If so, consider using the custom analyzer Bruce suggested above.
Afaik, please note that normalizers will only affect filtering/sorting/faceting results. Also, normalizers are the only way to perform this "normalization" to filter/sort/facet queries. Setting an analyzer will not affect these types of queries.

Multilingual Search using lucene

I am doing a multilingual search. And I will use lucene as the tool to do it.
I have the translated contents already, there will be 3 or 4 languages of each document.
For indexing and search, there could be the 4 strategies, For each document/contents:
each language are indexed in different index/directory.
each language are indexed in different document but in the same index.
each language are indexed in different Field but in the same document.
all the languages are indexed in the same Field in a document
But I have not test each of the way yet, could anyone experienced tell me which one is a better way to do the multilingual search?
Thanks!
Although the question has been asked a couple of years ago, it's still a great question.
There are a couple of aspects to consider evaluating the different solution approaches:
are language specific analyzers used at indexing time?
is the query language always known (e.g. user selectable)?
does the query language always match one of the "content" languages?
should only content matching the query language be retuned?
is relevancy important?
If (1.) & (5.) are valid in your project you should not consider any strategy that (re-)uses the same field for multiple languages in the same inverted index, as term frequencies for the various languages are all mixed up (independent of whether you index your multilingual content as one document or as multiple documents). It might be interesting to know, that adding "n" language specific fields does not result in an "n"-times larger index, but for obvious reasons it comes with some overhead.
Single Field (Strategies 2 & 4)
+ only one field to query
+ scales well for additional languages
+ can distinguish/filter languages (if multiple documents, and extra language field)
- cannot distinguish/filter languages (if single document)
- cannot just display the queried language (if single document)
- "wrong" term frequencies (as all languages mixed up)
Multiple Fields (Strategy 3)
+ correct term frequencies
+ can easily restrict/filter queries for particular language(s)
+ facilitates Auto-Complete & Spellcheck / Did-You-Mean
- more fields to index
- more fields to query
Multiple Indices (Strategy 1)
+ correct term frequencies
+ can easily restrict/filter queries for particular language(s)
+ facilitates Auto-Complete & Spellcheck / Did-You-Mean
- additional languages requires all their own index
Independent of a single or multiple fields approach, your solution might need to handle result collapsing for matches in the "wrong" language, if you index your content as multiple documents. One approach might could be by adding a language field and filter for that.
Recommendation: The approach/strategy you choose, depends on a projects requirements. Whenever possible I would opt for a multiple fields or multiple indices approach.
In short, it depends on your needs, but I would go with option 3 or 1.
1) would probably the best way, if there is no overlap / shared fields between the languages at all.
3) would be the way to go if there are several fields that need to be shared across languages, as this saves disk space and allows a larger part of the index to fit in the file system cache
I would not recommend 2): this makes your search queries more complex and forces lucene to consider more documents.
4) will make your search query very complex, unless you want users to be able to search in any language without selecting it first.

How do I handle word forms in sphinx search

I have a sphinx server to index a mysql database for a django app. My search is working fine but my content includes medical words/phrases. So, for example, I need a search for "dvt" to also match against "deep venous thrombosis" and even "deep vein thrombosis". I looked through the documentation and see an option for "wordforms" and "morphology". Which of these (or something else) should I use? Also, what will work backwards? ie, a search for "deep venous thrombosis"/"deep vein thrombosis" will match against "dvt".
Also, I would appreciate some advice on how to set these up since I'm new to sphinx in general.
You will need to provide your own list of word/term synonyms to be used in query expansion.
Since Sphinx does not currently support synonym expansion in queries, you'll need to massage the query based on your list of synonyms before submitting it to the search engine.
So, using your example:
User queries for: 'dvt remediation procedures'.
Server receives query and checks each term against its list of synonyms.
Server finds a match and adds 'deep vein thrombosis' to query.
Server submits newly expanded query 'dvt deep vein thrombosis remediation procedures' to search engine.
Finally, if the stemmer built into Sphinx is doing its job, you shouldn't have to support both 'venous' and 'vein' as separate terms since they both should stem to the same term. If this is not the case, you might need to do additional pre-stemming to handle words specific to your corpora (medical terms).

Tagging and Analysing a Search Query

I'm developing a search engine which functions taking the semantics of data into account, unlike the usual keyword based index. I managed to develop a reasonable index for the search using metadata extraction methods and RDF, but I have difficulty in using such methods on the search query itself since the search query is very much shorter that the actual data. any idea how to perform a successful tagging of a search query, using similar methods, natural language processing, etc. ?
Thank You!
Yes, the sample size of a typical query is too small for semantic analysis to be of any value.
One approach might be to constrain or expand your query using drop-down menus for things like "Named Entities" or "Subject Verb Object" tuples.
Another approach would be to expand simple keywords using rules created from your metadata so that, for example, a query for 'car' might be expanded to the tuple pattern
(*,[drive,operate,sell],[car,automobile,vehicle])
before submission.
Finally, you might try expanding the query with a non-semantically valuable prefix and/or suffix to get the query size large enough to trigger OpenCalais' recognizer.
Something like 'The user has specified the following terms in her query: one, two, three.'.
And once the results are returned, filter out all results that match only the added prefix/suffix.
Just a few quick thoughts.
You need to build semantic tree. It will based on the combination of keywords.
For example, automobile -->vehicle --> car this relation technical aspect of car. travel --
hire/rent-->vehicle-->car this is something related to travel and rent a car.
In this case MongoDB will help you a lot.

Calling search gurus: Numeric range search performance with Lucene?

I'm working on a system that performs matching on large sets of records based on strings and numeric ranges, and date ranges. The String matches are mostly exact matches as far as I can tell, as opposed to less exact full text search type results that I understand lucene is generally designed for. Numeric precision is important as the data concerns prices.
I noticed that Lucene recently added some support for numeric range searching but it's not something it's originally designed for.
Currently the system uses procedural SQL to do the matching and the limits are being reached as to the scalability of the system. I'm researching ways to scale the system horizontally and using search engine technology seems like a possibility, given that there are technologies that can scale to very large data sets while performing very fast search results. I'd like to investigate if it's possible to take a lot of load off the database by doing the matching with the lucene generated metadata without hitting the database for the full records until the matching rules have determined what should be retrieved. I would like to aim eventually for near real time results although we are a long way from that at this point.
My question is as follows: Is it likely that Lucene would perform many times faster and scale to greater data sets more cheaply than an RDBMS for this type of indexing and searching?
Lucene stores its numeric stuff as a trie; a SQL implementation will probably store it as a b-tree or an r-tree. The way Lucene stores its trie and SQL uses an R-tree are pretty similar, and I would be surprised if you saw a huge difference (unless you leveraged some of the scalability that comes from Solr).
As a general question of the performance of Lucene vs. SQL fulltext, a good study I've found is: Jing, Y., C. Zhang, and X. Wang. “An Empirical Study on Performance Comparison of Lucene and Relational Database.” In Communication Software and Networks, 2009. ICCSN'09. International Conference on, 336-340. IEEE, 2009.
First, when executing
exact query, the performance of Lucene is much better than
that of unindexed-RDB, while is almost same as that of
indexed-RDB. Second, when the wildcard query is a prefix
query, then the indexed-RDB and Lucene both perform very
well still by leveraging the index... Third, for combinational query, Lucene performs
smoothly and usually costs little time, while the query time
of RDB is related to the combinational search conditions and
the number of indexed fields. If some fields in the
combinational condition haven’t been indexed, search will
cost much more time. Fourth, the query time of Lucene and
unindexed-RDB has relations with the record complexity,
but the indexed-RDB is nearly independent of it.
In short, if you are doing a search like "select * where x = y", it doesn't matter which you use. The more clauses you add in (x = y OR (x = z AND y = x)...) the better Lucene becomes.
They don't really mention this, but a huge advantage of Lucene is all the built-in functionality: stemming, query parsing etc.
I suggest you read Marc Krellenstein's "Full Text Search Engines vs DBMS".
A relatively easy way to start using Lucene is by trying Solr. You can scale Lucene and Solr using replication and sharding.
At its heart, and in its simplest form, Lucene is a word density search engine. Lucene can scale to handle extremely large data sets and when indexed correctly return results in a blistering speed. For text based searching it is possible and very probable that search results will return quicker in Lucene as opposed to SQL Server/Oracle/My SQL. That being said it is unfair to compare Lucene to traditional RDBMS as they both have completely different usages.

Resources