How does Lucene order data in composite queries?

How does Lucene order data in composite queries? - search

I need to know how Lucene orders the records in a result set if I use composite queries.
It looks like it sorts it using "score" value for exact queries and it sorts it lexicographically for range queries. But what if have the query which looks like
q = type:TAG OR type:POST AND date:[111 to 999]

You mix together logical search and scoring. When you pass query like date:[111 to 999], Lucene searches for all documents with the date in specified range. But you give it no advice on how to sort them - is date 111 more preferable for you than 555? or is 701 better than 398? Lucene have no idea about it, so the score is the same for all found documents. Just to make some order, Lucene sorts results lexicographically, but that's mostly detail of implementation, not some key idea.
On other hand, if you pass some other parameters with a query - be it keywords or tags - Lucene can apply its similarity algorithm and assign different scores to different docs in results. You may find more on Lucene's scoring here.
So, to give you short answer: Lucene sorts results by score, and only if the score for 2 documents is the same, it uses other types sorting options like lexicographical order.

Related

Do high cardinality fields affect performance for searches?

The Azure Search docs state that:
A high cardinality field consists of a facetable or filterable field that has a significant number of unique values, and as a result, consumes significant resources when computing results
But it's not clear on whether this poor performance is limited to when the fields are specifically used in a filter/facet query, or whether it also affects performance when the field is queried against using search terms.
Can anyone with some deeper Azure Search knowledge weigh in?

After getting clarification from Microsoft, I can confirm that the answer is "no, performance is only affected when using the field in a facet/filter".
This poor performance is limited to when the fields are specifically used in a filter/facet query. The searchable terms will not be affected.
Fields that work best in faceted navigation have low cardinality: a small number of distinct values that repeat throughout documents in your search corpus (for example, a list of colors, countries/regions, or brand names).
If the field that has a significant number of unique values, it will consume significant resources when computing the facet navigation. Because each distinct value will be 1 facet and need to be calculated.
At query time, a filter parser accepts criteria as input, converts the expression into atomic Boolean expressions represented as a tree, and then evaluates the filter tree over filterable fields in an index.
If the field that has a significant number of unique values, the tree will be deep and consume significant computing resources. Because each unique value will be calculated in filter, there will be no cached result for duplicate items to reduce the calculation.
The searchable fields will not be affected if the fields have a significant number of unique values. Because searchable fields have inverted index to accelerate query.
When you load the index, each field's inverted index is populated with all of the unique, tokenized words from each document, with a map to corresponding document IDs. For example, when indexing a hotels data set, an inverted index created for a City field might contain terms for Seattle, Portland, and so forth. Documents that include Seattle or Portland in the City field would have their document ID listed alongside the term.

I reached out to MS as well, this is the answer that I got:
“High cardinality” means different things to filterable vs searchable fields. Cardinality for filterable fields amounts to the uniqueness of the full value of the field. For searchable fields, it’s about the aggregate number of indexed terms that results from writing a document to the index. Complex custom analyzers, for example, can bloat the index by producing several tokens for each word in a string. Inverted indexes scale really well, so I wouldn’t be too concerned about having a high number of unique words in the index. But, this should help understand the unit of scale each.
This mention in the documentation is primarily to raise awareness about what contributes to query performance and why they may see reduced performance as they add additional fields to the filter clause. I will add…You can improve the performance of individual queries by scaling up the number of partitions in your service. Going from 1 to 2 not only doubles the storage available to your service, it also doubles the amount of compute power available to execute queries. The data workload is divided roughly equally between each partition. It doesn’t usually equate to exactly twice the performance for your queries, but it can have a significant impact if you are seeing slow queries.

Algolia search keywords

I want to build a smart search with Algolia. The point is to use keywords to rank the results. Lets say user types "smarphone blue cheap good camera". This should find all blue smarthones and order them by price and camera characteristics.
The idea is to somehow map those keywords to a ranking formula.
Doea any one know if it is possible with Algolia and if so what is the best way to achieve the desired result?

To automatically detect and filter by facet values (like blue, good camera), you could use Query Rules, in particular Dynamic Filtering.
However, that shouldn't be necessary. If you include the color (containing for instance the blue value) and characteristics (containing for instance the good camera value) attributes in your searchableAttributes list, then the search request will return relevant results based on purely textual relevance matched in those attributes.
On the other hand, sorting strategies impact the Algolia indices at build time, therefore in order to change the sorting strategy based on the query (e.g. sort results by ascending price if the search query contains cheap), you will need to setup a new replica index for which results are sorted by price. On the frontend, when detecting a relevant keyword (e.g. cheap), you can decide to switch the search queries to the primary index or to the sorted replica.

How to sort this Solr query by distance?

I am querying results like this:
?q=*&wt=json&rows=1000&fq={!geofilt%20pt=36.722484,-4.371908%20sfield=location%20d=50}
This is using the geofilt function to find all results within 50km of a given point. But the results are returning in a strange order. I want to sort them by proximity to the given point, ascending. How can I add that to the above query?

I'd bet you rather need to apply additional sorting param which is described here: Spatial Search.
So in your case it would look like:
?q=*&wt=json&rows=1000&fq={!geofilt%20pt=36.722484,-4.371908%20sfield=location%20d=50}&sort=geodist()+asc

If the values you're getting isn't a field value, but the results of, say, a calculation (i.e. dynamically generated values, such as the distance between points) then you can sort the output of a function. Have a look at this link:
https://wiki.apache.org/solr/FunctionQuery#Sort_By_Function

Customize azure search scoring in a specific way

Consider a scenario where all documents have following fields
The requirement is that for email the score should be either 100 (if exact match) or 0.
For remaining fields, it is 0 to 100 based on edit distance .
Suppose in an index the records are like the following
1.abcd#gmail.com,Peterr,Parker,Developer
2.xyz#yahoo.com,Steve,Smith,Manager
The query is made on fuzzy search of all the fields and parameters are like
abcd#gmail.com,Pet,Par,Devl
The search result should have a score for first record like
score for email + score of last name +score of first name+score of title
=100+50(approx edit distance of 'Peterr and Pet')+50(approx edit distance of 'Peterr and Parker')+44(approx edit distance of 'Devl and Developer')
=244
Similarly ,the search result should have a score in similar way.
I just checked Azure search scoring has weights but those I don't think would be of much helpful in scenarios like this .The main thing we are looking for is to find a way where the search score returned for each record by Azure search would be in accordance with the score I discussed above

To clarify, it seems what you need is the scoring formula to be a function of the edit distance between the query term and the indexed term - the shorter the distance, the higher the score. Unfortunately, this is not possible in Azure Search.
Azure Search engine executes the search query in two phases: retrieval and scoring.
During retrieval search query terms processed by the lexical analyzer are looked up in the inverted index. Documents that had those terms are returned. When you use fuzzy search we expand your search query by adding terms from the inverted index that are within edit distance from a given query term - fuzzy expansion. This way your query can match more documents.
During scoring we assign a relevance score to retrieved documents using the Lucene scoring formula. This formula is based on TF/IDF. Practically, it means that documents that matched terms that are rare will be ranked higher up in the results set.
It's important to know that the Lucene scoring formula only applies to documents that matched the original query terms and terms added through fuzzy expansion. Documents that matched terms added through prefix expansion or regex/wildcard expansion are given constant score 1. This way those documents will be in the results set but won't have impact on ranking that's based on frequency of terms.
Hope that helps

How does ElasticSearch rank filter queries (rather than text queries)?

I know that ElasticSearch uses relevance ranking algorithms such as Lucene's tf/idf, length normalization and couple of more algorithms to rank term queries applied on textual fields (i.e. searching words "medical" AND "journal" in the "title" and "body" fields).
My question is how does ElasticSearch rank and retrieve results of a filter or range query (i.e. age=25, or weight>60)?
I know these types of queries are just filtering documents based on the condition(s). But lets say I have 200 documents which their age field value is 25. Which of those documents will be retrieved as top 10 results?
Does ElasticSearch retrieve them by the order it indexed them?

From the Elasticsearch documentation:
Filters: As a general rule, filters should be used instead of queries:
for binary yes/no searches
for queries on exact values
Queries: As a general rule, queries should be used instead of filters:
for full text search
where the result depends on a relevance score
So when running a search such as "age=25, or weight>60" you should be using a filter.
However - Filters do not affect the scoring - i.e. if you only used a filter your search results would all have the same score.
There is a range query - this is a query that would affect score and I would guess that it scores documents based on things like the document timestamp (most recent gets a higher score).
You'd need to explore the documentation further and dig into Lucene documentation to understand exactly how and why the a document got its score - but as above, you may be better using Filters that don't affect scoring.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string