The implication of #search.score in Azure Search Service - azure

I understood the reason for having search profile and boosting results based on some fields e.g. distance, rating, etc. To me, that's most likely applicable to structured documents like json files. The scenario that I cannot make sense of it is when indexer gets search service index let's say a MS Word or PDF document in azure blob. We have two entries of "id" and "content" which I don't know how the search score would apply to it.
For e.g. there are two documents with different contents. I searched for a keyword and the same keyword found in two documents resulted into getting two different scores for two MS Word documents. My challenge is why this score should be different while both documents contain the same keyword?

The score is determined by many factors, for example, the count of terms in each document, and the number of searchable fields in which query terms were found. In your example, the documents have different lengths, so naturally they'll have different scores. HTH.

Related

Multiple analyzers for a single field in a search index of Azure Cognitive Search

We need two different types of search (based on user input), partial and exact for few fields that we have and for the same requirement, we require two different analyzers for each field to produce the required output.
Now, the problem is, I'm not able to configure 2 analyzers for a single field. The only option for me is to create two different indexes altogether and then query respective index based on the user input, but clearly, this is not the right solution, it is not scalable, mostly redundant data and takes almost double the space.
I'm trying to create a duplicate field in the same index with different analyzers and use the output of them based on the user input, but I'm not sure how I can configure that in the index. The name of the field is what is used to search for, during query time. Is there a possibility for me to have 2 different fields with different names, which actually point to one field but have different analyzers?
You can have 2 different fields with different names, which actually point to one field with two different analyzers. This can be done using field mappings in indexer definition.
I have created index as shown below,
As highlighted in above screen shot, I have taken two new fields with name cont01 and cont02.
These two new fields will point to field merged_content with two different analyzers.
In indexer definition I have configured field mappings as shown below,
Ran indexer and results are as shown below,
Reference link

Solr default search field for multiple fields which has different analyzers

I have a document which has title, stockCode, category fields.
I have different field types (and analysis chains) for each. For instance title has EdgeNGram 2 to 20, category has EdgeNGram 3 to 10 with different range and stockCode just has lowercase filter.
So that, I don't want to search from documents with keyword "sample" with building the query like title:sample OR stockCode:sample OR category:sample.
I'd like to search with just "q=sample".
I copied my fields to text but It does not work. Because all fields analyzed as same. But I don't want to index stockCode as EdgeNGram or any other filters. I'd like to index my fields as I configured and I'd like to search a keyword over them base on my indexes.
I've been researching about that for three days, and Solr has a little bit poor documentation.
You can use the edismax handler, as this will allow you to give a list of fields to query and supply the query by itself. You can also give separate weights to each field for scoring them differently.
defType=edismax&q=sample&qf=title^10 stockCode category
.. will search for sample in each of the three fields, giving a 10x boost to any hits in the title field.
You can find the documentation about the edismax query parser under Searching in the reference guide.

Do high cardinality fields affect performance for searches?

The Azure Search docs state that:
A high cardinality field consists of a facetable or filterable field that has a significant number of unique values, and as a result, consumes significant resources when computing results
But it's not clear on whether this poor performance is limited to when the fields are specifically used in a filter/facet query, or whether it also affects performance when the field is queried against using search terms.
Can anyone with some deeper Azure Search knowledge weigh in?
After getting clarification from Microsoft, I can confirm that the answer is "no, performance is only affected when using the field in a facet/filter".
This poor performance is limited to when the fields are specifically used in a filter/facet query. The searchable terms will not be affected.
Fields that work best in faceted navigation have low cardinality: a small number of distinct values that repeat throughout documents in your search corpus (for example, a list of colors, countries/regions, or brand names).
If the field that has a significant number of unique values, it will consume significant resources when computing the facet navigation. Because each distinct value will be 1 facet and need to be calculated.
At query time, a filter parser accepts criteria as input, converts the expression into atomic Boolean expressions represented as a tree, and then evaluates the filter tree over filterable fields in an index.
If the field that has a significant number of unique values, the tree will be deep and consume significant computing resources. Because each unique value will be calculated in filter, there will be no cached result for duplicate items to reduce the calculation.
The searchable fields will not be affected if the fields have a significant number of unique values. Because searchable fields have inverted index to accelerate query.
When you load the index, each field's inverted index is populated with all of the unique, tokenized words from each document, with a map to corresponding document IDs. For example, when indexing a hotels data set, an inverted index created for a City field might contain terms for Seattle, Portland, and so forth. Documents that include Seattle or Portland in the City field would have their document ID listed alongside the term.
I reached out to MS as well, this is the answer that I got:
“High cardinality” means different things to filterable vs searchable fields. Cardinality for filterable fields amounts to the uniqueness of the full value of the field. For searchable fields, it’s about the aggregate number of indexed terms that results from writing a document to the index. Complex custom analyzers, for example, can bloat the index by producing several tokens for each word in a string. Inverted indexes scale really well, so I wouldn’t be too concerned about having a high number of unique words in the index. But, this should help understand the unit of scale each.
This mention in the documentation is primarily to raise awareness about what contributes to query performance and why they may see reduced performance as they add additional fields to the filter clause. I will add…You can improve the performance of individual queries by scaling up the number of partitions in your service. Going from 1 to 2 not only doubles the storage available to your service, it also doubles the amount of compute power available to execute queries. The data workload is divided roughly equally between each partition. It doesn’t usually equate to exactly twice the performance for your queries, but it can have a significant impact if you are seeing slow queries.

Customize azure search scoring in a specific way

Consider a scenario where all documents have following fields
The requirement is that for email the score should be either 100 (if exact match) or 0.
For remaining fields, it is 0 to 100 based on edit distance .
Suppose in an index the records are like the following
1.abcd#gmail.com,Peterr,Parker,Developer
2.xyz#yahoo.com,Steve,Smith,Manager
The query is made on fuzzy search of all the fields and parameters are like
abcd#gmail.com,Pet,Par,Devl
The search result should have a score for first record like
score for email + score of last name +score of first name+score of title
=100+50(approx edit distance of 'Peterr and Pet')+50(approx edit distance of 'Peterr and Parker')+44(approx edit distance of 'Devl and Developer')
=244
Similarly ,the search result should have a score in similar way.
I just checked Azure search scoring has weights but those I don't think would be of much helpful in scenarios like this .The main thing we are looking for is to find a way where the search score returned for each record by Azure search would be in accordance with the score I discussed above
To clarify, it seems what you need is the scoring formula to be a function of the edit distance between the query term and the indexed term - the shorter the distance, the higher the score. Unfortunately, this is not possible in Azure Search.
Azure Search engine executes the search query in two phases: retrieval and scoring.
During retrieval search query terms processed by the lexical analyzer are looked up in the inverted index. Documents that had those terms are returned. When you use fuzzy search we expand your search query by adding terms from the inverted index that are within edit distance from a given query term - fuzzy expansion. This way your query can match more documents.
During scoring we assign a relevance score to retrieved documents using the Lucene scoring formula. This formula is based on TF/IDF. Practically, it means that documents that matched terms that are rare will be ranked higher up in the results set.
It's important to know that the Lucene scoring formula only applies to documents that matched the original query terms and terms added through fuzzy expansion. Documents that matched terms added through prefix expansion or regex/wildcard expansion are given constant score 1. This way those documents will be in the results set but won't have impact on ranking that's based on frequency of terms.
Hope that helps

How does ElasticSearch rank filter queries (rather than text queries)?

I know that ElasticSearch uses relevance ranking algorithms such as Lucene's tf/idf, length normalization and couple of more algorithms to rank term queries applied on textual fields (i.e. searching words "medical" AND "journal" in the "title" and "body" fields).
My question is how does ElasticSearch rank and retrieve results of a filter or range query (i.e. age=25, or weight>60)?
I know these types of queries are just filtering documents based on the condition(s). But lets say I have 200 documents which their age field value is 25. Which of those documents will be retrieved as top 10 results?
Does ElasticSearch retrieve them by the order it indexed them?
From the Elasticsearch documentation:
Filters: As a general rule, filters should be used instead of queries:
for binary yes/no searches
for queries on exact values
Queries: As a general rule, queries should be used instead of filters:
for full text search
where the result depends on a relevance score
So when running a search such as "age=25, or weight>60" you should be using a filter.
However - Filters do not affect the scoring - i.e. if you only used a filter your search results would all have the same score.
There is a range query - this is a query that would affect score and I would guess that it scores documents based on things like the document timestamp (most recent gets a higher score).
You'd need to explore the documentation further and dig into Lucene documentation to understand exactly how and why the a document got its score - but as above, you may be better using Filters that don't affect scoring.

Resources