Why does Azure Search give higher score to less relevant document? - azure

I have two documents indexed in Azure Search (among many others):
Document A contains only one instance of "BRIG" in the whole document.
Document B contains 40 instances of "BRIG".
When I do a simple search for "BRIG" in the Azure Search Explorer via Azure Portal, I see Document A returned first with "#search.score": 7.93229 and Document B returned second with "#search.score": 4.6097126.
There is a scoring profile on the index that adds a boost of 10 for the "title" field and a boost of 5 for the "summary" field, but this doesn't affect these results as neither have "BRIG" in either of those fields.
There's also a "freshness" scoring function with a boost of 15 over 365 days with a quadratic function profile. Again, this shouldn't apply to either of these documents as both were created over a year ago.
I can't figure out why Document A is scoring higher than Document B.

It's possible that document A is 'newer' than document B and that's the reason why it's being displayed first (has a higher score). Besides Term relevance, freshness can also impact the score.
EDIT:
After some research it looks like that newer created Azure Cognitive Search uses BM25 algorithm by default. (source: https://learn.microsoft.com/en-us/azure/search/index-similarity-and-scoring#scoring-algorithms-in-search)
Document length and field length also play a role in the BM25 algorithm. Longer documents and fields are given less weight in the relevance score calculation. Therefore, a document that contains a single instance of the search term in a shorter field may receive a higher relevance score than a document that contains the search term multiple times in a longer field.

Test your scoring profile configurations. Perhaps try issuing queries without scoring profiles first and see if that meets your needs.
The "searchMode" parameter controls precision and recall. If you want more recall, use the default "any" value, which returns a result if any part of the query string is matched. If you favor precision, where all parts of the string must be matched, change searchMode to "all". Try the above query both ways to see how searchMode changes the outcome. See Simple Query Examples.
If you are using the BM25 algorithm, you also may want to tune your k1 and b values. See Set BM25 Parameters.
Lastly, you may want to explore the new Semantic search preview feature for enhanced relevance.

Related

How to implement synonyms for use in a search engine?

I am working on a pet search engine (SE).
What I have right now is boolean keyword SE, as a library that is split in two parts:
index: this is a inverted index ie. it associate terms with the original document where it appears
query: which is supplied by the user and can be arbitrarily complex boolean expression that looks like (mobile OR android OR iphone) AND game
I'd like to improve the search engine, in a way that does automatically extend simple queries to boolean queries so that it includes search terms that do no appear in the original query ie. I'd like to support synonyms.
I need some help to build the synonyms graph.
How can I compute list of words that appears in similar context?
Here is example of list of synonyms I'd like to compute:
psql, pgsql, postgres, postgresql
mobile, iphone, android
and also synonyms that includes ngrams like:
rdbms, relational database management systems, ...
The algorithm doesn't have to be perfect, I can post-process by hand the result, but at least I need to have a clue about what terms are similar to what other terms.
In the standard Information Retrieval (IR) literature, this enrichment of a query with additional terms (that don't appear in the initial/original query) is known as query expansion.
There're a plenty of standard approaches which, generally speaking, are based on the idea of scoring terms based on some factors and then selecting a number of terms (say K, a parameter) that have the highest scores.
To compute the term selection score, it is assumed that the top (M) ranked documents retrieved after initial retrieval are relevant, this being called pseudo-relevance feedback.
The factors on which the term selection function generally depend are:
The term frequency of a term in a top ranked document - higher the better.
The number of documents (out of top M) in which the term occurs in - higher the better.
How many times does an additional term co-occur with a query term - the higher the better.
The co-occurrence factor is the most important and would be give you terms such as 'pgsql' if the original query contains 'psql'.
Note that if documents are too short, this method would not work well and you have to use other methods that are necessarily semantics based such as i) word-vector based expansion or ii) wordnet-based expansion.

Solr: how to manage irrelevant results when not sorting by relevance?

Case in point: say we have a search query that returns 2000 results ranging from very relevant to hardly relevant at all. When this is sorted by relevance this is fine, as the most relevant results are listed on the first page.
However, when sorting by another field (e.g. user rating) the results on the first page are full of hardly-relevant results, which is a problem for our client. Somehow we need to only show the 'relevant' results with highest ratings.
I can only think of a few solutions, all of which have problems:
1 - Filter out listings on Solr side if relevancy score is under a threshold. I'm not sure how to do this, and from what I've read this isn't a good idea anyway. e.g. If a result returns only 10 listings I would want to display them all instead of filter any out. It seems impossible to determine a threshold that would work across the board. If anyone can show me otherwise please show me how!
2 - Filter out listings on the application side based on score. This I can do without a problem, except that now I can't implement pagination, because I have no way to determine the total number of filtered results without returning the whole set, which would affect performance/bandwidth etc... Also has same problems of the first point.
3 - Create a sort of 'combined' sort that aggregates a score between relevancy and user rating, which the results will then be sorted on. Firstly I'm not sure if this is even possible, and secondly it would be weird for the user if the results aren't actually listed in order of rating.
How has this been solved before? I'm open to any ideas!
Thanks
If they're not relevant, they should be excluded from the result set. Since you want to order by a dedicated field (i.e. user rating), you'll have to tweak how you decide which documents to include in the result at all.
In any case you'll have to define "what is relevant enough", since scores aren't really comparable between queries and doesn't say anything about "this was xyz relevant!".
You'll have to decide why those documents that are included aren't relevant and exclude them based on that criteria, and then either use the review score as a way to boost them further up (if you want the search to appear organic / by relevance). Otherwise you can just exclude them and sort by user score. But remember that user score, as an experience for the user, is usually a harder problem to make relevant than just order by the average of the votes.
Usually the client can choose different ordering options, by relevance or ratings for example. But you are right that ordering by rating is probably not useful enough. What you could do is take into account the rating in the relevance scoring. For example, by multiplying an "organic" score with a rating transformed as a small boost. In Solr you could do this with Function Queries. It is not hard science, and some magic is involved. Much is common sense. And it requires some very good evaluation and testing to see what works best.
Alternatively, if you do not want to treat it as a retrieval problem, you can apply faceting and let users do filtering of the results by rating. Let users help themselves. But I can imagine this does not work in all domains.
Engineers can define what relevancy is. Content similarity scoring is not only what constitutes relevancy. Many Information Retrieval researchers and engineers agree that contextual information should be used besides only the content similarity. This opens a plethora of possibilities to define a retrieval model. For example, what has become popular are Learning to Rank (LTR) approaches where different features are learnt from search logs to deliver more relevant documents to users given their user profiles and prior search behavior. Solr offers this as module.

Paging in Azure search when results have equal scores

I'm using Azure Search on my e-commerce site, and now i faced the problem with paging on my search page. When i reload the search page i can get different order of products. So when i'm using paging i can see same products on different pages, and this is critical.
I started researching what's going wrong, and i've found this info on Microsoft docs https://learn.microsoft.com/en-us/rest/api/searchservice/add-scoring-profiles-to-a-search-index#what-is-default-scoring
Search score values can be repeated throughout a result set. For
example, you might have 10 items with a score of 1.2, 20 items with a
score of 1.0, and 20 items with a score of 0.5. When multiple hits
have the same search score, the ordering of same scored items is not
defined, and is not stable. Run the query again, and you might see
items shift position. Given two items with an identical score, there
is no guarantee which one appears first.
So if i got it correctly, i face this issue because products has same score.
How to fix this?
You got it correctly! Because the products you are getting have the same score, there is no guarantee which one appears first.
In order to avoid it in this stage, you can add to your $orderby parameter a field that has unique values, and that way you guarantee the same order. However, this approach doesn’t take scoring into account. We are currently working on a solution to this problem. We will update this answer once the solution is available (the ETA at this point is weeks, not months).
Please note that you can now use search.score() function to order by score:
From the link below:
https://learn.microsoft.com/en-us/rest/api/searchservice/odata-expression-syntax-for-azure-search.
"You can specify multiple sort criteria. The order of expressions determines the final sort order. For example, to sort descending by score, followed by rating, the syntax would be $orderby=search.score() desc,rating desc."

Azure Search Distance filter with variable distance

Suppose I have the following scenario:
A search UI to allow individuals to find plumbers who are able to service their home location.
When a plumber enters their info into the system, they provide their coordinates and a maximum distance they are willing to travel.
The individual can then enter their home coordinates and should be presented with a list of plumbers who are eligible.
Looking at the Azure Search geo.distance function, I cannot see how to do this. Scenarios where the searcher provides a distance are well covered but not where the distance is different for each search record.
The documentation provides the following example:
$filter=geo.distance(location, geography'POINT(-122.131577 47.678581)') le 10
This works correctly but if I try and change the 10 to the maxDistance field, it fails with
Comparison must be between a field, range variable or function call
and a literal value
My requirement seems fairly basic but am now wondering if this is currently possible with Azure Search?
I found an azure feedback suggestion asking for this feature but no news on if/when it will be implemented. Therefore it is safe to assume that this scenario is not currently supported.
To add to Paul's answer, one possible workaround is to use a conservatively large constant value instead of referencing the maxDistance field in your $filter expression. Then, you can filter the resulting list of plumbers on the client to take each plumber's max distance into account and produce final list of plumbers.

Issue in azure search result when use both search keyword and Orderby clues

Having issue when I do document search in index, I use keywords as search param and distance as order by clues in api parameter.
The outcome result has sorted the result by distance, but the keyword based best data never come up into result.
https://****/indexes/IndexName/docs?api-version=2014-10-20-Preview&$filter= geo.distance(geolocation, geography'POINT(-157.825459241867 21.2753200113279)') le 16091.8615317766&search=the beach villas &$orderby=geo.distance(geolocation, geography'POINT(-157.825459241867 21.2753200113279)')&$skip=0&$top=10&$count=true
It is very possible that there is an issue, but I would like to step back and make sure you actually want to use sorting as opposed to scoring profiles. Based on the query, it seems as though what you want to do is boost items that are close to the user. A good way to do this is to use our Distance scoring profile that allows you to provide additional weighting to documents that are closer to the location specified by the user. You can also apply an exponential or linear interpolation to this scoring. Using exponential the villa closest to the location get a really large boost and the further ones get a small boost. Or using linear it is more of a gradual degradation of weighted boosting as it gets farther from the point.
Liam
Please see this page for more details on this: https://msdn.microsoft.com/en-us/library/azure/dn798928.aspx

Resources