Paging in Azure search when results have equal scores - azure

I'm using Azure Search on my e-commerce site, and now i faced the problem with paging on my search page. When i reload the search page i can get different order of products. So when i'm using paging i can see same products on different pages, and this is critical.
I started researching what's going wrong, and i've found this info on Microsoft docs https://learn.microsoft.com/en-us/rest/api/searchservice/add-scoring-profiles-to-a-search-index#what-is-default-scoring
Search score values can be repeated throughout a result set. For
example, you might have 10 items with a score of 1.2, 20 items with a
score of 1.0, and 20 items with a score of 0.5. When multiple hits
have the same search score, the ordering of same scored items is not
defined, and is not stable. Run the query again, and you might see
items shift position. Given two items with an identical score, there
is no guarantee which one appears first.
So if i got it correctly, i face this issue because products has same score.
How to fix this?

You got it correctly! Because the products you are getting have the same score, there is no guarantee which one appears first.
In order to avoid it in this stage, you can add to your $orderby parameter a field that has unique values, and that way you guarantee the same order. However, this approach doesn’t take scoring into account. We are currently working on a solution to this problem. We will update this answer once the solution is available (the ETA at this point is weeks, not months).

Please note that you can now use search.score() function to order by score:
From the link below:
https://learn.microsoft.com/en-us/rest/api/searchservice/odata-expression-syntax-for-azure-search.
"You can specify multiple sort criteria. The order of expressions determines the final sort order. For example, to sort descending by score, followed by rating, the syntax would be $orderby=search.score() desc,rating desc."

Related

Why does Azure Search give higher score to less relevant document?

I have two documents indexed in Azure Search (among many others):
Document A contains only one instance of "BRIG" in the whole document.
Document B contains 40 instances of "BRIG".
When I do a simple search for "BRIG" in the Azure Search Explorer via Azure Portal, I see Document A returned first with "#search.score": 7.93229 and Document B returned second with "#search.score": 4.6097126.
There is a scoring profile on the index that adds a boost of 10 for the "title" field and a boost of 5 for the "summary" field, but this doesn't affect these results as neither have "BRIG" in either of those fields.
There's also a "freshness" scoring function with a boost of 15 over 365 days with a quadratic function profile. Again, this shouldn't apply to either of these documents as both were created over a year ago.
I can't figure out why Document A is scoring higher than Document B.
It's possible that document A is 'newer' than document B and that's the reason why it's being displayed first (has a higher score). Besides Term relevance, freshness can also impact the score.
EDIT:
After some research it looks like that newer created Azure Cognitive Search uses BM25 algorithm by default. (source: https://learn.microsoft.com/en-us/azure/search/index-similarity-and-scoring#scoring-algorithms-in-search)
Document length and field length also play a role in the BM25 algorithm. Longer documents and fields are given less weight in the relevance score calculation. Therefore, a document that contains a single instance of the search term in a shorter field may receive a higher relevance score than a document that contains the search term multiple times in a longer field.
Test your scoring profile configurations. Perhaps try issuing queries without scoring profiles first and see if that meets your needs.
The "searchMode" parameter controls precision and recall. If you want more recall, use the default "any" value, which returns a result if any part of the query string is matched. If you favor precision, where all parts of the string must be matched, change searchMode to "all". Try the above query both ways to see how searchMode changes the outcome. See Simple Query Examples.
If you are using the BM25 algorithm, you also may want to tune your k1 and b values. See Set BM25 Parameters.
Lastly, you may want to explore the new Semantic search preview feature for enhanced relevance.

Solr: how to manage irrelevant results when not sorting by relevance?

Case in point: say we have a search query that returns 2000 results ranging from very relevant to hardly relevant at all. When this is sorted by relevance this is fine, as the most relevant results are listed on the first page.
However, when sorting by another field (e.g. user rating) the results on the first page are full of hardly-relevant results, which is a problem for our client. Somehow we need to only show the 'relevant' results with highest ratings.
I can only think of a few solutions, all of which have problems:
1 - Filter out listings on Solr side if relevancy score is under a threshold. I'm not sure how to do this, and from what I've read this isn't a good idea anyway. e.g. If a result returns only 10 listings I would want to display them all instead of filter any out. It seems impossible to determine a threshold that would work across the board. If anyone can show me otherwise please show me how!
2 - Filter out listings on the application side based on score. This I can do without a problem, except that now I can't implement pagination, because I have no way to determine the total number of filtered results without returning the whole set, which would affect performance/bandwidth etc... Also has same problems of the first point.
3 - Create a sort of 'combined' sort that aggregates a score between relevancy and user rating, which the results will then be sorted on. Firstly I'm not sure if this is even possible, and secondly it would be weird for the user if the results aren't actually listed in order of rating.
How has this been solved before? I'm open to any ideas!
Thanks
If they're not relevant, they should be excluded from the result set. Since you want to order by a dedicated field (i.e. user rating), you'll have to tweak how you decide which documents to include in the result at all.
In any case you'll have to define "what is relevant enough", since scores aren't really comparable between queries and doesn't say anything about "this was xyz relevant!".
You'll have to decide why those documents that are included aren't relevant and exclude them based on that criteria, and then either use the review score as a way to boost them further up (if you want the search to appear organic / by relevance). Otherwise you can just exclude them and sort by user score. But remember that user score, as an experience for the user, is usually a harder problem to make relevant than just order by the average of the votes.
Usually the client can choose different ordering options, by relevance or ratings for example. But you are right that ordering by rating is probably not useful enough. What you could do is take into account the rating in the relevance scoring. For example, by multiplying an "organic" score with a rating transformed as a small boost. In Solr you could do this with Function Queries. It is not hard science, and some magic is involved. Much is common sense. And it requires some very good evaluation and testing to see what works best.
Alternatively, if you do not want to treat it as a retrieval problem, you can apply faceting and let users do filtering of the results by rating. Let users help themselves. But I can imagine this does not work in all domains.
Engineers can define what relevancy is. Content similarity scoring is not only what constitutes relevancy. Many Information Retrieval researchers and engineers agree that contextual information should be used besides only the content similarity. This opens a plethora of possibilities to define a retrieval model. For example, what has become popular are Learning to Rank (LTR) approaches where different features are learnt from search logs to deliver more relevant documents to users given their user profiles and prior search behavior. Solr offers this as module.

Issue in azure search result when use both search keyword and Orderby clues

Having issue when I do document search in index, I use keywords as search param and distance as order by clues in api parameter.
The outcome result has sorted the result by distance, but the keyword based best data never come up into result.
https://****/indexes/IndexName/docs?api-version=2014-10-20-Preview&$filter= geo.distance(geolocation, geography'POINT(-157.825459241867 21.2753200113279)') le 16091.8615317766&search=the beach villas &$orderby=geo.distance(geolocation, geography'POINT(-157.825459241867 21.2753200113279)')&$skip=0&$top=10&$count=true
It is very possible that there is an issue, but I would like to step back and make sure you actually want to use sorting as opposed to scoring profiles. Based on the query, it seems as though what you want to do is boost items that are close to the user. A good way to do this is to use our Distance scoring profile that allows you to provide additional weighting to documents that are closer to the location specified by the user. You can also apply an exponential or linear interpolation to this scoring. Using exponential the villa closest to the location get a really large boost and the further ones get a small boost. Or using linear it is more of a gradual degradation of weighted boosting as it gets farther from the point.
Liam
Please see this page for more details on this: https://msdn.microsoft.com/en-us/library/azure/dn798928.aspx

SOLR Query parameters to avoid flooding with the same manufacturer

I've been a long time browser here, but never have had a question that wasn't already asked. So here goes:
I've run into a problem using SOLR search where some searches on SOLR (let's say DVD Players) tend to return a lot of search results from the same manufacturer in the first 50 results.
Now assuming that I want to provide my end-user with the best experience searching, but also the best variety of products in my catalog, how would I go about providing a type of demerit to reduce the same brand from showing up in the search results more than 5 times. For the record I'm using a fairly standard DisMax search handler.
This logic would only be applied to extremely broad queries like 'DVD Players', or 'Hard Drives', and naturally I wouldn't use it to shape 'Samsung DVD Players' search results.
I don't know if SOLR has a nifty feature that does this automatically, or if I would have to start modifying search handler logic.
I haven't used this but I believe field collapsing / grouping would be what you want.
http://wiki.apache.org/solr/FieldCollapsing
If I understand this feature correctly it would group similar results kind of how http://news.google.com/ does it by grouping similar news stories.
Some ideas here, although I've not tried them myself.
You can use Carrot plugin for Solr to cluster search results lets say on manufacturer and then feed it to custom RequestHandler to re-order (cherry picking from each mfr. cluster) the result for diversity.
However, there is a downside to the approach that you may need to fetch larger than necessary and secondly the search results will be synthetic.
To achieve this is a lengthy and complex process but worth trying. Let's say the main field on which you are searching is a single field called title, first you'll need to make sure that all the documents containing "dvd player" in it have same score. This you can do by neglecting solr scoring parameteres like field norm (set omitNorms=true) & term frequency (write a solr plugin to neglect it) code attached..
Implementation Details:
1) compile the following class and put it into Solr WEB-INF/classes
package my.package;
import org.apache.lucene.search.DefaultSimilarity;
public class CustomSimilarity extends DefaultSimilarity {
public float tf(float freq) {
return freq > 0 ? 1.0f : 0.0f;
}
}
In solrconfig.xml use this new similarity class add
similarity class="my.package.CustomSimilarity"
All this will help you to make score for all the documents with "dvd player" in their title same. After that you can define one field of random type. Then when you query solr you can arrange first by score, then by the random field. Since score for all the documents containing DVD players would be same, results will get arranged by random field, giving the customer better variety of products in your catalog.

How do I sort search results by relevance?

I'm working on a project which searches through a database, then sorts the search results by relevance, according to a string the user inputs. I think my current search is fairly decent, but the comparator I wrote to sort the results by relevance is giving me funny results. I don’t know what to consider relevant. I know this is a big branch of information retrieval, but I have no idea where to start finding examples of searches which sort objects by relevance and would appreciate any feedback.
To give a little more background about my specific issue, the user will input a string in a website database, which stores objects (items in the store) with various fields, such as a minor and major classification (for example, an XBox 360 game might be stored with major=video_games and minor=xbox360 fields along with its specific name). The four main fields that I think should be considered in the search are the specific name, major, minor, and genre of the type of object, if that helps.
In case you don't wanna use lucene/Solr, you can always use distance metrics to find the similarity between query and the rows retrieved from database. Once you get the score you can sort them and they will be considered as sorted by relevance.
This is what exactly happens behind the scene of lucene. You can use simple similarity metrics like manhattan distance, distance of points in n-dimensional space etc. Look for lucene scoring formula for more insight.

Resources