Negative boost in Azure Search Profiles

Negative boost in Azure Search Profiles - search

We have been working on creating scoring profiles for our search. We need a way to "bury" or give "negative" boosts to some fields in case of types of scoring function "Magnitude", "Freshness", "Tags". We noticed that we cannot add a negative value for boost. Is there any other way to achieve this kind of behavior (burying results based the field)
We cannot use $OrderBy because it takes precedence over the scoring profile.
Please advise. Thanks!

you should only set positive boosting values, as described [here][1]. There may be a few things you could do. The first thing I would try is to set the weight to 0 for the fields that you do not care about. In that case, they will simply not impact the relevance.
Another option: If you know that a field should not impact relevance you could simply make that field not 'searchable'. That said, this is a property of the index definition -- so you would need to create a different index for each combination of non-searchable fields.
Depending on your scenario, you could also make a field filterable, and filter based on that field. Something like $filer=Freshenss eq 'Really Fresh'. See this link for more information on using filters.
thanks!
-Luis Cabrera

For "Magnitude", "Freshness", you can set the set the range start as higher value and range end as lower value. Would this be considered as negative impact?
Like this:

I resolved that scenario by creating negative values (using an INT field) for the field we wanted to bury. That gave us the negative boost we needed.
I used a similar technique for Date "Freshness" too, where we counted the days from some event and the higher the number the less fresh the date is and used a "magnitude" function for it.
Thanks!

I have thought the about the need for this too.
One idea I have, but haven't tried, is to do a second search on just the negative keywords. That search result will have scores as well.
Then use those scores in a function to reduce first search result scores.
(yes, it would be nicer if it could be do as part of ACS)

Related

Why does Azure Search give higher score to less relevant document?

I have two documents indexed in Azure Search (among many others):
Document A contains only one instance of "BRIG" in the whole document.
Document B contains 40 instances of "BRIG".
When I do a simple search for "BRIG" in the Azure Search Explorer via Azure Portal, I see Document A returned first with "#search.score": 7.93229 and Document B returned second with "#search.score": 4.6097126.
There is a scoring profile on the index that adds a boost of 10 for the "title" field and a boost of 5 for the "summary" field, but this doesn't affect these results as neither have "BRIG" in either of those fields.
There's also a "freshness" scoring function with a boost of 15 over 365 days with a quadratic function profile. Again, this shouldn't apply to either of these documents as both were created over a year ago.
I can't figure out why Document A is scoring higher than Document B.

It's possible that document A is 'newer' than document B and that's the reason why it's being displayed first (has a higher score). Besides Term relevance, freshness can also impact the score.
EDIT:
After some research it looks like that newer created Azure Cognitive Search uses BM25 algorithm by default. (source: https://learn.microsoft.com/en-us/azure/search/index-similarity-and-scoring#scoring-algorithms-in-search)
Document length and field length also play a role in the BM25 algorithm. Longer documents and fields are given less weight in the relevance score calculation. Therefore, a document that contains a single instance of the search term in a shorter field may receive a higher relevance score than a document that contains the search term multiple times in a longer field.

Test your scoring profile configurations. Perhaps try issuing queries without scoring profiles first and see if that meets your needs.
The "searchMode" parameter controls precision and recall. If you want more recall, use the default "any" value, which returns a result if any part of the query string is matched. If you favor precision, where all parts of the string must be matched, change searchMode to "all". Try the above query both ways to see how searchMode changes the outcome. See Simple Query Examples.
If you are using the BM25 algorithm, you also may want to tune your k1 and b values. See Set BM25 Parameters.
Lastly, you may want to explore the new Semantic search preview feature for enhanced relevance.

Internal Search optimization for relevance

My team is using Solr and I have a question regarding it.
There are some search terms which doesn't gives relevant results or results which should have been displayed. For example:
Searching for Macy's without the apostrophe like "Macys" doesnt give back any result for Macy's.
Searching for JPMorgan vs JP Morgan gives different result
Searching for IBM doesn't show results which contains its full name i.e International business machine.
How can we improve and optimize such cases so that it gets applied to all, even to the one we didn't catch apart from these 3 above?
Any suggestions?

All these issues are related to how you process the incoming text for those fields. You'll have to create a filter chain for the field - and possibly use multiple fields for different use cases and prioritize those using qf - that processes the input values to do what you want.
Your first case can be solved by using a PatternReplaceFilter to remove any apostrophes - depending on your use case and tokenizer you might want to use the CharFilter version, as it processes the text before it's split into multiple tokens.
Your second case is a straight forward synonym filter or a WordDelimiterFilter, where you expand JPMorgan to "JP Morgan", or use the WordDelimiterFilter to expand case changes into separate tokens. That'll also allow you to search for JP and get JPMorgan related entries. These might have different effects on score, use debugQuery=true to see exactly how each term in your query contributes to the score.
The third case is in general the same as the second case. You'll have to create a decent synonym word list for the terms used, and this is usually something you build as you get feedback from your users, from existing dictionaries and from domain knowledge. There's also the option of preprocessing text using NLP, or in this case, something as primitive as indexing the initials of any capitalized words after each other could help.

Solr: how to manage irrelevant results when not sorting by relevance?

Case in point: say we have a search query that returns 2000 results ranging from very relevant to hardly relevant at all. When this is sorted by relevance this is fine, as the most relevant results are listed on the first page.
However, when sorting by another field (e.g. user rating) the results on the first page are full of hardly-relevant results, which is a problem for our client. Somehow we need to only show the 'relevant' results with highest ratings.
I can only think of a few solutions, all of which have problems:
1 - Filter out listings on Solr side if relevancy score is under a threshold. I'm not sure how to do this, and from what I've read this isn't a good idea anyway. e.g. If a result returns only 10 listings I would want to display them all instead of filter any out. It seems impossible to determine a threshold that would work across the board. If anyone can show me otherwise please show me how!
2 - Filter out listings on the application side based on score. This I can do without a problem, except that now I can't implement pagination, because I have no way to determine the total number of filtered results without returning the whole set, which would affect performance/bandwidth etc... Also has same problems of the first point.
3 - Create a sort of 'combined' sort that aggregates a score between relevancy and user rating, which the results will then be sorted on. Firstly I'm not sure if this is even possible, and secondly it would be weird for the user if the results aren't actually listed in order of rating.
How has this been solved before? I'm open to any ideas!
Thanks

If they're not relevant, they should be excluded from the result set. Since you want to order by a dedicated field (i.e. user rating), you'll have to tweak how you decide which documents to include in the result at all.
In any case you'll have to define "what is relevant enough", since scores aren't really comparable between queries and doesn't say anything about "this was xyz relevant!".
You'll have to decide why those documents that are included aren't relevant and exclude them based on that criteria, and then either use the review score as a way to boost them further up (if you want the search to appear organic / by relevance). Otherwise you can just exclude them and sort by user score. But remember that user score, as an experience for the user, is usually a harder problem to make relevant than just order by the average of the votes.

Usually the client can choose different ordering options, by relevance or ratings for example. But you are right that ordering by rating is probably not useful enough. What you could do is take into account the rating in the relevance scoring. For example, by multiplying an "organic" score with a rating transformed as a small boost. In Solr you could do this with Function Queries. It is not hard science, and some magic is involved. Much is common sense. And it requires some very good evaluation and testing to see what works best.
Alternatively, if you do not want to treat it as a retrieval problem, you can apply faceting and let users do filtering of the results by rating. Let users help themselves. But I can imagine this does not work in all domains.
Engineers can define what relevancy is. Content similarity scoring is not only what constitutes relevancy. Many Information Retrieval researchers and engineers agree that contextual information should be used besides only the content similarity. This opens a plethora of possibilities to define a retrieval model. For example, what has become popular are Learning to Rank (LTR) approaches where different features are learnt from search logs to deliver more relevant documents to users given their user profiles and prior search behavior. Solr offers this as module.

Issue in azure search result when use both search keyword and Orderby clues

Having issue when I do document search in index, I use keywords as search param and distance as order by clues in api parameter.
The outcome result has sorted the result by distance, but the keyword based best data never come up into result.
https://****/indexes/IndexName/docs?api-version=2014-10-20-Preview&$filter= geo.distance(geolocation, geography'POINT(-157.825459241867 21.2753200113279)') le 16091.8615317766&search=the beach villas &$orderby=geo.distance(geolocation, geography'POINT(-157.825459241867 21.2753200113279)')&$skip=0&$top=10&$count=true

It is very possible that there is an issue, but I would like to step back and make sure you actually want to use sorting as opposed to scoring profiles. Based on the query, it seems as though what you want to do is boost items that are close to the user. A good way to do this is to use our Distance scoring profile that allows you to provide additional weighting to documents that are closer to the location specified by the user. You can also apply an exponential or linear interpolation to this scoring. Using exponential the villa closest to the location get a really large boost and the further ones get a small boost. Or using linear it is more of a gradual degradation of weighted boosting as it gets farther from the point.
Liam
Please see this page for more details on this: https://msdn.microsoft.com/en-us/library/azure/dn798928.aspx

How do I write a Solr FunctionQuery to boost documents with future dates?

Am trying to boost records with a particular date in the future closest to now towards the top of the results, and make those with dates in the past less relevant. I've seen a number of posts about how to boost results which are just closer to now, but that's not really what I need.

What you want to do is:
Know if the date is in the future or the past
Use different boost functions for each case
Assuming the field name is due_date, we'll start building your query.
First, you want to get the time difference.
&timediff=ms(due_date,NOW)
You can use NOW/HOUR, NOW/DAY for better performance
Second, we need to know if duration is positive or negative: adding a number to its absolute value will return 1/true on positive and 0/false on negative.
&future=sum($timediff,abs($timediff)
Now depending whether the number is positive or negative you want to apply different boost functions. You can use any function you want here.
&futureboost=recip($timediff,1,36000000,36000000)
&pastboost=recip($timediff,1,3600000,3600000)
$finalboost=if($future,$futureboost,$pastboost)
&boost=$finalboost
Notice that futureboost parameters are 10x more than pastboost which will give higher boost to future documents than past ones. The recip function is documented on the Solr Function Query page and you can tune the parameters of both futureboost and pastboost functions to your case.
To return the function value, you can use:
&fl=_DATE_BOOST_:$finalboost
Full Query will be a combination of all the above:
&timediff=ms(due_date,NOW/HOUR)
&future=sum($timediff,abs($timediff)
&futureboost=recip($timediff,1,36000000,36000000)
&pastboost=recip($timediff,1,3600000,3600000)
$finalboost=if($future,$futureboost,$pastboost)
&boost=$finalboost
&fl=_DATE_BOOST_:$finalboost

Solved it by applying the below boost query :
bq=movie_release_date:[NOW/DAY-1MONTH TO NOW/DAY+2MONTHS]^10
Not that accurate as the previous answer, but in case you lack Solr 4, it results pretty much the same.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string