Solr: how to manage irrelevant results when not sorting by relevance?

Solr: how to manage irrelevant results when not sorting by relevance? - search

Case in point: say we have a search query that returns 2000 results ranging from very relevant to hardly relevant at all. When this is sorted by relevance this is fine, as the most relevant results are listed on the first page.
However, when sorting by another field (e.g. user rating) the results on the first page are full of hardly-relevant results, which is a problem for our client. Somehow we need to only show the 'relevant' results with highest ratings.
I can only think of a few solutions, all of which have problems:
1 - Filter out listings on Solr side if relevancy score is under a threshold. I'm not sure how to do this, and from what I've read this isn't a good idea anyway. e.g. If a result returns only 10 listings I would want to display them all instead of filter any out. It seems impossible to determine a threshold that would work across the board. If anyone can show me otherwise please show me how!
2 - Filter out listings on the application side based on score. This I can do without a problem, except that now I can't implement pagination, because I have no way to determine the total number of filtered results without returning the whole set, which would affect performance/bandwidth etc... Also has same problems of the first point.
3 - Create a sort of 'combined' sort that aggregates a score between relevancy and user rating, which the results will then be sorted on. Firstly I'm not sure if this is even possible, and secondly it would be weird for the user if the results aren't actually listed in order of rating.
How has this been solved before? I'm open to any ideas!
Thanks

If they're not relevant, they should be excluded from the result set. Since you want to order by a dedicated field (i.e. user rating), you'll have to tweak how you decide which documents to include in the result at all.
In any case you'll have to define "what is relevant enough", since scores aren't really comparable between queries and doesn't say anything about "this was xyz relevant!".
You'll have to decide why those documents that are included aren't relevant and exclude them based on that criteria, and then either use the review score as a way to boost them further up (if you want the search to appear organic / by relevance). Otherwise you can just exclude them and sort by user score. But remember that user score, as an experience for the user, is usually a harder problem to make relevant than just order by the average of the votes.

Usually the client can choose different ordering options, by relevance or ratings for example. But you are right that ordering by rating is probably not useful enough. What you could do is take into account the rating in the relevance scoring. For example, by multiplying an "organic" score with a rating transformed as a small boost. In Solr you could do this with Function Queries. It is not hard science, and some magic is involved. Much is common sense. And it requires some very good evaluation and testing to see what works best.
Alternatively, if you do not want to treat it as a retrieval problem, you can apply faceting and let users do filtering of the results by rating. Let users help themselves. But I can imagine this does not work in all domains.
Engineers can define what relevancy is. Content similarity scoring is not only what constitutes relevancy. Many Information Retrieval researchers and engineers agree that contextual information should be used besides only the content similarity. This opens a plethora of possibilities to define a retrieval model. For example, what has become popular are Learning to Rank (LTR) approaches where different features are learnt from search logs to deliver more relevant documents to users given their user profiles and prior search behavior. Solr offers this as module.

Related

Paging in Azure search when results have equal scores

I'm using Azure Search on my e-commerce site, and now i faced the problem with paging on my search page. When i reload the search page i can get different order of products. So when i'm using paging i can see same products on different pages, and this is critical.
I started researching what's going wrong, and i've found this info on Microsoft docs https://learn.microsoft.com/en-us/rest/api/searchservice/add-scoring-profiles-to-a-search-index#what-is-default-scoring
Search score values can be repeated throughout a result set. For
example, you might have 10 items with a score of 1.2, 20 items with a
score of 1.0, and 20 items with a score of 0.5. When multiple hits
have the same search score, the ordering of same scored items is not
defined, and is not stable. Run the query again, and you might see
items shift position. Given two items with an identical score, there
is no guarantee which one appears first.
So if i got it correctly, i face this issue because products has same score.
How to fix this?

You got it correctly! Because the products you are getting have the same score, there is no guarantee which one appears first.
In order to avoid it in this stage, you can add to your $orderby parameter a field that has unique values, and that way you guarantee the same order. However, this approach doesn’t take scoring into account. We are currently working on a solution to this problem. We will update this answer once the solution is available (the ETA at this point is weeks, not months).

Please note that you can now use search.score() function to order by score:
From the link below:
https://learn.microsoft.com/en-us/rest/api/searchservice/odata-expression-syntax-for-azure-search.
"You can specify multiple sort criteria. The order of expressions determines the final sort order. For example, to sort descending by score, followed by rating, the syntax would be $orderby=search.score() desc,rating desc."

Effect of randomness on search results

I am currently working on a search ranking algorithm which will be applied to elastic search queries (domain: e-commerce). It assigns scores on several entities returned and finally sorts them based on the score assigned.
My question is: Has anyone ever tried to introduce a certain level of randomness to any search algorithm and has experienced a positive effect of it. I am thinking that it might be useful to reduce bias and promote the lower ranking items to give them a chance to be seen easier and get popular if they deserve it. I know that some machine learning algorithms are introducing some randomization to reduce the bias so I thought it might be applied to search as well.
Closest I can get here is this but not exactly what I am hoping to get answers for:
Randomness in Artificial Intelligence & Machine Learning

I don't see this mentioned in your post... Elasticsearch offers a random scoring feature: https://www.elastic.co/guide/en/elasticsearch/guide/master/random-scoring.html
As the owner of the website, you want to give your advertisers as much exposure as possible. With the current query, results with the same _score would be returned in the same order every time. It would be good to introduce some randomness here, to ensure that all documents in a single score level get a similar amount of exposure.
We want every user to see a different random order, but we want the same user to see the same order when clicking on page 2, 3, and so forth. This is what is meant by consistently random.
The random_score function, which outputs a number between 0 and 1, will produce consistently random results when it is provided with the same seed value, such as a user’s session ID

Your intuition is right - randomization can help surface results that get a lower than deserved score due to uncertainty in the estimation. Empirically, Google search ads seemed to have sometimes been randomized, and e.g. this paper is hinting at it (see Section 6).
This problem describes an instance of a class of problems called Explore/Exploit algorithms, or Multi-Armed Bandit problems; see e.g. http://en.wikipedia.org/wiki/Multi-armed_bandit. There is a large body of mathematical theory and algorithmic approaches. A general idea is to not always order by expected, "best" utility, but by an optimistic estimate that takes the degree of uncertainty into account. A readable, motivating blog post can be found here.

Boost SolR results using users behavior

I would like SolR to be able to "learn" from my website users' choices. By that i mean that i know which product the user click after he performed a search. So i collected a list of [term searched => number of clicks] for each product indexed in SolR. But i can't figure how to have a boost that depends on the user input. Is it possible to index some key/value pairs for a document and retrieve the value with a function usable in the boost parameter ?
I'm not sure to be clear, so i'll add a concrete example :
Let's say that when a user search for "garden chair", SolR returns me 3 products, "green garden chair", "blue chair", and "hamac for garden".
"green garden chair" ranks first, the hamac ranks last, as expected.
But, then, all the users searching for "garden chair" ends up clicking on the hamac.
I would like to help the hamac to rank first on the search "garden chair", WITHOUT altering the rank it got on other search. So i would like to be able to perform a key=>value based boost.
Is that possible to achieve with SolR ?
I'm sure that i can't be the first one needing such user-based search results improvement.
Thanks in advance.

You could you edismax bq, if you are using edismax (or maybe bf). For this to work, you obviously need to store the info (in a db, redis, whatever you fancy):
searched "garden chair":
clicked "hamac for garden": 10
clicked "green garden chair": 4
searched "green table":
...
And so forth, look this up when there is a search, and if there is info available for the search, send the bq boosting what you want.
Also, check out the QueryElevationComponent It might your purpose (although is stronger than just boosting....). There are two things to consider though:
Every time you change the click number you would need to modify the xml and reload, so it would be better if you could batch it to nightly or something like that.
there was a recent jira issue to allow you to provide similar functionality but by providing request params, no need of xml/reload, so check that out too

How do I sort search results by relevance?

I'm working on a project which searches through a database, then sorts the search results by relevance, according to a string the user inputs. I think my current search is fairly decent, but the comparator I wrote to sort the results by relevance is giving me funny results. I don’t know what to consider relevant. I know this is a big branch of information retrieval, but I have no idea where to start finding examples of searches which sort objects by relevance and would appreciate any feedback.
To give a little more background about my specific issue, the user will input a string in a website database, which stores objects (items in the store) with various fields, such as a minor and major classification (for example, an XBox 360 game might be stored with major=video_games and minor=xbox360 fields along with its specific name). The four main fields that I think should be considered in the search are the specific name, major, minor, and genre of the type of object, if that helps.

In case you don't wanna use lucene/Solr, you can always use distance metrics to find the similarity between query and the rows retrieved from database. Once you get the score you can sort them and they will be considered as sorted by relevance.
This is what exactly happens behind the scene of lucene. You can use simple similarity metrics like manhattan distance, distance of points in n-dimensional space etc. Look for lucene scoring formula for more insight.

Ways to do "related searches" functionality

I've seen a few sites that list related searches when you perform a search, namely they suggest other search queries you may be interested in.
I'm wondering the best way to model this in a medium-sized site (not enough traffic to rely on visitor stats to infer relationships). My initial thought is to store the top 10 results for each unique query, then when a new search is performed to find all the historical searches that match some amount of the top 10 results but ideally not matching all of them (matching all of them might suggest an equivalent search and hence not that useful as a suggestion).
I imagine that some people have done this functionality before and may be able to provide some ideas of different ways to do this. I'm not necessarily looking for one winning idea since the solution will no doubt vary substantially depending on the size and nature of the site.

have you considered a matrix of with keywords on 1 axis vs. documents on another axis. once you find the set of vetors representing the keywords, find sets of keyword(s) found in your initial result set and then find a way to rank the other keywords by how many documents they reference or how many times they interset the intial result set.

I've tried a number of different approaches to this, with various degrees of success. In the end, I think the best approach is highly dependent on the domain/topics being searched, and how the users form queries.
Your thought about storing previous searches seems reasonable to me. I'd be curious to see how it works in practice (I mean that in the most sincere way -- there are many nuances that can cause these techniques to fail in the "real world", particularly when data is sparse).
Here are some techniques I've used in the past, and seen in the literature:
Thesaurus based approaches: Index into a thesaurus for each term that the user has used, and then use some heuristic to filter the synonyms to show the user as possible search terms.
Stem and search on that: Stem the search terms (eg: with the Porter Stemming Algorithm and then use the stemmed terms instead of the initially provided queries, and given the user the option of searching for exactly the terms they specified (or do the opposite, search the exact terms first, and use stemming to find the terms that stem to the same root. This second approach obviously takes some pre-processing of a known dictionary, or you can collect terms as your indexing term finds them.)
Chaining: Parse the results found by the user's query and extract key terms from the top N results (KEA is one library/algorithm that you can look at for keyword extraction techniques.)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string