We are building a search engine where we get the relevance score (1 to 5) from the user on the retrieved query results. Further, we want to utilize the feedback (results with relevance score) to improve the query results.
Till now, we have built the first part i.e., BERT based similarity search model. Now, we looking to build the second part. Anyone have any ideas please share.
Well, as far as I have understood from your description, you have BERT encoded the documents and while the user enters any query, you BERT encode it and finds the similar document related to the query.
Whenever you perform the similarity search operation, you get more than 1 result on the basis of your configurations on how many documents you want to retrieve. Let's say you have the settings of returning 10 documents similar to the user query. Now if the user has given its high relevance to the third document.Next time you might want to show that document to the first instead of third.
In that case, you can maintain a table in your database where every document contains the score also. Whenever the search engine retrieves the documents on the basis of query, you check the relevance score of the retrieved documents and rearrange them according to the relevance and show to the user.
Related
I'm reading an article about how to design a Twitter Search. The basic idea is to map tweets based on their ids to servers where each server has the mapping
English word -> A set of tweetIds having this word
Now if we want to find all the tweets that have some word all we need is to query all servers and aggregate the results. The article casually suggests that we can also sort the results by some parameter like "popularity" but isn't that a heavy task, especially if the word is an hot word?
What is done in practice in such search systems?
Maybe some tradeoff are being used?
Thanks!
First of all, there are two types of indexes: local and global.
A local index is stored on the same computer as tweet data. For example, you may have 10 shards and each of these shards will have its own index; like word "car" -> sorted list of tweet ids.
When search is run we will have to send the query to every server. As we don't know where the most popular tweets are. That query will ask every server to return their top results. All of these results will be collected on the same box - the one executing the user request - and that process will pick top 10 of of entire population.
Since all results are already sorted in the index itself, it is a O(1) operation to pick top 10 results from all lists - as we will be doing simple heap/watermarking on set number of tweets.
Second nice property, we can do pagination - the next query will be also sent to every box with additional data - give me top 10, with popularity below X, where X is the popularity of last tweet returned to customer.
Global index is a different beast - it does not live on the same boxes as data (it could, but does not have to). In that case, when we search for a keyword, we know exactly where to look for. And the index itself is also sorted, hence it is fast to get top 10 most popular results (or get pagination).
Since the global index returns only tweet Ids and not tweet itself, we will have to lookup tweets for every id - this is called N+1 problem - 1 query to get a list of ids and then one query for every id. There are several ways to solve this - caching and data duplication are by far most common approaches.
I am building a search page with Azure Search. On my page, I have a search box. I want to provide suggestions to the users. In an attempt to do this, I'm using the Suggestions endpoint on my index. At this time, I have a request that includes the following query string:
search=sta&suggesterName=sites&$top=3
My question is, how does top determine which three results to return? Is it the first three matches it encounters when going through the search index? Or is it something else? Based on the URL structure, I don't think it's using a scoring profile. So, I ruled out relevancy. But then I started reading about the minimumCoverage field and I got confused.
If the suggest endpoint just returns the first [top] matches it encounters, then why is the minimumCoverage field even needed?
In general, $top will give you the top N results based on whatever order the rest of the query specifies. For queries with no $orderby, the sort order is descending by relevance score. This applies to both Suggest and Search.
Note that just because you don't have a scoring profile (such as with Suggest), that doesn't mean Azure Search doesn't calculate relevance scores for each document. Scoring profiles can influence the score, but they do not completely define it.
For queries with an $orderby, the order of results is defined first by the fields in the $orderby, and then by score if there are any ties to be broken.
minimumCoverage has nothing to do with ordering or $top. It has to do with the way search queries are distributed. Every query is executed concurrently against different subsets of the index (this happens regardless of whether or not you have multiple search units). Sometimes one of these subsets fails to execute for whatever reason, usually when your search service is under heavy load. The minimumCoverage parameter provides a way to relax the rule that normally says "X% of the index must successfully execute the query in order to consider the overall query a success" (X is 100 by default for Search and 80 by default for Suggest). This is a way to tradeoff completeness of search results for higher availability in case of heavy load or partial outages.
How does the page ranking in elastic search work. Once we create an index is there an underlying intelligent layer that creates a metadata repository and provides results to query based on relevance. I have created several indices and I want to know how the results are ordered once a query is provided. And is there a way to influence these results based on relationships between different records.
Do you mean how documents are scored in elasticsearch? Or are you talking about the 'page-rank' in elasticsearch?
Documents are scored based on how well the query matches the document. This approach is based on the TF-IDF concept term frequency–inverse document frequency. There is, however, no 'page-rank' in elasticsearch. The 'page-rank' takes into consideration how many documents point towards a given document. A document with many in-links is weighted higher than other. This is meant to reflect the fact whether a document is authoritative or not.
In elasticsearch, however, the relations between the documents are not taken into account when it comes to scoring.
How would one go about setting up Elasticsearch so that it returns personalized results?
For example, I would want results returned to a particular user to rank higher if they clicked on a result previously, or if they "starred" that result in the past. You could also have a "hide" option that pushes results further down the ranking. From what I've seen with elasticsearch so far, it seems difficult to return different rankings to users based on that user's own dynamic data.
The solution would have to scale to thousands of users doing a dozen or so searches per day. Ideally, I would like the ranking to change in real-time, but it's not critical.
Elasticsearch provides a wide variety of scoring options , but then to achieve what you have told you will need to do some additional tasks.
Function score query and document terms lookup terms filter would be our tools of our choice
First create a document per user , telling the links or link ID he visited and the links he has liked. This should be housed separately as separate index. And this should be maintained by the user , as he should update and maintain this record from client side.
Now when a user hits the data index, do a function score query with filter function pointing to this fields.
In this approach , as the filter is cached , you should get decent performance too.
We are switching from SQL Fulltext Search to Lucene (SOLR stack) search in the next few months. One last wrinkle in figuring out our strategy here has to with replicating one current part of our search platform.
First, some nomenclature to describe the problem: Our site has a bunch of documents. People might "add" those documents, they might "favorite" those documents, they might "read" those documents, etc. Let's call that union of such documents for a given user their "personal documents". Some documents are public, and some are private so that only the logged-in-user can see them.
Currently, we have a weighting function that will always show a given user's "personal" documents FIRST in the search list, for any search. This outranks the normal order (but a document must be valid in the result set -- it just ranks above any other less important document). In SQL, we are able to achieve this by having a user-defined-function that returns a score, and it varies by user.
An analogy is Facebook -- where, when you type "Joe", it will first find all the Joes that you know, followed by any other Joe that meets the criteria. My search for "Joe" will return a different ordered set than your search for Joe.
In the world of Lucene/SOLR, as I understand it, I cannot figure out how to have such user-centric weighting of documents without two separate queries that are then effectively UNIONed together (I know, it's not relational, but you get the idea). We have millions of users, and hundreds of thousands of documents. If a user is logged in, we want "their documents" to show up first in any search, then the rest of all documents. And in each case, we want the search results to show only those documents that match the original search -- we're just talking about rank-order.
Can you think of any strategies here to reproduce this user-defined-function feature?
Can you afford to have a field in each document telling this particular document belongs to Jim (e.g. user123Doc:1)? If yes, you could solve it by sorting the result set by {user123Doc, score, ...}.
Or, if you don't want to store this information in Lucene, you can store this elsewhere (e.g. in the database) and implement FieldComparator so it works with these values. More on this is available here.