How does the page ranking in elastic search work. Once we create an index is there an underlying intelligent layer that creates a metadata repository and provides results to query based on relevance. I have created several indices and I want to know how the results are ordered once a query is provided. And is there a way to influence these results based on relationships between different records.
Do you mean how documents are scored in elasticsearch? Or are you talking about the 'page-rank' in elasticsearch?
Documents are scored based on how well the query matches the document. This approach is based on the TF-IDF concept term frequency–inverse document frequency. There is, however, no 'page-rank' in elasticsearch. The 'page-rank' takes into consideration how many documents point towards a given document. A document with many in-links is weighted higher than other. This is meant to reflect the fact whether a document is authoritative or not.
In elasticsearch, however, the relations between the documents are not taken into account when it comes to scoring.
Related
We are building a search engine where we get the relevance score (1 to 5) from the user on the retrieved query results. Further, we want to utilize the feedback (results with relevance score) to improve the query results.
Till now, we have built the first part i.e., BERT based similarity search model. Now, we looking to build the second part. Anyone have any ideas please share.
Well, as far as I have understood from your description, you have BERT encoded the documents and while the user enters any query, you BERT encode it and finds the similar document related to the query.
Whenever you perform the similarity search operation, you get more than 1 result on the basis of your configurations on how many documents you want to retrieve. Let's say you have the settings of returning 10 documents similar to the user query. Now if the user has given its high relevance to the third document.Next time you might want to show that document to the first instead of third.
In that case, you can maintain a table in your database where every document contains the score also. Whenever the search engine retrieves the documents on the basis of query, you check the relevance score of the retrieved documents and rearrange them according to the relevance and show to the user.
I saved 5 identical documents to my Azure Search Index, with a weight of 1 applied to a name field (below).
var fieldWeights = new Dictionary<string, double>
{
{"name", 1},
};
As all documents saved are identical i was expecting that all documents would be returned with the same search score. From the image below you can see that the first two are the same but the last three are a bit lower.
You might find the following article useful: How full-text search works in Azure Search, especially the section about Scoring in a distributed index. It explains that because Azure Search indexes are sharded to facilitate efficient scaling operations, relevance score of documents that are mapped to different shards could differ slightly as term statistics are computed at the shard level. In general, we don't recommend developing any programmatic dependency on the value of the relevance score as it is not stable and consistent for different reasons. Accurate relative order of documents in the results set is what we're optimizing for.
I'm using SOLR for storing the documents used by search in my application. The SOLR is shared by multiple applications and the data is grouped based on the application id which is unique for each application.
For calculating the score based on TF-IDF the SOLR uses the total documents available in it. How do I change that configuration to check the IDF only based on the total documents available for the application id rather than counting all the documents across applications.
Even if you store all docs in one collection, there is still something you can do!
Unless you enable ExactStatsCache in your solrconfig.xml like this:
<statsCache class="org.apache.solr.search.stats.ExactStatsCache"/>
similarity calculations are per shard, not per total collection.
So, if you shard your docs by your application_id, then you will get 'better' scores, closer to that you want. It will be exactly what you want if you get one application_id per shard, but if you have many applications and not many shards you will get more than one app per shard.
If you store them in one collection, I am afraid it's not possible with built-in functionality.
I think you have several choices - store each application data in the separate collection, than you will have IDF based only on specific application data out of the box.
If this is not suitable for you - you will need to write your own Similarity, probably by exteding https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html and overriding method public abstract float idf(long docFreq, long docCount) which is responsible for calculating IDF
Overall, I think the first approach will suit your needs much better.
How would one go about setting up Elasticsearch so that it returns personalized results?
For example, I would want results returned to a particular user to rank higher if they clicked on a result previously, or if they "starred" that result in the past. You could also have a "hide" option that pushes results further down the ranking. From what I've seen with elasticsearch so far, it seems difficult to return different rankings to users based on that user's own dynamic data.
The solution would have to scale to thousands of users doing a dozen or so searches per day. Ideally, I would like the ranking to change in real-time, but it's not critical.
Elasticsearch provides a wide variety of scoring options , but then to achieve what you have told you will need to do some additional tasks.
Function score query and document terms lookup terms filter would be our tools of our choice
First create a document per user , telling the links or link ID he visited and the links he has liked. This should be housed separately as separate index. And this should be maintained by the user , as he should update and maintain this record from client side.
Now when a user hits the data index, do a function score query with filter function pointing to this fields.
In this approach , as the filter is cached , you should get decent performance too.
How do you implement a "similar items" system for items described by a
set of tags?
In my database, I have three tables, Article, ArticleTag and Tag. Each
Article is related to a number of Tags via a many-to-many
relationship. For each Article i want to find the five most similar
articles to implement a "if you like this article you will like these
too" system.
I am familiar with Cosine similarity
and using that algorithm works very well. But it is way to slow. For
each article, I need to iterate over all articles, calculate the
cosine similarity for the article pair and then select the five
articles with the highest similarity rating.
With 200k articles and 30k tags, it takes me half a minute to
calculate the similar articles for a single article. So I need
another algorithm that produces roughly as good results as cosine
similarity but that can be run in realtime and which does not require
me to iterate over the whole document corpus each time.
Maybe someone can suggest an off-the-shelf solution for this? Most of
the search engines I looked at does not enable document similarity
searching.
Some questions,
How is ArticleTag different from Tag? Or is that the M2M mapping table?
Can you sketch out how you've implemented the cosine matching algorithm?
Why don't you store your document tags in an in memory data structure of some sort, using it only to retrieve document IDs? This way, you only hit the database during retrieval time.
Depending on the freq of document addition, this structure can be designed for fast/slow updates.
Initial intuition towards an answer - I'd say, an online clustering algorithm (perhaps do a Principal Components Analysis on the co-occurrence matrix, which will approximate a K-means cluster?). Better refined once you answer some of those questions above.
Cheers.
You can do it with the Lemur toolkit. With its KeyfileIncIndex, you have to re-retrieve the document from its source; the IndriIndex supports retrieving the document from the index.
But anyway, you index your documents, and then you build a query from the document you want to find similar documents to. You can then do a search with that query, and it will score the other documents for similarity. It's pretty fast in my experience. It treats both source documents and basic queries as documents, so finding similarities is really what it does (unless you're using the Indri parser stuff - that's a bit different, and I'm not sure how it works).