Cloudant Lucene boost factor during indexing - search

During indexing time it is possible to set a boost factor value which then changes the position of specific record in the array of returned documents.
Example:
index("default", doc.my_field, {"index": "analyzed", "boost": doc.boostFactor});
When applying this boost factor I can see that the sorting changes. However, it appears to be rather random.
I would expect a number greater than 1 would sort the document higher.
Did anybody managed to get the boost factor with Cloudant to work correctly?

Yes, Cloudant boost factor should work correctly. Setting boost to a field of a specific doc, will modify the score of this doc: Score = OriginalScore * boost while searching on this field.
Do you search on the same field you boost? How does your query look like? Does the field my_field consists of multiple tokens? This may also influence scoring (e.g. longer fields get scored less).
You can observe scores of docs in the order fields in the results, and then by modifying boost observe how the scores are changing.

Related

Azure Search Score showing different values for identical documents

I saved 5 identical documents to my Azure Search Index, with a weight of 1 applied to a name field (below).
var fieldWeights = new Dictionary<string, double>
{
{"name", 1},
};
As all documents saved are identical i was expecting that all documents would be returned with the same search score. From the image below you can see that the first two are the same but the last three are a bit lower.
You might find the following article useful: How full-text search works in Azure Search, especially the section about Scoring in a distributed index. It explains that because Azure Search indexes are sharded to facilitate efficient scaling operations, relevance score of documents that are mapped to different shards could differ slightly as term statistics are computed at the shard level. In general, we don't recommend developing any programmatic dependency on the value of the relevance score as it is not stable and consistent for different reasons. Accurate relative order of documents in the results set is what we're optimizing for.

SOLR IDF Max docs configuration

I'm using SOLR for storing the documents used by search in my application. The SOLR is shared by multiple applications and the data is grouped based on the application id which is unique for each application.
For calculating the score based on TF-IDF the SOLR uses the total documents available in it. How do I change that configuration to check the IDF only based on the total documents available for the application id rather than counting all the documents across applications.
Even if you store all docs in one collection, there is still something you can do!
Unless you enable ExactStatsCache in your solrconfig.xml like this:
<statsCache class="org.apache.solr.search.stats.ExactStatsCache"/>
similarity calculations are per shard, not per total collection.
So, if you shard your docs by your application_id, then you will get 'better' scores, closer to that you want. It will be exactly what you want if you get one application_id per shard, but if you have many applications and not many shards you will get more than one app per shard.
If you store them in one collection, I am afraid it's not possible with built-in functionality.
I think you have several choices - store each application data in the separate collection, than you will have IDF based only on specific application data out of the box.
If this is not suitable for you - you will need to write your own Similarity, probably by exteding https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html and overriding method public abstract float idf(long docFreq, long docCount) which is responsible for calculating IDF
Overall, I think the first approach will suit your needs much better.

Finding the number of documents that contain a term in elasticsearch

I have an Elasticsearch index that contains around 2.5 billion documents with around 18 million different terms in an analyzed field. Is it possible to quickly get a count of the number of documents that contain a term without searching the index?
It seems like ES would store that information while analyzing the field, or perhaps be able to count the length of an inverted index. If there is a way to search for multiple terms and get the document frequency for each of the terms, that would be even better. I want to do this thousands of times on a regular basis, and I can't tell if there is an efficient way to do that.
You can use the Count API to just return the count from a query, instead of a full document listing.
As far as whether Elasticsearch gives you a way to do this without a query: I'm reasonably confident Elasticsearch doesn't have a store of that information outside the index, because that is exactly what a lucene index already does. That's what an inverted index is, a map of documents indexed by term. Lucene is designed around making looking up documents by term efficient.

Solr: Boosting documents based on a numeric 'popularity' field - do it at index time or query time?

I'm reading the solr cookbook and it suggests using a boost function bf=product(popularity) parameter to boost certain documents based on the "popularity" score.
This could also be implemented using a index time boost on the document right?
So which is the better option? Is there a difference in terms of:
Functionality?
Performance?
This depends on how often your popularity changes. If it is pre-baked and changes infrequently, then you can boost at index time. If it changes frequently (e.g. based on the live searches), then you probably want to store it externally to specific records, using (for example) ExternalFileField.

How well does Solr scale over large number of facet values?

I'm using Solr and I want to facet over a field "group".
Since "group" is created by users, potentially there can be a huge number of values for "group".
Would Solr be able to handle a use case like this? Or is Solr not really appropriate for facet fields with a large number of values?
I understand that I can set facet.limit to restrict the number of values returned for a facet field. Would this help in my case?
Say there are 100,000 matching values for "group" in a search, if I set facet.limit to 50. would that speed up the query, or would the query still be slow because Solr still needs to process and sort through all the facet values and return the top 50 ones?
Any tips on how to tune Solr for large number of facet values?
Thanks.
Since 1.4, solr handles facets with a large number of values pretty well, as it uses a simple facet count by default. (facet.method is 'fc' by default).
Prior to 1.4, solr was using a filter based faceted method (enum) which is definitely faster for faceting on attribute with small number of values. This method requires one filter per facet value.
About facet.limit , think of it like as a way to navigate through the facet space (in conjunction with facet.offset), like you navigate through the result space with rows/offset. So a value of 10 ~ 50 is sensible.
As with rows/offset, and due to the nature of Solr, you can expect the performance of facet.limit/facet.offset to degrade when the offset gets bigger, but it should be perfectly fine if you stay within reasonable boundaries.
By default, solr outputs more frequent facets first.
To sum up:
Use Solr 1.4
Make sure facet.method is 'fc' (well, that's the default anyway).
Navigate through your facet space with facet.limit/facet.offset.
Don't misregard to enable cache faceting related parameters (try different cache sizes to chose the values that fit well to your system):
<filterCache class="solr.FastLRUCache" size="4096" initialSize="4096" autowarmCount="4096"/>
<queryResultCache class="solr.LRUCache" size="5000" initialSize="5000" autowarmCount="5000"/>

Resources