How well does Solr scale over large number of facet values? - search

I'm using Solr and I want to facet over a field "group".
Since "group" is created by users, potentially there can be a huge number of values for "group".
Would Solr be able to handle a use case like this? Or is Solr not really appropriate for facet fields with a large number of values?
I understand that I can set facet.limit to restrict the number of values returned for a facet field. Would this help in my case?
Say there are 100,000 matching values for "group" in a search, if I set facet.limit to 50. would that speed up the query, or would the query still be slow because Solr still needs to process and sort through all the facet values and return the top 50 ones?
Any tips on how to tune Solr for large number of facet values?
Thanks.

Since 1.4, solr handles facets with a large number of values pretty well, as it uses a simple facet count by default. (facet.method is 'fc' by default).
Prior to 1.4, solr was using a filter based faceted method (enum) which is definitely faster for faceting on attribute with small number of values. This method requires one filter per facet value.
About facet.limit , think of it like as a way to navigate through the facet space (in conjunction with facet.offset), like you navigate through the result space with rows/offset. So a value of 10 ~ 50 is sensible.
As with rows/offset, and due to the nature of Solr, you can expect the performance of facet.limit/facet.offset to degrade when the offset gets bigger, but it should be perfectly fine if you stay within reasonable boundaries.
By default, solr outputs more frequent facets first.
To sum up:
Use Solr 1.4
Make sure facet.method is 'fc' (well, that's the default anyway).
Navigate through your facet space with facet.limit/facet.offset.

Don't misregard to enable cache faceting related parameters (try different cache sizes to chose the values that fit well to your system):
<filterCache class="solr.FastLRUCache" size="4096" initialSize="4096" autowarmCount="4096"/>
<queryResultCache class="solr.LRUCache" size="5000" initialSize="5000" autowarmCount="5000"/>

Related

Finding the number of documents that contain a term in elasticsearch

I have an Elasticsearch index that contains around 2.5 billion documents with around 18 million different terms in an analyzed field. Is it possible to quickly get a count of the number of documents that contain a term without searching the index?
It seems like ES would store that information while analyzing the field, or perhaps be able to count the length of an inverted index. If there is a way to search for multiple terms and get the document frequency for each of the terms, that would be even better. I want to do this thousands of times on a regular basis, and I can't tell if there is an efficient way to do that.
You can use the Count API to just return the count from a query, instead of a full document listing.
As far as whether Elasticsearch gives you a way to do this without a query: I'm reasonably confident Elasticsearch doesn't have a store of that information outside the index, because that is exactly what a lucene index already does. That's what an inverted index is, a map of documents indexed by term. Lucene is designed around making looking up documents by term efficient.

Cloudant Lucene boost factor during indexing

During indexing time it is possible to set a boost factor value which then changes the position of specific record in the array of returned documents.
Example:
index("default", doc.my_field, {"index": "analyzed", "boost": doc.boostFactor});
When applying this boost factor I can see that the sorting changes. However, it appears to be rather random.
I would expect a number greater than 1 would sort the document higher.
Did anybody managed to get the boost factor with Cloudant to work correctly?
Yes, Cloudant boost factor should work correctly. Setting boost to a field of a specific doc, will modify the score of this doc: Score = OriginalScore * boost while searching on this field.
Do you search on the same field you boost? How does your query look like? Does the field my_field consists of multiple tokens? This may also influence scoring (e.g. longer fields get scored less).
You can observe scores of docs in the order fields in the results, and then by modifying boost observe how the scores are changing.

Multiple queries in Solr

My problem is I have n fields (say around 10) in Solr that are searchable, they all are indexed and stored. I would like to run a query first on my whole index of say 5000 docs which will hit around an average of 500 docs. Next I would like to query using a different set of keywords on these 500 docs and NOT on the whole index.
So the first time I send a query a score will be generated, the second time I run a query the new score generated should be based on the 500 documents of the previous query, or in other words Solr should consider only these 500 docs as the whole index.
To summarise this, Index of 5000 will be filtered to 500 and then 50 (5000>500>50). Its basically filtering but I would like to do this in Solr.
I have reasonable basic knowledge and still learning.
Update: If represented mathematically it would look like this:
results1=f(query1)
results2=f(query2, results1)
final_results=f(query3, results2)
I would like this to be accomplish using a program and end-user will only see 50 results. So faceting is not an option.
Two likely implementations occur to me. The simplest approach would be to just add the first query to the second query;
+(first query) +(new query)
This is a good approach if the first query, which you want to filter on, changes often. If the first query is something like a category of documents, or something similar where you can benefit from reuse of the same filter, then a filter query is the better approach, using the fq parameter, something like:
q=field:query2&fq=categoryField:query1
filter queries cache a set of document ids to filter against, so for commonly used searches, like categories, common date ranges, etc., a significant performance benefit can be gained from it (for uncommon searches, or user-entered search strings, it may just incur needless overhead to cache the results, and pollute the cache with a useless result set)
Filter queries (fq) are specifically designed to do quick restriction of the result set by not doing any score calculation.
So, if you put your first query into fq parameter and your second score-generating query in the normal 'q' parameter, it should do what you ask for.
See also a question discussing this issue from the opposite direction.
I believe you want to use a nested query like this:
text:"roses are red" AND _query_:"type:poems"
You can read more about nested queries here:
http://searchhub.org/2009/03/31/nested-queries-in-solr/
Should take a look at "faceted search" from Solr: http://wiki.apache.org/solr/SolrFacetingOverview This will help you in this kind of "iterative" search.

Solr: Boosting documents based on a numeric 'popularity' field - do it at index time or query time?

I'm reading the solr cookbook and it suggests using a boost function bf=product(popularity) parameter to boost certain documents based on the "popularity" score.
This could also be implemented using a index time boost on the document right?
So which is the better option? Is there a difference in terms of:
Functionality?
Performance?
This depends on how often your popularity changes. If it is pre-baked and changes infrequently, then you can boost at index time. If it changes frequently (e.g. based on the live searches), then you probably want to store it externally to specific records, using (for example) ExternalFileField.

SOLR - How to have facet counts restricted to rows returned in resultset

/select/?q=*:*&rows=100&facet=on&facet.field=category
I have around 100 000 documents indexed. But I return only 100 documents using rows=100. The facet counts returned for category, however return the counts for all documents indexed.
Can we somehow restrict the facets to the result set returned? i.e 100 rows only?
I don't think it is possible in any direct manner, as was pointed out by Pascal.
I can see two ways to achieve this:
Method I: do the counting yourself visiting the 100 results returned. This is very easy and fast if they are categorical fields, but harder if they are text fields that need to be tokenized, etc.
Method II: do two passes:
Do a normal query without facets (you only need to request doc ids at this point)
Collect all the IDs of the documents returned
Do a second query for all fields and facets, adding a filter to restrict result to those IDs collected in setp 2. Something like:
select/?q=:&facet=on&facet.field=category&fq=id:(312
OR 28 OR 1231 ...)
The first is way more efficient and I would recommend for non-textual filds. The second is computationally expensive but has the advantage of working for all types od fields.
Sorry, but i don't think it is possible. The facets are always based on all the documents matching the query.
Not a real answer but maybe better than nothing: the results grouping feature (check out from trunk!):
http://wiki.apache.org/solr/FieldCollapsing
where facet.field=category is then similar to group.field=category and you will get only as much groups ('facet hits') as you specified!
If you always execute the same query (q=*:*), maybe you can use facet.limit, for example:
select/?q=*:*&rows=100&facet=on&facet.field=category&facet.limit=100
Tell us if the order that solr uses is the same in the facet as in the query :.

Resources