How index and inverted index works in facets in solr? - search

I understand the theory concepts of Inverted index and indexes. Primarily, Solr indexes documents using inverted index (Searching tokens instead of documents).
I've also read that Solr uses indexing for features such as facets.
As I understand it, for facets,
searching for a term and creating facets require Solr to search all the terms in a field and match all the retrieved documents containing the search term, which will be costly, so indexing is used.
From what I understand, index is used when all the documents referring to the search terms are retrieved, they are traversed and a count of unique values regarding the fields are calculated.
Is this a correct understanding of this concept or there is something else ?

The is not only one way, how faceting in solr works.
Solr has a heuristic to select a best but there is also a the
facet.method parameter to select it by your own.
Mainly your description is right, but solr is fast because of caching the
UnInvertedField instead of selecting the values for each request from the inverted index.
With DocValues there is also an efficient storage of an uninverted field.
Possible also this answers will help you:
How does Lucene/Solr achieve high performance in multi-field / faceted search?
Solr faceted search performance recommendations
http://de.slideshare.net/lucenerevolution/seeley-solr-facetseurocon2011

Related

Finding the number of documents that contain a term in elasticsearch

I have an Elasticsearch index that contains around 2.5 billion documents with around 18 million different terms in an analyzed field. Is it possible to quickly get a count of the number of documents that contain a term without searching the index?
It seems like ES would store that information while analyzing the field, or perhaps be able to count the length of an inverted index. If there is a way to search for multiple terms and get the document frequency for each of the terms, that would be even better. I want to do this thousands of times on a regular basis, and I can't tell if there is an efficient way to do that.
You can use the Count API to just return the count from a query, instead of a full document listing.
As far as whether Elasticsearch gives you a way to do this without a query: I'm reasonably confident Elasticsearch doesn't have a store of that information outside the index, because that is exactly what a lucene index already does. That's what an inverted index is, a map of documents indexed by term. Lucene is designed around making looking up documents by term efficient.

full text search in databases

I have two fairly general question about full text search in a database. I was looking into elastic search and solr and it seems to me that one needs to produce separate documents made up of table entries, which then get searched. So the result of such a search is not actually a database entry? Or did I misunderstand something?
I also looked into whoosh search, which does index table columns and the result of whoosh are actual table rows.
When using solr or elastic search, should I put the row id into the document which gets searched and after I have my result use that id to retrieve the relevant rows from the table? Or is there a better solution?
Another question I have is if I have a id like abc/123.64664, which is stored as a string, is there any advantage in searching such a column with a FTS? It seems to me there is not much to be gained by indexing? Or am I wrong?
thanks
Elasticsearch can store the indexed document, and you can retrieve it as a part of query result. Usually ppl still store the original data in an usual DB, it gives you more reliability and flexibility on reindexing. Mind that ES indexes non-relational data. You can have you data stored in relational manner and compose denormalized documents for indexing.
As for "abc/123.64664" you can index it as tokenized string or you can tune the index for prefix search etc. It's up to you
(TL;DR) Don't think about what your data is structured in your RDBS. Think about what you are searching.
Content storage for good full text search is quite different from relational database standard storage. So, your data going into Search Engine can end up looking quite differently from the way you stored it.
This is all driven by your expected search results. You may increase granularity of the data or - opposite - denormalize it so the parent/related record content shows up in the records you actually want returned as part of search. Text processing (copyField, tokenization, pre-processing, etc) is also where a lot of content modifications happen to make a record findable.
Sometimes, relational databases support full-text search. PostgreSQL is getting better and better at that. But most of the time, relational databases just do not provide enough flexibility to support good relevancy-driven search.
Finally, if the original schema is quite complex, it may make sense to only use search engine to get the right - relevant - IDs out and then merge them in the client code with the details from the original database records.

How does Lucene organize and walk through the inverted index?

In SQL an index is typically some kind of balanced tree (ordered nodes that point to the real table row to make searching possible in O(log n)). Walking through such a tree tree actually is the searching process.
Now Lucene uses an inverted index with term frequencies: It stores for each term how often it occurs in which documents. This is easy to understand. But this doesn't explain how a search on such an index is actually performed.
The search string is analyzed and split the same way into terms, of course, and then "the index is searched" for these terms to find the documents containing them – but how? Is the Lucene index itself also ordered and tree-like organized in some way for O(log n) to be possible? Or is walking the Lucene index on search-time actually linear, so O(n)?
There is no simple answer to this. First, because the internal format got improved from release to release and second, because with the advent of Lucene 4 configurable codecs were introduced which serve as an abstraction between the logical format and the actual physical format.
An index is made up of shards and replicas, each of them being a Lucene index itself. A Lucene index is then made of of multiple segments, whereas each segment is again a Lucene index. A segment is read-only and made up of multiple artefacts which can be held in the file system or in RAM.
What's in a Lucene index from Adrien Grand is an excellent presentation on the organisation of a Lucene index. This blog article from Michael McCandless and this blog article from Elastic are about codecs introduced with Lucene 4.
So querying a Lucene index actually results in querying multiple segments in parallel making use of a specific codec. A codec can represent a structure in file system or in RAM, typically optimized/compressed for a particular need. Internal format can be anything from a tree, a Hashmap, a Finite State Machine just to name a few. As soon as you make use of wildcard characters in the search your query ("?" or "*") this automatically results in a more or less deep traversal in your index.

Autocomplete and Fuzzy search across multiple indecies in Elasticsearch

I have multiple indices populated in my elasticsearch engine. And I have one text search box which is supposed to query all indices for possible hits. I am planning to query these indices fuzzy and autocomplete. Any suggestion on how the implementation should look like?
Use either GET /_all/_search endpoint or create an alias that gathers under it all the indices you want and use GET /[alias_name]/_search.
As to which field to search, I think _all field could be a good match, depending on how you have your mappings configured (disabling _all or not).

How well does Solr scale over large number of facet values?

I'm using Solr and I want to facet over a field "group".
Since "group" is created by users, potentially there can be a huge number of values for "group".
Would Solr be able to handle a use case like this? Or is Solr not really appropriate for facet fields with a large number of values?
I understand that I can set facet.limit to restrict the number of values returned for a facet field. Would this help in my case?
Say there are 100,000 matching values for "group" in a search, if I set facet.limit to 50. would that speed up the query, or would the query still be slow because Solr still needs to process and sort through all the facet values and return the top 50 ones?
Any tips on how to tune Solr for large number of facet values?
Thanks.
Since 1.4, solr handles facets with a large number of values pretty well, as it uses a simple facet count by default. (facet.method is 'fc' by default).
Prior to 1.4, solr was using a filter based faceted method (enum) which is definitely faster for faceting on attribute with small number of values. This method requires one filter per facet value.
About facet.limit , think of it like as a way to navigate through the facet space (in conjunction with facet.offset), like you navigate through the result space with rows/offset. So a value of 10 ~ 50 is sensible.
As with rows/offset, and due to the nature of Solr, you can expect the performance of facet.limit/facet.offset to degrade when the offset gets bigger, but it should be perfectly fine if you stay within reasonable boundaries.
By default, solr outputs more frequent facets first.
To sum up:
Use Solr 1.4
Make sure facet.method is 'fc' (well, that's the default anyway).
Navigate through your facet space with facet.limit/facet.offset.
Don't misregard to enable cache faceting related parameters (try different cache sizes to chose the values that fit well to your system):
<filterCache class="solr.FastLRUCache" size="4096" initialSize="4096" autowarmCount="4096"/>
<queryResultCache class="solr.LRUCache" size="5000" initialSize="5000" autowarmCount="5000"/>

Resources