How does Lucene organize and walk through the inverted index?

How does Lucene organize and walk through the inverted index? - search

In SQL an index is typically some kind of balanced tree (ordered nodes that point to the real table row to make searching possible in O(log n)). Walking through such a tree tree actually is the searching process.
Now Lucene uses an inverted index with term frequencies: It stores for each term how often it occurs in which documents. This is easy to understand. But this doesn't explain how a search on such an index is actually performed.
The search string is analyzed and split the same way into terms, of course, and then "the index is searched" for these terms to find the documents containing them – but how? Is the Lucene index itself also ordered and tree-like organized in some way for O(log n) to be possible? Or is walking the Lucene index on search-time actually linear, so O(n)?

There is no simple answer to this. First, because the internal format got improved from release to release and second, because with the advent of Lucene 4 configurable codecs were introduced which serve as an abstraction between the logical format and the actual physical format.
An index is made up of shards and replicas, each of them being a Lucene index itself. A Lucene index is then made of of multiple segments, whereas each segment is again a Lucene index. A segment is read-only and made up of multiple artefacts which can be held in the file system or in RAM.
What's in a Lucene index from Adrien Grand is an excellent presentation on the organisation of a Lucene index. This blog article from Michael McCandless and this blog article from Elastic are about codecs introduced with Lucene 4.
So querying a Lucene index actually results in querying multiple segments in parallel making use of a specific codec. A codec can represent a structure in file system or in RAM, typically optimized/compressed for a particular need. Internal format can be anything from a tree, a Hashmap, a Finite State Machine just to name a few. As soon as you make use of wildcard characters in the search your query ("?" or "*") this automatically results in a more or less deep traversal in your index.

Related

Data structure for fast full text search

A trie seems like it would work for small strings, but not for large documents, so not sure (1-100's of pages of text). Maybe it is possible to combine an inverted index with a suffix tree to get the best of both worlds. Or perhaps using a b-tree with words stored as nodes, and a trie for each node. Not sure. Wondering what a good data structure would be (b-tree, linked-list, etc.).
I'm thinking of searching documents such as regular books, web pages, and source code, so the idea of storing just words in an inverted index doesn't seem quite right. Would be helpful to know if you need alternative solutions for each or if there is a general one that works for them all, or a combination of them.

You do need an inverted index at the end of the day for interleaving matching results from each of your query terms but an inverted index can be built either from Trie or a Hash Map. A trie would allow fuzzy look-ups, while an hash map based inverted-index would only allow an exact look up of a token.
To optimize for memory usage, you can use memory optimized versions of Trie like Radix Tree or Adaptive Radix Tree (ART). I've had great success using ART for an open source fuzzy search engine project I've been working on: https://github.com/typesense/typesense
With Typesense, I was able to index about 1 million Hacker News titles in about 165 MB of RAM (uncompressed size on disk was 85 MB). You can probably squeeze it in even further if your use case is more specific and don't need some metadata fields I added to the data structure.

Finding the number of documents that contain a term in elasticsearch

I have an Elasticsearch index that contains around 2.5 billion documents with around 18 million different terms in an analyzed field. Is it possible to quickly get a count of the number of documents that contain a term without searching the index?
It seems like ES would store that information while analyzing the field, or perhaps be able to count the length of an inverted index. If there is a way to search for multiple terms and get the document frequency for each of the terms, that would be even better. I want to do this thousands of times on a regular basis, and I can't tell if there is an efficient way to do that.

You can use the Count API to just return the count from a query, instead of a full document listing.
As far as whether Elasticsearch gives you a way to do this without a query: I'm reasonably confident Elasticsearch doesn't have a store of that information outside the index, because that is exactly what a lucene index already does. That's what an inverted index is, a map of documents indexed by term. Lucene is designed around making looking up documents by term efficient.

How index and inverted index works in facets in solr?

I understand the theory concepts of Inverted index and indexes. Primarily, Solr indexes documents using inverted index (Searching tokens instead of documents).
I've also read that Solr uses indexing for features such as facets.
As I understand it, for facets,
searching for a term and creating facets require Solr to search all the terms in a field and match all the retrieved documents containing the search term, which will be costly, so indexing is used.
From what I understand, index is used when all the documents referring to the search terms are retrieved, they are traversed and a count of unique values regarding the fields are calculated.
Is this a correct understanding of this concept or there is something else ?

The is not only one way, how faceting in solr works.
Solr has a heuristic to select a best but there is also a the
facet.method parameter to select it by your own.
Mainly your description is right, but solr is fast because of caching the
UnInvertedField instead of selecting the values for each request from the inverted index.
With DocValues there is also an efficient storage of an uninverted field.
Possible also this answers will help you:
How does Lucene/Solr achieve high performance in multi-field / faceted search?
Solr faceted search performance recommendations
http://de.slideshare.net/lucenerevolution/seeley-solr-facetseurocon2011

Multilingual free-text search in an app with normalized data?

We have enums, free-text, and referenced fields etc. in our DB.
Each enum has its own translation, free-text could be in any language. We'd like to do efficient large-scale free-text searching and enum value based searching.
I know of solutions like Solr which are nice, but that would mean we'd have to index entire de-normalized records with all the text of all the languages in the system. This seems a bit excessive.
What are some recommended approaches for searching multilingual normalized data? Anyone tackle this before?

ETL. Extract, Transform, Load. In other words, get the data out of your existing databases, transform it (which is more than merely denormalizing it) and load it into SOLR. The SOLR db will be a lot smaller than the existing databases because there is no relational overhead. And SOLR search takes most of the load off of your existing database servers.
Take a good look at how to configure and use SOLR and learn about SOLR cores. You may want to put some languages in separate cores because that way you can more effectively use the various stemming algorithms in SOLR. But even with multilingual data you can still use bigrams (such as are used with Chinese language analysis).
Having multiple cores makes searching a bit more complex since you can try either a single language index, or an all-languages index. But it is much more effective to group language data and apply language specific stopwords, protected words, stemming and language analysis tools.
Normally you would include some key data in the index so that when you find a record via SOLR search, you can then reference directly into the source db. Also, you can have normalised and non-normalised data together, for instance an enum could be recorded in a normalised field in English as well as a non-normalised field in the same language as the free-text. A field can be duplicated in order to apply two different analysis and filtering treatments.
It would be worth your while to trial this with a subset of your data in order to learn how SOLR works and how best to configure it.

Calling search gurus: Numeric range search performance with Lucene?

I'm working on a system that performs matching on large sets of records based on strings and numeric ranges, and date ranges. The String matches are mostly exact matches as far as I can tell, as opposed to less exact full text search type results that I understand lucene is generally designed for. Numeric precision is important as the data concerns prices.
I noticed that Lucene recently added some support for numeric range searching but it's not something it's originally designed for.
Currently the system uses procedural SQL to do the matching and the limits are being reached as to the scalability of the system. I'm researching ways to scale the system horizontally and using search engine technology seems like a possibility, given that there are technologies that can scale to very large data sets while performing very fast search results. I'd like to investigate if it's possible to take a lot of load off the database by doing the matching with the lucene generated metadata without hitting the database for the full records until the matching rules have determined what should be retrieved. I would like to aim eventually for near real time results although we are a long way from that at this point.
My question is as follows: Is it likely that Lucene would perform many times faster and scale to greater data sets more cheaply than an RDBMS for this type of indexing and searching?

Lucene stores its numeric stuff as a trie; a SQL implementation will probably store it as a b-tree or an r-tree. The way Lucene stores its trie and SQL uses an R-tree are pretty similar, and I would be surprised if you saw a huge difference (unless you leveraged some of the scalability that comes from Solr).
As a general question of the performance of Lucene vs. SQL fulltext, a good study I've found is: Jing, Y., C. Zhang, and X. Wang. “An Empirical Study on Performance Comparison of Lucene and Relational Database.” In Communication Software and Networks, 2009. ICCSN'09. International Conference on, 336-340. IEEE, 2009.
First, when executing
exact query, the performance of Lucene is much better than
that of unindexed-RDB, while is almost same as that of
indexed-RDB. Second, when the wildcard query is a prefix
query, then the indexed-RDB and Lucene both perform very
well still by leveraging the index... Third, for combinational query, Lucene performs
smoothly and usually costs little time, while the query time
of RDB is related to the combinational search conditions and
the number of indexed fields. If some fields in the
combinational condition haven’t been indexed, search will
cost much more time. Fourth, the query time of Lucene and
unindexed-RDB has relations with the record complexity,
but the indexed-RDB is nearly independent of it.
In short, if you are doing a search like "select * where x = y", it doesn't matter which you use. The more clauses you add in (x = y OR (x = z AND y = x)...) the better Lucene becomes.
They don't really mention this, but a huge advantage of Lucene is all the built-in functionality: stemming, query parsing etc.

I suggest you read Marc Krellenstein's "Full Text Search Engines vs DBMS".
A relatively easy way to start using Lucene is by trying Solr. You can scale Lucene and Solr using replication and sharding.

At its heart, and in its simplest form, Lucene is a word density search engine. Lucene can scale to handle extremely large data sets and when indexed correctly return results in a blistering speed. For text based searching it is possible and very probable that search results will return quicker in Lucene as opposed to SQL Server/Oracle/My SQL. That being said it is unfair to compare Lucene to traditional RDBMS as they both have completely different usages.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string