Finding the number of documents that contain a term in elasticsearch

Finding the number of documents that contain a term in elasticsearch - search

I have an Elasticsearch index that contains around 2.5 billion documents with around 18 million different terms in an analyzed field. Is it possible to quickly get a count of the number of documents that contain a term without searching the index?
It seems like ES would store that information while analyzing the field, or perhaps be able to count the length of an inverted index. If there is a way to search for multiple terms and get the document frequency for each of the terms, that would be even better. I want to do this thousands of times on a regular basis, and I can't tell if there is an efficient way to do that.

You can use the Count API to just return the count from a query, instead of a full document listing.
As far as whether Elasticsearch gives you a way to do this without a query: I'm reasonably confident Elasticsearch doesn't have a store of that information outside the index, because that is exactly what a lucene index already does. That's what an inverted index is, a map of documents indexed by term. Lucene is designed around making looking up documents by term efficient.

Related

Designing Twitter Search - How to sort large datasets?

I'm reading an article about how to design a Twitter Search. The basic idea is to map tweets based on their ids to servers where each server has the mapping
English word -> A set of tweetIds having this word
Now if we want to find all the tweets that have some word all we need is to query all servers and aggregate the results. The article casually suggests that we can also sort the results by some parameter like "popularity" but isn't that a heavy task, especially if the word is an hot word?
What is done in practice in such search systems?
Maybe some tradeoff are being used?
Thanks!

First of all, there are two types of indexes: local and global.
A local index is stored on the same computer as tweet data. For example, you may have 10 shards and each of these shards will have its own index; like word "car" -> sorted list of tweet ids.
When search is run we will have to send the query to every server. As we don't know where the most popular tweets are. That query will ask every server to return their top results. All of these results will be collected on the same box - the one executing the user request - and that process will pick top 10 of of entire population.
Since all results are already sorted in the index itself, it is a O(1) operation to pick top 10 results from all lists - as we will be doing simple heap/watermarking on set number of tweets.
Second nice property, we can do pagination - the next query will be also sent to every box with additional data - give me top 10, with popularity below X, where X is the popularity of last tweet returned to customer.
Global index is a different beast - it does not live on the same boxes as data (it could, but does not have to). In that case, when we search for a keyword, we know exactly where to look for. And the index itself is also sorted, hence it is fast to get top 10 most popular results (or get pagination).
Since the global index returns only tweet Ids and not tweet itself, we will have to lookup tweets for every id - this is called N+1 problem - 1 query to get a list of ids and then one query for every id. There are several ways to solve this - caching and data duplication are by far most common approaches.

full text search in databases

I have two fairly general question about full text search in a database. I was looking into elastic search and solr and it seems to me that one needs to produce separate documents made up of table entries, which then get searched. So the result of such a search is not actually a database entry? Or did I misunderstand something?
I also looked into whoosh search, which does index table columns and the result of whoosh are actual table rows.
When using solr or elastic search, should I put the row id into the document which gets searched and after I have my result use that id to retrieve the relevant rows from the table? Or is there a better solution?
Another question I have is if I have a id like abc/123.64664, which is stored as a string, is there any advantage in searching such a column with a FTS? It seems to me there is not much to be gained by indexing? Or am I wrong?
thanks

Elasticsearch can store the indexed document, and you can retrieve it as a part of query result. Usually ppl still store the original data in an usual DB, it gives you more reliability and flexibility on reindexing. Mind that ES indexes non-relational data. You can have you data stored in relational manner and compose denormalized documents for indexing.
As for "abc/123.64664" you can index it as tokenized string or you can tune the index for prefix search etc. It's up to you

(TL;DR) Don't think about what your data is structured in your RDBS. Think about what you are searching.
Content storage for good full text search is quite different from relational database standard storage. So, your data going into Search Engine can end up looking quite differently from the way you stored it.
This is all driven by your expected search results. You may increase granularity of the data or - opposite - denormalize it so the parent/related record content shows up in the records you actually want returned as part of search. Text processing (copyField, tokenization, pre-processing, etc) is also where a lot of content modifications happen to make a record findable.
Sometimes, relational databases support full-text search. PostgreSQL is getting better and better at that. But most of the time, relational databases just do not provide enough flexibility to support good relevancy-driven search.
Finally, if the original schema is quite complex, it may make sense to only use search engine to get the right - relevant - IDs out and then merge them in the client code with the details from the original database records.

How index and inverted index works in facets in solr?

I understand the theory concepts of Inverted index and indexes. Primarily, Solr indexes documents using inverted index (Searching tokens instead of documents).
I've also read that Solr uses indexing for features such as facets.
As I understand it, for facets,
searching for a term and creating facets require Solr to search all the terms in a field and match all the retrieved documents containing the search term, which will be costly, so indexing is used.
From what I understand, index is used when all the documents referring to the search terms are retrieved, they are traversed and a count of unique values regarding the fields are calculated.
Is this a correct understanding of this concept or there is something else ?

The is not only one way, how faceting in solr works.
Solr has a heuristic to select a best but there is also a the
facet.method parameter to select it by your own.
Mainly your description is right, but solr is fast because of caching the
UnInvertedField instead of selecting the values for each request from the inverted index.
With DocValues there is also an efficient storage of an uninverted field.
Possible also this answers will help you:
How does Lucene/Solr achieve high performance in multi-field / faceted search?
Solr faceted search performance recommendations
http://de.slideshare.net/lucenerevolution/seeley-solr-facetseurocon2011

What indexer do I use to find the list in the collection that is most similar to my list?

Lets say I have my list of ingredients:
{'potato','rice','carrot','corn'}
and I want to return lists from a database that are most similar to mine:
{'beans','potato','oranges','lettuce'},
{'carrot','rice','corn','apple'}
{'onion','garlic','radish','eggs'}
My query would return this first:
{'carrot','rice','corn','apple'}
I've used Solr, and have looked at CloudSearch, ElasticSearch, Algolia, Searchify and Swiftype. These engines only seem to let me put in one query string and then filter by other facets.
In a real scenario my search list will be about 200 items long and will be matching against about a million lists in my database.
What technology should I use to accomplish what I want to do?
Should I look away from search indexers and more towards database-esque things like mongo, map reduce, hadoop... All I know are the names of other technologies and I just need someone to point me in the right direction on what technology path I should be exploring for this.
With so much data I can't really loop through it, I need to query everything at once.

I wonder what keeps you from trying it with Solr, as Solr provides much of what you need. You can declare the field as type="string" multiValued="true and save each list item as a value. Then, when querying, you specify each of the items in the list to look for as a search term for that field, and Solr will – by default – return the closest match.
If you need exact control over what will be regarded as a match (e.g. at least 40% of the terms from the search list have to be in a matching list) you can use the mm EDisMax parameter, cf. Solr Wiki
Having said that, I must add that I’ve never searched for 200 query terms (do I unerstand correctly that the list whose contents should be searched will contain about 200 items?) and do not know how well that performs. But I guess that setting up a test core and filling it with random lists using a script should not take more than a few hours, so it should be possible to evaluate the performance of this approach without investing too much time.

SOLR - How to have facet counts restricted to rows returned in resultset

/select/?q=*:*&rows=100&facet=on&facet.field=category
I have around 100 000 documents indexed. But I return only 100 documents using rows=100. The facet counts returned for category, however return the counts for all documents indexed.
Can we somehow restrict the facets to the result set returned? i.e 100 rows only?

I don't think it is possible in any direct manner, as was pointed out by Pascal.
I can see two ways to achieve this:
Method I: do the counting yourself visiting the 100 results returned. This is very easy and fast if they are categorical fields, but harder if they are text fields that need to be tokenized, etc.
Method II: do two passes:
Do a normal query without facets (you only need to request doc ids at this point)
Collect all the IDs of the documents returned
Do a second query for all fields and facets, adding a filter to restrict result to those IDs collected in setp 2. Something like:
select/?q=:&facet=on&facet.field=category&fq=id:(312
OR 28 OR 1231 ...)
The first is way more efficient and I would recommend for non-textual filds. The second is computationally expensive but has the advantage of working for all types od fields.

Sorry, but i don't think it is possible. The facets are always based on all the documents matching the query.

Not a real answer but maybe better than nothing: the results grouping feature (check out from trunk!):
http://wiki.apache.org/solr/FieldCollapsing
where facet.field=category is then similar to group.field=category and you will get only as much groups ('facet hits') as you specified!

If you always execute the same query (q=*:*), maybe you can use facet.limit, for example:
select/?q=*:*&rows=100&facet=on&facet.field=category&facet.limit=100
Tell us if the order that solr uses is the same in the facet as in the query :.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string