Efficient low-cardinality ANDs in a search engine

Efficient low-cardinality ANDs in a search engine - search

How do search engines such as Lucene, etc. perform AND queries where a term is common to many documents in the dataset? For example, in an inverted index of:
term | document_id
---------------------
program | 1, 2, 3, 5...
python | 1, 4
code | 4
c++ | 4, 5
the term program is present in several documents meaning a query of program AND code would require performing an intersection upon a very large set of documents.
Is there a way to perform AND queries without having to take the intersection of terms contained by potentially billions of documents?

the term program is present in several documents meaning a query of program AND code would require performing an intersection upon a very large set of documents.
Yes. Say you have the following query:
term1 AND term2 AND term3
You first need to compute the document frequency of each positive term. You pick the word with the lowest count.
You retrieve the documents that contains the least common term from the query. Those are the candidates. Then you filter and score those candidates with the query with a finite state machine.
So the database has several subspace:
A mapping from lemma or stem or term to document frequency (as in tfidf)
The actual inverted-index that allows to retrieve the documents that contains a given lemma
A mapping between the document id and the full text representation of the document or just bag of words depending on how advanced is your query logic.
Then the filter + score step can happen in parallel

Related

Mongo DB like search with count is very slow on 50 million collection data

In my application, I have a collection of 50 million data. I am using like search and then count the results on a particular field(i.e Patientfirstname). I also created an index on the Patientfirstname field it improved the performance but still it is taking a lot of time.
db.patients.find({"Patientfirstname":{"$regex":"Testuser"}}).count() without index 40 sec
db.patients.find({"Patientfirstname":{"$regex":"Testuser"}}).count() after adding index on the Patientfirstname field 31 sec
db.patients.find({"Patientfirstname":{"$regex":"Testuser"}}).count()
I tried with a different approach (aggregate) but still, response is very slow
db.patients.aggregate.([{$match:{"Patientfirstname":{"$regex":"Testuser"}}},
{$project:{"Patientfirstname":1,"_id":1}},
{$group : {_id:"$Patientfirstname", count:{$sum:1}}},
{$sort:{"count":-1}} ])
this query also takes the same to time fetch the results 31 sec
another approach was tried but the results are not correct
select only the field from the entire collection and then apply like search and count and result.
db.patients.find({},{Patientfirstname:1,_id:1}).count({"Patientfirstname":{"$regex":"Testuser"}})
applying a filter in the count is not working, entire collection count is displayed
Please help in this query to fetch results faster.Thanks in advance

So here is the deal:
As rightly pointed in the comments, $regex is an operator that would not perform well with or without indexes. Here is the reason why:
Queries without indexes are slow because they executed using COLLSCAN - which is essentially iteration of the whole 50 Million documents on the disk one-by-one, filtering data and returning only the ones that match. Disks being an inherently slow piece of hardware does not help the situation either.
Now, When indexed - MongoDB creates a B-Tree in the RAM. And $regex operator being not very selective in nature, it forces a complete Tree Scan (as compared to a reduced / partial tree scan in case of equalities or ranges) in the index b-tree - which is as bad as a Collection Scan itself. The only reason you get a benefit on 9 seconds is because this Tree Scan occurs in the RAM and not the disk.
Having said that, there are a few alternatives to it:
Optimize your $regex. From the MongoDB Documentation itself:
For case sensitive regular expression queries, if an index exists for the field, then MongoDB matches the regular expression against the values in the index, which can be faster than a collection scan. Further optimization can occur if the regular expression is a "prefix expression", which means that all potential matches start with the same string. This allows MongoDB to construct a "range" from that prefix and only match against those values from the index that fall within that range.
A regular expression is a "prefix expression" if it starts with a caret (^) or a left anchor (\A), followed by a string of simple symbols. For example, the regex /^abc.*/ will be optimized by matching only against the values from the index that start with abc.
Additionally, while /^a/, /^a./, and /^a.$/ match equivalent strings, they have different performance characteristics. All of these expressions use an index if an appropriate index exists; however, /^a./, and /^a.$/ are slower. /^a/ can stop scanning after matching the prefix.
Case insensitive regular expression queries generally cannot use indexes effectively. The $regex implementation is not collation-aware and is unable to utilize case-insensitive indexes.
Create a Text Index - This would tokenize your text string and enable faster text based searches
If you are deployed on MongoDB Atlas - Then you can use Atlas Search which is a Lucene based Text Search Engine (Works almost like elasticsearch on steroids). This offers significantly greater performance and functionalities like fuzzy text search, text automcomplete etc.

Azure Search - exact match as first or single result

I'm using Azure Search based on the rich Lucene Query Parser syntax. I defined to "~1" as additional parameter to one symbol for distance ). But I faced with problem, that the entity is not ordered even if there is exact match. (For example,"blue~1" would return "blues", "blue", "glue". Or when searching product SKU like "P002", I would get result "P003", "P005", "P004", "P002", "P001", "P006" )
So my question: is there some way to define, that the entity with exact match must be first in list, or be singl search result even then I'm using fuzzy search "~1"?

With Lucene Query syntax you can boost individual subqueries, for example: term^2 | term~1 - this translates to "find documents that match 'term' OR 'term' with edit distance 1, and score the exact matches higher relative to fuzzy matches by a factor of two.
search=blue^2|blue~1&queryType=full
There is no guarantee that the exact match will always be first in the results set as the document score is a function of term frequency and inverse document frequency. If the fuzzy sub-query expands the input term to a term that's very unique in your document corpus you may need to bump the boosting factor (2 in my example). In general, relying on the relevance score for ordering is not a practical idea. Take a look at my answer in the following post for more information: Azure Search scoring
Let me know if this helps

inverted index sets - querying key prefixes

I'm using Redis in order to build an inverted index system for words and the documents that contains those words.
the setup is really simple: Redis Sets where the key of the Set is: i:word and the values of the Set are the documents ids that have this word
let's say i have 2 sets: i:example and i:result
the query - "example result" will intersect i:example and i:result and return all the ids that have both example and result as members
but what i'm looking for is a way to perform (in efficient manner) a query like: "ex res". the result set should contain at least all the ids from the query "example result"
Solutions that i thought of:
create prefix sets of size 2: p:ex - contains {"example", "expertise", "ex"...}. the lookup running time will not be a problem - O(1) to get the set and O(n) to check all elements in the set for words that start with the prefix (where n = set.size()) but i worry about the added size price.
Using scan: but i'm not sure about the running time - query like scan 0 match ex* will take O(n) where n is the number of keys in the db? I know redis is fast but it's probably not an optimized solution for query like "ex machi cont".

The usual way to go about this is the first approach you had mentioned, but usually you'd go with segments that are 3+ chars long. Note that you'll need to have a set for each segment, i.e.g. i:exa, i:exam, i:examp, i:exampl and of course i:example.
This will naturally take up space in your database (hence the suggestion to start at 3 rather than 2 characters). A possible tweak is to keep in the i:len(3) sets only references to i:len(4+) sets instead of document ids. This will required more read operations but will have significant savings in terms of RAM.
You should explore v2.8.9's addition of lexicographical ranges for Sorted Sets. By calling ZRANGEBYLEX you can get ranges of members (i.e.g. all the words that start with ex). While this could be useful in this context by itself, consider that you can also use your Sorted Set's members creatively to encode a word and its document reference. This can help you get over the "loss" of the score (since all scores need to be the same for lexicographical ordering to work). For example, assuming the words "bed" and "beg" in docs 1 and 2:
ZADD index 0 "beg:1" 0 "bed:2"
Lastly, here's a little something to think about too - adding suffix searching (i.e.g, everything that ends with "ample"): https://redislabs.com/blog/how-to-use-redis-at-least-x1000-more-efficiently

Stronger boosting by date in Solr

Boosting by date field in solr is defined as:
{!boost b=recip(ms(NOW,datefield),3.16e-11,1,1)}
I looked everywhere (examples: Solr Dismax Config for Boost Scoring and Solr boost for multivalued date field and they all reference the SolrRelevancyFAQ), same definition that is used. But I found that this is not boosting my results sufficiently. How can I make this date boosting stronger?
User is searching for two keywords. Both items contain both keywords (in same order) in both title and description. Neither of the keywords is repeated.
And the solr debug output is waaay too confusing to me to understand the problem.
Now, this is not a huge problem. 99% of queries work fine and produce expected results, so its not like solr is not working at all, I just found this situation that is very confusing to me and don't know how to proceed.

recip(x, m, a, b) implements f(x) = a/(xm+b) with :
x : the document age in ms, defined as ms(NOW,<datefield>).
m : a constant that defines a time scale which is used to apply boost. It should be relative to what you consider an old document age (a reference_time) in milliseconds. For example, choosing a reference_time of 1 year (3.16e10ms) implies to use its inverse : 3.16e-11 (1/3.16e10 rounded).
a and b are constants (defined arbitrarily).
xm = 1 when the document is 1 reference_time old (multiplier = a/(1+b)).
xm ≈ 0 when the document is new, resulting in a value close to a/b.
Using the same value for a and b ensures the multiplier doesn't exceed 1 with recent documents.
With a = b = 1, a 1 reference_time old document has a multiplier of about 1/2, a 2 reference_time old document has a multiplier of about 1/3, and so on.
How to make a date boosting stronger ?
Increase m : choose a lower reference_time for example 6 months, that gives us m = 6.33e-11. Comparing to a 1 year reference, the multiplier decreases 2x faster as the document age increases.
Decreasing a and b expands the response curve of the function. This can be very agressive, see this example (page 8).
Apply a boost to the boost function itself with the bf (Boost Functions) parameter (this is a dismax parameter so it requires using DisMax or eDisMax query parser), eg. :
bf=recip(ms(NOW,datefield),3.16e-11,1,1)^2.0
It is important to note a few things :
bf is an additive boost and acts as a bonus added to the score of newer documents.
{!boost b} is a multiplicative boost and acts more as a penalty applied to the score of older document.
A bf score (the "bonus" added to the global score) is calculated independently of the relevancy score (the global score), meaning that a resultset with higher scores may not be impacted as much as a resultset with lower scores. In contrast, multiplicative boosts affect scores the same way regardless of the resultset relevancy, that's why it is usually preferred.
Do not use recip() for dates more than one reference_time in the future or it will yield negative values.
See also this very insightful post by Nolan Lawson on Comparing boost methods in Solr.

User is searching for two keywords. Both items contain both keywords
(in same order) in both title and description. Neither of the keywords
is repeated.
Well, by your example, it is clear that your results have landed into a tie situation. To understand this problem of confusing debug output and devise a tie-breaker policy, it is important to understand dismax.
With DisMax queries, the different terms of the user input are executed against different fields, if many of them hit (the term appears in different fields in the same document) the hit that scores higher is used, but what happens with the other sub-queries that hit in that document for the term? Well, that’s what the tie parameter defines. DisMax will calculate the score for a term query as:
score= [score of the top scoring subquery] + tie * (sum of other hitting subqueries)
In consequence, the tie parameter is a value between 0 and 1 that will define if the Dismax will only consider the max hit score for a term (setting tie=0), all the hits for a term (setting tie=1) or something between those two extremes.
The boost parameter is very similar to the bf parameter, but instead of adding its result to the final score, it will multiply it. This is only available in the Extended Dismax Query Parser or the Lucid Query Parser.
There is an interesting article Comparing Boost Methods of SOLR which may be useful to you.
References for this answer:
Advanced Apache Solr boosting: a case study
Using Solr’s Dismax Tie Parameter
Shishir

There is an example very well presented in the ReciprocalFloatFunction that will give you a clear view on how the boosting recipe works. If you find that dismax does not offer you enough control over the boosting, you will have to do some tinkering with BoostQParserPlugin.
A multiplier of 3.16e-11 changes the units from milliseconds to years
(since there are about 3.16e10 milliseconds per year). Thus, a very
recent date will yield a value close to 1/(0+1) or 1, a date a year in
the past will get a multiplier of about 1/(1+1) or 1/2, and date two
years old will yield 1/(2+1) or 1/3.

how an search index works when querying many words?

I'm trying to build my own search engine for experimenting.
I know about the inverted indexes. for example when indexing words.
the key is the word and has a list of document ids containing that word. So when you search for that word you get the documents right away
how does it work for multiple words
you get all documents for every word and traverse those document to see if have both words?
I feel it is not the case.
anyone knows the real answer for this without speculating?

Inverted index is very efficient for getting intersection, using a zig-zag alorithm:
Assume your terms is a list T:
lastDoc <- 0 //the first doc in the collection
currTerm <- 0 //the first term in T
while (lastDoc != infinity):
if (currTerm > T.last): //if we have passed the last term:
insert lastDoc into result
currTerm <- 0
lastDoc <- lastDoc + 1
continue
docId <- T[currTerm].getFirstAfter(lastDoc-1)
if (docID != lastDoc):
lastDoc <- docID
currTerm <- 0
else:
currTerm <- currTerm + 1
This algorithm assumes efficient getFirstAfter() which can give you the first document which fits the term and his docId is greater then the specified parameter. It should return infinity if there is none.
The algorithm will be most efficient if the terms are sorted such that the rarest term is first.
The algorithm ensures at most #docs_matching_first_term * #terms iterations, but practically - it will usually be much less iterations.
Note: Though this alorithm is efficient, AFAIK lucene does not use it.
More info can be found in this lecture notes slides 11-13 [copy rights in the lecture's first page]

You need to store position of a word in a document in index file.
Your index file structure should be like this..
word id - doc id- no. of hits- pos of hits.
Now suppose the query contains 4 words "w1 w2 w3 w4" . Choose those files containing most of the words. Now calculate their relative distance in the document. The document where most of the words occur and their relative distance is minimum will have high priority in search results.
I have developed a total search engine without using any crawling or indexing tool available in internet. You can read a detailed description here-Search Engine
for more info read this paper by Google founders-click here

You find the intersection of document sets as biziclop said, and you can do it in a fairly fast way. See this post and the papers linked therein for a more formal description.

As pointed out by biziclop, for an AND query you need to intersect the match lists (aka inverted lists) for the two query terms.
In typical implementations, the inverted lists are implemented such that they can be searched for any given document id very efficiently (generally, in logarithmic time). One way to achieve this is to keep them sorted (and use binary search), but note that this is not trivial as there is also a need to store them in compressed form.
Given a query A AND B, and assume that there are occ(A) matches for A and occ(B) matches for B (i.e. occ(x) := the length of the match list for term x). Assume, without loss of generality, that occ(A) > occ(B), i.e. A occurs more frequently in the documents than B. What you do then is to iterate through all matches for B and search for each of them in the list for A. If indeed the lists can be searched in logarithmic time, this means you need
occ(B) * log(occ(A))
computational steps to identify all matches that contain both terms.
A great book describing various aspects of the implementation is Managing Gigabytes.

I don't really understand why people is talking about intersection for this.
Lucene supports combination of queries using BooleanQuery, which you can nest indefinitely if you must.
The QueryParser also supports the AND keyword, which would require both words to be in the document.
Example (Lucene.NET, C#):
var outerQuery + new BooleanQuery();
outerQuery.Add(new TermQuery( new Term( "FieldNameToSearch", word1 ) ), BooleanClause.Occur.MUST );
outerQuery.Add(new TermQuery( new Term( "FieldNameToSearch", word2 ) ), BooleanClause.Occur.MUST );
If you want to split the words (your actual search term) using the same analyzer, there are ways to do that too. Although, a QueryParser might be easier to use.
You can view this answer for example on how to split the string using the same analyzer that you used for indexing:
No hits when searching for "mvc2" with lucene.net

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string