Stronger boosting by date in Solr - search

Boosting by date field in solr is defined as:
{!boost b=recip(ms(NOW,datefield),3.16e-11,1,1)}
I looked everywhere (examples: Solr Dismax Config for Boost Scoring and Solr boost for multivalued date field and they all reference the SolrRelevancyFAQ), same definition that is used. But I found that this is not boosting my results sufficiently. How can I make this date boosting stronger?
User is searching for two keywords. Both items contain both keywords (in same order) in both title and description. Neither of the keywords is repeated.
And the solr debug output is waaay too confusing to me to understand the problem.
Now, this is not a huge problem. 99% of queries work fine and produce expected results, so its not like solr is not working at all, I just found this situation that is very confusing to me and don't know how to proceed.

recip(x, m, a, b) implements f(x) = a/(xm+b) with :
x : the document age in ms, defined as ms(NOW,<datefield>).
m : a constant that defines a time scale which is used to apply boost. It should be relative to what you consider an old document age (a reference_time) in milliseconds. For example, choosing a reference_time of 1 year (3.16e10ms) implies to use its inverse : 3.16e-11 (1/3.16e10 rounded).
a and b are constants (defined arbitrarily).
xm = 1 when the document is 1 reference_time old (multiplier = a/(1+b)).
xm ≈ 0 when the document is new, resulting in a value close to a/b.
Using the same value for a and b ensures the multiplier doesn't exceed 1 with recent documents.
With a = b = 1, a 1 reference_time old document has a multiplier of about 1/2, a 2 reference_time old document has a multiplier of about 1/3, and so on.
How to make a date boosting stronger ?
Increase m : choose a lower reference_time for example 6 months, that gives us m = 6.33e-11. Comparing to a 1 year reference, the multiplier decreases 2x faster as the document age increases.
Decreasing a and b expands the response curve of the function. This can be very agressive, see this example (page 8).
Apply a boost to the boost function itself with the bf (Boost Functions) parameter (this is a dismax parameter so it requires using DisMax or eDisMax query parser), eg. :
bf=recip(ms(NOW,datefield),3.16e-11,1,1)^2.0
It is important to note a few things :
bf is an additive boost and acts as a bonus added to the score of newer documents.
{!boost b} is a multiplicative boost and acts more as a penalty applied to the score of older document.
A bf score (the "bonus" added to the global score) is calculated independently of the relevancy score (the global score), meaning that a resultset with higher scores may not be impacted as much as a resultset with lower scores. In contrast, multiplicative boosts affect scores the same way regardless of the resultset relevancy, that's why it is usually preferred.
Do not use recip() for dates more than one reference_time in the future or it will yield negative values.
See also this very insightful post by Nolan Lawson on Comparing boost methods in Solr.

User is searching for two keywords. Both items contain both keywords
(in same order) in both title and description. Neither of the keywords
is repeated.
Well, by your example, it is clear that your results have landed into a tie situation. To understand this problem of confusing debug output and devise a tie-breaker policy, it is important to understand dismax.
With DisMax queries, the different terms of the user input are executed against different fields, if many of them hit (the term appears in different fields in the same document) the hit that scores higher is used, but what happens with the other sub-queries that hit in that document for the term? Well, that’s what the tie parameter defines. DisMax will calculate the score for a term query as:
score= [score of the top scoring subquery] + tie * (sum of other hitting subqueries)
In consequence, the tie parameter is a value between 0 and 1 that will define if the Dismax will only consider the max hit score for a term (setting tie=0), all the hits for a term (setting tie=1) or something between those two extremes.
The boost parameter is very similar to the bf parameter, but instead of adding its result to the final score, it will multiply it. This is only available in the Extended Dismax Query Parser or the Lucid Query Parser.
There is an interesting article Comparing Boost Methods of SOLR which may be useful to you.
References for this answer:
Advanced Apache Solr boosting: a case study
Using Solr’s Dismax Tie Parameter
Shishir

There is an example very well presented in the ReciprocalFloatFunction that will give you a clear view on how the boosting recipe works. If you find that dismax does not offer you enough control over the boosting, you will have to do some tinkering with BoostQParserPlugin.
A multiplier of 3.16e-11 changes the units from milliseconds to years
(since there are about 3.16e10 milliseconds per year). Thus, a very
recent date will yield a value close to 1/(0+1) or 1, a date a year in
the past will get a multiplier of about 1/(1+1) or 1/2, and date two
years old will yield 1/(2+1) or 1/3.

Related

Azure Search - exact match as first or single result

I'm using Azure Search based on the rich Lucene Query Parser syntax. I defined to "~1" as additional parameter to one symbol for distance ). But I faced with problem, that the entity is not ordered even if there is exact match. (For example,"blue~1" would return "blues", "blue", "glue". Or when searching product SKU like "P002", I would get result "P003", "P005", "P004", "P002", "P001", "P006" )
So my question: is there some way to define, that the entity with exact match must be first in list, or be singl search result even then I'm using fuzzy search "~1"?
With Lucene Query syntax you can boost individual subqueries, for example: term^2 | term~1 - this translates to "find documents that match 'term' OR 'term' with edit distance 1, and score the exact matches higher relative to fuzzy matches by a factor of two.
search=blue^2|blue~1&queryType=full
There is no guarantee that the exact match will always be first in the results set as the document score is a function of term frequency and inverse document frequency. If the fuzzy sub-query expands the input term to a term that's very unique in your document corpus you may need to bump the boosting factor (2 in my example). In general, relying on the relevance score for ordering is not a practical idea. Take a look at my answer in the following post for more information: Azure Search scoring
Let me know if this helps

solr how to properly use boost factor in a query?

Ok, so I am using many fields with qf, like:
[qf] => frpId^5 fundraise_title^3 fundraiser_display_name^3 charity_name^2 participantFname^2 participantLname^2 participantEmail^1 groupName^3 fundraise_text^ fundraiseTitleExact^15 fundraiserDisplayNameExact^15 charityNameExact^15 participantFnameExact^10 participantLnameExact^10 groupNameExact^10 all^
but I really want that exact matches for the field fundraiseTitleExact to be on top.
With this previous set up of qf, they are on the position 32.
Let's say that I am boosting fundraiseTitleExact like:
[qf] => frpId^5 fundraise_title^3 fundraiser_display_name^3 charity_name^2 participantFname^2 participantLname^2 participantEmail^1 groupName^3 fundraise_text^ fundraiseTitleExact^15000000000000000 fundraiserDisplayNameExact^15 charityNameExact^15 participantFnameExact^10 participantLnameExact^10 groupNameExact^10 all^
But even now the fundraiseTitleExact exact match is only on the position 27 (5 positions up) and is not going upper.
How can I prioritise this field over the rest?
This looks more like a tuning problem, however you have several options:
Tune up your relevancy modifying all the boosts until you get the expected results (I would advise to work with lower boosts than the ones in your questions and then increase the boost of the most important field);
If you are using edismax query parser then You probably want to check the bq and bf parameters in order to boost your term;
If worse come to worst you could use Query Elevation Component to put some entries at the top of the list.
I advise to read the following books to widen your knowledge of solr boosting and relevancy mechanisms:
Solr in Action
Relevant Search

Sparql 'langmatch' seems extremely slow on Virtuoso (DBpedia)

I have a sparql performance issue with DBpedia. I'd like to extract ordered information from DBpedia sparql endpoint page by page. My first example query looked like this:
select distinct ?objProperty ?label where {
?x ?objProperty <http://dbpedia.org/resource/United_States>.
?objProperty a owl:ObjectProperty.
OPTIONAL{?objProperty rdfs:label ?label}
}order by ?label limit 10 offset 3
It was executed about 2s for me on avg(please, if you try it yourself and you see timing less than a second - increment 'offset', because it seems that DBpedia's Virtuoso is caching request results).
However the result returned is not suitable for pagination, because it is a mess of lines with labels from different languages. I want English language for labels and for precise pagination I want exactly 10 different object properties to be returned as a result. Also they have to be ordered by label. Ok. Another try:
select distinct ?objProperty ?label where {
?a ?objProperty <http://dbpedia.org/resource/United_States>.
?objProperty a owl:ObjectProperty.
OPTIONAL{?objProperty rdfs:label ?label}
FILTER ( LANGMATCHES(lang(?label),"EN") || LANG(?label) = "")
}order by ?label limit 10 offset 3
For me this request returned what I expected,.. but it was executed about 7 seconds on avg!!! So sloooow!!! Without order by and langmatch, query works about 1s on avg. Without order by but with langmatch, it takes about 6s, so it seems that langmatch eats ~ 5s on avg for this query.
I do not understand (these are questions by the way):
Am I doing something wrong? :)
Why langmatch slows query SOOO much? I wish langmatch is not regex based? If this performance issue is unavoidable using langmatch, is there a faster way to work with languages? If no, I can't imagine how semantic technologies would conquer the world in nearest future as people expect :))
Is there a better (faster) way to build pagination based requests than using limit/offset? If no, what is the best way to avoid performance issues like mentioned above with limit/offset?
1. Am I doing something wrong? :)
I think there's a slight issue that could make your query a bit faster. You've got the ?label as optional, but I think that the filter will only succeed when ?label is bound, effectively making ?label non-optional. My reasoning is as follows: in the case where ?label is not bound, the expression lang(?label) will be an error (unless an implementation extends lang()), and both langMatches and = expect non-error values, so we'd have this reduction:
langMatches(lang(?label),"en") || lang(?label) = "en"
langMatches(error, "en") || error = "en"
error || error
false
I'm basing this on section 17.2 of the SPARQL 1.1 recommendation, which says:
17.2 Filter Evaluation
Functions invoked with an argument of the wrong type will produce a type error. Effective boolean value arguments (labeled "xsd:boolean
(EBV)" in the operator mapping table below), are coerced to
xsd:boolean using the EBV rules in section 17.2.2.
Apart from BOUND, COALESCE, NOT EXISTS and EXISTS, all functions and operators operate on RDF Terms and will produce a type error if any
arguments are unbound.
Any expression other than logical-or (||) or logical-and (&&) that encounters an error will produce that error.
Based on that, I'd rewrite the query as the following. My impression is that it's a little bit faster, but that might just be confirmation bias. It's not much faster, though.
select distinct ?p ?label where {
?x ?p dbpedia:United_States .
?p a owl:ObjectProperty ;
rdfs:label ?label .
filter( langMatches(lang(?label),"en") || lang(?label) = "" )
}
order by ?label
limit 10
offset 3
SPARQL results
2. Why langmatch slows query SOOO much? I wish langmatch is not regex based? If this performance issue is unavoidable using langmatch, is there a faster way to work with languages?
The public DBpedia SPARQL endpoint can be a bit slow at times, but that doesn't seem to be the issue here. When I run your original query, or the new one above, query, it takes six or seven seconds to get the results. Two things to note though:
langMatch isn't regular expression based. The docs for langMatches say that "Returns true if language-tag (first argument) matches language-range (second argument) per the basic filtering scheme defined in RFC4647 section 3.3.1. language-range is a basic language range per Matching of Language Tags RFC4647 section 2.1. A language-range of "*" matches any non-empty language-tag string." The basic filtering is case insensitive, but it's not regex.
langMatches isn't the only thing that might be causing some slower results. Note that to find the first 10 of something (or, in general, the mth through the _n_th), you have to visit all the elements. You don't have to sort all of them, but you have to visit all of them, which means that there's no way to get just the results from the desired page (unless there's some special indexing going on; keep making this query and maybe it will speed up overtime :)). This leads us into the next point, though.
3. Is there a better (faster) way to build pagination based requests than using limit/offset? If no, what is the best way to avoid performance issues like mentioned above with limit/offset?
While the original and updated queries take six or seven seconds to retrieve the 10 results with limit 10, asking for limit 1000, or limit 5000, also only take about six or seven seconds. Using limit/offset is the correct way to do pagination, but ordering the results can be expensive, since to find the elements in some particular range, you have to look at all the elements (though you don't necessarily have to order all the elements). It probably makes sense, then, to make those pages as big as possible, and to do any presentation paging locally. E.g., instead of running 100 queries for 10 results each (100 queries × 7 seconds = 700 seconds = 11 minutes and 40 seconds), you can run 1 query for 1000 results (1 query × 7 seconds = 7 seconds), and do any important paged presentation locally.
Handling of language filter is up to SPARQL engine. How it stores literals? Whether it can use indexes or another technique to avoid full text scan to get literal for desired language?
You can store literal as "chat"#en string, but selecting all literals for english for a given property would require all property literals scan for #en match.
In some SPARQL engines, you can get actual execution plan. For example, here is the way to do it in Virtuoso: Virtuoso execution plan, however, you can't use it on public endpoint.
Query optimization, execution, query hints are very well documented for RDBMS, you can easily find out what database really does to answer your query and how to modify schema or query to get best results. IMHO, SPARQL engines are not that mature for this.

how an search index works when querying many words?

I'm trying to build my own search engine for experimenting.
I know about the inverted indexes. for example when indexing words.
the key is the word and has a list of document ids containing that word. So when you search for that word you get the documents right away
how does it work for multiple words
you get all documents for every word and traverse those document to see if have both words?
I feel it is not the case.
anyone knows the real answer for this without speculating?
Inverted index is very efficient for getting intersection, using a zig-zag alorithm:
Assume your terms is a list T:
lastDoc <- 0 //the first doc in the collection
currTerm <- 0 //the first term in T
while (lastDoc != infinity):
if (currTerm > T.last): //if we have passed the last term:
insert lastDoc into result
currTerm <- 0
lastDoc <- lastDoc + 1
continue
docId <- T[currTerm].getFirstAfter(lastDoc-1)
if (docID != lastDoc):
lastDoc <- docID
currTerm <- 0
else:
currTerm <- currTerm + 1
This algorithm assumes efficient getFirstAfter() which can give you the first document which fits the term and his docId is greater then the specified parameter. It should return infinity if there is none.
The algorithm will be most efficient if the terms are sorted such that the rarest term is first.
The algorithm ensures at most #docs_matching_first_term * #terms iterations, but practically - it will usually be much less iterations.
Note: Though this alorithm is efficient, AFAIK lucene does not use it.
More info can be found in this lecture notes slides 11-13 [copy rights in the lecture's first page]
You need to store position of a word in a document in index file.
Your index file structure should be like this..
word id - doc id- no. of hits- pos of hits.
Now suppose the query contains 4 words "w1 w2 w3 w4" . Choose those files containing most of the words. Now calculate their relative distance in the document. The document where most of the words occur and their relative distance is minimum will have high priority in search results.
I have developed a total search engine without using any crawling or indexing tool available in internet. You can read a detailed description here-Search Engine
for more info read this paper by Google founders-click here
You find the intersection of document sets as biziclop said, and you can do it in a fairly fast way. See this post and the papers linked therein for a more formal description.
As pointed out by biziclop, for an AND query you need to intersect the match lists (aka inverted lists) for the two query terms.
In typical implementations, the inverted lists are implemented such that they can be searched for any given document id very efficiently (generally, in logarithmic time). One way to achieve this is to keep them sorted (and use binary search), but note that this is not trivial as there is also a need to store them in compressed form.
Given a query A AND B, and assume that there are occ(A) matches for A and occ(B) matches for B (i.e. occ(x) := the length of the match list for term x). Assume, without loss of generality, that occ(A) > occ(B), i.e. A occurs more frequently in the documents than B. What you do then is to iterate through all matches for B and search for each of them in the list for A. If indeed the lists can be searched in logarithmic time, this means you need
occ(B) * log(occ(A))
computational steps to identify all matches that contain both terms.
A great book describing various aspects of the implementation is Managing Gigabytes.
I don't really understand why people is talking about intersection for this.
Lucene supports combination of queries using BooleanQuery, which you can nest indefinitely if you must.
The QueryParser also supports the AND keyword, which would require both words to be in the document.
Example (Lucene.NET, C#):
var outerQuery + new BooleanQuery();
outerQuery.Add(new TermQuery( new Term( "FieldNameToSearch", word1 ) ), BooleanClause.Occur.MUST );
outerQuery.Add(new TermQuery( new Term( "FieldNameToSearch", word2 ) ), BooleanClause.Occur.MUST );
If you want to split the words (your actual search term) using the same analyzer, there are ways to do that too. Although, a QueryParser might be easier to use.
You can view this answer for example on how to split the string using the same analyzer that you used for indexing:
No hits when searching for "mvc2" with lucene.net

inverse document frequency

The inverse document freqency is defined as follows:
IDF(term,document) = tf(term) * log(1 + n/df(term))
where tf(term) = 'frequency of term in document', n = 'number of documents', df(term) = 'number of docs containing term'.
Just curious about df(term) - do I only count a document ones even if it contains the term more than once?
Also is it easy to determine this stat with lucene(.net)? I am only starting to use the latter and use a relational db at the moment.
Thanks.
Christian
For using idf with Lucene, check the API for example here.
You are right about the docs being counted only once. The idea is to get a function with a lower bound in the log part. Like this:
If you are interested in the idf theory behind the scenes, you may peep at this paper.
HTH!
Of course you have to count the DF(term) once. therefore, you should group the words to get distinct words.
See my class IDF here

Resources