inverse document frequency - search

The inverse document freqency is defined as follows:
IDF(term,document) = tf(term) * log(1 + n/df(term))
where tf(term) = 'frequency of term in document', n = 'number of documents', df(term) = 'number of docs containing term'.
Just curious about df(term) - do I only count a document ones even if it contains the term more than once?
Also is it easy to determine this stat with lucene(.net)? I am only starting to use the latter and use a relational db at the moment.
Thanks.
Christian

For using idf with Lucene, check the API for example here.
You are right about the docs being counted only once. The idea is to get a function with a lower bound in the log part. Like this:
If you are interested in the idf theory behind the scenes, you may peep at this paper.
HTH!

Of course you have to count the DF(term) once. therefore, you should group the words to get distinct words.
See my class IDF here

Related

OpenNLP doccat trainer always results in "1 outcome patterns"

I am evaluating OpenNLP for use as a document categorizer. I have a sanitized training corpus with roughly 4k files, in about 150 categories. The documents have many shared, mostly irrelevant words - but many of those words become relevant in n-grams, so I'm using the following parameters:
TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, 20000);
params.put(TrainingParameters.CUTOFF_PARAM, 10);
DoccatFactory dcFactory = new DoccatFactory(new FeatureGenerator[] { new NGramFeatureGenerator(3, 10) });
params.put(AbstractTrainer.ALGORITHM_PARAM, NaiveBayesTrainer.NAIVE_BAYES_VALUE);
Some of these categories apply to documents that are almost completely identical (think boiler-plate legal documents, with maybe only names and addresses different between document instances) - and will be mostly identical to documents in the test set. However, no matter how I tweak these params, I can't break out of the "1 outcome patterns" result. When running a test, every document in the test set is tagged with "Category A."
I did manage to effect a single minor change in output, by moving from previous use of the BagOfWordsFeatureGenerator to the NGramFeatureGenerator, and from maxent to Naive Bayes; before the change, every document in the test set was assigned "Category A", but after the change, all the documents were now assigned to "Category B." But other than that, I can't seem to move the dial at all.
I've tried fiddling with iterations, cutoff, ngram sizes, using maxent instead of bayes, etc; but all to no avail.
Example code from tutorials that I've found on the interweb have used much smaller training sets with less iterations, and are able to perform at least some rudimentary differentation.
Usually in such a situation - bewildering lack of expected behavior - the engineer has forgotten to flip some simple switch, or has some fatal lack of fundamental understanding. I am eminently capable of both those failures. Also, I have no Data Science training, although I have read a couple of O'Reilly books on the subject. So the problem could be procedural. Is the training set too small? Is the number of iterations off by an order of magnitude? Would a different algo be a better fit? I'm utterly surprised that no tweaks have even slightly moved the dial away from the "1 outcome" outcome.
Any response appreciated.
Well, the answer to this one did not come from the direction in which the question was asked. It turns out that there was a code sample in the OpenNLP documentation that was wrong, and no amount of parameter tuning would have solved it. I've submitted a jira to the project so it should be resolved; but for those who make their way here before then, here's the rundown:
Documentation (wrong):
String inputText = ...
DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m);
double[] outcomes = myCategorizer.categorize(inputText);
String category = myCategorizer.getBestCategory(outcomes);
Should be something like:
String inputText = ... // sanitized document to be classified
DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m);
double[] outcomes = myCategorizer.categorize(inputText.split(" "));
String category = myCategorizer.getBestCategory(outcomes);
DocumentCategorizerME.categorize() needs an array; since this is an obviously self-documenting bug the second you run the code, I had assumed the necessary array parameter should be an array of documents in string form; instead it needs
an array of tokens from a single document.

inverted index sets - querying key prefixes

I'm using Redis in order to build an inverted index system for words and the documents that contains those words.
the setup is really simple: Redis Sets where the key of the Set is: i:word and the values of the Set are the documents ids that have this word
let's say i have 2 sets: i:example and i:result
the query - "example result" will intersect i:example and i:result and return all the ids that have both example and result as members
but what i'm looking for is a way to perform (in efficient manner) a query like: "ex res". the result set should contain at least all the ids from the query "example result"
Solutions that i thought of:
create prefix sets of size 2: p:ex - contains {"example", "expertise", "ex"...}. the lookup running time will not be a problem - O(1) to get the set and O(n) to check all elements in the set for words that start with the prefix (where n = set.size()) but i worry about the added size price.
Using scan: but i'm not sure about the running time - query like scan 0 match ex* will take O(n) where n is the number of keys in the db? I know redis is fast but it's probably not an optimized solution for query like "ex machi cont".
The usual way to go about this is the first approach you had mentioned, but usually you'd go with segments that are 3+ chars long. Note that you'll need to have a set for each segment, i.e.g. i:exa, i:exam, i:examp, i:exampl and of course i:example.
This will naturally take up space in your database (hence the suggestion to start at 3 rather than 2 characters). A possible tweak is to keep in the i:len(3) sets only references to i:len(4+) sets instead of document ids. This will required more read operations but will have significant savings in terms of RAM.
You should explore v2.8.9's addition of lexicographical ranges for Sorted Sets. By calling ZRANGEBYLEX you can get ranges of members (i.e.g. all the words that start with ex). While this could be useful in this context by itself, consider that you can also use your Sorted Set's members creatively to encode a word and its document reference. This can help you get over the "loss" of the score (since all scores need to be the same for lexicographical ordering to work). For example, assuming the words "bed" and "beg" in docs 1 and 2:
ZADD index 0 "beg:1" 0 "bed:2"
Lastly, here's a little something to think about too - adding suffix searching (i.e.g, everything that ends with "ample"): https://redislabs.com/blog/how-to-use-redis-at-least-x1000-more-efficiently

Stronger boosting by date in Solr

Boosting by date field in solr is defined as:
{!boost b=recip(ms(NOW,datefield),3.16e-11,1,1)}
I looked everywhere (examples: Solr Dismax Config for Boost Scoring and Solr boost for multivalued date field and they all reference the SolrRelevancyFAQ), same definition that is used. But I found that this is not boosting my results sufficiently. How can I make this date boosting stronger?
User is searching for two keywords. Both items contain both keywords (in same order) in both title and description. Neither of the keywords is repeated.
And the solr debug output is waaay too confusing to me to understand the problem.
Now, this is not a huge problem. 99% of queries work fine and produce expected results, so its not like solr is not working at all, I just found this situation that is very confusing to me and don't know how to proceed.
recip(x, m, a, b) implements f(x) = a/(xm+b) with :
x : the document age in ms, defined as ms(NOW,<datefield>).
m : a constant that defines a time scale which is used to apply boost. It should be relative to what you consider an old document age (a reference_time) in milliseconds. For example, choosing a reference_time of 1 year (3.16e10ms) implies to use its inverse : 3.16e-11 (1/3.16e10 rounded).
a and b are constants (defined arbitrarily).
xm = 1 when the document is 1 reference_time old (multiplier = a/(1+b)).
xm ≈ 0 when the document is new, resulting in a value close to a/b.
Using the same value for a and b ensures the multiplier doesn't exceed 1 with recent documents.
With a = b = 1, a 1 reference_time old document has a multiplier of about 1/2, a 2 reference_time old document has a multiplier of about 1/3, and so on.
How to make a date boosting stronger ?
Increase m : choose a lower reference_time for example 6 months, that gives us m = 6.33e-11. Comparing to a 1 year reference, the multiplier decreases 2x faster as the document age increases.
Decreasing a and b expands the response curve of the function. This can be very agressive, see this example (page 8).
Apply a boost to the boost function itself with the bf (Boost Functions) parameter (this is a dismax parameter so it requires using DisMax or eDisMax query parser), eg. :
bf=recip(ms(NOW,datefield),3.16e-11,1,1)^2.0
It is important to note a few things :
bf is an additive boost and acts as a bonus added to the score of newer documents.
{!boost b} is a multiplicative boost and acts more as a penalty applied to the score of older document.
A bf score (the "bonus" added to the global score) is calculated independently of the relevancy score (the global score), meaning that a resultset with higher scores may not be impacted as much as a resultset with lower scores. In contrast, multiplicative boosts affect scores the same way regardless of the resultset relevancy, that's why it is usually preferred.
Do not use recip() for dates more than one reference_time in the future or it will yield negative values.
See also this very insightful post by Nolan Lawson on Comparing boost methods in Solr.
User is searching for two keywords. Both items contain both keywords
(in same order) in both title and description. Neither of the keywords
is repeated.
Well, by your example, it is clear that your results have landed into a tie situation. To understand this problem of confusing debug output and devise a tie-breaker policy, it is important to understand dismax.
With DisMax queries, the different terms of the user input are executed against different fields, if many of them hit (the term appears in different fields in the same document) the hit that scores higher is used, but what happens with the other sub-queries that hit in that document for the term? Well, that’s what the tie parameter defines. DisMax will calculate the score for a term query as:
score= [score of the top scoring subquery] + tie * (sum of other hitting subqueries)
In consequence, the tie parameter is a value between 0 and 1 that will define if the Dismax will only consider the max hit score for a term (setting tie=0), all the hits for a term (setting tie=1) or something between those two extremes.
The boost parameter is very similar to the bf parameter, but instead of adding its result to the final score, it will multiply it. This is only available in the Extended Dismax Query Parser or the Lucid Query Parser.
There is an interesting article Comparing Boost Methods of SOLR which may be useful to you.
References for this answer:
Advanced Apache Solr boosting: a case study
Using Solr’s Dismax Tie Parameter
Shishir
There is an example very well presented in the ReciprocalFloatFunction that will give you a clear view on how the boosting recipe works. If you find that dismax does not offer you enough control over the boosting, you will have to do some tinkering with BoostQParserPlugin.
A multiplier of 3.16e-11 changes the units from milliseconds to years
(since there are about 3.16e10 milliseconds per year). Thus, a very
recent date will yield a value close to 1/(0+1) or 1, a date a year in
the past will get a multiplier of about 1/(1+1) or 1/2, and date two
years old will yield 1/(2+1) or 1/3.

Search with attribute values correspondence in Lucene

Here's a text with ambiguous words:
"A man saw an elephant."
Each word has attributes: lemma, part of speech, and various grammatical attributes depending on its part of speech.
For "saw" it is like:
{lemma: see, pos: verb, tense: past}, {lemma: saw, pos: noun, number: singular}
All this attributes come from the 3rd party tools, Lucene itself is not involved in the word disambiguation.
I want to perform a query like "pos=verb & number=singular" and NOT to get "saw" in the result.
I thought of encoding distinct grammatical annotations into strings like "l:see;pos:verb;t:past|l:saw;pos:noun;n:sg" and searching for regexp "pos\:verb[^\|]+n\:sg", but I definitely can't afford regexp queries due to performance issues.
Maybe some hacks with posting list payloads can be applied?
UPD: A draft of my solution
Here are the specifics of my project: there is a fixed maximum of parses a word can have (say, 8).
So, I thought of inserting the parse number in each attribute's payload and use this payload at the posting lists intersectiion stage.
E.g., we have a posting list for 'pos = Verb' like ...|...|1.1234|...|..., and a posting list for 'number = Singular': ...|...|2.1234|...|...
While processing a query like 'pos = Verb AND number = singular' at all stages of posting list processing the 'x.1234' entries would be accepted until the intersection stage where they would be rejected because of non-corresponding parse numbers.
I think this is a pretty compact solution, but how hard would be incorporating it into Lucene?
So... the cheater way of doing this is (indeed) to control how you build the lucene index.
When constructing the lucene index, modify each word before Lucene indexes it so that it includes all the necessary attributes of the word. If you index things this way, you must do a lookup in the same way.
One way:
This means for each type of query you do, you must also build an index in the same way.
Example:
saw becomes noun-saw -- index it as that.
saw also becomes noun-past-see -- index it as that.
saw also becomes noun-past-singular-see -- index it as that.
The other way:
If you want attribute based lookup in a single index, you'd probably have to do something like permutation completion on the word 'saw' so that instead of noun-saw, you'd have all possible permutations of the attributes necessary in a big logic statement.
Not sure if this is a good answer, but that's all I could think of.

how an search index works when querying many words?

I'm trying to build my own search engine for experimenting.
I know about the inverted indexes. for example when indexing words.
the key is the word and has a list of document ids containing that word. So when you search for that word you get the documents right away
how does it work for multiple words
you get all documents for every word and traverse those document to see if have both words?
I feel it is not the case.
anyone knows the real answer for this without speculating?
Inverted index is very efficient for getting intersection, using a zig-zag alorithm:
Assume your terms is a list T:
lastDoc <- 0 //the first doc in the collection
currTerm <- 0 //the first term in T
while (lastDoc != infinity):
if (currTerm > T.last): //if we have passed the last term:
insert lastDoc into result
currTerm <- 0
lastDoc <- lastDoc + 1
continue
docId <- T[currTerm].getFirstAfter(lastDoc-1)
if (docID != lastDoc):
lastDoc <- docID
currTerm <- 0
else:
currTerm <- currTerm + 1
This algorithm assumes efficient getFirstAfter() which can give you the first document which fits the term and his docId is greater then the specified parameter. It should return infinity if there is none.
The algorithm will be most efficient if the terms are sorted such that the rarest term is first.
The algorithm ensures at most #docs_matching_first_term * #terms iterations, but practically - it will usually be much less iterations.
Note: Though this alorithm is efficient, AFAIK lucene does not use it.
More info can be found in this lecture notes slides 11-13 [copy rights in the lecture's first page]
You need to store position of a word in a document in index file.
Your index file structure should be like this..
word id - doc id- no. of hits- pos of hits.
Now suppose the query contains 4 words "w1 w2 w3 w4" . Choose those files containing most of the words. Now calculate their relative distance in the document. The document where most of the words occur and their relative distance is minimum will have high priority in search results.
I have developed a total search engine without using any crawling or indexing tool available in internet. You can read a detailed description here-Search Engine
for more info read this paper by Google founders-click here
You find the intersection of document sets as biziclop said, and you can do it in a fairly fast way. See this post and the papers linked therein for a more formal description.
As pointed out by biziclop, for an AND query you need to intersect the match lists (aka inverted lists) for the two query terms.
In typical implementations, the inverted lists are implemented such that they can be searched for any given document id very efficiently (generally, in logarithmic time). One way to achieve this is to keep them sorted (and use binary search), but note that this is not trivial as there is also a need to store them in compressed form.
Given a query A AND B, and assume that there are occ(A) matches for A and occ(B) matches for B (i.e. occ(x) := the length of the match list for term x). Assume, without loss of generality, that occ(A) > occ(B), i.e. A occurs more frequently in the documents than B. What you do then is to iterate through all matches for B and search for each of them in the list for A. If indeed the lists can be searched in logarithmic time, this means you need
occ(B) * log(occ(A))
computational steps to identify all matches that contain both terms.
A great book describing various aspects of the implementation is Managing Gigabytes.
I don't really understand why people is talking about intersection for this.
Lucene supports combination of queries using BooleanQuery, which you can nest indefinitely if you must.
The QueryParser also supports the AND keyword, which would require both words to be in the document.
Example (Lucene.NET, C#):
var outerQuery + new BooleanQuery();
outerQuery.Add(new TermQuery( new Term( "FieldNameToSearch", word1 ) ), BooleanClause.Occur.MUST );
outerQuery.Add(new TermQuery( new Term( "FieldNameToSearch", word2 ) ), BooleanClause.Occur.MUST );
If you want to split the words (your actual search term) using the same analyzer, there are ways to do that too. Although, a QueryParser might be easier to use.
You can view this answer for example on how to split the string using the same analyzer that you used for indexing:
No hits when searching for "mvc2" with lucene.net

Resources