Understanding Lucene Queries - search

I am interested in knowing a little more specifically about how Lucene queries are scored. In their documentation, they mention the VSM. I am familiar with VSM, but it seems inconsistent with the types of queries they allow.
I tried stepping through the source code for BooleanScorer2 and BooleanWeight, to no real avail.
My question is, can somebody step through the execution of a BooleanScorer to explain how it combines queries.
Also, is there a way to simple send out several terms and just get the raw tf.idf score for those terms, the way it is described in the documentation?

The place to start is http://lucene.apache.org/java/3_3_0/api/core/org/apache/lucene/search/Similarity.html
I think it clears up your inconsistency? Lucene combines Boolean model (BM) of Information Retrieval with Vector Space Model (VSM) of Information Retrieval - documents "approved" by BM are scored by VSM.
The next thing to look at is Searcher.explain, which can give you a string explaining how the score for a (query, document) pair is calculated.
Tracing thru the execution of BooleanScorer can be challenging I think, its probably easiest to understand BooleanScorer2 first, which uses subscorers like ConjunctionScorer/DisjunctionSumScorer, and to think of BooleanScorer as an optimization.
If this is confusing, then start even simpler at TermScorer. Personally I look at it "bottoms-up" anyway:
A Query creates a Weight valid across the whole index: this incorporates boost, idf, queryNorm, and even confusingly, boosts of any 'outer'/'parent' queries like booleanquery that are holding the term. this weight is computed a single time.
A Weight creates a Scorer (e.g. TermScorer) for each index segment, for a single term this scorer has everything it needs in the formula except for what is document-dependent: the within-document term-frequency (TF), which it must read from the postings, and the document's length normalization value (norm). So this is why termscorer scores a document as weight * sqrt(tf) * norm. in practice this is cached for tf values < 32 so that scoring most documents is a single multiply.
BooleanQuery really doesnt do "much" except its scorers are responsible for nextDoc()'ing and advance()'ing subscorers, and when the Boolean model is satisfied, then it combines the scores of the subscorers, applying the coordination factory (coord()) based on how many subscorers matched.
in general, its definitely difficult to trace through how lucene scores documents because in all released forms, the Scorers are responsible for 2 things: matching and calculating scores. In Lucene's trunk (http://svn.apache.org/repos/asf/lucene/dev/trunk/) these are now separated, in such a way that a Similarity is basically responsible for all aspects of scoring, and this is separate from matching. So the API there might be easier to understand, maybe harder, but at least you can refer to implementations of many other scoring models (BM25, language models, divergence from randomness, information-based models) if you get confused: http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/java/org/apache/lucene/search/similarities/

Related

How to implement synonyms for use in a search engine?

I am working on a pet search engine (SE).
What I have right now is boolean keyword SE, as a library that is split in two parts:
index: this is a inverted index ie. it associate terms with the original document where it appears
query: which is supplied by the user and can be arbitrarily complex boolean expression that looks like (mobile OR android OR iphone) AND game
I'd like to improve the search engine, in a way that does automatically extend simple queries to boolean queries so that it includes search terms that do no appear in the original query ie. I'd like to support synonyms.
I need some help to build the synonyms graph.
How can I compute list of words that appears in similar context?
Here is example of list of synonyms I'd like to compute:
psql, pgsql, postgres, postgresql
mobile, iphone, android
and also synonyms that includes ngrams like:
rdbms, relational database management systems, ...
The algorithm doesn't have to be perfect, I can post-process by hand the result, but at least I need to have a clue about what terms are similar to what other terms.
In the standard Information Retrieval (IR) literature, this enrichment of a query with additional terms (that don't appear in the initial/original query) is known as query expansion.
There're a plenty of standard approaches which, generally speaking, are based on the idea of scoring terms based on some factors and then selecting a number of terms (say K, a parameter) that have the highest scores.
To compute the term selection score, it is assumed that the top (M) ranked documents retrieved after initial retrieval are relevant, this being called pseudo-relevance feedback.
The factors on which the term selection function generally depend are:
The term frequency of a term in a top ranked document - higher the better.
The number of documents (out of top M) in which the term occurs in - higher the better.
How many times does an additional term co-occur with a query term - the higher the better.
The co-occurrence factor is the most important and would be give you terms such as 'pgsql' if the original query contains 'psql'.
Note that if documents are too short, this method would not work well and you have to use other methods that are necessarily semantics based such as i) word-vector based expansion or ii) wordnet-based expansion.

Content based recommendation in scale

This question is probably very repeated in the blogging and Q&A websites but I couldn't find any concrete answer yet.
I am trying to build a recommendation system for customers using only their purchase history.
Let's say my application has n products.
Compute item similarities for all the n products based on their attributes (like country, type, price)
When user needs recommendation - loop the previously purchased products p for user u and fetch the similar products (similarity is done in the previous step)
If am right we call this as content-based recommendation as opposed to collaborative filtering since it doesn't involve co-occurrence of items or user preferences to an item.
My problem is multi-fold:
Is there any existing scalable ML platform that addresses contend based recommendation (I am fine to adopt different technologies/language)
Is there a way to tweak Mahout to get this result?
Is classification a way to handle content based recommendation?
Is it something that a graph database good at solving?
Note: I looked at Mahout (since am familiar with Java and Mahout apparently utilizes Hadoop for distributed processing) for doing this in scale and advantage of having a well tested ML algorithms.
Your help is appreciated. Any examples would be really great. Thanks.
The so called item-item recommenders are natural candidates for precomputing the similarities, because the attributes of the items rarely change. I would suggest you precompute the item similarity between each item, and perhaps store the top K for each item, and if you have enough resources you could load the similarity matix into main memory for real time recommendation.
Check out my answer to this question for a way to do this in Mahout: Does Mahout provide a way to determine similarity between content (for content-based recommendations)?
The example is how to compute the textual similarity between the items, and than load the precomputed values into main memory.
For performance comparison about different data structures to hold the values check out this question: Mahout precomputed Item-item similarity - slow recommendation

Effect of randomness on search results

I am currently working on a search ranking algorithm which will be applied to elastic search queries (domain: e-commerce). It assigns scores on several entities returned and finally sorts them based on the score assigned.
My question is: Has anyone ever tried to introduce a certain level of randomness to any search algorithm and has experienced a positive effect of it. I am thinking that it might be useful to reduce bias and promote the lower ranking items to give them a chance to be seen easier and get popular if they deserve it. I know that some machine learning algorithms are introducing some randomization to reduce the bias so I thought it might be applied to search as well.
Closest I can get here is this but not exactly what I am hoping to get answers for:
Randomness in Artificial Intelligence & Machine Learning
I don't see this mentioned in your post... Elasticsearch offers a random scoring feature: https://www.elastic.co/guide/en/elasticsearch/guide/master/random-scoring.html
As the owner of the website, you want to give your advertisers as much exposure as possible. With the current query, results with the same _score would be returned in the same order every time. It would be good to introduce some randomness here, to ensure that all documents in a single score level get a similar amount of exposure.
We want every user to see a different random order, but we want the same user to see the same order when clicking on page 2, 3, and so forth. This is what is meant by consistently random.
The random_score function, which outputs a number between 0 and 1, will produce consistently random results when it is provided with the same seed value, such as a user’s session ID
Your intuition is right - randomization can help surface results that get a lower than deserved score due to uncertainty in the estimation. Empirically, Google search ads seemed to have sometimes been randomized, and e.g. this paper is hinting at it (see Section 6).
This problem describes an instance of a class of problems called Explore/Exploit algorithms, or Multi-Armed Bandit problems; see e.g. http://en.wikipedia.org/wiki/Multi-armed_bandit. There is a large body of mathematical theory and algorithmic approaches. A general idea is to not always order by expected, "best" utility, but by an optimistic estimate that takes the degree of uncertainty into account. A readable, motivating blog post can be found here.

What tried and true algorithms for suggesting related articles are out there?

Pretty common situation, I'd wager. You have a blog or news site and you have plenty of articles or blags or whatever you call them, and you want to, at the bottom of each, suggest others that seem to be related.
Let's assume very little metadata about each item. That is, no tags, categories. Treat as one big blob of text, including the title and author name.
How do you go about finding the possibly related documents?
I'm rather interested in the actual algorithm, not ready solutions, although I'd be ok with taking a look at something implemented in ruby or python, or relying on mysql or pgsql.
edit: the current answer is pretty good but I'd like to see more. Maybe some really bare example code for a thing or two.
This is a pretty big topic -- in addition to the answers people come up with here, I recommend tracking down the syllabi for a couple of information retrieval classes and checking out the textbooks and papers assigned for them. That said, here's a brief overview from my own grad-school days:
The simplest approach is called a bag of words. Each document is reduced to a sparse vector of {word: wordcount} pairs, and you can throw a NaiveBayes (or some other) classifier at the set of vectors that represents your set of documents, or compute similarity scores between each bag and every other bag (this is called k-nearest-neighbour classification). KNN is fast for lookup, but requires O(n^2) storage for the score matrix; however, for a blog, n isn't very large. For something the size of a large newspaper, KNN rapidly becomes impractical, so an on-the-fly classification algorithm is sometimes better. In that case, you might consider a ranking support vector machine. SVMs are neat because they don't constrain you to linear similarity measures, and are still quite fast.
Stemming is a common preprocessing step for bag-of-words techniques; this involves reducing morphologically related words, such as "cat" and "cats", "Bob" and "Bob's", or "similar" and "similarly", down to their roots before computing the bag of words. There are a bunch of different stemming algorithms out there; the Wikipedia page has links to several implementations.
If bag-of-words similarity isn't good enough, you can abstract it up a layer to bag-of-N-grams similarity, where you create the vector that represents a document based on pairs or triples of words. (You can use 4-tuples or even larger tuples, but in practice this doesn't help much.) This has the disadvantage of producing much larger vectors, and classification will accordingly take more work, but the matches you get will be much closer syntactically. OTOH, you probably don't need this for semantic similarity; it's better for stuff like plagiarism detection. Chunking, or reducing a document down to lightweight parse trees, can also be used (there are classification algorithms for trees), but this is more useful for things like the authorship problem ("given a document of unknown origin, who wrote it?").
Perhaps more useful for your use case is concept mining, which involves mapping words to concepts (using a thesaurus such as WordNet), then classifying documents based on similarity between concepts used. This often ends up being more efficient than word-based similarity classification, since the mapping from words to concepts is reductive, but the preprocessing step can be rather time-consuming.
Finally, there's discourse parsing, which involves parsing documents for their semantic structure; you can run similarity classifiers on discourse trees the same way you can on chunked documents.
These pretty much all involve generating metadata from unstructured text; doing direct comparisons between raw blocks of text is intractable, so people preprocess documents into metadata first.
You should read the book "Programming Collective Intelligence: Building Smart Web 2.0 Applications" (ISBN 0596529325)!
For some method and code: First ask yourself, whether you want to find direct similarities based on word matches, or whether you want to show similar articles that may not directly relate to the current one, but belong to the same cluster of articles.
See Cluster analysis / Partitional clustering.
A very simple (but theoretical and slow) method for finding direct similarities would be:
Preprocess:
Store flat word list per article (do not remove duplicate words).
"Cross join" the articles: count number of words in article A that match same words in article B. You now have a matrix int word_matches[narticles][narticles] (you should not store it like that, similarity of A->B is same as B->A, so a sparse matrix saves almost half the space).
Normalize the word_matches counts to range 0..1! (find max count, then divide any count by this) - you should store floats there, not ints ;)
Find similar articles:
select the X articles with highest matches from word_matches
This is a typical case of Document Classification which is studied in every class of Machine Learning. If you like statistics, mathematics and computer science, I recommend that you have a look at the unsupervised methods like kmeans++, Bayesian methods and LDA. In particular, Bayesian methods are pretty good at what are you looking for, their only problem is being slow (but unless you run a very large site, that shouldn't bother you much).
On a more practical and less theoretical approach, I recommend that you have a look a this and this other great code examples.
A small vector-space-model search engine in Ruby. The basic idea is that two documents are related if they contain the same words. So we count the occurrence of words in each document and then compute the cosine between these vectors (each terms has a fixed index, if it appears there is a 1 at that index, if not a zero). Cosine will be 1.0 if two documents have all terms common, and 0.0 if they have no common terms. You can directly translate that to % values.
terms = Hash.new{|h,k|h[k]=h.size}
docs = DATA.collect { |line|
name = line.match(/^\d+/)
words = line.downcase.scan(/[a-z]+/)
vector = []
words.each { |word| vector[terms[word]] = 1 }
{:name=>name,:vector=>vector}
}
current = docs.first # or any other
docs.sort_by { |doc|
# assume we have defined cosine on arrays
doc[:vector].cosine(current[:vector])
}
related = docs[1..5].collect{|doc|doc[:name]}
puts related
__END__
0 Human machine interface for Lab ABC computer applications
1 A survey of user opinion of computer system response time
2 The EPS user interface management system
3 System and human system engineering testing of EPS
4 Relation of user-perceived response time to error measurement
5 The generation of random, binary, unordered trees
6 The intersection graph of paths in trees
7 Graph minors IV: Widths of trees and well-quasi-ordering
8 Graph minors: A survey
the definition of Array#cosine is left as an exercise to the reader (should deal with nil values and different lengths, but well for that we got Array#zip right?)
BTW, the example documents are taken from the SVD paper by Deerwester etal :)
Some time ago I implemented something similiar. Maybe this idea is now outdated, but I hope it can help.
I ran a ASP 3.0 website for programming common tasks and started from this principle: user have a doubt and will stay on website as long he/she can find interesting content on that subject.
When an user arrived, I started an ASP 3.0 Session object and recorded all user navigation, just like a linked list. At Session.OnEnd event, I take first link, look for next link and incremented a counter column like:
<Article Title="Cookie problem A">
<NextPage Title="Cookie problem B" Count="5" />
<NextPage Title="Cookie problem C" Count="2" />
</Article>
So, to check related articles I just had to list top n NextPage entities, ordered by counter column descending.

Ways to do "related searches" functionality

I've seen a few sites that list related searches when you perform a search, namely they suggest other search queries you may be interested in.
I'm wondering the best way to model this in a medium-sized site (not enough traffic to rely on visitor stats to infer relationships). My initial thought is to store the top 10 results for each unique query, then when a new search is performed to find all the historical searches that match some amount of the top 10 results but ideally not matching all of them (matching all of them might suggest an equivalent search and hence not that useful as a suggestion).
I imagine that some people have done this functionality before and may be able to provide some ideas of different ways to do this. I'm not necessarily looking for one winning idea since the solution will no doubt vary substantially depending on the size and nature of the site.
have you considered a matrix of with keywords on 1 axis vs. documents on another axis. once you find the set of vetors representing the keywords, find sets of keyword(s) found in your initial result set and then find a way to rank the other keywords by how many documents they reference or how many times they interset the intial result set.
I've tried a number of different approaches to this, with various degrees of success. In the end, I think the best approach is highly dependent on the domain/topics being searched, and how the users form queries.
Your thought about storing previous searches seems reasonable to me. I'd be curious to see how it works in practice (I mean that in the most sincere way -- there are many nuances that can cause these techniques to fail in the "real world", particularly when data is sparse).
Here are some techniques I've used in the past, and seen in the literature:
Thesaurus based approaches: Index into a thesaurus for each term that the user has used, and then use some heuristic to filter the synonyms to show the user as possible search terms.
Stem and search on that: Stem the search terms (eg: with the Porter Stemming Algorithm and then use the stemmed terms instead of the initially provided queries, and given the user the option of searching for exactly the terms they specified (or do the opposite, search the exact terms first, and use stemming to find the terms that stem to the same root. This second approach obviously takes some pre-processing of a known dictionary, or you can collect terms as your indexing term finds them.)
Chaining: Parse the results found by the user's query and extract key terms from the top N results (KEA is one library/algorithm that you can look at for keyword extraction techniques.)

Resources