SOLR relevance seems tied heavily to length of document indexed

SOLR relevance seems tied heavily to length of document indexed - search

We have a lot of documents in SOLR and a certain type of them tend to score too highly in results (it appears mainly due them generally being quite short in content). So if I search for a name it will always return a load of short documents before anything longer.
How can I weight results so that the length of the document is taken more into account when ranking for relevance?
If it helps (as a cludge) we have a flag set on the documents this generally applies to, so if it is possible to boost all documents who don't have this flag set that would be a temporary option for us.

This is caused by the lengthNorm in scoring. Longer documents with the same matching terms receive a somewhat lower score than short documents. See TFIDFSimilarity's documentation (scroll down to "6. norm(t,d)"), as well as in Solr documentation here.
This tends to work well for full-text searching applications. The idea being that the document with the higher proportion of it's content matching the query is more relevant to the query.
For instance, if I search wikipedia article titles for the term Monkey, relevance of articles found might be:
Monkey - Precise match, it would be reasonable to assume this is what I was looking for
Spider Monkey - A well-known type of monkey, still quite relevant
Monkey: Journey to the West - A stage play featuring a main character who is a monkey. Likely less relevant.
African green monkey lymphotropic polyomavirus - A human tumor virus. Relevance to query limited.
If it's really necessary, this can be overridden in a custom DefaultSimilarity. And overriding computeNorm(state,norm) to simply return state.getBoost();.

Check the source of DefaultSimilarity for 4.0
#Override
public void computeNorm(FieldInvertState state, Norm norm) {
final int numTerms;
if (discountOverlaps)
numTerms = state.getLength() - state.getNumOverlap();
else
numTerms = state.getLength();
norm.setByte(encodeNormValue(state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)))));
}
So the numTerms have an adverse impact on the scoring.
You can create a Custom class overriding the behaviour
numTerms equal to 1
Change the Calculation to increase the score on longer documents rather the inverse now
Remove the calculation ((float) (1.0 / Math.sqrt(numTerms))) to eliminate lengthNorm effect.

Related

How to implement synonyms for use in a search engine?

I am working on a pet search engine (SE).
What I have right now is boolean keyword SE, as a library that is split in two parts:
index: this is a inverted index ie. it associate terms with the original document where it appears
query: which is supplied by the user and can be arbitrarily complex boolean expression that looks like (mobile OR android OR iphone) AND game
I'd like to improve the search engine, in a way that does automatically extend simple queries to boolean queries so that it includes search terms that do no appear in the original query ie. I'd like to support synonyms.
I need some help to build the synonyms graph.
How can I compute list of words that appears in similar context?
Here is example of list of synonyms I'd like to compute:
psql, pgsql, postgres, postgresql
mobile, iphone, android
and also synonyms that includes ngrams like:
rdbms, relational database management systems, ...
The algorithm doesn't have to be perfect, I can post-process by hand the result, but at least I need to have a clue about what terms are similar to what other terms.

In the standard Information Retrieval (IR) literature, this enrichment of a query with additional terms (that don't appear in the initial/original query) is known as query expansion.
There're a plenty of standard approaches which, generally speaking, are based on the idea of scoring terms based on some factors and then selecting a number of terms (say K, a parameter) that have the highest scores.
To compute the term selection score, it is assumed that the top (M) ranked documents retrieved after initial retrieval are relevant, this being called pseudo-relevance feedback.
The factors on which the term selection function generally depend are:
The term frequency of a term in a top ranked document - higher the better.
The number of documents (out of top M) in which the term occurs in - higher the better.
How many times does an additional term co-occur with a query term - the higher the better.
The co-occurrence factor is the most important and would be give you terms such as 'pgsql' if the original query contains 'psql'.
Note that if documents are too short, this method would not work well and you have to use other methods that are necessarily semantics based such as i) word-vector based expansion or ii) wordnet-based expansion.

Effect of randomness on search results

I am currently working on a search ranking algorithm which will be applied to elastic search queries (domain: e-commerce). It assigns scores on several entities returned and finally sorts them based on the score assigned.
My question is: Has anyone ever tried to introduce a certain level of randomness to any search algorithm and has experienced a positive effect of it. I am thinking that it might be useful to reduce bias and promote the lower ranking items to give them a chance to be seen easier and get popular if they deserve it. I know that some machine learning algorithms are introducing some randomization to reduce the bias so I thought it might be applied to search as well.
Closest I can get here is this but not exactly what I am hoping to get answers for:
Randomness in Artificial Intelligence & Machine Learning

I don't see this mentioned in your post... Elasticsearch offers a random scoring feature: https://www.elastic.co/guide/en/elasticsearch/guide/master/random-scoring.html
As the owner of the website, you want to give your advertisers as much exposure as possible. With the current query, results with the same _score would be returned in the same order every time. It would be good to introduce some randomness here, to ensure that all documents in a single score level get a similar amount of exposure.
We want every user to see a different random order, but we want the same user to see the same order when clicking on page 2, 3, and so forth. This is what is meant by consistently random.
The random_score function, which outputs a number between 0 and 1, will produce consistently random results when it is provided with the same seed value, such as a user’s session ID

Your intuition is right - randomization can help surface results that get a lower than deserved score due to uncertainty in the estimation. Empirically, Google search ads seemed to have sometimes been randomized, and e.g. this paper is hinting at it (see Section 6).
This problem describes an instance of a class of problems called Explore/Exploit algorithms, or Multi-Armed Bandit problems; see e.g. http://en.wikipedia.org/wiki/Multi-armed_bandit. There is a large body of mathematical theory and algorithmic approaches. A general idea is to not always order by expected, "best" utility, but by an optimistic estimate that takes the degree of uncertainty into account. A readable, motivating blog post can be found here.

SOLR Query parameters to avoid flooding with the same manufacturer

I've been a long time browser here, but never have had a question that wasn't already asked. So here goes:
I've run into a problem using SOLR search where some searches on SOLR (let's say DVD Players) tend to return a lot of search results from the same manufacturer in the first 50 results.
Now assuming that I want to provide my end-user with the best experience searching, but also the best variety of products in my catalog, how would I go about providing a type of demerit to reduce the same brand from showing up in the search results more than 5 times. For the record I'm using a fairly standard DisMax search handler.
This logic would only be applied to extremely broad queries like 'DVD Players', or 'Hard Drives', and naturally I wouldn't use it to shape 'Samsung DVD Players' search results.
I don't know if SOLR has a nifty feature that does this automatically, or if I would have to start modifying search handler logic.

I haven't used this but I believe field collapsing / grouping would be what you want.
http://wiki.apache.org/solr/FieldCollapsing
If I understand this feature correctly it would group similar results kind of how http://news.google.com/ does it by grouping similar news stories.

Some ideas here, although I've not tried them myself.
You can use Carrot plugin for Solr to cluster search results lets say on manufacturer and then feed it to custom RequestHandler to re-order (cherry picking from each mfr. cluster) the result for diversity.
However, there is a downside to the approach that you may need to fetch larger than necessary and secondly the search results will be synthetic.

To achieve this is a lengthy and complex process but worth trying. Let's say the main field on which you are searching is a single field called title, first you'll need to make sure that all the documents containing "dvd player" in it have same score. This you can do by neglecting solr scoring parameteres like field norm (set omitNorms=true) & term frequency (write a solr plugin to neglect it) code attached..
Implementation Details:
1) compile the following class and put it into Solr WEB-INF/classes
package my.package;
import org.apache.lucene.search.DefaultSimilarity;
public class CustomSimilarity extends DefaultSimilarity {
public float tf(float freq) {
return freq > 0 ? 1.0f : 0.0f;
}
}
In solrconfig.xml use this new similarity class add
similarity class="my.package.CustomSimilarity"
All this will help you to make score for all the documents with "dvd player" in their title same. After that you can define one field of random type. Then when you query solr you can arrange first by score, then by the random field. Since score for all the documents containing DVD players would be same, results will get arranged by random field, giving the customer better variety of products in your catalog.

Understanding Lucene Queries

I am interested in knowing a little more specifically about how Lucene queries are scored. In their documentation, they mention the VSM. I am familiar with VSM, but it seems inconsistent with the types of queries they allow.
I tried stepping through the source code for BooleanScorer2 and BooleanWeight, to no real avail.
My question is, can somebody step through the execution of a BooleanScorer to explain how it combines queries.
Also, is there a way to simple send out several terms and just get the raw tf.idf score for those terms, the way it is described in the documentation?

The place to start is http://lucene.apache.org/java/3_3_0/api/core/org/apache/lucene/search/Similarity.html
I think it clears up your inconsistency? Lucene combines Boolean model (BM) of Information Retrieval with Vector Space Model (VSM) of Information Retrieval - documents "approved" by BM are scored by VSM.
The next thing to look at is Searcher.explain, which can give you a string explaining how the score for a (query, document) pair is calculated.
Tracing thru the execution of BooleanScorer can be challenging I think, its probably easiest to understand BooleanScorer2 first, which uses subscorers like ConjunctionScorer/DisjunctionSumScorer, and to think of BooleanScorer as an optimization.
If this is confusing, then start even simpler at TermScorer. Personally I look at it "bottoms-up" anyway:
A Query creates a Weight valid across the whole index: this incorporates boost, idf, queryNorm, and even confusingly, boosts of any 'outer'/'parent' queries like booleanquery that are holding the term. this weight is computed a single time.
A Weight creates a Scorer (e.g. TermScorer) for each index segment, for a single term this scorer has everything it needs in the formula except for what is document-dependent: the within-document term-frequency (TF), which it must read from the postings, and the document's length normalization value (norm). So this is why termscorer scores a document as weight * sqrt(tf) * norm. in practice this is cached for tf values < 32 so that scoring most documents is a single multiply.
BooleanQuery really doesnt do "much" except its scorers are responsible for nextDoc()'ing and advance()'ing subscorers, and when the Boolean model is satisfied, then it combines the scores of the subscorers, applying the coordination factory (coord()) based on how many subscorers matched.
in general, its definitely difficult to trace through how lucene scores documents because in all released forms, the Scorers are responsible for 2 things: matching and calculating scores. In Lucene's trunk (http://svn.apache.org/repos/asf/lucene/dev/trunk/) these are now separated, in such a way that a Similarity is basically responsible for all aspects of scoring, and this is separate from matching. So the API there might be easier to understand, maybe harder, but at least you can refer to implementations of many other scoring models (BM25, language models, divergence from randomness, information-based models) if you get confused: http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/java/org/apache/lucene/search/similarities/

Is it OK to have a precision value of 100% in text retrieval system?

Since the formula for precision is :
retrieved_and_relevant/(retrieved_and_relevant+retrieved_and_irrelevant)
I am wondering if the value for precision in a text-retrieval system will ever be different from 100%. I think so because, all we programmers put a hell lot of effort in not forgetting to squeeze each and every text of all documents out there. So, when a query text is fired into the text retrieval system, it will output all the documents containing the query text. This means that all those documents retrieved are relevant documents; essentially making the score of 100%.
Is this true or am I missing some point ?

You're slightly confused on the concept behind precision.
A simple example would be searching for the terms iraq war. Depending on how the search engine is designed and the results may or may not be what the user is looking for. It might return
Wars that Iraq, the country is involved in
A fictional story about a soldier in the current Iraq war,
A news article that talks about various wars and their financial impact.
Each document could be completely different and contain the exact search terms, but might be irrelevant to what the user was looking for.
The search engine would definitely LIKE to have a precision of 100% but it's very rare that this is the case.
Precision can ONLY be determined by the user who performs the search query itself as they are the only one who knows without a doubt that a result is relevant or not. It's definitely something to strive for, but don't believe it will always equal 100%.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string