I've been a long time browser here, but never have had a question that wasn't already asked. So here goes:
I've run into a problem using SOLR search where some searches on SOLR (let's say DVD Players) tend to return a lot of search results from the same manufacturer in the first 50 results.
Now assuming that I want to provide my end-user with the best experience searching, but also the best variety of products in my catalog, how would I go about providing a type of demerit to reduce the same brand from showing up in the search results more than 5 times. For the record I'm using a fairly standard DisMax search handler.
This logic would only be applied to extremely broad queries like 'DVD Players', or 'Hard Drives', and naturally I wouldn't use it to shape 'Samsung DVD Players' search results.
I don't know if SOLR has a nifty feature that does this automatically, or if I would have to start modifying search handler logic.
I haven't used this but I believe field collapsing / grouping would be what you want.
http://wiki.apache.org/solr/FieldCollapsing
If I understand this feature correctly it would group similar results kind of how http://news.google.com/ does it by grouping similar news stories.
Some ideas here, although I've not tried them myself.
You can use Carrot plugin for Solr to cluster search results lets say on manufacturer and then feed it to custom RequestHandler to re-order (cherry picking from each mfr. cluster) the result for diversity.
However, there is a downside to the approach that you may need to fetch larger than necessary and secondly the search results will be synthetic.
To achieve this is a lengthy and complex process but worth trying. Let's say the main field on which you are searching is a single field called title, first you'll need to make sure that all the documents containing "dvd player" in it have same score. This you can do by neglecting solr scoring parameteres like field norm (set omitNorms=true) & term frequency (write a solr plugin to neglect it) code attached..
Implementation Details:
1) compile the following class and put it into Solr WEB-INF/classes
package my.package;
import org.apache.lucene.search.DefaultSimilarity;
public class CustomSimilarity extends DefaultSimilarity {
public float tf(float freq) {
return freq > 0 ? 1.0f : 0.0f;
}
}
In solrconfig.xml use this new similarity class add
similarity class="my.package.CustomSimilarity"
All this will help you to make score for all the documents with "dvd player" in their title same. After that you can define one field of random type. Then when you query solr you can arrange first by score, then by the random field. Since score for all the documents containing DVD players would be same, results will get arranged by random field, giving the customer better variety of products in your catalog.
Related
Case in point: say we have a search query that returns 2000 results ranging from very relevant to hardly relevant at all. When this is sorted by relevance this is fine, as the most relevant results are listed on the first page.
However, when sorting by another field (e.g. user rating) the results on the first page are full of hardly-relevant results, which is a problem for our client. Somehow we need to only show the 'relevant' results with highest ratings.
I can only think of a few solutions, all of which have problems:
1 - Filter out listings on Solr side if relevancy score is under a threshold. I'm not sure how to do this, and from what I've read this isn't a good idea anyway. e.g. If a result returns only 10 listings I would want to display them all instead of filter any out. It seems impossible to determine a threshold that would work across the board. If anyone can show me otherwise please show me how!
2 - Filter out listings on the application side based on score. This I can do without a problem, except that now I can't implement pagination, because I have no way to determine the total number of filtered results without returning the whole set, which would affect performance/bandwidth etc... Also has same problems of the first point.
3 - Create a sort of 'combined' sort that aggregates a score between relevancy and user rating, which the results will then be sorted on. Firstly I'm not sure if this is even possible, and secondly it would be weird for the user if the results aren't actually listed in order of rating.
How has this been solved before? I'm open to any ideas!
Thanks
If they're not relevant, they should be excluded from the result set. Since you want to order by a dedicated field (i.e. user rating), you'll have to tweak how you decide which documents to include in the result at all.
In any case you'll have to define "what is relevant enough", since scores aren't really comparable between queries and doesn't say anything about "this was xyz relevant!".
You'll have to decide why those documents that are included aren't relevant and exclude them based on that criteria, and then either use the review score as a way to boost them further up (if you want the search to appear organic / by relevance). Otherwise you can just exclude them and sort by user score. But remember that user score, as an experience for the user, is usually a harder problem to make relevant than just order by the average of the votes.
Usually the client can choose different ordering options, by relevance or ratings for example. But you are right that ordering by rating is probably not useful enough. What you could do is take into account the rating in the relevance scoring. For example, by multiplying an "organic" score with a rating transformed as a small boost. In Solr you could do this with Function Queries. It is not hard science, and some magic is involved. Much is common sense. And it requires some very good evaluation and testing to see what works best.
Alternatively, if you do not want to treat it as a retrieval problem, you can apply faceting and let users do filtering of the results by rating. Let users help themselves. But I can imagine this does not work in all domains.
Engineers can define what relevancy is. Content similarity scoring is not only what constitutes relevancy. Many Information Retrieval researchers and engineers agree that contextual information should be used besides only the content similarity. This opens a plethora of possibilities to define a retrieval model. For example, what has become popular are Learning to Rank (LTR) approaches where different features are learnt from search logs to deliver more relevant documents to users given their user profiles and prior search behavior. Solr offers this as module.
I was explaining to a friend of mine how great CouchDB was and I was actually doing a very good job of it, until he asked me if you could do a car sale database. After giving this quite a lot of thought, I have no answer, I kind of think this is impossible.
My dilemma is like this. Lets say a car has an owner_id, manufacturer, year, type, color, milage and price.
My first initial thought was to just emit all the keys. But you might want to search for a car that is blue or red or yellow and is driven between 30.000 and 80.000 miles and with some price range. And given this query, what if you don't search for color ?
The only way I can think of, is doing many queries, and doing a manual brute force diff array in my database layer code. But that seems to be quite excessive, even if there are only a few thousand cars.
So, in short, is this possible to do, in a viable way ?
CouchDB is designed for scalability, and infinite-flexibility ad-hoc queries are not scalable, so therefore it is discouraged. It is possible though, with temporary views. You can POST your view (query) as a JSON object to /db/_temp_view.
See for more details http://wiki.apache.org/couchdb/HTTP_view_API#Temporary_Views and https://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#Concept.
Also, this answer might be of use to you.
As you say, there's no nice way to spell this in CouchDB, in the same way as you might do this in a relational database with support for spatial indices; That's not to say that there's no hope at all. You could, for instance, do some simple clustering to index cars that meet a certain set of attributes together.
Mileage and price look like good candidates for this approach. Most queries will probably specify these values to a single digit
function oneDigit(value) {
var strValue = String(value);
return (Number(strValue[0]) * Math.pow(10, strValue.length - 1);
}
With this, we can build a view which organizes cars into bins based on their price and mileage.
function (doc) {
emit([[oneDigit(doc.mileage), oneDigit(doc.price)], null]);
}
It's then a simple matter of getting all of the cars that have that feature:
for mileage in range(60000, 100000, 10000):
cars.append(db.view('cars/mileageAndPrice', startkey=[mileage, minPrice], endkey=[mileage + 10000, maxPrice]))
I would like SolR to be able to "learn" from my website users' choices. By that i mean that i know which product the user click after he performed a search. So i collected a list of [term searched => number of clicks] for each product indexed in SolR. But i can't figure how to have a boost that depends on the user input. Is it possible to index some key/value pairs for a document and retrieve the value with a function usable in the boost parameter ?
I'm not sure to be clear, so i'll add a concrete example :
Let's say that when a user search for "garden chair", SolR returns me 3 products, "green garden chair", "blue chair", and "hamac for garden".
"green garden chair" ranks first, the hamac ranks last, as expected.
But, then, all the users searching for "garden chair" ends up clicking on the hamac.
I would like to help the hamac to rank first on the search "garden chair", WITHOUT altering the rank it got on other search. So i would like to be able to perform a key=>value based boost.
Is that possible to achieve with SolR ?
I'm sure that i can't be the first one needing such user-based search results improvement.
Thanks in advance.
You could you edismax bq, if you are using edismax (or maybe bf). For this to work, you obviously need to store the info (in a db, redis, whatever you fancy):
searched "garden chair":
clicked "hamac for garden": 10
clicked "green garden chair": 4
searched "green table":
...
And so forth, look this up when there is a search, and if there is info available for the search, send the bq boosting what you want.
Also, check out the QueryElevationComponent It might your purpose (although is stronger than just boosting....). There are two things to consider though:
Every time you change the click number you would need to modify the xml and reload, so it would be better if you could batch it to nightly or something like that.
there was a recent jira issue to allow you to provide similar functionality but by providing request params, no need of xml/reload, so check that out too
I'm designing a Lucene search index that includes ranked tags for each document.
Example:
Document 1
tag: java , rank 1.2
tag: learning, rank 2.1
tag: bugs, rank 1.2
tag: architecture: rank 0.3
The tags comes from an automated classification algorithm that is also assigning a score.
How do I design the index so I can query for search for a combination of tags and return the most relevant results? Example, search for java+learning
I've initially created a FIELD for each tag and used the rank to boost the field for each document. Is this a good approach in terms of performance? What if I have 10,000 possible tags? Is it a good idea to have 10,000 FIELDS in Lucene?
Field tag = new Field(
FIELD_TAG+tag.getId(),
"y",
Field.Store.NO,
Field.Index.NOT_ANALYZED);
tag.setBoost(tag.getRank());
luceneDoc.add(tag);
If I instead add all the tags to the same field, how can I take into account the rank?
I had this problem in my search too... Tell me if I'm wrong...
The good was if you could have one field like "Tags" contain the value "java learning bugs architecture" and you use a WhiteSpaceTokenizer:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WhitespaceTokenizerFactory
But doing this you are not able to bost each words, you are able to boost the field "Tags"...
Doing this Lucene will not give a good scoring when user searchs for "java bugs" ou "architecture in java", but will return all documents that have this words.
But you can do like you said, a lot of "Tags" and boost each one... Or you can crate a new Query Parser http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html inheritance edismax (for example) to make a field works like you want.
Is that what you want?
Ow... One more thing, adding a lot of fields will make the docs indexer slow and index bigger (probably not good to search).
I'm working on a project which searches through a database, then sorts the search results by relevance, according to a string the user inputs. I think my current search is fairly decent, but the comparator I wrote to sort the results by relevance is giving me funny results. I don’t know what to consider relevant. I know this is a big branch of information retrieval, but I have no idea where to start finding examples of searches which sort objects by relevance and would appreciate any feedback.
To give a little more background about my specific issue, the user will input a string in a website database, which stores objects (items in the store) with various fields, such as a minor and major classification (for example, an XBox 360 game might be stored with major=video_games and minor=xbox360 fields along with its specific name). The four main fields that I think should be considered in the search are the specific name, major, minor, and genre of the type of object, if that helps.
In case you don't wanna use lucene/Solr, you can always use distance metrics to find the similarity between query and the rows retrieved from database. Once you get the score you can sort them and they will be considered as sorted by relevance.
This is what exactly happens behind the scene of lucene. You can use simple similarity metrics like manhattan distance, distance of points in n-dimensional space etc. Look for lucene scoring formula for more insight.