Product search engines, and filtering on attributes ala eBay - how is it done? - search

Apologies in advance if this is a common question...I think I'm having trouble finding answers because I'm not sure what the problem is actually called!
The background to the problem is - if you look at a service like ebay, when you make a query, you can select a category to drill down in you results. And then when you drill down a leaf category, you can start using filters. So if you select televisions, you might get a variety of filters - like panel technology (oled, lcd, crt), screen size (22", 32", 40" etc.), brand (sony, samsung, lg etc.). The different filters show you the number of results each filter will produce.
Key point: as you select filters, the filters available update. So if you select sony and oled, the screensize filter (and the others) will update to match results available within the constraints of the previously chosen filters.
My question is...how would you implement this kind of filter system in a search engine. Or specifically, how would you calculate the number of results available for a give combination of filters? How do you work out and update the 'filter histogram' as the user makes filter choices?
It seems like a complex problem. Does ebay precalculate the number of results for every possible combination of filters under a leaf category?
Or is there some other smarter way of handling this?
I hope my question makes sense :) Thanks for ANY help! :)

Related

Solr: how to manage irrelevant results when not sorting by relevance?

Case in point: say we have a search query that returns 2000 results ranging from very relevant to hardly relevant at all. When this is sorted by relevance this is fine, as the most relevant results are listed on the first page.
However, when sorting by another field (e.g. user rating) the results on the first page are full of hardly-relevant results, which is a problem for our client. Somehow we need to only show the 'relevant' results with highest ratings.
I can only think of a few solutions, all of which have problems:
1 - Filter out listings on Solr side if relevancy score is under a threshold. I'm not sure how to do this, and from what I've read this isn't a good idea anyway. e.g. If a result returns only 10 listings I would want to display them all instead of filter any out. It seems impossible to determine a threshold that would work across the board. If anyone can show me otherwise please show me how!
2 - Filter out listings on the application side based on score. This I can do without a problem, except that now I can't implement pagination, because I have no way to determine the total number of filtered results without returning the whole set, which would affect performance/bandwidth etc... Also has same problems of the first point.
3 - Create a sort of 'combined' sort that aggregates a score between relevancy and user rating, which the results will then be sorted on. Firstly I'm not sure if this is even possible, and secondly it would be weird for the user if the results aren't actually listed in order of rating.
How has this been solved before? I'm open to any ideas!
Thanks
If they're not relevant, they should be excluded from the result set. Since you want to order by a dedicated field (i.e. user rating), you'll have to tweak how you decide which documents to include in the result at all.
In any case you'll have to define "what is relevant enough", since scores aren't really comparable between queries and doesn't say anything about "this was xyz relevant!".
You'll have to decide why those documents that are included aren't relevant and exclude them based on that criteria, and then either use the review score as a way to boost them further up (if you want the search to appear organic / by relevance). Otherwise you can just exclude them and sort by user score. But remember that user score, as an experience for the user, is usually a harder problem to make relevant than just order by the average of the votes.
Usually the client can choose different ordering options, by relevance or ratings for example. But you are right that ordering by rating is probably not useful enough. What you could do is take into account the rating in the relevance scoring. For example, by multiplying an "organic" score with a rating transformed as a small boost. In Solr you could do this with Function Queries. It is not hard science, and some magic is involved. Much is common sense. And it requires some very good evaluation and testing to see what works best.
Alternatively, if you do not want to treat it as a retrieval problem, you can apply faceting and let users do filtering of the results by rating. Let users help themselves. But I can imagine this does not work in all domains.
Engineers can define what relevancy is. Content similarity scoring is not only what constitutes relevancy. Many Information Retrieval researchers and engineers agree that contextual information should be used besides only the content similarity. This opens a plethora of possibilities to define a retrieval model. For example, what has become popular are Learning to Rank (LTR) approaches where different features are learnt from search logs to deliver more relevant documents to users given their user profiles and prior search behavior. Solr offers this as module.

Spark Item Similarity Interpretation (Cross-Similarity and Similarity)

I've been using Spark Item Similarity through mahout by following the steps in this article:
https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html
I was able to clean my data, setup a local-only spark/hadoop node and all that.
Now, my question relies more in the interpretation of the matrices. I've tried some Google queries with limited success.
I'm creating a multi-modal recommender - and one of my datasets is very similar to the Mahout example.
Example input:
Customer ActionName Product
11064612 view 241505
11086047 purchase 110915
11121878 view CERT_DL
11149030 purchase CERT_FS
11104130 view 111401
The output of mahout is 2 sets of matrices. A similarity matrix and a coocurrence matrix.
This is my similarity matrix (I assume mahout uses my "filter1" purchases)
**791207-WP** 791520-WP:11.350536461453885 791520:9.547158147208393 76130142:7.938639976084232 711215:7.0641921646893024 751309:6.805891904514283
So how would I interpret this? If someone purchased 791207-WP they could be interested in 791520-WP? (so I'd use the left part against purchases of a customer and rank products in the right part?).
The row for 791520-WP looks like this:
791520-WP 76151220:18.954662238247693 791604-WP:13.951210170984268
So, in theory, I'd recommend 76151220 to someone who bought 791520-WP, correct?
Part 2 of the question is interpreting the cross-similarity matrix. Remember my filter2 is "views".
How would I interpret this:
**790907** 76120956:14.2824428207241 791500-LXQ2:13.864741460885853 190907:10.735807818360627
I take this matrix as "someone who visited the 76120956 web page ended up purchasing 790907". So I should promote 790907 to customers who bought 76120956 and maybe even add a link between these 2 products on our site, for example.
Or is it "people who visited the webpage of 790907 ended up buying 76120956"?
My plan is not to use these as-is. I'll still use RowSimilarity and different sources to rank products - but I'm missing the basic interpretation of the outputs from mahout.
If you know of any documentation that clarifies this, that would be a great asset to have.
Thank you.
In both cases the matrix is telling you that the item-id key is similar to the listed items by the LLR value attached to each similar item. Similar in the sense that similar users purchased the items. In the second case it is saying that similar people viewed the items and this view also appears to have led of a purchase of the same item.
Cooccurrence works for purchases alone, cross-occurrence adds the check to make sure the view also correlated with a purchase. This allows you to use both for recommendations.
The output is meant to be used with a search engine generally and you would use a user's history of purchases and views as a 2 field query against the matrices, one in each field.
There are analogous methods to find item-based recommendations.
Better yet, use something like the Universal Recommender here: actionml.com/docs/ur with PredictionIO for an end-to-end system.

Boost SolR results using users behavior

I would like SolR to be able to "learn" from my website users' choices. By that i mean that i know which product the user click after he performed a search. So i collected a list of [term searched => number of clicks] for each product indexed in SolR. But i can't figure how to have a boost that depends on the user input. Is it possible to index some key/value pairs for a document and retrieve the value with a function usable in the boost parameter ?
I'm not sure to be clear, so i'll add a concrete example :
Let's say that when a user search for "garden chair", SolR returns me 3 products, "green garden chair", "blue chair", and "hamac for garden".
"green garden chair" ranks first, the hamac ranks last, as expected.
But, then, all the users searching for "garden chair" ends up clicking on the hamac.
I would like to help the hamac to rank first on the search "garden chair", WITHOUT altering the rank it got on other search. So i would like to be able to perform a key=>value based boost.
Is that possible to achieve with SolR ?
I'm sure that i can't be the first one needing such user-based search results improvement.
Thanks in advance.
You could you edismax bq, if you are using edismax (or maybe bf). For this to work, you obviously need to store the info (in a db, redis, whatever you fancy):
searched "garden chair":
clicked "hamac for garden": 10
clicked "green garden chair": 4
searched "green table":
...
And so forth, look this up when there is a search, and if there is info available for the search, send the bq boosting what you want.
Also, check out the QueryElevationComponent It might your purpose (although is stronger than just boosting....). There are two things to consider though:
Every time you change the click number you would need to modify the xml and reload, so it would be better if you could batch it to nightly or something like that.
there was a recent jira issue to allow you to provide similar functionality but by providing request params, no need of xml/reload, so check that out too

SOLR Query parameters to avoid flooding with the same manufacturer

I've been a long time browser here, but never have had a question that wasn't already asked. So here goes:
I've run into a problem using SOLR search where some searches on SOLR (let's say DVD Players) tend to return a lot of search results from the same manufacturer in the first 50 results.
Now assuming that I want to provide my end-user with the best experience searching, but also the best variety of products in my catalog, how would I go about providing a type of demerit to reduce the same brand from showing up in the search results more than 5 times. For the record I'm using a fairly standard DisMax search handler.
This logic would only be applied to extremely broad queries like 'DVD Players', or 'Hard Drives', and naturally I wouldn't use it to shape 'Samsung DVD Players' search results.
I don't know if SOLR has a nifty feature that does this automatically, or if I would have to start modifying search handler logic.
I haven't used this but I believe field collapsing / grouping would be what you want.
http://wiki.apache.org/solr/FieldCollapsing
If I understand this feature correctly it would group similar results kind of how http://news.google.com/ does it by grouping similar news stories.
Some ideas here, although I've not tried them myself.
You can use Carrot plugin for Solr to cluster search results lets say on manufacturer and then feed it to custom RequestHandler to re-order (cherry picking from each mfr. cluster) the result for diversity.
However, there is a downside to the approach that you may need to fetch larger than necessary and secondly the search results will be synthetic.
To achieve this is a lengthy and complex process but worth trying. Let's say the main field on which you are searching is a single field called title, first you'll need to make sure that all the documents containing "dvd player" in it have same score. This you can do by neglecting solr scoring parameteres like field norm (set omitNorms=true) & term frequency (write a solr plugin to neglect it) code attached..
Implementation Details:
1) compile the following class and put it into Solr WEB-INF/classes
package my.package;
import org.apache.lucene.search.DefaultSimilarity;
public class CustomSimilarity extends DefaultSimilarity {
public float tf(float freq) {
return freq > 0 ? 1.0f : 0.0f;
}
}
In solrconfig.xml use this new similarity class add
similarity class="my.package.CustomSimilarity"
All this will help you to make score for all the documents with "dvd player" in their title same. After that you can define one field of random type. Then when you query solr you can arrange first by score, then by the random field. Since score for all the documents containing DVD players would be same, results will get arranged by random field, giving the customer better variety of products in your catalog.

Organizing Lots of Data in Search Results

I'm working on a pretty basic web app (not much more than CRUD stuff). However, the requirements call for a bunch of data to be displayed with each item in the search results - IDs, dates, email addresses, long descriptions... too much to fit neatly into a simple grid, and too dissimilar to make them flow together (like the natural language example from this article.)
Is there a design pattern for attractively displaying many descriptive fields with each search result?
(Please don't tell me to just remove some fields from the results; that's not an option for this project.)
Obviously there are many ways you can handle this, and to a degree it's a factor of your information design abilities and preferences.
Natural Data Groupings
What I would do is try to organize your data into a small number of "buckets." You state that the data are too dissimilar to be arranged into a sentence, but it's likely you can create a few logical groups. Since we can't see all your data, I'll guess that you have information about a person (email, name, ID?), about some sort of event (dates? type?), or maybe about some kind of object related to the person (orders? classes?). Whatever they are, some of the data will be more closely related to each other than others.
Designing in Chunks
Take each loose "bucket" and design a kind of "plate" -- a grouping just for the information in that bucket. The design problem within this constrained chunk is easier to tackle: maybe it's a little table-like layout, maybe it's something non-tabular, like the stackoverflow user "nameplate". Maybe long textual data have their own plates, or maybe they're grouped into a single plate, but with a preview/detail click-for-more arrangement.
Using a Grid
Now that you have a small number of "plates," go back to a grid-like approach for your overall search result row design. Arrange the plates as units within the row, and be sure to keep them aligned. Following an overall grid (HTML table or otherwise) for the plates will avoid an "information soup" problem. You'll have clean columns that scan well, and a readable, natural information hierarchy. The natural language example you cite would indeed be difficult to parse if it were one of many rows displayed in a search results grid.
Consistency
Be sure to use a common "design vocabulary" when you're working on the chunks -- consistent styling of labels, consistent spacing... so when everything's displayed, despite the bulk of information, it all feels like it's part of the same family.
It's an interesting design exercise. Many comps, lots of iteration, and some brainstorming should get you where you need to be.
It probably depends on the content you're displaying. Look at the StackOverflow layout for this question. It has Votes, Title, Description, Tags, Author, etc. The content wouldn't work well in a grid for sure, nor does it flow nicely on it's own.
I think it's time to get creative ;)
No one ever thinks about what this is going to look like on their screen, do they?
One thing you can do is truncate the displayed text, and then display the expanded version in a tooltip on hover, or after the user clicks on it.
For example, display only the two-letter state abbreviation but show the full state name on hover.
Or, to save even more space, only display the state abbreviation, and put the entire address in the tooltip.
For long descriptions, you can display only the first few characters, followed by an ellipsis or the word "More". Then, show the full text either on hover or on click.
One disadvantage of the hover approach is that you can't sort the column on that text. There's nothing for the user to click to request the sort.

Resources