Azure search solr index definition for supporting multiple markets - azure

I am building a product catalog for an e-comm website. I am having a requirement to build a azure search/solr/elastic search based index. The problem is saving the market specific attributes. The website is supporting 109 markets and there is each market specific data like ratings, price, views, wish-listed, etc. that I need to save in the index eg: Product1 will have 109 ratings(rating is different in each market)/109 prices(price might be different in each market) corresponding to 109 markets. Also I will have to use this attributes to add a boosting function so that when people are searching for this, products with higher view/ratings surfaces up. How do I design the index definition to support this? Can I achieve this by 1 index doc per product or do I have to create 1 index doc per market? Some pointers will be very helpful. I have spent couple of days on this and could not reach to a conclusion that is optimized for this use case. Thank you!
My proposed index definition:
-id
-mktUSA
--mktId
--rating
--views
--price
...
-mktCanada
--mktId
--rating
--views
--price
...
-locales
--En
--Fr
--Zh
...
...other properties
The problem with this approach is configuring a magnitude scoring functions inside scoring profile, to boost products based on the market
Say eg: If user is from Canada, only the Canada based rating/views should be considered and not the other market ratings while Cognitive search is calculating the search relevance score.
Is there any possible work around this? Elastic search has a neat solution of Function score query that can be used to configure the scoring function dynamically

From what I understand, your problem is that you want to have a single index with products that support 109 different markets. Many different properties for your Product model can then be market-specific. Your concern is that the model gets to big, or if it's a scalable design. It is. You can have 1000+ properties without a problem.
I have built a similar search solution for e-commerce for multiple markets.
For price, I specify one price per market. I have about 80 or so markets, so that's 80 prices. There is no way around it. I would probably do the same for ratings and views too. One per market.
In our application we use separate dimensions for market, language and country. A market can be Scandinavia, BeNeLux or Asia-Pacific. You need to clearly define what a market is in your case, and agree with the business which markets you have and how you handle changes. Countries can map directly to markets, but it may also differ. Finally, language is usually shared across markets/countries and you usually only have to support 20-25 languages.
Suggested data model
Id
TitleEnGb
TitleDeDe
TitleFrFr
...
PriceGb
PriceUs
PriceNo
PriceDe
...
RatingsGb
RatingsUs
RatingsNo
RatingsDe
...
DescriptionEnGb
DescriptionDeDe
DescriptionFrFr
...
I try to illustrate that the Title and Description are language-specific. The price and ratings are market-specific.
For the 20-25 language-specific properties, you have to think about what analyzers to use. You want to use language-specific analyzers, and preferably the Microsoft analyzers since they have much better linguistics support with full lemmatization and so on.
When you develop your frontend application you have to keep track of which market, country and language you then refer to the specific properties. This is the easiest way to support boosting and so on.
Per-market index is not recommended
You could create one index per market. I have gone down this route before. I would not recommend this. This means you have to update 109 indexes every time you add, change or delete an item. And Azure Search supports 50 indexes per service at the most anyways.

Related

Solr: how to manage irrelevant results when not sorting by relevance?

Case in point: say we have a search query that returns 2000 results ranging from very relevant to hardly relevant at all. When this is sorted by relevance this is fine, as the most relevant results are listed on the first page.
However, when sorting by another field (e.g. user rating) the results on the first page are full of hardly-relevant results, which is a problem for our client. Somehow we need to only show the 'relevant' results with highest ratings.
I can only think of a few solutions, all of which have problems:
1 - Filter out listings on Solr side if relevancy score is under a threshold. I'm not sure how to do this, and from what I've read this isn't a good idea anyway. e.g. If a result returns only 10 listings I would want to display them all instead of filter any out. It seems impossible to determine a threshold that would work across the board. If anyone can show me otherwise please show me how!
2 - Filter out listings on the application side based on score. This I can do without a problem, except that now I can't implement pagination, because I have no way to determine the total number of filtered results without returning the whole set, which would affect performance/bandwidth etc... Also has same problems of the first point.
3 - Create a sort of 'combined' sort that aggregates a score between relevancy and user rating, which the results will then be sorted on. Firstly I'm not sure if this is even possible, and secondly it would be weird for the user if the results aren't actually listed in order of rating.
How has this been solved before? I'm open to any ideas!
Thanks
If they're not relevant, they should be excluded from the result set. Since you want to order by a dedicated field (i.e. user rating), you'll have to tweak how you decide which documents to include in the result at all.
In any case you'll have to define "what is relevant enough", since scores aren't really comparable between queries and doesn't say anything about "this was xyz relevant!".
You'll have to decide why those documents that are included aren't relevant and exclude them based on that criteria, and then either use the review score as a way to boost them further up (if you want the search to appear organic / by relevance). Otherwise you can just exclude them and sort by user score. But remember that user score, as an experience for the user, is usually a harder problem to make relevant than just order by the average of the votes.
Usually the client can choose different ordering options, by relevance or ratings for example. But you are right that ordering by rating is probably not useful enough. What you could do is take into account the rating in the relevance scoring. For example, by multiplying an "organic" score with a rating transformed as a small boost. In Solr you could do this with Function Queries. It is not hard science, and some magic is involved. Much is common sense. And it requires some very good evaluation and testing to see what works best.
Alternatively, if you do not want to treat it as a retrieval problem, you can apply faceting and let users do filtering of the results by rating. Let users help themselves. But I can imagine this does not work in all domains.
Engineers can define what relevancy is. Content similarity scoring is not only what constitutes relevancy. Many Information Retrieval researchers and engineers agree that contextual information should be used besides only the content similarity. This opens a plethora of possibilities to define a retrieval model. For example, what has become popular are Learning to Rank (LTR) approaches where different features are learnt from search logs to deliver more relevant documents to users given their user profiles and prior search behavior. Solr offers this as module.

Spark Item Similarity Interpretation (Cross-Similarity and Similarity)

I've been using Spark Item Similarity through mahout by following the steps in this article:
https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html
I was able to clean my data, setup a local-only spark/hadoop node and all that.
Now, my question relies more in the interpretation of the matrices. I've tried some Google queries with limited success.
I'm creating a multi-modal recommender - and one of my datasets is very similar to the Mahout example.
Example input:
Customer ActionName Product
11064612 view 241505
11086047 purchase 110915
11121878 view CERT_DL
11149030 purchase CERT_FS
11104130 view 111401
The output of mahout is 2 sets of matrices. A similarity matrix and a coocurrence matrix.
This is my similarity matrix (I assume mahout uses my "filter1" purchases)
**791207-WP** 791520-WP:11.350536461453885 791520:9.547158147208393 76130142:7.938639976084232 711215:7.0641921646893024 751309:6.805891904514283
So how would I interpret this? If someone purchased 791207-WP they could be interested in 791520-WP? (so I'd use the left part against purchases of a customer and rank products in the right part?).
The row for 791520-WP looks like this:
791520-WP 76151220:18.954662238247693 791604-WP:13.951210170984268
So, in theory, I'd recommend 76151220 to someone who bought 791520-WP, correct?
Part 2 of the question is interpreting the cross-similarity matrix. Remember my filter2 is "views".
How would I interpret this:
**790907** 76120956:14.2824428207241 791500-LXQ2:13.864741460885853 190907:10.735807818360627
I take this matrix as "someone who visited the 76120956 web page ended up purchasing 790907". So I should promote 790907 to customers who bought 76120956 and maybe even add a link between these 2 products on our site, for example.
Or is it "people who visited the webpage of 790907 ended up buying 76120956"?
My plan is not to use these as-is. I'll still use RowSimilarity and different sources to rank products - but I'm missing the basic interpretation of the outputs from mahout.
If you know of any documentation that clarifies this, that would be a great asset to have.
Thank you.
In both cases the matrix is telling you that the item-id key is similar to the listed items by the LLR value attached to each similar item. Similar in the sense that similar users purchased the items. In the second case it is saying that similar people viewed the items and this view also appears to have led of a purchase of the same item.
Cooccurrence works for purchases alone, cross-occurrence adds the check to make sure the view also correlated with a purchase. This allows you to use both for recommendations.
The output is meant to be used with a search engine generally and you would use a user's history of purchases and views as a 2 field query against the matrices, one in each field.
There are analogous methods to find item-based recommendations.
Better yet, use something like the Universal Recommender here: actionml.com/docs/ur with PredictionIO for an end-to-end system.

Azure ML Recommendations

I want to use Azure ML to find related products using information from receipts from a store.
I got a file of reciepts:
44366,136778
79619,88975
78861,78864
53395,78129,78786,79295,79353,79406,79408,79417,85829,136712
32340,33973
31897,32905
32476,32697,33202,33344,33879,34237,34422,48175,55486,55490,55498
17800
32476,32697,33202,33344,33879,34237,34422,48175,55490,55497,55498,55503
47098
136974
85832
Each row represent one receipt and each number is a product id.
Given a product id I want to get a list of similar products, i.e. products that was bought together by other customers.
Can anyone point me in the right direction of how do to this?
This seems a good fit for their frequently bought together service (https://datamarket.azure.com/dataset/amla/mba). You may have to preprocess the dataset to get it in the required format. This service has a web UI as well: https://marketbasket.cloudapp.net/
This is a typical problem for Recommender, you can use a model called Machbox recommender to cover such a problem.
Recommender typically use Scoring about items to propose and the use some tricky calculation to predict scores for items users had not scored yet ( a score would be typically 1 user bought the item, 0 he did not)
If you need more details let me know ..(you have access to a free version of Azure ML where you can try all this)
Regards

SOLR Query parameters to avoid flooding with the same manufacturer

I've been a long time browser here, but never have had a question that wasn't already asked. So here goes:
I've run into a problem using SOLR search where some searches on SOLR (let's say DVD Players) tend to return a lot of search results from the same manufacturer in the first 50 results.
Now assuming that I want to provide my end-user with the best experience searching, but also the best variety of products in my catalog, how would I go about providing a type of demerit to reduce the same brand from showing up in the search results more than 5 times. For the record I'm using a fairly standard DisMax search handler.
This logic would only be applied to extremely broad queries like 'DVD Players', or 'Hard Drives', and naturally I wouldn't use it to shape 'Samsung DVD Players' search results.
I don't know if SOLR has a nifty feature that does this automatically, or if I would have to start modifying search handler logic.
I haven't used this but I believe field collapsing / grouping would be what you want.
http://wiki.apache.org/solr/FieldCollapsing
If I understand this feature correctly it would group similar results kind of how http://news.google.com/ does it by grouping similar news stories.
Some ideas here, although I've not tried them myself.
You can use Carrot plugin for Solr to cluster search results lets say on manufacturer and then feed it to custom RequestHandler to re-order (cherry picking from each mfr. cluster) the result for diversity.
However, there is a downside to the approach that you may need to fetch larger than necessary and secondly the search results will be synthetic.
To achieve this is a lengthy and complex process but worth trying. Let's say the main field on which you are searching is a single field called title, first you'll need to make sure that all the documents containing "dvd player" in it have same score. This you can do by neglecting solr scoring parameteres like field norm (set omitNorms=true) & term frequency (write a solr plugin to neglect it) code attached..
Implementation Details:
1) compile the following class and put it into Solr WEB-INF/classes
package my.package;
import org.apache.lucene.search.DefaultSimilarity;
public class CustomSimilarity extends DefaultSimilarity {
public float tf(float freq) {
return freq > 0 ? 1.0f : 0.0f;
}
}
In solrconfig.xml use this new similarity class add
similarity class="my.package.CustomSimilarity"
All this will help you to make score for all the documents with "dvd player" in their title same. After that you can define one field of random type. Then when you query solr you can arrange first by score, then by the random field. Since score for all the documents containing DVD players would be same, results will get arranged by random field, giving the customer better variety of products in your catalog.

Magento geographical search and product recommendation

I'm evaluating Magento for a travel company who will need to do product searches and recommendations based on geographical distance. The company is creating custom holiday packages based on various components (eg: accommodation, tours, restaurant vouchers, etc). These components potentially have overlapping locations (ie: a particular tour might be close enough to several hotels to be considered related to each of them).
As a user builds up their custom package by adding stays at various hotels, I'd like related product recommendations to appear based on geographical location. And, if they search for tours, I'd like closer tours to be weighted toward the top of the catalogue search results.
Nice to have: the ability for the user to select how close / far they consider "close enough" to be (eg: 10km, 50km, 200km, etc).
My research indicates there isn't out of the box support for any sort of spatial queries in Magento. The best solution I could come up with was custom product attributes which list "location" where each product is manually assigned to various locations. But I think that's going to get pretty hard to manage for more than ~50 locations. Is my research correct? Is there an add-on / extension which will fulfil this scenario? Do you think overlapping 50 locations will be manageable in the backend?
Coming from a Microsoft background, my natural inclination would be to enable SQL Server 2008+ spacial functionality and do the queries in the database. Obviously, this option isn't available in the LAMP stack. Am I wrong? Does MySQL support spatial queries like WHERE productA.Location.GetDistanceFrom(productB.Location) < 50km?
Mysql supports spatial queries as well http://dev.mysql.com/doc/refman/4.1/en/spatial-extensions.html but nothing will help you or free you from entering the relations between products and it's location and you have to implement it yourself as well as extend the search based on location

Resources