Azure Cognitive Search - how to rank child objects by relevancy - azure

Let's say I have a product catalog index like below, where I have a list of products that have an array of individual sku child objects. I want to be able to perform a search that returns the matching product documents, but also indicate the relevancy of the child sku elements (or sort them, or something).
{
"productId": "1",
"name": "Cool Shirt",
"type": "t-shirt",
"skus": [
{
"skuNumber": "1-a",
"color": "green",
"image": "..."
},
{
"skuNumber": "1-b",
"color": "red",
"image": "..."
}
]
},
{
...additional documents
}
A search for red t-shirt should return this document, but I'd like to know that the second sku (color:red) was more relevant than the first sku - maybe by having a relevancy score applied to these child objects, or having Azure sort them accordingly. The goal is to be able to present a search result to a user as a product tile that highlights the most relevant child sku - in this case by displaying this "Cool Shirt" product with the red shirt sku's image.
Real world example of this in practice:
Search https://www.amazon.com/s?k=Hanes+Unisex+T-Shirt+red and the top result is the red "sku" of the product, search https://www.amazon.com/s?k=Hanes+Unisex+T-Shirt+green and you'll see the green "sku".
Are there any techniques to accomplish this with Azure Cognitive Search?
The investigation my team has done so far has not yielded good results. We're migrating from a Solr search implementation where this is accomplished a bit differently - by indexing the individual skus and then grouping them by a parent id. Newer versions of Solr suggest this approach https://solr.apache.org/guide/6_6/collapse-and-expand-results.html. My understanding is that Azure search does not support these capabilities.
Our workaround
The most promising option we've come up with is to have two indexes. One of the products (same as above) and another of just the skus, like so:
{
"productId": "1",
"skuNumber": "1-a",
"color": "green",
"image": "..."
},
{
"productId": "1",
"skuNumber": "1-b",
"color": "red",
"image": "..."
}
We'd first perform a search to get a list of relevant products, and then follow-up with an identical search to the sku index filtered only by skus with a parent product id from first result red t-shirt $filter productId eq '1' ...etc for all product ids returned by the first search. The relevancy score of this second search would then allow us to rank the child skus as I am describing. But this seems far from an ideal solution. Any other options?
Notes
Please note:
I'm willing to restructure our Index(s) in any way feasible
There will be dozens of additional fields at the sku level beyond just "color"
We don't want less/non-relevant skus to be completely filtered out; for red t-shirt we still want to display a product tile that indicates there's a green version too, for instance
Relevancy of skus would need work for filtering and faceting, in addition to text search. Eg. red t-shirt, filter=inStock ,facet=price[$5-$10] would need to surface the sku that most closely matched this criteria
We'll be using traditional paging of results (as opposed to infinite-scroll)

Showing multiple product variants in search results is a typical e-commerce requirement. We have solved this with Azure Search, without using collapsing or grouping. The search engine we migrated from supported collapsing, making it easy to boost the most relevant SKU to the top while presenting a tail of related SKUs.
See this related post: How to get only one item from each category in azure cognitive search?
I'll try to explain in more detail how to solve this use case with Azure Search. The constraints you list are great pointers. It's good to know that you still have the option to restructure your index to solve this use case.
SUGGESTED SOLUTION #1 (INFINITE SCROLL)
Store each SKU as a separate item in the index, without child items.
Tag each item with an ID for grouping
The grouping ID should be refinable
You are not limiting the grouping to color or any specific property. The grouping ID is an independent property for grouping products.
Submit your query as normal. Including any free text queries, boosting, filtering, or sorting options you want. This will work as expected. Make sure you include your grouping property as a refiner.
Then traverse your results going through the items one by one. Keep the first item for each group. Skip any subsequent items from a group you have already seen.
Now you can choose if you want to only present the head of each group. E.g. you only present the red t-shirt from your example. The grouping refiner will contain the exact SKU count for your query. You can also produce a link that filters by the item's group ID to list all variants.
This solution ensures you only show the most relevant SKU. I.e. you have filtered by red variants by having the word red in your query.
This would also work if you had applied a filter to only show shirts in size XL. The red t-shirts unavailable in size:XL would then disappear.
If you also want black t-shirts to appear in your free text query for red t-shirts, you would need to process your items before indexing to contain a description of the available variants. Use a searchable text property like "these items also comes in other variants like black, blue, green, ..."
{
"value": [
{
"id": "1",
"sku": "9001234",
"title": "Hayne's Unisex T-Shirt",
"group": "HAY2022",
"color": "green",
"variants": "available in green, black, red and blue"
},
{
"id": "2",
"sku": "9005678",
"title": "Hayne's Unisex T-Shirt",
"group": "HAY2022",
"color": "red",
"variants": "available in green, black, red and blue"
},
{
"id": "3",
"sku": "8001234",
"title": "Levi's T-Shirt",
"group": "LEV2022",
"color": "red",
"variants": "available in black and red"
}
]
}
It's worth noting that you may have to request a larger number of results than you actually present. For example, if your goal is to present 10 items on a page you may have a scenario where the first item has 20 variants. You would then only present/keep the head entry.
Therefore, you have to request a larger result set. It will have a slight impact on your performance, but we have found that is negligible for end users. We have used this solution in production for a few years now, and it works well. It resolves all the points you have mentioned.
SUGGESTED SOLUTION #2
Updated with the new constraints to not use infinite scroll. Your Amazon examples for red- or green t-shirts only show the corresponding colors. This would indicate that each SKU is stored as individual items in the index, containing only information about the SKU without information about the variants.
In your case, you also want the variants not matching the original query to be included. When the end user query is 'red t-shirt', you want to show red t-shirts as the top results (if there are any matches). However, you also want to include green t-shirts, if there are any variants containing the token 'green'.
Store each SKU as a separate item in the index, without child items.
Each item should only have keywords relevant for that SKU. I.e. red t-shirts do not have a searchable token containing green if there is a green version.
Tag each item with an ID for grouping
The grouping ID should be refinable
You are not limiting the grouping to color or any specific property. The grouping ID is an independent property for grouping products.
Query: Generate a query with the free text input from the end user. Apply any filtering and boosting- or sorting rules to the query.
To present results you have a few options. Both require two queries.
Present results in order. Traverse the presented results and collect the grouping ID from each result. Submit a secondary query without the end user free text, using a $filter with search.in(). E.g. search=*&$filter=search.in(groupid, 'groupA,groupC,groupX', ','). Then either append the results from the secondary query as separate tiles, or render them as variants for your existing tiles.
Submit the first query in your backend only. Then collect the group IDs from the results and submit a secondary query as an OR-query containing your original query and a filter query based on the group ids returned by the group id refiner. E.g. OR . This will give you a result containing both your red t-shirts at the top AND the variants from the matching groups with other colors further down.
AZURE USER VOICE
The optimal solution would be to have collapsing support in Azure Search. You could vote for collapsing in the Azure Search user voice as mentioned in the related SO post. The Azure Search user voice entry for collapsing was moved and hasn't been updated in 7 years it seems:
https://feedback.azure.com/d365community/idea/0c5a17be-0225-ec11-b6e6-000d3a4f07b8

Dan Gøran Lunde's answer is worth careful consideration, especially if implementing an "infinite scroll" type search result. However, if one needs to implement traditional pagination, I don't find the solution satisfactory. Frankly, what this really means is Azure Cognitive Search isn't a satisfactory platform for search if one needs grouping/collapsing.
In any case, I'm stuck building a solution for this with Azure search, so I wanted to share my planned approach. This isn't production battle-tested, but it is so far working in development.
Approach
We have two different indexes. First, the product index, which contains the set of grouped skus that comprise each product, like so:
{
"productId": "1",
"name": "Cool Shirt",
"skus": [
{
"productId": "1",
"skuNumber": "1-a",
"color": "green",
"image": "...",
...all other sku data
},
{
"productId": "1",
"skuNumber": "1-b",
"color": "red",
"image": "...",
...all other sku data
}
]
}, {product2...}, {product3...}, etc
Then there's a sku index, which is a flattened list of all skus:
{
"productId": "1",
"skuNumber": "1-a",
"color": "green",
"image": "...",
...all other sku data
},
{
"productId": "1",
"skuNumber": "1-b",
"color": "red",
"image": "...",
...all other sku data
},
{
"productId": "2",
"skuNumber": "2-x"
...etc
}, etc
The Sku objects would be identical across both indexes, loaded at the same time, etc.
Performing a Search
To perform a search, a query is issued to the first index. All filters/facets/text queries are performed on the Skus collection. If any sku meets the criteria, then the entire product is returned. These are the products presented to the user, so result counts & pagination for the search index matches exactly how pagination is executed in the UI.
What we don't know from this first query is which sku among each product is the most relevant. All we know is at least one sku for each product met the search criteria. So, next we perform a functionally identical search on the second (sku) index, with an added filter to only match skus with a productId from the first result. Take the result of this, and grab the top sku within each productId and we've found the most relevant sku for each product. Combine the result of the first query with this info and we've got a result of products and the primary sku within each that we want to display.
Pitfalls
Aside from having to execute two queries for each search, I see the following pitfalls:
Consistency issues between 2 different indexes. I'm confident our processes to index the data will ensure integrity between both indexes. Could Azure's infrastructure (different replica sets, for example) introduce unexpected inconsistencies? I don't have the expertise to quite understand that. Worst case, the second query would fail to identify the correct most relevant sku. All that would mean is that a product result might not be able to highlight the best matching sku. I can live with that.
Query syntax is different for each index. For the first query, everything would have to be scoped to the Sku collection level, but for the second query, everything would be top-level field queries. Thus, we'd have to ensure we generate different query parameters depending on which index is being queried.
Performance? This is laughable if we're already resigned to perform 2 queries for every search, but there's a theoretical performance hit I'd imagine when searching the first index. There, we're searching on fields within a collection (ie Skus/color) instead of top-level fields on the document (as would be the case in Dan's solution where you perform the queries on a single Skus index). Initial testing with our data sets indicate this has a negligible impact, so I don't personally consider this a problem for my use-case.
I would appreciate any additional feedback if you have any concerns with this approach. For now, this seems to be the most viable solution to the problem for us.

Related

mongoDB Search without joins

I have two collections: Profiles and Employees.
Employees consists of firstName, lastName etc.
Profiles-Collection, amongst other data, has a bunch of key value pairs that describe the profession or level of experience, e.g."software-engineer": true, "javascript": 3
Since you can't have joins in mongoDB I need to search each collection individually and then "join" that result. That leaves me with 2 options:
1) Have two separate search bars on the frontend so that I know which search query belongs to which collection
2) Have a single search bar and search both collections with the same query
Option two is implemented in a way that a search on a single collection either returns the desired data when the search query has a match or returns ALL data when the search query finds no match. That means searching after "John Doe" gives us john and searching after "angular" gives us all employees that work with angular. But it also means searching after "john angular" gives us all employees that work with angular OR are called john.
What I actually want is a AND search (like in option 1) but with a single search bar. Is there a way to implement this in MongoDB or is this only possible in a relational database?

Algolia search keywords

I want to build a smart search with Algolia. The point is to use keywords to rank the results. Lets say user types "smarphone blue cheap good camera". This should find all blue smarthones and order them by price and camera characteristics.
The idea is to somehow map those keywords to a ranking formula.
Doea any one know if it is possible with Algolia and if so what is the best way to achieve the desired result?
To automatically detect and filter by facet values (like blue, good camera), you could use Query Rules, in particular Dynamic Filtering.
However, that shouldn't be necessary. If you include the color (containing for instance the blue value) and characteristics (containing for instance the good camera value) attributes in your searchableAttributes list, then the search request will return relevant results based on purely textual relevance matched in those attributes.
On the other hand, sorting strategies impact the Algolia indices at build time, therefore in order to change the sorting strategy based on the query (e.g. sort results by ascending price if the search query contains cheap), you will need to setup a new replica index for which results are sorted by price. On the frontend, when detecting a relevant keyword (e.g. cheap), you can decide to switch the search queries to the primary index or to the sorted replica.

The implication of #search.score in Azure Search Service

I understood the reason for having search profile and boosting results based on some fields e.g. distance, rating, etc. To me, that's most likely applicable to structured documents like json files. The scenario that I cannot make sense of it is when indexer gets search service index let's say a MS Word or PDF document in azure blob. We have two entries of "id" and "content" which I don't know how the search score would apply to it.
For e.g. there are two documents with different contents. I searched for a keyword and the same keyword found in two documents resulted into getting two different scores for two MS Word documents. My challenge is why this score should be different while both documents contain the same keyword?
The score is determined by many factors, for example, the count of terms in each document, and the number of searchable fields in which query terms were found. In your example, the documents have different lengths, so naturally they'll have different scores. HTH.

Solr: Apply faceting when query contains particular terms

I have a database of product information indexed by name, type, manufacturer, etc. Users often submit search queries whose results would be contained neatly in one or more facets. When this situation arises, I would like for Solr to parse the query and apply the relevant facets.
For example, searching shoes should return results in the shoe category. More ambitiously, searching plaid shirt should query plaid on items in the shirt category.
Is it possible to configure Solr to do this?
Thanks in advance.
Asking Solr to do what you want is a tall order. Your best bet would be to store categories in a field that is weighted very highly. For example, if you have a category field with the value of "shoes", having a hit on that field will increase the relevance of documents on that category, thus having them show up first. Same goes for the second example.
As for faceting, your question is not clear on how you want to apply faceting.

Show specific document on top in search for specific keywords in solr

Suppose, I have 1000 sellers (S1.....S1000) of Apparels listed on my site. Since all the sellers are paying some amount to me, I am giving them equal weight-age, and the results are shown based on relevancy.
Now, I am planning to start with premium service, where I am thinking to list one supplier on top for each keywords in search results. Let say, S1 has been given premium search for keywords 'Jeans', so if a user searches 'jeans', I first wants to display this supplier on the top, then display other supplier based on relevancy. Plus, this premium service is for only for one month. So, another supplier say S2 can avail this service in next month and so on.
Is there any plugin, wherein I can store which supplier should be shown for which keyword. I am even OK with making 2 queries to meet the desire results.
Please suggest
I think the Query Elevation Component is your friend, you can configure which documents (and hence which suppliers) come first for any given query, see
https://wiki.apache.org/solr/QueryElevationComponent
If that's too much work, you could also add a new boolean field in your documents, indicating whether the document is to be promoted or not, and in the query, sort by this field first (so promoted documents come on top), and by score next (so most relevant documents come right after the promoted ones).
You can maybe also use the reRanking Componant :
https://cwiki.apache.org/confluence/display/solr/Query+Re-Ranking
With using a query like this :
q=jean&rq={!rerank reRankQuery=$rqq reRankDocs=1000 reRankWeight=3}&rqq=(brand:S1)
The top 1000 of results from query jean will be re-ranking thanks to the boost (of 3) add to the documents which contain the field brand with the value S1.
It can be useful, but in your case I think the QueryElevationComponent is the best.
Be careful, reRanking is only available since version 4.9.

Resources