Deep nested document search in Solr - search

I have been working with Solr sub/child documents index/search without problem. Now we are facing requirements to search deep nested (grandchild, great grand child and so on) fields. I have no problem indexing them in the nested structure in Solr. But I couldn't make it work to search fields in deep nested fields. Below is a piece of data I use for testing. "sub2", "sub3" and "sub4" are nest paths. I am using Solr 8.8.2 (with Solrj).
[
{ "id": "1_1",
"hierarchy": 1,
"X1_11_str": "10001",
"X1_12_text": "ancester one",
"sub2": [
{ "id": "1_1_2_1",
"hierarchy": 2,
"X2_21_text": "child one",
"sub3": [
"id": "1_1_2_1_3_1",
"hierarchy": 3,
"X3_31_text": "grand one",
"sub4": [
{"id": "1_1_2_1_3_1_4_1",
"hierarchy": 4,
"X4_41_text": "great grand one",
"X4_42_int": 1,
"X4_43_str": "red"
},
{"id": "1_1_2_1_3_1_4_2",
"hierarchy": 4,
"X4_41_text": "great grand two",
"X4_42_int": 2,
"X4_43_str": "blue"
}
]
]
}
]
}
]
On the root document, there is "sub2" subdocument. On Sub2 document, there is "sub3" subdocument, and so is for sub3 and sub 4. I wanted to search on sub4 document field but return root documents. Here are my queries:
{!parent hierarchy=1}"X4_42_int":1 AND "X4_43_str":"red"
it returns root document with "id":"1_1". This is correct.
{!parent hierarchy=1}"X4_42_int":1 AND "X4_43_str":"blue"
it still returns root document with "id":"1_1".
I understand this is because I ask for root document as result. But this is not I expected. I would like a query that return nothing since there is no sub4 documents match the search.
Could anyone help with the right query syntax?
Thanks,
Simon

Related

i can't query over populated children attributes

I am trying to query over populated children attributes using mongoose but it straight up doesn't work and will return empty arrays all the time.
even hardcoding right and existing information as values for the query would return empty arrays.
my schema is a business schema with a 1 to 1 relationship with user schema via the attribute createdBy. the user schema has an attribute name which I am trying to query on.
so if I make a query like this :
business.find({'createdBy.name': {$regex:"steve"}}).populate('createdBy')
the above will never return any documents. although, without the find condition, everything works fine.
Can I search by the name inside a populated child or not? all tutorials say this should work fine but it just doesn't.
EDIT : an example of what the record looks like :
{
"_id": "5fddedd00e8a7e069085964f",
"status": 6,
"addInfo": "",
"descProduit": "",
"createdBy": {
"_id": "5f99b1bea9ba194dec3bd6aa",
"status": 1,
"fcmtokens": [
],
"emailVerified": 1,
"phoneVerified": 0,
"userType": "User",
"name": "steve buschemi",
"firstName": "steve",
"lastName": "buschemi",
"tel": "",
"email": "steve#buschemi.com",
"register_token": "747f1e1e8fa1ecd2f1797bb402563198",
"createdAt": "2020-10-28T18:00:30.814Z",
"updatedAt": "2020-12-18T13:52:07.430Z",
"__v": 19,
"business": "5f99b1e101bfff39a8259457",
"credit": 635,
},
"createdAt": "2020-12-19T12:10:57.703Z",
"updatedAt": "2020-12-19T12:11:16.538Z",
"__v": 0,
"nid": "187"
}
It seems there is no way to filter parent documents by conditions on child documents:
From the official documentation:
In general, there is no way to make populate() filter stories based on properties of the story's author. For example, the below query won't return any results, even though author is populated.
const story = await Story.
findOne({ 'author.name': 'Ian Fleming' }).
populate('author').
exec();
story; // null
If you want to filter stories by their author's name, you should use denormalization.

Azure Search Complex types faceting by filtered nested properties is based on found objects

I'm using Azure Search complex types preview API (2017-11-11-Preview) for filtering/faceting by complex types. All of my filters and facets are creating on properties in nested type (not root type) and looks like they are not combining on the right nesting level but only through document root.
For example, I have the next document in the search index
{
apartmentComplexId: "1",
apartmentTypes: [
{
bedroomCount: 1,
bathroomCount: 2
},
{
bedroomCount: 2,
bathroomCount: 3
}
]
}
apartmentTypes.bedroomCount and apartmentTypes.bathroomCount are faceted and filtered. Facet result for dataset will return
{
"apartmentTypes/bedroomCount": [
{
"count": 1,
"value": 1
},
{
"count": 1,
"value": 2
}
],
"apartmentTypes/bathroomCount": [
{
"count": 1,
"value": 2
},
{
"count": 1,
"value": 3
}
]
}
When I'm executing the next query:
$filter=apartmentTypes/any(x: x/bedroomCount eq 1)&facet=apartmentTypes/bathroomCount
my facets collection in response contains all two possible facet values for bathroomCount - 2 and 3 with value of 1 for each of them.
{
"apartmentTypes/bathroomCount": [
{
"count": 1,
"value": 2
},
{
"count": 1,
"value": 3
}
]
}
By the next step I'm trying to use facet data in my more concrete filter
$filter=apartmentTypes/any(x: x/bedroomCount eq 1 and x/bathroomCount eq 3)
Oops, I've got empty resultset.
I understand that more correct filter string should be something like
$filter=apartmentTypes/any(x: x/bedroomCount) and values/any(x: x/bathroomcount eq 3)
but I need the functionality exactly like this - found entity should contain the item in its collection with all the faceted results.
Faceting and filtering both operate at document-scope, not at the scope of items in a complex collection (although you can write correlated filters on a complex collection, as in your first example). This is by design.
In your scenario, this is leading to a mismatch of user expectations with the system's behavior. As a user, if I'm clicking through facets describing apartments, I'm naturally going to assume that filtering is happening on apartments too, but it's actually happening on apartment complexes. This is why the empty resultset in your example is so unintuitive.
I would recommend modeling your indexes according to how users will navigate. Assuming users typically search for apartments, not apartment complexes, try making apartmentType the document type and denormalize the apartment complex information if necessary.
In the meantime, please consider creating an item on User Voice to help us prioritize adding support for correlated facets over complex collections.

Training a model with LUIS using Phrase List Features with overlapping words

I have a word for example ABC SSS. I need this to be recognised as one entity. At the same time this ABC SSS phrase precedes a lot of other words that need to be recognised as one entity which are not interchangeable. For example ABC SSS word. How can I train LUIS to be able to do this. I tried ABC SSS as a phrase feature but then LUIS doesn't recognise ABC SSS word as an entity. Currently, I marked ABC SSS as a feature phrase and word as a separate feature phrase. This is not ideal. Thanks for your help.
You'll want to create composite entities, not use phrase lists for this.
Here's a screenshot of the entities creation page on LUIS. I've created three simple entities and one composite entity which takes the other three entities:
Here are some snippets from a response I got from LUIS on a query. This first bit indicates the actual query and matched intent.
"query": "order large pepperoni pizza",
"topScoringIntent": {
"intent": "OrderPizza",
"score": 0.9999995
},
Under the entities list you'll find your simple and composite entities together, like the following.
{
"entity": "large",
"type": "PizzaSize",
"startIndex": 6,
"endIndex": 10,
"score": 0.9186653
},
{
"entity": "large",
"type": "Pizza", // This is the composite entity!
"startIndex": 6,
"endIndex": 10,
"score": 0.940835536
}
And here is the list for composite entities:
"compositeEntities": [
{
"parentType": "Pizza",
"value": "large",
"children": [
{
"type": "PizzaSize",
"value": "large"
}
]
},
{
"parentType": "Pizza",
"value": "pepperoni",
"children": [
{
"type": "PizzaTopping",
"value": "pepperoni"
}
]
},
{
"parentType": "Pizza",
"value": "pizza",
"children": []
}
]
Composite Entities are ideal for this case:
Set "ABC SSS" as an entity 1, but then tag "ABC SSS" plus those other words into a composite entity 2. This should be enough to both capture "ABC SSS" as entity 1 and the whole sentence as entity 2 in the case those other phrases appear.
Also, you can also tag those other words as Entities by themselves if you want to capture them while you are on it.

Searching parent id in cloudant

I have a Cloudant DB with the following structure:
{id: 1, resource:”john doe”, manager: “john smith”, amount: 13}
{id: 2, resource:”mary doe”, manager: “john smith”, amount: 3}
{id: 3, resource:”john smith”, manager: “peter doe”, amount: 10}
I needed a query to return the sum of amount, so I've built a query with emit(doc.manager, doc.amount) which returns
{"rows":[
{"key":"john smith","value":16},
{"key":"peter doe","value":10}]}
It is working like a charm. However I need the manager ID along with Manager name. The result I am looking for is:
{"rows":[
{"key":{"john smith",3},"value":16},
{"key":{"peter doe",null},"value":10}]}
How should I build a map view to search the parent ID?
Thanks,
Erik
Unfortunately I don't think there's a way to do exactly what you want in one query. Assuming you have the following three documents in your database:
{
"_id": "1",
"resource": "john doe",
"manager": "john smith",
"amount": 13
}
--
{
"_id": "2",
"resource": "mary doe",
"manager": "john smith",
"amount": 3
}
--
{
"_id": "3",
"resource": "john smith",
"manager": "peter doe",
"amount": 10
}
The closest thing to what you want would be the following map function (which uses a compound key) and a _sum reduce:
function(doc) {
emit([doc.manager, doc._id], doc.amount);
}
This would give you the following results with reduce=false:
{"total_rows":3,"offset":0,"rows":[
{"id":"1","key":["john smith","1"],"value":13},
{"id":"2","key":["john smith","2"],"value":3},
{"id":"3","key":["peter doe","3"],"value":10}
]}
With reduce=true and group_level=1, you essentially get the same results as what you already have:
{"rows":[
{"key":["john smith"],"value":16},
{"key":["peter doe"],"value":10}
]}
If you instead do reduce=true and group=true (exact grouping) then you get the following results:
{"rows":[
{"key":["john smith","1"],"value":13},
{"key":["john smith","2"],"value":3},
{"key":["peter doe","3"],"value":10}
]}
Each unique combination of the manager and _id field is summed, which unfortunately doesn't give you what you want. To accomplish what you want to accomplish, I think your best but would be to sum up the values after querying the database.

Optimal way to model documents hierarchy in CouchDB

I'm trying to model document a hierarchy in CouchDB to use in my system, which is conceptually similar to a blog. Each blog post belongs to at least one category and each category can have many posts. Categories are hierarchical, meaning that if a post belongs to CatB in the hierarchy "CatA->CatB" ("CatB is in CatA)", it belongs also to CatA.
Users must be able to quickly find all post in a category (and all its children).
Solution 1
Each document of the post type contains a "category" array representing its position in the hierarchy (see 2).
{
"_id": "8e7a440862347a22f4a1b2ca7f000e83",
"type": "post",
"author": "dexter",
"title": "Hello",
"category":["OO","Programming","C++"]
}
Solution 2
Each document of the post type contains the "category" string representing its path in the hierarchy (see 4).
{
"_id": "8e7a440862347a22f4a1b2ca7f000e83",
"type": "post",
"author": "dexter",
"title": "Hello",
"category": "OO/Programming/C++"
}
Solution 3
Each document of the post type contains its parent "category" id representing its path in the hierarchy (see 3). A hierarchical category structure is built through linked "category" document types.
{
"_id": "8e7a440862347a22f4a1b2ca7f000e83",
"type": "post",
"author": "dexter",
"title": "Hello",
"category_id": "3"
}
{
"_id": "1",
"type": "category",
"name": "OO"
}
{
"_id": "2",
"type": "category",
"name": "Programming",
"parent": "1"
}
{
"_id": "3",
"type": "category",
"name": "C++",
"parent": "2"
}
Question
What's the best way to store this kind of relationship in CouchDB? What's the most efficient solution in terms of disk space, scalability and retrieval speed?
Can such a relation be modelled to take into account localised category names?
Disclaimer
I know this question has been asked a few times already here on SO, but it seems there's no definitive answer to it nor an answer which deals with the pros and cons of each solution. Sorry for the length of the question :)
Read so far
CouchDB - The Definitive Guide
Storing Hierarchical Data in CouchDB
Retrieving Hierarchical/Nested Data From CouchDB
Using CouchDB group_level for hierarchical data
There's no right answer to this question, hence the lack of a definitive answer. It mostly depends on what kind of usage you want to optimize for.
You state that retrieval speed of documents that belong to a certain category (and their children) is most important. The first two solutions allow you to create a view that emits a blog post multiple times, once for each category in the chain from the leaf to the root. Thus selecting all documents can be done using a single (and thus fast) query. The only difference of second solution to first solution is that you move the parsing of the category "path" into components from the code that inserts the document to the map function of the view. I would prefer the first solution as it's simpler to implement the map function and a bit more flexible (e.g. it allows a category's name to contain a slash character).
In your scenario you probably also want to create a reduced view which counts the number of blog posts for each category. This is very simple with either of these solutions. With a fitting reduction function, the number of post in every category can be retrieved using a single request.
A downside of the first two solutions is that renaming or moving a category from one parent to another requires every document to be updated. The third solution allows that without touching the documents. But from the description of your scenario I assume that retrieval by category is very frequent and category renaming/moving is very rare.
Solution 4 I propose a fourth solution where blog post documents hold references to category documents but still reference all the ancestors of the post's category. This allows categories to be renamed without touching the blog posts and allows you to store additional metadata with a category (e.g. translations of the category name or a description):
{
"_id": "8e7a440862347a22f4a1b2ca7f000e83",
"type": "post",
"author": "dexter",
"title": "Hello",
"category_ids": [3, 2, 1]
}
{
"_id": "1",
"type": "category",
"name": "OO"
}
{
"_id": "2",
"type": "category",
"name": "Programming",
"parent": "1"
}
{
"_id": "3",
"type": "category",
"name": "C++",
"parent": "2"
}
You will still have to store the parents of categories with the categories, which is duplicating data in the posts, to allow categories to be traversed (e.g. for displaying a tree of categories for navigation).
You can extend this solution or any of your solutions to allow a post to be categorized under multiple categories, or a category to have multiple parents. When a post is categorized in multiple categories, you will need to store the union of the ancestors of each category in the post's document while preserving the categories selected by the author to allow them to be displayed with the post or edited later.
Lets assume that there is an additional category named "Ajax" with anchestors "JavaScript", "Programming" and "OO". To simplify the following example, I've chosen the document IDs of the categories to equal the category's name.
{
"_id": "8e7a440862347a22f4a1b2ca7f000e83",
"type": "post",
"author": "dexter",
"title": "Hello",
"category_ids": ["C++", "Ajax"],
"category_anchestor_ids": ["C++", "Programming", "OO", "Ajax", "JavaScript"]
}
To allow a category to have multiple parents, just store multiple parent IDs with a category. You will need to eliminate duplicates while finding all the ancestors of a category.
View for Solution 4 Suppose you want to get all the blog posts for a specific category. We will use a database with the following sample data:
{ "_id": "100", "type": "category", "name": "OO" }
{ "_id": "101", "type": "category", "name": "Programming", "parent_id": "100" }
{ "_id": "102", "type": "category", "name": "C++", "parent_id": "101" }
{ "_id": "103", "type": "category", "name": "JavaScript", "parent_id": "101" }
{ "_id": "104", "type": "category", "name": "AJAX", "parent_id": "103" }
{ "_id": "200", "type": "post", "title": "OO Post", "category_id": "104", "category_anchestor_ids": ["100"] }
{ "_id": "201", "type": "post", "title": "Programming Post", "category_id": "101", "category_anchestor_ids": ["101", "100"] }
{ "_id": "202", "type": "post", "title": "C++ Post", "category_id": "102", "category_anchestor_ids": ["102", "101", "100"] }
{ "_id": "203", "type": "post", "title": "AJAX Post", "category_id": "104", "category_anchestor_ids": ["104", "103", "101", "100"] }
In addition to that, we use a view called posts_by_category in a design document called _design/blog with the the following map function:
function (doc) {
if (doc.type == 'post') {
for (i in doc.category_anchestor_ids) {
emit([doc.category_anchestor_ids[i]], doc)
}
}
}
Then we can get all the posts in the Programming category (which has ID "101") or one of it's subcategories using a GET requests to the following URL.
http://localhost:5984/so/_design/blog/_view/posts_by_category?reduce=false&key=["101"]
This will return a view result with the keys set to the category ID and the values set to the post documents. The same view can also be used to get a summary list of all categories and the number of post in that category and it's children. We add the following reduce function to the view:
function (keys, values, rereduce) {
if (rereduce) {
return sum(values)
} else {
return values.length
}
}
And then we use the following URL:
http://localhost:5984/so/_design/blog/_view/posts_by_category?group_level=1
This will return a reduced view result with the keys again set to the category ID and the values set to the number of posts in each category. In this example, the categories name's would have to be fetched separately but it is possible to create view where each row in the reduced view result already contains the category name.

Resources