Best practices for structuring hierarchical/classified data in mongodb - node.js

Summary:
I am building my first large scale full stack application(MERN stack) that is trying to mimic a large scale clothing store. Each article of clothing has many 'tags' that represent its features, top/bottom/accessory/shoes/ect, and subcategories, for example on top there is shirt/outerwear/sweatshirt/etc, and sub-sub-categories within it, for example on shirt there is blouse/t-shirt/etc. Each article has tags for primary colors, hemline, pockets, technical features, the list goes on.
Main question:
how should I best organize the data in mongodb with mongoose schemas in order for it to be quickly searchable when I plan on having 50,000 or more articles? And genuinely curious, how do large clothing retailers typically design databases to be easily searchable by customers when items have so many identifying features?
Things I have tried or thought of:
On the mongoDB website there is a recommendation to use a tree structure with child references. here is the link: https://docs.mongodb.com/manual/tutorial/model-tree-structures-with-child-references/ I like this idea but I read here: https://developer.mongodb.com/article/mongodb-schema-design-best-practices/ that when storing over a few thousand pieces of data, using object ID references is no longer sufficient, and could create issues because of datalimits.
Further, each clothing item would fall into many different parts of the tree. For example it could be a blouse so it would be in the blouse "leaf" of the tree, and then if its blue, it would be in the blue "leaf" of the tree, and if it is sustainably sourced, it would fall into that "leaf" of the tree as well. Considering this, a tree like data structure seems not the right way to go. It would be storing the same ObjectID in many different leaves.
My other idea was to store the article information (description, price, and picture) seperate from the tagging/hierarchical information. Then each tagging object would have a ObjectID reference to the item. This way I could take advantage of the propogate method of mongoose if I wanted to collect that information.
I also created part of the large tree structure as a proof of concept for a design idea I had, and this is only for the front end right now, but this also creates bad searches cause they would look like taxonomy[0].options[0].options[0].options[0].title to get to 'blouse'. Which from my classes doesnt seem like a good way to make the code readable. This is only a snippet of a long long branching object. I was going to try to make this a mongoose schema. But its a lot of work and I wanna make sure that I do it well.
const taxonomy = [
{
title: 'Category',
selected: false,
options: [
{
title: 'top',
selected: false,
options: [
{
title: 'Shirt',
selected: false,
options: [
{
title: 'Blouse',
selected: false,
},
{
title: 'polo',
selected: false,
},
{
title: 'button down',
selected: false,
},
],
},
{
title: 'T-Shirt',
selected: false,
},
{
title: 'Sweater',
selected: false,
},
{
title: 'Sweatshirt and hoodie',
selected: false,
},
],
},
Moving forward:
I am not looking for a perfect answer, but I am sure that someone has tackled this issue before (all big businesses that sell lots of categorized products have) If someone could just point me in the right direction, for example, give me some terms to google, some articles to read, or some videos to watch, that would be great.
thank you for any direction you can provide.

MongoDB is a document based database. Each record in a collection is a document, and every document should be self-contained (it should contain all information that you need inside it).
The best practice would be to create one collection for each logical whole that you can think of. This is the best practice when you have documents with a lot of data, because it is scalable.
For example, you should create Collections for: Products, Subproducts, Categories, Items, Providers, Discounts...
Now, when you creating Schemas, instead of creating nested structure, you can just store a reference of one collection document as a property of another collection document.
NOTE: The maximum document size is 16 megabytes.
BAD PRACTICE
Let us first see what would be the bad practice. Consider this structure:
Product = {
"name": "Product_name",
"sub_products": [{
"sub_product_name": "Subpoduct_name_1",
"sub_product_description": "Description",
"items": [{
"item_name": "item_name_1",
"item_desciption": "Description",
"discounts": [{
"discount_name": "Discount_1",
"percentage": 25
}]
},
{
"item_name": "item_name_2",
"item_desciption": "Description",
"discounts": [{
"discount_name": "Discount_1",
"percentage": 25
},
{
"discount_name": "Discount_2",
"percentage": 50
}]
},
]
},
...
]
}
Here product document has sub_products property which is an array of sub_products. Each sub_product has items, and each item has discounts. As you can see, because of this nested structure, the maximum document size would be quickly exceeded.
GOOD PRACTICE
Consider this structure:
Product = {
"name": "Product_name",
"sub_products": [
'sub_product_1_id',
'sub_product_2_id',
'sub_product_3_id',
'sub_product_4_id',
'sub_product_5_id',
...
]
}
Subproduct = {
"id": "sub_product_1_id",
"sub_product_name": "Subroduct_name",
"sub_product_description": "Description",
"items": [
'item_1_id',
'item_2_id',
'item_3_id',
'item_4_id',
'item_5_id',
...
]
}
Item = {
"id": "item_1_id",
"item_name": "item_name_1",
"item_desciption": "Description",
"items": [
'discount_1_id',
'discount_2_id',
'discount_3_id',
'discount_4_id',
'discount_5_id',
...
]
}
Discount = {
"id": "discount_1_id",
"discount_name": "Discount_1",
"percentage": 25
}
Now, you have collection for each logical whole and you are just storing a reference of one collection document as a property of another collection document.
Now you can use one of the best features of the Mongoose that is called population. If you store a reference of one collection document as a property of another collection document, when performing querying of the database, Mongoose will replace references with the actual documents.

Related

How to make hierarchical structed database in mongodb?

I would like to achieve a hierarchical structured database in mongodb.
Can someone explain me how to structure this
Something Like this
Here all the leaf nodes will have multiple data, like an array. For eg. all leaf nodes will have employee details.
Or to understand better, can i achieve database like this
Additional info as requested:
Suppose I have an ecommerce website, and I wish to make one node for each type of item. And each node will individually have list of products.
Eg. Main nodes - Food, Stationery, Games
And Food has list of food item each as a document
Similarly Stationary has many items and games also.
Approach 1: In mongodb, you can have embedded document as following
{
parent: {
level1: [{
level2: [{
myField: myValue1
},
{
myField: myValue2
}]
},
{
level2: [{
myField: myValue1
},
{
myField: myValue2
}]
}
}
Note that default limit on each mongodb document is 16mb. So this approach will work fine as long as your children are not too many in number to exceed the document limit. Although you can but I wouldn't suggest to change the default document size limit unless there is no other way.
Approach 2: Create different collections for each with a reference field for parent
//Collection1: parent
{
id: "1",
....
}
//Collection1 : level1
{
id: "dsf",
parentId: 1,
...
}
//Collection1: level2
{
id: "bs",
level1Id: "dsf",
...
}

Conditionally update an array in mongoose [duplicate]

Currently I am working on a mobile app. Basically people can post their photos and the followers can like the photos like Instagram. I use mongodb as the database. Like instagram, there might be a lot of likes for a single photos. So using a document for a single "like" with index seems not reasonable because it will waste a lot of memory. However, I'd like a user add a like quickly. So my question is how to model the "like"? Basically the data model is much similar to instagram but using Mongodb.
No matter how you structure your overall document there are basically two things you need. That is basically a property for a "count" and a "list" of those who have already posted their "like" in order to ensure there are no duplicates submitted. Here's a basic structure:
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3")
"photo": "imagename.png",
"likeCount": 0
"likes": []
}
Whatever the case, there is a unique "_id" for your "photo post" and whatever information you want, but then the other fields as mentioned. The "likes" property here is an array, and that is going to hold the unique "_id" values from the "user" objects in your system. So every "user" has their own unique identifier somewhere, either in local storage or OpenId or something, but a unique identifier. I'll stick with ObjectId for the example.
When someone submits a "like" to a post, you want to issue the following update statement:
db.photos.update(
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3"),
"likes": { "$ne": ObjectId("54bb2244a3a0f26f885be2a4") }
},
{
"$inc": { "likeCount": 1 },
"$push": { "likes": ObjectId("54bb2244a3a0f26f885be2a4") }
}
)
Now the $inc operation there will increase the value of "likeCount" by the number specified, so increase by 1. The $push operation adds the unique identifier for the user to the array in the document for future reference.
The main important thing here is to keep a record of those users who voted and what is happening in the "query" part of the statement. Apart from selecting the document to update by it's own unique "_id", the other important thing is to check that "likes" array to make sure the current voting user is not in there already.
The same is true for the reverse case or "removing" the "like":
db.photos.update(
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3"),
"likes": ObjectId("54bb2244a3a0f26f885be2a4")
},
{
"$inc": { "likeCount": -1 },
"$pull": { "likes": ObjectId("54bb2244a3a0f26f885be2a4") }
}
)
The main important thing here is the query conditions being used to make sure that no document is touched if all conditions are not met. So the count does not increase if the user had already voted or decrease if their vote was not actually present anymore at the time of the update.
Of course it is not practical to read an array with a couple of hundred entries in a document back in any other part of your application. But MongoDB has a very standard way to handle that as well:
db.photos.find(
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3"),
},
{
"photo": 1
"likeCount": 1,
"likes": {
"$elemMatch": { "$eq": ObjectId("54bb2244a3a0f26f885be2a4") }
}
}
)
This usage of $elemMatch in projection will only return the current user if they are present or just a blank array where they are not. This allows the rest of your application logic to be aware if the current user has already placed a vote or not.
That is the basic technique and may work for you as is, but you should be aware that embedded arrays should not be infinitely extended, and there is also a hard 16MB limit on BSON documents. So the concept is sound, but just cannot be used on it's own if you are expecting 1000's of "like votes" on your content. There is a concept known as "bucketing" which is discussed in some detail in this example for Hybrid Schema design that allows one solution to storing a high volume of "likes". You can look at that to use along with the basic concepts here as a way to do this at volume.

MongoDB Relational Data Structures with array of _id's

We have been using MongoDB for some time now and there is one thing I just cant wrap my head around. Lets say I have a a collection of Users that have a Watch List or Favorite Items List like this:
usersCollection = [
{
_id: 1,
name: "Rob",
itemWatchList:[
"111111",
"222222",
"333333"
]
}
];
and a separate Collection of Items
itemsCollection = [
{
_id:"111111",
name: "Laptop",
price:1000.00
},
{
_id:"222222",
name: "Bike",
price:123.00
},
{
_id:"333333",
name: "House",
price:500000.00
}
];
Obviously we would not want to insert the whole item obj inside the itemWatchList array because the items data could change i.e. price.
Lets say we pull that user to the GUI and want to diplay a grid of the user itemWatchList. We cant because all we have is a list of ID's. Is the only option to do a second collection.find([itemWatchList]) and then in the results callback manipulate the user record to display the current items? The problem with that is what if I return an array of multiple Users each with an array of itemWatchList's, that would be a callback nightmare to try and keep the results straight. I know Map Reduce or Aggregation framework cant traverse multiple collections.
What is the best practice here and is there a better data structure that should be used to avoid this issue all together?
You have 3 different options with how to display relational data. None of them are perfect, but the one you've chosen may not be the best option for your use case.
Option 1 - Reference the IDs
This is the option you've chosen. Keep a list of Ids, generally in an array of the objects you want to reference. Later to display them, you do a second round-trip with an $in query.
Option 2 - Subdocuments
This is probably a bad solution for your situation. It means putting the entire array of documents that are stored in the items collection into your user collection as a sub-document. This is great if only one user can own an item at a time. (For example, different shipping and billing addresses.)
Option 3 - A combination
This may be the best option for you, but it'll mean changing your schema. For example, lets say that your items have 20 properties, but you really only care about the name and price for the majority of your screens. You then have a schema like this:
usersCollection = [
{
_id: 1,
name: "Rob",
itemWatchList:[
{
_id:"111111",
name: "Laptop",
price:1000.00
},
{
_id:"222222",
name: "Bike",
price:123.00
},
{
_id:"333333",
name: "House",
price:500000.00
}
]
}
];
itemsCollection = [
{
_id:"111111",
name: "Laptop",
price:1000.00,
otherAttributes: ...
},
{
_id:"222222",
name: "Bike",
price:123.00
otherAttributes: ...
},
{
_id:"333333",
name: "House",
price:500000.00,
otherAttributes: ...
}
];
The difficulty is that you then have to keep these items in sync with each other. (This is what is meant by eventual consistency.) If you have a low-stakes application (not banking, health care etc) this isn't a big deal. You can have the two update queries happen successively, updating the users that have that item to the new price. You'll notice this sort of latency on some websites if you pay attention. Ebay for example often has different prices on the search results pages than the actual price once you open the actual page, even if you return and refresh the search results.
Good luck!

How should I model my MongoDB collection for nested documents?

I'm managing a MongoDB database for a building products store. The most immediate collection is products, right?
There are quite several products, however they all belong to one among a set of 5-8 categories and then to one subcatefory among a small set of subcategories.
For example:
-Electrical
*Wires
p1
p2
..
*Tools
p5
pn
..
*Sockets
p11
p23
..
-Plumber
*Pipes
..
*Tools
..
PVC
..
I will use Angular at web site client side to show whole products catalog, I think about AJAX for querying the right subset of products I want.
Then, I wonder whether I should manage one only collection like:
{
MainCategory1: {
SubCategory1: {
{},{},{},{},{},{},{}
}
SubCategory2: {
{},{},{},{},{},{},{}
}
SubCategoryn: {
{},{},{},{},{},{},{}
}
},
MainCategory2: {
SubCategory1: {
{},{},{},{},{},{},{}
}
SubCategory2: {
{},{},{},{},{},{},{}
}
SubCategoryn: {
{},{},{},{},{},{},{}
}
},
MainCategoryn: {
SubCategory1: {
{},{},{},{},{},{},{}
}
SubCategory2: {
{},{},{},{},{},{},{}
}
SubCategoryn: {
{},{},{},{},{},{},{}
}
}
}
Or a single collection per each category. The number of documents might not be higher than 500. However I care about a balance for:
quick DB answer,
easy server side DB querying, and
client-side Angular code for rendering results to html.
I'm using mongodb node.js module, not Mongoose now.
What CRUD operations will I do?
Inserts of products, I'd also like to have a way to obtain autogenerated ids (maybe sequential) per each new register. However, as it might seem natural I wouldn't offer the _id to the user.
Querying the whole documents set of a subcategory. Maybe just obtaining a few attributes at first.
Querying whole or a specific subset of attributes of a document (product) in particular.
Modifying a product's attributes values.
I agree client side should get the easiest result to render. However, to nest categories into products is still a bad idea. The trade off is once you want to change, for example, the name of a category, it will be a disaster. And if you think about the possible usecases, for example:
list all categories
find all subcategories of a certain category
find all products in a certain category
You'll find it hard to do these stuff with your data structure.
I had same situation in my current project. So here's what I do for your reference.
First, categories should be in a separate collection. DON'T nest categories into each other, as it will complicate the procedure to find all subcategories. The traditional way for finding all subcategories is to maintain an idPath property. For example, your categories are divided into 3 levels:
{
_id: 100,
name: "level1 category"
parentId: 0, // means it's the top category
idPath: "0-100"
}
{
_id: 101,
name: "level2 category"
parentId: 100,
idPath: "0-100-101"
}
{
_id: 102,
name: "level3 category"
parentId: 101,
idPath: "0-100-101-102"
}
Note with idPath, parentId is not necessary anymore. It's for you to understand the structure easier.
Once you need to find all subcategories of category 100, simply do the query:
db.collection("category").find({_id: /^0-100-/}, function(err, doc) {
// whatever you want to do
})
With category stored in a separate collection, in your product you'll need to reference them by _id, just like when we use RDBMS. For example:
{
... // other fields of product
categories: [100, 101, 102, ...]
}
Now if you want to find all products in a certain category:
db.collection("category").find({_id: new RegExp("/^" + idPath + "-/"}, function(err, categories) {
var cateIds = _.pluck(categories, "_id"); // I'm using underscore to pluck category ids
db.collection("product").find({categories: { $in: cateIds }}, function(err, products) {
// products are here
}
})
Fortunately, category collection is usually very small, with only hundreds of records inside (or thousands). And it doesn't varies a lot. So you can always store a live copy of categories inside memory, and it can be constructed as nested objects like:
[{
id: 100,
name: "level 1 category",
... // other fields
subcategories: [{
id: 101,
... // other fields
subcategories: [...]
}, {
id: 103,
... // other fields
subcategories: [...]
},
...]
}, {
// another top1 category
}, ...]
You may want to refresh this copy every several hours, so:
setTimeout(3600000, function() {
// refresh your memory copy of categories.
});
That's all I get in mind right now. Hope it helps.
EDIT:
to provide int ID for each user, $inc and findAndModify is very useful. you may have a idSeed collection:
{
_id: ...,
seedValue: 1,
forCollection: "user"
}
When you want to get an unique ID:
db.collection("idSeed").findAndModify({forCollection: "user"}, {}, {$inc: {seedValue: 1}}, {}, function(err, doc) {
var newId = doc.seedValue;
});
The findAndModify is an atomic operator provided by mongodb. It will guarantee thread safety. and the find and modify actually happens in a "transaction".
2nd question is in my answer already.
query subsets of properties is described with mongodb Manual. NodeJS API is almost the same. Read the document of projection parameter.
update subsets is also supported by $set of mongodb operator.

Update offspring in nested tree mongoDB, node.js

Is there any way to update nested documents by id or some other field?
I use "Full Tree in Single Document" and don't know beforehand how deep nesting can go. Need to Update, for example, answer with {id:'104'}. I can do that via 'dot notation', but since I don't know the level (depth) of nesting I can't predict how long my 'comment.answers.answers....answers.' can go.
Is there any way to directly find and update id:'104', or I still need to pass some kind of depth mark?
{
title:'some title',
comment:
{
id:'101'
author:'Joe',
text:'some comment',
answers:
[
{
id:'102'
author:'Joe',
text:'first answer to comment',
answers:
[
{
id:'103'
author:'Done',
text:'first answer to first answer to comment',
answers:[]
},
{
id:'104'
author:'Bob',
text:'Second answer to first answer to comment',
answers:[]
}
]
},
{
},
{
},
]
}
}
I use The Node.JS MongoDB Driver
In short there's no really good way to do a query of this sort. There are a few options:
You can create a query with a long $or statement, specifying each of the possible nested locations for the document:
{ comment.id: 103, $or [
{ comment.answers.id: 103 },
{ comment.answers.answers.id: 103 },
{ comment.answers.answers.answers.id: 103 }
]
}
For more information about the $or operator see the docs.
In truth, the better and more sustainable solution would be to use a different schema, where all comments and answers are stored in a flat array and then store information about the relationships between the comments in a comments.parent field. For example:
{ comment: [
{ id: 1 }
{ id: 2, parent: 1 }
] }
For additional options and a more in depth discussion of possible ways of modeling comment hierarchies, view the Storing Comments Use Case in the MongoDB documentation.
There are also a number of also strategies in the Trees in MongoDB that you might want to consider.
I think you should store depth level of each node and then dynamically create queries.

Resources