I'm not yet ready to let this go, which is why I re-thought the problem and edited the Q (original below).
I am using mongoDB for a weekend project and it requires some relations in the DB, which is what the misery is all about:
I have three collections:
Users
Lists
Texts
A user can have texts and lists - lists 'contain' texts. Texts can be in multiple lists.
I decided to go with separate collections (not embeds) because child documents don't always appear in context of their parent (eg. all texts, without being in a list).
So what needs to be done is reference the texts that belong into certain lists with exactly those lists. There can be unlimited lists and texts, though lists will be less in comparison.
In contrast to what I first thought of, I could also put the reference in every single text-document and not all text-ids in the list-documents. It would actually make a difference, because I could get away with one query to find every snippet in a list. Could even index that reference.
var TextSchema = new Schema({
_id: Number,
name: String,
inListID: { type : Array , "default" : [] },
[...]
It is also rather seldom the case that texts will be in MANY lists, so the array would not really explode. The question kind of remains though, is there a chance this scales or actually a better way of implementing it with mongoDB? Would it help to limit the amount of lists a text can be in (probably)? Is there a recipe for few:many relations?
It would even be awesome to get references to projects where this has been done and how it was implemented (few:many relations). I can't believe everybody shies away from mongo DB as soon as some relations are needed.
Original Question
I'll break it down in two problems I see so far:
1) Let's assume a list consists of 5 texts. How do I reference the texts contained in a list? Just open an array and store the text's _ids in there? Seems like those arrays might grow to the moon and back, slowing the app down? On the other hand texts need to be available without a list, so embedding is not really an option. What if I want to get all texts of a list that contains 100 texts.. sounds like two queries and an array with 100 fields :-/. So is this way of referencing the proper way to do it?
var ListSchema = new Schema({
_id: Number,
name: String,
textids: { type : Array , "default" : [] },
[...]
Problem 2) I see with this approach is cleaning the references if a text is deleted. Its reference will still be in every list that contained the text and I wouldn't want to iterate through all the lists to clean out those dead references. Or would I? Is there a smart way to solve this? Just making the texts hold the reference (in which list they are) just moves the problem around, so that's not an option.
I guess I'm not the first with this sort of problem but I was also unable to find a definitive answer on how to do it 'right'.
I'm also interested in general thoughts on best-practice for this sort of referencing (many-to-many?) and especially scalability/performance.
Relations are usually not a big problem, though certain operations involving relations might be. That depends largely on the problem you're trying to solve, and very strongly on the cardinality of the result set and the selectivity of the keys.
I have written a simple testbed that generates data following a typical long-tail distribution to play with. It turns out that MongoDB is usually better at relations than people believe.
After all, there are only three differences to relational databases:
Foreign key constraints: You have to manage these yourself, so there's some risk for dead links
Transaction isolation: Since there are no multi-document transactions, there's some likelihood for creating invalid foreign key constraints even if the code is correct (in the sense that it never tries to create a dead link), but merely interrupted at runtime. Also, it is hard to check for dead links because you could be observing a race condition
Joins: MongoDB doesn't support joins, though a manual subquery with $in does scale well up to several thousand items in the $in-clause, provided the reference values are indexed, of course
Iff you need to perform large joins, i.e. if your queries are truly relational and you need large amount of the data joined accordingly, MongoDB is probably not a good fit. However, many joins required in relational databases aren't truly relational, they are required because you had to split up your object to multiple tables, for instance because it contains a list.
An example of a 'truly' relational query could be "Find me all customers who bought products that got >4 star reviews by customers that ranked high in turnover in June". Unless you have a very specialized schema that essentially was built to support this query, you'll most likely need to find all the orders, group them by customer ids, take the top n results, use these to query ratings using $in and use another $in to find the actual customers. Still, if you can limit yourself to the top, say 10k customers of June, this is three round-trips and some fast $in queries.
That will probably be in the range of 10-30ms on typical cloud hardware as long as your queries are supported by indexes in RAM and the network isn't completely congested. In this example, things get messy if the data is too sparse, i.e. the top 10k users hardly wrote >4 star reviews, which would force you to write program logic that is smart enough to keep iterating the first step which is both complicated and slow, but if that is such an important scenario, there is probably a better suited data structure anyway.
Using MongoDB with references is a gateway to performance issues. Perfect example of what not to use. This is a m:n kind of relation where m and n can scale to millions. MongoDB works well where we have 1:n(few), 1:n(many), m(few):n(many). But not in situations where you have m(many):n(many). It will obviously result in 2 queries and lot of housekeeping.
I am not sure that is this question still actual, but i have similar experience.
First of all i want to say what tells official mongo documentation:
Use embedded data models when: you have one-to-one or one-to-many model.
For model many-to-many use relationships with document references.
I think is the answer) but this answer provide a lot of problems because:
As were mentioned, mongo don't provide transactions at all.
And you don't have foreign key constraints.
Even if you have references (DBRefs) between documents, you will be faced with amazing problem how to dereference this documents.
Each this item - is huge piece of responsibility, even if you work at weekend project. And it might mean that you should be write many code to provide simple behaviour of your system (for example you can see how realize transaction in mongo here).
I have no idea how done foreign key constraints, and i don't saw something in this direction in mongo documentation, that's why i think that it amazing challenge (and risk for project).
And the last, mongo references - it isn't mysql join, and you dont receive all data from parent collection with data from child collection (like all fields from table and all fields from joined table in mysql), you will receive just REFERENCE to another document in another collection, and you will need to do something with this reference (dereference).
It can be easily reached in node by callback, but only in case when you need just one text from one list, but if you need all texts in one list - it's terrible, but if you need all texts in more than one list - it's become nightmare...
Perhaps it's my not the best experience... but i think you should think about it...
Using array in MongoDB is generally not preferable, and generally not advised by experts.
Here is a solution that came to my mind :
Each document of Users is always unique. There can be Lists and Texts for individual document in Users. So therefore, Lists and Texts have a Field for USER ID, which will be the _id of Users.
Lists always have an owner in Users so they are stored as they are.
Owner of Texts can be either Users or List, so you should keep a Field of LIST ID also in it, which will be _id of Lists.
Now mind that Texts cannot have both USER ID and LIST ID, so you will have to keep a condition that there should be only ONE out of both, the other should be null so that we can easily know who is the primary owner of the Texts.
Writing an answer as I want to explain how I will proceed from here.
Taking into consideration the answers here and my own research on the topic, it might actually be fine storing those references (not really relations) in an array, trying to keep it relativley small: less than 1000 fields is very likely in my case.
Especially because I can get away with one query (which I first though I couldn't) that doen't even require using $in so far, I'm confident that the approach will scale. After all it's 'just a weekend-project', so if it doesn't and I end up re-writing - that's fine.
With a text-schema like this:
var textSchema = new Schema({
_id: {type: Number, required: true, index: { unique: true }},
...
inList: { type : [Number] , "default" : [], index: true }
});
I can simply get all texts in a list with this query, where inList is an indexed array containing the _ids of the texts in the list.
Text.find({inList: listID}, function(err, text) {
...
});
I will still have to deal with foreign key constraints and write my own "clean-up" functions that take care of removing references if a list is removed - remove reference in every text that was in the list.
Luckily this will happen very rarely, so I'm okay with going through every text once in a while.
On the other hand I don't have to care about deleting references in a list-document if a text is removed, because I only store the reference on one side of the relation (in the text-document). Quite an important point in my opinion!
#mnemosyn: thanks for the link and pointing out that this is indeed not a large join or in other words: just a very simple relation. Also some numbers on how long those complex operations take (ofc. hardware dependet) is a big help.
PS: Grüße aus Bielefeld.
What I found most helpful during my own research was this vid, where Alvin Richards also talks about many-to-many relations at around min. 17. This is where I got the idea of making the relation one-sided to save myself some work cleaning up the dead references.
Thanks for the help guys
👍
Related
I would like to ask a question about a possible solution for an e-commerce database design in terms of scalability and flexibility.
We are going to use MongoDB and Node on the backend.
I included an image for you to see what we have so far. We currently have a Products table that can be used to add a product into the system. The interesting part is that we would like to be able to add different types of products to the system with varying attributes.
For example, in the admin management page, we could select a Clothes item where we should fill out a form with fields such as Height, Length, Size ... etc. The question is how could we model this way of structure in the database design?
What we were thinking of was creating tables such as ClothesProduct and many more and respectively connect the Products table to one of these. But we could have 100 different tables for the varying product types. We would like to add a product type dynamically from the admin management. Is this possible in Mongoose? Because creating all possible fields in the Products table is not efficient and it would hit us hard for the long-term.
Database design snippet
Maybe we should just create separate tables for each unique product type and from the front-end, we would select one of them to display the correct form?
Could you please share your thoughts?
Thank you!
We've got a mongoose backend that I've been working on since its inception about 3 years ago. Here some of my lessons:
Mongodb is noSQL: By linking all these objects by ID, it becomes very painful to find all products of "Shop A": You would have to make many queries before getting the list of products for a particular shop (shop -> brand category -> subCategory -> product). Consider nesting certain objects in other objects (e.g. subcategories inside categories, as they are semantically the same). This will save immense loading times.
Dynamically created product fields: We built a (now) big module that allows user to create their own databse keys & values, and assign them to different objects. In essence, it looks something like this:
SpecialFieldModel: new Schema({
...,
key: String,
value: String,
...,
})
this way, you users can "make their own products"
Number of products: Mongodb queries can handle huge dataloads, so I wouldn't worry too much about some tables beings thousands of objects large. However, if you want large reports on all the data, you will need to make sure your IDs are in the right place. Then you can use the Aggregation framework to construct big queries that might have to tie together multiple collectons in the db, and fetch the data in an efficient manner.
Don't reference IDs in both directions, unless you don't know what you're doing: Saving a reference to category ID in subcatgories and vice-versa is incredibly confusing. Which field do you have to update if you want to switch subcategories? One or the other? Or both? Even with strong tests, it can be very confusing for new developers to understand "which direction the queries are running in" (if you are building a proudct that might have to be extended in the future). We've done both which has led to a few problems. However, those modules that saved references to upper objects (rather than lower ones), I found to be consistently more pleasant and simple to work with.
created/updatedAt: Consider adding these fields to every single model & Schema. This will help with debugging, extensibility, and general features that you will be able to build in the future, which might otherwise be impossible. (ProductSchema.set('timestamps', true);)
Take my advice with a grain of salt, as I haven't designed most of our modules. But these are the sorts of things I consider as continue working on our applications.
I have two sets of data in the same collection in cosmos, one are 'posts' and the other are 'users', they are linked by the posts users create.
Currently my structure is as follows;
// user document
{
id: 123,
postIds: ['id1','id2']
}
// post document
{
id: 'id1',
ownerId: 123
}
{
id: 'id2',
ownerId: 123
}
My main issue with this setup is the fungible nature of it, code has to enforce the link and if there's a bug data will very easily be lost with no clear way to recover it.
I'm also concerned about performance, if a user has 10,000 posts that's 10,000 lookups I'll have to do to resolve all the posts..
Is this the correct method for modelling entity relationships?
As said by David, it's a long discussion but it is a very common one so, since I have on hour or so of "free" time, I'm more than glad to try to answer it, once for all, hopefully.
WHY NORMALIZE?
First thing I notice in your post: you are looking for some level of referential integrity (https://en.wikipedia.org/wiki/Referential_integrity) which is something that is needed when you decompose a bigger object into its constituent pieces. Also called normalization.
While this is normally done in a relational database, it is now also becoming popular in non-relational database since it helps a lot to avoid data duplication which usually creates more problem than what it solves.
https://docs.mongodb.com/manual/core/data-model-design/#normalized-data-models
But do you really need it? Since you have chosen to use JSON document database, you should leverage the fact that it's able to store the entire document and then just store the document ALONG WITH all the owner data: name, surname, or all the other data you have about the user who created the document. Yes, I’m saying that you may want to evaluate not to have post and user, but just posts, with user info inside it.This may be actually very correct, as you will be sure to get the EXACT data for the user existing at the moment of post creation. Say for example I create a post and I have biography "X". I then update my biography to "Y" and create a new post. The two post will have different author biographies and this is just right, as they have exactly captured reality.
Of course you may want to also display a biography in an author page. In this case you'll have a problem. Which one you'll use? Probably the last one.
If all authors, in order to exist in your system, MUST have blog post published, that may well be enough. But maybe you want to have an author write its biography and being listed in your system, even before he writes a blog post.
In such case you need to NORMALIZE the model and create a new document type, just for authors. If this is your case, then, you also need to figure out how to handler the situation described before. When the author will update its own biography, will you just update the author document, or create a new one? If you create a new one, so that you can keep track of all changes, will you also update all the previous post so that they will reference the new document, or not?
As you can see the answer is complex, and REALLY depends on what kind of information you want to capture from the real world.
So, first of all, figure out if you really need to keep posts and users separated.
CONSISTENCY
Let’s assume that you really want to have posts and users kept in separate documents, and thus you normalize your model. In this case, keep in mind that Cosmos DB (but NoSQL in general) databases DO NOT OFFER any kind of native support to enforce referential integrity, so you are pretty much on your own. Indexes can help, of course, so you may want to index the ownerId property, so that before deleting an author, for example, you can efficiently check if there are any blog post done by him/her that will remain orphans otherwise.
Another option is to manually create and keep updated ANOTHER document that, for each author, keeps track of the blog posts he/she has written. With this approach you can just look at this document to understand which blog posts belong to an author. You can try to keep this document automatically updated using triggers, or do it in your application. Just keep in mind, that when you normalize, in a NoSQL database, keep data consistent is YOUR responsibility. This is exactly the opposite of a relational database, where your responsibility is to keep data consistent when you de-normalize it.
PERFORMANCES
Performance COULD be an issue, but you don't usually model in order to support performances in first place. You model in order to make sure your model can represent and store the information you need from the real world and then you optimize it in order to have decent performance with the database you have chose to use. As different database will have different constraints, the model will then be adapted to deal with that constraints. This is nothing more and nothing less that the good old “logical” vs “physical” modeling discussion.
In Cosmos DB case, you should not have queries that go cross-partition as they are more expensive.
Unfortunately partitioning is something you chose once and for all, so you really need to have clear in your mind what are the most common use case you want to support at best. If the majority of your queries are done on per author basis, I would partition per author.
Now, while this may seems a clever choice, it will be only if you have A LOT of authors. If you have only one, for example, all data and queries will go into just one partition, limiting A LOT your performance. Remember, in fact, that Cosmos DB RU are split among all the available partitions: with 10.000 RU, for example, you usually get 5 partitions, which means that all your values will be spread across 5 partitions. Each partition will have a top limit of 2000 RU. If all your queries use just one partition, your real maximum performance is that 2000 and not 10000 RUs.
I really hope this help you to start to figure out the answer. And I really hope this help to foster and grow a discussion (how to model for a document database) that I think it is really due and mature now.
So I'm writing an application in NodeJS & ExpressJS. It's my first time I'm using a noSQL database like MongoDB and I'm trying to figure out how to fix my data model.
At start for our project we have written down everything in relationship database terms but since we recently switched from Laravel to ExpressJS for our project I'm a bit stuck on what to do with all my different tables layouts.
So far I have figured out it's better to denormalize your scheme but it does have to end somewhere, right? In the end you can end up storing your whole data in one collection. Well, not enterily but you get the point.
1. So is there a rule or standard that defines where to cut to make multiple collections?
I'm having a relation database with users (which are both a client or a store user), stores, products, purchases, categories, subcategories ..
2. Is it bad to define a relationship in a noSQL database?
Like every product has a category but I want to relate to the category by an id (parent does the job in MongoDB) but is it a bad thing? Or is this where you choose performance vs structure?
3. Is noSQL/MongoDB ment to be used for such large databases which have much relationships (if they were made in MySQL)?
Thanks in advance
As already written, there are no rules like the second normal form for SQL.
However, there are some best practices and common pitfalls related to optimization for MongoDB which I will list here.
Overuse of embedding
The BSON limit
Contrary to popular believe, there is nothing wrong with references. Assume you have a library of books, and you want to track the rentals. You could begin with a model like this
{
// We use ISBN for its uniqueness
_id: "9783453031456"
title: "Schismatrix",
author: "Bruce Sterling",
rentals: [
{
name:"Markus Mahlberg,
start:"2015-05-05T03:22:00Z",
due:"2015-05-12T12:00:00Z"
}
]
}
While there are several problems with this model, the most important isn't obvious – there will be a limited number of rentals because of the fact that BSON documents have a size limit of 16MB.
The document migration problem
The other problem with storing rentals in an array would be that this would cause relatively frequent document migrations, which is a rather costly operation. BSON documents are never partitioned and created with some additional space allocated in advance used when they grow. This additional space is called padding. When the padding is exceeded, the document is moved to another location in the datafiles and new padding space is allocated. So frequent additions of data cause frequent document migrations.
Hence, it is best practice to prevent frequent updates increasing the size of the document and use references instead.
So for the example, we would change our single model and create a second one. First, the model for the book
{
_id: "9783453031456",
title:"Schismatrix",
author: "Bruce Sterling"
}
The second model for the rental would look like this
{
_id: new ObjectId(),
book: "9783453031456",
rentee: "Markus Mahlberg",
start: ISODate("2015-05-05T03:22:00Z"),
due: ISODate("2015-05-05T12:00:00Z"),
returned: ISODate("2015-05-05T11:59:59.999Z")
}
The same approach of course could be used for author or rentee.
The problem with over normalization
Let's look back some time. A developer would identify the entities involved into a business case, define their properties and relations, write the according entity classes, bang his head against the wall for a few hours to get the triple inner-outer-above-and-beyond JOIN working required for the use case and all lived happily ever after. So why use NoSQL in general and MongoDB in particular? Because nobody lived happily ever after. This approach scales horribly and almost exclusively the only way to scale is vertical.
But the main difference of NoSQL is that you model your data according to the questions you need to get answered.
That being said, let's look at a typical n:m relation and take the relation from authors to books as our example. In SQL, you'd have 3 tables: two for your entities (books and authors) and one for the relation (Who is the author of which book?). Of course, you could take those tables and create their equivalent collections. But, since there are no JOINs in MongoDB, you'd need three queries (one for the first entity, one for its relations and one for the related entities) to find the related documents of an entity. This wouldn't make sense, since the three table approach for n:m relations was specifically invented to overcome the strict schemas SQL databases enforce.
Since MongoDB has a flexible schema, the first question would be where to store the relation, keeping the problems arising from overuse of embedding in mind. Since an author might write quite a few books in the years coming, but the authorship of a book rarely, if at all, changes, the answer is simple: We store the authors as a reference to the authors in the books data
{
_id: "9783453526723",
title: "The Difference Engine",
authors: ["idOfBruceSterling","idOfWilliamGibson"]
}
And now we can find the authors of that book by doing two queries:
var book = db.books.findOne({title:"The Difference Engine"})
var authors = db.authors.find({_id: {$in: book.authors})
I hope the above helps you to decide when to actually "split" your collections and to get around the most common pitfalls.
Conclusion
As to your questions, here are my answers
As written before: No, but keeping the technical limitations in mind should give you an idea when it could make sense.
It is not bad – as long as it fits your use case(s). If you have a given category and its _id, it is easy to find the related products. When loading the product, you can easily get the categories it belongs to, even efficiently so, as _id is indexed by default.
I have yet to find a use case which can't be done with MongoDB, though some things can get a bit more complicated with MongoDB. What you should do imho is to take the sum of your functional and non functional requirements and check wether the advantages outweigh the disadvantages. My rule of thumb: if one of "scalability" or "high availability/automatic failover" is on your list of requirements, MongoDB is worth more than a look.
The very "first" thing to consider when choosing an "NoSQL" solution for storage over an "Relational" solution is that things "do not work in the same way" and therefore respond differently by design.
More specifically, solutions such as MongoDB are "not meant" to "emulate" the "relational join" structure that is present in many SQL and therefore "relational" backends, and that they are moreover intended to look at data "joins" in a very different way.
This arrives at your "questions" as follows:
There really is no set "rule", and understand that the "rules" of denormalization do not apply here for the basic reason of why NoSQL solutions exist. And that is to offer something "different" that may work well for your situation.
Is it bad? Is it Good? Both are subjective. Considering point "1" here, there is the basic consideration that "non-relational" or "NoSQL" databases are designed to do things "differently" than a relational system is. So therefore there is usually a "penalty" to "emulating joins" in a relational manner. Specifically for MongoDB this means "additional requests". But that does not mean you "cannot" or "should not" do that. Rather it is all about how your usage pattern works for your application.
Re-capping on the basic points made above, NoSQL in general is designed to solve problems that do not suit the traditional SQL and/or "relational" design pattern, and therefore replace them with something else. The "ultimate goal" here is for you to "rethink your data access patterns" and evolve your application to use a storage model that is more suited to how you access it in your application usage.
In short, there are no strict rules, and that is also part of the point in moving away from "nth-normal-form" rules. NoSQL solutions such as MongoDB allow for "nested structure" storage that typical SQL/Relational solutions do not provide in an efficient form.
Another side of this is considering that operations such as "joins" do not "scale" well over "big data" forms, therefore there exists the different way to "join" by offering concepts such as "embedded data structures", such as MongoDB does.
You would do well to real some guides on the subjects of how many NoSQL solutions approach storing and accessing data. This is ultimately what you need to decide on to determine which is best for you and your application.
At the end of the day, it should be about realising when a SQL/Relational model does not meet your needs, and then choosing something else.
Im using CouchDB with node.js. Right now there is one node involved and even in remote future its not planned to changed that. While I can remove most of the cases where a short and auto-incremental-like (it can be sparse but not like random) ID is required there remains one place where the users actually needs to enter the ID of a product. I'd like to keep this ID as short as possible and in a more human readable format than something like '4ab234acde242349b' as it sometimes has to be typed by hand and so on.
However in the database it can be stored with whatever ID pleases CouchDB (using the default auto generated UUID) but it should be possible to give it a number that can be used to identify it as well. What I have thought about is creating a document that consists of an array with all the UUIDs from CouchDB. When in node I create a new product I would run an update handler that updates said document with the new unique ID at the end. To obtain the products ID I'd then query the array and client side using indexOf I could get the index as a short ID.
I dont know if this is feasible. From the performance point of view I can say the following: There are more queries that should do numerical ID -> uuid than uuid -> numerical ID. There will be at max 7000 new entries a year in the database. Also there is no use case where a product can be deleted yet I'd like not to rely on that.
Are there any other applicable ways to genereate a shorter and more human readable ID that can be associated with my document?
/EDIT
From a technical point of view: It seems to be working. I can do both conversions number <-> uuid and it seems go well. I dont now if this works well with replication and stuff but as there is said array i guess it should, right?
You have two choices here:
Set your human readable id as _id field. Basically you can just set in create document calls to DB, and it will accept it. This can be a more lightweight solution, but it comes with some limitations:
It has to be unique. You should also be careful about clients trying to create documents, but instead overwrite existing ones.
It can only contain alphanumeric or a few special characters. In my experience it is asking for trouble to have extra character types.
It cannot be longer than a theoretical string length limit(Couchdb doesn't define any, but you should). Long ids will increase your views(indexes) size really bad. And it might make it s lower.
If these things are no problem with you, then you should go with this solution.
As you said yourself, let the _id be a UUID, and set the human readable id to another field. To reach the document by the human readable id, you can just create a view emitting the human readable id as a key, and then either emit the document as value or get the document via include_docs=true option. Whenever the view is reached Couchdb will update the view incrementally and return you the list. This is really same as you creating a document with an array/object of ids inside it. Except with using a couchdb view, you get more performance.
This might be also slightly slower on querying and inserting. If the ids are inserted sequentially, it's fine, if not, CouchDB will slightly take more time to insert it at the right place. These don't work well with huge amounts of insert coming at the DB.
Querying shouldn't be more than 10% of total query time longer than first option. I think 10% is really a big number. It will be most probably less than 5%, I remember in my CouchDB application, I switched from reading by _id to reading from a view by a key and the slow down was very little that from user end point, when making 100 queries at the same time, it wasn't noticeable.
This is how people, query documents by other fields than id, for example querying a user document with email, when the user is logging in.
If you don't know how couchdb views work, you should read the views chapter of couchdb definite guide book.
Also make sure you stay away from documents with huge arrays inside them. I think CouchDB, has a limit of 4GB per document. I remember having many documents and it had really long querying times because the view had to iterate on each array item. In the end for each array item, instead I created one document. It was way faster.
In relational database, 1-n n-n relationships mean 2 or more tables.
But in mongoDB, since it is possible to directly store those things into one model like this:
Article{
content: String,
uid: String,
comments:[Comment]
}
I am getting confused about how to manage those relations. For example, in article-comments model, should I directly store all the comments into the article model and then read out the entire article object into JSON every time? But what if the comments grow really large? Like if there is 1,000 comments in an article object, will such strategy make the GET process very slow every time?
I am by no means an expert on this, however I've worked through similar situations before.
From the few demos I've seen yes you should store all the comments directly in line. This is going to give you the best performance (unless you're expecting some ridiculous amount of comments). This way you have everything in your document.
In the future if things start going great and you do notice things going slower you could do a few things. You Could look to store the latest (insert arbitrary number) of comments with a reference to where the other comments are stored, then map-reduce old comments out into a "bucket" to keep loading times quick.
However initially I'd store it in one document.
So would have a model that looked maybe something like this:
Article{
content: String,
uid: String,
comments:[
{"comment":"hi", "user":"jack"},
{"comment":"hi", "user":"jack"},
]
"oldCommentsIdentifier":12345
}
Then only have oldCommentsIdentifier populated if you did move comments out of your comment string, however I really wouldn't do this for less then 1000 comments and maybe even more. Would take a bit of testing here to see what the "sweet" spot would be.
I think a large part of the answer depends on how many comments you are expecting. Having a document that contains an array that could grow to an arbitrarily large size is a bad idea, for a couple reasons. First, the $push operator tends to be slow because it often increases the size of the document, forcing it to be moved. Second, there is a maximum BSON size of 16MB, so eventually you will not be able to grow the array any more.
If you expect each article to have a large number of comments, you could create a separate "comments" collection, where each document has an "article_id" field that contains the _id of the article that it is tied to (or the uid, or some other field unique to the article). This would make retrieving all comments for a specific article easy, by querying the "comments" collection for any documents whose "article_id" field matches the article's _id. Indexing this field would make the query very fast.
The link that limelights posted as a comment on your question is also a great reference for general tips about schema design.
But if solve this problem by linking article and comments with _id, won't it kinda go back to the relational database design? And somehow lose the essence of being NoSQL?
Not really, NoSQL isn't all about embedding models. Infact embedding should be considered carefully for your scenario.
It is true that the aggregation framework solves quite a few of the problems you can get from embedding objects that you need to use as documents themselves. I define subdocuments that need to be used as documents as:
Documents that need to be paged in the interface
Documents that might exist across multiple root documents
Document that require advanced sorting within their group
Documents that when in a group will exceed the root documents 16meg limit
As I said the aggregation framework does solve this a little however your still looking at performing a query that, in realtime or close to, would be much like performing the same in SQL on the same number of documents.
This effect is not always desirable.
You can achieve paging (sort of) of suboducments with normal querying using the $slice operator, but then this can house pretty much the same problems as using skip() and limit() over large result sets, which again is undesirable since you cannot fix it so easily with a range query (aggregation framework would be required again). Even with 1000 subdocuments I have seen speed problems with not just me but other people too.
So let's get back to the original question: how to manage the schema.
Now the answer, which your not going to like, is: it all depends.
Do your comments satisfy the needs that they should separate? Is so then that probably is a good bet.
There is no best way to this. In MongoDB you should be designing your collections according to application that is going to use it.
If your application needs to display comments with article, then I can say it is better to embed these comments in article collection. Otherwise, you will end up with several round trips to your database.
There is one scenario where embedding does not work. As far as I know, document size is limited to 16 MB in MongoDB. This is quite large actually. However, If you think your document size can exceed this limit it is better to have separate collection.