Implementing user ratings / favorites on CouchDB - couchdb

I'm considering using CouchDB for an upcoming site, but I'm a little confused as far as far as how to implement a system of user ratings for the site. Basically, each item of content can be rated by a given user. What way of doing this makes the most sense in the CouchDB model? I would think the DRYest and most logical way would be to have 3 different document types, Content, Users, and a user_rating doc that looks something like this.
{
user_id: "USERID"
content_id: "CONTENTID"
rating: 6
}
Then, I'd create a view where the map was the set of all content docs and user_rating docs keyed by content doc ids, and where the reduce tallied the mean of the ratings and returned the content doc keyed by content doc id.
Is that the best way of doing this? I haven't yet found much in the way of resources on CouchDB best practices so I'm pretty unsure of all this stuff.
My Conclusion:
The accepted answer below, which is what I pretty much was going to implement does work, but beware, the docs need to be keyed by content doc id which makes advanced queries based on other document properties troublesome. I'm going back to SQL for my needs in this app.

Sounds like you've got a reasonable idea going. CouchDB is so new that I think it'll take awhile for best practices to shake out.
A map/reduce pair like this might form a reasonable starting point.
map:
function(doc) {
if(doc.type='rating' && doc.content_id) {
emit(doc.content_id, doc.rating);
}
}
reduce:
function(keys, values) {
return sum(values)/values.length
}
NB: That map function requires adding the proper type to your Rating model:
{
type: 'rating',
user_id: "USERID",
content_id: "CONTENTID",
rating: 6
}

Well Damien Katz, one of the Couchdb developers, gives a description of a similar process, so you might be doing it the way that the Couchdb folks intend.

I wrote about a similar situation (although simpler than your example). I was adding article ratings to my blog and decided to use CouchDB to store the ratings themselves. I think you've got the right idea.
Here's a thought, though. Do you care who rated what, like for display somewhere or tracking? If so, carry on :)
If not then why not just update the content document's rating attribute to += 1 (and perhaps the user document's rated attribute to .push( doc._id ) if you want to prevent a user from rating content more than once).
This would greatly simplify your document handling and give better performance when 'reading' ratings to display on pages (since you'll already have the content document assumingly)... This would be at the cost of making the actual process of rating more expensive (bigger documents going to the server, etc).
Seems to me that sometimes CouchDB (and other key-value databases) are at their best when things aren't totally normalized.

Related

In CouchDB, should I use the _id for relation and _changes?

I've been reading a lot of best practices and how I should embrace the _id. To be honest, I'm getting my kind of paranoid at the repercussions I might face if I don't do this for when I start scaling up my application.
Currently I have about 50k documents per database. It's only been a few months with heavy usage. I expect this to grow A LOT. I do a lot of .find() Mango Queries, not much indexing; and to be honest working off a relational style document structuring.
For example:
First get Project from ID.
Then do a find query that:
grabs all type:signature where project_id: X.
grabs all type:revisions where project_id: X.
The reason for this is I try VERY hard not to update documents. A lot of these documents are created offline, so doing a write once workflow is very important for me to avoid conflicts.
I'm currently at a point of no return as scheduling is getting pretty intense. If I want to change the way I'm doing things now is the best time before it gets too crazy.
I'd love to hear your thoughts about using the _id for data structuring and what people think.
Being able to make one call with a _all_docs grab like this sounds appealing to me:
{
"include_docs": true,
"startkey": "project:{ID}",
"endkey": "project:{ID}:\ufff0"
}
An example of how ONE type of my documents are set is like so:
Main Document
{
_id: {COUCH_GENERATED_1},
type: "project",
..
.
}
Signature Document
{
_id: {COUCH_GENERATED_2},
type: "signature",
project_id: {COUCH_GENERATED_1},
created_at: {UNIX_TIMESTAMP}
}
Change to Main Document
{
_id: {COUCH_GENERATED_3},
type: "revision",
project_id: {COUCH_GENERATED_1},
created_at: {UNIX_TIMESTAMP}
data: [{..}]
}
I was wondering whether I should do something like this:
Main Document: _id: project:{kuuid_1}
Signature Document: _id: project:{kuuid_1}:signature:{kuuid_2}
Change to Main Document: _id: project:{kuuid_1}:rev:{kuuid_3}
I'm just trying to set up my database in a way that isn't going to mess with me in the future. I know problems are going to come up but I'd like not to heavily change the structure if I can avoid it.
Another reason I am thinking of this is that I watch for _changes in my databases and being able to know what types are coming through without getting each document every time a document changes sound appealing also.
Setting up your database structure so that it makes data retrieval easier is good practice. It seems to me you have some options:
If there is a field called project_id in the documents of interest, you can create an index on project_id which would allow you to fetch all documents pertaining to a known project_id cheaply. see CouchDB Find
Create a MapReduce index keyed on project_id e.g if (doc.project_id) { emit(doc.project_id)}. The index that this produces would allow you to fetch documents by known project_id with judicious use of start_key& end_key when querying the view. see Introduction to views
As you say, packing more information into the _id field allows you to perform range queries on the _all_docs endpoint.
If you choose a key design of:
project{project_id}:signature{kuuid}
then the primary index of the database has all of a single project's documents grouped together. Putting the project_id before the ':' character is preparation for a forthcoming CouchDB feature called "partitioned databases", which groups logically related documents in their own partition, making it quicker and easier to perform queries on a single partition, in your case a project. This feature isn't ready yet but it's likely to have a {partition_key}:{document_key} format for the _id field, so there's no harm in getting your document _ids ready for it for when it lands (see CouchDB mailing list! In the meantime, a range query on _all_docs will work.

How to structure relationships in Azure Cosmos DB?

I have two sets of data in the same collection in cosmos, one are 'posts' and the other are 'users', they are linked by the posts users create.
Currently my structure is as follows;
// user document
{
id: 123,
postIds: ['id1','id2']
}
// post document
{
id: 'id1',
ownerId: 123
}
{
id: 'id2',
ownerId: 123
}
My main issue with this setup is the fungible nature of it, code has to enforce the link and if there's a bug data will very easily be lost with no clear way to recover it.
I'm also concerned about performance, if a user has 10,000 posts that's 10,000 lookups I'll have to do to resolve all the posts..
Is this the correct method for modelling entity relationships?
As said by David, it's a long discussion but it is a very common one so, since I have on hour or so of "free" time, I'm more than glad to try to answer it, once for all, hopefully.
WHY NORMALIZE?
First thing I notice in your post: you are looking for some level of referential integrity (https://en.wikipedia.org/wiki/Referential_integrity) which is something that is needed when you decompose a bigger object into its constituent pieces. Also called normalization.
While this is normally done in a relational database, it is now also becoming popular in non-relational database since it helps a lot to avoid data duplication which usually creates more problem than what it solves.
https://docs.mongodb.com/manual/core/data-model-design/#normalized-data-models
But do you really need it? Since you have chosen to use JSON document database, you should leverage the fact that it's able to store the entire document and then just store the document ALONG WITH all the owner data: name, surname, or all the other data you have about the user who created the document. Yes, I’m saying that you may want to evaluate not to have post and user, but just posts, with user info inside it.This may be actually very correct, as you will be sure to get the EXACT data for the user existing at the moment of post creation. Say for example I create a post and I have biography "X". I then update my biography to "Y" and create a new post. The two post will have different author biographies and this is just right, as they have exactly captured reality.
Of course you may want to also display a biography in an author page. In this case you'll have a problem. Which one you'll use? Probably the last one.
If all authors, in order to exist in your system, MUST have blog post published, that may well be enough. But maybe you want to have an author write its biography and being listed in your system, even before he writes a blog post.
In such case you need to NORMALIZE the model and create a new document type, just for authors. If this is your case, then, you also need to figure out how to handler the situation described before. When the author will update its own biography, will you just update the author document, or create a new one? If you create a new one, so that you can keep track of all changes, will you also update all the previous post so that they will reference the new document, or not?
As you can see the answer is complex, and REALLY depends on what kind of information you want to capture from the real world.
So, first of all, figure out if you really need to keep posts and users separated.
CONSISTENCY
Let’s assume that you really want to have posts and users kept in separate documents, and thus you normalize your model. In this case, keep in mind that Cosmos DB (but NoSQL in general) databases DO NOT OFFER any kind of native support to enforce referential integrity, so you are pretty much on your own. Indexes can help, of course, so you may want to index the ownerId property, so that before deleting an author, for example, you can efficiently check if there are any blog post done by him/her that will remain orphans otherwise.
Another option is to manually create and keep updated ANOTHER document that, for each author, keeps track of the blog posts he/she has written. With this approach you can just look at this document to understand which blog posts belong to an author. You can try to keep this document automatically updated using triggers, or do it in your application. Just keep in mind, that when you normalize, in a NoSQL database, keep data consistent is YOUR responsibility. This is exactly the opposite of a relational database, where your responsibility is to keep data consistent when you de-normalize it.
PERFORMANCES
Performance COULD be an issue, but you don't usually model in order to support performances in first place. You model in order to make sure your model can represent and store the information you need from the real world and then you optimize it in order to have decent performance with the database you have chose to use. As different database will have different constraints, the model will then be adapted to deal with that constraints. This is nothing more and nothing less that the good old “logical” vs “physical” modeling discussion.
In Cosmos DB case, you should not have queries that go cross-partition as they are more expensive.
Unfortunately partitioning is something you chose once and for all, so you really need to have clear in your mind what are the most common use case you want to support at best. If the majority of your queries are done on per author basis, I would partition per author.
Now, while this may seems a clever choice, it will be only if you have A LOT of authors. If you have only one, for example, all data and queries will go into just one partition, limiting A LOT your performance. Remember, in fact, that Cosmos DB RU are split among all the available partitions: with 10.000 RU, for example, you usually get 5 partitions, which means that all your values will be spread across 5 partitions. Each partition will have a top limit of 2000 RU. If all your queries use just one partition, your real maximum performance is that 2000 and not 10000 RUs.
I really hope this help you to start to figure out the answer. And I really hope this help to foster and grow a discussion (how to model for a document database) that I think it is really due and mature now.

MongoDB query comments along with user information

I am creating an application with nodejs and mongod(Not mongoose). I have a problem that gave me headache over few days, anyone please suggest a way for this!!.
I have a mongodb design like this
post{
_id:ObjectId(...),
picture: 'some_url',
comments:[
{_id:ObjectId(...),
user_id:Object('123456'),
body:"some content"
},
{_id:ObjectId(...),
user_id:Object('...'),
body:"other content"
}
]
}
user{
_id:ObjectId('123456'),
name: 'some name', --> changable at any times
username: 'some_name', --> changable at any times
picture: 'url_link' --> changable at any times
}
I want to query the post along with all the user information so the query will look like this:
[{
_id:ObjectId(...),
picture: 'some_url',
comments:[
{_id:ObjectId(...),
user_id:Object('123456'),
user_data:{
_id:ObjectId('123456'),
name: 'some name',
username: 'some_name',
picture: 'url_link'
}
body:"some content"
},
{_id:ObjectId(...),
user_id:Object('...'),
body:"other content"
}
]
}]
I tried to use loop to manually get the user data and add to comment but it proves to be difficult and not achievable by my coding skill :(
Please anybody got any suggestion, I would be really appreciated.
P/s I am trying another approach that I would embedded all the user data in to the comment and whenever the user update their username, name or picture. They will update it in all the comment as well
The problem(s)
As written before, there are several problems when over-embedding:
Problem 1: BSON size limit
As of the time of this writing, BSON documents are limited to 16MB. If that limit is reached, MongoDB would throw an exception and you simply could not add more comments and in worst case scenarios not even change the (user-)name or the picture if the change would increase the size of the document.
Problem 2: Query limitations and performance
It is not easily possible to query or sort the comments array under certain conditions. Some things would require a rather costly aggregation, others rather complicated statements.
While one could argue that once the queries are in place, this isn't much of a problem, I beg to differ. First, the more complicated a query is, the harder it is to optimize, both for the developer and subsequently MongoDBs query optimizer. I have had the best results with simplyfying data models and queries, speeding up responses by a factor of 100 in one instance.
When scaling, the ressources needed for complicated and/or costly queries might even sum up to whole machines when compared to a simpler data model and according queries.
Problem 3: Maintainability
Last but not least you might well run into problems maintaining your code. As a simple rule of thumb
The more complicated your code becomes, the harder it is to maintain. The harder code is to maintain, the more time it needs to maintain the code. The more time it needs to maintain code, the more expensive it gets.
Conclusion: Complicated code is expensive.
In this context, "expensive" both refers to money (for professional projects) and time (for hobby projects).
(My!) Solution
It is pretty easy: simplify your data model. Consequently, your queries will become less complicated and (hopefully) faster.
Step 1: Identify your use cases
That's going to be a wild guess for me, but the important thing here is to show you the general method. I'd define your use cases as follows:
For a given post, users should be able to comment
For a given post, show the author and the comments, along with the commenters and authors username and their picture
For a given user, it should be easily possible to change the name, username and picture
Step 2: Model your data accordingly
Users
First of all, we have a straightforward user model
{
_id: new ObjectId(),
name: "Joe Average",
username: "HotGrrrl96",
picture: "some_link"
}
Nothing new here, added just for completeness.
Posts
{
_id: new ObjectId()
title: "A post",
content: " Interesting stuff",
picture: "some_link",
created: new ISODate(),
author: {
username: "HotGrrrl96",
picture: "some_link"
}
}
And that's about it for a post. There are two things to note here: first, we store the author data we immediately need when displaying a post, since this saves us a query for a very common, if not ubiquitous use case. Why don't we save the comments and commenters data acordingly? Because of the 16 MB size limit, we are trying to prevent the storage of references in a single document. Rather, we store the references in comment documents:
Comments
{
_id: new ObjectId(),
post: someObjectId,
created: new ISODate(),
commenter: {
username: "FooBar",
picture: "some_link"
},
comment: "Awesome!"
}
The same as with posts, we have all the necessary data for displaying a post.
The queries
What we have achieved now is that we circumvented the BSON size limit and we don't need to refer to the user data in order to be able to display posts and comments, which should save us a lot of queries. But let's come back to the use cases and some more queries
Adding a comment
That's totally straightforward now.
Getting all or some comments for a given post
For all comments
db.comments.find({post:objectIdOfPost})
For the 3 lastest comments
db.comments.find({post:objectIdOfPost}).sort({created:-1}).limit(3)
So for displaying a post and all (or some) of its comments including the usernames and pictures we are at two queries. More than you needed before, but we circumvented the size limit and basically you can have an indefinite number of comments for every post. But let's get to something real
Getting the latest 5 posts and their latest 3 comments
This is a two step process. However, with proper indexing (will come back to that later) this still should be fast (and hence resource saving):
var posts = db.posts.find().sort({created:-1}).limit(5)
posts.forEach(
function(post) {
doSomethingWith(post);
var comments = db.comments.find({"post":post._id}).sort("created":-1).limit(3);
doSomethingElseWith(comments);
}
)
Get all posts of a given user sorted from newest to oldest and their comments
var posts = db.posts.find({"author.username": "HotGrrrl96"},{_id:1}).sort({"created":-1});
var postIds = [];
posts.forEach(
function(post){
postIds.push(post._id);
}
)
var comments = db.comments.find({post: {$in: postIds}}).sort({post:1, created:-1});
Note that we have only two queries here. Although you need to "manually" make the connection between posts and their respective comments, that should be pretty straightforward.
Change a username
This presumably is a rare use case executed. However, it isn't very complicated with said data model
First, we change the user document
db.users.update(
{ username: "HotGrrrl96"},
{
$set: { username: "Joe Cool"},
$push: {oldUsernames: "HotGrrrl96" }
},
{
writeConcern: {w: "majority"}
}
);
We push the old username to an according array. This is a security measure in case something goes wrong with the following operations. Furthermore, we set the write concern to a rather high level in order to make sure the data is durable.
db.posts.update(
{ "author.username": "HotGrrrl96"},
{ $set:{ "author.username": "Joe Cool"} },
{
multi:true,
writeConcern: {w:"majority"}
}
)
Nothing special here. The update statement for the comments looks pretty much the same. While those queries take some time, they are rarely executed.
The indices
As a rule of thumb, one can say that MongoDB can only use one index per query. While this is not entirely true since there are index intersections, it is easy to deal with. Another thing is that individual fields in a compound index can be used independently. So an easy approach to index optimization is to find the query with the most fields used in operations which make use of indices and create a compound index of them. Note that the order of occurrence in the query matters. So, let's go ahead.
Posts
db.posts.createIndex({"author.username":1,"created":-1})
Comments
db.comments.createIndex({"post":1, "created":-1})
Conclusion
A fully embedded document per post admittedly is the the fastest way of loading it and it's comments. However, it does not scale well and due to the nature of possibly complex queries necessary to deal with it, this performance advantage may be leveraged or even eliminated.
With the above solution, you trade some speed (if!) against basically unlimited scalability and a much more straightforward way of dealing with the data.
Hth.
You are following Normalized data model approach. if you are following this model means, you have to write another query to get the user info or If you uses the embedded document store then all the user doc must change whenever updates on user doc.
http://docs.mongodb.org/v3.0/reference/database-references/ read this link for more information.

Mongoose - "object" in "object" [duplicate]

I want to design a question structure with some comments. Which relationship should I use for comments: embed or reference?
A question with some comments, like stackoverflow, would have a structure like this:
Question
title = 'aaa'
content = 'bbb'
comments = ???
At first, I thought of using embedded comments (I think embed is recommended in MongoDB), like this:
Question
title = 'aaa'
content = 'bbb'
comments = [ { content = 'xxx', createdAt = 'yyy'},
{ content = 'xxx', createdAt = 'yyy'},
{ content = 'xxx', createdAt = 'yyy'} ]
It is clear, but I'm worried about this case: If I want to edit a specified comment, how do I get its content and its question? There is no _id to let me find one, nor question_ref to let me find its question. (Is there perhaps a way to do this without _id and question_ref?)
Do I have to use ref rather than embed? Do I then have to create a new collection for comments?
This is more an art than a science. The Mongo Documentation on Schemas is a good reference, but here are some things to consider:
Put as much in as possible
The joy of a Document database is that it eliminates lots of Joins. Your first instinct should be to place as much in a single document as you can. Because MongoDB documents have structure, and because you can efficiently query within that structure (this means that you can take the part of the document that you need, so document size shouldn't worry you much) there is no immediate need to normalize data like you would in SQL. In particular any data that is not useful apart from its parent document should be part of the same document.
Separate data that can be referred to from multiple places into its own collection.
This is not so much a "storage space" issue as it is a "data consistency" issue. If many records will refer to the same data it is more efficient and less error prone to update a single record and keep references to it in other places.
Document size considerations
MongoDB imposes a 4MB (16MB with 1.8) size limit on a single document. In a world of GB of data this sounds small, but it is also 30 thousand tweets or 250 typical Stack Overflow answers or 20 flicker photos. On the other hand, this is far more information than one might want to present at one time on a typical web page. First consider what will make your queries easier. In many cases concern about document sizes will be premature optimization.
Complex data structures:
MongoDB can store arbitrary deep nested data structures, but cannot search them efficiently. If your data forms a tree, forest or graph, you effectively need to store each node and its edges in a separate document. (Note that there are data stores specifically designed for this type of data that one should consider as well)
It has also been pointed out than it is impossible to return a subset of elements in a document. If you need to pick-and-choose a few bits of each document, it will be easier to separate them out.
Data Consistency
MongoDB makes a trade off between efficiency and consistency. The rule is changes to a single document are always atomic, while updates to multiple documents should never be assumed to be atomic. There is also no way to "lock" a record on the server (you can build this into the client's logic using for example a "lock" field). When you design your schema consider how you will keep your data consistent. Generally, the more that you keep in a document the better.
For what you are describing, I would embed the comments, and give each comment an id field with an ObjectID. The ObjectID has a time stamp embedded in it so you can use that instead of created at if you like.
In general, embed is good if you have one-to-one or one-to-many relationships between entities, and reference is good if you have many-to-many relationships.
Well, I'm a bit late but still would like to share my way of schema creation.
I have schemas for everything that can be described by a word, like you would do it in the classical OOP.
E.G.
Comment
Account
User
Blogpost
...
Every schema can be saved as a Document or Subdocument, so I declare this for each schema.
Document:
Can be used as a reference. (E.g. the user made a comment -> comment has a "made by" reference to user)
Is a "Root" in you application. (E.g. the blogpost -> there is a page about the blogpost)
Subdocument:
Can only be used once / is never a reference. (E.g. Comment is saved in the blogpost)
Is never a "Root" in you application. (The comment just shows up in the blogpost page but the page is still about the blogpost)
I came across this small presentation while researching this question on my own. I was surprised at how well it was laid out, both the info and the presentation of it.
http://openmymind.net/Multiple-Collections-Versus-Embedded-Documents
It summarized:
As a general rule, if you have a lot of [child documents] or if they are large, a separate collection might be best.
Smaller and/or fewer documents tend to be a natural fit for embedding.
Actually, I'm quite curious why nobody spoke about the UML specifications. A rule of thumb is that if you have an aggregation, then you should use references. But if it is a composition, then the coupling is stronger, and you should use embedded documents.
And you will quickly understand why it is logical. If an object can exist independently of the parent, then you will want to access it even if the parent doesn't exist. As you just can't embed it in a non-existing parent, you have to make it live in it's own data structure. And if a parent exist, just link them together by adding a ref of the object in the parent.
Don't really know what is the difference between the two relationships ?
Here is a link explaining them:
Aggregation vs Composition in UML
If I want to edit a specified comment, how to get its content and its question?
You can query by sub-document: db.question.find({'comments.content' : 'xxx'}).
This will return the whole Question document. To edit the specified comment, you then have to find the comment on the client, make the edit and save that back to the DB.
In general, if your document contains an array of objects, you'll find that those sub-objects will need to be modified client side.
Yes, we can use the reference in the document. To populate another document just like SQL i joins. In MongoDB, they don't have joins to map one to many relationship documents. Instead that we can use populate to fulfil our scenario.
var mongoose = require('mongoose')
, Schema = mongoose.Schema
var personSchema = Schema({
_id : Number,
name : String,
age : Number,
stories : [{ type: Schema.Types.ObjectId, ref: 'Story' }]
});
var storySchema = Schema({
_creator : { type: Number, ref: 'Person' },
title : String,
fans : [{ type: Number, ref: 'Person' }]
});
The population is the process of automatically replacing the specified paths in the document with the document(s) from other collection(s). We may populate a single document, multiple documents, plain objects, multiple plain objects, or all objects returned from a query. Let's look at some examples.
Better you can get more information please visit: http://mongoosejs.com/docs/populate.html
I know this is quite old but if you are looking for the answer to the OP's question on how to return only specified comment, you can use the $ (query) operator like this:
db.question.update({'comments.content': 'xxx'}, {'comments.$': true})
MongoDB gives freedom to be schema-less and this feature can result in pain in the long term if not thought or planned well,
There are 2 options either Embed or Reference. I will not go through definitions as the above answers have well defined them.
When embedding you should answer one question is your embedded document going to grow, if yes then how much (remember there is a limit of 16 MB per document) So if you have something like a comment on a post, what is the limit of comment count, if that post goes viral and people start adding comments. In such cases, reference could be a better option (but even reference can grow and reach 16 MB limit).
So how to balance it, the answer is a combination of different patterns, check these links, and create your own mix and match based on your use case.
https://www.mongodb.com/blog/post/building-with-patterns-a-summary
https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-1
If I want to edit a specified comment, how do I get its content and
its question?
If you had kept track of the number of comments and the index of the comment you wanted to alter, you could use the dot operator (SO example).
You could do f.ex.
db.questions.update(
{
"title": "aaa"
},
{
"comments.0.contents": "new text"
}
)
(as another way to edit the comments inside the question)

What is the best practice for mongoDB to handle 1-n n-n relationships?

In relational database, 1-n n-n relationships mean 2 or more tables.
But in mongoDB, since it is possible to directly store those things into one model like this:
Article{
content: String,
uid: String,
comments:[Comment]
}
I am getting confused about how to manage those relations. For example, in article-comments model, should I directly store all the comments into the article model and then read out the entire article object into JSON every time? But what if the comments grow really large? Like if there is 1,000 comments in an article object, will such strategy make the GET process very slow every time?
I am by no means an expert on this, however I've worked through similar situations before.
From the few demos I've seen yes you should store all the comments directly in line. This is going to give you the best performance (unless you're expecting some ridiculous amount of comments). This way you have everything in your document.
In the future if things start going great and you do notice things going slower you could do a few things. You Could look to store the latest (insert arbitrary number) of comments with a reference to where the other comments are stored, then map-reduce old comments out into a "bucket" to keep loading times quick.
However initially I'd store it in one document.
So would have a model that looked maybe something like this:
Article{
content: String,
uid: String,
comments:[
{"comment":"hi", "user":"jack"},
{"comment":"hi", "user":"jack"},
]
"oldCommentsIdentifier":12345
}
Then only have oldCommentsIdentifier populated if you did move comments out of your comment string, however I really wouldn't do this for less then 1000 comments and maybe even more. Would take a bit of testing here to see what the "sweet" spot would be.
I think a large part of the answer depends on how many comments you are expecting. Having a document that contains an array that could grow to an arbitrarily large size is a bad idea, for a couple reasons. First, the $push operator tends to be slow because it often increases the size of the document, forcing it to be moved. Second, there is a maximum BSON size of 16MB, so eventually you will not be able to grow the array any more.
If you expect each article to have a large number of comments, you could create a separate "comments" collection, where each document has an "article_id" field that contains the _id of the article that it is tied to (or the uid, or some other field unique to the article). This would make retrieving all comments for a specific article easy, by querying the "comments" collection for any documents whose "article_id" field matches the article's _id. Indexing this field would make the query very fast.
The link that limelights posted as a comment on your question is also a great reference for general tips about schema design.
But if solve this problem by linking article and comments with _id, won't it kinda go back to the relational database design? And somehow lose the essence of being NoSQL?
Not really, NoSQL isn't all about embedding models. Infact embedding should be considered carefully for your scenario.
It is true that the aggregation framework solves quite a few of the problems you can get from embedding objects that you need to use as documents themselves. I define subdocuments that need to be used as documents as:
Documents that need to be paged in the interface
Documents that might exist across multiple root documents
Document that require advanced sorting within their group
Documents that when in a group will exceed the root documents 16meg limit
As I said the aggregation framework does solve this a little however your still looking at performing a query that, in realtime or close to, would be much like performing the same in SQL on the same number of documents.
This effect is not always desirable.
You can achieve paging (sort of) of suboducments with normal querying using the $slice operator, but then this can house pretty much the same problems as using skip() and limit() over large result sets, which again is undesirable since you cannot fix it so easily with a range query (aggregation framework would be required again). Even with 1000 subdocuments I have seen speed problems with not just me but other people too.
So let's get back to the original question: how to manage the schema.
Now the answer, which your not going to like, is: it all depends.
Do your comments satisfy the needs that they should separate? Is so then that probably is a good bet.
There is no best way to this. In MongoDB you should be designing your collections according to application that is going to use it.
If your application needs to display comments with article, then I can say it is better to embed these comments in article collection. Otherwise, you will end up with several round trips to your database.
There is one scenario where embedding does not work. As far as I know, document size is limited to 16 MB in MongoDB. This is quite large actually. However, If you think your document size can exceed this limit it is better to have separate collection.

Resources