Fetching 'related documents' from mongodb - node.js

So I have a very specific question about the optimum way of storing & then fetching data from a MongoDB database and ill try my best to explain the use case:
I have a content publishing platform that I've built. On this platform, a user can say, write a story, and the story gets saved as a document in the 'stories' collection in the database in a structure such as this:
{
_id : s_12345,
title : This is a story,
...
}
now on the same platform, let's say, another user writes a 'news article', which gets saved as a document in a separate 'news' collection. But now, the interesting thing is, while writing this news article, the user could 'tag' a story of their choice so that this news article would show up in a 'related content' section when some user on the platform is viewing that particular story. So the data structure of this news article could be:
{
_id : n_12345,
title : This is a news article,
related_to_tag : s_12345 //id of the story
...
}
Now from my understanding as of now, there are 2 ways of doing this:
OPTION 1: when a user tries to view this story (s_12345), we make a get request to the server, fetch this particular story document from the 'stories' collection in the database, then cycle through ALL the documents in the 'news' collection and pick up all the documents that have the related_to_tag === s_12345, and then return the story document + all these related news documents to the client. However this operation seems pretty expensive to me, especially if I have, let's say, 10,000 news articles in the news collection.
OPTION 2: At the time of posting the 'news article' to the database, I also find the story (s_123456) in the stories collection, and write a reference to the news article in this story document itself, like so:
{
_id : s_12345,
title : This is a story,
related_content : n_12345
...
}
The second option seems better to me. Because then, when a user tries to get the story, I already know all the other news articles that are related to it, and simply have to run a mongoose populate function to populate these news articles. but it brings up other complications such as:
what happens when the author of the news article deletes it? that means that I will have to find the story document (s_12345), and delete the related_content reference (n_12345) as well.
Or maybe I could run a weekly cron-job that does this sort of cleanup.
Also what happens if, while I am doing this double write operation (write the news article to database + write a reference to the news article to the story document), the second operation fails for whatever reason. that would create data inconsistency.
Anyway, this is a question that I have been struggling with for quite some time now, hope I have explained my use case clearly.
Awaiting your responses!
Abrar

Related

Mongodb: big data structure

I'm rebuilding my website which is a search engine for nicknames from the most active forum in France: you search for a nickname and you got all of its messages.
My current database contains more than 60Gb of data, stored in a MySQL database. I'm now rewriting it into a mongodb database, and after retrieving 1 million messages (1 message = 1 document) find() started to take a while.
The structure of a document is as such:
{
"_id" : ObjectId(),
"message": "<p>Hai guys</p>",
"pseudo" : "mahnickname", //from a nickname (*pseudo* in my db)
"ancre" : "774497928", //its id in the forum
"datepost" : "30/11/2015 20:57:44"
}
I set the id ancre as unique, so I don't get twice the same entry.
Then the user enters the nickname and it finds all documents that have that nickname.
Here is the request:
Model.find({pseudo: "danickname"}).sort('-datepost').skip((r_page -1) * 20).limit(20).exec(function(err, bears)...
Should I structure it differently? Instead of having one document for each message, I'm having a document for each nickname and I update the document once I get a new message from that nickname?
I was using the first method with MySQL et it wasn't taking that long.
Edit: Or maybe should I just index the nicknames (pseudo)?
Thanks!
Here are some recommendations for your problem about big data:
The ObjectId already contains a timestamp. You can also sort on it. You could save on some disk space by removing the datepost field.
Do you absolutely need the ancre field? The ObjectId is already unique and indexed. If you absolutely need it and need to keep the datepost seperate too, you could replace the _id field to be your ancre field.
As many mentioned, you should add an index on pseudo. This will make the "get all messages where the pseudo is mahnickname" search much faster.
If the amount of messages per user is low, you could store all of them inside a single Document per user. This would avoid having to skip to a specific page, which can be slow. However, be aware of the 16mb limit. I would personally still have them in multiple documents.
To keep fast query speeds, ensure that all your indexed fields fit in RAM. You can see the RAM consumption of indexed fields by typing db.collection.stats() and looking at the indexSizes sub-document.
Would there be a way for you to not skip documents, but use the time it got written to the database as your pages? If so, use the datepost field or the timestamp in _id for your paging strategy. If you decide on using the datepost, make a compound index on pseudo and datepost.
As for your benchmarks, you can closely monitor MongoDB by using mongotop and mongostat.

Node.js + Mongoose: Find data and get other data for each

I have two models Users and News. On the page which is written with Express framework are published news and under the news are comments. Inside News model is subdocument with comments which contains two fields - user (subfields:) { name, objectid } and comment. Because in addition to comment there is user's name, I would like to add some additional informations about it (like number of comments, link to website, ...).
And this is my question: How to get data of user (from Users model) for each comment from subdocument (from News model)?
Add a populate call to your find query to pull in the user details. I'm not quite clear on your schema, but something like:
News.find().populate('comments.userId').exec(...);
This relies on your schema defining userId as an ObjectId ref to Users.

DDD: should "Comment" in an "Article" be an aggregate root?

I am starting to design a first simple application DDD-style, and I am starting to see how the concepts work together.
If I design a classic blog application, Article class will be one of my aggregate roots. I want to retrieve articles, manage and delete all the related data (publication date, authors...). I am having difficulties with comments. At first, Comment seems to be an entity that is part of the Article aggregate: a comment is created in relation to an article, if I delete an Article, I will delete related comments.
Then I want to display a small box on the blog with the latest comments posted on the blog, for any article. So it looks like I want to retrieve comments from my data store (and only comments). From my understanding of DDD ideas, that makes it an aggregate root. But that does not seem totally right as Comment seems to depend strongly on Article.
How would you model it?
Thanks.
When you think about it you will probably find various reasons why a Comment should be an Aggregate itself:
You want to list the latest comments
You may want to list all comments by a particular user
You may want comments to be nested (a comment being a reply to another comment)
You may want to approve/reject comments through an admin interface
A user may want to edit or delete his/her comment
...
I take the following as a general rule of thumb: Keep your Aggregates as small as possible. When in doubt, err on the side of more Aggregates.
By modelling it this way, you can attach the comments to multiple objects, like Article and User
Comment
string Text
string UserName
bool IsApproved
Article
string Title
string Body
...
List<CommentIds> CommentIds
User
string UserName
...
List<CommentIds> CommentIds
ListOfTenLatestComments
List<CommentIds> CommentIds

How does a document-based database and CouchDB in particular handle ID references?

I really like the document-based approach of storing data like blog posts as a whole document with all information needed saved inside of it. Therefore the authorĀ“s username is stored as plain text. The author himself has his own document with personal information attached to it. What happens when the author decides to change his username? Do I have to update every document the contains a blog post by that author or is this just one of the drawbacks using a document-based database?
Thanks for any suggestions!
If you need to write a query(view) with content from the blogpost and the name of the author, then the name must be included in the blog content, and therefore all blogposts must be updated.
if the name is only for information ( i mean you do not query a blogpost-content like keywords AND name of author), you can add the id into the blog document (and of course now can Query blog content AND author-id) and emit {'_id':doc.author_id} as a Value.
include_docs=true then gives you the doc of the Author (and no longer the blogpost-doc.. you have to call it explicit with the id thats in the result rows). No Need to update the blogposts.
Example:
Case 1:
Use Author by Name, you have to include the name, and therefore update ALL docs.
{
"_id":"blogpost1",
"author":"Oliver",
"keyword":"couchDB"
}
to look for all couchdb posts from oliver:
emit ([doc.author,doc.keyword],1)
call:
&key=["Oliver","couchDB"]
Case 2:
No need to query by name
{
"_id":"blogpost1",
"author_id":"author-123",
"keyword":"couchDB"
}
emit (doc.keyword,{'_id':doc.author_id})
and the authors doc:
{
"_id":"author-123",
"name":"Oliver"
}
call:
?key=["couchDB"]&include_docs=true
result:
...
{"id":"blogpost1","key":"couchDB","value":{"_id":"author-123"},"doc":{"_id":"author-123","_rev":"xxx","name":"Oliver,....

CouchDB view collation, join on one key, search on other values

Looking at the example described in Couch DB Joins.
It discusses view collation and how you can have one document for your blog posts, and then each comment is a separate document in CouchDB. So for example, I could have "My Post" and 5 comments associated with "My Post" for a total of 6 documents. In their example, "myslug" is stored both in the post document, and each comment document, so that when I search CouchDB with the key "myslug" it returns all the documents.
Here's the problem/question. Let's say I want to search on the author in the comments and a post that also has a category of "news". How would this work exactly?
So for example:
function(doc) {
if (doc.type == "post") {
emit([doc._id, 0], doc);
} else if (doc.type == "comment") {
emit([doc.post, 1], doc);
}
}
That will load my blog post and comments based on this: ?startkey=["myslug"]
However, I want to do this, grab the comments by author bob, and the post that has the category news. For this example, bob has written three comments to the blog post with the category news. It seems as if CouchDB only allows me search on keys that exist in both documents, and not search on a key in one document, and a key in another that are "joined" together with the map function.
In other words, if post and comments are joined by a slug, how do I search on one field in one document and another field in another document that are joined by the id aka. slug?
In SQL it would be something like this:
SELECT * FROM comments JOIN doc.id ON doc.post WHERE author = bob AND category = news
I've been investigating couchdb for about a week so I'm hardly qualified to answer your question, but I think I've come to the conclusion it can't be done. View results need to be tied to one and only one document so the view can be updated. You are going to have to denormalize, at least if you don't want to do a grunt search. If anyone's come up with a clever way to do this I'd really like to know.
There are several ways that you can approximate a SQL join on CouchDB. I've just asked a similar question here: Why is CouchDB's reduce_limit enabled by default? (Is it better to approximate SQL JOINS in MapReduce views or List views?)
You can use MapReduce (not a good option)
You can use lists (This will iterate over a result set before emitting results, meaning you can 'combine' documents in a number of creative ways)
You can also apparently use 'collation', though I haven't figured this out yet (seems like I always get a count and can only use the feature with Reduce - if I'm on the right track)

Resources