Mongoose populate option and query time - node.js

I am working on a platform where I use mongoose .populate number of times in all my queries, I turn on the mongoose debug mode and find that there is hardly difference in query execution time (for 100 document now , there will will be 100000 doc in future) with using populate and without using populate.
I know that basically populate is also doing a finOne query internally , my question is, is using .populate will increase my query time or is it anyways going to effect my performance if number of record reaches millions. Also is there any alternate that I can choose to increase performance

In general, you're correct - you want to avoid using populate() since it will issue another query for each row. Keep in mind that this is a full round-trip to the server. Mongo doesn't have any sort of concept for a join, so when you do populate you're issuing an additional query for each row in your returned set.
The technique to work around this is to denormalize your data - don't design a mongo database like a relational one. The mongo docs have lots of information on how to do this. https://docs.mongodb.org/manual/core/data-model-design/ One important thing to keep in mind with Mongo design is that you never want to have a subdocument with unbounded growth. Due to the way mongo space allocation and paging works, this can cause severe performance problems, so if you're in a situation like this it's best to normalize.
Another very common technique is subdocument caching. This is where you take partial data from the "joined" collection and cache it on the collection you're querying. In this case, you're trading space for performance because you have duplicate data. Also, you'll have to make sure you keep the data updated whenever there's a change. With mongoose, it is easy to do this as a post-save hook on the model of the foreign collection.

Related

MongoDb slow aggregation with many collections (lookup)

i'm working on a MEAN stack project, i use too many collections in my aggregation so i use a lot of lookup, and that impacts negatively the performance and makes the execution of aggregation very slow. i was wondering if you have any suggestions , i found that we can reduce lookup by creating for each collection i need an array of objects into a globale collection however, i'm looking for an optimale and secured solution.
As an information, i defined indexes on all collections into mongo.
Thanks for sharing your ideas!
This is a very involved question. Even if you gave all your schemas and queries, it would take too long to answer, and be very specific to your case (ie. not useful to anyone else coming along later).
Instead for a general answer, I'd advise you to read into denormalization and consider some database redesign if this query is core to your project.
Here is a good article to get you started.
Denormalization allows you to avoid some application-level joins, at the expense of having more complex and expensive updates. Denormalizing one or more fields makes sense if those fields are read much more often than they are updated.
A simple example to outline it:
Say you have a Blog with a comment collection, and a user collection
You want to display the comment with the name of the user. So you have to load the player for every comment.
Instead you could save the username on the comment collection as well as the user collection.
Then you will have a fast query to show comments, as you don't need to load the users too. But if the user changes their name, then you will have to update all of the comments with the new name. This is the main tradeoff.
If a DB redesign is too difficult, I suggest splitting into multiple aggregates and combining them in memory (ie. in your node server side code)

MongoDB schema design

I'm planning to implement this schema in MongoDB, I have been doing some readings about schema design, and the notion was whenever you structure your data like a relational database you must be doing something wrong.
My questions:
what should I do when collection size gets larger than 16MB limit?
app_log in server_log collections gets might in some cases grow larger than 16MB depending how busy the server is.
I'm aware of the cap feature that I could use, but the requirement is store all logs for 90 days.
Do you see any potential issues with my design?
Is it a good practice to have the application check collection size and create new collection by day / hour ..etc to accommodate log size growth?
Thanks
Your collection size is not restricted to 16MB, as one of the comments pointed out, you can check in the MongoDB manual that it is the largest document size. So there is no need to separate the same class of data between different collections, in fact it would be a major headache for you to do so :) One user collection, one for your servers and one for your server_logs. You can then create references from one collection to the next by using the id field.
Whether this is a good design or not will depend on your queries. In general, you want to avoid using joins in Mongo (they're still possible, but if you're doing a bunch of joins, you're using it wrong, and really should use a relational DB :-)
For example, if most of your queries are on the server_log collection and only use the fields in that collection, then you'll be fine. OTOH, if your server_log queries always need to pull in data from the server collection as well (say for example the name and userId fields), then it might be worth selectively denormalizing that data. That's a fancy way of saying, you may wish to copy the name and userId fields into your server_log documents, so that your queries can avoid having to join with the server collection. Of course, every time you denormalize, you add complexity to your application which must now ensure that the data is consistent across multiple collections (e.g., when you change the server name, you have to make sure you change it in the server_logs, too).
You may wish to make a list of the queries you expect to perform, and see if they can be done with a minimum of joins with your current schema. If not, see if a little denormalization will help. If you're getting to the point where either you need to do a bunch of joins or a lot of manual management of denormalized data in order to satisfy your queries, then you may need to rethink your schema or even your choice of DB.
what should I do when collection size gets larger than 16MB limit
In Mongodb there is no limit for collection size. Limit is exist for each document. Each document should not exceed the size of 16 MB.
Do you see any potential issues with my design?
No issue with above design

How do you count the amount of documents in a MongoDB collection within Node?

Quite an odd situation, I've tried numerous solutions and I can't seem to crack it.
A lot of the resources I've found only include mongo console commands, I'm not sure how you'd write this in Node.js.
The reason I'm trying to work this out is I'm trying to make each documents 'id' iterate from the last, so the attempt is to find the amount of documents in a collection, add one and then use that number as the 'id' for the new document.
To get the number of documents in a collection, use db.collection.find({}).count().
However, what you are trying to do will not work. When you have a lot of parallel accesses to the database, then it is possible that multiple threads do this at the same time, receive the same count and will thus insert a document with the same id. According to the CAP theorem, a distributed database like MongoDB can not provide this kind of consistency.
What you should do instead is rely on MongoDB ObjectId's as unique identifiers for documents. MongoDB generates these automatically for each document when you don't provide an own value for _id. ObjectId's are globally unique (unique enough for any practical purpose), so you won't get any collisions. They also begin with a timestamp, so when you order by _id you get a roughly chronological order (as previously stated, a strict chronological order is impossible to provide by a distributed system).
As a rule of thumb, whenever you would use AUTO_INCREMENT in SQL, you would likely use ObjectId's in MongodDB.

What is the best practice for mongoDB to handle 1-n n-n relationships?

In relational database, 1-n n-n relationships mean 2 or more tables.
But in mongoDB, since it is possible to directly store those things into one model like this:
Article{
content: String,
uid: String,
comments:[Comment]
}
I am getting confused about how to manage those relations. For example, in article-comments model, should I directly store all the comments into the article model and then read out the entire article object into JSON every time? But what if the comments grow really large? Like if there is 1,000 comments in an article object, will such strategy make the GET process very slow every time?
I am by no means an expert on this, however I've worked through similar situations before.
From the few demos I've seen yes you should store all the comments directly in line. This is going to give you the best performance (unless you're expecting some ridiculous amount of comments). This way you have everything in your document.
In the future if things start going great and you do notice things going slower you could do a few things. You Could look to store the latest (insert arbitrary number) of comments with a reference to where the other comments are stored, then map-reduce old comments out into a "bucket" to keep loading times quick.
However initially I'd store it in one document.
So would have a model that looked maybe something like this:
Article{
content: String,
uid: String,
comments:[
{"comment":"hi", "user":"jack"},
{"comment":"hi", "user":"jack"},
]
"oldCommentsIdentifier":12345
}
Then only have oldCommentsIdentifier populated if you did move comments out of your comment string, however I really wouldn't do this for less then 1000 comments and maybe even more. Would take a bit of testing here to see what the "sweet" spot would be.
I think a large part of the answer depends on how many comments you are expecting. Having a document that contains an array that could grow to an arbitrarily large size is a bad idea, for a couple reasons. First, the $push operator tends to be slow because it often increases the size of the document, forcing it to be moved. Second, there is a maximum BSON size of 16MB, so eventually you will not be able to grow the array any more.
If you expect each article to have a large number of comments, you could create a separate "comments" collection, where each document has an "article_id" field that contains the _id of the article that it is tied to (or the uid, or some other field unique to the article). This would make retrieving all comments for a specific article easy, by querying the "comments" collection for any documents whose "article_id" field matches the article's _id. Indexing this field would make the query very fast.
The link that limelights posted as a comment on your question is also a great reference for general tips about schema design.
But if solve this problem by linking article and comments with _id, won't it kinda go back to the relational database design? And somehow lose the essence of being NoSQL?
Not really, NoSQL isn't all about embedding models. Infact embedding should be considered carefully for your scenario.
It is true that the aggregation framework solves quite a few of the problems you can get from embedding objects that you need to use as documents themselves. I define subdocuments that need to be used as documents as:
Documents that need to be paged in the interface
Documents that might exist across multiple root documents
Document that require advanced sorting within their group
Documents that when in a group will exceed the root documents 16meg limit
As I said the aggregation framework does solve this a little however your still looking at performing a query that, in realtime or close to, would be much like performing the same in SQL on the same number of documents.
This effect is not always desirable.
You can achieve paging (sort of) of suboducments with normal querying using the $slice operator, but then this can house pretty much the same problems as using skip() and limit() over large result sets, which again is undesirable since you cannot fix it so easily with a range query (aggregation framework would be required again). Even with 1000 subdocuments I have seen speed problems with not just me but other people too.
So let's get back to the original question: how to manage the schema.
Now the answer, which your not going to like, is: it all depends.
Do your comments satisfy the needs that they should separate? Is so then that probably is a good bet.
There is no best way to this. In MongoDB you should be designing your collections according to application that is going to use it.
If your application needs to display comments with article, then I can say it is better to embed these comments in article collection. Otherwise, you will end up with several round trips to your database.
There is one scenario where embedding does not work. As far as I know, document size is limited to 16 MB in MongoDB. This is quite large actually. However, If you think your document size can exceed this limit it is better to have separate collection.

Should I implement auto-incrementing in MongoDB?

I'm making the switch to MongoDB from MySQL. A familiar architecture to me for a very basic users table would have auto-incrementing of the uid. See Mongo's own documentation for this use case.
I'm wondering whether this is the best architectural decision. From a UX standpoint, I like having UIDs as external references, for example in shorter URLs: http://example.com/users/12345
Is there a third way? Someone in IRC Freenode's #mongodb suggested creating a range of IDs and caching them. I'm unsure of how to actually implement that, or whether there's another route I can go. I don't necessarily even need the _id itself to be incremented this way. As long as the users all have a unique numerical uid within the document, I would be happy.
I strongly disagree with author of selected answer that No auto-increment id in MongoDB and there are good reasons. We don't know reasons why 10gen didn't encourage usage of auto-incremented IDs. It's speculation. I think 10gen made this choice because it's just easier to ensure uniqueness of 12-byte IDs in clustered environment. It's default solution that fits most newcomers therefore increases product adoption which is good for 10gen's business.
Now let me tell everyone about my experience with ObjectIds in commercial environment.
I'm building social network. We have roughly 6M users and each user has roughly 20 friends.
Now imagine we have a collection which stores relationship between users (who follows who). It looks like this
_id : ObjectId
user_id : ObjectId
followee_id : ObjectId
on which we have unique composite index {user_id, followee_id}. We can estimate size of this index to be 12*2*6M*20 = 2GB. Now that's index for fast look-up of people I follow. For fast look-up of people that follow me I need reverse index. That's another 2GB.
And this is just the beginning. I have to carry these IDs everywhere. We have activity cluster where we store your News Feed. That's every event you or your friends do. Imagine how much space it takes.
And finally one of our engineers made an unconscious decision and decided to store references as strings that represent ObjectId which doubles its size.
What happens if an index does not fit into RAM? Nothing good, says 10gen:
When an index is too large to fit into RAM, MongoDB must read the index from disk, which is a much slower operation than reading from RAM. Keep in mind an index fits into RAM when your server has RAM available for the index combined with the rest of the working set.
That means reads are slow. Lock contention goes up. Writes gets slower as well. Seeing lock contention in 80%-nish is no longer shock to me.
Before you know it you ended up with 460GB cluster which you have to split to shards and which is quite hard to manipulate.
Facebook uses 64-bit long as user id :) There is a reason for that. You can generate sequential IDs
using 10gen's advice.
using mysql as storage of counters (if you concerned about speed take a look at handlersocket)
using ID generating service you built or using something like Snowflake by Twitter.
So here is my general advice to everyone. Please please make your data as small as possible. When you grow it will save you lots of sleepless nights.
Josh,
No auto-increment id in MongoDB and there are good reasons.
I would say go with ObjectIds which are unique in the cluster.
You can add auto increment by a sequence collection and using findAndModify to get the next id to use. This will definitely add complexities to your application and may also affect the ability to shard your database.
As long as you can guarantee that your generated ids will be unique, you will be fine.
But the headache will be there.
You can look at this post for more info about this question in the dedicated google group for MongoDB:
http://groups.google.com/group/mongodb-user/browse_thread/thread/f57b712b2aae6f0b/b4315285e689b9a7?lnk=gst&q=projapati#b4315285e689b9a7
Hope this helps.
Thanks
So, there's a fundamental problem with "auto-increment" IDs. When you have 10 different servers (shards in MongoDB), who picks the next ID?
If you want a single set of auto-incrementing IDs, you have to have a single authority for picking those IDs. In MySQL, this is generally pretty easy as you just have one server accepting writes. But big deployments of MongoDB are running sharding which doesn't have this "central authority".
MongoDB, uses 12-byte ObjectIds so that each server can create new documents uniquely without relying on a single authority.
So here's the big question: "can you afford to have a single authority"?
If so, then you can use findAndModify to keep track of the "last highest ID" and then you can insert with that.
That's the process described in your link. The obvious weakness here is that you technically have to do two writes for each insert. This may not scale very well, you probably want to avoid it on data with a high insertion rate. It may work for users, it probably won't work for tracking clicks.
There is nothing like an auto-increment in MongoDB but you may store your own counters in a dedicated collection and $inc the related value of counter as needed. Since $inc is an atomic operation you won't see duplicates.
The default Mongo ObjectId -- the one used in the _id field -- is incrementing.
Mongo uses a timestamp ( seconds since the Unix epoch) as the first 4-byte portion of its 4-3-2-3 composition, very similar (if not exactly) the same composition as a Version 1 UUID. And that ObjectId is generated at time of insert (if no other type of _id is provided by the user/client)
Thus the ObjectId is ordinal in nature; further, the default sort is based on this incrementing timestamp.
One might consider it an updated version of the auto-incrementing (index++) ids used in many dbms.

Resources