Getting slow response to node from mongodb - node.js

I am having 20 millions documents in my collection.I was coming across slow response issue.So I wrote a script to divide this collection into multiple collections.So now I am having one collection of 1.3 million documents.I took a random query to check response,so for a same query I am getting response time on bigger collection is around 19000ms & smaller collection is around 13000ms.I am using mongoose lib on node to connect with mongodb.
I am new to mongodb,so please tell me in which direction I should look to these issue.

MongoDB cannot perform JOINS, so there is little gain from combining collections.But With MongoDB, and most other NoSQL databases, Its easy to use separate databases in a single instance. So if your data sets is large you can group them into different collections as per range keys,etc. Further breaking down those collections into different databases to gain performance. I'm little sceptical about the gain in performance but on theory it should.
for instance : Having two database to serve different data over entire million data sets in a single instance
db["collectionSet1"].AppCollection.find({application_status: {$in: ["pending", "approved", "rejected"]}}).sort({_id: -1})
db["collectionSet2"].OrigCollection.find({origination_status: {$in: ["originated", "approved", "rejected"]}}).sort({_id: -1})

Related

Large Data Processing and Management in MongoDB using NodeJS

I am trying to do CRUD operations on MongoDB of a very large size around 20GB data and there can be multiple such versions of data. Can anyone guide me on how to handle such high data for the CRUD operations and maintaining the previous versions of the data in MongoDB?
I am using NodeJS as backend and I can also use any other database if required.
Mongodb is a reliable database, I am using it to processes 10-11 billions of data every single day nodejs should also be fine as long as you are handling the files in streams of data.
Things you should do to Optimize:
Indexing - this will be the biggest part, if you want faster queries you better look into indexing in mongodb, every single document needs to be indexed according to your query, else you are going to have a tough time dealing with queries.
Sharding and Replication - this will help you organise the data and increases the query speed, replication would allow you to have your reads and writes separated(there are cons for replication you can read about that in the mongodb documentation).
This are the main things you need to consider, there are a lot but this should get you started...;) need any help please do let me know.

MongoDB unnormalized data and the change stream

I have an application that most of the collections in it are heavily read then write, so I demoralized the data in them, and now I need to handle the normalization of the data, for some collections I used jobs in order to sync the data but that not good enough as for some cases I need the data to be normalized in real-time,
for example:
let's say I have orders collections and users collection.
orders have the user email(for search)
{
_id:ObjectId(),
user_email:'test#email.email'
....
}
now whenever I am changing the user email in users I want to change it in orders as well.
so I find that MongoDB has change stream which looks pretty awesome feature, I have played with it a bit and it gives me the results I need to update my other collections, my question is does anyone use it in production? can I trust on this stream to be always set the update data to update the other collections? how does it affect the DB performance if I have many streams open? also, I use the nodejs MongoDB driver does it has any effect
I've not worked yet with change stream but these cases are very common and can be easily solved by building more normalized schema
Normalization form 1 says among the others "don't repeat data" - so you will save the email in the users collection only
orders collection won't have the email field but will have user_id for joining with users collection with lookup command for joining collections
https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/

is it good to use different collections in a database in mongodb

I am going to do a project using nodejs and mongodb. We are designing the schema of database, we are not sure that whether we need to use different collections or same collection to store the data. Because each has its own pros and cons.
If we use single collection, whenever the database is invoked, total collection will be loaded into memory which reduces the RAM capacity.If we use different collections then to retrieve data we need to write different queries. By using one collection retrieving will be easy and by using different collections application will become faster. We are confused whether to use single collection or multiple collections. Please Guide me which one is better.
Usually you use different collections for different things. For example when you have users and articles in the systems, you usually create a "users" collection for users and "articles" collection for articles. You could create one collection called "objects" or something like that and put everything there but it would mean you would have to add some type fields and use it for searches and storage of data. You can use a single collection in the database but it would make the usage more complicated. Of course it would let you to load the entire collection at once but whether or not it is relevant for the performance of your application, that is something that would have to be profiled and tested to give your the performance impact for your particular use case.
Usually, developers create the different collection for different things. Like for post management, people create 'post' collection and save the posts in post collection and same for users and all.
Using different collection for different purpose is a good pratices.
MongoDB is great at scaling horizontally. It can shard a collection across a dynamic cluster to produce a fast, querable collection of your data.
So having a smaller collection size is not really a pro and I am not sure where this theory comes that it is, it isn't in SQL and it isn't in MongoDB. The performance of sharding, if done well, should be relative to the performance of querying a single small collection of data (with a small overhead). If it isn't then you have setup your sharding wrong.
MongoDB is not great at scaling vertically, as #Sushant quoted, the ns size of MongoDB would be a serious limitation here. One thing that quote does not mention is that index size and count also effect the ns size hence why it describes that:
By default MongoDB has a limit of approximately 24,000 namespaces per
database. Each namespace is 628 bytes, the .ns file is 16MB by
default.
Each collection counts as a namespace, as does each index. Thus if
every collection had one index, we can create up to 12,000
collections. The --nssize parameter allows you to increase this limit
(see below).
Be aware that there is a certain minimum overhead per collection -- a
few KB. Further, any index will require at least 8KB of data space as
the b-tree page size is 8KB. Certain operations can get slow if there
are a lot of collections and the meta data gets paged out.
So you won't be able to gracefully handle it if your users exceed the namespace limit. Also it won't be high on performance with the growth of your userbase.
UPDATE
For Mongodb 3.0 or above using WiredTiger storage engine, it will no longer be the limit.
Yes personally I think having multiple collections in a DB keeps it nice and clean. The only thing I would worry about is the size of the collections. Collections are used by a lot of developers to cut up their db into, for example, posts, comments, users.
Sorry about my grammar and lack of explanation I'm on my phone

Mongoose insertMany limit

Mongoose 4.4 now has an insertMany function which lets you validate an array of documents and insert them if valid all with one operation, rather than one for each document:
var arr = [{ name: 'Star Wars' }, { name: 'The Empire Strikes Back' }];
Movies.insertMany(arr, function(error, docs) {});
If I have a very large array, should I batch these? Or is there no limit on the size or array?
For example, I want to create a new document for every Movie, and I have 10,000 movies.
I'd recommend based on personal experience, batch of 100-200 gives you good blend of performance without putting strain on your system.
insertMany group of operations can have at most 1000 operations. If a group exceeds this limit, MongoDB will divide the group into smaller groups of 1000 or less. For example, if the queue consists of 2000 operations, MongoDB creates 2 groups, each with 1000 operations.
The sizes and grouping mechanics are internal performance details and are subject to change in future versions.
Executing an ordered list of operations on a sharded collection will generally be slower than executing an unordered list since with an ordered list, each operation must wait for the previous operation to finish.
Mongo 3.6 update:
The limit for insertMany() has increased in Mongo 3.6 from 1000 to 100,000.
Meaning that now, groups above 100,000 operations will be divided into smaller groups accordingly.
For example: a queue that has 200,000 operations will be split and Mongo will create 2 groups of 100,000 each.
The method takes that array and starts inserting them through the insertMany method in MongoDB, so the size of the array itself actually depends on how much your machine can handle.
But please note that there is another point, which is not a limitation but something worth keeping into consideration, on how MongoDB deals with multiple operations, by default it handles a batch of 1000 operations at a time and splits whats more than that.

Does the size of a document affect performance of a find() query?

Can the size of a MongoDB document affect the performance of a find() query?
I'm running the following query on a collection, in the MongoDB shell
r.find({_id:ObjectId("5552966b380c2dbc29472755")})
The entire document is 3MB. When I run this query the operation takes about 8 seconds to perform. The document has a "salaries" property which makes up the bulk of the document's size (about 2.9MB). So when I ommit the salaries property and run the following query it takes less than a second.
r.find({_id:ObjectId("5552966b380c2dbc29472755")},{salaries:0})
I only notice this performance difference when I run the find() query only. When I run a find().count() query there is no difference. It appears that performance degrades only when I want to fetch the entire document.
The collection is never updated (never changes in size), an index is set on _id and I've run repairDatabase() on the database. I've searched around the web but can't find a satisfactory answer to why there is a performance difference. Any insight and recommendations would be appreciated. Thanks.
I think the experiments you've just ran are an answer to your own question.
Mongo will index the _id field by default, so document size shouldn't affect the length of time it takes to locate the document, but if its 3MB then you will likely notice a difference in actually downloading that data. I imagine that's why its taking less time if you omit some of the fields.
To get a better sense of how long your query is actually taking to run, try this:
r.find({
_id: ObjectId("5552966b380c2dbc29472755")
})
.explain(function(err, explaination) {
if (err) throw err;
console.log(explaination);
});
If salaries is the 3MB culprit, and its structured data, then to speed things up you could try A) splitting it up into separate mongo documents or B) querying based on sub-properties of that document, and in both cases A and B you can build indexes to keep those queries fast.

Resources