Getting 300k data from mongodb - node.js

I need to get 300k data with NodeJS from the database and pass it to the client(HTML side) for reporting and virtual scroll. I started with MySQL and it was taking around 14 seconds (2 tables joined). Since I'd like to speed up, I changed it to MongoDB but it takes around 31.2 seconds in one collection without any joining with the native MongoDB npm package.
NodeJS MongoDB source code
const { MongoClient } = require('mongodb');
.
.
.
const db = client.db(dbName);
const collection = db.collection('test');
const startTime = performance.now()
const findResult = await collection.find({}).toArray();
const endTime = performance.now();
console.log(`Time elapsed: ${endTime - startTime} milliseconds`);
console.log('Found documents 1st element =>', findResult[0]);
I'd assume MongoDB was going to take less time versus MySql but I guess there is something wrong with my code (I also tried fastify-mongodb, fast-json-stringify, streaming but none of them could not speed up ). What do you suggest? Because I researched lots of papers and most of the people gets +3m data around 10seconds and I know MongoDB is a good choice for big data even my data is not big enough :)

In general mongoDB is good in multiple small single document reads based on indexes(already in memory) , this is coming from the fact that the data is allocated on storage in random 32k compressed blocks and it is not written sequentially , so in case you read the full collection there will be multiple IOPS operations to your storage that turns out to be the bottleneck for sequential reads in your case. If you immediately try to read the same collection second time you will find that the read is much more faster since the documents are already cached in memory and OS fs cache. To improve this performance for bigger volumes you may shard the collection and distribute the data in multiple shards so it use less IOPS per individual shard when fetching the full collection. Btw by 300k data you mean 300k documents or total size of collection is 300k bytes since 31 seconds is huge amount of time , maybe documents are relatively big in this 300k collection or you have some infrastructure issue?
(300000 x 16)/(1024*1024) = ~ 4,5TB max , how big is your collection?
FileSystem? Storage ?
Testing in my machine it seems there is no issue in mongo but in the stages after you fetch the data:
db.collection.find({}).explain("executionStats")
"executionStats" : {
"nReturned" : 300000,
"executionTimeMillis" : 174,
"totalKeysExamined" : 0,
"totalDocsExamined" : 300000
Some interesting results:
mongos> var a=new Date().getTime();var doc=db.collection.find({});var b=new Date().getTime();var doc2=doc.toArray();var c=new Date().getTime();printjson(doc2);d=new Date().getTime();var ba=b-a,cb=c-b,dc=d-c;print("find:"+ba+"ms toArray:"+cb+"ms printjson(doc2):"+dc+"ms");
doc=find({}): 0ms
doc2=doc.toArray(): 3296ms
printjson(doc2): 67122ms
Something that you may try:
Increase the cursor batch size :
var doc=db.collection.find({}).batchSize(3000000)
( In my tests higher batch sizes provided 2-3 times better execution times )
Use eachAsync paralelized , check here
( havent check that , but you may try )

Related

Mongodb/Mongoose bulkwrite(upsert) performance issues

I am using mongoDB with mongoose for our Nodejs api where we need to do sort of seed for collections where data-source is a JSON, i am using Model.bulkwrite which internally uses mongodb's Bulkwrite(https://docs.mongodb.com/manual/core/bulk-write-operations).
Code below,
await Model.bulkWrite(docs.map(doc => ({
updateOne: { ..... } // update document
insertOne: { ....... } // insert document
updateOne: { ..... } // update document
insertOne: { ....... } // insert document
.
.
.n
})))
This works fine for our current use-case with just few hundred documents,
But we are worried about how will it scale,its performance when the number of documents will increase a lot, Like will there be any issues when number of document will be in 10 thousands.
Just want to confirm that are we on the right path or is there any room for improvement.
Bulkwrite in Mongodb is currently having maximum limit of 100,000 write operations in a single batch. From the docs
The number of operations in each group cannot exceed the value of the maxWriteBatchSize of the database. As of MongoDB 3.6, this value is 100,000. This value is shown in the isMaster.maxWriteBatchSize field.
This limit prevents issues with oversized error messages. If a group
exceeds this limit, the client driver divides the group into smaller
groups with counts less than or equal to the value of the limit. For
example, with the maxWriteBatchSize value of 100,000, if the queue
consists of 200,000 operations, the driver creates 2 groups, each with
100,000 operations.
So, you won't face any performance issues until you exceed this limit.
For your reference:
Mongodb Bulkwrite: db.collection.bulkWrite()
Write Command Batch Limit Size

will I hit maximum writes per second per database if I make a document using Promise.all like this?

I am now developing an app. and I want to send a message to all my users inbox. the code is like this in my cloud functions.
const query = db.collection(`users`)
.where("lastActivity","<=",now)
.where("lastActivity",">=",last30Days)
const usersQuerySnapshot = await query.get()
const promises = []
usersQuerySnapshot.docs.forEach( userSnapshot => {
const user = userSnapshot.data()
const userID = user.userID
// set promise to create data in user inbox
const p1 = db.doc(`users/${userID}/inbox/${notificationID}`).set(notificationData)
promises.push(p1)
})
return await Promise.all(promises)
there is a limit in Firebase:
Maximum writes per second per database 10,000 (up to 10 MiB per
second)
say if I send a message to 25k users (create a document to 25K users),
how long the operations of that await Promise.all(promises) will take place ? I am worried that operation will take below 1 second, I don't know if it will hit that limit or not using this code. I am not sure about the operation rate of this
if I hit that limit, how to spread it out over time ? could you please give a clue ? sorry I am a newbie.
If you want to throttle the rate at which document writes happen, you should probably not blindly kick off very large batches of writes in a loop. While there is no guarantee how fast they will occur, it's possible that you could exceed the 10K/second/database limit (depending on how good the client's network connection is, and how fast Firestore responds in general). Over a mobile or web client, I doubt that you'll exceed the limit, but on a backend that's in the same region as your Firestore database, who knows - you would have to benchmark it.
Your client code could simply throttle itself with some simple logic that measures its progress.
If you have a lot of documents to write as fast as possible, and you don't want to throttle your client code, consider throttling them as individual items of work using a Cloud Tasks queue. The queue can be configured to manage the rate at which the queue of tasks will be executed. This will drastically increase the amount of work you have to do to implement all these writes, but it should always stay in a safe range.
You could use e.g. p-limit to reduce promise concurrency in the general case, or preferably use batched writes.

How to perform massive data uploads to firebase firestore

I have about ~300mb of data (~180k json objects) that gets updated once every 2-3 days.
This data is divided into three "collections", that I must keep up to date.
I decided to take the Node.JS way, but any solution in a language i know ( Java, Python) will be welcomed.
Whenever I perform a batch set using the node.JS firebase-admin client, not only it consumes an aberrant amount of ram ( about 4-6GB!), but it also tends to crash with errors that don't have a clear ( up to page 4 of google search without a meaningful answer ) reason.
My code is frankly simple, this is it:
var collection = db.collection("items");
var batch = db.batch();
array.forEach(item => {
var ref = collection.doc(item.id);
batch.set(ref, item);
});
batch.commit().then((res) => {
console.log("YAY",res);
});
I haven't found anywhere if there is a limit on the number of writes in a limited span of time (I understand doing 50-60k writes should be easy peasy with a backend the size of firebase), and also found that this can go up the ram train and have like 4-6GB of ram allocated.
I can confirm that when the errors are thrown, or the ram usage clogs my laptop, whatever happens first, I am still at less than 1-4% my daily usage quotas, so that is not the issue.

Alternatives to MongoDB cursor.toArray() in node.js

I am currently using MongoDB cursor's toArray() function to convert the database results into an array:
run = true;
count = 0;
var start = process.hrtime();
db.collection.find({}, {limit: 2000}).toArray(function(err, docs){
var diff = process.hrtime(start);
run = false;
socket.emit('result', {
result: docs,
time: diff[0] * 1000 + diff[1] / 1000000,
ticks: count
});
if(err) console.log(err);
});
This operation takes about 7ms on my computer. If I remove the .toArray() function then the operation takes about 0.15ms. Of course this won't work because I need to forward the data, but I'm wondering what the function is doing since it takes so long? Each document in the database simply consists of 4 numbers.
In the end I'm hoping to run this on a much smaller processor, like a Raspberry Pi, and here the operation where it fetches 500 documents from the database and converts it to an array takes about 230ms. That seems like a lot to me. Or am I just expecting too much?
Are there any alternative ways to get data from the database without using toArray()?
Another thing that I noticed is that the entire Node application slows remarkably down while getting the database results. I created a simple interval function that should increment the count value every 1 ms:
setInterval(function(){
if(run) count++;
}, 1);
I would then expect the count value to be almost the same as the time, but for a time of 16 ms on my computer the count value was 3 or 4. On the Raspberry Pi the count value was never incremented. What is taking so much CPU usage? The monitor told me that my computer was using 27% CPU and the Raspberry Pi was using 92% CPU and 11% RAM, when asked to run the database query repeatedly.
I know that was a lot of questions. Any help or explanations are much appreciated. I'm still new to Node and MongoDB.
db.collection.find() returns a cursor, not results, and opening a cursor is pretty fast.
Once you start reading the cursor (using .toArray() or by traversing it using .each() or .next()), the actual documents are being transferred from the database to your client. That operation is taking up most of the time.
I doubt that using .each()/.next() (instead of .toArray(), which—under the hood—uses one of those two) will improve the performance much, but you could always try (who knows). Since .toArray() will read everything in memory, it may be worthwhile, although it doesn't sound like your data set is that large.
I really think that MongoDB on Raspberry Pi (esp a Model 1) is not going to work well. If you don't depend on the MongoDB query features too much, you should consider using an alternative data store. Perhaps even an in-memory storage (500 documents times 4 numbers doesn't sound like lots of RAM is required).

Mongodb cursor.toArray() has become the bottle neck

Mongodb cursor.toArray() has become the bottle neck. I need to process 2 million documents and output to a file. I am processing 10,000 at a time using skip and limit options but it didn’t quite work. so I was looking for a driver that is more memory efficient. I have also tried to process 10 documents at a time and it takes forever so I am not sure if .each() can solve the problem? Also does .nextObject makes a network call every time we retrieve a single document?
Node.js also has an internal limit of 1.5GB on memory so I am not sure how I can process these documents. I do believe that this problem can be solved just by using the mongo cursor in the right way at the application level and not doing any database level aggregations.
There shouldn't be any need to hold all the documents since you can write each document to the file as it is received from the server. If you use a cursor with .each and a batchSize, you can write each document to the file, holding no more than batchSize documents on the client side:
db.collection.find(query, { "batchSize" : 100 }).each(writeToFile)
From the Node.js driver API docs
the cursor will only hold a maximum of batch size elements at any given time if batch size is specified
Using skip and limit to break up results is a bad idea. A query with a skip of n and a limit of m generally has to scan at least n + m documents or index entries. If you paginate with skip and limit, you end up making the amount of work the query has to do quadratic in the size of (total number of results / limit), e.g. for 1000 docs and a limit of 100, the total docs scanned would be on the order of
100 + 200 + 300 + 400 + ... + 1000 = 100 (1 + 2 + 3 + 4 + ... + 10)

Resources