Alternatives to MongoDB cursor.toArray() in node.js - node.js

I am currently using MongoDB cursor's toArray() function to convert the database results into an array:
run = true;
count = 0;
var start = process.hrtime();
db.collection.find({}, {limit: 2000}).toArray(function(err, docs){
var diff = process.hrtime(start);
run = false;
socket.emit('result', {
result: docs,
time: diff[0] * 1000 + diff[1] / 1000000,
ticks: count
});
if(err) console.log(err);
});
This operation takes about 7ms on my computer. If I remove the .toArray() function then the operation takes about 0.15ms. Of course this won't work because I need to forward the data, but I'm wondering what the function is doing since it takes so long? Each document in the database simply consists of 4 numbers.
In the end I'm hoping to run this on a much smaller processor, like a Raspberry Pi, and here the operation where it fetches 500 documents from the database and converts it to an array takes about 230ms. That seems like a lot to me. Or am I just expecting too much?
Are there any alternative ways to get data from the database without using toArray()?
Another thing that I noticed is that the entire Node application slows remarkably down while getting the database results. I created a simple interval function that should increment the count value every 1 ms:
setInterval(function(){
if(run) count++;
}, 1);
I would then expect the count value to be almost the same as the time, but for a time of 16 ms on my computer the count value was 3 or 4. On the Raspberry Pi the count value was never incremented. What is taking so much CPU usage? The monitor told me that my computer was using 27% CPU and the Raspberry Pi was using 92% CPU and 11% RAM, when asked to run the database query repeatedly.
I know that was a lot of questions. Any help or explanations are much appreciated. I'm still new to Node and MongoDB.

db.collection.find() returns a cursor, not results, and opening a cursor is pretty fast.
Once you start reading the cursor (using .toArray() or by traversing it using .each() or .next()), the actual documents are being transferred from the database to your client. That operation is taking up most of the time.
I doubt that using .each()/.next() (instead of .toArray(), which—under the hood—uses one of those two) will improve the performance much, but you could always try (who knows). Since .toArray() will read everything in memory, it may be worthwhile, although it doesn't sound like your data set is that large.
I really think that MongoDB on Raspberry Pi (esp a Model 1) is not going to work well. If you don't depend on the MongoDB query features too much, you should consider using an alternative data store. Perhaps even an in-memory storage (500 documents times 4 numbers doesn't sound like lots of RAM is required).

Related

Getting 300k data from mongodb

I need to get 300k data with NodeJS from the database and pass it to the client(HTML side) for reporting and virtual scroll. I started with MySQL and it was taking around 14 seconds (2 tables joined). Since I'd like to speed up, I changed it to MongoDB but it takes around 31.2 seconds in one collection without any joining with the native MongoDB npm package.
NodeJS MongoDB source code
const { MongoClient } = require('mongodb');
.
.
.
const db = client.db(dbName);
const collection = db.collection('test');
const startTime = performance.now()
const findResult = await collection.find({}).toArray();
const endTime = performance.now();
console.log(`Time elapsed: ${endTime - startTime} milliseconds`);
console.log('Found documents 1st element =>', findResult[0]);
I'd assume MongoDB was going to take less time versus MySql but I guess there is something wrong with my code (I also tried fastify-mongodb, fast-json-stringify, streaming but none of them could not speed up ). What do you suggest? Because I researched lots of papers and most of the people gets +3m data around 10seconds and I know MongoDB is a good choice for big data even my data is not big enough :)
In general mongoDB is good in multiple small single document reads based on indexes(already in memory) , this is coming from the fact that the data is allocated on storage in random 32k compressed blocks and it is not written sequentially , so in case you read the full collection there will be multiple IOPS operations to your storage that turns out to be the bottleneck for sequential reads in your case. If you immediately try to read the same collection second time you will find that the read is much more faster since the documents are already cached in memory and OS fs cache. To improve this performance for bigger volumes you may shard the collection and distribute the data in multiple shards so it use less IOPS per individual shard when fetching the full collection. Btw by 300k data you mean 300k documents or total size of collection is 300k bytes since 31 seconds is huge amount of time , maybe documents are relatively big in this 300k collection or you have some infrastructure issue?
(300000 x 16)/(1024*1024) = ~ 4,5TB max , how big is your collection?
FileSystem? Storage ?
Testing in my machine it seems there is no issue in mongo but in the stages after you fetch the data:
db.collection.find({}).explain("executionStats")
"executionStats" : {
"nReturned" : 300000,
"executionTimeMillis" : 174,
"totalKeysExamined" : 0,
"totalDocsExamined" : 300000
Some interesting results:
mongos> var a=new Date().getTime();var doc=db.collection.find({});var b=new Date().getTime();var doc2=doc.toArray();var c=new Date().getTime();printjson(doc2);d=new Date().getTime();var ba=b-a,cb=c-b,dc=d-c;print("find:"+ba+"ms toArray:"+cb+"ms printjson(doc2):"+dc+"ms");
doc=find({}): 0ms
doc2=doc.toArray(): 3296ms
printjson(doc2): 67122ms
Something that you may try:
Increase the cursor batch size :
var doc=db.collection.find({}).batchSize(3000000)
( In my tests higher batch sizes provided 2-3 times better execution times )
Use eachAsync paralelized , check here
( havent check that , but you may try )

AWS DocumentDB Performance Issue with Concurrency of Aggregations

I'm working with DocumentDB in AWS, and I've been having troubles when I try to read from the same collection simultaneously from different aggregation queries.
The issue is not that I cannot read from the database, but rather that it takes a lot of time to complete the queries. It doesn't matter if I trigger the queries simultaneously or one after the other.
I'm using a Lambda Function with NodeJS to run my code. And I'm using mongoose to handle the connection with the database.
Here's a sample code that I put together to illustrate my problem:
query1() {
return Collection.aggregate([...])
}
query2() {
return Collection.aggregate([...])
}
query3() {
return Collection.aggregate([...])
}
It takes the same time if I run it using Promise.all
Promise.all([ query1(), query2(), query3() ])
Than if I run it waiting for the previous one to finish
query1().then(result1 => query2().then(result3 => query3()))
While if I run each query in different Lambda Executions, it takes significantly less time for each individual query to finish (Between 1 and 2 seconds).
So if they were running in parallel the execution should be finished with the time of the query that takes the most time (2 seconds), and not take 7 seconds, as it does now.
So my guessing is that the instance of DocumentDB is running the queries in sequence no matter how I send them. In the collection there are around 19,000 documents with a total size of almost 25Mb.
When I check the metrics of the instance, the CPUUtilization is barely over 8% and the RAM available only drops by 20Mb. So I don't think the problem of the delay has to do with the size of the instance.
Do you know why DocumentDB is behaving like this? Is there a configuration that I can change to run the aggregations in parallel?

Mongodb empty find({}) query taking long time to return docs

I have ~19,000 docs in an mlab sandbox, total size ~160mb. I will have queries that may want a heavily filtered subset of those docs, or completely unfiltered. In all cases, the information I need returned is a 100-length array that is an attribute of each document. So in the case of no filter, I expect to get back 19k arrays which I will then do further processing on. The query I'm doing to test this is db.find({}, { vector: 1 });.
Using .explain(), I can see that this query doesn't take long at all. However, it takes upwards of 50 seconds for my code to see the returned data. At first I thought it was simply downloading a bunch of data taking a long time, but the total vector data size is only around 20mb, which shouldn't take that long. (For context, my nodejs server is running locally on my pc while the db is hosted, so there will be some transfer).
What is taking this so long? An empty query shouldn't take any time to execute, and projecting the results into only vectors means I should only have to download ~20mb of data, which should be a matter of seconds.

Every 'nth' document from a collection - MongoDB + NodeJS

I am looking for a method to return data at different resolutions that is stored in MongoDB. The most elegant solution I can envision is a query that returns every 'nth' (second, third, tenth, etc.) document from the collection.
I am storing data (say temperature) at a 5 second interval but want to look at different trends in the data.
To find the instantaneous trend, I look at the last 720 entries (1 hour). This part is easy.
If I want to look at slightly longer trend, say 3 hours, I could retrieve the last 2160 entries (3 hours) however that is more time to pull from the server, and more time and memory to plot. As when looking at the larger trends, the small movements are noise and I would be better off retrieving the same number of documents (720) but only every 3rd, still giving me 3 hours of results, with the same resources used, for a minor sacrifice in detail.
This only gets more extreme when I want to look at weeks (120,960 documents) or months (500,000+ documents).
My current code collects every single document (n = 1):
db.collection(collection).find().sort({$natural:-1}).limit(limit)
I could then loop through the returned array and remove every document when:
index % n != 0
This at least saves the client from dealing with all the data however this seems extremely inefficient and I would rather the database handle this part.
Does anyone know a method to accomplish this?
Apparenlty, there is no inbuilt solution in mongo to solve your problem.
The way forward would be to archive your data smartly, in fragments.
So you can store your data in a collection which will house no more than weekly or monthly data. A new month/week means storing your data in a different collection. That way you wont be doing a full table scan and wont be collecting every single document as you mentioned in your problem. Your application code will decide which collection to query.
If I were in your shoes, I would use a different tool as mongo is more suited for a general purpose database. Timeseries data(storing something every 5 sec) can be handled pretty well by database like cassandra which can handle frequent writes with ease, just as in your case.
Alternate fragmentation(update) :
Always write your current data in collection "week0" and in the background run a weekly scheduler that moves the data from "week0" to history collections "week1","week2" and so on. Fragmentation logic depends on your requirements.
I think the $bucket stage might help you with it.
You can do something like,
db.collection.aggregate([
{
$bucketAuto: {
groupBy: "$_id", // here you'll put the variable you need, in your example 'temperature'
buckets: 5 // this is the number of documents you want to return, so if you want a sample of 500 documents, you can put 500 here
}
}
])
Each document in the result for the above query would be something like this,
"_id": {
"max": 3,
"min": 1
},
"count": 2
If you had grouped by temperature, then each document will have the minimum and maximum temperature found in that sample
You might have another problem. Docs state not to rely on natural ordering:
This ordering is an internal implementation feature, and you should
not rely on any particular structure within it.
You can instead save the epoch seconds in each document and do your mod arithmetic on it as part of a query, with limit and sort.

How to perform massive data uploads to firebase firestore

I have about ~300mb of data (~180k json objects) that gets updated once every 2-3 days.
This data is divided into three "collections", that I must keep up to date.
I decided to take the Node.JS way, but any solution in a language i know ( Java, Python) will be welcomed.
Whenever I perform a batch set using the node.JS firebase-admin client, not only it consumes an aberrant amount of ram ( about 4-6GB!), but it also tends to crash with errors that don't have a clear ( up to page 4 of google search without a meaningful answer ) reason.
My code is frankly simple, this is it:
var collection = db.collection("items");
var batch = db.batch();
array.forEach(item => {
var ref = collection.doc(item.id);
batch.set(ref, item);
});
batch.commit().then((res) => {
console.log("YAY",res);
});
I haven't found anywhere if there is a limit on the number of writes in a limited span of time (I understand doing 50-60k writes should be easy peasy with a backend the size of firebase), and also found that this can go up the ram train and have like 4-6GB of ram allocated.
I can confirm that when the errors are thrown, or the ram usage clogs my laptop, whatever happens first, I am still at less than 1-4% my daily usage quotas, so that is not the issue.

Resources