Mongodb cursor.toArray() has become the bottle neck - node.js

Mongodb cursor.toArray() has become the bottle neck. I need to process 2 million documents and output to a file. I am processing 10,000 at a time using skip and limit options but it didn’t quite work. so I was looking for a driver that is more memory efficient. I have also tried to process 10 documents at a time and it takes forever so I am not sure if .each() can solve the problem? Also does .nextObject makes a network call every time we retrieve a single document?
Node.js also has an internal limit of 1.5GB on memory so I am not sure how I can process these documents. I do believe that this problem can be solved just by using the mongo cursor in the right way at the application level and not doing any database level aggregations.

There shouldn't be any need to hold all the documents since you can write each document to the file as it is received from the server. If you use a cursor with .each and a batchSize, you can write each document to the file, holding no more than batchSize documents on the client side:
db.collection.find(query, { "batchSize" : 100 }).each(writeToFile)
From the Node.js driver API docs
the cursor will only hold a maximum of batch size elements at any given time if batch size is specified
Using skip and limit to break up results is a bad idea. A query with a skip of n and a limit of m generally has to scan at least n + m documents or index entries. If you paginate with skip and limit, you end up making the amount of work the query has to do quadratic in the size of (total number of results / limit), e.g. for 1000 docs and a limit of 100, the total docs scanned would be on the order of
100 + 200 + 300 + 400 + ... + 1000 = 100 (1 + 2 + 3 + 4 + ... + 10)

Related

How Mongodb queries works in node.js?

I have a very easy search query in Node.js Express.js MongoDB with Mongoose:
await Model.find({}).limit(10);
My question is how do the architects work? Is it first to get all Models Data and then limit to 10 or before getting all data will select 10 items from the database? I mean the steps:
Find all data from Model and return as List(Array) --> 2. Limit 10 first items and remove others from List(Array).
Find first 10 items and return as List(Array)
The difference in performance is high cause with first step if we got a million data it will return 1 mill items with a huge 10 20 sec and then limiting the 10 of it which we loose 10 20 seconds of time and when the user are more the server will be done but with the second way even with 100 mil items it will always take same time.
The limit function sets specifies the maximum number of elements a cursor will return. In the case of your example, the cursor will return the first 10 items matching the query only (option 2). You can find more information on how the cursor.limit() works via the links below:
https://docs.mongodb.com/manual/reference/method/cursor.limit/
http://mongodb.github.io/node-mongodb-native/3.5/api/Cursor.html#limit

How to divide 1 mongoDB collection into 2 or more collections

I'm using mongoDB to scrap a dataset using Node.js. The collection which I have has 0.2 million documents and so the Node.js is crashing giving a segmentation fault. Is there a way to split/divide the collection to 2 or more collections so that Node.js doesn't crash.
Thanks!!
Did you try using limit to constraint the no of documents returned? You can take the total document count in collection and then split it using limit and skip For ex: if collection has 200 docs
First time limit 100 docs and skip 0
Second time limit 100 again but this time skip 100
This is oneway i an think of. There may be other ways

Alternatives to MongoDB cursor.toArray() in node.js

I am currently using MongoDB cursor's toArray() function to convert the database results into an array:
run = true;
count = 0;
var start = process.hrtime();
db.collection.find({}, {limit: 2000}).toArray(function(err, docs){
var diff = process.hrtime(start);
run = false;
socket.emit('result', {
result: docs,
time: diff[0] * 1000 + diff[1] / 1000000,
ticks: count
});
if(err) console.log(err);
});
This operation takes about 7ms on my computer. If I remove the .toArray() function then the operation takes about 0.15ms. Of course this won't work because I need to forward the data, but I'm wondering what the function is doing since it takes so long? Each document in the database simply consists of 4 numbers.
In the end I'm hoping to run this on a much smaller processor, like a Raspberry Pi, and here the operation where it fetches 500 documents from the database and converts it to an array takes about 230ms. That seems like a lot to me. Or am I just expecting too much?
Are there any alternative ways to get data from the database without using toArray()?
Another thing that I noticed is that the entire Node application slows remarkably down while getting the database results. I created a simple interval function that should increment the count value every 1 ms:
setInterval(function(){
if(run) count++;
}, 1);
I would then expect the count value to be almost the same as the time, but for a time of 16 ms on my computer the count value was 3 or 4. On the Raspberry Pi the count value was never incremented. What is taking so much CPU usage? The monitor told me that my computer was using 27% CPU and the Raspberry Pi was using 92% CPU and 11% RAM, when asked to run the database query repeatedly.
I know that was a lot of questions. Any help or explanations are much appreciated. I'm still new to Node and MongoDB.
db.collection.find() returns a cursor, not results, and opening a cursor is pretty fast.
Once you start reading the cursor (using .toArray() or by traversing it using .each() or .next()), the actual documents are being transferred from the database to your client. That operation is taking up most of the time.
I doubt that using .each()/.next() (instead of .toArray(), which—under the hood—uses one of those two) will improve the performance much, but you could always try (who knows). Since .toArray() will read everything in memory, it may be worthwhile, although it doesn't sound like your data set is that large.
I really think that MongoDB on Raspberry Pi (esp a Model 1) is not going to work well. If you don't depend on the MongoDB query features too much, you should consider using an alternative data store. Perhaps even an in-memory storage (500 documents times 4 numbers doesn't sound like lots of RAM is required).

Mongoose limiting query to 1000 results when I want more/all (migrating from 2.6.5 to 3.1.2)

I'm migrating my app from Mongoose 2.6.5 to 3.1.2, and I'm running into some unexpected behavior. Namely I notice that query results are automatically being limited to 1000 records, while pretty much everything else works the same. In my code (below) I set a value maxIvDataPoints that limits the number of data points returned (and ultimately sent to the client browser), and that value was set elsewhere to 1500. I use a count query to determine the total number of potential results, and then a subsequent mod to limit the actual query results using the count and the value of maxIvDataPoints to determine the value of the mod. I'm running node 0.8.4 and mongo 2.0.4, writing server-side code in coffeescript.
Prior to installing mongoose 3.1.x the code was working as I had wanted, returning just under 1500 data points each time. After installing 3.1.2 I'm getting exactly 1000 data points returned each time (assuming there are more than 1000 data points in the specified range). The results are truncated, so that data points 1001 to ~1500 are the ones no longer being returned.
It seems there may be some setting somewhere that governs this behavior, but I can't find anything in the docs, on here, or in the Google group. I'm still a relative n00b so I may have missed something obvious.
DataManager::ivDataQueryStream = (testId, minTime, maxTime, callback) ->
# If minTime and maxTime have been provided, set a flag to limit time extents of query
unless isNaN(minTime)
timeLimits = true
# Load the max number of IV data points to be displayed from CONFIG
maxIvDataPoints = CONFIG.maxIvDataPoints
# Construct a count query to determine the number if IV data points in range
ivCountQuery = TestDataPoint.count({})
ivCountQuery.where "testId", testId
if timeLimits
ivCountQuery.gt "testTime", minTime
ivCountQuery.lt "testTime", maxTime
ivCountQuery.exec (err, count) ->
ivDisplayQuery = TestDataPoint.find({})
ivDisplayQuery.where "testId", testId
if timeLimits
ivDisplayQuery.gt "testTime", minTime
ivDisplayQuery.lt "testTime", maxTime
# If the data set is too large, use modulo to sample, keeping the total data series
# for display below maxIvDataPoints
if count > maxIvDataPoints
dataMod = Math.ceil count/maxIvDataPoints
ivDisplayQuery.mod "dataPoint", dataMod, 1
ivDisplayQuery.sort "dataPoint" #, 1 <-- new sort syntax for Mongoose 3.x
callback ivDisplayQuery.stream()
You're getting tripped up by a pair of related factors:
Mongoose's default query batchSize changed to 1000 in 3.1.2.
MongoDB has a known issue where a query that requires an in-memory sort puts a hard limit of the query's batch size on the number of documents returned.
So your options are to put a combo index on TestDataPoint that would allow mongo to use it for sorting by dataPoint in this type of query or increase the batch size to at least the total count of documents you're expecting.
Wow that's awful. I'll publish a fix to mongoose soon removing the batchSize default (was helpful when streaming large result sets). Thanks for the pointer.
UPDATE: 3.2.1 and 2.9.1 have been released with the fix (removed batchSize default).

Count(*) on a SimpleDB table of millions of entries

How long should it take to get a response for the statement SELECT count(*) FROM db_name on a SimpleDB table of millions of entries? (currently my table >16M).
Shouldn't there some sort of "pagination" using the next_token parameter if the operation takes too long? (it's been hanging there for minutes now!)
There's something wrong. No count query will take more than 5 seconds, because after 5 seconds it cuts off and gives you a next token.
If the count request takes more than five seconds, Amazon SimpleDB returns the number of items that it could count and a next token to return additional results. The client is responsible for accumulating the partial counts.
http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/CountingDataSelect.html
SimpleDB responses are typically under 200ms, not counting data transfer speed (from Amazon's server to yours, which is less than 50ms if you're on EC2).
However, the total size of a SimpleDB response cannot exceed 2,500 rows or 1MB, whichever is smaller.
See "Limit" here
http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/index.html?UsingSelect.html

Resources