Mongodb toArray() performance - node.js

I have a collection 'matches' with 727000 documents inside. It has 6 fields inside, no arrays just simple integers and object Ids. I am doing query to collection as follows:
matches.find({
$or: [{
homeTeamId: getObjectId(teamId)
}, {
awayTeamId: getObjectId(teamId)
}
],
season: season,
seasonDate: {
'$gt': dayMin,
'$lt': dayMax
}
}).sort({
seasonDate: 1
}).toArray(function (e, res) {
callback(res);
});
Results returning only around 7-8 documents.
The query takes about ~100ms, which i think is quite reasonable, but the main problem is, when i call method toArray(), it adds about ~600ms!!
I am running server on my laptop, Intel Core I5, 6GB RAM but i can't believe it adds 600ms for 7-8 documents.
Tried using mongodb-native driver, now switched to mongoskin, and stil getting the same slow results.
Any suggestions ?

toArray() method iterate throw all cursor element and load them on memory, it is a highly cost operation. Maybe you can add index to improve your query performance, and/or avoid toArray iterating yourself throw the Cursor.
Regards,
Moacy

Related

Which of these mongodb queries is more efficient? for loop vs deleteMany

For a mongodb collection with 30+ million documents is it better to:
1: Query multiple times like this:
for (const _id of array) {
await Schema.deleteOne({ _id: _id })
}
2: One query using $in:
await Schema.deleteMany({ _id: { $in: array } })
Since a for loop needs multiple conections to DataBase is a better way use deleteMany.
Into Mongo docs from deleteOne it says:
To delete multiple documents, see db.collection.deleteMany()
Also, for small databases, a for loop may have a not bad performance but with large amount of data this is not a good idea because you are calling your DB multiple times. And also using await it stops the execution so this is not efficient.

Avoid Aggregate 16MB Limit

I have a collection of about 1M documents. Each document has internalNumber property and I need to get all internalNumbers in my node.js code.
Previously I was using
db.docs.distinct("internalNumber")
or
collection.distinct('internalNumber', {}, {},(err, result) => { /* ... */ })
in Node.
But with the growth of the collection I started to get the error: distinct is too big, 16m cap.
Now I want to use aggregation. It consumes a lot of memory and it is slow, but it is OK since I need to do it only once at the script startup. I've tried following in Robo 3T GUI tool:
db.docs.aggregate([{$group: {_id: '$internalNumber'} }]);
It works, and I wanted to use it in node.js code the following way:
collection.aggregate([{$group: {_id: '$internalNumber'} }],
(err, docs) => { /* ... * });
But in Node I get an error: "MongoError: aggregation result exceeds maximum document size (16MB) at Function.MongoError.create".
Please help to overcome that limit.
The problem is that the native driver differs from how the shell method is working by default in that the "shell" is actually returning a "cursor" object where the native driver needs this option "explicitly".
Without a "cursor", .aggregate() returns a single BSON document as an array of documents, so we turn it into a cursor to avoid the limitation:
let cursor = collection.aggregate(
[{ "$group": { "_id": "$internalNumber" } }],
{ "cursor": { "batchSize": 500 } }
);
cursor.toArray((err,docs) => {
// work with resuls
});
Then you can use regular methods like .toArray() to make the results a JavaScript array which on the 'client' does not share the same limitations, or other methods for iterating a "cursor".
For Casbah users:
val pipeline = ...
collection.aggregate(pipeline, AggregationOptions(batchSize = 500, outputMode = AggregationOptions.CURSOR)

MongoDB fulltext search: Overflow sort stage buffered data usage

I am trying to implement mongo text search in my node(express.js) application.
Here are my codes:
Collection.find({$text: {$search: searchString}}
, {score: {$meta: "textScore"}})
.sort({score: {$meta: 'textScore'}})
.exec(function(err, docs {
//Process docs
});
I am getting following error when text search is performed on large dataset:
MongoError: Executor error: Overflow sort stage buffered data usage of 33554558 bytes exceeds internal limit of 33554432 bytes
I am aware that MongoDB can sort maximum of 32MB data and this error can be avoided by adding index for field we will be sorting collection with. But in my case I am sorting collection by textScore and I am not exactly sure if is it possible to set index for this field. If not, is there any workaround for this?
NOTE: I am aware there are similar questions on SO but most of these questions do not have textScore as sort criteria and therefore my question is different.
You can use aggregate to circumvent the limit.
Collection.aggregate([
{ $match: { $text: { $search: searchString } } },
{ $sort: { score: { $meta: "textScore" } } }
])
The $sort stage has a 100 MB limit. If you need more, you can use allowDiskUse, that will write to temp files while sorting takes place. To do that just add allowDiskUse: true to the aggregate option.
If your result is greater than 16MB (i.e. MongoDB's document size limit), you need to request a cursor to iterate through your data. Just add .cursor() before your exec and here's a detailed example. http://mongoosejs.com/docs/api.html#aggregate_Aggregate-cursor

How to bulk save an array of objects in MongoDB?

I have looked a long time and not found an answer. The Node.JS MongoDB driver docs say you can do bulk inserts using insert(docs) which is good and works well.
I now have a collection with over 4,000,000 items, and I need to add a new field to all of them. Usually mongodb can only write 1 transaction per 100ms, which means I would be waiting for days to update all those items. How can I do a "bulk save/update" to update them all at once? update() and save() seem to only work on a single object.
psuedo-code:
var stuffToSave = [];
db.collection('blah').find({}, function(err, stuff) {
stuff.toArray().forEach(function(item)) {
item.newField = someComplexCalculationInvolvingALookup();
stuffToSave.push(item);
}
}
db.saveButNotSuperSlow(stuffToSave);
Sure, I'll need to put some limit on doing something like 10,000 at once to not try do all 4 million at once, but i think you get the point.
MongoDB allows you to update many documents that match a specific query using a single db.collection.update(query, update, options) call, see the documentation. For example,
db.blah.update(
{ },
{
$set: { newField: someComplexValue }
},
{
multi: true
}
)
The multi option allows the command to update all documents that match the query criteria. Note that the exact same thing applies when using the Node.JS driver, see that documentation.
If you're performing many different updates on a collection, you can wrap them all in a Bulk() builder to avoid some of the overhead of sending multiple updates to the database.

How to do a massive random update with MongoDB / NodeJS

I have a mongoDB collection with more then 1000000 documents and i would like to update each document one by one with a dedicated information (each doc has an information coming from an other collection).
Currently i'm using a cursor that fetch all the data from the collection and i do an update of each records through the async module of Node.js
Fetch all docs :
inst.db.collection(association.collection, function(err, collection) {
collection.find({}, {}, function(err, cursor) {
cursor.toArray(function(err, items){
......
);
});
});
update each doc :
items.forEach(function(item) {
// *** do some stuff with item, add field etc.
tasks.push(function(nextTask) {
inst.db.collection(association.collection, function(err, collection) {
if (err) callback(err, null);
collection.save(item, nextTask);
});
});
});
call the "save" task in parallel
async.parallel(tasks, function(err, results) {
callback(err, results);
});
Ho would you do this type of operation in a more efficient way? I mean how to avoid the initial "find" to load a cursor. Is there now way to do an operation doc by doc knowing that all docs should be updated?
Thanks for your support.
You're question inspired me to create a Gist to do some performance testing of different approaches to your problem.
Here are the results running on a small EC2 instance with the MongoDB at localhost. The test scenario is to uniquely operate on every document of a 100000 element collection.
108.661 seconds -- Uses find().toArray to pull in all the items at once then replaces the documents with individual "save" calls.
99.645 seconds -- Uses find().toArray to pull in all the items at once then updates the documents with individual "update" calls.
74.553 seconds -- Iterates on the cursor (find().each) with batchSize = 10, then uses individual update calls.
58.673 seconds -- Iterates on the cursor (find().each) with batchSize = 10000, then uses individual update calls.
4.727 seconds -- Iterates on the cursor with batchSize = 10000, and does inserts into a new collection 10000 items at a time.
Though not included, I also did a test with MapReduce used as a server side filter which ran at about 19 seconds. I would have liked to have similarly used "aggregate" as a server side filter, but it doesn't yet have an option to output to a collection.
The bottom line answer is that if you can get away with it, the fastest option is to pull items from an initial collection via a cursor, update them locally and insert them into a new collection in big chunks. Then you can swap in the new collection for the old.
If you need to keep the database active, then the best option is to use a cursor with a big batchSize, and update the documents in place. The "save" call is slower than "update" because it needs to replace whole document, and probably needs to reindex it as well.

Resources