How to do a massive random update with MongoDB / NodeJS

How to do a massive random update with MongoDB / NodeJS - node.js

I have a mongoDB collection with more then 1000000 documents and i would like to update each document one by one with a dedicated information (each doc has an information coming from an other collection).
Currently i'm using a cursor that fetch all the data from the collection and i do an update of each records through the async module of Node.js
Fetch all docs :
inst.db.collection(association.collection, function(err, collection) {
collection.find({}, {}, function(err, cursor) {
cursor.toArray(function(err, items){
......
);
});
});
update each doc :
items.forEach(function(item) {
// *** do some stuff with item, add field etc.
tasks.push(function(nextTask) {
inst.db.collection(association.collection, function(err, collection) {
if (err) callback(err, null);
collection.save(item, nextTask);
});
});
});
call the "save" task in parallel
async.parallel(tasks, function(err, results) {
callback(err, results);
});
Ho would you do this type of operation in a more efficient way? I mean how to avoid the initial "find" to load a cursor. Is there now way to do an operation doc by doc knowing that all docs should be updated?
Thanks for your support.

You're question inspired me to create a Gist to do some performance testing of different approaches to your problem.
Here are the results running on a small EC2 instance with the MongoDB at localhost. The test scenario is to uniquely operate on every document of a 100000 element collection.
108.661 seconds -- Uses find().toArray to pull in all the items at once then replaces the documents with individual "save" calls.
99.645 seconds -- Uses find().toArray to pull in all the items at once then updates the documents with individual "update" calls.
74.553 seconds -- Iterates on the cursor (find().each) with batchSize = 10, then uses individual update calls.
58.673 seconds -- Iterates on the cursor (find().each) with batchSize = 10000, then uses individual update calls.
4.727 seconds -- Iterates on the cursor with batchSize = 10000, and does inserts into a new collection 10000 items at a time.
Though not included, I also did a test with MapReduce used as a server side filter which ran at about 19 seconds. I would have liked to have similarly used "aggregate" as a server side filter, but it doesn't yet have an option to output to a collection.
The bottom line answer is that if you can get away with it, the fastest option is to pull items from an initial collection via a cursor, update them locally and insert them into a new collection in big chunks. Then you can swap in the new collection for the old.
If you need to keep the database active, then the best option is to use a cursor with a big batchSize, and update the documents in place. The "save" call is slower than "update" because it needs to replace whole document, and probably needs to reindex it as well.

Related

MongoDB Cursor - Efficient way to count AND enumerate documents (in Node)

I would like to know how many documents are going to be processed by my MongoDB cursor before I iterate over the cursor. Do I have to run the query twice, or can I determine the size of a cursor and then iterate over the cursor without running the query twice?
myCount = myCollection.find({complex query}).count(); // will get my count
myCursor = myCollection.find({complex query}); // will get my cursor for iteration
but these two commands will run the query twice, which is presumably inefficient?
What is the most efficient way to determine the number of documents that I will iterate over without running the query twice? My docs are quite large.
Does MongoDB 'know' how many docs are going to be returned before it starts returning them via the cursor? Where can I read up on Mongo internals for this info?
I am running this within node, and I'm using the standard MongoDB node driver. So, the solution needs to deal with the usual node callback mechanisms etc, etc.
Thanks
EDIT
I have amended the original question to state I'm using the node driver
EDIT2
yogesh's answer is correct, I just needed to figure out the syntax for node. I've shown the working code below in case it helps anyone
db.collection('collectionName', function(err, collection) {
function processItem(err, item) {
if (item === null) {
db.close();
callback(null, {info: "yippee"});
return;
}
cursor.nextObject(processItem); // continue looping
}
var cursor = collection.find({foo:1});
cursor.count(function(err,count){
console.log('resultCursor size='+count);
cursor.nextObject(processItem); // Start of the loop
});
}

Check cursor.next this find next document in the cursor returned by the db.collection.find() method. So your case you should write as ( tested on mongo shell )
var collectionData = myCollection.find({complex query})
collectionData.count() // return matching documents count
collectionData.next() // return next documents
Or you also check mongo hasnext it returns true if the cursor returned by the db.collection.find() query can iterate further to return more documents.

How do I see output of SQL query in node-sqlite3?

I read all the documentation and this seemingly simple operation seems completely ignored throughout the entire README.
Currently, I am trying to run a SELECT query and console.log the results, but it is simply returning a database object. How do I view the results from my query in Node console?
exports.runDB = function() {
db.serialize(function() {
console.log(db.run('SELECT * FROM archive'));
});
db.close();
}

run does not have retrieval capabilities. You need to use all, each, or get
According to the documentation for all:
Note that it first retrieves all result rows and stores them in
memory. For queries that have potentially large result sets, use the
Database#each function to retrieve all rows or Database#prepare
followed by multiple Statement#get calls to retrieve a previously
unknown amount of rows.
As an illistration:
db.all('SELECT url, rowid FROM archive', function(err, table) {
console.log(table);
});
That will return all entries in the archive table as an array of objects.

Comparing ObjectIDs in Mongoose Query

I'm trying to update every document in an expanding Mongo database.
My plan is to start with the youngest, most recently created document and work back from there, one-by-one querying the next oldest document.
The problem is that my Mongoose query is skipping documents that were created in the same second. I thought greater than/less than operators would work on _ids generated in the same second. But though there are 150 documents in the database right now, this function gets from the youngest to the oldest document in only 8 loops.
Here's my Mongoose query within the recursive node loop:
function loopThroughDatabase(i, doc, sizeOfDatabase){
if (i < sizeOfDatabase) {
(function(){
myMongooseCollection.model(false)
.find()
.where("_id")
.lt(doc._id)
.sort("id")
.limit(1)
.exec(function(err, docs) {
if (err) {
console.log(err);
}
else {
updateDocAndSaveToDatabase(docs[0]);
loopThroughDatabase(i + 1, docs[0], sizeOfDatabase); //recursion here
}
});
})();
}
}
loopThroughDatabase(1, youngestDoc, sizeOfDatabase);

Error found.
In the Mongoose query, I was sorting by "id" rather than "_id"

If you read the MongoDB documentation, you will see that it depends on the process in which the item was created http://docs.mongodb.org/manual/reference/glossary/#term-objectid, therefore, to guarantee what you need, you need to add a Date stamp to the records and use that instead of the _id

Inserting records without failing on duplicate

I'm inserting a lot of documents in bulk with the latest node.js native driver (2.0).
My collection has an index on the URL field, and I'm bound to get duplicates out of the thousands of lines I insert. Is there a way for MongoDB to not crash when it encounters a duplicate?
Right now I'm batching records 1000 at a time, and Using insertMany. I've tried various things, including adding {continueOnError=true}. I tried inserting my records one by one, but it's just too slow, I have thousands of workers in a queue and can't really afford the delay.
Collection definition :
self.prods = db.collection('products');
self.prods.ensureIndex({url:1},{unique:true}, function() {});
Insert :
MongoProcessor.prototype._batchInsert= function(coll,items){
var self = this;
if(items.length>0){
var batch = [];
var l = items.length;
for (var i = 0; i < 999; i++) {
if(i<l){
batch.push(items.shift());
}
if(i===998){
coll.insertMany(batch, {continueOnError: true},function(err,res){
if(err) console.log(err);
if(res) console.log('Inserted products: '+res.insertedCount+' / '+batch.length);
self._batchInsert(coll,items);
});
}
}
}else{
self._terminate();
}
};
I was thinking of dropping the index before the insert, then reindexing using dropDups, but it seems a bit hacky, my workers are clustered and I have no idea what would happen if they try to insert records while another process is reindexing... Does anyone have a better idea?
Edit :
I forgot to mention one thing. The items I insert have a 'processed' field which is set to 'false'. However the items already in the db may have been processed, so the field can be 'true'. Therefore I can't upsert... Or can I select a field to be untouched by upsert?

The 2.6 Bulk API is what you're looking for, which will require MongoDB 2.6+* and node driver 1.4+.
There are 2 types of bulk operations:
Ordered bulk operations. These operations execute all the operation in order and error out on the first write error.
Unordered bulk operations. These operations execute all the operations in parallel and aggregates up all the errors. Unordered bulk operations do not guarantee order of execution.
So in your case Unordered is what you want. The previous link provides an example:
MongoClient.connect("mongodb://localhost:27017/test", function(err, db) {
// Get the collection
var col = db.collection('batch_write_ordered_ops');
// Initialize the Ordered Batch
var batch = col.initializeUnorderedBulkOp();
// Add some operations to be executed in order
batch.insert({a:1});
batch.find({a:1}).updateOne({$set: {b:1}});
batch.find({a:2}).upsert().updateOne({$set: {b:2}});
batch.insert({a:3});
batch.find({a:3}).remove({a:3});
// Execute the operations
batch.execute(function(err, result) {
console.dir(err);
console.dir(result);
db.close();
});
});
*The docs do state that: "for older servers than 2.6 the API will downconvert the operations. However it’s not possible to downconvert 100% so there might be slight edge cases where it cannot correctly report the right numbers."

How to bulk save an array of objects in MongoDB?

I have looked a long time and not found an answer. The Node.JS MongoDB driver docs say you can do bulk inserts using insert(docs) which is good and works well.
I now have a collection with over 4,000,000 items, and I need to add a new field to all of them. Usually mongodb can only write 1 transaction per 100ms, which means I would be waiting for days to update all those items. How can I do a "bulk save/update" to update them all at once? update() and save() seem to only work on a single object.
psuedo-code:
var stuffToSave = [];
db.collection('blah').find({}, function(err, stuff) {
stuff.toArray().forEach(function(item)) {
item.newField = someComplexCalculationInvolvingALookup();
stuffToSave.push(item);
}
}
db.saveButNotSuperSlow(stuffToSave);
Sure, I'll need to put some limit on doing something like 10,000 at once to not try do all 4 million at once, but i think you get the point.

MongoDB allows you to update many documents that match a specific query using a single db.collection.update(query, update, options) call, see the documentation. For example,
db.blah.update(
{ },
{
$set: { newField: someComplexValue }
},
{
multi: true
}
)
The multi option allows the command to update all documents that match the query criteria. Note that the exact same thing applies when using the Node.JS driver, see that documentation.
If you're performing many different updates on a collection, you can wrap them all in a Bulk() builder to avoid some of the overhead of sending multiple updates to the database.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string