How can I cancel MongoDB query from .each callback - node.js

I implemented a little NodeJs web server that stores log entries and provides a backend for a web based log browser. The web interface provides also an "Export to CVS" function and lets user download the logs in CVS format. My code looks similar to this:
this.log_entries(function(err, collection) {
collection.find(query)
.sort({_id: 1})
.each(function (err, doc) {
if(doc){
WriteLineToCSVFile(doc);
}
else {
ZipCSVFileAndSendIt();
}
});
});
The export operation may take a significant amount of time and disk space in case if a user didn't specify the right filters for the query. I need to implement a fail safe mechanism preventing this. One important requirement is that user should be able to abort the ongoing export operation at any point in time. Currently my solution is that I stop writing the data to the CSV file, however the callback passed to the .each() still gets called. I could not find any information how to stop the each loop. So the question is how can I do this?
UPDATE, THE ANSWER:
Use cursor.nextObject()
For the correct answer see the comments by #dbra below: db.currentOp() and db.killOp() doesn't work for this case.
The final solution looks like this:
this.log_entries(function(err, collection) {
var cursor = collection.find(query);
cursor.sort("_id", 1, function(err, sorted) {
function exportFinished(aborted) {
...
}
function processItem(err, doc) {
if(doc === null ) {
exportFinished(false);
}
else if( abortCurrentExport ) {
exportFinished(true);
}
else {
var line = formatCSV(doc);
WriteFile(line);
process.nextTick(function(){
sorted.nextObject(processItem);
});
}
}
sorted.nextObject(processItem);
});
});
Note the usage of process.nextTick - without it there will be a stack overflow!

You could search the running query with db.currentOp and then kill it with db.killOp, but il would be a nasty solution.
A better way could be working with limited subsequent batches; the easier way would be a simple pagination with "limit" and "skip", but it depends on how your collection changes while you read it.

Related

How do I make a large but unknown number of REST http calls in nodejs?

I have an orientdb database. I want to use nodejs with RESTfull calls to create a large number of records. I need to get the #rid of each for some later processing.
My psuedo code is:
for each record
write.to.db(record)
when the async of write.to.db() finishes
process based on #rid
carryon()
I have landed in serious callback hell from this. The version that was closest used a tail recursion in the .then function to write the next record to the db. However, I couldn't carry on with the rest of the processing.
A final constraint is that I am behind a corporate proxy and cannot use any other packages without going through the network administrator, so using the native nodejs packages is essential.
Any suggestions?
With a completion callback, the general design pattern for this type of problem makes use of a local function for doing each write:
var records = ....; // array of records to write
var index = 0;
function writeNext(r) {
write.to.db(r, function(err) {
if (err) {
// error handling
} else {
++index;
if (index < records.length) {
writeOne(records[index]);
}
}
});
}
writeNext(records[0]);
The key here is that you can't use synchronous iterators like .forEach() because they won't iterate one at a time and wait for completion. Instead, you do your own iteration.
If your write function returns a promise, you can use the .reduce() pattern that is common for iterating an array.
var records = ...; // some array of records to write
records.reduce(function(p, r) {
return p.then(function() {
return write.to.db(r);
});
}, Promsise.resolve()).then(function() {
// all done here
}, function(err) {
// error here
});
This solution chains promises together, waiting for each one to resolve before executing the next save.
It's kinda hard to tell which function would be best for your scenario w/o more detail, but I almost always use asyncjs for this kind of thing.
From what you say, one way to do it would be with async.map:
var recordsToCreate = [...];
function functionThatCallsTheApi(record, cb){
// do the api call, then call cb(null, rid)
}
async.map(recordsToCreate, functionThatCallsTheApi, function(err, results){
// here, err will be if anything failed in any function
// results will be an array of the rids
});
You can also check out other ones to enable throttling, which is probablya good idea.

Mongo Bulk Updates - which succeeded (matched and modified) and which did not?

In order to improve the performance of many single Mongo document updates #Node, I consider using Mongo's Bulk operation - to update as many as 1000 documents at each iteration.
In this use case, each individual update opeartion may or may not occur - an update will only occur if the document version had not changed since it was last read by the updater. If a docuemnt was not updated, the application needs to retry and/or do other stuff to hadnle the situation.
Currently the Node code looks like this:
col.update(
{_id: someid, version:someversion},
{$set:{stuf:toupdate, version:(someversion+1)}},
function(err, res) {
if (err) {
console.log('wow, something is seriously wrong!');
// do something about it...
return;
}
if (!res || !res.result || !res.result.nModified) { // no update
console.log('oops, seems like someone updated doc before me);
// do something about it...
return;
}
// Great! - Document was updated, continue as usual...
});
Using Mongo's Bulk unordered operations, is there a way to know which of the group of (1000) updates had succeeded and which had not been performed (in this case due to wrong version)?
The code I started playing with looks like:
var bulk = col.initializeUnorderedBulkOp();
bulk.find({_id: someid1, version:someversion1}).updateOne(
{$set:{stuf:toupdate1, version:(someversion1+1)}});
bulk.find({_id: someid2, version:someversion2}).updateOne(
{$set:{stuf:toupdate2, version:(someversion2+1)}});
...
bulk.find({_id: someid1000, version:someversion1000}).updateOne(
{$set:{stuf:toupdate1000, version:(someversion1000+1)}});
bulk.execute(function(err, result) {
if (err) {
console.log('wow, something is seriously wrong!');
// do something about it...
return;
}
if (result.nMatched < 1000) { // not all got updated
console.log('oops, seems like someone updated at least one doc before me);
// But which of the 1000 got updated OK and which had not!!!!
return;
}
// Great! - All 1000 documents got updated, continue as usual...
});
I was unable to find a Mongo solution for that.
The solution I used was to revert to per document operation if the bulk operation failed... This gives reasonable performance in most cases.

Nodejs behaviour

I have been working on nodeJS + MongoDB, using the Express and Mongoose frameworks for a few months, and I wanted to ask you guys what is really happening in a situation such as the following:
Model1.find({}, function (err, elems) {
if (err) {
console.log('ERROR');
} else {
elems.forEach(function (el) {
Model2.find({[QUERY RELATED WITH FIELDS IN 'el']}, function (err, elems2) {
if (err) {
console.log('ERROR');
} else {
//DO STAFF.
}
});
});
}
});
My best guess is that there's a main thread looping over elems, and then different threads attending each query over Model2, but I'm not really sure.
Is that correct? And also, is this a good solution? And if not, how would you code in a situation such as this, where you need the information in each of the elements you get from Model1 to get elements from Model2, and perform the actual functionality you are looking for?
I know I could elaborate a more complex query where I could get all the elements each of the 'el' in elems would yield, but I¡d rather not do that, because in that case i would be worried about the memory expense.
Also, I've been thinking about changing the data model, but I've gone over it and I'm confident it is well thought, and I don't think that's the best solution for my aplication.
Thanks!
NodeJS is a single threaded environment and it works asynchronously for blocking function calls such as network requests in your case. So there is only one thread and your query results will be called asynchronously so that nothing will be blocked due to intensive network operation.
In your scenario if the first query returns quite a lot of records such as 100000 thousands you may exhaust your mongo server in your loop as you will query your server as many as the result of first query instantly. This will happen because node won't stop for receiving the results of each query as it works asynchronously.
So usually manually throttling the requests to network operations is a good practice. This is not trivial when working on asynchronous environment. One way to do is to use recursive function call. Basically you split your tasks into groups and do each group in batch, once you are done with one batch you start with your next group.
Here is a simple example on how to do it, I have used promises instead of callback functions, Q is a promise library that is very useful for handling promises:
var rows = [...]; // array of many
function handleRecursively(startIndex, batchSize){
var promises = [];
for(i = 0; i < batchSize && i + batchSize < rows.length; i++){
var theRow = rows[startIndex + i];
promises.push(doAsynchronousJobWithTheRow(theRow));
}
//you wait until you handle all tasks in this iteration
Q.all(promises).then(function(){
startIndex += batchSize;
if(startIndex < rows.length){ // if there is still task to do continue with next batch
handleRecursively(startIndex, batchSize); }
})
}
handleRecursively(0, 1000);
Here is the best solution :
Model1.find({}, function (err, elems) {
if (err) {
console.log('ERROR');
} else {
loopAllElements(0,elems);
}
});
function loopAllElements(startIndex,elems){
if (startIndex==elems.length) {
return "success";
}else{
Model2.find({[QUERY RELATED WITH FIELDS IN elems[startIndex] ]}, function (err, elems2) {
if (err) {
console.log('ERROR');
return "error";
} else {
//DO STAFF.
loopAllElements(startIndex+1, elems);
}
});
}
}

MongoDB NodeJS driver, how to know when .update() 's are complete

As the code is quite large to posted in here, I append my github repo https://github.com/DiegoGallegos4/Mongo
I am trying to use de NodeJS driver to update some records fulfilling a criteria but first I have to find some records fulfilling another criteria. On the update part, the records found and filter from the find operation are used. This is,
file: weather1.js
MongoClient.connect(some url, function(err,db){
db.collection(collection_name).find({},{},sort criteria).toArray(){
.... find the data and append to an array
.... this data inside a for loop
db.collection(collection_name).update(data[i], {$set...}, callback)
}
})
That´s the structure used to solve the problem, relating when to close the connection , it is when the length of the data array equals the number of callbacks on the update operation. For more details you can refer to the repo.
file: weather.js
On the other approach, Instead of toArray is used .each to iterate on the cursor.
I've looked up for a solution to this for a week now on several forums.
I've read about pooling connections but I want to know what is my conceptual error on my code. I would appreciate a deep insight on this topic.
The way you pose your question is very misleading. All you want to know is "When is the processing complete so I can close?".
The answer to that is you need to respect the callbacks generally only move through the cursor of results once each update is complete.
The simple way without other dependencies is to use the stream interface suported by the driver:
var MongoClient = require('mongodb').MongoClient;
MongoClient.connect('mongodb://localhost:27017/data',function(err,db){
if(err) throw err;
coll = db.collection('weather');
console.log('connection established')
var stream = coll.find().sort([['State',1],['Temperature',-1]])
stream.on('err',function(err) {
throw err;
});
stream.on('end',function() {
db.close();
});
var month_highs = [];
var state = '';
var length = 0;
stream.on('data',function(doc) {
stream.pause(); // pause processing documents
if (err) throw err;
if (doc) {
length = month_highs.length
if(state != doc['State']){
month_highs.push(doc['State']);
//console.log(doc);
}
state = doc['State']
if(month_highs.length > length){
coll.update(doc, {$set : {'month_high':true} }, function(err, updated){
if (err) throw err;
console.log(updated)
stream.resume(); // resume processing documents
});
} else {
stream.resume();
}
} else {
stream.resume();
}
});
});
That's just a copy of the code from your repo, refactored to use a stream. So all the important parts are where the word "stream" appears, and most importantly where they are being called.
In a nutshell the "data" event is emitted by each document from the cursor results. First you call .pause() so new documents do not overrun the processing. Then you do your .update() and within it's callback on return you call .resume(), and the flow continues with the next document.
Eventually "end" is emitted when the cursor is depleted, and that is where you call db.close().
That is basic flow control. For other approaches, look at the node async library as a good helper. But do not loop arrays with no async control, and do not use .each() which is DEPRECATED.
You need to signal when the .update() callback is complete to follow a new "loop iteration" at any rate. This is the basic no additional dependancy approach.
P.S I am a bit suspect about the general logic of your code, especially testing if the length of something is greater when you read it without possibly changing that length. But this is all about how to implement "flow control", and not to fix the logic in your code.

Iterate through Array, update/create Objects asynchronously, when everything is done call callback

I have a problem, but I have no idea how would one go around this.
I'm using loopback, but I think I would've face the same problem in mongodb sooner or later. Let me explain what am I doing:
I fetch entries from another REST services, then I prepare entries for my API response (entries are not ready yet, because they don't have id from my database)
Before I send response I want to check if entry exist in database, if it doesn't:
Create it, if it does (determined by source_id):
Use it & update it to newer version
Send response with entries (entries now have database ids assigned to them)
This seems okay, and easy to implement but it's not as far as my knowledge goes. I will try to explain further in code:
//This will not work since there are many async call, and fixedResults will be empty at the end
var fixedResults = [];
//results is array of entries
results.forEach(function(item) {
Entry.findOne({where: {source_id: item.source_id}}, functioN(err, res) {
//Did we find it in database?
if(res === null) {
//Create object, another async call here
fixedResults.push(newObj);
} else {
//Update object, another async call here
fixedResults.push(updatedObj);
}
});
});
callback(null, fixedResults);
Note: I left some of the code out, but I think its pretty self explanatory if you read through it.
So I want to iterate through all objects, create or update them in database, then when all are updated/created, use them. How would I do this?
You can use promises. They are callbacks that will be invoked after some other condition has completed. Here's an example of chaining together promises https://coderwall.com/p/ijy61g.
The q library is a good one - https://github.com/kriskowal/q
This question how to use q.js promises to work with multiple asynchronous operations gives a nice code example of how you might build these up.
This pattern is generically called an 'async map'
var fixedResults = [];
var outstanding = 0;
//results is array of entries
results.forEach(function(item, i) {
Entry.findOne({where: {source_id: item.source_id}}, functioN(err, res) {
outstanding++;
//Did we find it in database?
if(res === null) {
//Create object, another async call here
DoCreateObject(function (err, result) {
if (err) callback(err);
fixedResults[i] = result;
if (--outstanding === 0) callback (null, fixedResults);
});
} else {
//Update object, another async call here
DoOtherCall(function (err, result) {
if(err) callback(err);
fixedResults[i] = result;
if (--outstanding === 0) callback (null, fixedResults);
});
}
});
});
callback(null, fixedResults);
You could use async.map for this. For each element in the array, run the array iterator function doing what you want to do to each element, then run the callback with the result (instead of fixedResults.push), triggering the map callback when all are done. Each iteration ad database call would then be run in parallel.
Mongo has a function called upsert.
http://docs.mongodb.org/manual/reference/method/db.collection.update/
It does exactly what you ask for without needing the checks. You can fire all three requests asnc and just validate the result comes back as true. No need for additional processing.

Resources