While coding in Node.js, I encountered many situations when it is so hard to implement some elaborated logic mixed with database queries (I/O).
Consider an example written in python. We need to iterate over an array of values, for each value we query the database, then, based on the results, we need to compute the average.
def foo:
a = [1, 2, 3, 4, 5]
result = 0
for i in a:
record = find_from_db(i) # I/O operation
if not record:
raise Error('No record exist for %d' % i)
result += record.value
return result / len(a)
The same task in Node.js
function foo(callback) {
var a = [1, 2, 3, 4, 5];
var result = 0;
var itemProcessed = 0;
var error;
function final() {
if (itemProcessed == a.length) {
if (error) {
callback(error);
} else {
callback(null, result / a.length);
}
}
}
a.forEach(function(i) {
// I/O operation
findFromDb(function(err, record) {
itemProcessed++;
if (err) {
error = err;
} else if (!record) {
error = 'No record exist for ' + i;
} else {
result += record.value;
}
final();
});
});
}
You can see that such code much harder to write/read, and it is more prone to errors.
My questions:
Is there a way to make above Node.js code cleaner?
Imagine more sophisticated logic. For example, when we obtained a record from the db, we might need do another db query based on some conditions. In Node.js that becomes a nightmare. What are common patterns for dealing with such tasks?
Based on your experience, does the performance gain deserves the productivity loss when you code with Node.js?
Is there other asynchronous I/O framework/language that is easier to work with?
To answer your questions:
There are libraries such as async which provide a variety of solutions for common scenarios when working with asynchronous tasks. For "callback hell" concerns, there are many ways to avoid that as well, including (but not limited to) naming your functions and pulling them out, modularizing your code, and using promises.
More or less what you currently have is a fairly common pattern: having counter and function index variables with an array of functions to call. Again, async can help here because it reduces this kind of boilerplate that you will probably find yourself repeating often. async currently doesn't have methods that really allow for skipping individual tasks, but you could easily do this yourself if you are writing the boilerplate (just increment the function index variable by 2 for example).
From my own experience, if you properly design your javascript code with asynchronous in mind and use a lot of tools like async, you will find it easier to develop with node. Writing for asynchronous vs synchronous in node is typically always going to be more complicated (although less so with generators, fibers, etc. as compared to callbacks/promises).
I personally think that deciding on a language based upon that single aspect is not worthwhile. You have to consider much much more than just the design of the language, for example the size of the community, availability of third party libraries, performance, technical support options, ease of code debugging, etc.
Just write your code more compactly:
// parallel version
function foo (cb) {
var items = [ 1, 2, 3, 4, 5 ];
var pending = items.length;
var result = 0;
items.forEach(function (item) {
findFromDb(item, function (err, record) {
if (err) return cb(err);
if (!record) return cb(new Error('No record for: ' + item))
result += record.value / items.length;
if (-- pending === 0) cb(null, result);
});
});
}
That clocks in at 13 source lines of code compared to the 9 sloc for python that you posted. However, unlike the python that you posted, this code runs all the jobs in parallel.
To do the same thing in series, a trick I usually do is a next() function defined inline that invokes itself and pops a job off of an array:
// sequential version
function foo (cb) {
var items = [ 1, 2, 3, 4, 5 ];
var len = items.length;
var result = 0;
(function next () {
if (items.length === 0) return cb(null, result);
var item = items.shift();
findFromDb(item, function (err, record) {
if (err) return cb(err);
if (!record) return cb(new Error('No record for: ' + item))
result += record.value / len;
next();
});
})();
}
This time, 15 lines. The nice thing is that you can easily control whether the actions should happen in parallel or sequentially or somewhere in between. That is not so easy in a language like python where everything is synchronous and you've got to do lots of work-arounds like threads or evented libraries to get things back up to asynchronous. Try implementing a parallel version of what you have in python! It would most certainly be longer than the node version.
As for the promise/async route: it's not actually all that hard or bad to use ordinary functions for these relatively simple kinds of tasks. In the future (or in node 0.11+ with --harmony) you can use generators and a library like co, but that feature isn't widely deployed yet.
Everyone here seems to be suggesting async, which is a great library. But to give another suggestion, you should take a look at Promises , which is a new built-in being introduced to the language (and currently has several very good polyfills). It allows you to write asynchronous code in a way that looks much more structured. For example, take a look at this code:
var items = [ 1, 2, 3, 4 ];
var processItem = function(item, callback) {
// do something async ...
};
var values = [ ];
items.forEach(function(item) {
processItem(item, function(err, value) {
if (err) {
// something went wrong
}
values.push(value);
// all of the items have been processed, move on
if (values.length === items.length) {
doSomethingWithValues(values, function(err) {
if (err) {
// something went wrong
}
// and we're done
});
}
});
});
function doSomethingWithValues(values, callback) {
// do something async ...
}
Using promises, it would be written something like this:
var items = [ 1, 2, 3, 4 ];
var processItem = function(item) {
return new Promise(function(resolve, reject) {
// do something async ...
});
};
var doSomethingWithValues = function(values) {
return new Promise(function(resolve, reject) {
// do something async ...
});
};
// promise.all returns a new promise that will resolve when all of the promises passed to it have resolved
Promise.all(items.map(processItem))
.then(doSomethingWithValues)
.then(function() {
// and we're done
})
.catch(function(err) {
// something went wrong
});
The second version is much cleaner and simpler, and that barely even scratches the surface of promises real power. And, like I said, Promises are in es6 as a new language built-in, so (eventually) you won't even need to load in a library, it will just be available.
don't use anonymous (un-named) functions they make the code ugly and they make debugging much harder, so always name your functions and define them outside the function scope not inline.
that is a real issue with Node.js (it is called callback hell or pyramid of doom ,..) you can solve this issue by using promises or using async.js which have so many functions for handling different situations (waterfall, parallel, series, auto, ...)
well the performance gain is absolutely a good thing and it is not that much loss (when you start to master it) and also the Node.js community is great.
Check async.js, q.
The more I work with async the more I love it and I like node more. Let me give you a simple example of what I have for a server initialization.
async.parallel ({
"job1": loadFromCollection1,
"job2": loadFromCollection2,
},
function (initError, results) {
if (initError) {
console.log ("[INIT] Server initialization error occurred: " + JSON.stringify(initError, null, 3));
return callback (initError);
}
// Do more stuff with the results
});
In fact, this very same approach can be followed and one can pass different arguments to the different functions that correspond to the various jobs; see for example Passing arguments to async.parallel in node.js.
To be perfectly honest with you, I prefer the node-way which is also non-blocking. I think node forces someone to have a better design and sometimes you spend time creating more definitions and grouping functions and objects in arrays so that you can write better code. The reason I think is that in the end you want to exploit some variant of async and mix and merge stuff accordingly. In my opinion, spending some extra time and thinking about the code a bit more is well worth it when you also take into account that node is asynchronous.
Other than that, I think it is a habit. The more one writes code for node, the more one improves and writes better asynchronous code. What is good on node is that it really forces someone to write more robust code since one starts respecting all the error codes from all the functions much more. For example, how often do people check, say if malloc or new have succeeded and one does not have an error handler for a NULL pointer after the command has been issued? Writing asynchronous code though forces one to respect the events and the error codes that the events have. I guess one obvious reason is that one respects the code that one writes and in the end we have to write code that returns errors so that caller knows what happened.
I really think that you need to give it more time and start working with async more. That's all.
"If you try to code bussiness db login using pure node.js, you go straight to callback hell"
I've recently created a simple abstraction named WaitFor to call async functions in sync mode (based on Fibers): https://github.com/luciotato/waitfor
check the database example:
Database example (pseudocode)
pure node.js (mild callback hell):
var db = require("some-db-abstraction");
function handleWithdrawal(req,res){
try {
var amount=req.param("amount");
db.select("* from sessions where session_id=?",req.param("session_id"),function(err,sessiondata) {
if (err) throw err;
db.select("* from accounts where user_id=?",sessiondata.user_ID),function(err,accountdata) {
if (err) throw err;
if (accountdata.balance < amount) throw new Error('insufficient funds');
db.execute("withdrawal(?,?),accountdata.ID,req.param("amount"), function(err,data) {
if (err) throw err;
res.write("withdrawal OK, amount: "+ req.param("amount"));
db.select("balance from accounts where account_id=?", accountdata.ID,function(err,balance) {
if (err) throw err;
res.end("your current balance is " + balance.amount);
});
});
});
});
}
catch(err) {
res.end("Withdrawal error: " + err.message);
}
Note: The above code, although it looks like it will catch the exceptions, it will not.
Catching exceptions with callback hell adds a lot of pain, and i'm not sure if you will have the 'res' parameter
to respond to the user. If somebody like to fix this example... be my guest.
using wait.for:
var db = require("some-db-abstraction"), wait=require('wait.for');
function handleWithdrawal(req,res){
try {
var amount=req.param("amount");
sessiondata = wait.forMethod(db,"select","* from session where session_id=?",req.param("session_id"));
accountdata= wait.forMethod(db,"select","* from accounts where user_id=?",sessiondata.user_ID);
if (accountdata.balance < amount) throw new Error('insufficient funds');
wait.forMethod(db,"execute","withdrawal(?,?)",accountdata.ID,req.param("amount"));
res.write("withdrawal OK, amount: "+ req.param("amount"));
balance=wait.forMethod(db,"select","balance from accounts where account_id=?", accountdata.ID);
res.end("your current balance is " + balance.amount);
}
catch(err) {
res.end("Withdrawal error: " + err.message);
}
Note: Exceptions will be catched as expected.
db methods (db.select, db.execute) will be called with this=db
Your Code
In order to use wait.for, you'll have to STANDARDIZE YOUR CALLBACKS to function(err,data)
If you STANDARDIZE YOUR CALLBACKS, your code might look like:
var wait = require('wait.for');
//run in a Fiber
function process() {
var a = [1, 2, 3, 4, 5];
var result = 0;
a.forEach(function(i) {
// I/O operation
var record = wait.for(findFromDb,i); //call & wait for async function findFromDb(i,callback)
if (!record) throw new Error('No record exist for ' + i);
result += record.value;
});
return result/a.length;
}
function inAFiber(){
console.log('result is: ',process());
}
// run the loop in a Fiber (keep node spinning)
wait.launchFiber(inAFiber);
see? closer to python and no callback hell
Related
I have an orientdb database. I want to use nodejs with RESTfull calls to create a large number of records. I need to get the #rid of each for some later processing.
My psuedo code is:
for each record
write.to.db(record)
when the async of write.to.db() finishes
process based on #rid
carryon()
I have landed in serious callback hell from this. The version that was closest used a tail recursion in the .then function to write the next record to the db. However, I couldn't carry on with the rest of the processing.
A final constraint is that I am behind a corporate proxy and cannot use any other packages without going through the network administrator, so using the native nodejs packages is essential.
Any suggestions?
With a completion callback, the general design pattern for this type of problem makes use of a local function for doing each write:
var records = ....; // array of records to write
var index = 0;
function writeNext(r) {
write.to.db(r, function(err) {
if (err) {
// error handling
} else {
++index;
if (index < records.length) {
writeOne(records[index]);
}
}
});
}
writeNext(records[0]);
The key here is that you can't use synchronous iterators like .forEach() because they won't iterate one at a time and wait for completion. Instead, you do your own iteration.
If your write function returns a promise, you can use the .reduce() pattern that is common for iterating an array.
var records = ...; // some array of records to write
records.reduce(function(p, r) {
return p.then(function() {
return write.to.db(r);
});
}, Promsise.resolve()).then(function() {
// all done here
}, function(err) {
// error here
});
This solution chains promises together, waiting for each one to resolve before executing the next save.
It's kinda hard to tell which function would be best for your scenario w/o more detail, but I almost always use asyncjs for this kind of thing.
From what you say, one way to do it would be with async.map:
var recordsToCreate = [...];
function functionThatCallsTheApi(record, cb){
// do the api call, then call cb(null, rid)
}
async.map(recordsToCreate, functionThatCallsTheApi, function(err, results){
// here, err will be if anything failed in any function
// results will be an array of the rids
});
You can also check out other ones to enable throttling, which is probablya good idea.
I am working with zombie.js to scrape one site, I must to use the callback style to connect to each url. The point is that I have got an urls array and I need to process each urls using an async function. This is my first approach:
Array urls = {http..., http...};
function process_url(index)
{
if(index == urls.length)
return;
async_function(url,
function() {
...
//parse the url
...
// Process the next url
process_url(index++);
}
);
}
process_url(0)
Without use someone third party nodejs library to use the asyn funtion as sync function or to wait for the function (wait.for, synchornized, mocha), this is the way that I though to resolve this problem, I don't know what would happen if the array is too big. Is the function released from the memory when the next function is called? or all the functions are in memory until the end?
Any ideas?
Your scheme will work. I call it "manually sequencing async operations".
A general purpose version of what you're doing would look like this:
function processItem(data, callback) {
// do your async function here
// for example, let's suppose it was an http request using the request module
request(data, callback);
}
function processArray(array, fn) {
var index = 0;
function next() {
if (index < array.length) {
fn(array[index++], function(err, result) {
// process error here
if (err) return;
// process result here
next();
});
}
}
next();
}
processArray(arr, processItem);
As to your specific questions:
I don't know what would happen if the array is too big. Is the
function released from the memory when the next function is called? or
all the functions are in memory until the end?
Memory in Javascript is released when it is not longer referenced by any running code and when the garbage collector gets time to run. Since you are running a series of asynchronous operations here, it is likely that the garbage collector gets a chance to run regularly while waiting for the http response from the async operation so memory could get cleaned up then. Functions are just another type of object in Javascript and they get garbage collected just like anything else. When they are no longer reference by running code, they are eligible for garbage collection.
In your specific code, because you are re-calling process_url() only in an async callback, there is no stack build-up (as in normal recursion). The prior instance of process_url() has already completed BEFORE the async callback is called and BEFORE you call the next iteration of process_url().
In general, management and coordination of multiple async operations is much, much easier using promises which are built into the current versions of node.js and are part of the ES6 ECMAScript standard. No external libraries are required to use promises in current versions of node.js.
For a list of a number of different techniques for sequencing your asynchronous operations on your array, both using promises and not using promises, see:
How to synchronize a sequence of promises?.
The first step in using promises is the "promisify" your async function so that it returns a promise instead of takes a callback.
function async_function_promise(url) {
return new Promise(function(resolve, reject) {
async_function(url, function(err, result) {
if (err) {
reject(err);
} else {
resolve(result);
}
});
});
}
Now, you have a version of your function that returns promises.
If you want your async operations to proceed one at a time so the next one doesn't start until the previous one has completed, then a usual design pattern for that is to use .reduce() like this:
function process_urls(array) {
return array.reduce(function(p, url) {
return p.then(function(priorResult) {
return async_function_promise(url);
});
}, Promise.resolve());
}
Then, you can call it like this:
var myArray = ["url1", "url2", ...];
process_urls(myArray).then(function(finalResult) {
// all of them are done here
}, function(err) {
// error here
});
There are also Promise libraries that have some helpful features that make this type of coding simpler. I, myself, use the Bluebird promise library. Here's how your code would look using Bluebird:
var Promise = require('bluebird');
var async_function_promise = Promise.promisify(async_function);
function process_urls(array) {
return Promise.map(array, async_function_promise, {concurrency: 1});
}
process_urls(myArray).then(function(allResults) {
// all of them are done here and allResults is an array of the results
}, function(err) {
// error here
});
Note, you can change the concurrency value to whatever you want here. For example, you would probably get faster end-to-end performance if you increased it to something between 2 and 5 (depends upon the server implementation on how this is best optimized).
I have been working on nodeJS + MongoDB, using the Express and Mongoose frameworks for a few months, and I wanted to ask you guys what is really happening in a situation such as the following:
Model1.find({}, function (err, elems) {
if (err) {
console.log('ERROR');
} else {
elems.forEach(function (el) {
Model2.find({[QUERY RELATED WITH FIELDS IN 'el']}, function (err, elems2) {
if (err) {
console.log('ERROR');
} else {
//DO STAFF.
}
});
});
}
});
My best guess is that there's a main thread looping over elems, and then different threads attending each query over Model2, but I'm not really sure.
Is that correct? And also, is this a good solution? And if not, how would you code in a situation such as this, where you need the information in each of the elements you get from Model1 to get elements from Model2, and perform the actual functionality you are looking for?
I know I could elaborate a more complex query where I could get all the elements each of the 'el' in elems would yield, but I¡d rather not do that, because in that case i would be worried about the memory expense.
Also, I've been thinking about changing the data model, but I've gone over it and I'm confident it is well thought, and I don't think that's the best solution for my aplication.
Thanks!
NodeJS is a single threaded environment and it works asynchronously for blocking function calls such as network requests in your case. So there is only one thread and your query results will be called asynchronously so that nothing will be blocked due to intensive network operation.
In your scenario if the first query returns quite a lot of records such as 100000 thousands you may exhaust your mongo server in your loop as you will query your server as many as the result of first query instantly. This will happen because node won't stop for receiving the results of each query as it works asynchronously.
So usually manually throttling the requests to network operations is a good practice. This is not trivial when working on asynchronous environment. One way to do is to use recursive function call. Basically you split your tasks into groups and do each group in batch, once you are done with one batch you start with your next group.
Here is a simple example on how to do it, I have used promises instead of callback functions, Q is a promise library that is very useful for handling promises:
var rows = [...]; // array of many
function handleRecursively(startIndex, batchSize){
var promises = [];
for(i = 0; i < batchSize && i + batchSize < rows.length; i++){
var theRow = rows[startIndex + i];
promises.push(doAsynchronousJobWithTheRow(theRow));
}
//you wait until you handle all tasks in this iteration
Q.all(promises).then(function(){
startIndex += batchSize;
if(startIndex < rows.length){ // if there is still task to do continue with next batch
handleRecursively(startIndex, batchSize); }
})
}
handleRecursively(0, 1000);
Here is the best solution :
Model1.find({}, function (err, elems) {
if (err) {
console.log('ERROR');
} else {
loopAllElements(0,elems);
}
});
function loopAllElements(startIndex,elems){
if (startIndex==elems.length) {
return "success";
}else{
Model2.find({[QUERY RELATED WITH FIELDS IN elems[startIndex] ]}, function (err, elems2) {
if (err) {
console.log('ERROR');
return "error";
} else {
//DO STAFF.
loopAllElements(startIndex+1, elems);
}
});
}
}
I am using a variable and that is used by many functions at a time. I need to synchronize it. How do I do it?
var x = 0;
var a = function(){
x=x+1;
}
var b = function(){
x=x+2;
}
var c = function(){
var t = x;
return t;
}
This is the simplified logic of my code. To give more insight, X is as good as my mongoDB object which needs to be used by only one function at a time. Also 3 functions are like REST api calls so there is probability they will be called at same time.
I need to write getX function which should manage locking and unlocking.
Any suggestions?
Node is single threaded so there is no chance of the the 3 functions to be executed at the same time. Syncronization and race conditions only apply in multithreaded environments. There is a case though, if the first function blocks for i/o.
You are asking about keeping a single object synchronized as several
asynchronous operations modify that object. This is a bit vague (do you need to execute them in order? do they change the same properties?) Its hard to make a catch all solution, so I suggest that you determine what order, if any, the operations must take place in, and use the async library to handle
the control flow.
The async.waterfall method (example below) is useful if you want to pass
results down a chain of functions that execute in order. There are many other
useful functions included in the library, like async.eachSeries (execute a function once per array item in order) and
async.parallel (execute an array of functions simultaneously.) All docs available at https://github.com/caolan/async
var async = require('async');
function calculateX(callback){
async.waterfall(
[
function(done){
var x = 0;
asyncCall1(x, function(x1){ // add x1=x+1;
done(null, x1);
});
},
function(x1, done){
asyncCall2(x1, function(x2){ // add x2=x1+2;
done(null, x2);
});
},
],
function(err, x2){
var t = x2;
callback(t);
});
};
calculateX(function(x2){
mongo.save(x2, function(err){ // or something idk mongo
if(err){ console.log(err) };
});
});
I'm using Mongoose with Node.js and have the following code that will call the callback after all the save() calls has finished. However, I feel that this is a very dirty way of doing it and would like to see the proper way to get this done.
function setup(callback) {
// Clear the DB and load fixtures
Account.remove({}, addFixtureData);
function addFixtureData() {
// Load the fixtures
fs.readFile('./fixtures/account.json', 'utf8', function(err, data) {
if (err) { throw err; }
var jsonData = JSON.parse(data);
var count = 0;
jsonData.forEach(function(json) {
count++;
var account = new Account(json);
account.save(function(err) {
if (err) { throw err; }
if (--count == 0 && callback) callback();
});
});
});
}
}
You can clean up the code a bit by using a library like async or Step.
Also, I've written a small module that handles loading fixtures for you, so you just do:
var fixtures = require('./mongoose-fixtures');
fixtures.load('./fixtures/account.json', function(err) {
//Fixtures loaded, you're ready to go
};
Github:
https://github.com/powmedia/mongoose-fixtures
It will also load a directory of fixture files, or objects.
I did a talk about common asyncronous patterns (serial and parallel) and ways to solve them:
https://github.com/masylum/i-love-async
I hope its useful.
I've recently created simpler abstraction called wait.for to call async functions in sync mode (based on Fibers). It's at an early stage but works. It is at:
https://github.com/luciotato/waitfor
Using wait.for, you can call any standard nodejs async function, as if it were a sync function, without blocking node's event loop. You can code sequentially when you need it.
using wait.for your code will be:
//in a fiber
function setup(callback) {
// Clear the DB and load fixtures
wait.for(Account.remove,{});
// Load the fixtures
var data = wait.for(fs.readFile,'./fixtures/account.json', 'utf8');
var jsonData = JSON.parse(data);
jsonData.forEach(function(json) {
var account = new Account(json);
wait.forMethod(account,'save');
}
callback();
}
That's actually the proper way of doing it, more or less. What you're doing there is a parallel loop. You can abstract it into it's own "async parallel foreach" function if you want (and many do), but that's really the only way of doing a parallel loop.
Depending on what you intended, one thing that could be done differently is the error handling. Because you're throwing, if there's a single error, that callback will never get executed (count won't be decremented). So it might be better to do:
account.save(function(err) {
if (err) return callback(err);
if (!--count) callback();
});
And handle the error in the callback. It's better node-convention-wise.
I would also change another thing to save you the trouble of incrementing count on every iteration:
var jsonData = JSON.parse(data)
, count = jsonData.length;
jsonData.forEach(function(json) {
var account = new Account(json);
account.save(function(err) {
if (err) return callback(err);
if (!--count) callback();
});
});
If you are already using underscore.js anywhere in your project, you can leverage the after method. You need to know how many async calls will be out there in advance, but aside from that it's a pretty elegant solution.