Patterns for async list processing in node.js

Patterns for async list processing in node.js - node.js

I have an application where a database query returns a number of rows (typically, less than 100). For each row, I need to make an http call to get supplemental data. I'd like to fire off all of the requests, and then when the last callback completes, move on to rendering the result page.
So far, the answers to similar questions I've looked at have either been chaining the requests by making request #2 in the callback for request #1 (advantages: simple, avoids burying the server in multiple requests), or by firing all of the requests with no tracking of whether all of the requests have completed (works well in the browser where the callback updates the UI).
My current plan is to keep a counter of requests made and have the callback decrement the counter; if it reaches zero, I can call the render function. I may also need to handle the case where responses come in faster than requests are being made (not likely, but a possible edge case).
Are there are other useful patterns for this type of problem?

When using async code could roughly look like this:
var async = require('async');
results = [];
var queue = async.queue(function(row, callback) {
http.fetchResultForRow(row, function(data){
result.push(data);
callback();
});
}, 1);
queue.drain = function() {
console.log("All results loaded");
renderEverything(results);
}
database.fetch(function(rows) {
for (var i=0; i < rows.length; i++) {
queue.push(rows[i]);
}
});
If the order does not matter you also could use: map
Look around in the documenation of async, there are a lot of useful patterns.

You can implement this quite nicely with promises using the when library. Though if you want to rate limit the calls to getting the extra info you will need to do a little more work than in the async approach of TheHippo I think.
Here's an example:
when = require('when')
// This is the function that gets the extra info.
// I've added a setTimeout to show how it is async.
function get_extra_info_for_row(x, callback) {
setTimeout( function(){ return callback(null, x+10); }, 1 );
};
rows = [1,2,3,4,5];
row_promises = rows.map(
function(x) {
var defered = when.defer()
get_extra_info_for_row(x, function(err,extra_info) {
if(err) return defered.reject(err);
defered.resolve([x,extra_info]);
});
return defered.promise;
})
when.all( row_promises )
.then(
function(augmented_rows) { console.log( augmented_rows ); },
function(err) { console.log("Error", err ); }
);
This outputs
[ [ 1, 11 ], [ 2, 12 ], [ 3, 13 ], [ 4, 14 ], [ 5, 15 ] ]

Related

How to run asynchronous tasks synchronous?

I'm developing an app with the following node.js stack: Express/Socket.IO + React. In React I have DataTables, wherein you can search and with every keystroke the data gets dynamically updated! :)
I use Socket.IO for data-fetching, so on every keystroke the client socket emits some parameters and the server calls then the callback to return data. This works like a charm, but it is not garanteed that the returned data comes back in the same order as the client sent it.
To simulate: So when I type in 'a', the server responds with this same 'a' and so for every character.
I found the async module for node.js and tried to use the queue to return tasks in the same order it received it. For simplicity I delayed the second incoming task with setTimeout to simulate a slow performing database-query:
Declaration:
const async = require('async');
var queue = async.queue(function(task, callback) {
if(task.count == 1) {
setTimeout(function() {
callback();
}, 3000);
} else {
callback();
}
}, 10);
Usage:
socket.on('result', function(data, fn) {
var filter = data.filter;
if(filter.length === 1) { // TEST SYNCHRONOUSLY
queue.push({name: filter, count: 1}, function(err) {
fn(filter);
// console.log('finished processing slow');
});
} else {
// add some items to the queue
queue.push({name: filter, count: filter.length}, function(err) {
fn(data.filter);
// console.log('finished processing fast');
});
}
});
But the way I receive it in the client console, when I search for abc is as follows:
ab -> abc -> a(after 3 sec)
I want it to return it like this: a(after 3sec) -> ab -> abc
My thought is that the queue runs the setTimeout and then goes further and eventually the setTimeout gets fired somewhere on the event loop later on. This resulting in returning later search filters earlier then the slow performing one.
How can i solve this problem?

First a few comments, which might help clear up your understanding of async calls:
Using "timeout" to try and align async calls is a bad idea, that is not the idea about async calls. You will never know how long an async call will take, so you can never set the appropriate timeout.
I believe you are misunderstanding the usage of queue from async library you described. The documentation for the queue can be found here.
Copy pasting the documentation in here, in-case things are changed or down:
Creates a queue object with the specified concurrency. Tasks added to the queue are processed in parallel (up to the concurrency limit). If all workers are in progress, the task is queued until one becomes available. Once a worker completes a task, that task's callback is called.
The above means that the queue can simply be used to priorities the async task a given worker can perform. The different async tasks can still be finished at different times.
Potential solutions
There are a few solutions to your problem, depending on your requirements.
You can only send one async call at a time and wait for the first one to finish before sending the next one
You store the results and only display the results to the user when all calls have finished
You disregard all calls except for the latest async call
In your case I would pick solution 3 as your are searching for something. Why would you use care about the results for "a" if they are already searching for "abc" before they get the response for "a"?
This can be done by giving each request a timestamp and then sort based on the timestamp taking the latest.

SOLUTION:
Server:
exports = module.exports = function(io){
io.sockets.on('connection', function (socket) {
socket.on('result', function(data, fn) {
var filter = data.filter;
var counter = data.counter;
if(filter.length === 1 || filter.length === 5) { // TEST SYNCHRONOUSLY
setTimeout(function() {
fn({ filter: filter, counter: counter}); // return to client
}, 3000);
} else {
fn({ filter: filter, counter: counter}); // return to client
}
});
});
}
Client:
export class FilterableDataTable extends Component {
constructor(props) {
super();
this.state = {
endpoint: "http://localhost:3001",
filters: {},
counter: 0
};
this.onLazyLoad = this.onLazyLoad.bind(this);
}
onLazyLoad(event) {
var offset = event.first;
if(offset === null) {
offset = 0;
}
var filter = ''; // filter is the search character
if(event.filters.result2 != undefined) {
filter = event.filters.result2.value;
}
var returnedData = null;
this.state.counter++;
this.socket.emit('result', {
offset: offset,
limit: 20,
filter: filter,
counter: this.state.counter
}, function(data) {
returnedData = data;
console.log(returnedData);
if(returnedData.counter === this.state.counter) {
console.log('DATA: ' + JSON.stringify(returnedData));
}
}
This however does send unneeded data to the client, which in return ignores it. Somebody any idea's for further optimizing this kind of communication? For example a method to keep old data at the server and only send the latest?

How do I optimise this callback hierarchy in my simple express app?

I'm creating a simple rss feed aggregator. For which, I'm using a module called 'blind-parser'. This modules parses an rss/xml string and returns an object.
My aim was to obtain such objects from multiple rss feeds and render them. So I did this, initially :
var obj1 = {},obj2 = {}, obj3 = {};
app.get('/',function(req,res){
parser.parseURL('https://www.toptal.com/blog.rss', function (err, parsed1) {
obj1 = parsed1;
});
parser.parseURL('https://www.toptal.com/designers/blog.rss', function (err, parsed2) {
obj2 = parsed2;
});
parser.parseURL('http://jsfeeds.com/feed', function (err, parsed3) {
obj3 = parsed3;
});
parser.parseURL('http://www.mironov.com/feed/', function (err, parsed4) {
obj4 = parsed4;
});
res.render('index', {'toptal': obj1 , 'toptaldesign' : obj2 , 'jsf': obj3 , 'product': obj4 });
});
However it didn't work as obj1,obj2,obj3 and obj4 turned out to be undefined.
Which I coudn't understand. I tried declaring the variable inside the app.get and outside, both to no effect.
So I did this instead,
app.get('/',function(req,res){
parser.parseURL('https://www.toptal.com/blog.rss', function (err, parsed1) {
parser.parseURL('https://www.toptal.com/designers/blog.rss', function (err, parsed2) {
parser.parseURL('http://jsfeeds.com/feed', function (err, parsed3) {
parser.parseURL('http://www.mironov.com/feed/', function (err, parsed4) {
res.render('index', {'toptal': parsed1 , 'toptaldesign' : parsed2 , 'rs': parsed3 , 'product': parsed4 });
});
});
});
});
});
and voila it worked! However the major downside to this approach is that the time it takes increases 2x, every time I nest another rss-link.
already,the time it takes to load the page with three-tier nesting is ~20seconds.
Can someone please help me out and offer a solution to this, also I'm new to node so please be elaborate.
Thanks.

I'm assuming the parse.parseURL function is actually fetching the specified URL and parsing its contents. Each of your four parseURL requests is asynchronous, meaning that each function is invoked synchronously, but each of their callbacks is executed at some later point in time once the URL has been fetched etc. Meanwhile your res.render call has also been run synchronously, immediately after the preceding parseURL calls but before each of their callbacks have been invoked, and so before obj1 etc have been set.
When you nest the calls within each callback, as in your second example, then you are ensuring that each URL result is fetched before proceeding, but as you are fetching each URL sequentially (e.g. you don't fetch the second URL until the first has been returned) then the execution time will be slower.
You can possibly speed things up by fetching all four URLs in parallel (assuming that your requests are handled concurrently). To do this, it will be simpler if you first use a library like async, or alternatively, if you convert your code to use promises. If you use async, then your code could be changed to fetch the URLs in parallel like so:
async.parallel([
( cb ) => {
parser.parseURL('https://www.toptal.com/blog.rss', cb );
},
( cb ) => {
parser.parseURL('https://www.toptal.com/designers/blog.rss', cb );
},
( cb ) => {
parser.parseURL('http://jsfeeds.com/feed', cb );
},
( cb ) => {
parser.parseURL('http://www.mironov.com/feed/', cb );
}
],
( err, results ) => {
// Called once all requests have completed.
res.render('index', {'toptal': results[0] , 'toptaldesign' : results[1] , 'jsf': results[2] , 'product': results[3] });
});

Perform arbitrary set of asynchronous tasks

My input is streamed from another source, which makes it difficult to use async.forEach. I am pulling data from an API endpoint, but I have a limit of 1000 objects per request to the endpoint, and I need to get hundreds of thousands of them (basically all of them) and I will know they're finished when the response contains < 1000 objects. Now, I have tried this approach:
/* List all deposits */
var depositsAll = [];
var depositsIteration = [];
async.doWhilst(this._post(endpoint_path, function (err, response) {
// check err
/* Loop through the data and gather only the deposits */
for (var key in response) {
//do some stuff
}
depositsAll += depositsIteration;
return callback(null, depositsAll);
}, {limit: 1000, offset: 0, sort: 'desc'}),
response.length > 1000, function (err, depositsAll) {
// check for err
// return the complete result
return callback(null, depositsAll);
});
With this code I get an async internal error that iterator is not a function. But in general I am almost sure the logic is not correct as well.
If it's not clear what I'm trying to achieve - I need to perform a request multiple times, and add the response data to a result that at the end contains all the results, so I can return it. And I need to perform requests until the response contains less than 1000 objects.
I also looked into async.queue but could not get the hang of it...
Any ideas?

You should be able to do it like that, but if that example is from your real code you have misunderstood some of how async works. doWhilst takes three arguments, each of them being a function:
The function to be called by async. Gets argument callback that must be called. In your case, you need to wrap this._post inside another function.
The test function (you would give value of response.length > 1000, ie. a boolean, if response would be defined)
The final function to be called once execution is stopped
Example with each needed function separated for readability:
var depositsAll = [];
var responseLength = 1000;
var self = this;
var post = function(asyncCb) {
self._post(endpoint_path, function(err, res) {
...
responseLength = res.length;
asyncCb(err, depositsAll);
});
}
var check = function() {
return responseLength >= 1000;
};
var done = function(err, deposits) {
console.log(deposits);
};
async.doWhilst(post, check, done);

node asynchronous with response

function getResultsForOneDev(devID, res) {
var Contribution = require('../db/Contribution.js').model;
var SurveyState = require('../db/SurveyState.js').model;
var SurveyAnswer = require('../db/SurveyAnswer.js').model;
var contributionList = {
"dev": [ {
"contribs" : [ {
"surveyStates" : [ {
"surveyAnswers" : [ { } ]
} ]
} ]
} ]
};
Contribution.find({dev:devID}).exec(function (error, contribs){
// console.log("contribs:"+contribs);
contributionList = contribs;
console.log("contribs length:"+contribs.length);
for (var i = 0 ; i<contribs.length ; i++) {
(function(oneContrib) {
//console.log('contribs ID '+oneContrib._id);
SurveyState.find({contrib:oneContrib._id}).exec(function (error, surveyStates){
// console.log("surveyStates:"+surveyStates);
oneContrib.surveyStates = surveyStates;
console.log("surveyStates length:"+surveyStates.length);
for (var j = 0 ; j<surveyStates.length ; j++) {
(function(oneSurveyState) {
SurveyAnswer.find({surveyState:oneSurveyState._id}).exec(function (error, surveyAnswers){
// console.log("surveyAnswers:"+surveyAnswers);
oneSurveyState.surveyAnswers = surveyAnswers;
console.log("surveyAnswers length:"+surveyAnswers.length);
});
})(surveyStates[j]);
}
});
})(contribs[i]);
};
});
res.jsonp(contributionList);
}
This program does not run as I want, res.jsonp return empty contributionList.
I already try with async (https://github.com/caolan/async). What is the good pratice to fill contributionList before sending a res.jsonp ?

.find() is asynchronous. It returns immediately, before the callback has populated values into contributionList.
Move your res.jsonp() to the end of the callback code where contributionList is populated rather than outside the callback.
Since you seem to have multiple find() inside loops and whatnot, and you cannot guarantee the order the callbacks will run, you can use async (as you mention) to create a workflow to insure they all finish, and then run a final callback (executed by async) to invoke res.jsonp().

Because your database queries are asynchronous (they finish sometime later) and the rest of your code does not wait for them, your two for loops will finish long before the actual async responses will. As such, you have to actually keep track (somehow) of when the last async response is done and thus all the data in now in the contributionList data structure so you can now send your response.
My preference would be to use promises for this and Promise.all() to trigger an action when an arbitrary number of asynchronous operations are complete, but I don't know the database interfaces you're using to know which ones are promisified, so here's a generic method that simply uses a manual counter to keep track of how many async operations are still in flight and when the counter gets to zero, you have all the data now and you can send the response.
The additions to this code are the lines of code that use the variable remaining.
function getResultsForOneDev(devID, res) {
var Contribution = require('../db/Contribution.js').model;
var SurveyState = require('../db/SurveyState.js').model;
var SurveyAnswer = require('../db/SurveyAnswer.js').model;
var contributionList = {
"dev": [ {
"contribs" : [ {
"surveyStates" : [ {
"surveyAnswers" : [ { } ]
} ]
} ]
} ]
};
Contribution.find({dev:devID}).exec(function (error, contribs){
// console.log("contribs:"+contribs);
contributionList = contribs;
console.log("contribs length:"+contribs.length);
// keep track of how many async responses are left to be processed
// in a variable at a higher scope
var remaining = 0;
for (var i = 0 ; i<contribs.length ; i++) {
(function(oneContrib) {
//console.log('contribs ID '+oneContrib._id);
SurveyState.find({contrib:oneContrib._id}).exec(function (error, surveyStates){
// console.log("surveyStates:"+surveyStates);
oneContrib.surveyStates = surveyStates;
console.log("surveyStates length:"+surveyStates.length);
// add how many more responses are pending
remaining += surveyStates.length;
for (var j = 0 ; j<surveyStates.length ; j++) {
(function(oneSurveyState) {
SurveyAnswer.find({surveyState:oneSurveyState._id}).exec(function (error, surveyAnswers){
// console.log("surveyAnswers:"+surveyAnswers);
oneSurveyState.surveyAnswers = surveyAnswers;
console.log("surveyAnswers length:"+surveyAnswers.length);
// mark one more processed and see if all remaining ones are done
--remaining;
if (remaining === 0) {
res.jsonp(contributionList);
}
});
})(surveyStates[j]);
}
});
})(contribs[i]);
};
});
}
P.S. You should realize that you are somewhat flooding your database with a whole bunch of requests all at once (all attempting to run in parallel) and then sometime later the database will actually finish all of them. Depending upon the structure of the database and its ability to handle this flood of requests efficiently or share load with other users also using the database, this is sometimes not a best practice. So, sometimes it is better to send some small number of requests at once (e.g. 3-5) and each time one completes, you launch the next waiting request. The async library can do that type of management for you or you can fairly simply build your own little queue of requests and each time one finishes, you send another.

Idiomatic way to wait for multiple callbacks in Node.js

Suppose you need to do some operations that depend on some temp file. Since
we're talking about Node here, those operations are obviously asynchronous.
What is the idiomatic way to wait for all operations to finish in order to
know when the temp file can be deleted?
Here is some code showing what I want to do:
do_something(tmp_file_name, function(err) {});
do_something_other(tmp_file_name, function(err) {});
fs.unlink(tmp_file_name);
But if I write it this way, the third call can be executed before the first two
get a chance to use the file. I need some way to guarantee that the first two
calls already finished (invoked their callbacks) before moving on without nesting
the calls (and making them synchronous in practice).
I thought about using event emitters on the callbacks and registering a counter
as receiver. The counter would receive the finished events and count how many
operations were still pending. When the last one finished, it would delete the
file. But there is the risk of a race condition and I'm not sure this is
usually how this stuff is done.
How do Node people solve this kind of problem?

Update:
Now I would advise to have a look at:
Promises
The Promise object is used for deferred and asynchronous computations.
A Promise represents an operation that hasn't completed yet, but is
expected in the future.
A popular promises library is bluebird. A would advise to have a look at why promises.
You should use promises to turn this:
fs.readFile("file.json", function (err, val) {
if (err) {
console.error("unable to read file");
}
else {
try {
val = JSON.parse(val);
console.log(val.success);
}
catch (e) {
console.error("invalid json in file");
}
}
});
Into this:
fs.readFileAsync("file.json").then(JSON.parse).then(function (val) {
console.log(val.success);
})
.catch(SyntaxError, function (e) {
console.error("invalid json in file");
})
.catch(function (e) {
console.error("unable to read file");
});
generators: For example via co.
Generator based control flow goodness for nodejs and the browser,
using promises, letting you write non-blocking code in a nice-ish way.
var co = require('co');
co(function *(){
// yield any promise
var result = yield Promise.resolve(true);
}).catch(onerror);
co(function *(){
// resolve multiple promises in parallel
var a = Promise.resolve(1);
var b = Promise.resolve(2);
var c = Promise.resolve(3);
var res = yield [a, b, c];
console.log(res);
// => [1, 2, 3]
}).catch(onerror);
// errors can be try/catched
co(function *(){
try {
yield Promise.reject(new Error('boom'));
} catch (err) {
console.error(err.message); // "boom"
}
}).catch(onerror);
function onerror(err) {
// log any uncaught errors
// co will not throw any errors you do not handle!!!
// HANDLE ALL YOUR ERRORS!!!
console.error(err.stack);
}
If I understand correctly I think you should have a look at the very good async library. You should especially have a look at the series. Just a copy from the snippets from github page:
async.series([
function(callback){
// do some stuff ...
callback(null, 'one');
},
function(callback){
// do some more stuff ...
callback(null, 'two');
},
],
// optional callback
function(err, results){
// results is now equal to ['one', 'two']
});
// an example using an object instead of an array
async.series({
one: function(callback){
setTimeout(function(){
callback(null, 1);
}, 200);
},
two: function(callback){
setTimeout(function(){
callback(null, 2);
}, 100);
},
},
function(err, results) {
// results is now equals to: {one: 1, two: 2}
});
As a plus this library can also run in the browser.

The simplest way increment an integer counter when you start an async operation and then, in the callback, decrement the counter. Depending on the complexity, the callback could check the counter for zero and then delete the file.
A little more complex would be to maintain a list of objects, and each object would have any attributes that you need to identify the operation (it could even be the function call) as well as a status code. The callbacks would set the status code to completed.
Then you would have a loop that waits (using process.nextTick) and checks to see if all tasks are completed. The advantage of this method over the counter, is that if it is possible for all outstanding tasks to complete, before all tasks are issued, the counter technique would cause you to delete the file prematurely.

// simple countdown latch
function CDL(countdown, completion) {
this.signal = function() {
if(--countdown < 1) completion();
};
}
// usage
var latch = new CDL(10, function() {
console.log("latch.signal() was called 10 times.");
});

There is no "native" solution, but there are a million flow control libraries for node. You might like Step:
Step(
function(){
do_something(tmp_file_name, this.parallel());
do_something_else(tmp_file_name, this.parallel());
},
function(err) {
if (err) throw err;
fs.unlink(tmp_file_name);
}
)
Or, as Michael suggested, counters could be a simpler solution. Take a look at this semaphore mock-up. You'd use it like this:
do_something1(file, queue('myqueue'));
do_something2(file, queue('myqueue'));
queue.done('myqueue', function(){
fs.unlink(file);
});

I'd like to offer another solution that utilizes the speed and efficiency of the programming paradigm at the very core of Node: events.
Everything you can do with Promises or modules designed to manage flow-control, like async, can be accomplished using events and a simple state-machine, which I believe offers a methodology that is, perhaps, easier to understand than other options.
For example assume you wish to sum the length of multiple files in parallel:
const EventEmitter = require('events').EventEmitter;
// simple event-driven state machine
const sm = new EventEmitter();
// running state
let context={
tasks: 0, // number of total tasks
active: 0, // number of active tasks
results: [] // task results
};
const next = (result) => { // must be called when each task chain completes
if(result) { // preserve result of task chain
context.results.push(result);
}
// decrement the number of running tasks
context.active -= 1;
// when all tasks complete, trigger done state
if(!context.active) {
sm.emit('done');
}
};
// operational states
// start state - initializes context
sm.on('start', (paths) => {
const len=paths.length;
console.log(`start: beginning processing of ${len} paths`);
context.tasks = len; // total number of tasks
context.active = len; // number of active tasks
sm.emit('forEachPath', paths); // go to next state
});
// start processing of each path
sm.on('forEachPath', (paths)=>{
console.log(`forEachPath: starting ${paths.length} process chains`);
paths.forEach((path) => sm.emit('readPath', path));
});
// read contents from path
sm.on('readPath', (path) => {
console.log(` readPath: ${path}`);
fs.readFile(path,(err,buf) => {
if(err) {
sm.emit('error',err);
return;
}
sm.emit('processContent', buf.toString(), path);
});
});
// compute length of path contents
sm.on('processContent', (str, path) => {
console.log(` processContent: ${path}`);
next(str.length);
});
// when processing is complete
sm.on('done', () => {
const total = context.results.reduce((sum,n) => sum + n);
console.log(`The total of ${context.tasks} files is ${total}`);
});
// error state
sm.on('error', (err) => { throw err; });
// ======================================================
// start processing - ok, let's go
// ======================================================
sm.emit('start', ['file1','file2','file3','file4']);
Which will output:
start: beginning processing of 4 paths
forEachPath: starting 4 process chains
readPath: file1
readPath: file2
processContent: file1
readPath: file3
processContent: file2
processContent: file3
readPath: file4
processContent: file4
The total of 4 files is 4021
Note that the ordering of the process chain tasks is dependent upon system load.
You can envision the program flow as:
start -> forEachPath -+-> readPath1 -> processContent1 -+-> done
+-> readFile2 -> processContent2 -+
+-> readFile3 -> processContent3 -+
+-> readFile4 -> processContent4 -+
For reuse, it would be trivial to create a module to support the various flow-control patterns, i.e. series, parallel, batch, while, until, etc.

The simplest solution is to run the do_something* and unlink in sequence as follows:
do_something(tmp_file_name, function(err) {
do_something_other(tmp_file_name, function(err) {
fs.unlink(tmp_file_name);
});
});
Unless, for performance reasons, you want to execute do_something() and do_something_other() in parallel, I suggest to keep it simple and go this way.

Wait.for https://github.com/luciotato/waitfor
using Wait.for:
var wait=require('wait.for');
...in a fiber...
wait.for(do_something,tmp_file_name);
wait.for(do_something_other,tmp_file_name);
fs.unlink(tmp_file_name);

With pure Promises it could be a bit more messy, but if you use Deferred Promises then it's not so bad:
Install:
npm install --save #bitbar/deferred-promise
Modify your code:
const DeferredPromise = require('#bitbar/deferred-promise');
const promises = [
new DeferredPromise(),
new DeferredPromise()
];
do_something(tmp_file_name, (err) => {
if (err) {
promises[0].reject(err);
} else {
promises[0].resolve();
}
});
do_something_other(tmp_file_name, (err) => {
if (err) {
promises[1].reject(err);
} else {
promises[1].resolve();
}
});
Promise.all(promises).then( () => {
fs.unlink(tmp_file_name);
});

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string