Non blocking Loop in Node.js and pooling? - node.js

I'm starting to play around with node.js and I have an application which basically iterating over dozens thousands of object and performing some various asynchronous http requests for all of them and populate the object with various data returned from the http requests..
This question is more concerning best practices with Node.js, non blocking operations and probably related to pooling.
Forgive me If I'm using the wrong term, as I'm new to this and please don't hesitate to correct me.
So below is a brief summary of the code
I have got a loop which kind of doing iterating over thousands
//Loop briefly summarized
for (var i = 0; i < arrayOfObjects.length; i++) {
do_something(arrayOfObjects[i], function (error, result){
if(err){
//various log
}else{
console.log(result);
}
});
}
//dosomething briefly summarized
function do_something (Object, callback){
http.request(url1, function(err, result){
if(!err){
insert_in_db(result.value1, function (error,result){
//Another http request with asynchronous
});
}else{
//various logging error
}
});
http.request(url2, function(err, result){
//some various logic including db call
});
}
In reality in do_something there is a complex logic but it's not really the matter right now
So my problem are the following
I think the main issue is in my loop is not really optimized because it's kind of a blocking event.
So the first http request results within dosomething are avaialble are after the loops is finished processing and then it's cascading.
If there a way somehow to make kind of pool of 10 or 20 max simualtenous execution of do_something and the rest are kind of queued when a pool ressource is available?
I hope I clearly explained myself , don't hesitate to ask me if I need to details.
Thanks in advance for your feedbacks,
Anselme

Your loop isn't blocking, per se, but it's not optimal. One of the things it does is schedules arrayOfObjects.length http requests. Those requests will all be scheduled right away, as your loop progresses.
In older versions of node.js, you would have had the benefit of default of 5 concurrent requests per host, but that default is later changed.
But then the actual opening of sockets, sending requests, waiting for responses, this will be individual for each loop. And each entry will finish in it's own time (depending, in this case, on the remote host, or e.g. database response times etc).
Take a look at async, vasync, or some of it's many alternatives, as suggested in comments, for pooling.
You can take it even a step further and use something like Bluebird Promise.map, with concurrency option set, depending on your use case.

Related

Are Database connections asynchronous in Node?

Node is single-threaded, but there are a lot of functions(modules like http, fs) that allow us to do a background task and the event loop takes care of executing the callbacks.
However, is this true for a database connection?
Let's say I have the following code.
const mysql = require('mysql');
function callDatabase(id) {
var result;
var connection = mysql.createConnection(
{
host : '192.168.1.14',
user : 'root',
password : '',
database : 'test'
}
);
connection.connect();
var queryString = 'SELECT name FROM test WHERE id = 1';
connection.query(queryString, function(err, rows, fields) {
if (err) throw err;
for (var i in rows) {
result = rows[i].name;
}
connection.end();
return result;
});
}
Does, mysql.createConnection, connection.connect, connection.query, connection.end spin up a new thread to execute in the background, leaving Node to run the remaining synchronous code?
If yes, in what queue will the callback be enqueued and how to write this sort of code such that a background task is initiated.
Anything that may be blocking (file system operations, network connections, etc) are generally asynchronous in Node, in order to avoid blocking on the main thread. That these functions take a parameter for a callback function is a sure hint that you have asynchronous operations (or "background tasks") in progress.
You don't show it in your sample code, but connect() and end() do take callback functions so you know when a connection is actually made or ends. It looks like the mysql library, however, also maintains an internal queue to make sure you can't attempt a query until a connection has been made and that only one operation at a time can be executed.
Note that createConnection() does not have a callback function. All it does is create a new data structure (connection) that gets used. It doesn't do any I/O itself, so doesn't need to run asynchronously.
Also note that you don't generally "spin up" your own threads. Node takes care of this thread management for you (largely by running things on the main worker thread), for the most part, and hides how threads themselves work for most developers. You typically hear that Node is "single threaded", and you should treat it this way.
Modern Node code makes extensive use of async/await and Promises to do this sort of thing. Slightly older code uses callback functions. Even older code uses Node events. In reality - if you dig far enough down, they're all using events and possibly presenting the simplified (more modern) interfaces.
The mysql module appears to date from the "callback" era and hasn't yet been updated for Promises/async/await. Under the covers, as noted, it uses Node events to track network (or unix domain socket) connections and transfers.

Patterns for asynchronous but sequential requests

I have been writing a lot of NodeJS recently and that has forced me to attack some problems from a different perspective. I was wondering what patterns had developed for the problem of processing chunks of data sequentially (rather than in parallel) in an asynchronous request-environment, but I haven't been able to find anything directly relevant.
So to summarize the problem:
I have a list of data stored in an array format that I need to process.
I have to send this data to a service asynchronously, but the service will only accept a few at a time.
The data must be processed sequentially to meet the restrictions on the service, meaning making a number of parallel asynchronous requests is not allowed
Working in this domain, the simplest pattern I've come up with is a recursive one. Something like
function processData(data, start, step, callback){
if(start < data.length){
var chunk = data.split(start, step);
queryService(chunk, start, step, function(e, d){
//Assume no errors
//Could possibly do some matching between d and 'data' here to
//Update data with anything that the service may have returned
processData(data, start+step, step, callback);
});
}
else{
callback(data);
}
}
Conceptually, this should step through each item, but it's intuitively complex. I feel like there should be a simpler way of doing this. Does anyone have a pattern they tend to follow when approaching this kind of problem?
My first thought process would be to rely on object encapsulation. Create an object that contains all of the information about what needs to be processed and all of the relevant data about what has been processed and is being processed and the callback function will just call the 'next' function for the object, which will in turn start processing on the next piece of data and update the object. Essentially working like a n asynchronous for-loop.

Rendering page asynchronously issue

I'm trying to render a page via something similar to this:
var content = '';
db.query(imageQuery,function(images){
content += images;
});
db.query(userQuery,function(users){
content += users;
});
response.end('<div id="page">'+content+'</div>');
Unfortunately, content is empty. I already know that these Asynchronous Queries cause the problem, but I can't find a way to fix it.
Somebody please helps me out of this.
The problem with your code is that you're saying "go do these two things for a while and then send my response." -- in other words you've told node to go into the other room to get the next pages of a book, and told it to do when it was done doing that, but then when it was out of the room, you just continued trying to read the book without the new pages.
What you need to do is instead send your response only when the two database queries are done.
There are several ways you can do that, how you do it is up to you.
You can chain the queries. This is inefficient since you're doing one query, waiting for it to return, doing the second, waiting for it to return and then sending your response, but it's the most basic way to do it.
var content = '';
db.query(imageQuery,function(images){
content += images;
db.query(userQuery,function(users){
content += users;
response.end('<div id="page">'+content+'</div>');
});
});
See how the response.end is now inside the last db.query's callback, which inside the first db.query's callback? This guarantees order of operations however. Your first query will ALWAYS complete first.
You could also write some sort of primitive latching system to run the queries in parallel. This is a little more efficient (they don't necessarily happen simultaneously, but it'll be faster than chaining them.) However, with this method you can't guarantee order of operations.
var _latch = 0;
var resp = '';
var complete = function(content){
resp += content;
++_latch;
if(_latch === 2){
response.end('<div id="page">'+resp+'</div>');
}
};
db.query(imageQuery, complete);
db.query(userQuery, complete);
So what you're doing there is saying run these queries and then call the same function. That function aggregates the responses and then counts the number of time it's been called. When it's been called the number of times you're making queries, it then returns the results to the user.
These are the two basic ways of handling multiple asynchronous methods. However, there are a lot of utilities to help you do this so you don't have to handle it manually.
async is a great library that will help you run async functions in series, parallel, waterfall, etc. Takes a TON of pain out of async management.
runnel is a similar library, but with a much smaller focus than async
q or bluebird are promises librarys implementing promises/a+. This provides a different concept behind flow control (if you're familiar with jQuery's deferred object, this is the idea that they were trying to implement.
You can read more about promises here, but a quick google will also help explain the concept.

Parallel Request at different paths in NodeJS: long running path 1 is blocking other paths

I am trying out simple NodeJS app so that I could to understand the async nature.
But my problem is as soon as I hit "/home" from browser it waits for response and simultaneously when "/" is hit, it waits for the "/home" 's response first and then responds to "/" request.
My concern is that if one of the request needs heavy processing, in parallel we can't request another one? Is this correct?
app.get("/", function(request, response) {
console.log("/ invoked");
response.writeHead(200, {'Content-Type' : 'text/plain'});
response.write('Logged in! Welcome!');
response.end();
});
app.get("/home", function(request, response) {
console.log("/home invoked");
var obj = {
"fname" : "Dead",
"lname" : "Pool"
}
for (var i = 0; i < 999999999; i++) {
for (var i = 0; i < 2; i++) {
// BS
};
};
response.writeHead(200, {'Content-Type' : 'application/json'});
response.write(JSON.stringify(obj));
response.end();
});
Good question,
Now, although Node.js has it's asynchronous nature, this piece of code:
for (var i = 0; i < 999999999; i++) {
for (var i = 0; i < 2; i++) {
// BS
};
};
Is not asynchronous actually blocking the node main thread. And therefore, all other requests has to wait until this big for loop will end.
In order to do some heavy calculations in parallel I recommend using setTimeout or setInterval to achieve your goal:
var i=0;
var interval = setInterval(function() {
if(i++>=999999999){
clearInterval(interval);
}
//do stuff here
},5);
For more information I recommend searching for "Node.js event loop"
As Stasel, stated, code running like will block the event loop. Basically whenever javascript is running on the server, nothing else is running. Asynchronous I/O events such as disk I/O might be processing in the background, but their handler/callback won't be call unless your synchronous code has finished running. Basically as soon as it's finished, node will check for pending events to be handled and call their handlers respectively.
You actually have couple of choices to fix this problem.
Break the work in pieces and let the pending events be executed in between. This is almost same as Stasel's recommendation, except 5ms between a single iteration is huge. For something like 999999999 items, that takes forever. Firstly I suggest batch process the loop for about sometime, then schedule next batch process with setimmediate. setimmediate basically will schedule it after the pending I/O events are handled, so if there is not new I/O event to be handled(like no new http requests) then it will executed immediately. It's fast enough. Now the question comes that how much processing should we do for each batch/iteration. I suggest first measure how much does it on average manually, and for schedule about 50ms of work. For example if you have realized 1000 items take 100ms. Then let it process 500 items, so it will be 50ms. You can break it down further, but the more broken down, the more time it takes in total. So be careful. Also since you are processing huge amount of items, try not to make too much garbage, so the garbage collector won't block it much. In this not-so-similar question, I've explained how to insert 10000 documents into MongoDB without blocking the event loop.
Use threads. There are actually a couple nice thread implementations that you won't shoot yourself in foot with them. This is really a good idea for this case, if you are looking for performance for huge processings, since it would be tricky as I said above to implement CPU bound task playing nice with other stuff happening in the same process, asynchronous events are perfect for data-bound task not CPU bound tasks. There's nodejs-threads-a-gogo module you can use. You can also use node-webworker-threads which is built on threads-a-gogo, but with webworker API. There's also nPool, which is a bit more nice looking but less popular. They all support thread pools and should be straight forward to implement a work queue.
Make several processes instead of threads. This might be slower than threads, but for huge stuff still way better than iterating in the main process. There's are different ways. Using processes will bring you a design that you can extend it to using multiple machines instead of just using multiple CPUs. You can either use a job-queue(basically pull the next from the queue whenever finished a task to process), a multi process map-reduce or AWS elastic map reduce, or using nodejs cluster module. Using cluster module you can listen to unix domain socket on each worker and for each job just make a request to that socket. Whenever the worker finished processing the job, it will just write back to that particular request. You can search about this stuff, there are many implementations and modules existing already. You can use 0MQ, rabbitMQ, node built-in ipc, unix domain sockets or a redis queue for multi process communications.

Why does Node.js have both async & sync version of fs methods?

In Node.js, I can do almost any async operation one of two ways:
var file = fs.readFileSync('file.html')
or...
var file
fs.readFile('file.html', function (err, data) {
if (err) throw err
console.log(data)
})
Is the only benefit of the async one custom error handling? Or is there really a reason to have the file read operation non-blocking?
These exist mostly because node itself needs them to load your program's modules from disk when your program starts. More broadly, it is typical to do a bunch a synchronous setup IO when a service is initially started but prior to accepting network connections. Once the program is ready to go (has it's TLS cert loaded, config file has been read, etc), then a network socket is bound and at that point everything is async from then on.
Asynchronous calls allow for the branching of execution chains and the passing of results through that execution chain. This has many advantages.
For one, the program can execute two or more calls at the same time, and do work on the results as they complete, not necessarily in the order they were first called.
For example if you have a program waiting on two events:
var file1;
var file2;
//Let's say this takes 2 seconds
fs.readFile('bigfile1.jpg', function (err, data) {
if (err) throw err;
file1 = data;
console.log("FILE1 Done");
});
//let's say this takes 1 second.
fs.readFile('bigfile2.jpg', function (err, data) {
if (err) throw err;
file2 = data;
console.log("FILE2 Done");
});
console.log("DO SOMETHING ELSE");
In the case above, bigfile2.jpg will return first and something will be logged after only 1 second. So your output timeline might be something like:
#0:00: DO SOMETHING ELSE
#1:00: FILE2 Done
#2:00: FILE1 Done
Notice above that the log to "DO SOMETHING ELSE" was logged right away. And File2 executed first after only 1 second... and at 2 seconds File1 is done. Everything was done within a total of 2 seconds though the callBack order was unpredictable.
Whereas doing it synchronously it would look like:
file1 = fs.readFileSync('bigfile1.jpg');
console.log("FILE1 Done");
file2 = fs.readFileSync('bigfile2.jpg');
console.log("FILE2 Done");
console.log("DO SOMETHING ELSE");
And the output might look like:
#2:00: FILE1 Done
#3:00: FILE2 Done
#3:00 DO SOMETHING ELSE
Notice it takes a total of 3 seconds to execute, but the order is how you called it.
Doing it synchronously typically takes longer for everything to finish (especially for external processes like filesystem reads, writes or database requests) because you are waiting for one thing to complete before moving onto the next. Sometimes you want this, but usually you don't. It can be easier to program synchronously sometimes though, since you can do things reliably in a particular order (usually).
Executing filesystem methods asynchronously however, your application can continue executing other non-filesystem related tasks without waiting for the filesystem processes to complete. So in general you can continue executing other work while the system waits for asynchronous operations to complete. This is why you find database queries, filesystem and communication requests to generally be handled using asynchronous methods (usually). They basically allow other work to be done while waiting for other (I/O and off-system) operations to complete.
When you get into more advanced asynchronous method chaining you can do some really powerful things like creating scopes (using closures and the like) with a little amount of code and also create responders to certain event loops.
Sorry for the long answer. There are many reasons why you have the option to do things synchronously or not, but hopefully this will help you decide whether either method is best for you.
The benefit of the asynchronous version is that you can do other stuff while you wait for the IO to complete.
fs.readFile('file.html', function (err, data) {
if (err) throw err;
console.log(data);
});
// Do a bunch more stuff here.
// All this code will execute before the callback to readFile,
// but the IO will be happening concurrently. :)
You want to use the async version when you are writing event-driven code where responding to requests quickly is paramount. The canonical example for Node is writing a web server. Let's say you have a user making a request which is such that the server has to perform a bunch of IO. If this IO is performed synchronously, the server will block. It will not answer any other requests until it has finished serving this request. From the perspective of the users, performance will seem terrible. So in a case like this, you want to use the asynchronous versions of the calls so that Node can continue processing requests.
The sync version of the IO calls is there because Node is not used only for writing event-driven code. For instance, if you are writing a utility which reads a file, modifies it, and writes it back to disk as part of a batch operation, like a command line tool, using the synchronous version of the IO operations can make your code easier to follow.

Resources