I have been writing a lot of NodeJS recently and that has forced me to attack some problems from a different perspective. I was wondering what patterns had developed for the problem of processing chunks of data sequentially (rather than in parallel) in an asynchronous request-environment, but I haven't been able to find anything directly relevant.
So to summarize the problem:
I have a list of data stored in an array format that I need to process.
I have to send this data to a service asynchronously, but the service will only accept a few at a time.
The data must be processed sequentially to meet the restrictions on the service, meaning making a number of parallel asynchronous requests is not allowed
Working in this domain, the simplest pattern I've come up with is a recursive one. Something like
function processData(data, start, step, callback){
if(start < data.length){
var chunk = data.split(start, step);
queryService(chunk, start, step, function(e, d){
//Assume no errors
//Could possibly do some matching between d and 'data' here to
//Update data with anything that the service may have returned
processData(data, start+step, step, callback);
});
}
else{
callback(data);
}
}
Conceptually, this should step through each item, but it's intuitively complex. I feel like there should be a simpler way of doing this. Does anyone have a pattern they tend to follow when approaching this kind of problem?
My first thought process would be to rely on object encapsulation. Create an object that contains all of the information about what needs to be processed and all of the relevant data about what has been processed and is being processed and the callback function will just call the 'next' function for the object, which will in turn start processing on the next piece of data and update the object. Essentially working like a n asynchronous for-loop.
Related
I am currently working on a backend which provides rest endpoints for my frontend with nestjs. In some endpoints I receive e.g. an array of elements which I need to process.
Concrete Example:
I receive an array of 50 elements. For each element I need to make a SQL request. Therefore I need to loop over the array and do stuff in SQL.
I always ask myself: At what amount of elements should I use for example worker threads to not block the event loop?
Maybe I misunderstood the blocking of the event loop and someone can enlight me.
I don't think that you'll need worker-threads in this scenario. As long as the sql-queries are executed asynchronsouly, i.e. the sql-query calls do not block, you will be fine. You can use Promise.all to speed up the processing of the loop, as the queries will be executed in parallel, e.g.
const dbQueryPromises = [];
for(const entry of data) {
dbQueryPromises.push(dbConnection.query(buildQuery(entry)));
}
await Promise.all(dbQueryPromises);
If, however, your code performs computation-heavy operations inside the loop, then you should consider worker-threads as the long running operations on your call stack will block the eventloop.
Only use them if you need to do CPU-intensive tasks with large amounts of data. They allow you to avoid the serialization step of the data. 50 Is not enough I believe
I'm starting to play around with node.js and I have an application which basically iterating over dozens thousands of object and performing some various asynchronous http requests for all of them and populate the object with various data returned from the http requests..
This question is more concerning best practices with Node.js, non blocking operations and probably related to pooling.
Forgive me If I'm using the wrong term, as I'm new to this and please don't hesitate to correct me.
So below is a brief summary of the code
I have got a loop which kind of doing iterating over thousands
//Loop briefly summarized
for (var i = 0; i < arrayOfObjects.length; i++) {
do_something(arrayOfObjects[i], function (error, result){
if(err){
//various log
}else{
console.log(result);
}
});
}
//dosomething briefly summarized
function do_something (Object, callback){
http.request(url1, function(err, result){
if(!err){
insert_in_db(result.value1, function (error,result){
//Another http request with asynchronous
});
}else{
//various logging error
}
});
http.request(url2, function(err, result){
//some various logic including db call
});
}
In reality in do_something there is a complex logic but it's not really the matter right now
So my problem are the following
I think the main issue is in my loop is not really optimized because it's kind of a blocking event.
So the first http request results within dosomething are avaialble are after the loops is finished processing and then it's cascading.
If there a way somehow to make kind of pool of 10 or 20 max simualtenous execution of do_something and the rest are kind of queued when a pool ressource is available?
I hope I clearly explained myself , don't hesitate to ask me if I need to details.
Thanks in advance for your feedbacks,
Anselme
Your loop isn't blocking, per se, but it's not optimal. One of the things it does is schedules arrayOfObjects.length http requests. Those requests will all be scheduled right away, as your loop progresses.
In older versions of node.js, you would have had the benefit of default of 5 concurrent requests per host, but that default is later changed.
But then the actual opening of sockets, sending requests, waiting for responses, this will be individual for each loop. And each entry will finish in it's own time (depending, in this case, on the remote host, or e.g. database response times etc).
Take a look at async, vasync, or some of it's many alternatives, as suggested in comments, for pooling.
You can take it even a step further and use something like Bluebird Promise.map, with concurrency option set, depending on your use case.
I'm trying to render a page via something similar to this:
var content = '';
db.query(imageQuery,function(images){
content += images;
});
db.query(userQuery,function(users){
content += users;
});
response.end('<div id="page">'+content+'</div>');
Unfortunately, content is empty. I already know that these Asynchronous Queries cause the problem, but I can't find a way to fix it.
Somebody please helps me out of this.
The problem with your code is that you're saying "go do these two things for a while and then send my response." -- in other words you've told node to go into the other room to get the next pages of a book, and told it to do when it was done doing that, but then when it was out of the room, you just continued trying to read the book without the new pages.
What you need to do is instead send your response only when the two database queries are done.
There are several ways you can do that, how you do it is up to you.
You can chain the queries. This is inefficient since you're doing one query, waiting for it to return, doing the second, waiting for it to return and then sending your response, but it's the most basic way to do it.
var content = '';
db.query(imageQuery,function(images){
content += images;
db.query(userQuery,function(users){
content += users;
response.end('<div id="page">'+content+'</div>');
});
});
See how the response.end is now inside the last db.query's callback, which inside the first db.query's callback? This guarantees order of operations however. Your first query will ALWAYS complete first.
You could also write some sort of primitive latching system to run the queries in parallel. This is a little more efficient (they don't necessarily happen simultaneously, but it'll be faster than chaining them.) However, with this method you can't guarantee order of operations.
var _latch = 0;
var resp = '';
var complete = function(content){
resp += content;
++_latch;
if(_latch === 2){
response.end('<div id="page">'+resp+'</div>');
}
};
db.query(imageQuery, complete);
db.query(userQuery, complete);
So what you're doing there is saying run these queries and then call the same function. That function aggregates the responses and then counts the number of time it's been called. When it's been called the number of times you're making queries, it then returns the results to the user.
These are the two basic ways of handling multiple asynchronous methods. However, there are a lot of utilities to help you do this so you don't have to handle it manually.
async is a great library that will help you run async functions in series, parallel, waterfall, etc. Takes a TON of pain out of async management.
runnel is a similar library, but with a much smaller focus than async
q or bluebird are promises librarys implementing promises/a+. This provides a different concept behind flow control (if you're familiar with jQuery's deferred object, this is the idea that they were trying to implement.
You can read more about promises here, but a quick google will also help explain the concept.
In Node.js, I can do almost any async operation one of two ways:
var file = fs.readFileSync('file.html')
or...
var file
fs.readFile('file.html', function (err, data) {
if (err) throw err
console.log(data)
})
Is the only benefit of the async one custom error handling? Or is there really a reason to have the file read operation non-blocking?
These exist mostly because node itself needs them to load your program's modules from disk when your program starts. More broadly, it is typical to do a bunch a synchronous setup IO when a service is initially started but prior to accepting network connections. Once the program is ready to go (has it's TLS cert loaded, config file has been read, etc), then a network socket is bound and at that point everything is async from then on.
Asynchronous calls allow for the branching of execution chains and the passing of results through that execution chain. This has many advantages.
For one, the program can execute two or more calls at the same time, and do work on the results as they complete, not necessarily in the order they were first called.
For example if you have a program waiting on two events:
var file1;
var file2;
//Let's say this takes 2 seconds
fs.readFile('bigfile1.jpg', function (err, data) {
if (err) throw err;
file1 = data;
console.log("FILE1 Done");
});
//let's say this takes 1 second.
fs.readFile('bigfile2.jpg', function (err, data) {
if (err) throw err;
file2 = data;
console.log("FILE2 Done");
});
console.log("DO SOMETHING ELSE");
In the case above, bigfile2.jpg will return first and something will be logged after only 1 second. So your output timeline might be something like:
#0:00: DO SOMETHING ELSE
#1:00: FILE2 Done
#2:00: FILE1 Done
Notice above that the log to "DO SOMETHING ELSE" was logged right away. And File2 executed first after only 1 second... and at 2 seconds File1 is done. Everything was done within a total of 2 seconds though the callBack order was unpredictable.
Whereas doing it synchronously it would look like:
file1 = fs.readFileSync('bigfile1.jpg');
console.log("FILE1 Done");
file2 = fs.readFileSync('bigfile2.jpg');
console.log("FILE2 Done");
console.log("DO SOMETHING ELSE");
And the output might look like:
#2:00: FILE1 Done
#3:00: FILE2 Done
#3:00 DO SOMETHING ELSE
Notice it takes a total of 3 seconds to execute, but the order is how you called it.
Doing it synchronously typically takes longer for everything to finish (especially for external processes like filesystem reads, writes or database requests) because you are waiting for one thing to complete before moving onto the next. Sometimes you want this, but usually you don't. It can be easier to program synchronously sometimes though, since you can do things reliably in a particular order (usually).
Executing filesystem methods asynchronously however, your application can continue executing other non-filesystem related tasks without waiting for the filesystem processes to complete. So in general you can continue executing other work while the system waits for asynchronous operations to complete. This is why you find database queries, filesystem and communication requests to generally be handled using asynchronous methods (usually). They basically allow other work to be done while waiting for other (I/O and off-system) operations to complete.
When you get into more advanced asynchronous method chaining you can do some really powerful things like creating scopes (using closures and the like) with a little amount of code and also create responders to certain event loops.
Sorry for the long answer. There are many reasons why you have the option to do things synchronously or not, but hopefully this will help you decide whether either method is best for you.
The benefit of the asynchronous version is that you can do other stuff while you wait for the IO to complete.
fs.readFile('file.html', function (err, data) {
if (err) throw err;
console.log(data);
});
// Do a bunch more stuff here.
// All this code will execute before the callback to readFile,
// but the IO will be happening concurrently. :)
You want to use the async version when you are writing event-driven code where responding to requests quickly is paramount. The canonical example for Node is writing a web server. Let's say you have a user making a request which is such that the server has to perform a bunch of IO. If this IO is performed synchronously, the server will block. It will not answer any other requests until it has finished serving this request. From the perspective of the users, performance will seem terrible. So in a case like this, you want to use the asynchronous versions of the calls so that Node can continue processing requests.
The sync version of the IO calls is there because Node is not used only for writing event-driven code. For instance, if you are writing a utility which reads a file, modifies it, and writes it back to disk as part of a batch operation, like a command line tool, using the synchronous version of the IO operations can make your code easier to follow.
I'm using Redis to generate IDs for my in memory stored models. The Redis client requires a callback to the INCR command, which means the code looks like
client.incr('foo', function(err, id) {
... continue on here
});
The problem is, that I already have written the other part of the app, that expects the incr call to be synchronous and just return the ID, so that I can use it like
var id = client.incr('foo');
The reason why I got to this problem is that up until now, I was generating the IDs just in memory with a simple closure counter function, like
var counter = (function() {
var count = 0;
return function() {
return ++count;
}
})();
to simplify the testing and just general setup.
Does this mean that my app is flawed by design and I need to rewrite it to expect callback on generating IDs? Or is there any simple way to just synchronize the call?
Node.js in its essence is an async I/O library (with plugins). So, by definition, there's no synchronous I/O there and you should rewrite your app.
It is a bit of a pain, but what you have to do is wrap the logic that you had after the counter was generated into a function, and call that from the Redis callback. If you had something like this:
var id = get_synchronous_id();
processIdSomehow(id);
you'll need to do something like this.
var runIdLogic = function(id){
processIdSomehow(id);
}
client.incr('foo', function(err, id) {
runIdLogic(id);
});
You'll need the appropriate error checking, but something like that should work for you.
There are a couple of sequential programming layers for Node (such as TameJS) that might help with what you want, but those generally do recompilation or things like that: you'll have to decide how comfortable you are with that if you want to use them.
#Sergio said this briefly in his answer, but I wanted to write a little more of an expanded answer. node.js is an asynchronous design. It runs in a single thread, which means that in order to remain fast and handle many concurrent operations, all blocking calls must have a callback for their return value to run them asynchronously.
That does not mean that synchronous calls are not possible. They are, and its a concern for how you trust 3rd party plugins. If someone decides to write a call in their plugin that does block, you are at the mercy of that call, where it might even be something that is internal and not exposed in their API. Thus, it can block your entire app. Consider what might happen if Redis took a significant amount of time to return, and then multiple that by the amount of clients that could potentially be accessing that same routine. The entire logic has been serialized and they all wait.
In answer to your last question, you should not work towards accommodating a blocking approach. It may seems like a simple solution now, but its counter-intuitive to the benefits of node.js in the first place. If you are only more comfortable in a synchronous design workflow, you may want to consider another framework that is designed that way (with threads). If you want to stick with node.js, rewrite your existing logic to conform to a callback style. From the code examples I have seen, it tends to look like a nested set of functions, as callback uses callback, etc, until it can return from that recursive stack.
The application state in node.js is normally passed around as an object. What I would do is closer to:
var state = {}
client.incr('foo', function(err, id) {
state.id = id;
doSomethingWithId(state.id);
});
function doSomethingWithId(id) {
// reuse state if necessary
}
It's just a different way of doing things.