long (IO-bound) requests in express - node.js

I have an Express URL which has to wait for data to arrive from an external device over a serial port (or another network connection). This can take up to two seconds. I understand that if my get function blocks, it blocks the entire Node process, so I want to avoid this:
app.get('/ext-data', function(req, res){
var data = wait_for_external_data();
res.send(data);
});
I do have an emitter for the external data, so I can get a callback when external data arrive.
I'm unclear on how to tell express to do other things while my code is waiting for external data to become available, and how to pass them on to the repose object once I have them.

Generally you would pass a callback to your wait_for_external_data function that will be called once the data is received, and you need to write wait_for_external_data such that it will not block. To do this you would use the event emitters to get the data, as you mentioned. I can give more info if you elaborate on what library you are using to get the data.
app.get('/ext-data', function(req, res){
wait_for_external_data(function(data){
res.send(data);
});
});

Related

Non blocking Loop in Node.js and pooling?

I'm starting to play around with node.js and I have an application which basically iterating over dozens thousands of object and performing some various asynchronous http requests for all of them and populate the object with various data returned from the http requests..
This question is more concerning best practices with Node.js, non blocking operations and probably related to pooling.
Forgive me If I'm using the wrong term, as I'm new to this and please don't hesitate to correct me.
So below is a brief summary of the code
I have got a loop which kind of doing iterating over thousands
//Loop briefly summarized
for (var i = 0; i < arrayOfObjects.length; i++) {
do_something(arrayOfObjects[i], function (error, result){
if(err){
//various log
}else{
console.log(result);
}
});
}
//dosomething briefly summarized
function do_something (Object, callback){
http.request(url1, function(err, result){
if(!err){
insert_in_db(result.value1, function (error,result){
//Another http request with asynchronous
});
}else{
//various logging error
}
});
http.request(url2, function(err, result){
//some various logic including db call
});
}
In reality in do_something there is a complex logic but it's not really the matter right now
So my problem are the following
I think the main issue is in my loop is not really optimized because it's kind of a blocking event.
So the first http request results within dosomething are avaialble are after the loops is finished processing and then it's cascading.
If there a way somehow to make kind of pool of 10 or 20 max simualtenous execution of do_something and the rest are kind of queued when a pool ressource is available?
I hope I clearly explained myself , don't hesitate to ask me if I need to details.
Thanks in advance for your feedbacks,
Anselme
Your loop isn't blocking, per se, but it's not optimal. One of the things it does is schedules arrayOfObjects.length http requests. Those requests will all be scheduled right away, as your loop progresses.
In older versions of node.js, you would have had the benefit of default of 5 concurrent requests per host, but that default is later changed.
But then the actual opening of sockets, sending requests, waiting for responses, this will be individual for each loop. And each entry will finish in it's own time (depending, in this case, on the remote host, or e.g. database response times etc).
Take a look at async, vasync, or some of it's many alternatives, as suggested in comments, for pooling.
You can take it even a step further and use something like Bluebird Promise.map, with concurrency option set, depending on your use case.

Using callbacks with Socket IO

I'm using node and socket io to stream twitter feed to the browser, but the stream is too fast. In order to slow it down, I'm attempting to use setInterval, but it either only delays the start of the stream (without setting evenly spaced intervals between the tweets) or says that I can't use callbacks when broadcasting. Server side code below:
function start(){
stream.on('tweet', function(tweet){
if(tweet.coordinates && tweet.coordinates != null){
io.sockets.emit('stream', tweet);
}
});
}
io.sockets.on("connection", function(socket){
console.log('connected');
setInterval(start, 4000);
});
I think you're misunderstanding how .on() works for streams. It's an event handler. Once it is installed, it's there and the stream can call you at any time. Your interval is actually just making things worse because it's installing multiple .on() handlers.
It's unclear what you mean by "data coming too fast". Too fast for what? If it's just faster than you want to display it, then you can just store the tweets in an array and then use timers to decide when to display things from the array.
If data from a stream is coming too quickly to even store and this is a flowing nodejs stream, then you can pause the stream with the .pause() method and then, when you're able to go again, you can call .resume(). See http://nodejs.org/api/stream.html#stream_readable_pause for more info.

One shot Streams

The following will not work properly:
var http = require('http');
var fs = require('fs');
var theIndex = fs.createReadStream('index.html');
http.createServer(function (req, res) {
res.writeHead(200, {'Content-Type': 'text/html'});
theIndex.pipe(res);
}).listen(9000);
It will work great on the first request but for all subsequent requests no index.html will be sent to the client. The createReadStream call seems to need be inside the createServer callback. I think I can conceptualize why, but can you articulate why in words? It seems to be that once the stream has completed the file handle is closed and the stream must be created again? It can't simply be "restarted"? Is this correct?
Thanks
Streams contain internal state that keeps track of the state of the stream--in the case of a file stream, you have a file descriptor object, a read buffer, and the current position the file has been read to. Thus, it doesn't make sense to "rewind" a Node.js stream because Node.js is an asynchronous environment--this is an important point to keep in mind, as it means that two HTTP requests can be in the middle of processing at the same time.
If one HTTP request causes the stream to begin streaming from disk, and midway through the streaming process another HTTP request came in, there would be no way to use the same stream in the second HTTP request (the internal record-keeping would incorrectly send the second HTTP response the wrong data). Similarly, rewinding the stream when the second HTTP request is processed would cause the wrong data to be sent to the original HTTP request.
If Node.js were not an asynchronous environment, and it was guaranteed that the stream was completely used up before you rewound it, it might make sense to be able to rewind a stream (though there are other considerations, such as the timing of the open, end, and close events).
You do have access to the low-level fs.read mechanisms, so you could theoretically create an API that only opened a single file descriptor but spawned multiple streams; each stream would contain its own buffer and read position, but share a file descriptor. Perhaps something like:
var http = require('http');
var fs = require('fs');
var theIndexSpawner = createStreamSpawner('index.html');
http.createServer(function (req, res) {
res.writeHead(200, {'Content-Type': 'text/html'});
theIndexSpawner.spawnStream().pipe(res);
}).listen(9000);
Of course, you'll have to figure out when it's time to close the file descriptor, making sure you don't hold onto it for too long, etc. Unless you find that opening the file multiple times is an actual bottleneck in your application, it's probably not worth the mental overhead.

Reporting upload progress from node.js

I'm writing a small node.js application that receives a multipart POST from an HTML form and pipes the incoming data to Amazon S3. The formidable module provides the multipart parsing, exposing each part as a node Stream. The knox module handles the PUT to s3.
var form = new formidable.IncomingForm()
, s3 = knox.createClient(conf);
form.onPart = function(part) {
var put = s3.putStream(part, filename, headers, handleResponse);
put.on('progress', handleProgress);
};
form.parse(req);
I'm reporting the upload progress to the browser client via socket.io, but am having difficulty getting these numbers to reflect the real progress of the node to s3 upload.
When the browser to node upload happens near instantaneously, as it does when the node process is running on the local network, the progress indicator reaches 100% immediately. If the file is large, i.e. 300MB, the progress indicator rises slowly, but still faster than our upstream bandwidth would allow. After hitting 100% progress, the client then hangs, presumably waiting for the s3 upload to finish.
I know putStream uses Node's stream.pipe method internally, but I don't understand the detail of how this really works. My assumption is that node gobbles up the incoming data as fast as it can, throwing it into memory. If the write stream can take the data fast enough, little data is kept in memory at once, since it can be written and discarded. If the write stream is slow though, as it is here, we presumably have to keep all that incoming data in memory until it can be written. Since we're listening for data events on the read stream in order to emit progress, we end up reporting the upload as going faster than it really is.
Is my understanding of this problem anywhere close to the mark? How might I go about fixing it? Do I need to get down and dirty with write, drain and pause?
Your problem is that stream.pause isn't implemented on the part, which is a very simple readstream of the output from the multipart form parser.
Knox instructs the s3 request to emit "progress" events whenever the part emits "data". However since the part stream ignores pause, the progress events are emitted as fast as the form data is uploaded and parsed.
The formidable form, however, does know how to both pause and resume (it proxies the calls to the request it's parsing).
Something like this should fix your problem:
form.onPart = function(part) {
// once pause is implemented, the part will be able to throttle the speed
// of the incoming request
part.pause = function() {
form.pause();
};
// resume is the counterpart to pause, and will fire after the `put` emits
// "drain", letting us know that it's ok to start emitting "data" again
part.resume = function() {
form.resume();
};
var put = s3.putStream(part, filename, headers, handleResponse);
put.on('progress', handleProgress);
};

How to queue http get requests in Nodejs in order to control their rate?

I have a NodeJS app which sends HTTP get requests from various places in the code, some are even dependent (sending a request, waiting for a reply, processing it and based on results sending another request).
I need to limit the rate the requests (e.g., 10 requests per hour).
I thought about queuing the requests and then at some central point releasing them in a controlled manner, but got stuck at how to queue the callback functions and their dependent parameters.
Would be happy to hear suggestions how to over come this scenario with minimum restructuring for the app.
Thanks
I think that you have answered your question already. A central queue that can throttle your requests is the way to go. The only problem here is that the queue has to have the full information of for the request and the callback(s) that should be used. I would abstract this in a QueueableRequest object that could look something like this:
var QueueableRequest = function(url, params, httpMethod, success, failure){
this.url = url;
this.params = params;
...
}
//Then you can queue your request with
queue.add(new QueueableRequest({
"api.test.com",
{"test": 1},
"GET",
function(data){ console.log('success');},
function(err){ console.log('error');}
}));
Of course this is just sample code that could be much prettier, but I hope you get the picture.
The Async module has a number of control flow options that could help you. queue sounds like a good fit, where you can limit concurrency.
I would use Deferreds and return one for every queued request. You can then add succeed/fail callbacks to the deferred promise after it has been queued.
var deferred = queue.add('http://example.com/something');
deferred.fail(function(error) { /* handle failure */ });
deferred.done(function(response) { /* handle response */ });
You can hold a [ url, deferred ] pairs in your queue, and each time you dequeue a URL you'll also have the Deferred that goes with it, which you can resolve or fail after you process the request.

Resources