Node async stream piping - node.js

I have this code working in Node 0.10, but it prints nothing in 0.8
var http = require('http');
var req = http.request('http://www.google.com:80', function(res) {
setTimeout(function() {
res.pipe(process.stdout);
}, 0);
});
req.end();
After some guessing I found workaround:
var http = require('http');
var req = http.request('http://www.google.com:80', function(res) {
res.pause();
setTimeout(function() {
res.resume();
res.pipe(process.stdout);
}, 0);
});
req.end();
But documentation says, that pause is advisory and this is confuses me. Why should I pause stream, which is not connected anywhere?

0.10 revamped the Streams API and added the following change in behavior:
WARNING: If you never add a 'data' event handler, or call resume(), then it'll sit in a paused state forever and never emit 'end'.
So, in 0.10, the stream will wait for a valid listener, like a pipe, or a forced resume without an explicit pause.
Steams in 0.8 and older, on the other hand, will start sending 'data' immediately unless instructed to pause. And, in this case, that creates a race condition between the timeout and the stream -- the stream may run in part or even to completion before the timeout expires.

Related

NodeJS Child Process can't send net.socket fd:3

When writing a nodejs multi-threading package there is the problem when the main thread can send content though fd:3 and threads can receive the message, but then threads cannot send anything back though fd:3
Is there something I am doing wrong? (line threader.js:45-59 is where the problem shows it's self)
Package (Only on github for now while I get the package working)
Start up code:
var Thread = require("threader");
var task = Thread.task(function(){
//Do some calculation
}, function(){
//When the calculation response has been sent
});
task('a', 2);
I just figured the problem:
thread.js is like a socket Server and threader.js is like a client.
Server has to respond with in the context of the connection.
Since you are using setTimeout which itself is a separate thread that doesn't have access to the connection context, threader is not able to listen to data.
thread.js - old code
pipe.on('data', function(chunk){
console.log('RECEIVED CONENT THOUGH fd:3 in thread');
console.log(chunk.toString());
});
setTimeout(function () {
pipe.write('I piped a thing');
}, 2000);
thread.js - new code
pipe.on('data', function(chunk){
console.log('RECEIVED CONENT THOUGH fd:3 in thread');
console.log(chunk.toString());
});
pipe.write('I piped a thing');
OR
thread.js - new code - best way
pipe.on('data', function(chunk){
console.log('RECEIVED CONENT THOUGH fd:3 in thread');
console.log(chunk.toString());
//Real 2 second work but not on a separate thread using setTimeout
pipe.write('I piped a thing');
});
I just rewrote the entire package again starting for a different angle and now it works...
I think the problem was to do with the thread picking.
The fixes will be pushed to github soon.

NodeJS streams and premature end

Assuming a Readable Stream in NodeJS and a Data (on('data', ...)) event handler tied to it that is relatively slow, is it possible for the End event to fire before the last Data handler(s) has finished, and if so, will it prematurely terminate that handler? Or, will all Data events get dispatched and run?
In my case, I am working with large files and want to commit to a DB every data chunk. I am worried that I may lose the last record or two (or more) if End is fired before the last DB calls in the handler actually complete.
Event 'end' fire after last 'data' event. But it may happend before the last Data handler has finished. It is possible that before one 'data' handler has finished, next is started. It depends of what you have in your code, but it is possible that later call of event 'data' finish before earlier. It may cause errors and problems in your code.
Example how to cause problems (to your own tests):
var fs = require('fs');
var rr = fs.createReadStream('somebigfile.jpg');
var i=0;
rr.on('data', function(chunk) {
i++;
var s = i;
console.log('readable:' + s);
setTimeout(function(){
console.log('timeout:'+s);
}, 50-i*10);
});
rr.on('end', function() {
console.log('end');
});
It will print in your console when start each 'data' event handler. And after some miliseconds when it finish. Finish may be in different order.
Solution:
Readable Streams have two modes 'flowing mode' and a 'paused mode'. When you add 'data' event handler, you auto set Readable Streams to flowing mode.
From documentation :
When in flowing mode, data is read from the underlying system and
provided to your program as fast as possible
In this mode events will not wait for your slow actions to finish. For your need is 'paused mode'.
From documentation:
In paused mode, you must explicitly call stream.read() to get chunks
of data out. Streams start out in paused mode.
In other words: you demand chunk of data, you get it, you work with it, and when you ready you ask for new chunk of data. In this mode you controll when you want to get your data.
How to change to 'paused mode':
It is default mode for this stream. But when you register 'data' event handler it switch to 'flowing mode'. Therefore not use readstream.on('data',...)
Instead use readstream.on('readable', function(){...}) when it fire, then it means that stream is ready to give chunk of data. To get chunk of data use var chunk = readstream.read();
Example from docs:
var fs = require('fs');
var rr = fs.createReadStream('foo.txt');
rr.on('readable', function() {
console.log('readable:', rr.read());
});
rr.on('end', function() {
console.log('end');
});
Please read documentation for more details, because there are more posibilities when stream is auto switched to 'flowing mode'.
Work with slow handlers and flowing mode:
If you want/need work in 'flowing mode', there is also solution. You can pause and resume stream. When you get chunk form readstream('data'), pause stream and when you finish work then resume it.
Example from documentation:
var readable = getReadableStreamSomehow();
readable.on('data', function(chunk) {
console.log('got %d bytes of data', chunk.length);
readable.pause();
console.log('there will be no more data for 1 second');
setTimeout(function() {
console.log('now data will start flowing again');
readable.resume();
}, 1000);
});

NodeJS sockets initialized as unpaused?

A net.Socket object in NodeJS is a Readable Stream, however one note in the docs got me concerned:
For the Net.Socket 'data' event, the docs say
Note that the data will be lost if there is no listener when a Socket emits a 'data' event.
That seems to imply a Socket is returned to the calling script in "flowing-mode" and already un-paused? However, for a generic Readable Stream, the documentation for the 'data' event says
If you attach a data event listener, then it will switch the stream into flowing mode, and data will be passed to your handler as soon as it is available.
That "If" seems to imply if you wait a bit to bind to the 'data' event, the stream will wait for you, and if you intentionally want to miss the 'data' events, the example in the resume() method seems to indicate you must call the resume() method to start the flow of data.
My concern is that when working with a net.Server, when you receive a net.Socket as part of a 'connection' event, is it imperative that you start handling the 'data' events right away since it's already opened? Meaning if I do:
var s = new net.Server();
s.on('connection', function(socket) {
// Do some lengthy setup process here, blocking execution for a few seconds...
socket.on('data', function(d) { console.log(d); });
});
s.listen(8080);
Meaning not bind to the 'data' event right away, I could lose data? So is this a more robust way to handle incoming connections if you have a lengthy setup required for each one?
var s = new net.Server();
s.on('connection', function(socket) {
socket.pause(); // Not ready for you yet!
// Do some lengthy setup process here, blocking execution for a few seconds...
socket.on('data', function(d) { console.log(d); });
socket.resume(); // Okay, go!
});
s.listen(8080);
Anyone have experience working with listening on raw socket streams to know if this data loss is an issue?
I'm hoping this is an instance where the Net.Socket documentation wasn't updated since v0.10, since the stream documentation has a section that mentions 'data' events started emitting right away in versions prior to 0.10. Were TCP sockets properly updated to not start emitting 'data' packets right away, and the documentation not updated appropriately?
Yes, this is the docs flaw. Here is an example:
var net = require('net')
var server = net.createServer(onConnection)
function onConnection (socket) {
console.log('onConnection')
setTimeout(startReading, 1000)
function startReading () {
socket.on('data', read)
socket.on('end', stopReading)
}
function stopReading () {
socket.removeListener('data', read)
socket.removeListener('end', stopReading)
}
}
function read (data) {
console.log('Received: ' + data.toString('utf8'))
}
server.listen(1234, onListening)
function onListening () {
console.log('onListening')
net.connect(1234, onConnect)
}
function onConnect () {
console.log('onConnect')
this.write('1')
this.write('2')
this.write('3')
this.write('4')
this.write('5')
this.write('6')
}
All the data is received. If you explicitly resume() socket, you will lose it.
Also, if you do your "lengthy" setup in a blocking manner (which you shouldn't) you can't lose any IO as it has no chance to be processed, so no events will be emitted.

Node js - http.request() problems with connection pooling

Consider the following simple Node.js application:
var http = require('http');
http.createServer(function() { }).listen(8124); // Prevent process shutting down
var requestNo = 1;
var maxRequests = 2000;
function requestTest() {
http.request({ host: 'www.google.com', method: 'GET' }, function(res) {
console.log('Completed ' + (requestNo++));
if (requestNo <= maxRequests) {
requestTest();
}
}).end();
}
requestTest();
It makes 2000 HTTP requests to google.com, one after the other. The problem is it gets to request No. 5 and pauses for about 3 mins, then continues processing requests 6 - 10, then pauses for another 3 minutes, then requests 11 - 15, pauses, and so on. Edit: I tried changing www.google.com to localhost, an extremely basic Node.js app running my machine that returns "Hello world", I still get the 3 minute pause.
Now I read I can increase the connection pool limit:
http.globalAgent.maxSockets = 20;
Now if I run it, it processes requests 1 - 20, then pauses for 3 mins, then requests 21 - 40, then pauses, and so on.
Finally, after a bit of research, I learned I could disable connection pooling entirely by setting agent: false in the request options:
http.request({ host: 'www.google.com', method: 'GET', agent: false }, function(res) {
...snip....
...and it'll run through all 2000 requests just fine.
My question, is it a good idea to do this? Is there a danger that I could end up with too many HTTP connections? And why does it pause for 3 mins, surely if I've finished with the connection it should add it straight back into the pool ready for the next request to use, so why is it waiting 3 mins? Forgive my ignorance.
Failing that, what is the best strategy for a Node.js app making a potentially large number HTTP requests, without locking up, or crashing?
I'm running Node.js version 0.10 on Mac OSX 10.8.2.
Edit: I've found if I convert the above code into a for loop and try to establish a bunch of connections at the same time, I start getting errors after about 242 connections. The error is:
Error was thrown: connect EMFILE
(libuv) Failed to create kqueue (24)
...and the code...
for (var i = 1; i <= 2000; i++) {
(function(requestNo) {
var request = http.request({ host: 'www.google.com', method: 'GET', agent: false }, function(res) {
console.log('Completed ' + requestNo);
});
request.on('error', function(e) {
console.log(e.name + ' was thrown: ' + e.message);
});
request.end();
})(i);
}
I don't know if a heavily loaded Node.js app could ever reach that many simultaneous connections.
You have to consume the response.
Remember, in v0.10, we landed streams2. That means that data events don't happen until you start looking for them. So, you can do stuff like this:
http.createServer(function(req, res) {
// this does some I/O, async
// in 0.8, you'd lose data chunks, or even the 'end' event!
lookUpSessionInDb(req, function(er, session) {
if (er) {
res.statusCode = 500;
res.end("oopsie");
} else {
// no data lost
req.on('data', handleUpload);
// end event didn't fire while we were looking it up
req.on('end', function() {
res.end('ok, got your stuff');
});
}
});
});
However, the flip side of streams that don't lose data when you're not reading it, is that they actually don't lose data if you're not reading it! That is, they start out paused, and you have to read them to get anything out.
So, what's happening in your test is that you're making a bunch of requests and not consuming the responses, and then eventually the socket gets killed by google because nothing is happening, and it assumes you've died.
There are some cases where it's impossible to consume the incoming message: that is, if you don't add a response event handler on a requests, or where you completely write and finish the response message on a server without ever reading the request. In those cases, we just dump the data in the garbage for you.
However, if you are listening to the 'response' event, it's your responsibility to handle the object. Add a response.resume() in your first example, and you'll see it processes on through at a reasonable pace.

Better way to make node not exit?

In a node program I'm reading from a file stream with fs.createReadStream. But when I pause the stream the program exits. I thought the program would keep running since the file is still opened, just not being read.
Currently to get it to not exit I'm setting an interval that does nothing.
setInterval(function() {}, 10000000);
When I'm ready to let the program exit, I clear it. But is there a better way?
Example Code where node will exit:
var fs = require('fs');
var rs = fs.createReadStream('file.js');
rs.pause();
Node will exit when there is no more queued work. Calling pause on a ReadableStream simply pauses the data event. At that point, there are no more events being emitted and no outstanding work requests, so Node will exit. The setInterval works since it counts as queued work.
Generally this is not a problem since you will probably be doing something after you pause that stream. Once you resume the stream, there will be a bunch of queued I/O and your code will execute before Node exits.
Let me give you an example. Here is a script that exits without printing anything:
var fs = require('fs');
var rs = fs.createReadStream('file.js');
rs.pause();
rs.on('data', function (data) {
console.log(data); // never gets executed
});
The stream is paused, there is no outstanding work, and my callback never runs.
However, this script does actually print output:
var fs = require('fs');
var rs = fs.createReadStream('file.js');
rs.pause();
rs.on('data', function (data) {
console.log(data); // prints stuff
});
rs.resume(); // queues I/O
In conclusion, as long as you are eventually calling resume later, you should be fine.
Short way based on answers below
require('fs').createReadStream('file.js').pause();

Resources