Node.js stream pipes and garbage collection - node.js

Question:
In Node.js, if a Readable stream is piped to a Writable stream and both go out of scope, is the pair of streams liable to be garbage collected before the reader is complete? (because they are now inaccessible)
Background:
I am trying to understand the effects of pipes on stream object lifetimes. I am operating under the assumption that a pipe represents a bidirectional reference between the two streams, so that if one is accessible, neither will be garbage collected (until, of course, the reader ends and the pipe is closed).
So with that assumption: Is there anything under the hood in the runtime that holds streams in existence while a pipe is active, or while a 'data' listener on the Readable is doing the equivalent? (apart from the obvious, like references embedded in event listener functions and other objects)
A concrete example would be piping a file read stream to an http response object. If I "pipe-and-forget", and retain no reference to the file or response stream, is this process liable to be interrupted mid-stream?
Or alternatively, if something is holding off the GC for piped streams, would a bidirectional pipe between two sockets exist indefinitely, even if they were both inaccessible? (and be totally unclosable?)

The answer, as usual, is: it depends.
If nothing references the stream objects then they will be garbage collected. However, there might be references elsewhere than in your code.
Note that the writable stream is referenced by the readable stream via event handlers, so all we really need is a reference to the readable stream to keep both alive.
If the source is a stream that is capable of producing data all by itself (it reads from a file, the network, or something in memory) and the pipeline is flowing, then it will be referenced by a callback somewhere (an I/O continuation, event handler, or closure given to setTimeout() or process.nextTick()).
If the source is a stream that is waiting for you to push data into it or the pipeline is paused, then there are likely no references and both streams will eventually be garbage-collected.

Related

Node.js Streams: When will _writev Be Invoked?

The Node.js documentation makes the following comments about a Writable stream's _writev method.
The writable._writev() method may be implemented in addition or alternatively to writable._write() in stream implementations that are capable of processing multiple chunks of data at once. If implemented and if there is buffered data from previous writes, _writev() will be called instead of _write().
Emphasis mine. In what scenarios can a Node.js writable stream have buffered data from previous writes?
Is the _writev method only called after uncorking a corked stream that's had data written to it? Or are there other scenarios where a stream can have buffered date from previous writes? Bonus point if you can point to the place in the Node.js source code where it makes a decisions w/r/t to calling _write or _writev.
_writev() will be called whenever there is more than one piece of data buffered from the stream and the function has been defined. Using cork() could cause more data to be buffered, but so could slow processing.
The code that guards _writev is in lib/internal/streams/writable.js. There is a buffer decision and then the guard for the write.

Node JS Streams: Understanding data concatenation

One of the first things you learn when you look at node's http module is this pattern for concatenating all of the data events coming from the request read stream:
let body = [];
request.on('data', chunk => {
body.push(chunk);
}).on('end', () => {
body = Buffer.concat(body).toString();
});
However, if you look at a lot of streaming library implementations they seem to gloss over this entirely. Also, when I inspect the request.on('data',...) event it almost ever only emits once for a typical JSON payload with a few to a dozen properties.
You can do things with the request stream like pipe it through some transforms in object mode and through to some other read streams. It looks like this concatenating pattern is never needed.
Is this because the request stream in handling POST and PUT bodies pretty much only ever emits one data event which is because their payload is way below the chunk partition size limit?. In practice, how large would a JSON encoded object need to be to be streamed in more than one data chunk?
It seems to me that objectMode streams don't need to worry about concatenating because if you're dealing with an object it is almost always no larger than one data emitted chunk, which atomically transforms to one object? I could see there being an issue if a client were uploading something like a massive collection (which is when a stream would be very useful as long as it could parse the individual objects in the collection and emit them one by one or in batches).
I find this to probably be the most confusing aspect of really understanding the node.js specifics of streams, there is a weird disconnect between streaming raw data, and dealing with atomic chunks like objects. Do objectMode stream transforms have internal logic for automatically concatenating up to object boundaries? If someone could clarify this it would be very appreciated.
The job of the code you show is to collect all the data from the stream into one buffer so when the end event occurs, you then have all the data.
request.on('data',...) may emit only once or it may emit hundreds of times. It depends upon the size of the data, the configuration of the stream object and the type of stream behind it. You cannot ever reliably assume it will only emit once.
You can do things with the request stream like pipe it through some transforms in object mode and through to some other read streams. It looks like this concatenating pattern is never needed.
You only use this concatenating pattern when you are trying to get the entire data from this stream into a single variable. The whole point of piping to another stream is that you don't need to fetch the entire data from one stream before sending it to the next stream. .pipe() will just send data as it arrives to the next stream for you. Same for transforms.
Is this because the request stream in handling POST and PUT bodies pretty much only ever emits one data event which is because their payload is way below the chunk partition size limit?.
It is likely because the payload is below some internal buffer size and the transport is sending all the data at once and you aren't running on a slow link and .... The point here is you cannot make assumptions about how many data events there will be. You must assume there can be more than one and that the first data event does not necessarily contain all the data or data separated on a nice boundary. Lots of things can cause the incoming data to get broken up differently.
Keep in mind that a readStream reads data until there's momentarily no more data to read (up to the size of the internal buffer) and then it emits a data event. It doesn't wait until the buffer fills before emitting a data event. So, since all data at the lower levels of the TCP stack is sent in packets, all it takes is a momentary delivery delay with some packet and the stream will find no more data available to read and will emit a data event. This can happen because of the way the data is sent, because of things that happen in the transport over which the data flows or even because of local TCP flow control if lots of stuff is going on with the TCP stack at the OS level.
In practice, how large would a JSON encoded object need to be to be streamed in more than one data chunk?
You really should not know or care because you HAVE to assume that any size object could be delivered in more than one data event. You can probably safely assume that a JSON object larger than the internal stream buffer size (which you could find out by studying the stream code or examining internals in the debugger) WILL be delivered in multiple data events, but you cannot assume the reverse because there are other variables such as transport-related things that can cause it to get split up into multiple events.
It seems to me that objectMode streams don't need to worry about concatenating because if you're dealing with an object it is almost always no larger than one data emitted chunk, which atomically transforms to one object? I could see there being an issue if a client were uploading something like a massive collection (which is when a stream would be very useful as long as it could parse the individual objects in the collection and emit them one by one or in batches).
Object mode streams must do their own internal buffering to find the boundaries of whatever objects they are parsing so that they can emit only whole objects. At some low level, they are concatenating data buffers and then examining them to see if they yet have a whole object.
Yes, you are correct that if you were using an object mode stream and the object themselves were very large, they could consume a lot of memory. Likely this wouldn't be the most optimal way of dealing with that type of data.
Do objectMode stream transforms have internal logic for automatically concatenating up to object boundaries?
Yes, they do.
FYI, the first thing I do when making http requests is to go use the request-promise library so I don't have to do my own concatenating. It handles all this for you. It also provides a promise-based interface and about 100 other helpful features which I find helpful.

How to get a readable stream to 'close'

I'm getting a readable stream (require('stream').Readable) from a library I'm using*.
In a general sense, how can I close this (any) readable stream once all data is consumed? I'm seeing the end event, but the close event is never received.
Tried: .close() and destroy() don't seem to be avail anymore on require('stream').Readable, while they were avail on require('fs') streams.
I believe the above is causing some erratic behavior under load. I.e.: running out of file descriptors, mem leaks, etc, so any help is much appreciated.
Thanks.
*) x-ray. Under the covers it uses enstore, which uses an adapted require('stream').Readable
Readable streams typically don't emit close (they emit end). The close event is more for Writable streams to indicate that an underlying file descriptor has been closed for example.
There is no need to manually close a Readable stream once all of the data has been consumed, it ends automatically (this is done when the stream implementation calls push(null)).
Of course if the stream implementation isn't cleaning up any resources it uses behind the scenes, then that is a bug and should be filed on the appropriate project's issue tracker.

Any benefit to using streams if all the data fits in a single chunk?

When I do fs.createReadStream in Node.js, the data seems to come in 64KB chunks (I assume this varies between computers).
Let's say I'm piping a read stream through a series of transformations (which each operate on a single chunk) and then finally piping it to a write stream to save it to disk...
If I know in advance that the files I'm working on are guaranteed to be less than 64KB each (ie, they'll each be read in a single chunk), is there any benefit to using streams, as opposed to plain old async code?
First of all, you can configure the chunk size using the highWaterMark parameter: it defaults to 16k for byte-mode streams (16 objects for object-mode streams), but fs.ReadStream default to 64k chunks (see relevant source code).
If you are absolutely sure that your all of your data fits in a single chunk, there is no immediate benefit to using streams, indeed.
But remember that streams are flexible; they are the unifying abstraction of your code: you can read data from a file, a socket or a random generator. You can add or remove a duplex stream from a streams pipeline and your code will still work in the same way.
You can also pipe a single readable stream into multiple writable streams, which would be a pain to do using only asynchronous callback…
Also note that streams don't emit data synchronously (i.e. the readable event is emitted on the next tick), which handles nicely for you the common mistake to synchronously call an asynchronous callback, thus creating a possible stack overflow bug.

What is Streams3 in Node.js and how does it differ from Streams2?

I've often heard of Streams2 and old-streams, but what is Streams3? It get mentioned in this talk by Thorsten Lorenz.
Where can I read about it, and what is the difference between Streams2 and Streams3.
Doing a search on Google, I also see it mentioned in the Changelog of Node 0.11.5,
stream: Simplify flowing, passive data listening (streams3) (isaacs)
I'm going to give this a shot, but I've probably got it wrong. Having never written Streams1 (old-streams) or Streams2, I'm probably not the right guy to self-answer this one, but here it goes. It seems as if there is Streams1 API that still persists to some degree. In Streams2, there are two modes of streams flowing (legacy), and non-flowing. In short, the shim that supported flowing mode is going away. This was the message that lead to the patch now called called Streams3,
Same API as streams2, but remove the confusing modality of flowing/old
mode switch.
Every time read() is called, and returns some data, a data event fires.
resume() will make it call read() repeatedly. Otherwise, no change.
pause() will make it stop calling read() repeatedly.
pipe(dest) and on('data', fn) will automatically call resume().
No switches into old-mode. There's only flowing, and paused. Streams start out paused.
Unfortunately, to understand any of description which defines Streams3 pretty well, you need to first understand Streams1, and the legacy streams
Backstory
First, let's take a look at what the Node v0.10.25 docs say about the two modes,
Readable streams have two "modes": a flowing mode and a non-flowing mode. When in flowing mode, data is read from the underlying system and provided to your program as fast as possible. In non-flowing mode, you must explicitly call stream.read() to get chunks of data out. — Node v0.10.25 Docs
Isaac Z. Schlueter said in November slides I dug up:
streams2
"suck streams"
Instead of 'data' events spewing, call read() to pull data from source
Solves all problems (that we know of)
So it seems as if in streams1, you'd create an object and call .on('data', cb) to that object. This would set the event to be trigger, and then you were at the mercy of the stream. In Streams2 internally streams have buffers and you request data from those streams explicitly (using `.read). Isaac goes on to specify how backwards compat works in Streams2 to keep Streams1 (old-stream) modules functioning
old-mode streams1 shim
New streams can switch into old-mode, where they spew 'data'
If you add a 'data' event handler, or call pause() or resume(), then switch
Making minimal changes to existing tests to keep us honest
So in Streams2, a call to .pause() or .resume() triggers the shim. And, it should, right? In Streams2 you have control over when to .read(), and you're not catching stuff being thrown at you. This triggered a legacy mode that acted independently of Streams2.
Let's take an example from Isaac's slide,
createServer(function(q,s) {
// ADVISORY only!
q.pause()
session(q, function(ses) {
q.on('data', handler)
q.resume()
})
})
In Streams1, q starts up right away reading and emitting (likely losing data), until the call to q.pause advises q to stop pulling in data but not from emitting events to clear what it already read.
In Streams2, q starts off paused until the call to .pause() which signifies to emulate the old mode.
In Streams3, q starts off as paused having never read from the file handle making the q.pause() a noop, and on the call to q.on('data', cb) will call q.resume until there is no more data in the buffer. And, then call again q.resume doing the same thing.
Seems like Streams3 was introduced in io.js, then in Node 0.11+
Streams 1 Supported data being pushed to a stream. There was no consumer control, data was thrown at the consumer whether it was ready or not.
Streams 2 allows data to be pushed to a stream as per Streams 1, or for a consumer to pull data from a stream as needed. The consumer could control the flow of data in pull mode (using stream.read() when notified of available data). The stream can not support both push and pull at the same time.
Streams 3 allows pull and push data on the same stream.
Great overview here:
https://strongloop.com/strongblog/whats-new-io-js-beta-streams3/
A cached version (accessed 8/2020) is here: https://hackerfall.com/story/whats-new-in-iojs-10-beta-streams-3
I suggest you read the documentation, more specifically the section "API for Stream Consumers", it's actually very understandable, besides I think the other answer is wrong: http://nodejs.org/api/stream.html#stream_readable_read_size

Resources