Converting WriteStream to TransformStream - node.js

I have a (somewhat weird) writable stream that I need to convert to a transform stream.
The writable stream, normally, sits at the end of a pipe chain and emits custom events once it has collected enough data for its output. I want it to go in the middle so I can pipe it to another writeStream, i.e:
readStream.pipe(writeStreamToConvert).pipe(finalWriteStream);
What I done is the following and it works.
const through2 = require('through2')
var writeStreamToConvert = new WriteStreamToConvert();
return through2.obj(function (chunk, enc, callback) {
writeStreamToConvert.write(chunk)
// object is the event emitted from the writestream
writeStreamToConvert.on('object', (name, obj ) => {
this.push(JSON.stringify(obj, null, 4) + '\n')
});
callback()
})
This works fine, does not seem to leak memory and is fairly quick. However node gives me a warning:
Warning: Possible EventEmitter memory leak detected. 11 object listeners added. Use emitter.setMaxListeners() to increase limit
So I am a little bit curious if this is the correct way of converting writestreams?

The event handler would be best placed in a Transform stream constructor. Since through2 does not support such initialization, you would need to use node's stream API directly.
Currently, a new event handler (which is never removed -- that is how .on() works) is being added for every object written to the through2 stream. That is why the warning occurs.

Related

Why while loop is needed for reading a non-flowing mode stream in Node.js?

In the node.js documentation, I came across the following code
const readable = getReadableStreamSomehow();
// 'readable' may be triggered multiple times as data is buffered in
readable.on('readable', () => {
let chunk;
console.log('Stream is readable (new data received in buffer)');
// Use a loop to make sure we read all currently available data
while (null !== (chunk = readable.read())) {
console.log(`Read ${chunk.length} bytes of data...`);
}
});
// 'end' will be triggered once when there is no more data available
readable.on('end', () => {
console.log('Reached end of stream.');
});
Here is the comment from the node.js documentation concerning the usage of the while loop, saying it's needed to make sure all data is read
// Use a loop to make sure we read all currently available data
while (null !== (chunk = readable.read())) {
I couldn't understand why it is needed and tried to replace while with just if statement, and the process terminated after the very first read. Why?
From the node.js documentation
The readable.read() method should only be called on Readable streams operating in paused mode. In flowing mode, readable.read() is called automatically until the internal buffer is fully drained.
Be careful that this method is only meant for stream that has been paused.
And even further, if you understand what a stream is, you'll understand that you need to process chunks of data.
Each call to readable.read() returns a chunk of data, or null. The chunks are not concatenated. A while loop is necessary to consume all data currently in the buffer.
So i hope you understand that if you are not looping through your readable stream and only executing 1 read, you won't get your full data.
Ref: https://nodejs.org/api/stream.html

What are the roles of _read and read in Node JS streams?

I'm really just looking for clarification on how these work. IMO the documentation on streams is somewhat lacking, and there actually aren't a lot of resources out their that comprehensively explain explain how they're are meant to work and be extended.
My question can be broken down into two parts
One, What is the role of the _read function within the stream module? When I run this code it endlessly prints out "hello world" until null is pushed onto the stream buffer. This seems to indicate that _read is called in some kind of loop that waits for a null in the buffer, but I can't find documentation anywhere that states this in explicit terms.
var Readable = require('stream').Readable
var rs = Readable()
rs._read = function () {
rs.push("hello world")
rs.push(null)
};
rs.on("data", function(data){
console.log("some data", data)
})
Two, what does read actually do? My understanding is that read consumes data from the read stream buffer, and fires the data event. Is that all that's going on here?
read() is something that a consumer of the readStream calls if they want to specifically read some bytes from the stream (when the stream is not flowing).
_read() is an internal method that is part of the internal implementation of the read stream. The internals of the stream call this method (it is NOT to be called from the outside) when the stream is flowing and the stream wants to get more data from the source. When called the _read() method pushes data with .push(data) or if it has no more data, then it does a .push(null).
You can see an explanation and example here in this article.
_read(size) {
if (this.data.length) {
const chunk = this.data.slice(0, size);
this.data = this.data.slice(size, this.data.length);
this.push(chunk);
} else {
this.push(null); // 'end', no more data
}
}
If you were implementing a read stream to some custom source of data, then you would implement the _read() method to fetch up to the size amount of data from your source and .push() that data into the stream.

Node.js stream.on('end'... does not make file readable

I try to catch the completion of writing the canvas stream thusly:
var out = fs.createWriteStream(out_fs);
var stream = canvas.createPNGStream({
bufsize: 2048
});
stream.on('end', function () {
// can we use out_fs now? why not?
});
stream.pipe(out);
But when I try to load out_fs in sub function
Error: Image given has not completed loading
at this line:
fs.readFile(out_fs, function (err, data) {
if (err) throw err;
var img = new Canvas.Image; // Create a new Image
img.src = data;
ctx2.drawImage(img, 0, 50, img.width, img.height); <--
http://nodejs.org/api/stream.html#stream_event_end
But I don't see any other way to continue with the control flow after the stream is written. If I let the entire parent function return, the file then seems readable. I've tried wrapping my child functions in setImmediate(), but that only seems to work intermittently.
What is the definitive way to catch the final usable end result of writing the stream?
The node-canvas documentation claims that the end event signals the final writing of the file: https://www.npmjs.com/package/canvas#canvaspngstream
But this generates the error above if you immediately try to use it.
`finish' does not seem to be implemented at all.
Since you have piped stream to out, out will be close()'d automatically on stream's end event (this is part of what gets setup automatically when you .pipe() a stream). So, to know when file is finished being written, listen to the close event of out stream.
You saw intermittent results because the stream end event is the same event that will be used by out writable stream to finalize the file.
I would put this in a comment (but can't):
You need to close your WriteStream called 'out' - use the event aarosil suggests and do out.close()

Proper way to unpipe a streams2 pipeline and empty it (not just flush)

Premise
I'm trying to find the correct way to prematurely terminate a series of piped streams (pipeline) in Node.js: sometimes I want to gracefully abort the stream before it has finished. Specifically I'm dealing with mostly objectMode: true and non-native parallel streams, but this shouldn't really matter.
Problem
The problem is when I unpipe the pipeline, data remains in each stream's buffer and is drained. This might be okay for most of the intermediate streams (e.g. Readable/Transform), but the last Writable still drains to its write target (e.g. a file or a database or socket or w/e). This could be problematic if the buffer contains hundreds or thousands of chunks which takes a significant amount of time to drain. I want it to stop immediately, i.e. not drain; why waste cycles and memory on data that doesn't matter?
Depending on the route I go, I receive either a "write after end" error, or an exception when the stream cannot find existing pipes.
Question
What is the proper way to gracefully kill off a pipeline of streams in the form a.pipe(b).pipe(c).pipe(z)?
Solution?
The solution I have come up with is 3-step:
unpipe each stream in the pipeline in reverse order
Empty each stream's buffer that implements Writable
end each stream that implements Writable
Some pseudo code illustrating the entire process:
var pipeline = [ // define the pipeline
readStream,
transformStream0,
transformStream1,
writeStream
];
// build and start the pipeline
var tmpBuildStream;
pipeline.forEach(function(stream) {
if ( !tmpBuildStream ) {
tmpBuildStream = stream;
continue;
}
tmpBuildStream = lastStream.pipe(stream);
});
// sleep, timeout, event, etc...
// tear down the pipeline
var tmpTearStream;
pipeline.slice(0).reverse().forEach(function(stream) {
if ( !tmpTearStream ) {
tmpTearStream = stream;
continue;
}
tmpTearStream = stream.unpipe(tmpTearStream);
});
// empty and end the pipeline
pipeline.forEach(function(stream) {
if ( typeof stream._writableState === 'object' ) { // empty
stream._writableState.length -= stream._writableState.buffer.length;
stream._writableState.buffer = [];
}
if ( typeof stream.end === 'function' ) { // kill
stream.end();
}
});
I'm really worried about the usage of stream._writableState and modifying the internal buffer and length properties (the _ signifies a private property). This seems like a hack. Also note that since I'm piping, things like pause and resume our out of the question (based on a suggestion I received from IRC).
I also put together a runnable version (pretty sloppy) you can grab from github: https://github.com/zamnuts/multipipe-proto (git clone, npm install, view readme, npm start)
In this particular case I think we should get rid of the structure where you have 4 different not fully customised streams. Piping them together will create chain dependency that will be hard to control if we haven't implement our own mechanism.
I would like to focus on your actuall goal here:
INPUT >----[read] → [transform0] → [transform1] → [write]-----> OUTPUT
| | | |
KILL_ALL------o----------o--------------o------------o--------[nothing to drain]
I believe that the above structure can be achieved via combining custom:
duplex stream - for own _write(chunk, encoding, cb)and _read(bytes) implementation with
transform stream - for own _transform(chunk, encoding, cb) implementation.
Since you are using the writable-stream-parallel package you may also want to go over their libs, as their duplex implementation can be found here: https://github.com/Clever/writable-stream-parallel/blob/master/lib/duplex.js .
And their transform stream implementation is here: https://github.com/Clever/writable-stream-parallel/blob/master/lib/transform.js. Here they handle the highWaterMark.
Possible solution
Their write stream : https://github.com/Clever/writable-stream-parallel/blob/master/lib/writable.js#L189 has an interesting function writeOrBuffer, I think you might be able to tweak it a bit to interrupt writing the data from buffer.
Note: These 3 flags are controlling the buffer clearing:
( !finished && !state.bufferProcessing && state.buffer.length )
References:
Node.js Transform Stream Doc
Node.js Duplex Stream Doc
Writing Transform Stream in Node.js
Writing Duplex Stream in Node.js

Basic streams issue: Difficulty sending a string to stdout

I'm just starting learning about streams in node. I have a string in memory and I want to put it in a stream that applies a transformation and pipe it through to process.stdout. Here is my attempt to do it:
var through = require('through');
var stream = through(function write(data) {
this.push(data.toUpperCase());
});
stream.push('asdf');
stream.pipe(process.stdout);
stream.end();
It does not work. When I run the script on the cli via node, nothing is sent to stdout and no errors are thrown. A few questions I have:
If you have a value in memory that you want to put into a stream, what is the best way to do it?
What is the difference between push and queue?
Does it matter if I call end() before or after calling pipe()?
Is end() equivalent to push(null)?
Thanks!
Just use the vanilla stream API
var Transform = require("stream").Transform;
// create a new Transform stream
var stream = new Transform({
decodeStrings: false,
encoding: "ascii"
});
// implement the _transform method
stream._transform = function _transform(str, enc, done) {
this.push(str.toUpperCase() + "\n";
done();
};
// connect to stdout
stream.pipe(process.stdout);
// write some stuff to the stream
stream.write("hello!");
stream.write("world!");
// output
// HELLO!
// WORLD!
Or you can build your own stream constructor. This is really the way the stream API is intended to be used
var Transform = require("stream").Transform;
function MyStream() {
// call Transform constructor with `this` context
// {decodeStrings: false} keeps data as `string` type instead of `Buffer`
// {encoding: "ascii"} sets the encoding for our strings
Transform.call(this, {decodeStrings: false, encoding: "ascii"});
// our function to do "work"
function _transform(str, encoding, done) {
this.push(str.toUpperCase() + "\n");
done();
}
// export our function
this._transform = _transform;
}
// extend the Transform.prototype to your constructor
MyStream.prototype = Object.create(Transform.prototype, {
constructor: {
value: MyStream
}
});
Now use it like this
// instantiate
var a = new MyStream();
// pipe to a destination
a.pipe(process.stdout);
// write data
a.write("hello!");
a.write("world!");
Output
HELLO!
WORLD!
Some other notes about .push vs .write.
.write(str) adds data to the writable buffer. It is meant to be called externally. If you think of a stream like a duplex file handle, it's just like fwrite, only buffered.
.push(str) adds data to the readable buffer. It is only intended to be called from within our stream.
.push(str) can be called many times. Watch what happens if we change our function to
function _transform(str, encoding, done) {
this.push(str.toUpperCase());
this.push(str.toUpperCase());
this.push(str.toUpperCase() + "\n");
done();
}
Output
HELLO!HELLO!HELLO!
WORLD!WORLD!WORLD!
First, you want to use write(), not push(). write() puts data in to the stream, push() pushes data out of the stream; you only use push() when implementing your own Readable, Duplex, or Transform streams.
Second, you'll only want to write() data to the stream after you've setup the pipe() (or added some event listeners). If you write to a stream with nothing wired to the other end, the data you've written will be lost. As #naomik pointed out, this isn't true in general since a Writable stream will buffer write()s. In your example you do need to write() after pipe() though. Otherwise, the process will end before writing anything to STDOUT. This is possibly due to how the through module is implemented, but I don't know that for sure.
So, with that in mind, you can make a couple simple changes to your example to get it to work:
var through = require('through');
var stream = through(function write(data) {
this.push(data.toUpperCase());
});
stream.pipe(process.stdout);
stream.write('asdf');
stream.end();
Now, for your questions:
The easiest way to get data from memory in to a writable stream is to simply write() it, just like we're doing with stream.wrtie('asdf') in your example.
As far as I know, the stream doesn't have a queue() function, did you mean write()? Like I said above, write() is used to put data in to a stream, push() is used to push data out of the stream. Only call push() in your owns stream implementations.
Only call end() after all your data has been written to your stream. end() basically says: "Ok, I'm done now. Please finish what you're doing and close the stream."
push(null) is pretty much equivalent to end(). That being said, don't call push(null) unless you're doing it inside your own stream implementation (as stated above). It's almost always more appropriate to call end().
Based on the examples for stream (http://nodejs.org/api/stream.html#stream_readable_pipe_destination_options)
and through (https://www.npmjs.org/package/through)
it doesn't look like you are using your stream correctly... What happens if you use write(...) instead of push(...)?

Resources