Node.js: splitting a readable stream pipe to multiple sequential writable streams

Node.js: splitting a readable stream pipe to multiple sequential writable streams - node.js

Given a Readable stream (which may be process.stdin or a file stream), is it possible/practical to pipe() to a custom Writable stream that will fill a child Writable until a certain size; then close that child stream; open a new Writable stream and continue?
(The context is to upload a large piece of data from a pipeline to a CDN, dividing it up into blocks of a reasonable size as it goes, without having to write the data to disk first.)
I've tried creating a Writable that handles the opening and closing of the child stream in the _write function, but the problem comes when the incoming chunk is too big to fit in the existing child stream: it has to write some of the chunk to the old stream; create the new stream; and then wait for the open event on the new stream before completing the _write call.
The other thought I had was to create an extra Duplex or Transform stream to buffer the pipe and ensure that the chunk coming into the Writable is definitely equal to or less than the amount the existing child stream can accept, to give the Writable time to change the child stream over.
Alternatively, is this overcomplicating everything and there's a much easier way to do the original task?

I bumped across the question when looking for an answer for a related problem. How to parse a file and split it its lines into separate files depending on some category value in the line.
I did my best to change my code to make it more relevant to your problem. However, that's rapidly adapted. Not tested. Treat it as pseudo-code.
var fs = require('fs'),
through = require('through');
var destCount = 0, dest, size = 0, MAX_SIZE = 1000;
readableStream
.on('data', function(data) {
var out = data.toString() + "\n";
size += out.length;
if(size > MAX_SIZE) {
dest.emit("end");
dest = null;
size = 0;
}
if(!dest) {
// option 1. manipulate data before saving them.
dest = through();
dest.pipe(fs.createWriteStream("log" + destCount))
// option 2. write directly to file
// dest = fs.createWriteStream("log" + destCount);
}
dest.emit("data", out);
})
.on('end', function() {
dest.emit('end');
});

I would introduce a Transform in between the Readable and Writable stream. And in its _transform, I would do all the logic I would need.
Maybe, I would only have a Readable and a Transform only. The _transform method would create all the Writable stream I need
Personally, I only use a Writable stream only when I'm dumping data somewhere and I would be done processing that chunk.
I avoid implementing _read and _write as much as I can and abuse Transform stream.
But the point I don't understand in your question is write about size. What do you mean by it.?

Related

What are the roles of _read and read in Node JS streams?

I'm really just looking for clarification on how these work. IMO the documentation on streams is somewhat lacking, and there actually aren't a lot of resources out their that comprehensively explain explain how they're are meant to work and be extended.
My question can be broken down into two parts
One, What is the role of the _read function within the stream module? When I run this code it endlessly prints out "hello world" until null is pushed onto the stream buffer. This seems to indicate that _read is called in some kind of loop that waits for a null in the buffer, but I can't find documentation anywhere that states this in explicit terms.
var Readable = require('stream').Readable
var rs = Readable()
rs._read = function () {
rs.push("hello world")
rs.push(null)
};
rs.on("data", function(data){
console.log("some data", data)
})
Two, what does read actually do? My understanding is that read consumes data from the read stream buffer, and fires the data event. Is that all that's going on here?

read() is something that a consumer of the readStream calls if they want to specifically read some bytes from the stream (when the stream is not flowing).
_read() is an internal method that is part of the internal implementation of the read stream. The internals of the stream call this method (it is NOT to be called from the outside) when the stream is flowing and the stream wants to get more data from the source. When called the _read() method pushes data with .push(data) or if it has no more data, then it does a .push(null).
You can see an explanation and example here in this article.
_read(size) {
if (this.data.length) {
const chunk = this.data.slice(0, size);
this.data = this.data.slice(size, this.data.length);
this.push(chunk);
} else {
this.push(null); // 'end', no more data
}
}
If you were implementing a read stream to some custom source of data, then you would implement the _read() method to fetch up to the size amount of data from your source and .push() that data into the stream.

Basic streams issue: Difficulty sending a string to stdout

I'm just starting learning about streams in node. I have a string in memory and I want to put it in a stream that applies a transformation and pipe it through to process.stdout. Here is my attempt to do it:
var through = require('through');
var stream = through(function write(data) {
this.push(data.toUpperCase());
});
stream.push('asdf');
stream.pipe(process.stdout);
stream.end();
It does not work. When I run the script on the cli via node, nothing is sent to stdout and no errors are thrown. A few questions I have:
If you have a value in memory that you want to put into a stream, what is the best way to do it?
What is the difference between push and queue?
Does it matter if I call end() before or after calling pipe()?
Is end() equivalent to push(null)?
Thanks!

Just use the vanilla stream API
var Transform = require("stream").Transform;
// create a new Transform stream
var stream = new Transform({
decodeStrings: false,
encoding: "ascii"
});
// implement the _transform method
stream._transform = function _transform(str, enc, done) {
this.push(str.toUpperCase() + "\n";
done();
};
// connect to stdout
stream.pipe(process.stdout);
// write some stuff to the stream
stream.write("hello!");
stream.write("world!");
// output
// HELLO!
// WORLD!
Or you can build your own stream constructor. This is really the way the stream API is intended to be used
var Transform = require("stream").Transform;
function MyStream() {
// call Transform constructor with `this` context
// {decodeStrings: false} keeps data as `string` type instead of `Buffer`
// {encoding: "ascii"} sets the encoding for our strings
Transform.call(this, {decodeStrings: false, encoding: "ascii"});
// our function to do "work"
function _transform(str, encoding, done) {
this.push(str.toUpperCase() + "\n");
done();
}
// export our function
this._transform = _transform;
}
// extend the Transform.prototype to your constructor
MyStream.prototype = Object.create(Transform.prototype, {
constructor: {
value: MyStream
}
});
Now use it like this
// instantiate
var a = new MyStream();
// pipe to a destination
a.pipe(process.stdout);
// write data
a.write("hello!");
a.write("world!");
Output
HELLO!
WORLD!
Some other notes about .push vs .write.
.write(str) adds data to the writable buffer. It is meant to be called externally. If you think of a stream like a duplex file handle, it's just like fwrite, only buffered.
.push(str) adds data to the readable buffer. It is only intended to be called from within our stream.
.push(str) can be called many times. Watch what happens if we change our function to
function _transform(str, encoding, done) {
this.push(str.toUpperCase());
this.push(str.toUpperCase());
this.push(str.toUpperCase() + "\n");
done();
}
Output
HELLO!HELLO!HELLO!
WORLD!WORLD!WORLD!

First, you want to use write(), not push(). write() puts data in to the stream, push() pushes data out of the stream; you only use push() when implementing your own Readable, Duplex, or Transform streams.
Second, you'll only want to write() data to the stream after you've setup the pipe() (or added some event listeners). If you write to a stream with nothing wired to the other end, the data you've written will be lost. As #naomik pointed out, this isn't true in general since a Writable stream will buffer write()s. In your example you do need to write() after pipe() though. Otherwise, the process will end before writing anything to STDOUT. This is possibly due to how the through module is implemented, but I don't know that for sure.
So, with that in mind, you can make a couple simple changes to your example to get it to work:
var through = require('through');
var stream = through(function write(data) {
this.push(data.toUpperCase());
});
stream.pipe(process.stdout);
stream.write('asdf');
stream.end();
Now, for your questions:
The easiest way to get data from memory in to a writable stream is to simply write() it, just like we're doing with stream.wrtie('asdf') in your example.
As far as I know, the stream doesn't have a queue() function, did you mean write()? Like I said above, write() is used to put data in to a stream, push() is used to push data out of the stream. Only call push() in your owns stream implementations.
Only call end() after all your data has been written to your stream. end() basically says: "Ok, I'm done now. Please finish what you're doing and close the stream."
push(null) is pretty much equivalent to end(). That being said, don't call push(null) unless you're doing it inside your own stream implementation (as stated above). It's almost always more appropriate to call end().

Based on the examples for stream (http://nodejs.org/api/stream.html#stream_readable_pipe_destination_options)
and through (https://www.npmjs.org/package/through)
it doesn't look like you are using your stream correctly... What happens if you use write(...) instead of push(...)?

Performing piped operations on individual chunks (node-wav)

I'm new to node and I'm working on an audio stream server. I'm trying to process / transform the chunks of a stream as they come out of each pipe.
So, file = fs.createReadStream(path) (filestream) is piped into file.pipe(wavy) (remove headers and output raw PCM) gets piped in to .pipe(waver) (add proper wav header to chunk) which is piped into .pipe(spark) (ouput chunk to client).
The idea is that each filestream chunk has headers removed if any (only applies to first chunk), then using the node-wav Writer that chunk is endowed with headers and then sent to the client. As I'm sure you guessed this doesn't work.
The pipe operations into node-wav are acting on the entire filestream, not the individual chunks. To confirm I've checked the output client side and it is effectively dropping the headers and re-adding them to the entire data stream.
From what I've read of the Node Stream docs it seems like what I'm trying to do should be possible, just not the way I'm doing it. I just can't pin down how to accomplish this.
Is it possible, and if so what am I missing?
Complete function:
processAudio = (path, spark) ->
wavy = new wav.Reader()
waver = new wav.Writer()
file = fs.createReadStream(path)
file.pipe(wavy).pipe(waver).pipe(spark)

I don't really know about wavs and headers but if you're "trying to process / transform the chunks of a stream as they come out of each pipe." you can use the Transform stream.
It permits you to sit between 2 streams and modify the bytes between them:
var util = require('util');
var Transform = require('stream').Transform;
util.inherits(Test, Transform);
function Test(options) {
Transform.call(this, options);
}
Test.prototype._transform = function(chunk, encoding, cb) {
// do something with chunk, then pass a modified chunk (or not)
// to the downstream
cb(null, chunk);
};
To observe the stream and potentially modify it, pipe like:
file.pipe(wavy).pipe(new Test()).pipe(waver).pipe(spark)

gzipping a file with nodejs streams causes memory leaks

I'm trying to do what should be seemingly quite simple: take a file with filename X, and create a gzipped version as "X.gz". Nodejs's zlib module does not come with a convenient zlib.gzip(infile, outfile), so I figured I'd use an input stream, an output stream, and a zlib gzipper, then pipe them:
var zlib = require("zlib"),
zipper = zlib.createGzip(),
fs = require("fs");
var tryThing = function(logfile) {
var input = fs.createReadStream(logfile, {autoClose: true}),
output = fs.createWriteStream(logfile + ".gz");
input.pipe(zipper).pipe(output);
output.on("end", function() {
// delete original file, it is no longer needed
fs.unlink(logfile);
// clear listeners
zipper.removeAllListeners();
input.removeAllListeners();
});
}
however, every time I run this function, the memory footprint of Node.js grows by about 100kb. Am I forgetting to tell the streams they should just kill themselves off again because they won't be needed any longer?
Or, alternatively, is there a way to just gzip a file without bothering with streams and pipes? I tried googling for "node.js gzip a file" but it's just links to the API docs, and stack overflow questions on gzipping streams and buffers, not how to just gzip a file.

I think you need to properly unpipe and close the stream. Simply removeAllListeners() may not be enough to clean things up. As streams may be waiting for more data (and thus staying alive in memory unnecessarily.)
Also you're not closing the output stream as well and IMO I'd listen on the input stream's end instead of the output.
// cleanup
input.once('end', function() {
zipper.removeAllListeners();
zipper.close();
zipper = null;
input.removeAllListeners();
input.close();
input = null;
output.removeAllListeners();
output.close();
output = null;
});
Also I don't think the stream returned from zlib.createGzip() can be shared once ended. You should create a new one at every iteration of tryThing:
var input = fs.createReadStream(logfile, {autoClose: true}),
output = fs.createWriteStream(logfile + ".gz")
zipper = zlib.createGzip();
input.pipe(zipper).pipe(output);
Havn't tested this tho as I don't have a memory profile tool nearby right now.

How to wrap a buffer as a stream2 Readable stream?

How can I transform a node.js buffer into a Readable stream following using the stream2 interface ?
I already found this answer and the stream-buffers module but this module is based on the stream1 interface.

The easiest way is probably to create a new PassThrough stream instance, and simply push your data into it. When you pipe it to other streams, the data will be pulled out of the first stream.
var stream = require('stream');
// Initiate the source
var bufferStream = new stream.PassThrough();
// Write your buffer
bufferStream.end(Buffer.from('Test data.'));
// Pipe it to something else (i.e. stdout)
bufferStream.pipe(process.stdout)

As natevw suggested, it's even more idiomatic to use a stream.PassThrough, and end it with the buffer:
var buffer = new Buffer( 'foo' );
var bufferStream = new stream.PassThrough();
bufferStream.end( buffer );
bufferStream.pipe( process.stdout );
This is also how buffers are converted/piped in vinyl-fs.

A modern simple approach that is usable everywhere you would use fs.createReadStream() but without having to first write the file to a path.
const {Duplex} = require('stream'); // Native Node Module
function bufferToStream(myBuuffer) {
let tmp = new Duplex();
tmp.push(myBuuffer);
tmp.push(null);
return tmp;
}
const myReadableStream = bufferToStream(your_buffer);
myReadableStream is re-usable.
The buffer and the stream exist only in memory without writing to local storage.
I use this approach often when the actual file is stored at some cloud service and our API acts as a go-between. Files never get wrote to a local file.
I have found this to be the very reliable no matter the buffer (up to 10 mb) or the destination that accepts a Readable Stream. Larger files should implement

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Node.js: splitting a readable stream pipe to multiple sequential writable streams - node.js

Related

What are the roles of _read and read in Node JS streams?

Basic streams issue: Difficulty sending a string to stdout

Performing piped operations on individual chunks (node-wav)

gzipping a file with nodejs streams causes memory leaks

How to wrap a buffer as a stream2 Readable stream?

Categories

Resources