Using stream-combiner and Writable Streams (stream-adventure) - node.js

i'm working on nodeschool.io's stream-adventure. The challenge:
Write a module that returns a readable/writable stream using the
stream-combiner module. You can use this code to start with:
var combine = require('stream-combiner')
module.exports = function () {
return combine(
// read newline-separated json,
// group books into genres,
// then gzip the output
)
}
Your stream will be written a newline-separated JSON list of science fiction
genres and books. All the books after a "type":"genre" row belong in that
genre until the next "type":"genre" comes along in the output.
{"type":"genre","name":"cyberpunk"}
{"type":"book","name":"Neuromancer"}
{"type":"book","name":"Snow Crash"}
{"type":"genre","name":"space opera"}
{"type":"book","name":"A Deepness in the Sky"}
{"type":"book","name":"Void"}
Your program should generate a newline-separated list of JSON lines of genres,
each with a "books" array containing all the books in that genre. The input
above would yield the output:
{"name":"cyberpunk","books":["Neuromancer","Snow Crash"]}
{"name":"space opera","books":["A Deepness in the Sky","Void"]}
Your stream should take this list of JSON lines and gzip it with
zlib.createGzip().
HINTS
The stream-combiner module creates a pipeline from a list of streams,
returning a single stream that exposes the first stream as the writable side and
the last stream as the readable side like the duplexer module, but with an
arbitrary number of streams in between. Unlike the duplexer module, each
stream is piped to the next. For example:
var combine = require('stream-combiner');
var stream = combine(a, b, c, d);
will internally do a.pipe(b).pipe(c).pipe(d) but the stream returned by
combine() has its writable side hooked into a and its readable side hooked
into d.
As in the previous LINES adventure, the split module is very handy here. You
can put a split stream directly into the stream-combiner pipeline.
Note that split can send empty lines too.
If you end up using split and stream-combiner, make sure to install them
into the directory where your solution file resides by doing:
`npm install stream-combiner split`
Note: when you test the program, the source stream is automatically inserted into the program, so it's perfectly fine to have split() as the first parameter in combine(split(), etc., etc.)
I'm trying to solve this challenge without using the 'through' package.
My code:
var combiner = require('stream-combiner');
var stream = require('stream')
var split = require('split');
var zlib = require('zlib');
module.exports = function() {
var ws = new stream.Writable({decodeStrings: false});
function ResultObj() {
name: '';
books: [];
}
ws._write = function(chunk, enc, next) {
if(chunk.length === 0) {
next();
}
chunk = JSON.parse(chunk);
if(chunk.type === 'genre') {
if(currentResult) {
this.push(JSON.stringify(currentResult) + '\n');
}
var currentResult = new ResultObj();
currentResult.name = chunk.name;
} else {
currentResult.books.push(chunk.name);
}
next();
var wsObj = this;
ws.end = function(d) {
wsObj.push(JSON.stringify(currentResult) + '\n');
}
}
return combiner(split(), ws, zlib.createGzip());
}
My code does not work and returns 'Cannot pipe. Not readable'. Can someone point out to me where i'm going wrong?
Any other comments on how to improve are welcome too...

Related

How to sequentially read a csv file with node.js (using stream API)

I am trying to figure out how to create a stream pipe which reads entries in a csv file on-demand. To do so, I thought of using the following approach using pipes (pseudocode)
const stream_pipe = input_file_stream.pipe(csv_parser)
// Then getting entries through:
let entry = stream_pipe.read()
Unfortunately, after lots of testing I figured that them moment I set up the pipe, it is automatically consumed until the end of the csv file. I tried to pause it on creation by appending .pause() at the end, but it seems to not have any effect.
Here's my current code. I am using the csv_parse library (part of the bigger csv package):
// Read file stream
const file_stream = fs.createReadStream("filename.csv")
const parser = csvParser({
columns: ['a', 'b'],
on_record: (record) => {
// A simple filter as I am interested only in numeric entries
let a = parseInt(record.a)
let b = parseInt(record.b)
return (isNaN(a) || isNaN(b)) ? undefined : record
}
})
const reader = stream.pipe(parser) // Adding .pause() seems to have no effect
console.log(reader.read()) // Prints `null`
// I found out I can use this strategy to read a few entries immediately, but I cannot break out of it and then resume as the stream will automatically be consumed
//for await (const record of reader) {
// console.log(record)
//}
I have been banging my head on this for a while and I could not find easy solutions on both the csv package and node official documentation.
Thanks in advance to anyone able to put me on the right track :)
You can do one thing while reading the stream you can create a readLineInterface and pass the input stream and normal output stream like this:
const inputStream = "reading the csv file",
outputStream = new stream();
// now create a readLineInterface which will read
// line by line you should use async/await
const res = await processRecord(readline.createInterface(inputStream, outputStream));
async function processRecord(line) {
return new Promise((res, rej) => {
if (line) {
// do the processing
res(line);
}
rej('Unable to process record');
})
}
Now create processRecord function should get the things line by line and you can you promises to make it sequential.
Note: the above code is a pseudo code just to give you an idea if things work because I have been doing same in my project to read the csv file line and line and it works fine.

Reasons why a node stream read() function might return null?

I am building a node application that reads a CSV file from the file system, analyzes the file, and then parses the file using the csv-parse module. Each of these steps are in the form of a stream, piped one into the next.
The trouble I am having is that for some files, the parse step can read the stream, but for others, the read() method returns null on the readable event and I don't know why.
In the code below, specifically, I will sometimes see data come through on calling read() on the parser stream, and other times it will return null. CSV files that succeed always succeed, and CSV files that fail always fail. I tried to determine some difference between the files, but other than using different field names in the first row, and slightly different data in the body, I can't see any significant difference between the source files.
What are some reasons that a node stream's read() function might return null after a readable event?
Sample code:
var parse = require('csv-parse');
var fs = require('fs');
var stream = require('stream');
var byteCounter = new stream.Transform({objectMode : true});
byteCounter.setEncoding('utf8');
var totalBytes = 0;
// count the bytes we have read
byteCounter._transform = function (chunk, encoding, done) {
var data = chunk.toString();
if (this._lastLineData) {
data = this._lastLineData + data ;
}
var lines = data.split('\n');
// this is because each chunk will probably not end precisely at the end of a line:
this._lastLineData = lines.splice(lines.length-1,1)[0];
lines.forEach(function(line) {
totalBytes += line.length + 1 // we add an extra byte for the end-of-line
this.push(line);
}, this);
done();
};
byteCounter._flush = function (done) {
if (this._lastLineData) {
this.push(this._lastLineData);
}
this._lastLineData = null;
done();
};
// csv parser
var parser = parse({
delimiter: ",",
comment: "#",
skip_empty_lines: true,
auto_parse: true,
columns: true
});
parser.on('readable', function(){
var row;
while( null !== (row = parser.read()) ) {
// do stuff
}
});
// start by reading a file, piping to byteCounter, then pipe to parser.
var myPath = "/path/to/file.csv";
var options = {
encoding : "utf-8"
};
fs.createReadStream(myPath, options).pipe(byteCounter).pipe(parser);

How can I chain streams internally within a custom through2 stream

I'm writing my own through stream in Node which takes in a text stream and outputs an object per line of text. This is what the end result should look like:
fs.createReadStream('foobar')
.pipe(myCustomPlugin());
The implementation would use through2 and event-stream to make things easy:
var es = require('event-stream');
var through = require('through2');
module.exports = function myCustomPlugin() {
var parse = through.obj(function(chunk, enc, callback) {
this.push({description: chunk});
callback();
});
return es.split().pipe(parse);
};
However, if I were to pull this apart essentially what I did was:
fs.createReadStream('foobar')
.pipe(
es.split()
.pipe(parse)
);
Which is incorrect. Is there a better way? Can I inherit es.split() instead of use it inside the implementation? Is there an easy way to implement splits on lines without event-stream or similar? Would a different pattern work better?
NOTE: I'm intentionally doing the chaining inside the function as the myCustomPlugin() is the API interface I'm attempting to expose.
Based on the link in the previously accepted answer that put me on the right googling track, here's a shorter version if you don't mind another module: stream-combiner (read the code to convince yourself of what's going on!)
var combiner = require('stream-combiner')
, through = require('through2')
, split = require('split2')
function MyCustomPlugin() {
var parse = through(...)
return combine( split(), parse )
}
I'm working on something similar.
See this solution: Creating a Node.js stream from two piped streams
var outstream = through2().on('pipe', function(source) {
source.unpipe(this);
this.transformStream = source.pipe(stream1).pipe(stream2);
});
outstream.pipe = function(destination, options) {
return this.transformStream.pipe(destination, options);
};
return outstream;

Is it possible to register multiple listeners to a child process's stdout data event? [duplicate]

I need to run two commands in series that need to read data from the same stream.
After piping a stream into another the buffer is emptied so i can't read data from that stream again so this doesn't work:
var spawn = require('child_process').spawn;
var fs = require('fs');
var request = require('request');
var inputStream = request('http://placehold.it/640x360');
var identify = spawn('identify',['-']);
inputStream.pipe(identify.stdin);
var chunks = [];
identify.stdout.on('data',function(chunk) {
chunks.push(chunk);
});
identify.stdout.on('end',function() {
var size = getSize(Buffer.concat(chunks)); //width
var convert = spawn('convert',['-','-scale',size * 0.5,'png:-']);
inputStream.pipe(convert.stdin);
convert.stdout.pipe(fs.createWriteStream('half.png'));
});
function getSize(buffer){
return parseInt(buffer.toString().split(' ')[2].split('x')[0]);
}
Request complains about this
Error: You cannot pipe after data has been emitted from the response.
and changing the inputStream to fs.createWriteStream yields the same issue of course.
I don't want to write into a file but reuse in some way the stream that request produces (or any other for that matter).
Is there a way to reuse a readable stream once it finishes piping?
What would be the best way to accomplish something like the above example?
You have to create duplicate of the stream by piping it to two streams. You can create a simple stream with a PassThrough stream, it simply passes the input to the output.
const spawn = require('child_process').spawn;
const PassThrough = require('stream').PassThrough;
const a = spawn('echo', ['hi user']);
const b = new PassThrough();
const c = new PassThrough();
a.stdout.pipe(b);
a.stdout.pipe(c);
let count = 0;
b.on('data', function (chunk) {
count += chunk.length;
});
b.on('end', function () {
console.log(count);
c.pipe(process.stdout);
});
Output:
8
hi user
The first answer only works if streams take roughly the same amount of time to process data. If one takes significantly longer, the faster one will request new data, consequently overwriting the data still being used by the slower one (I had this problem after trying to solve it using a duplicate stream).
The following pattern worked very well for me. It uses a library based on Stream2 streams, Streamz, and Promises to synchronize async streams via a callback. Using the familiar example from the first answer:
spawn = require('child_process').spawn;
pass = require('stream').PassThrough;
streamz = require('streamz').PassThrough;
var Promise = require('bluebird');
a = spawn('echo', ['hi user']);
b = new pass;
c = new pass;
a.stdout.pipe(streamz(combineStreamOperations));
function combineStreamOperations(data, next){
Promise.join(b, c, function(b, c){ //perform n operations on the same data
next(); //request more
}
count = 0;
b.on('data', function(chunk) { count += chunk.length; });
b.on('end', function() { console.log(count); c.pipe(process.stdout); });
You can use this small npm package I created:
readable-stream-clone
With this you can reuse readable streams as many times as you need
For general problem, the following code works fine
var PassThrough = require('stream').PassThrough
a=PassThrough()
b1=PassThrough()
b2=PassThrough()
a.pipe(b1)
a.pipe(b2)
b1.on('data', function(data) {
console.log('b1:', data.toString())
})
b2.on('data', function(data) {
console.log('b2:', data.toString())
})
a.write('text')
I have a different solution to write to two streams simultaneously, naturally, the time to write will be the addition of the two times, but I use it to respond to a download request, where I want to keep a copy of the downloaded file on my server (actually I use a S3 backup, so I cache the most used files locally to avoid multiple file transfers)
/**
* A utility class made to write to a file while answering a file download request
*/
class TwoOutputStreams {
constructor(streamOne, streamTwo) {
this.streamOne = streamOne
this.streamTwo = streamTwo
}
setHeader(header, value) {
if (this.streamOne.setHeader)
this.streamOne.setHeader(header, value)
if (this.streamTwo.setHeader)
this.streamTwo.setHeader(header, value)
}
write(chunk) {
this.streamOne.write(chunk)
this.streamTwo.write(chunk)
}
end() {
this.streamOne.end()
this.streamTwo.end()
}
}
You can then use this as a regular OutputStream
const twoStreamsOut = new TwoOutputStreams(fileOut, responseStream)
and pass it to to your method as if it was a response or a fileOutputStream
If you have async operations on the PassThrough streams, the answers posted here won't work.
A solution that works for async operations includes buffering the stream content and then creating streams from the buffered result.
To buffer the result you can use concat-stream
const Promise = require('bluebird');
const concat = require('concat-stream');
const getBuffer = function(stream){
return new Promise(function(resolve, reject){
var gotBuffer = function(buffer){
resolve(buffer);
}
var concatStream = concat(gotBuffer);
stream.on('error', reject);
stream.pipe(concatStream);
});
}
To create streams from the buffer you can use:
const { Readable } = require('stream');
const getBufferStream = function(buffer){
const stream = new Readable();
stream.push(buffer);
stream.push(null);
return Promise.resolve(stream);
}
What about piping into two or more streams not at the same time ?
For example :
var PassThrough = require('stream').PassThrough;
var mybiraryStream = stream.start(); //never ending audio stream
var file1 = fs.createWriteStream('file1.wav',{encoding:'binary'})
var file2 = fs.createWriteStream('file2.wav',{encoding:'binary'})
var mypass = PassThrough
mybinaryStream.pipe(mypass)
mypass.pipe(file1)
setTimeout(function(){
mypass.pipe(file2);
},2000)
The above code does not produce any errors but the file2 is empty

Parse output of spawned node.js child process line by line

I have a PhantomJS/CasperJS script which I'm running from within a node.js script using process.spawn(). Since CasperJS doesn't support require()ing modules, I'm trying to print commands from CasperJS to stdout and then read them in from my node.js script using spawn.stdout.on('data', function(data) {}); in order to do things like add objects to redis/mongoose (convoluted, yes, but seems more straightforward than setting up a web service for this...) The CasperJS script executes a series of commands and creates, say, 20 screenshots which need to be added to my database.
However, I can't figure out how to break the data variable (a Buffer?) into lines... I've tried converting it to a string and then doing a replace, I've tried doing spawn.stdout.setEncoding('utf8'); but nothing seems to work...
Here is what I have right now
var spawn = require('child_process').spawn;
var bin = "casperjs"
//googlelinks.js is the example given at http://casperjs.org/#quickstart
var args = ['scripts/googlelinks.js'];
var cspr = spawn(bin, args);
//cspr.stdout.setEncoding('utf8');
cspr.stdout.on('data', function (data) {
var buff = new Buffer(data);
console.log("foo: " + buff.toString('utf8'));
});
cspr.stderr.on('data', function (data) {
data += '';
console.log(data.replace("\n", "\nstderr: "));
});
cspr.on('exit', function (code) {
console.log('child process exited with code ' + code);
process.exit(code);
});
https://gist.github.com/2131204
Try this:
cspr.stdout.setEncoding('utf8');
cspr.stdout.on('data', function(data) {
var str = data.toString(), lines = str.split(/(\r?\n)/g);
for (var i=0; i<lines.length; i++) {
// Process the line, noting it might be incomplete.
}
});
Note that the "data" event might not necessarily break evenly between lines of output, so a single line might span multiple data events.
I've actually written a Node library for exactly this purpose, it's called stream-splitter and you can find it on Github: samcday/stream-splitter.
The library provides a special Stream you can pipe your casper stdout into, along with a delimiter (in your case, \n), and it will emit neat token events, one for each line it has split out from the input Stream. The internal implementation for this is very simple, and delegates most of the magic to substack/node-buffers which means there's no unnecessary Buffer allocations/copies.
I found a nicer way to do this with just pure node, which seems to work well:
const childProcess = require('child_process');
const readline = require('readline');
const cspr = childProcess.spawn(bin, args);
const rl = readline.createInterface({ input: cspr.stdout });
rl.on('line', line => /* handle line here */)
Adding to maerics' answer, which does not deal properly with cases where only part of a line is fed in a data dump (theirs will give you the first part and the second part of the line individually, as two separate lines.)
var _breakOffFirstLine = /\r?\n/
function filterStdoutDataDumpsToTextLines(callback){ //returns a function that takes chunks of stdin data, aggregates it, and passes lines one by one through to callback, all as soon as it gets them.
var acc = ''
return function(data){
var splitted = data.toString().split(_breakOffFirstLine)
var inTactLines = splitted.slice(0, splitted.length-1)
var inTactLines[0] = acc+inTactLines[0] //if there was a partial, unended line in the previous dump, it is completed by the first section.
acc = splitted[splitted.length-1] //if there is a partial, unended line in this dump, store it to be completed by the next (we assume there will be a terminating newline at some point. This is, generally, a safe assumption.)
for(var i=0; i<inTactLines.length; ++i){
callback(inTactLines[i])
}
}
}
usage:
process.stdout.on('data', filterStdoutDataDumpsToTextLines(function(line){
//each time this inner function is called, you will be getting a single, complete line of the stdout ^^
}) )
You can give this a try. It will ignore any empty lines or empty new line breaks.
cspr.stdout.on('data', (data) => {
data = data.toString().split(/(\r?\n)/g);
data.forEach((item, index) => {
if (data[index] !== '\n' && data[index] !== '') {
console.log(data[index]);
}
});
});
Old stuff but still useful...
I have made a custom stream Transform subclass for this purpose.
See https://stackoverflow.com/a/59400367/4861714
#nyctef's answer uses an official nodejs package.
Here is a link to the documentation: https://nodejs.org/api/readline.html
The node:readline module provides an interface for reading data from a Readable stream (such as process.stdin) one line at a time.
My personal use-case is parsing json output from the "docker watch" command created in a spawned child_process.
const dockerWatchProcess = spawn(...)
...
const rl = readline.createInterface({
input: dockerWatchProcess.stdout,
output: null,
});
rl.on('line', (log: string) => {
console.log('dockerWatchProcess event::', log);
// code to process a change to a docker event
...
});

Resources