Long Object Stream with Asynchronous Transform Ending Too Soon - node.js

I am piping the response from a Node request into a transform stream using through2Concurrent. This response comes in as a buffer and is parsed to an object using JSONStream. That then gets piped into my transform stream. The transform stream function then makes an HTTP requests, formats the response and stores it into a MongoDB. We are using concurrent streams because it would take an unacceptably long time to handle everything otherwise.
response Stream -> JSONStream.parse() -> Transform Stream
Problem Description
The initial response stream contains roughly 18,000 objects once parsed. However, the stream terminates and an finish event is received before all 18,000 objects are handled. No error is thrown, but only about 2,000 - 5,000 objects are actually handled before the stream ends. The exact number handled varies.
Here is the relevant code:
const analyticsTransformer = through2Concurrent.obj({
maxConcurrency: 15
}, async (doc, enc, cb) => {
// Make an http request. This is a relatively long request.
const res = await apim.getAnalytics(doc);
// Save response to mongo.
await UsageData.save(res);
cb();
});
// Kick off the streaming.
broker.getInstances()
.pipe(JSONStream.parse('*')
.pipe(analyticsTransformer)
.on('finish', () => {
// We reach this way too quickly before we have handled all 18,000 objects
})
.on('error', err => {
// No errors are caught.
})
What I have Tried
Waiting for an 'end' event: Same result. Unhandled objects and early termination.
Using through2 (not through2Concurrent): Receive ETIMEOUT after several thousand objects have come through.
Setting the highWaterMark to 18,000:
This is the only thing that has worked. I can handle all of the objects if I change this highWatermark value, but this is really just a bandaid on the problem. I want to know why this works and what I can do to fix my streaming problems in a robust way.
Setting the highWaterMark looks like this:
const analyticsTransformer = through2Concurrent.obj({
highWaterMark: 18,000,
maxConcurrency: 15
}, async (doc, enc, cb) => {
// ...
});
Why does changing the highWaterMark value work?
What is the real cause of my early terminated stream?
How can I fix it?
Thanks in advance to anyone that can help! :)

Related

How do I stream a chunked file using Node.js Readable?

I have a 400Mb file split into chunks that are ~1Mb each.
Each chunk is a MongoDB document:
{
name: 'stuff.zip',
index: 15,
buffer: Binary('......'),
totalChunks: 400
}
I am fetching each chunk from my database and then streaming it to the client.
Every time I get chunk from the DB I push it to the readableStream which is being piped to the client.
Here is the code:
import { Readable } from 'stream'
const name = 'stuff.zip'
const contentType = 'application/zip'
app.get('/api/download-stuff', (req, res) => {
res.set('Content-Type', contentType)
res.set('Content-Disposition', `attachment; filename=${name}`)
res.attachment(name)
// get `totalChunks` from random chunk
let { totalChunks } = await ChunkModel.findOne({ name }).select('totalChunks')
let index = 0
const readableStream = new Readable({
async read() {
if (index < totalChunks) {
let { buffer } = await ChunkModel.findOne({ name, index }).select('buffer')
let canContinue = readableStream.push(buffer)
console.log(`pushed chunk ${index}/${totalChunks}`)
index++
// sometimes it logs false
// which means I should be waiting before pushing more
// but I don't know how
console.log('canContinue = ', canContinue)
} else {
readableStream.push(null)
readableStream.destroy()
console.log(`all ${totalChunks} chunks streamed to the client`)
}
}
})
readableStream.pipe(res)
})
The code works.
But I'm wondering whether I risk having memory overflows on my local server memory, especially when the requests for the same file are too many or the chunks are too many.
Question: My code is not waiting for readableStream to finish reading the chunk that was just pushed to it, before pushing the next one. I thought it was, and that is why I'm using read(){..} in this probably wrong way. So how should I wait for each chunk to be pushed, read, streamed to the client and cleared from my server's local memory, before I push the next one in ?
I have created this sandbox in case it helps anyone
In general, when the readable interface is implemented correctly (i.e., the backpressure signal is respected), the readable interface will prevent the code from overflowing the memory regardless of source size.
When implemented according to the API spec, the readable itself does not keep references for data that has finished passing through the stream. The memory requirement of a readable buffer is adjusted by specifying a highWatermark.
In this case, the snippet does not conform to the readable interface. It violates the following two concepts:
No data shall be pushed to the readable's buffer unless read() has been called. Currently, this implementation proceeds to push data from DB immediately. Consequently, the readable buffer will start to fill before the sink has begun to consume data.
The readable's push() method returns a boolean flag. When the flag is false, the implementation must wait for .read() to be called before pushing additional data. If the flag is ignored, the buffer will overflow wrt. the highWatermark.
Note that ignoring these core criteria of Readables circumvents the backpressure logic.
An alternative implementation, if this is a Mongoose query:
app.get('/api/download-stuff', async (req, res) => {
// ... truncated handler
// A helper variable to relay data from the stream to the response body
const passThrough = new stream.PassThrough({objectMode: false});
// Pipe data using pipeline() to simplify handling stream errors
stream.pipeline(
// Create a cursor that fetch all relevant documents using a single query
ChunkModel.find().limit(chunksLength).select("buffer").sort({index: 1}).lean().cursor(),
// Cherry pick the `buffer` property
new stream.Transform({
objectMode: true,
transform: ({ buffer }, encoding, next) => {
next(null, buffer);
}
}),
// Write the retrieved documents to the helper variable
passThrough,
error => {
if(error){
// Log and handle error. At this point the HTTP headers are probably already sent,
// and it is therefore too late to return HTTP500
}
}
);
res.body = passThrough;
});

Avoid re-fetching data while streaming in Expressjs

I've just started playing with streaming data in Expressjs.
Not entirely sure, but I think the request will start to execute the handler again. For example, here is my handler:
import getDataAsync from "./somewhere";
function handler(req, res) {
console.log('requesting', req.path);
getDataAsync()
.then(data => {
let stream = renderContent(data);
stream.pipe(res);
})
.catch(err => {
res.end();
})
}
What I found was, it continue to print out console.log('requesting', req.path) (which I think will re-execute getDataAsync).
My question is:
Is it true it will re-execute getDataAsync?
If it does, what's your approach?
Thank heaps!
Node JS is non-blocking, so if you were to make a request to an endpoint with this handler again then it will execute. The handler will call getDataAsync() and then the handler gets removed from call stack. The process is repeated for each request.
If you want the handler to wait out the stream before it calls it again you could do:
import getDataAsync from "./somewhere";
let streamComplete = true;
function handler(req, res) {
if(!streamComplete) {
res.end();
}
console.log('requesting', req.path);
getDataAsync()
.then(data => {
streamComplete = false;
let stream = renderContent(data);
stream.pipe(res);
stream.on('end', () => streamComplete = true);
})
.catch(err => {
res.end();
})
}
I did need to sort this problem out in one of my projects. Node or in fact any other environment/language will have the same issue, that once you start streaming the data to one client, it's rather hard to stream it to another. This is due to the fact that once you do this:
inputStream.pipe(outputStream);
...the input data will be pushed out to output and will be removed from memory. So if you just pipe the inputStream again, you'll have some initial part of the data missing.
The solution I came up with was to write a Transform stream that kept the data in memory and you can reuse it afterwards. Such a stream will have all the original chunks and at the same time when it catches up with the first request, it will just keep pushing the chunks directly. I packaged the solution as a npm module and published so now you can use it.
This is how you use it:
const {ReReadable} = require("rereadable-stream");
// We'll use this for caching - you can use a Map if you have more streams
let cachedStream;
// This function will get the stream and
const getCachedStream = () =>
(cachedStream || (cachedStream =
getDataAsync()
.then(
data => renderContent(data).pipe(new ReReadable())
))
)
.then(readable => readable.rewind())
Such a function will call you getDataAsync once and then will push the data to a the rewindable stream, but every time the function is executed the stream will be rewound to the begining.
You can read a bit more about the rereadable-stream module here.
A word of warning though - remember, that you will keep all that data in memory now, so be careful to clean it up if there's more chunks there and control your memory usage.

Node.js Streams: Is there a way to convert or wrap a fs write stream to a Transform stream?

With a node http server I'm trying to pipe the request read stream to the response write stream with some intermediary transforms, one of which is a file system write.
The pipeline looks like this with non pertinent code removed for simplicity:
function handler (req, res) {
req.pipe(jsonParse())
.pipe(addTimeStamp())
.pipe(jsonStringify())
.pipe(saveToFs('saved.json'))
.pipe(res);
}
The custom Transform streams are pretty straight forward, but I have no elegant way of writing saveToFs. It looks like this:
function saveToFs (filename) {
const write$ = fs.createWriteStream(filename);
write$.on('open', () => console.log('opened'));
write$.on('close', () => console.log('closed'));
const T = new Transform();
T._transform = function (chunk, encoding, cb) {
write$.write(chunk);
cb(null, chunk);
}
return T;
}
The idea is simply to pipe the data to the write stream and then through to the response stream, but fs.createWriteStream(<file.name>) is only a writable stream, so it makes this approach difficult.
Right now this code has two problems that I can see: the write stream never fires a close event (memory leak?), and I would like the data to pass through the file system write before returning data to the response stream instead of essentially multicasting to two sinks.
Any suggestions, or pointing out fundamental things I've missed would be greatly appreciated.
What you should do is save the stream returned by the .pipe before saveToFs, and then pipe that to a file and res.
function handler(req, res) {
const transformed = req.pipe(jsonParse())
.pipe(addTimeStamp())
.pipe(jsonStringify());
transformed.pipe(fs.createWriteStream('saved.json'));
transformed.pipe(res);
}
To sum it up, you can pipe the same readable stream (transformed) to multiple writable streams.
And I would like the data to pass through the file system write
before returning data to the response stream instead of essentially
multicasting to two sinks.
Use { end: false } option when piping to res.
transformed.pipe(res, { end: false });
And then call res.end() when the file is written or whenever you want.

Node CSV pull parser

I need to parse a CSV document from Node.JS, performing database operations for each record (= each line). However, I'm having trouble finding a suitable CSV parser using a pull approach, or at least a push approach that waits for my record operations before parsing the next row.
I've looked at csv-parse, csvtojson, csv-streamify, but they all seem to push events in a continuous stream without any flow control. If parsing a 1000 line CSV document, I basically get all 1000 callbacks in quick sequence. For each record, I perform an operation returning a promise. Currently I've had to resort to pushing all my promises into an array and after getting the done/end event I also wait for Promise.all(myOperations) to know when the document has been fully processed. But this is not very nice, and also, I'd prefer parsing one line at a time and fully processing it, before getting the next record, instead of concurrently processing all records - it's hard to debug and uses a lot of memory as opposed to simply dealing with each record sequentially.
So, is there a CSV parser that supports pull mode, or a way to get any stream-based CSV parser (preferably csvtojson as that's the one I'm using at the moment) to only produce events for new records when my handler for the previous record is finished (using promises)?
I solved this myself by creating my own Writable and piping the CSV parser to it. My write method does its stuff and wraps a promise to the node callback passed to _write() (here implemented using Q.nodeify):
class CsvConsumer extends stream.Writable {
_write(data, encoding, cb) {
console.log('Got data: ', data);
Q.delay(1000).then(() => {
console.log('Waited 1 s');
}).nodeify(cb);
}
}
csvtojson()
.fromStream(is)
.pipe(new CsvConsumer())
.on('finish', err => {
if (err) {
console.log('Error!');
} else {
console.log('Done!');
}
});
This will process lines one by one:
Got data: {"a": "1"}
Waited 1 s
Got data: {"a": "2"}
Waited 1 s
Got data: {"a": "3"}
Waited 1 s
Done!
If you want to process each line asynchronously you can do that with node's native LineReader.
const lineStream = readline.createInterface({
input: fs.createReadStream('data/test.csv'),
});
lineStream.on('line', (eachLine) =>{
//process each line
});
If you want to do the same in synchronous fashion you can use line-by-line. It doesn't buffer the entire file into memory. It provides event handlers to pause and resume the 'line' emit event.
lr.on('line', function (line) {
// pause emitting of lines...
lr.pause();
// ...do your asynchronous line processing..
setTimeout(function () {
// ...and continue emitting lines. (1 sec delay)
lr.resume();
}, 1000);
});

How to close a readable stream (before end)?

How to close a readable stream in Node.js?
var input = fs.createReadStream('lines.txt');
input.on('data', function(data) {
// after closing the stream, this will not
// be called again
if (gotFirstLine) {
// close this stream and continue the
// instructions from this if
console.log("Closed.");
}
});
This would be better than:
input.on('data', function(data) {
if (isEnded) { return; }
if (gotFirstLine) {
isEnded = true;
console.log("Closed.");
}
});
But this would not stop the reading process...
Edit: Good news! Starting with Node.js 8.0.0 readable.destroy is officially available: https://nodejs.org/api/stream.html#stream_readable_destroy_error
ReadStream.destroy
You can call the ReadStream.destroy function at any time.
var fs = require("fs");
var readStream = fs.createReadStream("lines.txt");
readStream
.on("data", function (chunk) {
console.log(chunk);
readStream.destroy();
})
.on("end", function () {
// This may not been called since we are destroying the stream
// the first time "data" event is received
console.log("All the data in the file has been read");
})
.on("close", function (err) {
console.log("Stream has been destroyed and file has been closed");
});
The public function ReadStream.destroy is not documented (Node.js v0.12.2) but you can have a look at the source code on GitHub (Oct 5, 2012 commit).
The destroy function internally mark the ReadStream instance as destroyed and calls the close function to release the file.
You can listen to the close event to know exactly when the file is closed. The end event will not fire unless the data is completely consumed.
Note that the destroy (and the close) functions are specific to fs.ReadStream. There are not part of the generic stream.readable "interface".
Invoke input.close(). It's not in the docs, but
https://github.com/joyent/node/blob/cfcb1de130867197cbc9c6012b7e84e08e53d032/lib/fs.js#L1597-L1620
clearly does the job :) It actually does something similar to your isEnded.
EDIT 2015-Apr-19 Based on comments below, and to clarify and update:
This suggestion is a hack, and is not documented.
Though for looking at the current lib/fs.js it still works >1.5yrs later.
I agree with the comment below about calling destroy() being preferable.
As correctly stated below this works for fs ReadStreams's, not on a generic Readable
As for a generic solution: it doesn't appear as if there is one, at least from my understanding of the documentation and from a quick look at _stream_readable.js.
My proposal would be put your readable stream in paused mode, at least preventing further processing in your upstream data source. Don't forget to unpipe() and remove all data event listeners so that pause() actually pauses, as mentioned in the docs
Today, in Node 10
readableStream.destroy()
is the official way to close a readable stream
see https://nodejs.org/api/stream.html#stream_readable_destroy_error
You can't. There is no documented way to close/shutdown/abort/destroy a generic Readable stream as of Node 5.3.0. This is a limitation of the Node stream architecture.
As other answers here have explained, there are undocumented hacks for specific implementations of Readable provided by Node, such as fs.ReadStream. These are not generic solutions for any Readable though.
If someone can prove me wrong here, please do. I would like to be able to do what I'm saying is impossible, and would be delighted to be corrected.
EDIT: Here was my workaround: implement .destroy() for my pipeline though a complex series of unpipe() calls. And after all that complexity, it doesn't work properly in all cases.
EDIT: Node v8.0.0 added a destroy() api for Readable streams.
At version 4.*.* pushing a null value into the stream will trigger a EOF signal.
From the nodejs docs
If a value other than null is passed, The push() method adds a chunk of data into the queue for subsequent stream processors to consume. If null is passed, it signals the end of the stream (EOF), after which no more data can be written.
This worked for me after trying numerous other options on this page.
This destroy module is meant to ensure a stream gets destroyed, handling different APIs and Node.js bugs. Right now is one of the best choice.
NB. From Node 10 you can use the .destroy method without further dependencies.
You can clear and close the stream with yourstream.resume(), which will dump everything on the stream and eventually close it.
From the official docs:
readable.resume():
Return: this
This method will cause the readable stream to resume emitting 'data' events.
This method will switch the stream into flowing mode. If you do not want to consume the data from a stream, but you do want to get to its 'end' event, you can call stream.resume() to open the flow of data.
var readable = getReadableStreamSomehow();
readable.resume();
readable.on('end', () => {
console.log('got to the end, but did not read anything');
});
It's an old question but I too was looking for the answer and found the best one for my implementation. Both end and close events get emitted so I think this is the cleanest solution.
This will do the trick in node 4.4.* (stable version at the time of writing):
var input = fs.createReadStream('lines.txt');
input.on('data', function(data) {
if (gotFirstLine) {
this.end(); // Simple isn't it?
console.log("Closed.");
}
});
For a very detailed explanation see:
http://www.bennadel.com/blog/2692-you-have-to-explicitly-end-streams-after-pipes-break-in-node-js.htm
This code here will do the trick nicely:
function closeReadStream(stream) {
if (!stream) return;
if (stream.close) stream.close();
else if (stream.destroy) stream.destroy();
}
writeStream.end() is the go-to way to close a writeStream...
for stop callback execution after some call,
you have to use process.kill with particular processID
const csv = require('csv-parser');
const fs = require('fs');
const filepath = "./demo.csv"
let readStream = fs.createReadStream(filepath, {
autoClose: true,
});
let MAX_LINE = 0;
readStream.on('error', (e) => {
console.log(e);
console.log("error");
})
.pipe(csv())
.on('data', (row) => {
if (MAX_LINE == 2) {
process.kill(process.pid, 'SIGTERM')
}
// console.log("not 2");
MAX_LINE++
console.log(row);
})
.on('end', () => {
// handle end of CSV
console.log("read done");
}).on("close", function () {
console.log("closed");
})

Resources