Related
I have a 400Mb file split into chunks that are ~1Mb each.
Each chunk is a MongoDB document:
{
name: 'stuff.zip',
index: 15,
buffer: Binary('......'),
totalChunks: 400
}
I am fetching each chunk from my database and then streaming it to the client.
Every time I get chunk from the DB I push it to the readableStream which is being piped to the client.
Here is the code:
import { Readable } from 'stream'
const name = 'stuff.zip'
const contentType = 'application/zip'
app.get('/api/download-stuff', (req, res) => {
res.set('Content-Type', contentType)
res.set('Content-Disposition', `attachment; filename=${name}`)
res.attachment(name)
// get `totalChunks` from random chunk
let { totalChunks } = await ChunkModel.findOne({ name }).select('totalChunks')
let index = 0
const readableStream = new Readable({
async read() {
if (index < totalChunks) {
let { buffer } = await ChunkModel.findOne({ name, index }).select('buffer')
let canContinue = readableStream.push(buffer)
console.log(`pushed chunk ${index}/${totalChunks}`)
index++
// sometimes it logs false
// which means I should be waiting before pushing more
// but I don't know how
console.log('canContinue = ', canContinue)
} else {
readableStream.push(null)
readableStream.destroy()
console.log(`all ${totalChunks} chunks streamed to the client`)
}
}
})
readableStream.pipe(res)
})
The code works.
But I'm wondering whether I risk having memory overflows on my local server memory, especially when the requests for the same file are too many or the chunks are too many.
Question: My code is not waiting for readableStream to finish reading the chunk that was just pushed to it, before pushing the next one. I thought it was, and that is why I'm using read(){..} in this probably wrong way. So how should I wait for each chunk to be pushed, read, streamed to the client and cleared from my server's local memory, before I push the next one in ?
I have created this sandbox in case it helps anyone
In general, when the readable interface is implemented correctly (i.e., the backpressure signal is respected), the readable interface will prevent the code from overflowing the memory regardless of source size.
When implemented according to the API spec, the readable itself does not keep references for data that has finished passing through the stream. The memory requirement of a readable buffer is adjusted by specifying a highWatermark.
In this case, the snippet does not conform to the readable interface. It violates the following two concepts:
No data shall be pushed to the readable's buffer unless read() has been called. Currently, this implementation proceeds to push data from DB immediately. Consequently, the readable buffer will start to fill before the sink has begun to consume data.
The readable's push() method returns a boolean flag. When the flag is false, the implementation must wait for .read() to be called before pushing additional data. If the flag is ignored, the buffer will overflow wrt. the highWatermark.
Note that ignoring these core criteria of Readables circumvents the backpressure logic.
An alternative implementation, if this is a Mongoose query:
app.get('/api/download-stuff', async (req, res) => {
// ... truncated handler
// A helper variable to relay data from the stream to the response body
const passThrough = new stream.PassThrough({objectMode: false});
// Pipe data using pipeline() to simplify handling stream errors
stream.pipeline(
// Create a cursor that fetch all relevant documents using a single query
ChunkModel.find().limit(chunksLength).select("buffer").sort({index: 1}).lean().cursor(),
// Cherry pick the `buffer` property
new stream.Transform({
objectMode: true,
transform: ({ buffer }, encoding, next) => {
next(null, buffer);
}
}),
// Write the retrieved documents to the helper variable
passThrough,
error => {
if(error){
// Log and handle error. At this point the HTTP headers are probably already sent,
// and it is therefore too late to return HTTP500
}
}
);
res.body = passThrough;
});
I am using the excellent Papa Parse library in nodejs mode, to stream a large (500 MB) CSV file of over 1 million rows, into a slow persistence API, that can only take one request at a time. The persistence API is based on Promises, but from Papa Parse, I receive each parsed CSV row in a synchronous event like so: parseStream.on("data", row => { ... }
The challenge I am facing is that Papa Parse dumps its CSV rows from the stream so fast that my slow persistence API can't keep up. Because Papa is synchronous and my API is Promise-based, I can't just call await doDirtyWork(row) in the on event handler, because sync and async code doesn't mix.
Or can they mix and I just don't know how?
My question is, can I make Papa's event handler wait for my API call to finish? Kind of doing the persistence API request directly in the on("data") event, making the on() function linger around somehow until the dirty API work is done?
The solution I have so far is not much better than using Papa's non-streaming mode, in terms of memory footprint. I actually need to queue up the torrent of on("data") events, in form of generator function iterations. I could have also queued up promise factories in an array and work it off in a loop. Any which way, I end up saving almost the entire CSV file as huge collection of future Promises (promise factories) in memory, until my slow API calls have worked all the way through.
async importCSV(filePath) {
let parsedNum = 0, processedNum = 0;
async function* gen() {
let pf = yield;
do {
pf = yield await pf();
} while (typeof pf === "function");
};
var g = gen();
g.next();
await new Promise((resolve, reject) => {
try {
const dataStream = fs.createReadStream(filePath);
const parseStream = Papa.parse(Papa.NODE_STREAM_INPUT, {delimiter: ",", header: false});
dataStream.pipe(parseStream);
parseStream.on("data", row => {
// Received a CSV row from Papa.parse()
try {
console.log("PA#", parsedNum, ": parsed", row.filter((e, i) => i <= 2 ? e : undefined)
);
parsedNum++;
// Simulate some really slow async/await dirty work here, for example
// send requests to a one-at-a-time persistence API
g.next(() => { // don't execute now, call in sequence via the generator above
return new Promise((res, rej) => {
console.log(
"DW#", processedNum, ": dirty work START",
row.filter((e, i) => i <= 2 ? e : undefined)
);
setTimeout(() => {
console.log(
"DW#", processedNum, ": dirty work STOP ",
row.filter((e, i) => i <= 2 ? e : undefined)
);
processedNum++;
res();
}, 1000)
})
});
} catch (err) {
console.log(err.stack);
reject(err);
}
});
parseStream.on("finish", () => {
console.log(`Parsed ${parsedNum} rows`);
resolve();
});
} catch (err) {
console.log(err.stack);
reject(err);
}
});
while(!(await g.next()).done);
}
So why the rush Papa? Why not allow me to work down the file a bit slower -- the data in the original CSV file isn't gonna run away, we have hours to finish the streaming, why hammer me with on("data") events that I can't seem to slow down?
So what I really need is for Papa to become more of a grandpa, and minimize or eliminate any queuing or buffering of CSV rows. Ideally I would be able to completely sync Papa's parsing events with the speed (or lack thereof) of my API. So if it weren't for the dogma that async code can't make sync code "sleep", I would ideally send each CSV row to the API inside the Papa event, and only then return control to Papa.
Suggestions? Some kind of "loose coupling" of the event handler with the slowness of my async API is fine too. I don't mind if a few hundred rows get queued up. But when tens of thousands pile up, I will run out of heap fast.
Why hammer me with on("data") events that I can't seem to slow down?
You can, you just were not asking papa to stop. You can do this by calling stream.pause(), then later stream.resume() to make use of Node stream's builtin back-pressure.
However, there's a much nicer API to use than dealing with this on your own in callback-based code: use the stream as an async iterator! When you await in the body of a for await loop, the generator has to pause as well. So you can write
async importCSV(filePath) {
let parsedNum = 0;
const dataStream = fs.createReadStream(filePath);
const parseStream = Papa.parse(Papa.NODE_STREAM_INPUT, {delimiter: ",", header: false});
dataStream.pipe(parseStream);
for await (const row of parseStream) {
// Received a CSV row from Papa.parse()
const data = row.filter((e, i) => i <= 2 ? e : undefined);
console.log("PA#", parsedNum, ": parsed", data);
parsedNum++;
await dirtyWork(data);
}
console.log(`Parsed ${parsedNum} rows`);
}
importCSV('sample.csv').catch(console.error);
let processedNum = 0;
function dirtyWork(data) {
// Simulate some really slow async/await dirty work here,
// for example send requests to a one-at-a-time persistence API
return new Promise((res, rej) => {
console.log("DW#", processedNum, ": dirty work START", data)
setTimeout(() => {
console.log("DW#", processedNum, ": dirty work STOP ", data);
processedNum++;
res();
}, 1000);
});
}
Async code in JavaScript can sometimes be a little hard to grok. It's important to remember how Node operates handles concurrency.
The node process is single-threaded, but it uses a concept called an event loop. The consequence of this is that async code and callbacks are essentially equivalent representations of the same thing.
Of course, you need an async function to use await, but your callback from Papa Parse can be an async function:
parse.on("data", async row => {
await sync(row)
})
Once the await operation completes, the arrow function ends, and all references to row will be eliminated, so the garbage collector can successfully collect row, releasing that memory.
The effect this has is concurrently executing sync every time a row is parsed, so if you can only sync one record at a time, then I would recommend wrapping the sync function in a debouncer.
What's the correct way to handle errors with streams? I already know there's an 'error' event you can listen on, but I want to know some more details about arbitrarily complicated situations.
For starters, what do you do when you want to do a simple pipe chain:
input.pipe(transformA).pipe(transformB).pipe(transformC)...
And how do you properly create one of those transforms so that errors are handled correctly?
More related questions:
when an error happens, what happens to the 'end' event? Does it never get fired? Does it sometimes get fired? Does it depend on the transform/stream? What are the standards here?
are there any mechanisms for propogating errors through the pipes?
do domains solve this problem effectively? Examples would be nice.
do errors that come out of 'error' events have stack traces? Sometimes? Never? is there a way to get one from them?
transform
Transform streams are both readable and writeable, and thus are really good 'middle' streams. For this reason, they are sometimes referred to as through streams. They are similar to a duplex stream in this way, except they provide a nice interface to manipulate the data rather than just sending it through. The purpose of a transform stream is to manipulate the data as it is piped through the stream. You may want to do some async calls, for example, or derive a couple of fields, remap some things, etc.
For how to create a transform stream see here and here. All you have to do is :
include the stream module
instantiate ( or inherit from) the Transform class
implement a _transform method which takes a (chunk, encoding, callback).
The chunk is your data. Most of the time you won't need to worry about encoding if you are working in objectMode = true. The callback is called when you are done processing the chunk. This chunk is then pushed on to the next stream.
If you want a nice helper module that will enable you to do through stream really really easily, I suggest through2.
For error handling, keep reading.
pipe
In a pipe chain, handling errors is indeed non-trivial. According to this thread .pipe() is not built to forward errors. So something like ...
var a = createStream();
a.pipe(b).pipe(c).on('error', function(e){handleError(e)});
... would only listen for errors on the stream c. If an error event was emitted on a, that would not be passed down and, in fact, would throw. To do this correctly:
var a = createStream();
a.on('error', function(e){handleError(e)})
.pipe(b)
.on('error', function(e){handleError(e)})
.pipe(c)
.on('error', function(e){handleError(e)});
Now, though the second way is more verbose, you can at least keep the context of where your errors happen. This is usually a good thing.
One library I find helpful though if you have a case where you only want to capture the errors at the destination and you don't care so much about where it happened is event-stream.
end
When an error event is fired, the end event will not be fired (explicitly). The emitting of an error event will end the stream.
domains
In my experience, domains work really well most of the time. If you have an unhandled error event (i.e. emitting an error on a stream without a listener), the server can crash. Now, as the above article points out, you can wrap the stream in a domain which should properly catch all errors.
var d = domain.create();
d.on('error', handleAllErrors);
d.run(function() {
fs.createReadStream(tarball)
.pipe(gzip.Gunzip())
.pipe(tar.Extract({ path: targetPath }))
.on('close', cb);
});
the above code sample is from this post
The beauty of domains is that they will preserve the stack traces. Though event-stream does a good job of this as well.
For further reading, check out the stream-handbook. Pretty in depth, but super useful and gives some great links to lots of helpful modules.
If you are using node >= v10.0.0 you can use stream.pipeline and stream.finished.
For example:
const { pipeline, finished } = require('stream');
pipeline(
input,
transformA,
transformB,
transformC,
(err) => {
if (err) {
console.error('Pipeline failed', err);
} else {
console.log('Pipeline succeeded');
}
});
finished(input, (err) => {
if (err) {
console.error('Stream failed', err);
} else {
console.log('Stream is done reading');
}
});
See this github PR for more discussion.
domains are deprecated. you dont need them.
for this question, distinctions between transform or writable are not so important.
mshell_lauren's answer is great, but as an alternative you can also explicitly listen for the error event on each stream you think might error. and reuse the handler function if you prefer.
var a = createReadableStream()
var b = anotherTypeOfStream()
var c = createWriteStream()
a.on('error', handler)
b.on('error', handler)
c.on('error', handler)
a.pipe(b).pipe(c)
function handler (err) { console.log(err) }
doing so prevents the infamous uncaught exception should one of those stream fire its error event
Errors from the whole chain can be propagated to the rightmost stream using a simple function:
function safePipe (readable, transforms) {
while (transforms.length > 0) {
var new_readable = transforms.shift();
readable.on("error", function(e) { new_readable.emit("error", e); });
readable.pipe(new_readable);
readable = new_readable;
}
return readable;
}
which can be used like:
safePipe(readable, [ transform1, transform2, ... ]);
.on("error", handler) only takes care of Stream errors but if you are using custom Transform streams, .on("error", handler) don't catch the errors happening inside _transform function. So one can do something like this for controlling application flow :-
this keyword in _transform function refers to Stream itself, which is an EventEmitter. So you can use try catch like below to catch the errors and later on pass them to the custom event handlers.
// CustomTransform.js
CustomTransformStream.prototype._transform = function (data, enc, done) {
var stream = this
try {
// Do your transform code
} catch (e) {
// Now based on the error type, with an if or switch statement
stream.emit("CTError1", e)
stream.emit("CTError2", e)
}
done()
}
// StreamImplementation.js
someReadStream
.pipe(CustomTransformStream)
.on("CTError1", function (e) { console.log(e) })
.on("CTError2", function (e) { /*Lets do something else*/ })
.pipe(someWriteStream)
This way, you can keep your logic and error handlers separate. Also , you can opt to handle only some errors and ignore others.
UPDATE
Alternative: RXJS Observable
Use multipipe package to combinate several streams into one duplex stream. And handle errors in one place.
const pipe = require('multipipe')
// pipe streams
const stream = pipe(streamA, streamB, streamC)
// centralized error handling
stream.on('error', fn)
Use Node.js pattern by creating a Transform stream mechanics and calling its callback done with an argument in order to propagate the error:
var transformStream1 = new stream.Transform(/*{objectMode: true}*/);
transformStream1.prototype._transform = function (chunk, encoding, done) {
//var stream = this;
try {
// Do your transform code
/* ... */
} catch (error) {
// nodejs style for propagating an error
return done(error);
}
// Here, everything went well
done();
}
// Let's use the transform stream, assuming `someReadStream`
// and `someWriteStream` have been defined before
someReadStream
.pipe(transformStream1)
.on('error', function (error) {
console.error('Error in transformStream1:');
console.error(error);
process.exit(-1);
})
.pipe(someWriteStream)
.on('close', function () {
console.log('OK.');
process.exit();
})
.on('error', function (error) {
console.error(error);
process.exit(-1);
});
const http = require('http');
const fs = require('fs');
const server = http.createServer();
server.on('request',(req,res)=>{
const readableStream = fs.createReadStream(__dirname+'/README.md');
const writeableStream = fs.createWriteStream(__dirname+'/assets/test.txt');
readableStream
.on('error',()=>{
res.end("File not found")
})
.pipe(writeableStream)
.on('error',(error)=>{
console.log(error)
res.end("Something went to wrong!")
})
.on('finish',()=>{
res.end("Done!")
})
})
server.listen(8000,()=>{
console.log("Server is running in 8000 port")
})
Try catch won't capture the errors that occurred in the stream because as they are thrown after the calling code has already exited. you can refer to the documentation:
https://nodejs.org/dist/latest-v10.x/docs/api/errors.html
What's the correct way to handle errors with streams? I already know there's an 'error' event you can listen on, but I want to know some more details about arbitrarily complicated situations.
For starters, what do you do when you want to do a simple pipe chain:
input.pipe(transformA).pipe(transformB).pipe(transformC)...
And how do you properly create one of those transforms so that errors are handled correctly?
More related questions:
when an error happens, what happens to the 'end' event? Does it never get fired? Does it sometimes get fired? Does it depend on the transform/stream? What are the standards here?
are there any mechanisms for propogating errors through the pipes?
do domains solve this problem effectively? Examples would be nice.
do errors that come out of 'error' events have stack traces? Sometimes? Never? is there a way to get one from them?
transform
Transform streams are both readable and writeable, and thus are really good 'middle' streams. For this reason, they are sometimes referred to as through streams. They are similar to a duplex stream in this way, except they provide a nice interface to manipulate the data rather than just sending it through. The purpose of a transform stream is to manipulate the data as it is piped through the stream. You may want to do some async calls, for example, or derive a couple of fields, remap some things, etc.
For how to create a transform stream see here and here. All you have to do is :
include the stream module
instantiate ( or inherit from) the Transform class
implement a _transform method which takes a (chunk, encoding, callback).
The chunk is your data. Most of the time you won't need to worry about encoding if you are working in objectMode = true. The callback is called when you are done processing the chunk. This chunk is then pushed on to the next stream.
If you want a nice helper module that will enable you to do through stream really really easily, I suggest through2.
For error handling, keep reading.
pipe
In a pipe chain, handling errors is indeed non-trivial. According to this thread .pipe() is not built to forward errors. So something like ...
var a = createStream();
a.pipe(b).pipe(c).on('error', function(e){handleError(e)});
... would only listen for errors on the stream c. If an error event was emitted on a, that would not be passed down and, in fact, would throw. To do this correctly:
var a = createStream();
a.on('error', function(e){handleError(e)})
.pipe(b)
.on('error', function(e){handleError(e)})
.pipe(c)
.on('error', function(e){handleError(e)});
Now, though the second way is more verbose, you can at least keep the context of where your errors happen. This is usually a good thing.
One library I find helpful though if you have a case where you only want to capture the errors at the destination and you don't care so much about where it happened is event-stream.
end
When an error event is fired, the end event will not be fired (explicitly). The emitting of an error event will end the stream.
domains
In my experience, domains work really well most of the time. If you have an unhandled error event (i.e. emitting an error on a stream without a listener), the server can crash. Now, as the above article points out, you can wrap the stream in a domain which should properly catch all errors.
var d = domain.create();
d.on('error', handleAllErrors);
d.run(function() {
fs.createReadStream(tarball)
.pipe(gzip.Gunzip())
.pipe(tar.Extract({ path: targetPath }))
.on('close', cb);
});
the above code sample is from this post
The beauty of domains is that they will preserve the stack traces. Though event-stream does a good job of this as well.
For further reading, check out the stream-handbook. Pretty in depth, but super useful and gives some great links to lots of helpful modules.
If you are using node >= v10.0.0 you can use stream.pipeline and stream.finished.
For example:
const { pipeline, finished } = require('stream');
pipeline(
input,
transformA,
transformB,
transformC,
(err) => {
if (err) {
console.error('Pipeline failed', err);
} else {
console.log('Pipeline succeeded');
}
});
finished(input, (err) => {
if (err) {
console.error('Stream failed', err);
} else {
console.log('Stream is done reading');
}
});
See this github PR for more discussion.
domains are deprecated. you dont need them.
for this question, distinctions between transform or writable are not so important.
mshell_lauren's answer is great, but as an alternative you can also explicitly listen for the error event on each stream you think might error. and reuse the handler function if you prefer.
var a = createReadableStream()
var b = anotherTypeOfStream()
var c = createWriteStream()
a.on('error', handler)
b.on('error', handler)
c.on('error', handler)
a.pipe(b).pipe(c)
function handler (err) { console.log(err) }
doing so prevents the infamous uncaught exception should one of those stream fire its error event
Errors from the whole chain can be propagated to the rightmost stream using a simple function:
function safePipe (readable, transforms) {
while (transforms.length > 0) {
var new_readable = transforms.shift();
readable.on("error", function(e) { new_readable.emit("error", e); });
readable.pipe(new_readable);
readable = new_readable;
}
return readable;
}
which can be used like:
safePipe(readable, [ transform1, transform2, ... ]);
.on("error", handler) only takes care of Stream errors but if you are using custom Transform streams, .on("error", handler) don't catch the errors happening inside _transform function. So one can do something like this for controlling application flow :-
this keyword in _transform function refers to Stream itself, which is an EventEmitter. So you can use try catch like below to catch the errors and later on pass them to the custom event handlers.
// CustomTransform.js
CustomTransformStream.prototype._transform = function (data, enc, done) {
var stream = this
try {
// Do your transform code
} catch (e) {
// Now based on the error type, with an if or switch statement
stream.emit("CTError1", e)
stream.emit("CTError2", e)
}
done()
}
// StreamImplementation.js
someReadStream
.pipe(CustomTransformStream)
.on("CTError1", function (e) { console.log(e) })
.on("CTError2", function (e) { /*Lets do something else*/ })
.pipe(someWriteStream)
This way, you can keep your logic and error handlers separate. Also , you can opt to handle only some errors and ignore others.
UPDATE
Alternative: RXJS Observable
Use multipipe package to combinate several streams into one duplex stream. And handle errors in one place.
const pipe = require('multipipe')
// pipe streams
const stream = pipe(streamA, streamB, streamC)
// centralized error handling
stream.on('error', fn)
Use Node.js pattern by creating a Transform stream mechanics and calling its callback done with an argument in order to propagate the error:
var transformStream1 = new stream.Transform(/*{objectMode: true}*/);
transformStream1.prototype._transform = function (chunk, encoding, done) {
//var stream = this;
try {
// Do your transform code
/* ... */
} catch (error) {
// nodejs style for propagating an error
return done(error);
}
// Here, everything went well
done();
}
// Let's use the transform stream, assuming `someReadStream`
// and `someWriteStream` have been defined before
someReadStream
.pipe(transformStream1)
.on('error', function (error) {
console.error('Error in transformStream1:');
console.error(error);
process.exit(-1);
})
.pipe(someWriteStream)
.on('close', function () {
console.log('OK.');
process.exit();
})
.on('error', function (error) {
console.error(error);
process.exit(-1);
});
const http = require('http');
const fs = require('fs');
const server = http.createServer();
server.on('request',(req,res)=>{
const readableStream = fs.createReadStream(__dirname+'/README.md');
const writeableStream = fs.createWriteStream(__dirname+'/assets/test.txt');
readableStream
.on('error',()=>{
res.end("File not found")
})
.pipe(writeableStream)
.on('error',(error)=>{
console.log(error)
res.end("Something went to wrong!")
})
.on('finish',()=>{
res.end("Done!")
})
})
server.listen(8000,()=>{
console.log("Server is running in 8000 port")
})
Try catch won't capture the errors that occurred in the stream because as they are thrown after the calling code has already exited. you can refer to the documentation:
https://nodejs.org/dist/latest-v10.x/docs/api/errors.html
How to close a readable stream in Node.js?
var input = fs.createReadStream('lines.txt');
input.on('data', function(data) {
// after closing the stream, this will not
// be called again
if (gotFirstLine) {
// close this stream and continue the
// instructions from this if
console.log("Closed.");
}
});
This would be better than:
input.on('data', function(data) {
if (isEnded) { return; }
if (gotFirstLine) {
isEnded = true;
console.log("Closed.");
}
});
But this would not stop the reading process...
Edit: Good news! Starting with Node.js 8.0.0 readable.destroy is officially available: https://nodejs.org/api/stream.html#stream_readable_destroy_error
ReadStream.destroy
You can call the ReadStream.destroy function at any time.
var fs = require("fs");
var readStream = fs.createReadStream("lines.txt");
readStream
.on("data", function (chunk) {
console.log(chunk);
readStream.destroy();
})
.on("end", function () {
// This may not been called since we are destroying the stream
// the first time "data" event is received
console.log("All the data in the file has been read");
})
.on("close", function (err) {
console.log("Stream has been destroyed and file has been closed");
});
The public function ReadStream.destroy is not documented (Node.js v0.12.2) but you can have a look at the source code on GitHub (Oct 5, 2012 commit).
The destroy function internally mark the ReadStream instance as destroyed and calls the close function to release the file.
You can listen to the close event to know exactly when the file is closed. The end event will not fire unless the data is completely consumed.
Note that the destroy (and the close) functions are specific to fs.ReadStream. There are not part of the generic stream.readable "interface".
Invoke input.close(). It's not in the docs, but
https://github.com/joyent/node/blob/cfcb1de130867197cbc9c6012b7e84e08e53d032/lib/fs.js#L1597-L1620
clearly does the job :) It actually does something similar to your isEnded.
EDIT 2015-Apr-19 Based on comments below, and to clarify and update:
This suggestion is a hack, and is not documented.
Though for looking at the current lib/fs.js it still works >1.5yrs later.
I agree with the comment below about calling destroy() being preferable.
As correctly stated below this works for fs ReadStreams's, not on a generic Readable
As for a generic solution: it doesn't appear as if there is one, at least from my understanding of the documentation and from a quick look at _stream_readable.js.
My proposal would be put your readable stream in paused mode, at least preventing further processing in your upstream data source. Don't forget to unpipe() and remove all data event listeners so that pause() actually pauses, as mentioned in the docs
Today, in Node 10
readableStream.destroy()
is the official way to close a readable stream
see https://nodejs.org/api/stream.html#stream_readable_destroy_error
You can't. There is no documented way to close/shutdown/abort/destroy a generic Readable stream as of Node 5.3.0. This is a limitation of the Node stream architecture.
As other answers here have explained, there are undocumented hacks for specific implementations of Readable provided by Node, such as fs.ReadStream. These are not generic solutions for any Readable though.
If someone can prove me wrong here, please do. I would like to be able to do what I'm saying is impossible, and would be delighted to be corrected.
EDIT: Here was my workaround: implement .destroy() for my pipeline though a complex series of unpipe() calls. And after all that complexity, it doesn't work properly in all cases.
EDIT: Node v8.0.0 added a destroy() api for Readable streams.
At version 4.*.* pushing a null value into the stream will trigger a EOF signal.
From the nodejs docs
If a value other than null is passed, The push() method adds a chunk of data into the queue for subsequent stream processors to consume. If null is passed, it signals the end of the stream (EOF), after which no more data can be written.
This worked for me after trying numerous other options on this page.
This destroy module is meant to ensure a stream gets destroyed, handling different APIs and Node.js bugs. Right now is one of the best choice.
NB. From Node 10 you can use the .destroy method without further dependencies.
You can clear and close the stream with yourstream.resume(), which will dump everything on the stream and eventually close it.
From the official docs:
readable.resume():
Return: this
This method will cause the readable stream to resume emitting 'data' events.
This method will switch the stream into flowing mode. If you do not want to consume the data from a stream, but you do want to get to its 'end' event, you can call stream.resume() to open the flow of data.
var readable = getReadableStreamSomehow();
readable.resume();
readable.on('end', () => {
console.log('got to the end, but did not read anything');
});
It's an old question but I too was looking for the answer and found the best one for my implementation. Both end and close events get emitted so I think this is the cleanest solution.
This will do the trick in node 4.4.* (stable version at the time of writing):
var input = fs.createReadStream('lines.txt');
input.on('data', function(data) {
if (gotFirstLine) {
this.end(); // Simple isn't it?
console.log("Closed.");
}
});
For a very detailed explanation see:
http://www.bennadel.com/blog/2692-you-have-to-explicitly-end-streams-after-pipes-break-in-node-js.htm
This code here will do the trick nicely:
function closeReadStream(stream) {
if (!stream) return;
if (stream.close) stream.close();
else if (stream.destroy) stream.destroy();
}
writeStream.end() is the go-to way to close a writeStream...
for stop callback execution after some call,
you have to use process.kill with particular processID
const csv = require('csv-parser');
const fs = require('fs');
const filepath = "./demo.csv"
let readStream = fs.createReadStream(filepath, {
autoClose: true,
});
let MAX_LINE = 0;
readStream.on('error', (e) => {
console.log(e);
console.log("error");
})
.pipe(csv())
.on('data', (row) => {
if (MAX_LINE == 2) {
process.kill(process.pid, 'SIGTERM')
}
// console.log("not 2");
MAX_LINE++
console.log(row);
})
.on('end', () => {
// handle end of CSV
console.log("read done");
}).on("close", function () {
console.log("closed");
})