Node Unzipper - how to know it's finished - node.js

I'm trying to use the unzipper node module to extract and process a number of files (exact number is unknown). However, I can't seem to figure out how to know when all the files are processed. So far, my code looks like this:
s3.getObject(params).createReadStream()
.pipe(unzipper.Parse())
.on('entry', async (entry) => {
var fileName = entry.path;
if (fileName.match(someRegex)) {
await processEntry(entry);
console.log("Uploaded");
} else {
entry.autodrain();
console.log("Drained");
}
});
I'm trying to figure out how to know that unzipper has gone through all the files (i.e., no more entry events are forthcoming) and all the entry handlers have finished so that I know I've finished processing all the files I care about.
I've tried experimenting with the close and finish events but when I have, they both trigger before console.log("Uploaded"); has printed, so that doesn't seem right.
Help?

Directly from the docs:
The parser emits finish and error events like any other stream. The parser additionally provides a promise wrapper around those two events to allow easy folding into existing Promise-based structures.
Example:
fs.createReadStream('path/to/archive.zip')
.pipe(unzipper.Parse())
.on('entry', entry => entry.autodrain())
.promise()
.then( () => console.log('done'), e => console.log('error',e));

Related

ReadableStream's 'data' event isn't emitted if stream is not piped beforehand

I'm using the promise-ftp Node.js library to download a file from an FTP server. Function ftpClient.get(...) returns a Promise of ReadableStream, data of which should be accessible using data and end events, however, I encounter a weird behaviour that is not described in the documentation.
//Some code that returns a to-be-downloaded file's name
.then((fileName) => {
return ftpClient.get(fileName)
})
.then((fileStream) => {
let chunks = [];
return new Promise((resolve, reject) => {
fileStream.on('data', (chunk) => {
chunks.push(chunk);
});
fileStream.on('end', () => {
resolve(Buffer.concat(chunks).toString('utf8'));
});
});
});
This is a solution on how to read a stream into a variable that I derived from answers to this question. However, it doesn't work: the data and end events are just never emitted.
Comparing my code to this example in library documentation, the only difference is that the ReadableStream is piped somewhere beforehand.
Now, if I add fileStream.pipe(...),
//...
.then((fileStream) => {
fileStream.pipe(fs.createWriteStream('foo.txt'));
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
let chunks = [];
return new Promise((resolve, reject) => {
//...
everything works perfectly.
So my question is: what exactly is the purpose of adding a .pipe() and how is it connected to data and end events?
Pipe is the correct way to read data from a stream. The idea comes from Linux Pipes. Something like:
fileStream | stdout
Here fileStream is being pipe'd to stdout
on is to register an event listener i.e. if some part of the program has the pipe set up, you can listen in to the same.
Not streaming any data when no one is reading it (i.e. its not being piped to anything) is an obvious optimisation (it could have other constraints too).
Yes I understand the mentioned answer can be misleading. Office node docs on streams is better. And there is a new upward trending better answer.

Node CSV pull parser

I need to parse a CSV document from Node.JS, performing database operations for each record (= each line). However, I'm having trouble finding a suitable CSV parser using a pull approach, or at least a push approach that waits for my record operations before parsing the next row.
I've looked at csv-parse, csvtojson, csv-streamify, but they all seem to push events in a continuous stream without any flow control. If parsing a 1000 line CSV document, I basically get all 1000 callbacks in quick sequence. For each record, I perform an operation returning a promise. Currently I've had to resort to pushing all my promises into an array and after getting the done/end event I also wait for Promise.all(myOperations) to know when the document has been fully processed. But this is not very nice, and also, I'd prefer parsing one line at a time and fully processing it, before getting the next record, instead of concurrently processing all records - it's hard to debug and uses a lot of memory as opposed to simply dealing with each record sequentially.
So, is there a CSV parser that supports pull mode, or a way to get any stream-based CSV parser (preferably csvtojson as that's the one I'm using at the moment) to only produce events for new records when my handler for the previous record is finished (using promises)?
I solved this myself by creating my own Writable and piping the CSV parser to it. My write method does its stuff and wraps a promise to the node callback passed to _write() (here implemented using Q.nodeify):
class CsvConsumer extends stream.Writable {
_write(data, encoding, cb) {
console.log('Got data: ', data);
Q.delay(1000).then(() => {
console.log('Waited 1 s');
}).nodeify(cb);
}
}
csvtojson()
.fromStream(is)
.pipe(new CsvConsumer())
.on('finish', err => {
if (err) {
console.log('Error!');
} else {
console.log('Done!');
}
});
This will process lines one by one:
Got data: {"a": "1"}
Waited 1 s
Got data: {"a": "2"}
Waited 1 s
Got data: {"a": "3"}
Waited 1 s
Done!
If you want to process each line asynchronously you can do that with node's native LineReader.
const lineStream = readline.createInterface({
input: fs.createReadStream('data/test.csv'),
});
lineStream.on('line', (eachLine) =>{
//process each line
});
If you want to do the same in synchronous fashion you can use line-by-line. It doesn't buffer the entire file into memory. It provides event handlers to pause and resume the 'line' emit event.
lr.on('line', function (line) {
// pause emitting of lines...
lr.pause();
// ...do your asynchronous line processing..
setTimeout(function () {
// ...and continue emitting lines. (1 sec delay)
lr.resume();
}, 1000);
});

How to force sequential statement execution in Node?

I am new to Node, so please forgive me if my question is too simple. I fully appreciate the Async paradigm and why it is useful in single threads. But some logical operations are synchronous by nature.
I have found many posts about the async/sync issue, and have been reading for whole two days about callbacks, promises, async/await etc...
But still I cannot figure what should be straight forward and simple thing to do. Am I missing something!
Basically for the code below:
const fs = require('fs');
var readline = require('readline');
function getfile (aprodfile) {
var prodlines = [];
var prodfile = readline.createInterface({input: fs.createReadStream(aprodfile)});
prodfile.on('line', (line) => {prodlines.push(line)});
prodfile.on('close', () => console.log('Loaded '+prodlines.length));
// the above block is repeated several times to read several other
// files, but are omitted here for simplicity
console.log(prodlines.length);
// then 200+ lines that assume prodlines already filled
};
the output I get is:
0
Loaded 11167
whereas the output I expect is:
Loaded 11167
11167
This is because the console.log statement executes before the prodfile.on events are completed.
Is there a nice clean way to tell Node to execute commands sequentially, even if blocking? or better still to tell console.log (and the 200+ lines of code following it) to wait until prodlines is fully populated?
Here's the execution order of what you wrote :
prodfile.on('line', line => { // (1) subscribing to the 'line' event
prodlines.push(line) // (4), whenever the 'line' event is triggered
});
prodfile.on('close', () => { // (2) subscribing to the 'close' event
console.log('Loaded '+prodlines.length) // (5), whenever the 'close' event is triggered
});
console.log(prodlines.length); // (3) --> so it logs 0, nothing has happened yet
What you can do is this :
function getfile (aprodfile) {
var prodlines = [];
var prodfile = readline.createInterface({input: fs.createReadStream(aprodfile)});
prodfile.on('line', line => { prodlines.push(line) });
prodfile.on('close', () => {
console.log('Loaded '+prodlines.length)
finishedFetching( prodlines );
});
};
function finishedFetching( prodlines ) {
console.log(prodlines.length) // 200!
}

Complex sequencing of promises - nested

After a lot of googling I have not been able to confirm the correct approach to this problem. The following code runs as expected but I have a grave feeling that I am not approaching this in the correct way, and I am setting myself up for problems.
The following code is initiated by the main app.js file and is passed a location to start loading XML files from and processing into a mongoDB
exports.processProfiles = function(path) {
var deferrer = q.defer();
q(dataService.deleteProfiles()) // simple mongodb call to empty the Profiles collection
.then(function(deleteResult) {
return loadFilenames(path); // method to load all filenames in the given path using fs
})
.then(function(filenames) {
// now we have all the file names lets load and save
filenames.forEach(function(filename) {
// Here is where i think the problem is!
// kick off another promise chain for the dynamically sized array of files to process
q(loadFileContent(path, filename)) // first we load the data in the file
.then(function(inboundFile) {
// then parse XML structure to my new shiny JSON structure
// and ask Mongo to store it for me
return dataService.createProfile(processProfileXML(filename, inboundFile));
})
.done(function(result) {
console.log(result);
})
});
})
.catch(function(err) {
deferrer.reject('Unable to Process Profile records : ' + err);
})
.done(function() {
deferrer.resolve('Profile Processing Completed');
});
return deferrer.promise;
}
Whilst this code works these are my main concerns but cannot solve them on my own after a few hours of Google and reading.
1) Is this blocking? The read out to the console is difficult to understand if this is running asynchronously as i want it to - i think it is but advice on if I am doing something fundamentally wrong would be great
2) Is having a nested promise a bad idea, should I be linking it to the outter promise - I have tried but could not get anything to compile or run.
I haven't used Q in a really long time, but I think that you'd need to do is let it know you're about to hand back an array of promises that need to all be satisfied before moving on.
Additionally as you're waiting for multiple promises on one section of code, rather than nesting further, throw the 'set' of promises back up once they're all satisfied.
q(dataService.deleteProfiles()) // simple mongodb call to empty the Profiles collection
.then(function (deleteResult) {
return loadFilenames(path); // method to load all filenames in the given path using fs
})
.then(function (filenames) {
return q.all(
filenames.map(function (filename) {
return q(loadFileContent(path, filename)) { /* Do stuff with your filenames */ });
})
);
.then(function (resultsOfLoadFileContentsPromises) {
console.log('I did stuff with all the things');
)
.catch(function(err) {});
What you have is not 'blocking'. But really what you're doing with promises is moving things into a new 'block'ing section. The more blocks you have, the more async-ish your code will appear. If nothing else is running apart from this promise, it will still appear procedural.
But inner promises must still resolve before the parent promises resolve thereafter.
Inner promises like what you have aren't an inherently bad, personally I will break them out into seperate files to makes easier to reason about, but I wouldn't define that as 'bad' unless there's no need for that inner promise to exist, however where possible (and in your example here) I've adjusted so I throw back up the next set of promises for a new section to deal with the data after it's gotten it.
(I'm not great with Q though, this code will probably require a little further tweaking).

How to close a readable stream (before end)?

How to close a readable stream in Node.js?
var input = fs.createReadStream('lines.txt');
input.on('data', function(data) {
// after closing the stream, this will not
// be called again
if (gotFirstLine) {
// close this stream and continue the
// instructions from this if
console.log("Closed.");
}
});
This would be better than:
input.on('data', function(data) {
if (isEnded) { return; }
if (gotFirstLine) {
isEnded = true;
console.log("Closed.");
}
});
But this would not stop the reading process...
Edit: Good news! Starting with Node.js 8.0.0 readable.destroy is officially available: https://nodejs.org/api/stream.html#stream_readable_destroy_error
ReadStream.destroy
You can call the ReadStream.destroy function at any time.
var fs = require("fs");
var readStream = fs.createReadStream("lines.txt");
readStream
.on("data", function (chunk) {
console.log(chunk);
readStream.destroy();
})
.on("end", function () {
// This may not been called since we are destroying the stream
// the first time "data" event is received
console.log("All the data in the file has been read");
})
.on("close", function (err) {
console.log("Stream has been destroyed and file has been closed");
});
The public function ReadStream.destroy is not documented (Node.js v0.12.2) but you can have a look at the source code on GitHub (Oct 5, 2012 commit).
The destroy function internally mark the ReadStream instance as destroyed and calls the close function to release the file.
You can listen to the close event to know exactly when the file is closed. The end event will not fire unless the data is completely consumed.
Note that the destroy (and the close) functions are specific to fs.ReadStream. There are not part of the generic stream.readable "interface".
Invoke input.close(). It's not in the docs, but
https://github.com/joyent/node/blob/cfcb1de130867197cbc9c6012b7e84e08e53d032/lib/fs.js#L1597-L1620
clearly does the job :) It actually does something similar to your isEnded.
EDIT 2015-Apr-19 Based on comments below, and to clarify and update:
This suggestion is a hack, and is not documented.
Though for looking at the current lib/fs.js it still works >1.5yrs later.
I agree with the comment below about calling destroy() being preferable.
As correctly stated below this works for fs ReadStreams's, not on a generic Readable
As for a generic solution: it doesn't appear as if there is one, at least from my understanding of the documentation and from a quick look at _stream_readable.js.
My proposal would be put your readable stream in paused mode, at least preventing further processing in your upstream data source. Don't forget to unpipe() and remove all data event listeners so that pause() actually pauses, as mentioned in the docs
Today, in Node 10
readableStream.destroy()
is the official way to close a readable stream
see https://nodejs.org/api/stream.html#stream_readable_destroy_error
You can't. There is no documented way to close/shutdown/abort/destroy a generic Readable stream as of Node 5.3.0. This is a limitation of the Node stream architecture.
As other answers here have explained, there are undocumented hacks for specific implementations of Readable provided by Node, such as fs.ReadStream. These are not generic solutions for any Readable though.
If someone can prove me wrong here, please do. I would like to be able to do what I'm saying is impossible, and would be delighted to be corrected.
EDIT: Here was my workaround: implement .destroy() for my pipeline though a complex series of unpipe() calls. And after all that complexity, it doesn't work properly in all cases.
EDIT: Node v8.0.0 added a destroy() api for Readable streams.
At version 4.*.* pushing a null value into the stream will trigger a EOF signal.
From the nodejs docs
If a value other than null is passed, The push() method adds a chunk of data into the queue for subsequent stream processors to consume. If null is passed, it signals the end of the stream (EOF), after which no more data can be written.
This worked for me after trying numerous other options on this page.
This destroy module is meant to ensure a stream gets destroyed, handling different APIs and Node.js bugs. Right now is one of the best choice.
NB. From Node 10 you can use the .destroy method without further dependencies.
You can clear and close the stream with yourstream.resume(), which will dump everything on the stream and eventually close it.
From the official docs:
readable.resume():
Return: this
This method will cause the readable stream to resume emitting 'data' events.
This method will switch the stream into flowing mode. If you do not want to consume the data from a stream, but you do want to get to its 'end' event, you can call stream.resume() to open the flow of data.
var readable = getReadableStreamSomehow();
readable.resume();
readable.on('end', () => {
console.log('got to the end, but did not read anything');
});
It's an old question but I too was looking for the answer and found the best one for my implementation. Both end and close events get emitted so I think this is the cleanest solution.
This will do the trick in node 4.4.* (stable version at the time of writing):
var input = fs.createReadStream('lines.txt');
input.on('data', function(data) {
if (gotFirstLine) {
this.end(); // Simple isn't it?
console.log("Closed.");
}
});
For a very detailed explanation see:
http://www.bennadel.com/blog/2692-you-have-to-explicitly-end-streams-after-pipes-break-in-node-js.htm
This code here will do the trick nicely:
function closeReadStream(stream) {
if (!stream) return;
if (stream.close) stream.close();
else if (stream.destroy) stream.destroy();
}
writeStream.end() is the go-to way to close a writeStream...
for stop callback execution after some call,
you have to use process.kill with particular processID
const csv = require('csv-parser');
const fs = require('fs');
const filepath = "./demo.csv"
let readStream = fs.createReadStream(filepath, {
autoClose: true,
});
let MAX_LINE = 0;
readStream.on('error', (e) => {
console.log(e);
console.log("error");
})
.pipe(csv())
.on('data', (row) => {
if (MAX_LINE == 2) {
process.kill(process.pid, 'SIGTERM')
}
// console.log("not 2");
MAX_LINE++
console.log(row);
})
.on('end', () => {
// handle end of CSV
console.log("read done");
}).on("close", function () {
console.log("closed");
})

Resources