I'm quite unfamiliar with streaming. Suppose I have an input stream:
let inputStream = fs.createReadStream(getTrgFilePath());
I'm going to pipe this input to some output stream:
inputStream.pipe(someGenericOutputStream);
In my case, getTrgFilePath() may not produce a valid filepath. This will cause a failure which will result in no content being sent to someGenericOutputStream.
How do I set things up so that when inputStream encounters an error, it pipes some default value (e.g. "Invalid filepath!") instead of failing?
Example 1:
If getTrgFilePath() is invalid, and someGenericOutputStream is process.stdout, I want to see stdout say "Invalid filepath!"
Example 2:
If getTrgFilePath() is invalid, and someGenericOutputStream is the result of fs.createOutputStream(outputFilePath), I would expect to find a file at outputFilePath with the contents "Invalid filepath!".
I'm interested in a solution which doesn't need to know what specific kind of stream someGenericOutputStream is.
If you are only worried about the path being invalid, you could first check the output with fs.access, but as I understand you don't want additional "handling" code in your file...
So let's take into account what may go wrong:
File path is not valid,
File or path does not exist,
File is not readable,
File is read but something happens when it fails.
Now I'm gonna leave the 4th case alone, this is a separate case, so we'll just ignore such a situation. We need two files (so that your code looks clean and all the mess is in a separate file) - here's the, lets say, ./lib/create-fs-with-default.js file:
module.exports = // or export default if you use es6 modules
function(filepath, def = "Could not read file") {
// We open the file normally
const _in = fs.createReadStream(filepath);
// We'll need a list of targets later on
let _piped = [];
// Here's a handler that end's all piped outputs with the default value.
const _handler = (e) => {
if (!_piped.length) {
throw e;
}
_piped.forEach(
out => out.end(def)
);
};
_in.once("error", _handler);
// We keep the original `pipe` method in a variable
const _orgPipe = _in.pipe;
// And override it with our alternative version...
_in.pipe = function(to, ...args) {
const _out = _orgPipe.call(this, to, ...args);
// ...which, apart from calling the original, also records the outputs
_piped.push(_out);
return _out;
}
// Optionally we could handle `unpipe` method here.
// Here we remove the handler once data flow is started.
_in.once("data", () => _in.removeListener("error", _handler));
// And pause the stream again so that `data` listener doesn't consume the first chunk.
_in.pause();
// Finally we return the read stream
return _in;
};
Now there's just a small matter to use it:
const createReadStreamWithDefault = require("./lib/create-fs-with-default");
const inputStream = fs.createReadStream(getTrgFilePath(), "Invalid input!");
// ... and at some point
inputStream.pipe(someOutput);
And there you go.
M.
Related
I am trying to figure out how to create a stream pipe which reads entries in a csv file on-demand. To do so, I thought of using the following approach using pipes (pseudocode)
const stream_pipe = input_file_stream.pipe(csv_parser)
// Then getting entries through:
let entry = stream_pipe.read()
Unfortunately, after lots of testing I figured that them moment I set up the pipe, it is automatically consumed until the end of the csv file. I tried to pause it on creation by appending .pause() at the end, but it seems to not have any effect.
Here's my current code. I am using the csv_parse library (part of the bigger csv package):
// Read file stream
const file_stream = fs.createReadStream("filename.csv")
const parser = csvParser({
columns: ['a', 'b'],
on_record: (record) => {
// A simple filter as I am interested only in numeric entries
let a = parseInt(record.a)
let b = parseInt(record.b)
return (isNaN(a) || isNaN(b)) ? undefined : record
}
})
const reader = stream.pipe(parser) // Adding .pause() seems to have no effect
console.log(reader.read()) // Prints `null`
// I found out I can use this strategy to read a few entries immediately, but I cannot break out of it and then resume as the stream will automatically be consumed
//for await (const record of reader) {
// console.log(record)
//}
I have been banging my head on this for a while and I could not find easy solutions on both the csv package and node official documentation.
Thanks in advance to anyone able to put me on the right track :)
You can do one thing while reading the stream you can create a readLineInterface and pass the input stream and normal output stream like this:
const inputStream = "reading the csv file",
outputStream = new stream();
// now create a readLineInterface which will read
// line by line you should use async/await
const res = await processRecord(readline.createInterface(inputStream, outputStream));
async function processRecord(line) {
return new Promise((res, rej) => {
if (line) {
// do the processing
res(line);
}
rej('Unable to process record');
})
}
Now create processRecord function should get the things line by line and you can you promises to make it sequential.
Note: the above code is a pseudo code just to give you an idea if things work because I have been doing same in my project to read the csv file line and line and it works fine.
I want to pipe data from my readable stream to a writable stream but validate in between.
In my case:
Readable Stream: http response as a stream (Axios.post response as a stream to be more specific)
Writable Stream: AWS S3
Axios.post response comes in XML format. So, it means the readable stream will read chunks that represent XML. I transform each chunk to string and check if <specificTag> (opening) and </specificTag> closing is available. Both these checks will be done in different or arbitrary chunks.
If both opening/closing tags are OK then I have to transfer the chunk to Writable stream.
I am coding like:
let openTagFound: boolean: false;
let closingTagFound: boolean: false;
readableStream.pipe(this.validateStreamData()).pipe(writableStream);
I have also defined _tranform method for validateStreamData() like:
private validateStreamData(): Transform {
let data = '', transformStream = new Transform();
let openTagFound: boolean = false;
let closingTagFound: boolean = false;
try {
transformStream._transform = function (chunk, _encoding, done) {
// Keep chunk in memory
data += chunk.toString();
if(!openTagFound) {
// Check whether openTag e.g <specificTag> is found, if yes
openTagFound = true;
}
if(!closingTagFound) {
// parse the chunk using parser
// Check whether closingTag e.g </specificTag> is found, if yes
closingTagFound = true;
}
// we are not writing anything out at this
// time, only at end during _flush
// so we don't need to call push
done();
};
transformStream._flush = function (done) {
if(openTagFound && closingTagFound) {
this.push(data);
}
done();
};
return transformStream;
} catch (ex) {
this.logger.error(ex);
transformStream.end();
throw Error(ex);
}
}
Now, you can see that I am using a variable data at:
// Keep chunk in memory
data += chunk.toString();
I want to get rid of this. I do not want to utilize memory explicitly. The final goal is to get data from Axios.post and transfer it to AWS S3, only if my validation succeeds. If not, then it should not write to S3.
Any help is much appreciated.
Thanks in Advance!!!
So, What I finally did is, let the pipe end and kept some flags to check whether it is valid or invalid and then on('end') callback, if flag says invalid explicitly destroyed destination object.
I am not sure where I am going wrong but I think that the event listener is getting invoked multiple times and parsing the files multiple times.
I have five files in the directory and they are getting parsed. However the pdf file with array 0 gets parsed once and the next one twice and third one three times.
I want the each file in the directory to be parsed once and create a text file by extracting the data from pdf.
The Idea is to parse the pdf get the content as text and convert the text in to json in a specific format.
To make it simple, the plan is to complete one task first then use the output from the below code to perform the next task.
Hope anyone can help and point out where i am going wrong and explain a bit about my mistake so i understand it. (new to the JS and Node)
Regards,
Jai
Using the module from here:
https://github.com/modesty/pdf2json
var fs = require('fs')
PDFParser = require('C:/Users/Administrator/node_modules/pdf2json/PDFParser')
var pdfParser = new PDFParser(this, 1)
fs.readdir('C:/Users/Administrator/Desktop/Project/Input/',function(err,pdffiles){
//console.log(pdffiles)
pdffiles.forEach(function(pdffile){
console.log(pdffile)
pdfParser.once("pdfParser_dataReady",function(){
fs.writeFile('C:/Users/Administrator/Desktop/Project/Jsonoutput/'+pdffile, pdfParser.getRawTextContent())
pdfParser.loadPDF('C:/Users/Administrator/Desktop/Project/Input/'+pdffile)
})
})
})
As mentioned in the comment, just contributing 'work-around' ideas for OP to temporary resolve this issue.
Assuming performance is not an issue then you should be able to asynchronously parse the pdf files in a sequential matter. That is, only parse the next file when the first one is done.
Unfortunately I have never used the npm module PDFParser before so it is really difficult for me to try the code below. Pardon me as it may require some minor tweaks to make it to work, syntactically they should be fine as they were written using an IDE.
Example:
var fs = require('fs');
PDFParser = require('C:/Users/Administrator/node_modules/pdf2json/PDFParser');
var parseFile = function(files, done) {
var pdfFile = files.pop();
if (pdfFile) {
var pdfParser = new PDFParser();
pdfParser.on("pdfParser_dataError", errData => { return done(errData); });
pdfParser.on("pdfParser_dataReady", pdfData => {
fs.writeFile("'C:/Users/Administrator/Desktop/Project/Jsonoutput/" + pdfFile, JSON.stringify(pdfData));
parseFile(files, done);
});
pdfParser.loadPDF('C:/Users/Administrator/Desktop/Project/Input/' + pdfFile);
}
else {
return done(null, "All pdf files parsed.")
}
};
fs.readdir('C:/Users/Administrator/Desktop/Project/Input/',function(err,pdffiles){
parseFile(pdffiles, (err, message) => {
if (err) { console.error(err.parseError); }
else { console.log(message); }
})
});
In the code above, I have isolated out the parsing logic into a separated function called parseFile. In this function it first checks to see if there are still files to process or not, if none then it invokes the callback function done otherwise it will do an array.pop operation to get the next file in queue and starts parsing it.
When parsing is done then it recursively call the parseFile function until the last file is parsed.
Currently I'm using node-csv (http://www.adaltas.com/projects/node-csv/) for csv file parsing.
Is there a way to skip first few lines of the file before starting to parse the data? As some csv reports for example have report details in the first few lines before the actual headers and data start.
LOG REPORT <- data about the report
DATE: 1.1.1900
DATE,EVENT,MESSAGE <- data headers
1.1.1900,LOG,Hello World! <- actual data stars here
All you need to do to pass argument {from_line: 2}inside parse() function.
like the snippet below
const fs = require('fs');
const parse = require('csv-parse');
fs.createReadStream('path/to/file')
.pipe(parse({ delimiter: ',', from_line: 2 }))
.on('data', (row) => {
// it will start from 2nd row
console.log(row)
})
Assuming you're using v0.4 or greater with the new refactor (i.e. csv-generate, csv-parse, stream-transform, and csv-stringify), you can use the built-in transform to skip the first line, with a bit of extra work.
var fs = require('fs'),
csv = require('csv');
var skipHeader = true; // config option
var read = fs.createReadStream('in.csv'),
write = fs.createWriteStream('out.jsonish'),
parse = csv.parse(),
rowCount = 0, // to keep track of where we are
transform = csv.transform(function(row,cb) {
var result;
if ( skipHeader && rowCount === 0 ) { // if the option is turned on and this is the first line
result = null; // pass null to cb to skip
} else {
result = JSON.stringify(row)+'\n'; // otherwise apply the transform however you want
}
rowCount++; // next time we're not at the first line anymore
cb(null,result); // let node-csv know we're done transforming
});
read
.pipe(parse)
.pipe(transform)
.pipe(write).once('finish',function() {
// done
});
Essentially we track the number of rows that have been transformed and if we're on the very first one (and we in-fact wish to skip the header via skipHeader bool), then pass null to the callback as the second param (first one is always error), otherwise pass the transformed result.
This will also work with synchronous parsing, but requires a change since there are no callback in synchronous mode. Also, the same logic could be applied to the older v0.2 library since it also has row transforming built-in.
See http://csv.adaltas.com/transform/#skipping-and-creating-records
This is pretty easy to apply, and IMO has a pretty low footprint. Usually you want to keep track of rows processed for status purposes, and I almost always transform the result set before sending it to Writable, so it is very simple to just add in the extra logic to check for skipping the header. The added benefit here is that we're using the same module to apply skipping logic as we are to parse/transform - no extra dependencies are needed.
You have two options here:
You can process the file line-by-line. I posted a code snippet in an answer earlier. You can use that
var rl = readline.createInterface({
input: instream,
output: outstream,
terminal: false
});
rl.on('line', function(line) {
console.log(line);
//Do your stuff ...
//Then write to outstream
rl.write(line);
});
You can give an offset to your filestream which will skip those bytes. You can see it in the documentation
fs.createReadStream('sample.txt', {start: 90, end: 99});
This is much easier if you know the offset is fixed.
Basically I have a file, say, 100mb.qs and I need to pass its entire contents through the following function:
function process(in){
var out = JSON.stringify(require('querystring').parse(in));
return out;
}
And then replace the file's contents with the result.
I imagine that I'll have to stream it, so...
require('fs').createReadStream('1mb.qs').pipe( /* ??? */ )
What to I do?
You should take a look at clarinet for parsing JSON as a stream.
var createReadStream = require('fs').createReadStream
, createWriteStream = require('fs').createReadStream
, parseJson = require('clarinet').createStream()
;
parseJson.on('error', function(err){
if (err) throw err
})
parseJson.on('onvalue', function(v){
// do stuff with value
})
parseJson.on('onopenobject', function (key) {
// I bet you got the idea how this works :)
})
createReadStream('100mb.qs')
.pipe(parseJson)
.pipe(createWriteStream('newerFile.qs'))
there is many more events to listen to, so you definitely show take a look.
Also, it will send data down stream whenever a JSON node is ready to be written. It couldn't get better then this.
Hope this helps