Processing a 100MB file in Node.js

Processing a 100MB file in Node.js - node.js

Basically I have a file, say, 100mb.qs and I need to pass its entire contents through the following function:
function process(in){
var out = JSON.stringify(require('querystring').parse(in));
return out;
}
And then replace the file's contents with the result.
I imagine that I'll have to stream it, so...
require('fs').createReadStream('1mb.qs').pipe( /* ??? */ )
What to I do?

You should take a look at clarinet for parsing JSON as a stream.
var createReadStream = require('fs').createReadStream
, createWriteStream = require('fs').createReadStream
, parseJson = require('clarinet').createStream()
;
parseJson.on('error', function(err){
if (err) throw err
})
parseJson.on('onvalue', function(v){
// do stuff with value
})
parseJson.on('onopenobject', function (key) {
// I bet you got the idea how this works :)
})
createReadStream('100mb.qs')
.pipe(parseJson)
.pipe(createWriteStream('newerFile.qs'))
there is many more events to listen to, so you definitely show take a look.
Also, it will send data down stream whenever a JSON node is ready to be written. It couldn't get better then this.
Hope this helps

Related

How can I write a buffer data to a file from readable.stream in Nodejs?

How can I write a buffer data to a file from readable.stream in Nodejs? I know there are already npm package, I am asking this question for learning purpose only. I am also wondering why there is no such method available in npm 'fs' where user can pass readablestream and create a file directly?
I tried to write a stream.readableBuffer to a file using fs.write by passing the buffer directly, but somehow a small portion of file, is corrupt, after writing, I can see image but a small portion look black in it, my guess buffer has not written completely.
I pass formdata from ajax XMLHttpRequest to serverside controller (node js router in this case).
and I used npm 'parse-formdata' to parse the request. below is the code:
parseFormdata(req, function (err, data) {
if (err) {
logger.error(err);
throw err
}
console.log('fields:', data.fields); // I have data here but how to write this data to a file?
/** perhaps a bad way to write the data to a file, looking for a better way **/
var chunk = data.parts[0].stream.readableBuffer.head.chunk;
fs.writeFile(data.parts[0].filename, chunk, function(err) {
if(err) {
console.log(err);
} else {
console.log("The file was saved!");
}
});
could some body tell me a better approach to write the data (that I got from parsing of FormData) ?

According to parseFormData
You may use the provided sample:
var pump = require('pump')
var concat = require('concat-stream')
pump(stream, concat(function (buf) {
assert.equal(String(buf), String(file), 'file received')
// then write to your file
res.end()
}))
But you may do shorter:
const ws = fs.createWriteStream('out.txt')
data.parts[0].stream.pipe(ws)
Finally, note that library has not been updated since 2017, so there may be some vulnerabilities or so..

Node - Abstracting Pipe Steps into Function

I'm familiar with Node streams, but I'm struggling on best practices for abstracting code that I reuse a lot into a single pipe step.
Here's a stripped down version of what I'm writing today:
inputStream
.pipe(csv.parse({columns:true})
.pipe(csv.transform(function(row) {return transform(row); }))
.pipe(csv.stringify({header: true})
.pipe(outputStream);
The actual work happens in transform(). The only things that really change are inputStream, transform(), and outputStream. Like I said, this is a stripped down version of what I actually use. I have a lot of error handling and logging on each pipe step, which is ultimately why I'm try to abstract the code.
What I'm looking to write is a single pipe step, like so:
inputStream
.pipe(csvFunction(transform(row)))
.pipe(outputStream);
What I'm struggling to understand is how to turn those pipe steps into a single function that accepts a stream and returns a stream. I've looked at libraries like through2 but I'm but not sure how that get's me to where I'm trying to go.

You can use the PassThrough class like this:
var PassThrough = require('stream').PassThrough;
var csvStream = new PassThrough();
csvStream.on('pipe', function (source) {
// undo piping of source
source.unpipe(this);
// build own pipe-line and store internally
this.combinedStream =
source.pipe(csv.parse({columns: true}))
.pipe(csv.transform(function (row) {
return transform(row);
}))
.pipe(csv.stringify({header: true}));
});
csvStream.pipe = function (dest, options) {
// pipe internal combined stream to dest
return this.combinedStream.pipe(dest, options);
};
inputStream
.pipe(csvStream)
.pipe(outputStream);

Here's what I ended up going with. I used the through2 library and the streaming API of the csv library to create the pipe function I was looking for.
var csv = require('csv');
through = require('through2');
module.exports = function(transformFunc) {
parser = csv.parse({columns:true, relax_column_count:true}),
transformer = csv.transform(function(row) {
return transformFunc(row);
}),
stringifier = csv.stringify({header: true});
return through(function(chunk,enc,cb){
var stream = this;
parser.on('data', function(data){
transformer.write(data);
});
transformer.on('data', function(data){
stringifier.write(data);
});
stringifier.on('data', function(data){
stream.push(data);
});
parser.write(chunk);
parser.removeAllListeners('data');
transformer.removeAllListeners('data');
stringifier.removeAllListeners('data');
cb();
})
}
It's worth noting the part where I remove the event listeners towards the end, this was due to running into memory errors where I had created too many event listeners. I initially tried solving this problem by listening to events with once, but that prevented subsequent chunks from being read and passed on to the next pipe step.
Let me know if anyone has feedback or additional ideas.

how to use Node.JS foreach function with Event listerner

I am not sure where I am going wrong but I think that the event listener is getting invoked multiple times and parsing the files multiple times.
I have five files in the directory and they are getting parsed. However the pdf file with array 0 gets parsed once and the next one twice and third one three times.
I want the each file in the directory to be parsed once and create a text file by extracting the data from pdf.
The Idea is to parse the pdf get the content as text and convert the text in to json in a specific format.
To make it simple, the plan is to complete one task first then use the output from the below code to perform the next task.
Hope anyone can help and point out where i am going wrong and explain a bit about my mistake so i understand it. (new to the JS and Node)
Regards,
Jai
Using the module from here:
https://github.com/modesty/pdf2json
var fs = require('fs')
PDFParser = require('C:/Users/Administrator/node_modules/pdf2json/PDFParser')
var pdfParser = new PDFParser(this, 1)
fs.readdir('C:/Users/Administrator/Desktop/Project/Input/',function(err,pdffiles){
//console.log(pdffiles)
pdffiles.forEach(function(pdffile){
console.log(pdffile)
pdfParser.once("pdfParser_dataReady",function(){
fs.writeFile('C:/Users/Administrator/Desktop/Project/Jsonoutput/'+pdffile, pdfParser.getRawTextContent())
pdfParser.loadPDF('C:/Users/Administrator/Desktop/Project/Input/'+pdffile)
})
})
})

As mentioned in the comment, just contributing 'work-around' ideas for OP to temporary resolve this issue.
Assuming performance is not an issue then you should be able to asynchronously parse the pdf files in a sequential matter. That is, only parse the next file when the first one is done.
Unfortunately I have never used the npm module PDFParser before so it is really difficult for me to try the code below. Pardon me as it may require some minor tweaks to make it to work, syntactically they should be fine as they were written using an IDE.
Example:
var fs = require('fs');
PDFParser = require('C:/Users/Administrator/node_modules/pdf2json/PDFParser');
var parseFile = function(files, done) {
var pdfFile = files.pop();
if (pdfFile) {
var pdfParser = new PDFParser();
pdfParser.on("pdfParser_dataError", errData => { return done(errData); });
pdfParser.on("pdfParser_dataReady", pdfData => {
fs.writeFile("'C:/Users/Administrator/Desktop/Project/Jsonoutput/" + pdfFile, JSON.stringify(pdfData));
parseFile(files, done);
});
pdfParser.loadPDF('C:/Users/Administrator/Desktop/Project/Input/' + pdfFile);
}
else {
return done(null, "All pdf files parsed.")
}
};
fs.readdir('C:/Users/Administrator/Desktop/Project/Input/',function(err,pdffiles){
parseFile(pdffiles, (err, message) => {
if (err) { console.error(err.parseError); }
else { console.log(message); }
})
});
In the code above, I have isolated out the parsing logic into a separated function called parseFile. In this function it first checks to see if there are still files to process or not, if none then it invokes the callback function done otherwise it will do an array.pop operation to get the next file in queue and starts parsing it.
When parsing is done then it recursively call the parseFile function until the last file is parsed.

node, How to handle/catch pipe

I've seen and read a few tutorials that state you can pipe one stream to another almost like lego blocks, but I can't find anything on how to catch a pipe command when a stream is piped to your object.
What I mean is how do I create an object with functions so I can do:
uploadWrapper = function (client, file, callback) {
upload = function (client,file,callback){
var file = file
// this.data = 'undefined'
stream.Writable.call(this);
this.end = function () {
if(typeof this.data !== 'undefined') file.data = this.data
callback(file.data,200)
}
// var path = urlB.host('upload').object('files',file.id).action('content').url
// // client.upload(path,file,callback)
}
util.inherits(upload,stream.Writable)
upload.prototype._write = function (chunk, encoding, callback) {
this.data = this.data + chunk.toString('utf8')
callback()
}
return new upload(client,file,callback)
}
exports.upload = uploadWrapper
How do I handle when data is piped to my object?
I've looked but I can't really find anything about this (maybe I haven't looked in the write places?).
Can any one point me in the right direction?
If it helps to know it, all I Want to be able to do is catch a data stream and build a string containing data with binary encoding; whether it's from a file-stream or a request stream from a server(i.e. the data from a file of a multipart request) object.
EDIT: I've updated the code to log the data
EDIT: I've fixed it, I can now receive piped data, I had to put the code in a wrapper that returned the function that implemented stream.
EDIT: different problem now, this.data in _read isn't storing in a way that this.data in the upload function can read.
EDIT: OK, now I can deal with the callback and catch the data, I need to work out how to tell if data is being piped to it or if it's being used as a normal function.

If you want to create your own stream that can be piped to and/or from, look at the node docs for implementing streams.

How can I parse a large delimited text file in node

I'm using Node to process log files from an application and due to the traffic volumes these can be a gigabyte or so in size each day.
The files are gripped every night and I need to read the files without having to unzip them to disk.
From what I understand I can use zlib to decompress the file to some form of stream but I don't know how to get at the data and not sure how i can then easily handle a line at a time (though I know some kind of while loop searching for \n will be involved.
The closest answer I found so far was demonstrating how to pipe the stream to a sax parser, but the whole node pipes/stream is a little confusing
fs.createReadStream('large.xml.gz').pipe(zlib.createUnzip()).pipe(saxStream);

You should take a look at sax.
It is developed by the isaacs!
I haven't tested this code, but I would start by writing something along these lines.
var Promise = Promise || require('es6-promise').Promise
, thr = require('through2')
, createReadStream = require('fs').createReadStream
, createUnzip = require('zlib').createUnzip
, createParser = require('sax').createStream
;
function processXml (filename) {
return new Promise(function(resolve, reject){
var unzip = createUnzip()
, xmlParser = createParser()
;
xmlParser.on('opentag', function(node){
// do stuff with the node
})
xmlParser.on('attribute', function(node){
// do more stuff with attr
})
// instead of rejecting, you may handle the error instead.
xmlParser.on('error', reject)
xmlParser.on('end', resolve)
createReadStream(filename)
.pipe(unzip)
.pipe(xmlParser)
.pipe(thr(function(chunk, enc, next){
// as soon xmlParser is done with a node, it passes down stream.
// change the chunk if you wish
next(null, newerChunk)
}))
rl = readline.createInterface({
input: unzip
, ouput: xmlParser
})
})
}
processXml('large.xml.gz').then(function(){
console.log('done')
})
.catch(function(err){
// handle error.
})
I hope that helps

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Processing a 100MB file in Node.js - node.js

Related

How can I write a buffer data to a file from readable.stream in Nodejs?

Node - Abstracting Pipe Steps into Function

how to use Node.JS foreach function with Event listerner

node, How to handle/catch pipe

How can I parse a large delimited text file in node

Categories

Resources