How can I parse a large delimited text file in node - node.js

I'm using Node to process log files from an application and due to the traffic volumes these can be a gigabyte or so in size each day.
The files are gripped every night and I need to read the files without having to unzip them to disk.
From what I understand I can use zlib to decompress the file to some form of stream but I don't know how to get at the data and not sure how i can then easily handle a line at a time (though I know some kind of while loop searching for \n will be involved.
The closest answer I found so far was demonstrating how to pipe the stream to a sax parser, but the whole node pipes/stream is a little confusing
fs.createReadStream('large.xml.gz').pipe(zlib.createUnzip()).pipe(saxStream);

You should take a look at sax.
It is developed by the isaacs!
I haven't tested this code, but I would start by writing something along these lines.
var Promise = Promise || require('es6-promise').Promise
, thr = require('through2')
, createReadStream = require('fs').createReadStream
, createUnzip = require('zlib').createUnzip
, createParser = require('sax').createStream
;
function processXml (filename) {
return new Promise(function(resolve, reject){
var unzip = createUnzip()
, xmlParser = createParser()
;
xmlParser.on('opentag', function(node){
// do stuff with the node
})
xmlParser.on('attribute', function(node){
// do more stuff with attr
})
// instead of rejecting, you may handle the error instead.
xmlParser.on('error', reject)
xmlParser.on('end', resolve)
createReadStream(filename)
.pipe(unzip)
.pipe(xmlParser)
.pipe(thr(function(chunk, enc, next){
// as soon xmlParser is done with a node, it passes down stream.
// change the chunk if you wish
next(null, newerChunk)
}))
rl = readline.createInterface({
input: unzip
, ouput: xmlParser
})
})
}
processXml('large.xml.gz').then(function(){
console.log('done')
})
.catch(function(err){
// handle error.
})
I hope that helps

Related

How to sequentially read a csv file with node.js (using stream API)

I am trying to figure out how to create a stream pipe which reads entries in a csv file on-demand. To do so, I thought of using the following approach using pipes (pseudocode)
const stream_pipe = input_file_stream.pipe(csv_parser)
// Then getting entries through:
let entry = stream_pipe.read()
Unfortunately, after lots of testing I figured that them moment I set up the pipe, it is automatically consumed until the end of the csv file. I tried to pause it on creation by appending .pause() at the end, but it seems to not have any effect.
Here's my current code. I am using the csv_parse library (part of the bigger csv package):
// Read file stream
const file_stream = fs.createReadStream("filename.csv")
const parser = csvParser({
columns: ['a', 'b'],
on_record: (record) => {
// A simple filter as I am interested only in numeric entries
let a = parseInt(record.a)
let b = parseInt(record.b)
return (isNaN(a) || isNaN(b)) ? undefined : record
}
})
const reader = stream.pipe(parser) // Adding .pause() seems to have no effect
console.log(reader.read()) // Prints `null`
// I found out I can use this strategy to read a few entries immediately, but I cannot break out of it and then resume as the stream will automatically be consumed
//for await (const record of reader) {
// console.log(record)
//}
I have been banging my head on this for a while and I could not find easy solutions on both the csv package and node official documentation.
Thanks in advance to anyone able to put me on the right track :)
You can do one thing while reading the stream you can create a readLineInterface and pass the input stream and normal output stream like this:
const inputStream = "reading the csv file",
outputStream = new stream();
// now create a readLineInterface which will read
// line by line you should use async/await
const res = await processRecord(readline.createInterface(inputStream, outputStream));
async function processRecord(line) {
return new Promise((res, rej) => {
if (line) {
// do the processing
res(line);
}
rej('Unable to process record');
})
}
Now create processRecord function should get the things line by line and you can you promises to make it sequential.
Note: the above code is a pseudo code just to give you an idea if things work because I have been doing same in my project to read the csv file line and line and it works fine.

How to download multiple links from a .csv file using multithreading in node.js?

I am trying to download links from a .csv file and store the downloaded files in a folder. I have used multithreading library for this i.e mt-files-downloader. The files are downloading fine but it takes too much time to download about 313 files. These files are about 400Kb in size max. When i tried using normal download using node i could download them in a minute or two but with this library the download should be fast as i am using multithread library but it takes lot of time. Below is my code any help would be useful. Thanks!
var rec;
csv
.fromStream(stream, { headers: ["Recording", , , , , , , ,] })
.on("data", function (records) {
rec = records.Recording;
//console.log(rec);
download(rec);
})
.on("end", function () {
console.log('Reading complete')
});
function download(rec) {
var filename = rec.replace(/\//g, '');
var filePath = './recordings/'+filename;
var downloadPath = path.resolve(filePath)
var fileUrl = 'http:' + rec;
var downloader = new Downloader();
var dl = downloader.download(fileUrl, downloadPath);
dl.start();
dl.on('error', function(dl) {
var dlUrl = dl.url;
console.log('error downloading = > '+dl.url+' restarting download....');
if(!dlUrl.endsWith('.wav') && !dlUrl.endsWith('Recording')){
console.log('resuming file download => '+dlUrl);
dl.resume();
}
});
}
You're right, downloading 313 files of 400kB should not take long - and I don't think this has to do with your code - maybe the connection is bad? Have you tried downloading a single file via curl?
Anyway I see two problems in your approach with which I can help:
first - you download all the files at the same time (which may introduce some overhead on the server)
second - your error handling will run in loop without waiting and checking the actual file, so if there's a 404 - you'll flood the server with requests.
Using streams with on('data') events has a major drawback of executing all the chunks more or less synchronously as they are read. This means that your code will execute whatever is in on('data') handler never waiting for completion of your downloads. The only limiting factor is now how fast the server can read the csv - and I'd expect millions of lines per second to be normal.
From the server perspective, you're simply requesting 313 files at once, which will result, not wanting to speculate on the actual technical mechanisms of the server, in some of those requests waiting and interfering with each other.
This can be solved by using a streaming framework, like scramjet, event-steram or highland for instance. I'm the author of the first and it's IMHO the easiest in this case, but you can use any of those changing the code a little to match their API - it's pretty similar in all cases anyway.
Here's a heavily commented code that will run a couple downloads in parallel:
const {StringStream} = require("scramjet");
const sleep = require("sleep-promise");
const Downloader = require('mt-files-downloader');
const downloader = new Downloader();
const {StringStream} = require("scramjet");
const sleep = require("sleep-promise");
const Downloader = require('mt-files-downloader');
const downloader = new Downloader();
// First we create a StringStream class from your csv stream
StringStream.from(csvStream)
// we parse it as CSV without columns
.CSVParse({header: false})
// we set the limit of parallel operations, it will get propagated.
.setOptions({maxParallel: 16})
// now we extract the first column as `recording` and create a
// download request.
.map(([recording]) => {
// here's the first part of your code
const filename = rec.replace(/\//g, '');
const filePath = './recordings/'+filename;
const downloadPath = path.resolve(filePath)
const fileUrl = 'http:' + rec;
// at this point we return the dl object so we can keep these
// parts separate.
// see that the download hasn't been started yet
return downloader.download(fileUrl, downloadPath);
})
// what we get is a stream of not started download objects
// so we run this asynchronous function. If this returns a Promise
// it will wait
.map(
async (dl) => new Promise((res, rej) => {
// let's assume a couple retries we allow
let retries = 10;
dl.on('error', async (dl) => {
try {
// here we reject if the download fails too many times.
if (retries-- === 0) throw new Error(`Download of ${dl.url} failed too many times`);
var dlUrl = dl.url;
console.log('error downloading = > '+dl.url+' restarting download....');
if(!dlUrl.endsWith('.wav') && !dlUrl.endsWith('Recording')){
console.log('resuming file download => '+dlUrl);
// lets wait half a second before retrying
await sleep(500);
dl.resume();
}
} catch(e) {
// here we call the `reject` function - meaning that
// this file wasn't downloaded despite retries.
rej(e);
}
});
// here we call `resolve` function to confirm that the file was
// downloaded.
dl.on('end', () => res());
})
)
// we log some message and ignore the result in case of an error
.catch(e => {
console.error('An error occured:', e.message);
return;
})
// Every steram must have some sink to flow to, the `run` method runs
// every operation above.
.run();
You can also use the stream to push out some kind of log messages and use pipe(process.stderr) in the end, instead of those console.logs. Please check the scramjet documentation for additional info and a Mozilla doc on async functions

Node - Abstracting Pipe Steps into Function

I'm familiar with Node streams, but I'm struggling on best practices for abstracting code that I reuse a lot into a single pipe step.
Here's a stripped down version of what I'm writing today:
inputStream
.pipe(csv.parse({columns:true})
.pipe(csv.transform(function(row) {return transform(row); }))
.pipe(csv.stringify({header: true})
.pipe(outputStream);
The actual work happens in transform(). The only things that really change are inputStream, transform(), and outputStream. Like I said, this is a stripped down version of what I actually use. I have a lot of error handling and logging on each pipe step, which is ultimately why I'm try to abstract the code.
What I'm looking to write is a single pipe step, like so:
inputStream
.pipe(csvFunction(transform(row)))
.pipe(outputStream);
What I'm struggling to understand is how to turn those pipe steps into a single function that accepts a stream and returns a stream. I've looked at libraries like through2 but I'm but not sure how that get's me to where I'm trying to go.
You can use the PassThrough class like this:
var PassThrough = require('stream').PassThrough;
var csvStream = new PassThrough();
csvStream.on('pipe', function (source) {
// undo piping of source
source.unpipe(this);
// build own pipe-line and store internally
this.combinedStream =
source.pipe(csv.parse({columns: true}))
.pipe(csv.transform(function (row) {
return transform(row);
}))
.pipe(csv.stringify({header: true}));
});
csvStream.pipe = function (dest, options) {
// pipe internal combined stream to dest
return this.combinedStream.pipe(dest, options);
};
inputStream
.pipe(csvStream)
.pipe(outputStream);
Here's what I ended up going with. I used the through2 library and the streaming API of the csv library to create the pipe function I was looking for.
var csv = require('csv');
through = require('through2');
module.exports = function(transformFunc) {
parser = csv.parse({columns:true, relax_column_count:true}),
transformer = csv.transform(function(row) {
return transformFunc(row);
}),
stringifier = csv.stringify({header: true});
return through(function(chunk,enc,cb){
var stream = this;
parser.on('data', function(data){
transformer.write(data);
});
transformer.on('data', function(data){
stringifier.write(data);
});
stringifier.on('data', function(data){
stream.push(data);
});
parser.write(chunk);
parser.removeAllListeners('data');
transformer.removeAllListeners('data');
stringifier.removeAllListeners('data');
cb();
})
}
It's worth noting the part where I remove the event listeners towards the end, this was due to running into memory errors where I had created too many event listeners. I initially tried solving this problem by listening to events with once, but that prevented subsequent chunks from being read and passed on to the next pipe step.
Let me know if anyone has feedback or additional ideas.

how to use Node.JS foreach function with Event listerner

I am not sure where I am going wrong but I think that the event listener is getting invoked multiple times and parsing the files multiple times.
I have five files in the directory and they are getting parsed. However the pdf file with array 0 gets parsed once and the next one twice and third one three times.
I want the each file in the directory to be parsed once and create a text file by extracting the data from pdf.
The Idea is to parse the pdf get the content as text and convert the text in to json in a specific format.
To make it simple, the plan is to complete one task first then use the output from the below code to perform the next task.
Hope anyone can help and point out where i am going wrong and explain a bit about my mistake so i understand it. (new to the JS and Node)
Regards,
Jai
Using the module from here:
https://github.com/modesty/pdf2json
var fs = require('fs')
PDFParser = require('C:/Users/Administrator/node_modules/pdf2json/PDFParser')
var pdfParser = new PDFParser(this, 1)
fs.readdir('C:/Users/Administrator/Desktop/Project/Input/',function(err,pdffiles){
//console.log(pdffiles)
pdffiles.forEach(function(pdffile){
console.log(pdffile)
pdfParser.once("pdfParser_dataReady",function(){
fs.writeFile('C:/Users/Administrator/Desktop/Project/Jsonoutput/'+pdffile, pdfParser.getRawTextContent())
pdfParser.loadPDF('C:/Users/Administrator/Desktop/Project/Input/'+pdffile)
})
})
})
As mentioned in the comment, just contributing 'work-around' ideas for OP to temporary resolve this issue.
Assuming performance is not an issue then you should be able to asynchronously parse the pdf files in a sequential matter. That is, only parse the next file when the first one is done.
Unfortunately I have never used the npm module PDFParser before so it is really difficult for me to try the code below. Pardon me as it may require some minor tweaks to make it to work, syntactically they should be fine as they were written using an IDE.
Example:
var fs = require('fs');
PDFParser = require('C:/Users/Administrator/node_modules/pdf2json/PDFParser');
var parseFile = function(files, done) {
var pdfFile = files.pop();
if (pdfFile) {
var pdfParser = new PDFParser();
pdfParser.on("pdfParser_dataError", errData => { return done(errData); });
pdfParser.on("pdfParser_dataReady", pdfData => {
fs.writeFile("'C:/Users/Administrator/Desktop/Project/Jsonoutput/" + pdfFile, JSON.stringify(pdfData));
parseFile(files, done);
});
pdfParser.loadPDF('C:/Users/Administrator/Desktop/Project/Input/' + pdfFile);
}
else {
return done(null, "All pdf files parsed.")
}
};
fs.readdir('C:/Users/Administrator/Desktop/Project/Input/',function(err,pdffiles){
parseFile(pdffiles, (err, message) => {
if (err) { console.error(err.parseError); }
else { console.log(message); }
})
});
In the code above, I have isolated out the parsing logic into a separated function called parseFile. In this function it first checks to see if there are still files to process or not, if none then it invokes the callback function done otherwise it will do an array.pop operation to get the next file in queue and starts parsing it.
When parsing is done then it recursively call the parseFile function until the last file is parsed.

Processing a 100MB file in Node.js

Basically I have a file, say, 100mb.qs and I need to pass its entire contents through the following function:
function process(in){
var out = JSON.stringify(require('querystring').parse(in));
return out;
}
And then replace the file's contents with the result.
I imagine that I'll have to stream it, so...
require('fs').createReadStream('1mb.qs').pipe( /* ??? */ )
What to I do?
You should take a look at clarinet for parsing JSON as a stream.
var createReadStream = require('fs').createReadStream
, createWriteStream = require('fs').createReadStream
, parseJson = require('clarinet').createStream()
;
parseJson.on('error', function(err){
if (err) throw err
})
parseJson.on('onvalue', function(v){
// do stuff with value
})
parseJson.on('onopenobject', function (key) {
// I bet you got the idea how this works :)
})
createReadStream('100mb.qs')
.pipe(parseJson)
.pipe(createWriteStream('newerFile.qs'))
there is many more events to listen to, so you definitely show take a look.
Also, it will send data down stream whenever a JSON node is ready to be written. It couldn't get better then this.
Hope this helps

Resources