How to skip first lines of the file with node-csv parser? - node.js

Currently I'm using node-csv (http://www.adaltas.com/projects/node-csv/) for csv file parsing.
Is there a way to skip first few lines of the file before starting to parse the data? As some csv reports for example have report details in the first few lines before the actual headers and data start.
LOG REPORT <- data about the report
DATE: 1.1.1900
DATE,EVENT,MESSAGE <- data headers
1.1.1900,LOG,Hello World! <- actual data stars here

All you need to do to pass argument {from_line: 2}inside parse() function.
like the snippet below
const fs = require('fs');
const parse = require('csv-parse');
fs.createReadStream('path/to/file')
.pipe(parse({ delimiter: ',', from_line: 2 }))
.on('data', (row) => {
// it will start from 2nd row
console.log(row)
})

Assuming you're using v0.4 or greater with the new refactor (i.e. csv-generate, csv-parse, stream-transform, and csv-stringify), you can use the built-in transform to skip the first line, with a bit of extra work.
var fs = require('fs'),
csv = require('csv');
var skipHeader = true; // config option
var read = fs.createReadStream('in.csv'),
write = fs.createWriteStream('out.jsonish'),
parse = csv.parse(),
rowCount = 0, // to keep track of where we are
transform = csv.transform(function(row,cb) {
var result;
if ( skipHeader && rowCount === 0 ) { // if the option is turned on and this is the first line
result = null; // pass null to cb to skip
} else {
result = JSON.stringify(row)+'\n'; // otherwise apply the transform however you want
}
rowCount++; // next time we're not at the first line anymore
cb(null,result); // let node-csv know we're done transforming
});
read
.pipe(parse)
.pipe(transform)
.pipe(write).once('finish',function() {
// done
});
Essentially we track the number of rows that have been transformed and if we're on the very first one (and we in-fact wish to skip the header via skipHeader bool), then pass null to the callback as the second param (first one is always error), otherwise pass the transformed result.
This will also work with synchronous parsing, but requires a change since there are no callback in synchronous mode. Also, the same logic could be applied to the older v0.2 library since it also has row transforming built-in.
See http://csv.adaltas.com/transform/#skipping-and-creating-records
This is pretty easy to apply, and IMO has a pretty low footprint. Usually you want to keep track of rows processed for status purposes, and I almost always transform the result set before sending it to Writable, so it is very simple to just add in the extra logic to check for skipping the header. The added benefit here is that we're using the same module to apply skipping logic as we are to parse/transform - no extra dependencies are needed.

You have two options here:
You can process the file line-by-line. I posted a code snippet in an answer earlier. You can use that
var rl = readline.createInterface({
input: instream,
output: outstream,
terminal: false
});
rl.on('line', function(line) {
console.log(line);
//Do your stuff ...
//Then write to outstream
rl.write(line);
});
You can give an offset to your filestream which will skip those bytes. You can see it in the documentation
fs.createReadStream('sample.txt', {start: 90, end: 99});
This is much easier if you know the offset is fixed.

Related

How to sequentially read a csv file with node.js (using stream API)

I am trying to figure out how to create a stream pipe which reads entries in a csv file on-demand. To do so, I thought of using the following approach using pipes (pseudocode)
const stream_pipe = input_file_stream.pipe(csv_parser)
// Then getting entries through:
let entry = stream_pipe.read()
Unfortunately, after lots of testing I figured that them moment I set up the pipe, it is automatically consumed until the end of the csv file. I tried to pause it on creation by appending .pause() at the end, but it seems to not have any effect.
Here's my current code. I am using the csv_parse library (part of the bigger csv package):
// Read file stream
const file_stream = fs.createReadStream("filename.csv")
const parser = csvParser({
columns: ['a', 'b'],
on_record: (record) => {
// A simple filter as I am interested only in numeric entries
let a = parseInt(record.a)
let b = parseInt(record.b)
return (isNaN(a) || isNaN(b)) ? undefined : record
}
})
const reader = stream.pipe(parser) // Adding .pause() seems to have no effect
console.log(reader.read()) // Prints `null`
// I found out I can use this strategy to read a few entries immediately, but I cannot break out of it and then resume as the stream will automatically be consumed
//for await (const record of reader) {
// console.log(record)
//}
I have been banging my head on this for a while and I could not find easy solutions on both the csv package and node official documentation.
Thanks in advance to anyone able to put me on the right track :)
You can do one thing while reading the stream you can create a readLineInterface and pass the input stream and normal output stream like this:
const inputStream = "reading the csv file",
outputStream = new stream();
// now create a readLineInterface which will read
// line by line you should use async/await
const res = await processRecord(readline.createInterface(inputStream, outputStream));
async function processRecord(line) {
return new Promise((res, rej) => {
if (line) {
// do the processing
res(line);
}
rej('Unable to process record');
})
}
Now create processRecord function should get the things line by line and you can you promises to make it sequential.
Note: the above code is a pseudo code just to give you an idea if things work because I have been doing same in my project to read the csv file line and line and it works fine.

How to read big csv file batch by batch in Nodejs?

I have a csv file which contains more than 500k records. Fields of the csv are
name
age
branch
Without loading huge data in to memory I need to process all the records from the file. Need to read few records insert them in to collection and manipulate and then continue reading remaining records. As I'm new to this, unable to understand how it would work. If I try to print the batch, it prints buffered data, will the below code work for my requirement? With that buffered value, how can i get the csv record & insert, manipulate file data.
var stream = fs.createReadStream(csvFilePath)
.pipe(csv())
.on('data',(data) => {
batch.push(data)
counter ++;
if(counter == 100){
stream.pause()
setTimeout(() => {
console.log("batch in ",data)
counter = 0;
batch = []
stream.resume()},5000)
}
})
.on('error',(e) => {
console.log("er ",e);
})
.on('end',() => {
console.log("end");
})
I have written you some sample code how to work with streams.
You basically create a stream and proceed with it's chunks. A chunk is an object of type buffer. To work on it as text call toString().
Haven't a lot of time to explain you more but the comments should help out.
Also consider to use a module, since csv parsing was already done a lot.
Hope this helps>
import * as fs from 'fs'
// end oof line delimiter, system specific.
import { EOL } from 'os'
// the delimiter used in csv
var delimiter = ','
// add your own implementttaion of parsing a portion of the text here.
const parseChunk = (text, index) => {
// first chunk, the header is included here.
if(index === 0) {
// The first row will be the header. So take it
var headerLine = text.substring(0, text.indexOf(EOL))
// remove the header from the text for further processing.
// also replace the new line character..
text = text.replace(headerLine+EOL, '')
// Do something with header here..
}
// Now you have a part of the file to process without headers.
// The csv parse function you need to figure out yourself. Best
// is to use some module for that. there are plenty of edge cases
// when parsing csv.
// custom csv parser here =>h ttps://stackoverflow.com/questions/1293147/example-javascript-code-to-parse-csv-data
// if the csv is well formatted it could be enough to use this
var lines = text.split(EOL)
for(var line of lines) {
var values = line.split(delimiter)
console.log('liine received', values)
// StoreToDb(values)
}
}
// create the stream
const stream = fs.createReadStream('file.csv')
// variable to count the chunks for knowing if header is inckuded..
var chunkCount = 0
// handle data event of stream
stream.on('data', chunk => {
// the stream sends you a Buffer
// to have it as text, convert it to string
const text = chunk.toString()
// Note that chunks will be a fixed size
// but mostly consist of multiple lines,
parseChunk(text, chunkCount)
// increment the count.
chunkCount++
})
stream.on('end', () => {
console.log('parsing finished')
})
stream.on('error', (err) => {
// error, handle properly here, maybe rollback changess already made to db
// and parse again. You can may also use the chunkCount to start the parsing
// again and omit first x chunks, so u can restsart at given point
console.log('parsing error ', err)
})

Nodejs streams - convert errors into default values

I'm quite unfamiliar with streaming. Suppose I have an input stream:
let inputStream = fs.createReadStream(getTrgFilePath());
I'm going to pipe this input to some output stream:
inputStream.pipe(someGenericOutputStream);
In my case, getTrgFilePath() may not produce a valid filepath. This will cause a failure which will result in no content being sent to someGenericOutputStream.
How do I set things up so that when inputStream encounters an error, it pipes some default value (e.g. "Invalid filepath!") instead of failing?
Example 1:
If getTrgFilePath() is invalid, and someGenericOutputStream is process.stdout, I want to see stdout say "Invalid filepath!"
Example 2:
If getTrgFilePath() is invalid, and someGenericOutputStream is the result of fs.createOutputStream(outputFilePath), I would expect to find a file at outputFilePath with the contents "Invalid filepath!".
I'm interested in a solution which doesn't need to know what specific kind of stream someGenericOutputStream is.
If you are only worried about the path being invalid, you could first check the output with fs.access, but as I understand you don't want additional "handling" code in your file...
So let's take into account what may go wrong:
File path is not valid,
File or path does not exist,
File is not readable,
File is read but something happens when it fails.
Now I'm gonna leave the 4th case alone, this is a separate case, so we'll just ignore such a situation. We need two files (so that your code looks clean and all the mess is in a separate file) - here's the, lets say, ./lib/create-fs-with-default.js file:
module.exports = // or export default if you use es6 modules
function(filepath, def = "Could not read file") {
// We open the file normally
const _in = fs.createReadStream(filepath);
// We'll need a list of targets later on
let _piped = [];
// Here's a handler that end's all piped outputs with the default value.
const _handler = (e) => {
if (!_piped.length) {
throw e;
}
_piped.forEach(
out => out.end(def)
);
};
_in.once("error", _handler);
// We keep the original `pipe` method in a variable
const _orgPipe = _in.pipe;
// And override it with our alternative version...
_in.pipe = function(to, ...args) {
const _out = _orgPipe.call(this, to, ...args);
// ...which, apart from calling the original, also records the outputs
_piped.push(_out);
return _out;
}
// Optionally we could handle `unpipe` method here.
// Here we remove the handler once data flow is started.
_in.once("data", () => _in.removeListener("error", _handler));
// And pause the stream again so that `data` listener doesn't consume the first chunk.
_in.pause();
// Finally we return the read stream
return _in;
};
Now there's just a small matter to use it:
const createReadStreamWithDefault = require("./lib/create-fs-with-default");
const inputStream = fs.createReadStream(getTrgFilePath(), "Invalid input!");
// ... and at some point
inputStream.pipe(someOutput);
And there you go.
M.

node slow and unresponsive with large data file

I've written a simple node program to parse an excel formatted HTML table returned from a corporate ERP, pull out the data, and save it as JSON.
This uses FS to open the file and Cheerio to extract the data.
The program works fine for small files (<10MB) but takes many minutes for large files (>30MB)
The data file i'm having trouble with is 38MB and has about 30,0000 rows of data.
Question 1: shouldn't this be faster?
Question 2: I can only get one console.log statement to output. I can put one statement anywhere and it works, if I add more than one, only the first one outputs anything.
var fs = require('fs'); // for file system streaming
function oracleParse(file, callback) {
var headers = []; // array to store the data table column headers
var myError; // module error holder
var XMLdata = []; // array to store the parsed XML data to be returned
var cheerio = require('cheerio');
// open relevant file
var reader = fs.readFile(file, function (err, data) {
if (err) {
myError = err; // catch errors returned from file open
} else {
$ = cheerio.load(data); // load data returned from fs into cheerio for parsing
// the data retruned from Oracle consists of a variable number of tables however the last one is
// always the one that contains the data. We can select this with cheerio and reset the cherrio $ object
var dataTable = $('table').last();
$ = cheerio.load(dataTable);
// table column headers in the table of data returned from Oracle include headers under 'tr td b' elements
// We extract these headers and load these into the 'headers' array for future use as keys in the JSON
// data array to be constucted
$('tr td b').each(function (i, elem) {
headers.push($(this).text());
});
// remove the headers from the cheerio data object so that they don't interfere with the data
$('tr td b').remove();
// for the actual data, each row of data (this corresponds to a customer, account, transation record etc) is
// extracted using cheerio and stored in a key/value object. These objects are then stored in an array
var dataElements = [];
var dataObj = {};
var headersLength = headers.length;
var headerNum;
// the actual data is returned from Oracle in 'tr td nobr' elements. Using cheerio, we can extract all of
// these elements although they are not separated into individual rows. It is possible to return individual
// rows using cheeris (e.g. 'tr') but this is very slow as cheerio needs to requery each subsequent row.
// In our case, we simply select all data elements using the 'tr td nobr' selector and then iterate through
// them, aligning them with the relevant key and grouping them into relevant rows by taking the modulus of
// the element number returned and the number of headers there are.
$('tr td nobr').each(function (i, elem) {
headerNum = i % headersLength; // pick which column is associated with each element
dataObj[headers[headerNum]] = $(this).text(); // build the row object
// if we find the header number is equal to the header length less one, we have reached the end of
// elements for the row and push the row object onto the array in which we store the final result
if (headerNum === headersLength - 1) {
XMLdata.push(dataObj);
dataObj = {};
}
});
console.log(headersLength);
// once all the data in the file has been parsed, run the call back function passed in
callback(JSON.stringify(XMLdata));
}
});
return myError;
}
// parse promo dates data
var file = './data/Oracle/signups_01.html';
var output = './data/Oracle/signups_01.JSON';
//var file = './data/Oracle/detailed_data.html';
//var output = './data/Oracle/detailed_data.JSON';
var test = oracleParse(file, function(data) {
fs.writeFile(output, data, function(err) {
if (err) throw err;
console.log('File write complete: ' + output);
});
});
console.log(test);
You might want to check out a streaming solution like substack's trumpet or (shameless self-plug) cornet. Otherwise, you're traversing the document multiple times, which will always take some time.
My guess is that Chrome defers heavy lifting intelligently - you probably only care about the first couple of rows, so that's what you get. Try to include jQuery & run your code, it will still take some time. To be fair, Chrome's DOM isn't garbage collected and therefore will always outperform cheerio.

Parse output of spawned node.js child process line by line

I have a PhantomJS/CasperJS script which I'm running from within a node.js script using process.spawn(). Since CasperJS doesn't support require()ing modules, I'm trying to print commands from CasperJS to stdout and then read them in from my node.js script using spawn.stdout.on('data', function(data) {}); in order to do things like add objects to redis/mongoose (convoluted, yes, but seems more straightforward than setting up a web service for this...) The CasperJS script executes a series of commands and creates, say, 20 screenshots which need to be added to my database.
However, I can't figure out how to break the data variable (a Buffer?) into lines... I've tried converting it to a string and then doing a replace, I've tried doing spawn.stdout.setEncoding('utf8'); but nothing seems to work...
Here is what I have right now
var spawn = require('child_process').spawn;
var bin = "casperjs"
//googlelinks.js is the example given at http://casperjs.org/#quickstart
var args = ['scripts/googlelinks.js'];
var cspr = spawn(bin, args);
//cspr.stdout.setEncoding('utf8');
cspr.stdout.on('data', function (data) {
var buff = new Buffer(data);
console.log("foo: " + buff.toString('utf8'));
});
cspr.stderr.on('data', function (data) {
data += '';
console.log(data.replace("\n", "\nstderr: "));
});
cspr.on('exit', function (code) {
console.log('child process exited with code ' + code);
process.exit(code);
});
https://gist.github.com/2131204
Try this:
cspr.stdout.setEncoding('utf8');
cspr.stdout.on('data', function(data) {
var str = data.toString(), lines = str.split(/(\r?\n)/g);
for (var i=0; i<lines.length; i++) {
// Process the line, noting it might be incomplete.
}
});
Note that the "data" event might not necessarily break evenly between lines of output, so a single line might span multiple data events.
I've actually written a Node library for exactly this purpose, it's called stream-splitter and you can find it on Github: samcday/stream-splitter.
The library provides a special Stream you can pipe your casper stdout into, along with a delimiter (in your case, \n), and it will emit neat token events, one for each line it has split out from the input Stream. The internal implementation for this is very simple, and delegates most of the magic to substack/node-buffers which means there's no unnecessary Buffer allocations/copies.
I found a nicer way to do this with just pure node, which seems to work well:
const childProcess = require('child_process');
const readline = require('readline');
const cspr = childProcess.spawn(bin, args);
const rl = readline.createInterface({ input: cspr.stdout });
rl.on('line', line => /* handle line here */)
Adding to maerics' answer, which does not deal properly with cases where only part of a line is fed in a data dump (theirs will give you the first part and the second part of the line individually, as two separate lines.)
var _breakOffFirstLine = /\r?\n/
function filterStdoutDataDumpsToTextLines(callback){ //returns a function that takes chunks of stdin data, aggregates it, and passes lines one by one through to callback, all as soon as it gets them.
var acc = ''
return function(data){
var splitted = data.toString().split(_breakOffFirstLine)
var inTactLines = splitted.slice(0, splitted.length-1)
var inTactLines[0] = acc+inTactLines[0] //if there was a partial, unended line in the previous dump, it is completed by the first section.
acc = splitted[splitted.length-1] //if there is a partial, unended line in this dump, store it to be completed by the next (we assume there will be a terminating newline at some point. This is, generally, a safe assumption.)
for(var i=0; i<inTactLines.length; ++i){
callback(inTactLines[i])
}
}
}
usage:
process.stdout.on('data', filterStdoutDataDumpsToTextLines(function(line){
//each time this inner function is called, you will be getting a single, complete line of the stdout ^^
}) )
You can give this a try. It will ignore any empty lines or empty new line breaks.
cspr.stdout.on('data', (data) => {
data = data.toString().split(/(\r?\n)/g);
data.forEach((item, index) => {
if (data[index] !== '\n' && data[index] !== '') {
console.log(data[index]);
}
});
});
Old stuff but still useful...
I have made a custom stream Transform subclass for this purpose.
See https://stackoverflow.com/a/59400367/4861714
#nyctef's answer uses an official nodejs package.
Here is a link to the documentation: https://nodejs.org/api/readline.html
The node:readline module provides an interface for reading data from a Readable stream (such as process.stdin) one line at a time.
My personal use-case is parsing json output from the "docker watch" command created in a spawned child_process.
const dockerWatchProcess = spawn(...)
...
const rl = readline.createInterface({
input: dockerWatchProcess.stdout,
output: null,
});
rl.on('line', (log: string) => {
console.log('dockerWatchProcess event::', log);
// code to process a change to a docker event
...
});

Resources