node slow and unresponsive with large data file - node.js

I've written a simple node program to parse an excel formatted HTML table returned from a corporate ERP, pull out the data, and save it as JSON.
This uses FS to open the file and Cheerio to extract the data.
The program works fine for small files (<10MB) but takes many minutes for large files (>30MB)
The data file i'm having trouble with is 38MB and has about 30,0000 rows of data.
Question 1: shouldn't this be faster?
Question 2: I can only get one console.log statement to output. I can put one statement anywhere and it works, if I add more than one, only the first one outputs anything.
var fs = require('fs'); // for file system streaming
function oracleParse(file, callback) {
var headers = []; // array to store the data table column headers
var myError; // module error holder
var XMLdata = []; // array to store the parsed XML data to be returned
var cheerio = require('cheerio');
// open relevant file
var reader = fs.readFile(file, function (err, data) {
if (err) {
myError = err; // catch errors returned from file open
} else {
$ = cheerio.load(data); // load data returned from fs into cheerio for parsing
// the data retruned from Oracle consists of a variable number of tables however the last one is
// always the one that contains the data. We can select this with cheerio and reset the cherrio $ object
var dataTable = $('table').last();
$ = cheerio.load(dataTable);
// table column headers in the table of data returned from Oracle include headers under 'tr td b' elements
// We extract these headers and load these into the 'headers' array for future use as keys in the JSON
// data array to be constucted
$('tr td b').each(function (i, elem) {
headers.push($(this).text());
});
// remove the headers from the cheerio data object so that they don't interfere with the data
$('tr td b').remove();
// for the actual data, each row of data (this corresponds to a customer, account, transation record etc) is
// extracted using cheerio and stored in a key/value object. These objects are then stored in an array
var dataElements = [];
var dataObj = {};
var headersLength = headers.length;
var headerNum;
// the actual data is returned from Oracle in 'tr td nobr' elements. Using cheerio, we can extract all of
// these elements although they are not separated into individual rows. It is possible to return individual
// rows using cheeris (e.g. 'tr') but this is very slow as cheerio needs to requery each subsequent row.
// In our case, we simply select all data elements using the 'tr td nobr' selector and then iterate through
// them, aligning them with the relevant key and grouping them into relevant rows by taking the modulus of
// the element number returned and the number of headers there are.
$('tr td nobr').each(function (i, elem) {
headerNum = i % headersLength; // pick which column is associated with each element
dataObj[headers[headerNum]] = $(this).text(); // build the row object
// if we find the header number is equal to the header length less one, we have reached the end of
// elements for the row and push the row object onto the array in which we store the final result
if (headerNum === headersLength - 1) {
XMLdata.push(dataObj);
dataObj = {};
}
});
console.log(headersLength);
// once all the data in the file has been parsed, run the call back function passed in
callback(JSON.stringify(XMLdata));
}
});
return myError;
}
// parse promo dates data
var file = './data/Oracle/signups_01.html';
var output = './data/Oracle/signups_01.JSON';
//var file = './data/Oracle/detailed_data.html';
//var output = './data/Oracle/detailed_data.JSON';
var test = oracleParse(file, function(data) {
fs.writeFile(output, data, function(err) {
if (err) throw err;
console.log('File write complete: ' + output);
});
});
console.log(test);

You might want to check out a streaming solution like substack's trumpet or (shameless self-plug) cornet. Otherwise, you're traversing the document multiple times, which will always take some time.
My guess is that Chrome defers heavy lifting intelligently - you probably only care about the first couple of rows, so that's what you get. Try to include jQuery & run your code, it will still take some time. To be fair, Chrome's DOM isn't garbage collected and therefore will always outperform cheerio.

Related

How to read big csv file batch by batch in Nodejs?

I have a csv file which contains more than 500k records. Fields of the csv are
name
age
branch
Without loading huge data in to memory I need to process all the records from the file. Need to read few records insert them in to collection and manipulate and then continue reading remaining records. As I'm new to this, unable to understand how it would work. If I try to print the batch, it prints buffered data, will the below code work for my requirement? With that buffered value, how can i get the csv record & insert, manipulate file data.
var stream = fs.createReadStream(csvFilePath)
.pipe(csv())
.on('data',(data) => {
batch.push(data)
counter ++;
if(counter == 100){
stream.pause()
setTimeout(() => {
console.log("batch in ",data)
counter = 0;
batch = []
stream.resume()},5000)
}
})
.on('error',(e) => {
console.log("er ",e);
})
.on('end',() => {
console.log("end");
})
I have written you some sample code how to work with streams.
You basically create a stream and proceed with it's chunks. A chunk is an object of type buffer. To work on it as text call toString().
Haven't a lot of time to explain you more but the comments should help out.
Also consider to use a module, since csv parsing was already done a lot.
Hope this helps>
import * as fs from 'fs'
// end oof line delimiter, system specific.
import { EOL } from 'os'
// the delimiter used in csv
var delimiter = ','
// add your own implementttaion of parsing a portion of the text here.
const parseChunk = (text, index) => {
// first chunk, the header is included here.
if(index === 0) {
// The first row will be the header. So take it
var headerLine = text.substring(0, text.indexOf(EOL))
// remove the header from the text for further processing.
// also replace the new line character..
text = text.replace(headerLine+EOL, '')
// Do something with header here..
}
// Now you have a part of the file to process without headers.
// The csv parse function you need to figure out yourself. Best
// is to use some module for that. there are plenty of edge cases
// when parsing csv.
// custom csv parser here =>h ttps://stackoverflow.com/questions/1293147/example-javascript-code-to-parse-csv-data
// if the csv is well formatted it could be enough to use this
var lines = text.split(EOL)
for(var line of lines) {
var values = line.split(delimiter)
console.log('liine received', values)
// StoreToDb(values)
}
}
// create the stream
const stream = fs.createReadStream('file.csv')
// variable to count the chunks for knowing if header is inckuded..
var chunkCount = 0
// handle data event of stream
stream.on('data', chunk => {
// the stream sends you a Buffer
// to have it as text, convert it to string
const text = chunk.toString()
// Note that chunks will be a fixed size
// but mostly consist of multiple lines,
parseChunk(text, chunkCount)
// increment the count.
chunkCount++
})
stream.on('end', () => {
console.log('parsing finished')
})
stream.on('error', (err) => {
// error, handle properly here, maybe rollback changess already made to db
// and parse again. You can may also use the chunkCount to start the parsing
// again and omit first x chunks, so u can restsart at given point
console.log('parsing error ', err)
})

Having difficulties with node.js res.send() loop

I'm attempting to write a very basic scraper that loops through a few pages and outputs all the data from each url to a single json file. The url structure goes as follows:
http://url/1
http://url/2
http://url/n
Each of the urls has a table, which contains information pertaining to the ID of the url. This is the data I am attempting to retrieve and store inside a json file.
I am still extremely new to this and having a difficult time moving forward. So far, my code looks as follows:
app.get('/scrape', function(req, res){
var json;
for (var i = 1163; i < 1166; i++){
url = 'https://urlgoeshere.com' + i;
request(url, function(error, response, html){
if(!error){
var $ = cheerio.load(html);
var mN, mL, iD;
var json = { mN : "", mL : "", iD: ""};
$('html body div#wrap h2').filter(function(){
var data = $(this);
mN = data.text();
json.mN = mN;
})
$('table.vertical-table:nth-child(7)').filter(function(){
var data = $(this);
mL = data.text();
json.mL = mL;
})
$('table.vertical-table:nth-child(8)').filter(function(){
var data = $(this);
iD = data.text();
json.iD = iD;
})
}
fs.writeFile('output' + i + '.json', JSON.stringify(json, null, 4), function(err){
console.log('File successfully written! - Check your project directory for the output' + i + '.json file');
})
});
}
res.send(json);
})
app.listen('8081')
console.log('Magic happens on port 8081');
exports = module.exports = app;
When I run the code as displayed above, the output within the output.json file only contains data for the last url. I presume that's because I attempt to save all the data within the same variable?
If I include res.send() inside the loop, so the data writes after each page, I receive the error that multiple headers cannot be sent.
Can someone provide some pointers as to what I'm doing wrong? Thanks in advance.
Ideal output I would like to see:
Page ID: 1
Page Name: First Page
Color: Blue
Page ID: 2
Page Name: Second Page
Color: Red
Page ID: n
Page Name: Nth Page
Color: Green
I can see a number of problems:
Your loop doesn't wait for the asynchronous operations in the loop, thus you do some things like res.send() before the asynchronous operations in the loop have completed.
In appropriate use of cheerio's .filter().
Your json variable is constantly being overwritten so it only has the last data in it.
Your loop variable i would lose its value by the time you tried to use it in the fs.writeFile() statement.
Here's one way to deal with those issues:
const rp = require('request-promise');
const fsp = require('fs').promises;
app.get('/scrape', async function(req, res) {
let data = [];
for (let i = 1163; i < 1166; i++) {
const url = 'https://urlgoeshere.com/' + i;
try {
const html = await rp(url)
const $ = cheerio.load(html);
const mN = $('html body div#wrap h2').first().text();
const mL = $('table.vertical-table:nth-child(7)').first().text();
const iD = $('table.vertical-table:nth-child(8)').first().text();
// create object for this iteration of the loop
const obj = {iD, mN, mL};
// add this object to our overall array of all the data
data.push(obj);
// write a file specifically for this invocation of the loop
await fsp.writeFile('output' + i + '.json', JSON.stringify(obj, null, 4));
console.log('File successfully written! - Check your project directory for the output' + i + '.json file');
} catch(e) {
// stop further processing on an error
console.log("Error scraping ", url, e);
res.sendStatus(500);
return;
}
}
// send all the data we accumulated (in an array) as the final result
res.send(data);
});
Things different in this code:
Switch over all variable declarations to let or const
Declare route handler as async so we can use await inside.
Use the request-promise module instead of request. It has the same features, but returns a promise instead of using a plain callback.
Use the promise-based fs module (in latest versions of node.js).
Use await in order to serialize our two asynchronous (now promise-returning) operations so the for loop will pause for them and we can have proper sequencing.
Catch errors and stop further processing and return an error status.
Accumulate an object of data for each iteration of the for loop into an array.
Change .filter() to .first().
Make the response to the request handler be a JSON array of data.
FYI, you can tweak the organization of the data in obj however you want, but the point here is that you end up with an array of objects, one for each iteration of the for loop.
EDIT Jan, 2020 - request() module in maintenance mode
FYI, the request module and its derivatives like request-promise are now in maintenance mode and will not be actively developed to add new features. You can read more about the reasoning here. There is a list of alternatives in this table with some discussion of each one. I have been using got() myself and it's built from the beginning to use promises and is simple to use.

Using stream-combiner and Writable Streams (stream-adventure)

i'm working on nodeschool.io's stream-adventure. The challenge:
Write a module that returns a readable/writable stream using the
stream-combiner module. You can use this code to start with:
var combine = require('stream-combiner')
module.exports = function () {
return combine(
// read newline-separated json,
// group books into genres,
// then gzip the output
)
}
Your stream will be written a newline-separated JSON list of science fiction
genres and books. All the books after a "type":"genre" row belong in that
genre until the next "type":"genre" comes along in the output.
{"type":"genre","name":"cyberpunk"}
{"type":"book","name":"Neuromancer"}
{"type":"book","name":"Snow Crash"}
{"type":"genre","name":"space opera"}
{"type":"book","name":"A Deepness in the Sky"}
{"type":"book","name":"Void"}
Your program should generate a newline-separated list of JSON lines of genres,
each with a "books" array containing all the books in that genre. The input
above would yield the output:
{"name":"cyberpunk","books":["Neuromancer","Snow Crash"]}
{"name":"space opera","books":["A Deepness in the Sky","Void"]}
Your stream should take this list of JSON lines and gzip it with
zlib.createGzip().
HINTS
The stream-combiner module creates a pipeline from a list of streams,
returning a single stream that exposes the first stream as the writable side and
the last stream as the readable side like the duplexer module, but with an
arbitrary number of streams in between. Unlike the duplexer module, each
stream is piped to the next. For example:
var combine = require('stream-combiner');
var stream = combine(a, b, c, d);
will internally do a.pipe(b).pipe(c).pipe(d) but the stream returned by
combine() has its writable side hooked into a and its readable side hooked
into d.
As in the previous LINES adventure, the split module is very handy here. You
can put a split stream directly into the stream-combiner pipeline.
Note that split can send empty lines too.
If you end up using split and stream-combiner, make sure to install them
into the directory where your solution file resides by doing:
`npm install stream-combiner split`
Note: when you test the program, the source stream is automatically inserted into the program, so it's perfectly fine to have split() as the first parameter in combine(split(), etc., etc.)
I'm trying to solve this challenge without using the 'through' package.
My code:
var combiner = require('stream-combiner');
var stream = require('stream')
var split = require('split');
var zlib = require('zlib');
module.exports = function() {
var ws = new stream.Writable({decodeStrings: false});
function ResultObj() {
name: '';
books: [];
}
ws._write = function(chunk, enc, next) {
if(chunk.length === 0) {
next();
}
chunk = JSON.parse(chunk);
if(chunk.type === 'genre') {
if(currentResult) {
this.push(JSON.stringify(currentResult) + '\n');
}
var currentResult = new ResultObj();
currentResult.name = chunk.name;
} else {
currentResult.books.push(chunk.name);
}
next();
var wsObj = this;
ws.end = function(d) {
wsObj.push(JSON.stringify(currentResult) + '\n');
}
}
return combiner(split(), ws, zlib.createGzip());
}
My code does not work and returns 'Cannot pipe. Not readable'. Can someone point out to me where i'm going wrong?
Any other comments on how to improve are welcome too...

Add a mongo request into a file and archive this file

I'm having some troubles while trying to use streams with a MongoDB request. I want to :
Get the results from a collection
Put this results into a file
Put this file into a CSV
I'm using the archiver package for the file compression. The file contains csv formatted values, so for each row I have to parse them in the CSV format.
My function take a res (output) parameters, which means that I can send the result to a client directly. For the moment, I can put this results into a file without streams. I think I'll get memory troubles for a large amount of data that's why I want to use streams.
Here is my code (with no stream)
function getCSV(res,query) {
<dbRequest>.toArray(function(err,docs){
var csv = '';
if(docs !== null){
for(var i = 0; i< docs.length; i++){
var line = '';
for(var index in docs[i]){
if(docs[i].hasOwnProperty(index) && (index !== '_id' ) ){
if(line !== '') line+= ',';
line += docs[i][index];
}
}
console.log("line",line);
csv += line += '\r\n';
}
}
}.bind(this));
fileManager.addToFile(csv);
archiver.initialize();
archiver.addToArchive(fileManager.getName());
fileManager.deleteFile();
archiver.sendToClient(res);
};
Once the csv is completed, I had it to a file with a Filemanager Object. The latter one handles file creation and manipulation. The addToArchive method add the file to the current archive, and the sendToClient method send the archive through the output (res parameter is the function).
I'm using Express.js so I call this method with a server request.
Sometimes the file contains data, sometimes it is empty, could you explain me why ?
I'd like to understand how streams works, how could I implement this to my code ?
Regards
I'm not quite sure why you're having issue with the data sometimes showing up, but here is a way to send it with a stream. A couple of points of info before the code:
.stream({transform: someFunction})
takes a stream of documents from the database and runs whatever data manipulation you want on each document as it passes through the stream. I put this function into a closure to make it easier to keep the column headers, as well as allow you to pick and choose which keys from the document to use as columns. This will allow you to use it on different collections.
Here is the function that runs on each document as it passes through:
// this is a closure containing knowledge of the keys you want to use,
// as well as whether or not to add the headers before the current line
function createTransformFunction(keys) {
var hasHeaders = false;
// this is the function that is run on each document
// as it passes through the stream
return function(document) {
var values = [];
var line;
keys.forEach(function(key) {
// explicitly use 'undefined'.
// if using !key, the number 0 would get replaced
if (document[key] !== "undefined") {
values.push(document[key]);
}
else {
values.push("");
}
});
// add the column headers only on the first document
if (!hasHeaders) {
line = keys.join(",") + "\r\n";
line += values.join(",");
hasHeaders = true;
}
else {
// add the line breaks at the beginning of each line
// to avoid having an extra line at the end
line = "\r\n";
line += values.join(",");
}
// return the document to the stream and move on to the next one
return line;
}
}
You pass that function into the transform option for the database stream. Now assuming you have a collection of people with the keys _id, firstName, lastName:
function (req, res) {
// create a transform function with the keys you want to keep
var transformPerson = createTransformFunction(["firstName", "lastName"]);
// Create the mongo read stream that uses your transform function
var readStream = personCollection.find({}).stream({
transform: transformPerson
});
// write stream to file
var localWriteStream = fs.createWriteStream("./localFile.csv");
readStream.pipe(localWriteStream);
// write stream to download
res.setHeader("content-type", "text/csv");
res.setHeader("content-disposition", "attachment; filename=downloadFile.csv");
readStream.pipe(res);
}
If you hit this endpoint, you'll trigger a download in the browser and write a local file. I didn't use archiver because I think it would add a level of complexity and take away from the concept of what's actually happening. The streams are all there, you'd just need to fiddle with it a bit to work it in with archiver.

How to skip first lines of the file with node-csv parser?

Currently I'm using node-csv (http://www.adaltas.com/projects/node-csv/) for csv file parsing.
Is there a way to skip first few lines of the file before starting to parse the data? As some csv reports for example have report details in the first few lines before the actual headers and data start.
LOG REPORT <- data about the report
DATE: 1.1.1900
DATE,EVENT,MESSAGE <- data headers
1.1.1900,LOG,Hello World! <- actual data stars here
All you need to do to pass argument {from_line: 2}inside parse() function.
like the snippet below
const fs = require('fs');
const parse = require('csv-parse');
fs.createReadStream('path/to/file')
.pipe(parse({ delimiter: ',', from_line: 2 }))
.on('data', (row) => {
// it will start from 2nd row
console.log(row)
})
Assuming you're using v0.4 or greater with the new refactor (i.e. csv-generate, csv-parse, stream-transform, and csv-stringify), you can use the built-in transform to skip the first line, with a bit of extra work.
var fs = require('fs'),
csv = require('csv');
var skipHeader = true; // config option
var read = fs.createReadStream('in.csv'),
write = fs.createWriteStream('out.jsonish'),
parse = csv.parse(),
rowCount = 0, // to keep track of where we are
transform = csv.transform(function(row,cb) {
var result;
if ( skipHeader && rowCount === 0 ) { // if the option is turned on and this is the first line
result = null; // pass null to cb to skip
} else {
result = JSON.stringify(row)+'\n'; // otherwise apply the transform however you want
}
rowCount++; // next time we're not at the first line anymore
cb(null,result); // let node-csv know we're done transforming
});
read
.pipe(parse)
.pipe(transform)
.pipe(write).once('finish',function() {
// done
});
Essentially we track the number of rows that have been transformed and if we're on the very first one (and we in-fact wish to skip the header via skipHeader bool), then pass null to the callback as the second param (first one is always error), otherwise pass the transformed result.
This will also work with synchronous parsing, but requires a change since there are no callback in synchronous mode. Also, the same logic could be applied to the older v0.2 library since it also has row transforming built-in.
See http://csv.adaltas.com/transform/#skipping-and-creating-records
This is pretty easy to apply, and IMO has a pretty low footprint. Usually you want to keep track of rows processed for status purposes, and I almost always transform the result set before sending it to Writable, so it is very simple to just add in the extra logic to check for skipping the header. The added benefit here is that we're using the same module to apply skipping logic as we are to parse/transform - no extra dependencies are needed.
You have two options here:
You can process the file line-by-line. I posted a code snippet in an answer earlier. You can use that
var rl = readline.createInterface({
input: instream,
output: outstream,
terminal: false
});
rl.on('line', function(line) {
console.log(line);
//Do your stuff ...
//Then write to outstream
rl.write(line);
});
You can give an offset to your filestream which will skip those bytes. You can see it in the documentation
fs.createReadStream('sample.txt', {start: 90, end: 99});
This is much easier if you know the offset is fixed.

Resources