Node CSV pull parser - node.js

I need to parse a CSV document from Node.JS, performing database operations for each record (= each line). However, I'm having trouble finding a suitable CSV parser using a pull approach, or at least a push approach that waits for my record operations before parsing the next row.
I've looked at csv-parse, csvtojson, csv-streamify, but they all seem to push events in a continuous stream without any flow control. If parsing a 1000 line CSV document, I basically get all 1000 callbacks in quick sequence. For each record, I perform an operation returning a promise. Currently I've had to resort to pushing all my promises into an array and after getting the done/end event I also wait for Promise.all(myOperations) to know when the document has been fully processed. But this is not very nice, and also, I'd prefer parsing one line at a time and fully processing it, before getting the next record, instead of concurrently processing all records - it's hard to debug and uses a lot of memory as opposed to simply dealing with each record sequentially.
So, is there a CSV parser that supports pull mode, or a way to get any stream-based CSV parser (preferably csvtojson as that's the one I'm using at the moment) to only produce events for new records when my handler for the previous record is finished (using promises)?

I solved this myself by creating my own Writable and piping the CSV parser to it. My write method does its stuff and wraps a promise to the node callback passed to _write() (here implemented using Q.nodeify):
class CsvConsumer extends stream.Writable {
_write(data, encoding, cb) {
console.log('Got data: ', data);
Q.delay(1000).then(() => {
console.log('Waited 1 s');
}).nodeify(cb);
}
}
csvtojson()
.fromStream(is)
.pipe(new CsvConsumer())
.on('finish', err => {
if (err) {
console.log('Error!');
} else {
console.log('Done!');
}
});
This will process lines one by one:
Got data: {"a": "1"}
Waited 1 s
Got data: {"a": "2"}
Waited 1 s
Got data: {"a": "3"}
Waited 1 s
Done!

If you want to process each line asynchronously you can do that with node's native LineReader.
const lineStream = readline.createInterface({
input: fs.createReadStream('data/test.csv'),
});
lineStream.on('line', (eachLine) =>{
//process each line
});
If you want to do the same in synchronous fashion you can use line-by-line. It doesn't buffer the entire file into memory. It provides event handlers to pause and resume the 'line' emit event.
lr.on('line', function (line) {
// pause emitting of lines...
lr.pause();
// ...do your asynchronous line processing..
setTimeout(function () {
// ...and continue emitting lines. (1 sec delay)
lr.resume();
}, 1000);
});

Related

Node Unzipper - how to know it's finished

I'm trying to use the unzipper node module to extract and process a number of files (exact number is unknown). However, I can't seem to figure out how to know when all the files are processed. So far, my code looks like this:
s3.getObject(params).createReadStream()
.pipe(unzipper.Parse())
.on('entry', async (entry) => {
var fileName = entry.path;
if (fileName.match(someRegex)) {
await processEntry(entry);
console.log("Uploaded");
} else {
entry.autodrain();
console.log("Drained");
}
});
I'm trying to figure out how to know that unzipper has gone through all the files (i.e., no more entry events are forthcoming) and all the entry handlers have finished so that I know I've finished processing all the files I care about.
I've tried experimenting with the close and finish events but when I have, they both trigger before console.log("Uploaded"); has printed, so that doesn't seem right.
Help?
Directly from the docs:
The parser emits finish and error events like any other stream. The parser additionally provides a promise wrapper around those two events to allow easy folding into existing Promise-based structures.
Example:
fs.createReadStream('path/to/archive.zip')
.pipe(unzipper.Parse())
.on('entry', entry => entry.autodrain())
.promise()
.then( () => console.log('done'), e => console.log('error',e));

Understanding fs.writeFileSync(path, data[, options]) Node.js

I have a function that writes data to a file then uploads that file to cloud storage. The file isn't finished writing before it starts uploading so I am getting a partial file in cloud storage. I found that fs.writeFileSync(path, data[, options]) could help, but I am not exactly sure how it works.
It is my understanding that node runs asynchronously and I have several async processes running prior to this portion of code. I understand what synchronous vs asynchronous means, but I am having a little trouble understanding how it plays in this example. Here are my questions if I replace the below code with fs.writeFileSync(path, data[, options])
What do the docs mean by "Synchronously append data to a file"
a. Will the next lines of code be halted until the fs.writeFileSync(path, data) is finished?
b. Are previous asynchronous processes halted by this line of code?
If other async processes are not affected how is writeFileSync different the writeFile?
Is there a callback feature in writeFileSync that I am misunderstanding?
Code for reference
outCsv = "x","y","z"
filename = "file.csv"
fs.writeFile(filename, outCsv, function (err) {
if (err) {
return console.log(err);
}
console.log('The file was saved!');
bucket.upload(filename, (err, file) => {
if (err) {
return console.log(err);
}
console.log('The file was uploaded!');
});
});
Will the next lines of code be halted until the fs.writeFileSync(path, data) is finished?
Yes. It is a blocking operation. Note that you're assuming that fs.writeFileSync does finish.
Are previous asynchronous processes halted by this line of code?
Kinda. Since JavaScript is single-threaded, they will also not be running while the file is writing but will queue up at the next tick of the event loop.
If other async processes are not affected how is writeFileSync different the writeFile?
It blocks any code that comes after it. For an easier example consider the following:
setTimeout(() => console.log('3'), 5);
console.log('1'); // fs.writeFileSync
console.log('2');
vs
setTimeout(() => console.log('3'), 5);
setTimeout(() => console.log('1'), 0); // fs.writeFile
console.log('2');
The first will print 1 2 3 because the call to console.log blocks what comes after. The second will print 2 1 3 because the setTimeout is non-blocking. The code that prints 3 isn't affected either way: 3 will always come last.
Is there a callback feature in writeFileSync that I am misunderstanding?
Don't know. You didn't post enough code for us to say.
This all begs the question of why to prefer fs.writeFile over the alternative. The answer is this:
The sync version blocks.
While it's taking however long it takes your e.g. webserver isn't handling requests.

Difficulty processing CSV file, browser timeout

I was asked to import a csv file from a server daily and parse the respective header to the appropriate fields in mongoose.
My first idea was to make it to run automatically with a scheduler using the cron module.
const CronJob = require('cron').CronJob;
const fs = require("fs");
const csv = require("fast-csv")
new CronJob('30 2 * * *', async function() {
await parseCSV();
this.stop();
}, function() {
this.start()
}, true);
Next, the parseCSV() function code is as follow:
(I have simplify some of the data)
function parseCSV() {
let buffer = [];
let stream = fs.createReadStream("data.csv");
csv.fromStream(stream, {headers:
[
"lot", "order", "cwotdt"
]
, trim:true})
.on("data", async (data) =>{
let data = { "order": data.order, "lot": data.lot, "date": data.cwotdt};
// Only add product that fulfill the following condition
if (data.cwotdt !== "000000"){
let product = {"order": data.order, "lot": data.lot}
// Check whether product exist in database or not
await db.Product.find(product, function(err, foundProduct){
if(foundProduct && foundProduct.length !== 0){
console.log("Product exists")
} else{
buffer.push(product);
console.log("Product not exists")
}
})
}
})
.on("end", function(){
db.Product.find({}, function(err, productAvailable){
// Check whether database exists or not
if(productAvailable.length !== 0){
// console.log("Database Exists");
// Add subsequent onward
db.Product.insertMany(buffer)
buffer = [];
} else{
// Add first time
db.Product.insertMany(buffer)
buffer = [];
}
})
});
}
It is not a problem if it's just a few line of rows in the csv file but just only reaching 2k rows, I encountered a problem. The culprit is due to the if condition checking when listening to the event handler on, it needs to check every single row to see whether the database contains the data already or not.
The reason I'm doing this is that the csv file will have new data added into it and I need to add all the data for the first time if database is empty or look into every single row and only add those new data into mongoose.
The 1st approach I did from here (as in the code),was using async/await to make sure that all the datas have been read before proceeding to the event handler end. This helps but I see from time to time (with mongoose.set("debug", true);), some data are being queried twice, which I have no idea why.
The 2nd approach was not to use the async/await feature, this has some downside since the data was not fully queried, it proceeded straight to the event handler end and then insertMany some of the datas which were able to get pushed into the buffer.
If i stick with the current approach, it is not an issue, but the query will take 1 to 2 minutes, not to mention even more if the database keeps growing. So, during those few minutes of querying, the event queue got blocked and therefore when sending request to the server, the server time out.
I used stream.pause() and stream.resume() before this code but I can't get it to work as it just jump straight to the end event handler first. This cause the buffer to be empty every single time since end event handler runs before the on event handler
I cant' remember the links that I used but the fundamentals that I got from is through this.
Import CSV Using Mongoose Schema
I saw these threads:
Insert a large csv file, 200'000 rows+, into MongoDB in NodeJS
Can't populate big chunk of data to mongodb using Node.js
to be similar to what I need but it's a bit too complicated for me to understand what is going on. Seems like using socket or a child process maybe? Furthermore, I still need to check conditions before adding into the buffer
Anyone care to guide me on this?
Edit: await is removed from console.log as it is not asynchronous
Forking a child process approach:
When web service got a request of csv data file save it somewhere in app
Fork a child process -> child process example
Pass the file url to the child_process to run the insert checks
When child process finish processing the csv file, delete the file
Like what Joe said, indexing the DB would speed up the processing time by a lot when there are lots(millions) of tuples.
If you create an index on order and lot. The query should be very fast.
db.Product.createIndex( { order: 1, lot: 1 }
Note: This is a compound index and may not be the ideal solution. Index strategies
Also, your await on console.log is weird. That may be causing your timing issues. console.log is not async. Additionally the function is not marked async
// removing await from console.log
let product = {"order": data.order, "lot": data.lot}
// Check whether product exist in database or not
await db.Product.find(product, function(err, foundProduct){
if(foundProduct && foundProduct.length !== 0){
console.log("Product exists")
} else{
buffer.push(product);
console.log("Product not exists")
}
})
I would try with removing the await on console.log (that may be a red herring if console.log is for stackoverflow and hiding the actual async method.) However, be sure to mark the function with async if that is the case.
Lastly, if the problem still exists. I may look into a 2 tiered approach.
Insert all lines from the CSV file into a mongo collection.
Process that mongo collection after the CSV has been parsed. Removing the CSV from the equation.

Can an event based read function ever run out of order?

Given a situation where I use the nodejs readline library to iterate over each line in the STDIN stream, do some processing on it and write it back out to STDOUT as in the following example:
var rl = readline.createInterface({
input: process.stdin,
output: process.stdout,
terminal: false
});
function my_function(line) {
var output = ...(line);
process.stdout.write(output);
}
rl.on('line', my_function);
I'm concerned that the processing I'm doing will take very different amounts of time depending on the line content so some lines will return very quickly while others takes some time to sort out. Is it possible that my_function() will ever run out of order and hence cause the output stream to be scrambled? Should I be looking into using a synchronous loop of some kind instead of this asynchronous event handler?
The JavaScript execution itself is single-threaded, so as long as you're only performing synchronous operations inside the event handler, there is no problem.
If you are performing asynchronous operations inside the event handler, then it is possible that another 'line' event could be emitted before your asynchronous operation(s) are complete. In that case, you would need to rl.pause() first and then rl.resume() once you are finished with your asynchronous operations. However, this isn't foolproof since 'line' events could still be emitted after a rl.pause() if the current chunk of data read from the input stream had multiple line breaks.
So if you are performing asynchronous operations inside the event handler, you are probably better off just reading from the stream yourself so that you have more control over the parsing behavior. This is actually pretty easy to do, for example:
function parseStream(stream, callback) {
// Assuming all stream data is text and not binary ...
var buffer = '';
var RE_EOL = /\r?\n/g;
stream.on('data', function(data) {
buffer += data;
processBuffer();
});
stream.on('end', callback);
stream.on('error', callback);
function processBuffer() {
var idx = RE_EOL.exec(buffer);
if (~idx) {
// Found a line ending
var line = buffer.slice(0, RE_EOL.index);
buffer = buffer.slice(RE_EOL.index + RE_EOL[0].length);
stream.pause();
callback(null, line, processBuffer);
} else {
stream.resume();
}
}
}
// ...
processStream(process.stdin, function(err, line, done) {
if (err) throw err;
if (line === undefined) {
// No more data will be available (stream ended)
console.log('(Stream ended!)');
return;
}
// Do something with `line`
console.log(line);
// Call `done()` whenever your async operation(s) are all finished
done();
});

How to access data only once ALL of it is ready

I have seen some code such as this:
.on('error', console.error)
.on('data', function (data) {})
.on('info', function(info) {})
.on('end', function() {
// All data retrieved.
});
I have read some docs about streams, but am having trouble understanding them. Say I only want to do the operations once all the data is received (not partial). How can I do this? I would think I would have to read the data object inside of the 'end' function, but the data object is not accessible from there.
From my understanding if I put some logic inside of the 'data' function I could be operating on incomplete data? Is this true? Say data is a list of friends (some lists have 1 friend some can have 10,000 so the size of the data returned back will be different). How can I only do operation once ALL the friends are returned no matter the size of the data coming back?
The data handler will usually be called multiple times, each time with a fraction of the complete data.
If you want to perform an action once with all data, the usual way is as follows:
Buffer all the items received in the data handler in some variable (e.g. add to an array) and perform your final action in the end handler. (although the idea of a stream naturally is, to "act" right away).
var allData = [];
stream
.on('error', console.error)
.on('data', function (data) {
allData.push(data);
})
.on('info', function(info) {})
.on('end', function() {
// TODO do something more intelligent,
// where buffering in memory makes sense
console.log(allData.join());
});

Resources