Nodejs: Performance issues parsing CSV and Zip - node.js

The files are submitted to my server and I'm trying to determine if the CSV is valid and if all the images referenced from the CSV are present in the zip. I have to populate a Mongo database with all that information but I want to do it in the background, and send a response to the client as fast as possible.
So I have two readable streams and I have three different approaches:
Unzipping the file takes 24sec, so unzip + parsing the csv + fs.exists is not an option.
Parsing the whole csv, save filenames in array and then reading the zip using node-unzip and pipe takes 5 seconds.
Reading the csv and in parallel read the zip and use a shared array to determine simultaneusly if the files are present, which is the fastest option, takes 4 seconds.
Does anyone have an idea of how to do it faster?
EDIT: The code used for validation is:
// lib/validator.js
function validateParallel(csv, zip) {
const shared = {};
return new Promise((resolve, reject) => {
const l = CSV_VALIDATORS.length - 1;
csv
.pipe(split())
.pipe(through2(validateLine.bind({ zip, reject, n: 0, l, shared })))
.on('finish', () => {
zip
.pipe(unzip.Parse())
.on('entry', (entry) => {
delete shared[entry.path];
})
.on('close', () => {
resolve(Object.keys(shared).length === 0);
});
});
});
}
// perfomance/validate.spec.js
const zip = fs.createReadStream('./performance/imports/import.zip');
const csv = fs.createReadStream('./performance/imports/stress-test.csv');
const hrstart = process.hrtime();
validator
.validateParallel(csv, zip)
.then(function(isValid) {
console.log(`valid=${isValid}`);
const hrend = process.hrtime(hrstart);
console.info("Execution time (hr): %ds %dms", hrend[0], hrend[1]/1000000);
});
ValidateLine takes the image name and pushes it into the shared object. The output is:
valid=true
Execution time (hr): 4s 926.031869ms
I have simplified the code and removed error management to make it more readable.

Do you have to also validate the images themselves, or just make sure their paths exist in the CSV file? If the latter, you can run a shell process that executes unzip -l on the zip file, which will print only the file names, should be quick.

Related

Node Read Streams - How can I limit the number of open files?

I'm running into AggregateError: EMFILE: too many open files while streaming multiple files.
Machine Details:
MacOS Monterey,
MacBook Pro (14-inch, 2021),
Chip Apple M1 Pro,
Memory 16GB,
Node v16.13.0
I've tried increasing the limits with no luck.
Ideally I would like to be able to set the limit of the number of files open at one time or resolve by closing files as soon as they have been used.
Code below. I've tried to remove the unrelated code and replace it with '//...'.
const MultiStream = require('multistream');
const fs = require('fs-extra'); // Also tried graceful-fs and the standard fs
const { fdir } = require("fdir");
// Also have a require for the bz2 and split2 functions but editing from phone right now
//...
let files = [];
//...
(async() => {
const crawler = await new fdir()
.filter((path, isDirectory) => path.endsWith(".bz2"))
.withFullPaths()
.crawl("Dir/Sub Dir")
.withPromise();
for(const file of crawler){
files = [...files, fs.createReadStream(file)]
}
multi = await new MultiStream(files)
// Unzip
.pipe(bz2())
// Create chunks from lines
.pipe(split2())
.on('data', function (obj) {
// Code to filter data and extract what I need
//...
})
.on("error", function(error) {
// Handling parsing errors
//...
})
.on('end', function(error) {
// Output results
//...
})
})();
To prevent pre-opening a filehandle for every single file in your array, you want to only open the files upon demand when it's that particular file's turn to be streamed. And, you can do that with multi-stream.
Per the multi-stream doc, you can lazily create the readStreams by changing this:
for(const file of crawler){
files = [...files, fs.createReadStream(file)]
}
to this:
let files = crawler.map((f) => {
return function() {
return fs.createReadStream(f);
}
});
After reading over the npm page for multistream I think I have found something that will help. I have also edited where you are adding the stream to the files array as I don't see a need to instantiate a new array and spread existing elements like you are doing.
To lazily create the streams, wrap them in a function:
var streams = [
fs.createReadStream(__dirname + '/numbers/1.txt'),
function () { // will be executed when the stream is active
return fs.createReadStream(__dirname + '/numbers/2.txt')
},
function () { // same
return fs.createReadStream(__dirname + '/numbers/3.txt')
}
]
new MultiStream(streams).pipe(process.stdout) // => 123 ```
With that we can update your logic to include this functionality by simply wrapping the readStreams in functions, this way the streams will not be created until they are needed. This will prevent you from having too many open at once. We can do this by simply updating your file loop:
for(const file of crawler){
files.push(function() {
return fs.createReadStream(file)
})
}

How to sequentially read a csv file with node.js (using stream API)

I am trying to figure out how to create a stream pipe which reads entries in a csv file on-demand. To do so, I thought of using the following approach using pipes (pseudocode)
const stream_pipe = input_file_stream.pipe(csv_parser)
// Then getting entries through:
let entry = stream_pipe.read()
Unfortunately, after lots of testing I figured that them moment I set up the pipe, it is automatically consumed until the end of the csv file. I tried to pause it on creation by appending .pause() at the end, but it seems to not have any effect.
Here's my current code. I am using the csv_parse library (part of the bigger csv package):
// Read file stream
const file_stream = fs.createReadStream("filename.csv")
const parser = csvParser({
columns: ['a', 'b'],
on_record: (record) => {
// A simple filter as I am interested only in numeric entries
let a = parseInt(record.a)
let b = parseInt(record.b)
return (isNaN(a) || isNaN(b)) ? undefined : record
}
})
const reader = stream.pipe(parser) // Adding .pause() seems to have no effect
console.log(reader.read()) // Prints `null`
// I found out I can use this strategy to read a few entries immediately, but I cannot break out of it and then resume as the stream will automatically be consumed
//for await (const record of reader) {
// console.log(record)
//}
I have been banging my head on this for a while and I could not find easy solutions on both the csv package and node official documentation.
Thanks in advance to anyone able to put me on the right track :)
You can do one thing while reading the stream you can create a readLineInterface and pass the input stream and normal output stream like this:
const inputStream = "reading the csv file",
outputStream = new stream();
// now create a readLineInterface which will read
// line by line you should use async/await
const res = await processRecord(readline.createInterface(inputStream, outputStream));
async function processRecord(line) {
return new Promise((res, rej) => {
if (line) {
// do the processing
res(line);
}
rej('Unable to process record');
})
}
Now create processRecord function should get the things line by line and you can you promises to make it sequential.
Note: the above code is a pseudo code just to give you an idea if things work because I have been doing same in my project to read the csv file line and line and it works fine.

Cloning a Node File Object to use multiple streams in parallel (Multer)

Is it possible to clone a Node.JS File object?
I've written a custom storage driver for Multer which takes an array of storage drivers in it's constructor and calls ._handleFile of each driver. The goal is to save one file to multiple destinations in parralel.
However, it seems that the file stream that's opened by the disk driver messes up any subsequent reads. In my particular case I'm trying to save to a local disk + AWS-S3.
Through debugging (setTimeouts, etc.) I found out that:
If the file gets uploaded to S3 first, the file written to my local disk is empty.
If the file gets written my local disk first the S3 upload simply dies without any errors
So my assumption is that multiple streams on the same file cause strange issues.
The multer disk driver does the following:
...
var outStream = fs.createWriteStream(finalPath)
file.stream.pipe(outStream)
The multer AWS S3 driver does this:
...
var upload = this.s3.upload(params)
I assume the library opens a stream.
I don't want to save the file first and manually create two streams after. I'd prefer to somehow duplicate the file object and send them off to each individual ._handleFile method.
MultiStorage.prototype._handleFile = async function _handleFile (req, file, cb) {
// I removed some code for this example
...
const results = await Promise.all(drivers.map({ driver }, i) => {
return new Promise((fulfill, reject) => {
// file -> this I believe I need to duplicate
driver._handleFile(req, file, (error, info) => {
fulfill({ info, error })
})
})
....
Answering my own questions
I wrote little helper which creates new PassThrough streams and then writes to them as data comes in.
const { PassThrough } = require('stream');
// Split stream into $count amount of new streams and return them
const splitStream = (stream, count) => {
const streams = [...Array(count)].map(() => new PassThrough());
stream.on('data', chunk => {
streams.map(s => s.push(chunk));
})
stream.on('end', chunk => {
streams.on('end', () => {
streams.map(s => s.push(null));
})
})
return streams;
}
Now you just need to pass on your new stream(s) instead of the original stream.
myFn(streams[0]);
myFn(streams[1]);
Disclaimer: This method does not take care of error handling and can cause memory leaks. You might want to consider using the Pipeline() wrapper from the 'stream' lib.

Load 1 million records from a file and save to PSQL database

I have a file of 1 million records in which I have to pass one by one record to elastic search and save the resulted data into the database.
But the issue is, it is taking very long time to do that as the records are streaming one by one to elasticsearch then it saves the data into PSQL database.
I want some suggestions that how can I improve on this or should use some other tools.
Right now I am using Nodejs with some packages:
I upload the file in nodejs application and convert it to json file using
const csv=require('csvtojson')
I use
const StreamArray = require('stream-json/streamers/StreamArray');
const {Writable} = require('stream');
For reading json and parsing it through these packages using stream as the file is too big.
I use this code
const fileStream = fs.createReadStream(this.fileName);
const jsonStream = StreamArray.withParser();
const incomingThis = this;
const processingStream = new Writable({
write({key, value}, encoding, callback) {
incomingThis.recordParser(value, (val, data) => { // pass the data to elasticsearch to get search data
incomingThis.processQueue(data); // save the data to the PSQL database
callback();
});
},
//Don't skip this, as we need to operate with objects, not buffers
objectMode: true
});
//Pipe the streams as follows
fileStream.pipe(jsonStream.input);
jsonStream.pipe(processingStream);
//So we're waiting for the 'finish' event when everything is done.
processingStream.on('finish', async () => {
console.log('stream end');
const statistics = new Statistics(jobId);
await statistics.update(); // update the job table for completion of data
});
Please suggest How can I improve on this to parse 1 million records file in couple of hours rather than days or minimum less time.
I am open to use any other tools too like redis, spark if these will help me out.
Thanks.
Instead of one by one pressing from stream . use batch approach ( create multiple batches ) to get data in elastic and save in batch .

How to download multiple links from a .csv file using multithreading in node.js?

I am trying to download links from a .csv file and store the downloaded files in a folder. I have used multithreading library for this i.e mt-files-downloader. The files are downloading fine but it takes too much time to download about 313 files. These files are about 400Kb in size max. When i tried using normal download using node i could download them in a minute or two but with this library the download should be fast as i am using multithread library but it takes lot of time. Below is my code any help would be useful. Thanks!
var rec;
csv
.fromStream(stream, { headers: ["Recording", , , , , , , ,] })
.on("data", function (records) {
rec = records.Recording;
//console.log(rec);
download(rec);
})
.on("end", function () {
console.log('Reading complete')
});
function download(rec) {
var filename = rec.replace(/\//g, '');
var filePath = './recordings/'+filename;
var downloadPath = path.resolve(filePath)
var fileUrl = 'http:' + rec;
var downloader = new Downloader();
var dl = downloader.download(fileUrl, downloadPath);
dl.start();
dl.on('error', function(dl) {
var dlUrl = dl.url;
console.log('error downloading = > '+dl.url+' restarting download....');
if(!dlUrl.endsWith('.wav') && !dlUrl.endsWith('Recording')){
console.log('resuming file download => '+dlUrl);
dl.resume();
}
});
}
You're right, downloading 313 files of 400kB should not take long - and I don't think this has to do with your code - maybe the connection is bad? Have you tried downloading a single file via curl?
Anyway I see two problems in your approach with which I can help:
first - you download all the files at the same time (which may introduce some overhead on the server)
second - your error handling will run in loop without waiting and checking the actual file, so if there's a 404 - you'll flood the server with requests.
Using streams with on('data') events has a major drawback of executing all the chunks more or less synchronously as they are read. This means that your code will execute whatever is in on('data') handler never waiting for completion of your downloads. The only limiting factor is now how fast the server can read the csv - and I'd expect millions of lines per second to be normal.
From the server perspective, you're simply requesting 313 files at once, which will result, not wanting to speculate on the actual technical mechanisms of the server, in some of those requests waiting and interfering with each other.
This can be solved by using a streaming framework, like scramjet, event-steram or highland for instance. I'm the author of the first and it's IMHO the easiest in this case, but you can use any of those changing the code a little to match their API - it's pretty similar in all cases anyway.
Here's a heavily commented code that will run a couple downloads in parallel:
const {StringStream} = require("scramjet");
const sleep = require("sleep-promise");
const Downloader = require('mt-files-downloader');
const downloader = new Downloader();
const {StringStream} = require("scramjet");
const sleep = require("sleep-promise");
const Downloader = require('mt-files-downloader');
const downloader = new Downloader();
// First we create a StringStream class from your csv stream
StringStream.from(csvStream)
// we parse it as CSV without columns
.CSVParse({header: false})
// we set the limit of parallel operations, it will get propagated.
.setOptions({maxParallel: 16})
// now we extract the first column as `recording` and create a
// download request.
.map(([recording]) => {
// here's the first part of your code
const filename = rec.replace(/\//g, '');
const filePath = './recordings/'+filename;
const downloadPath = path.resolve(filePath)
const fileUrl = 'http:' + rec;
// at this point we return the dl object so we can keep these
// parts separate.
// see that the download hasn't been started yet
return downloader.download(fileUrl, downloadPath);
})
// what we get is a stream of not started download objects
// so we run this asynchronous function. If this returns a Promise
// it will wait
.map(
async (dl) => new Promise((res, rej) => {
// let's assume a couple retries we allow
let retries = 10;
dl.on('error', async (dl) => {
try {
// here we reject if the download fails too many times.
if (retries-- === 0) throw new Error(`Download of ${dl.url} failed too many times`);
var dlUrl = dl.url;
console.log('error downloading = > '+dl.url+' restarting download....');
if(!dlUrl.endsWith('.wav') && !dlUrl.endsWith('Recording')){
console.log('resuming file download => '+dlUrl);
// lets wait half a second before retrying
await sleep(500);
dl.resume();
}
} catch(e) {
// here we call the `reject` function - meaning that
// this file wasn't downloaded despite retries.
rej(e);
}
});
// here we call `resolve` function to confirm that the file was
// downloaded.
dl.on('end', () => res());
})
)
// we log some message and ignore the result in case of an error
.catch(e => {
console.error('An error occured:', e.message);
return;
})
// Every steram must have some sink to flow to, the `run` method runs
// every operation above.
.run();
You can also use the stream to push out some kind of log messages and use pipe(process.stderr) in the end, instead of those console.logs. Please check the scramjet documentation for additional info and a Mozilla doc on async functions

Resources