PapaParse and Highland

PapaParse and Highland - node.js

I have to parse a very big CSV file in NodeJS and save it in a database (async operation) that allows up to 500 entries at a time. Due to memory limits I have to stream the CSV file and want to use PapaParse to parse the CSV file (as that worked best in my case).
As PapaParse uses a callback style approach to parse Node.js streams I didn't see an easy to combine highland (for batching and data transform) and PapaParse. So, I tried to use a ParseThrough stream to write data to and read that stream with highland for batching:
const csv = require('papaparse');
const fs = require('fs');
const highland = require('highland');
const { PassThrough } = require('stream');
const passThroughStream = new PassThrough({ objectMode: true });
csv.parse(fileStream, {
step: function(row) {
// Write data to stream
passThroughStream.write(row.data[0]);
},
complete: function() {
// Somehow "end" the stream
passThroughStream.write(null);
},
});
highland(passThroughStream)
.map((data) => {
// data transform
})
.batch(500)
.map((data) => {
// Save up to 500 entries in database (async call)
});
Obviously that doesn't work as is and doesn't do anything really. Is something like that even possible or even an better way to parse very big CSV files and save the rows in a database (in batches of up to 500)?
Edit: Using the csv package (https://www.npmjs.com/package/csv) it would be possible like so (same for fast-csv):
highland(fileStream.pipe(csv.parse()))
.map((data) => {
// data transform
})
.batch(500)
.map((data) => {
// Save up to 500 entries in database (async call)
});
But unfortunately both NPM packages do not parse the CSV files properly in all cases.

After a quick look at papaparse I decided to implement CSV parser in scramjet.
fileStream.pipe(new scramjet.StringStream('utf-8'))
.csvParse(options)
.batch(500)
.map(items => db.insertArray('some_table', items))
I hope that works for you. :)

Related

Node Read Streams - How can I limit the number of open files?

I'm running into AggregateError: EMFILE: too many open files while streaming multiple files.
Machine Details:
MacOS Monterey,
MacBook Pro (14-inch, 2021),
Chip Apple M1 Pro,
Memory 16GB,
Node v16.13.0
I've tried increasing the limits with no luck.
Ideally I would like to be able to set the limit of the number of files open at one time or resolve by closing files as soon as they have been used.
Code below. I've tried to remove the unrelated code and replace it with '//...'.
const MultiStream = require('multistream');
const fs = require('fs-extra'); // Also tried graceful-fs and the standard fs
const { fdir } = require("fdir");
// Also have a require for the bz2 and split2 functions but editing from phone right now
//...
let files = [];
//...
(async() => {
const crawler = await new fdir()
.filter((path, isDirectory) => path.endsWith(".bz2"))
.withFullPaths()
.crawl("Dir/Sub Dir")
.withPromise();
for(const file of crawler){
files = [...files, fs.createReadStream(file)]
}
multi = await new MultiStream(files)
// Unzip
.pipe(bz2())
// Create chunks from lines
.pipe(split2())
.on('data', function (obj) {
// Code to filter data and extract what I need
//...
})
.on("error", function(error) {
// Handling parsing errors
//...
})
.on('end', function(error) {
// Output results
//...
})
})();

To prevent pre-opening a filehandle for every single file in your array, you want to only open the files upon demand when it's that particular file's turn to be streamed. And, you can do that with multi-stream.
Per the multi-stream doc, you can lazily create the readStreams by changing this:
for(const file of crawler){
files = [...files, fs.createReadStream(file)]
}
to this:
let files = crawler.map((f) => {
return function() {
return fs.createReadStream(f);
}
});

After reading over the npm page for multistream I think I have found something that will help. I have also edited where you are adding the stream to the files array as I don't see a need to instantiate a new array and spread existing elements like you are doing.
To lazily create the streams, wrap them in a function:
var streams = [
fs.createReadStream(__dirname + '/numbers/1.txt'),
function () { // will be executed when the stream is active
return fs.createReadStream(__dirname + '/numbers/2.txt')
},
function () { // same
return fs.createReadStream(__dirname + '/numbers/3.txt')
}
]
new MultiStream(streams).pipe(process.stdout) // => 123 ```
With that we can update your logic to include this functionality by simply wrapping the readStreams in functions, this way the streams will not be created until they are needed. This will prevent you from having too many open at once. We can do this by simply updating your file loop:
for(const file of crawler){
files.push(function() {
return fs.createReadStream(file)
})
}

Stream large JSON from REST API using NodeJS/ExpressJS

I have to return a large JSON, resulting from a query to MongoDB, from a REST API server build-up using ExpressJS. This JSON has to be converted into .csv so the client can directly save the resulting CSV file. I know that the best solution is to use NodeJS streams and pipe. Could anyone suggest to me a working example? Thanks.

Typically when wanting to parse JSON in Node its fairly simple. In the past I would do something like the following.
const fs = require('fs');
const rawdata = fs.readFileSync('file.json');
const data = JSON.parse(rawdata);
Or even simpler with a require statement like this
const data = require('./file.json');
Both of these work great with small or even moderate size files, but what if you need to parse a really large JSON file, one with millions of lines, reading the entire file into memory is no longer a great option.
Because of this I needed a way to “Stream” the JSON and process as it went. There is a nice module named ‘stream-json’ that does exactly what I wanted.
With stream-json, we can use the NodeJS file stream to process our large data file in chucks.
const StreamArray = require( 'stream-json/streamers/StreamArray');
const fs = require('fs');
const jsonStream = StreamArray.withParser();
//internal Node readable stream option, pipe to stream-json to convert it for us
fs.createReadStream('file.json').pipe(jsonStream.input);
//You'll get json objects here
//Key is the array-index here
jsonStream.on('data', ({key, value}) => {
console.log(key, value);
});
jsonStream.on('end', ({key, value}) => {
console.log('All Done');
});
Now our data can process without running out of memory, however in the use case I was working on, inside the stream I had an asynchronous process. Because of this, I still was consuming huge amounts of memory as this just up a very large amount of unresolved promises to keep in memory until they completed.
To solve this I had to also use a custom Writeable stream like this.
const StreamArray = require( 'stream-json/streamers/StreamArray');
const {Writable} = require('stream');
const fs = require('fs');
const fileStream = fs.createReadStream('file.json');
const jsonStream = StreamArray.withParser();
const processingStream = new Writable({
write({key, value}, encoding, callback) {
//some async operations
setTimeout(() => {
console.log(key,value);
//Runs one at a time, need to use a callback for that part to work
callback();
}, 1000);
},
//Don't skip this, as we need to operate with objects, not buffers
objectMode: true
});
//Pipe the streams as follows
fileStream.pipe(jsonStream.input);
jsonStream.pipe(processingStream);
//So we're waiting for the 'finish' event when everything is done.
processingStream.on('finish', () => console.log('All done' ));
The Writeable stream also allows each asynchronous process to complete and the promises to resolve before continuing on to the next, thus avoiding the memory backup.
This stack overflow is where I got the examples for this post.
Parse large JSON file in Nodejs and handle each object independently
Also note another thing I learned in this process is if you want to start Node with more than the default amount of RAM you can use the following command.
node --max-old-space-size=4096 file.js
By default the memory limit in Node.js is 512 mb, to solve this issue you need to increase the memory limit using command –max-old-space-size. This can be used to avoid the memory limits within node. The command above would give Node 4GB of RAM to use.

Cloning a Node File Object to use multiple streams in parallel (Multer)

Is it possible to clone a Node.JS File object?
I've written a custom storage driver for Multer which takes an array of storage drivers in it's constructor and calls ._handleFile of each driver. The goal is to save one file to multiple destinations in parralel.
However, it seems that the file stream that's opened by the disk driver messes up any subsequent reads. In my particular case I'm trying to save to a local disk + AWS-S3.
Through debugging (setTimeouts, etc.) I found out that:
If the file gets uploaded to S3 first, the file written to my local disk is empty.
If the file gets written my local disk first the S3 upload simply dies without any errors
So my assumption is that multiple streams on the same file cause strange issues.
The multer disk driver does the following:
...
var outStream = fs.createWriteStream(finalPath)
file.stream.pipe(outStream)
The multer AWS S3 driver does this:
...
var upload = this.s3.upload(params)
I assume the library opens a stream.
I don't want to save the file first and manually create two streams after. I'd prefer to somehow duplicate the file object and send them off to each individual ._handleFile method.
MultiStorage.prototype._handleFile = async function _handleFile (req, file, cb) {
// I removed some code for this example
...
const results = await Promise.all(drivers.map({ driver }, i) => {
return new Promise((fulfill, reject) => {
// file -> this I believe I need to duplicate
driver._handleFile(req, file, (error, info) => {
fulfill({ info, error })
})
})
....

Answering my own questions
I wrote little helper which creates new PassThrough streams and then writes to them as data comes in.
const { PassThrough } = require('stream');
// Split stream into $count amount of new streams and return them
const splitStream = (stream, count) => {
const streams = [...Array(count)].map(() => new PassThrough());
stream.on('data', chunk => {
streams.map(s => s.push(chunk));
})
stream.on('end', chunk => {
streams.on('end', () => {
streams.map(s => s.push(null));
})
})
return streams;
}
Now you just need to pass on your new stream(s) instead of the original stream.
myFn(streams[0]);
myFn(streams[1]);
Disclaimer: This method does not take care of error handling and can cause memory leaks. You might want to consider using the Pipeline() wrapper from the 'stream' lib.

Load 1 million records from a file and save to PSQL database

I have a file of 1 million records in which I have to pass one by one record to elastic search and save the resulted data into the database.
But the issue is, it is taking very long time to do that as the records are streaming one by one to elasticsearch then it saves the data into PSQL database.
I want some suggestions that how can I improve on this or should use some other tools.
Right now I am using Nodejs with some packages:
I upload the file in nodejs application and convert it to json file using
const csv=require('csvtojson')
I use
const StreamArray = require('stream-json/streamers/StreamArray');
const {Writable} = require('stream');
For reading json and parsing it through these packages using stream as the file is too big.
I use this code
const fileStream = fs.createReadStream(this.fileName);
const jsonStream = StreamArray.withParser();
const incomingThis = this;
const processingStream = new Writable({
write({key, value}, encoding, callback) {
incomingThis.recordParser(value, (val, data) => { // pass the data to elasticsearch to get search data
incomingThis.processQueue(data); // save the data to the PSQL database
callback();
});
},
//Don't skip this, as we need to operate with objects, not buffers
objectMode: true
});
//Pipe the streams as follows
fileStream.pipe(jsonStream.input);
jsonStream.pipe(processingStream);
//So we're waiting for the 'finish' event when everything is done.
processingStream.on('finish', async () => {
console.log('stream end');
const statistics = new Statistics(jobId);
await statistics.update(); // update the job table for completion of data
});
Please suggest How can I improve on this to parse 1 million records file in couple of hours rather than days or minimum less time.
I am open to use any other tools too like redis, spark if these will help me out.
Thanks.

Instead of one by one pressing from stream . use batch approach ( create multiple batches ) to get data in elastic and save in batch .

Get hash of ReadStream and output data of stream

I have a ReadStream that I want to read multiple times. The readStream is created with fs.createReadStream.
First time I'm using it to get it's md5 hash, I'm doing it with module hasha, function fromStream, and the second time I'm using it with FormData to upload a file to web hosting.
How can I use this one ReadStream to do both of these things?
readStream = fs.createReadStream("/tmp/test.txt");
hash = await hasha.fromStream(readStream, hashOptions);
readStream.on("data", (chunk) => console.log("data chunk", chunk)).on("end", () => console.log("finished"));
It's not loging the content to the console as it should, probably because in the hasha.fromStream it's pipe-ing the stream. If I don't execute hasha.fromStream it's working fine, the chunks are logged.
The module I'm using, called hasha is on github: https://github.com/sindresorhus/hasha/blob/master/index.js#L45
I don't want to save the data to buffer before getting hash, because I'll be using it with large files.
I have also made a runkit script showing my problem, you can play with it there:
https://runkit.com/5942fba4653ae70012196b77/5942fba4653ae70012196b78

Here's a standalone example on how to "fork" a stream so you can pipe it to two destinations:
const PassThrough = require('stream').PassThrough;
async function hashAndPost(stream) {
let pass1 = new PassThrough();
let pass2 = new PassThrough();
stream.pipe(pass1);
stream.pipe(pass2);
// Destination #1
pass1.on('data', chunk =>
console.log('data chunk', chunk.toString())
).on('end', () =>
console.log('finished')
);
// Destination #2
let hash = await hasha.fromStream(pass2, { algorithm : 'md5' });
console.log('hash', hash);
};

You can either recreate the stream by re-reading the file, or you can rewind the stream, as explained here: How to reset nodejs stream?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

PapaParse and Highland - node.js

After a quick look at papaparse I decided to implement CSV parser in scramjet. fileStream.pipe(new scramjet.StringStream('utf-8')) .csvParse(options) .batch(500) .map(items => db.insertArray('some_table', items)) I hope that works for you. :)

Related

Node Read Streams - How can I limit the number of open files?

Stream large JSON from REST API using NodeJS/ExpressJS

Cloning a Node File Object to use multiple streams in parallel (Multer)

Load 1 million records from a file and save to PSQL database

Get hash of ReadStream and output data of stream

Categories

Resources