I have to return a large JSON, resulting from a query to MongoDB, from a REST API server build-up using ExpressJS. This JSON has to be converted into .csv so the client can directly save the resulting CSV file. I know that the best solution is to use NodeJS streams and pipe. Could anyone suggest to me a working example? Thanks.
Typically when wanting to parse JSON in Node its fairly simple. In the past I would do something like the following.
const fs = require('fs');
const rawdata = fs.readFileSync('file.json');
const data = JSON.parse(rawdata);
Or even simpler with a require statement like this
const data = require('./file.json');
Both of these work great with small or even moderate size files, but what if you need to parse a really large JSON file, one with millions of lines, reading the entire file into memory is no longer a great option.
Because of this I needed a way to “Stream” the JSON and process as it went. There is a nice module named ‘stream-json’ that does exactly what I wanted.
With stream-json, we can use the NodeJS file stream to process our large data file in chucks.
const StreamArray = require( 'stream-json/streamers/StreamArray');
const fs = require('fs');
const jsonStream = StreamArray.withParser();
//internal Node readable stream option, pipe to stream-json to convert it for us
fs.createReadStream('file.json').pipe(jsonStream.input);
//You'll get json objects here
//Key is the array-index here
jsonStream.on('data', ({key, value}) => {
console.log(key, value);
});
jsonStream.on('end', ({key, value}) => {
console.log('All Done');
});
Now our data can process without running out of memory, however in the use case I was working on, inside the stream I had an asynchronous process. Because of this, I still was consuming huge amounts of memory as this just up a very large amount of unresolved promises to keep in memory until they completed.
To solve this I had to also use a custom Writeable stream like this.
const StreamArray = require( 'stream-json/streamers/StreamArray');
const {Writable} = require('stream');
const fs = require('fs');
const fileStream = fs.createReadStream('file.json');
const jsonStream = StreamArray.withParser();
const processingStream = new Writable({
write({key, value}, encoding, callback) {
//some async operations
setTimeout(() => {
console.log(key,value);
//Runs one at a time, need to use a callback for that part to work
callback();
}, 1000);
},
//Don't skip this, as we need to operate with objects, not buffers
objectMode: true
});
//Pipe the streams as follows
fileStream.pipe(jsonStream.input);
jsonStream.pipe(processingStream);
//So we're waiting for the 'finish' event when everything is done.
processingStream.on('finish', () => console.log('All done' ));
The Writeable stream also allows each asynchronous process to complete and the promises to resolve before continuing on to the next, thus avoiding the memory backup.
This stack overflow is where I got the examples for this post.
Parse large JSON file in Nodejs and handle each object independently
Also note another thing I learned in this process is if you want to start Node with more than the default amount of RAM you can use the following command.
node --max-old-space-size=4096 file.js
By default the memory limit in Node.js is 512 mb, to solve this issue you need to increase the memory limit using command –max-old-space-size. This can be used to avoid the memory limits within node. The command above would give Node 4GB of RAM to use.
Related
Is it possible to clone a Node.JS File object?
I've written a custom storage driver for Multer which takes an array of storage drivers in it's constructor and calls ._handleFile of each driver. The goal is to save one file to multiple destinations in parralel.
However, it seems that the file stream that's opened by the disk driver messes up any subsequent reads. In my particular case I'm trying to save to a local disk + AWS-S3.
Through debugging (setTimeouts, etc.) I found out that:
If the file gets uploaded to S3 first, the file written to my local disk is empty.
If the file gets written my local disk first the S3 upload simply dies without any errors
So my assumption is that multiple streams on the same file cause strange issues.
The multer disk driver does the following:
...
var outStream = fs.createWriteStream(finalPath)
file.stream.pipe(outStream)
The multer AWS S3 driver does this:
...
var upload = this.s3.upload(params)
I assume the library opens a stream.
I don't want to save the file first and manually create two streams after. I'd prefer to somehow duplicate the file object and send them off to each individual ._handleFile method.
MultiStorage.prototype._handleFile = async function _handleFile (req, file, cb) {
// I removed some code for this example
...
const results = await Promise.all(drivers.map({ driver }, i) => {
return new Promise((fulfill, reject) => {
// file -> this I believe I need to duplicate
driver._handleFile(req, file, (error, info) => {
fulfill({ info, error })
})
})
....
Answering my own questions
I wrote little helper which creates new PassThrough streams and then writes to them as data comes in.
const { PassThrough } = require('stream');
// Split stream into $count amount of new streams and return them
const splitStream = (stream, count) => {
const streams = [...Array(count)].map(() => new PassThrough());
stream.on('data', chunk => {
streams.map(s => s.push(chunk));
})
stream.on('end', chunk => {
streams.on('end', () => {
streams.map(s => s.push(null));
})
})
return streams;
}
Now you just need to pass on your new stream(s) instead of the original stream.
myFn(streams[0]);
myFn(streams[1]);
Disclaimer: This method does not take care of error handling and can cause memory leaks. You might want to consider using the Pipeline() wrapper from the 'stream' lib.
Here is the test code (in an express environment just because that's what I happen to be messing around with):
const fs = require('fs-extra');
const fsPromises = fs.promises;
const express = require('express');
const app = express();
const speedtest = async function (req, res, next) {
const useFsPromises = (req.params.promises == 'true');
const jsonFileName = './json/big-file.json';
const hrstart = process.hrtime();
if (useFsPromises) {
await fsPromises.readFile(jsonFileName);
} else {
fs.readFileSync(jsonFileName);
}
res.send(`time taken to read: ${process.hrtime(hrstart)[1]/1000000} ms`);
};
app.get('/speedtest/:promises', speedtest);
The big-file.json file is around 16 MB. Using node 12.18.4.
Typical results (varies quite a bit around these values, but the following are "typical"):
https://dev.mydomain.com/speedtest/false
time taken to read: 3.948152 ms
https://dev.mydomain.com/speedtest/true
time taken to read: 61.865763 ms
UPDATE to include two more variants... plain fs.readFile() and also a promisified version of this:
const fs = require('fs-extra');
const fsPromises = fs.promises;
const util = require('util');
const readFile = util.promisify(fs.readFile);
const express = require('express');
const app = express();
const speedtest = async function (req, res, next) {
const type = req.params.type;
const jsonFileName = './json/big-file.json';
const hrstart = process.hrtime();
if (type == 'readFileFsPromises') {
await fsPromises.readFile(jsonFileName);
} else if (type == 'readFileSync') {
fs.readFileSync(jsonFileName);
} else if (type == 'readFileAsync') {
return fs.readFile(jsonFileName, function (err, jsondata) {
res.send(`time taken to read: ${process.hrtime(hrstart)[1]/1000000} ms`);
});
} else if (type == 'readFilePromisified') {
await readFile(jsonFileName);
}
res.send(`time taken to read: ${process.hrtime(hrstart)[1]/1000000} ms`);
};
app.get('/speedtest/:type', speedtest);
I am finding that the fsPromises.readFile() is the slowest, while the others are much faster and all roughly the same in terms of reading time. I should add that in a different example (which I can't fully verify so I'm not sure what was going on) the time difference was vastly bigger than reported here. Seems to me at present that fsPromises.readFile() should simply be avoided because there are other async/promise options.
After stepping through each implementation in the debugger (fs.readFileSync and fs.promises.readFile), I can confirm that the synchronous version reads the entire file in one large chunk (the size of the file). Whereas fs.promises.readFile() reads 16,384 bytes at a time in a loop, with an await on each read. This is going to make fs.promises.readFile() go back to the event loop multiple times before it can read the entire file. Besides giving other things a chance to run, it's extra overhead to go back to the event loop every cycle through a for loop. There's also memory management overhead because fs.promises.readFile() allocates a series of Buffer objects and then combines them all at the end, whereas fs.readFileSync() allocates one large Buffer object at the beginning and just reads the entire file into that one Buffer.
So, the synchronous version, which is allowed to hog the entire CPU, is just faster from a pure time to completion point of view (it's significantly less efficient from a CPU cycles used point of view in a multi-user server because it blocks the event loop from doing anything else during the read). The asynchronous version is reading in smaller chunks, probably to avoid blocking the event loop too much so other things can effectively interleave and run while fs.promises.readFile() is doing its thing.
For a project I worked on awhile ago, I wrote my own simple asynchronous version of readFile() that reads the entire file at once and it was significantly faster than the built-in implementation. I was not concerned about event loop blockage in that particular project so I did not investigate if that's an issue.
In addition, fs.readFile() reads the file in 524,288 byte chunks (much larger chunks that fs.promises.readFile()) and does not use await, using just plain callbacks. It is apparently just coded more optimally than the promise implementation. I don't know why they rewrote the function in a slower way for the fs.promises.readFile() implementation. For now, it appears that wrapping fs.readFile() with a promise would be faster.
I have a file of 1 million records in which I have to pass one by one record to elastic search and save the resulted data into the database.
But the issue is, it is taking very long time to do that as the records are streaming one by one to elasticsearch then it saves the data into PSQL database.
I want some suggestions that how can I improve on this or should use some other tools.
Right now I am using Nodejs with some packages:
I upload the file in nodejs application and convert it to json file using
const csv=require('csvtojson')
I use
const StreamArray = require('stream-json/streamers/StreamArray');
const {Writable} = require('stream');
For reading json and parsing it through these packages using stream as the file is too big.
I use this code
const fileStream = fs.createReadStream(this.fileName);
const jsonStream = StreamArray.withParser();
const incomingThis = this;
const processingStream = new Writable({
write({key, value}, encoding, callback) {
incomingThis.recordParser(value, (val, data) => { // pass the data to elasticsearch to get search data
incomingThis.processQueue(data); // save the data to the PSQL database
callback();
});
},
//Don't skip this, as we need to operate with objects, not buffers
objectMode: true
});
//Pipe the streams as follows
fileStream.pipe(jsonStream.input);
jsonStream.pipe(processingStream);
//So we're waiting for the 'finish' event when everything is done.
processingStream.on('finish', async () => {
console.log('stream end');
const statistics = new Statistics(jobId);
await statistics.update(); // update the job table for completion of data
});
Please suggest How can I improve on this to parse 1 million records file in couple of hours rather than days or minimum less time.
I am open to use any other tools too like redis, spark if these will help me out.
Thanks.
Instead of one by one pressing from stream . use batch approach ( create multiple batches ) to get data in elastic and save in batch .
I have to parse a very big CSV file in NodeJS and save it in a database (async operation) that allows up to 500 entries at a time. Due to memory limits I have to stream the CSV file and want to use PapaParse to parse the CSV file (as that worked best in my case).
As PapaParse uses a callback style approach to parse Node.js streams I didn't see an easy to combine highland (for batching and data transform) and PapaParse. So, I tried to use a ParseThrough stream to write data to and read that stream with highland for batching:
const csv = require('papaparse');
const fs = require('fs');
const highland = require('highland');
const { PassThrough } = require('stream');
const passThroughStream = new PassThrough({ objectMode: true });
csv.parse(fileStream, {
step: function(row) {
// Write data to stream
passThroughStream.write(row.data[0]);
},
complete: function() {
// Somehow "end" the stream
passThroughStream.write(null);
},
});
highland(passThroughStream)
.map((data) => {
// data transform
})
.batch(500)
.map((data) => {
// Save up to 500 entries in database (async call)
});
Obviously that doesn't work as is and doesn't do anything really. Is something like that even possible or even an better way to parse very big CSV files and save the rows in a database (in batches of up to 500)?
Edit: Using the csv package (https://www.npmjs.com/package/csv) it would be possible like so (same for fast-csv):
highland(fileStream.pipe(csv.parse()))
.map((data) => {
// data transform
})
.batch(500)
.map((data) => {
// Save up to 500 entries in database (async call)
});
But unfortunately both NPM packages do not parse the CSV files properly in all cases.
After a quick look at papaparse I decided to implement CSV parser in scramjet.
fileStream.pipe(new scramjet.StringStream('utf-8'))
.csvParse(options)
.batch(500)
.map(items => db.insertArray('some_table', items))
I hope that works for you. :)
I am writing a image manipulation service and I have to transform an image into multiple sizes
const writable1 = storage(name1).writableStream();
const writable2 = storage(name2).writableStream();
const writable3 = storage(name3).writableStream();
//piping the file stream to their respective storage stream
file.stream.pipe(imageTransformer).pipe(writable1);
file.stream.pipe(imageTransformer).pipe(writable2);
file.stream.pipe(imageTransformer).pipe(writable3);
I want to know when all the streams are finished writing to destination
Right now I have only checked for one stream like:
writable3.on('finish', callback);
//error handling
writable3.on('error', callback);
I have seen libraries like https://github.com/mafintosh/pump and https://github.com/maxogden/mississippi but these libraries only show writing to a single destination with multiple transforms.
How would I be able to check if all the streams are finished writing or one of them has errored out? How can I handle them in an array?
You can use a combination converting stream to promise and Promise.all.
In the example, I have used stream-to-promise library for the stream to promise conversion.
For each stream, a promise is created. The promise is resolved when the stream is completed and rejected when the stream fails.
const streamToPromise = require('stream-to-promise')
const promise1 = streamToPromise(readable1.pipe(writable1));
const promise2 = streamToPromise(readable2.pipe(writable2));
Promise.all([promise1, promise2]).
.then(() => console.log('all the streams are finished'));