Cloning a Node File Object to use multiple streams in parallel (Multer) - node.js

Is it possible to clone a Node.JS File object?
I've written a custom storage driver for Multer which takes an array of storage drivers in it's constructor and calls ._handleFile of each driver. The goal is to save one file to multiple destinations in parralel.
However, it seems that the file stream that's opened by the disk driver messes up any subsequent reads. In my particular case I'm trying to save to a local disk + AWS-S3.
Through debugging (setTimeouts, etc.) I found out that:
If the file gets uploaded to S3 first, the file written to my local disk is empty.
If the file gets written my local disk first the S3 upload simply dies without any errors
So my assumption is that multiple streams on the same file cause strange issues.
The multer disk driver does the following:
...
var outStream = fs.createWriteStream(finalPath)
file.stream.pipe(outStream)
The multer AWS S3 driver does this:
...
var upload = this.s3.upload(params)
I assume the library opens a stream.
I don't want to save the file first and manually create two streams after. I'd prefer to somehow duplicate the file object and send them off to each individual ._handleFile method.
MultiStorage.prototype._handleFile = async function _handleFile (req, file, cb) {
// I removed some code for this example
...
const results = await Promise.all(drivers.map({ driver }, i) => {
return new Promise((fulfill, reject) => {
// file -> this I believe I need to duplicate
driver._handleFile(req, file, (error, info) => {
fulfill({ info, error })
})
})
....

Answering my own questions
I wrote little helper which creates new PassThrough streams and then writes to them as data comes in.
const { PassThrough } = require('stream');
// Split stream into $count amount of new streams and return them
const splitStream = (stream, count) => {
const streams = [...Array(count)].map(() => new PassThrough());
stream.on('data', chunk => {
streams.map(s => s.push(chunk));
})
stream.on('end', chunk => {
streams.on('end', () => {
streams.map(s => s.push(null));
})
})
return streams;
}
Now you just need to pass on your new stream(s) instead of the original stream.
myFn(streams[0]);
myFn(streams[1]);
Disclaimer: This method does not take care of error handling and can cause memory leaks. You might want to consider using the Pipeline() wrapper from the 'stream' lib.

Related

How do I stream a chunked file using Node.js Readable?

I have a 400Mb file split into chunks that are ~1Mb each.
Each chunk is a MongoDB document:
{
name: 'stuff.zip',
index: 15,
buffer: Binary('......'),
totalChunks: 400
}
I am fetching each chunk from my database and then streaming it to the client.
Every time I get chunk from the DB I push it to the readableStream which is being piped to the client.
Here is the code:
import { Readable } from 'stream'
const name = 'stuff.zip'
const contentType = 'application/zip'
app.get('/api/download-stuff', (req, res) => {
res.set('Content-Type', contentType)
res.set('Content-Disposition', `attachment; filename=${name}`)
res.attachment(name)
// get `totalChunks` from random chunk
let { totalChunks } = await ChunkModel.findOne({ name }).select('totalChunks')
let index = 0
const readableStream = new Readable({
async read() {
if (index < totalChunks) {
let { buffer } = await ChunkModel.findOne({ name, index }).select('buffer')
let canContinue = readableStream.push(buffer)
console.log(`pushed chunk ${index}/${totalChunks}`)
index++
// sometimes it logs false
// which means I should be waiting before pushing more
// but I don't know how
console.log('canContinue = ', canContinue)
} else {
readableStream.push(null)
readableStream.destroy()
console.log(`all ${totalChunks} chunks streamed to the client`)
}
}
})
readableStream.pipe(res)
})
The code works.
But I'm wondering whether I risk having memory overflows on my local server memory, especially when the requests for the same file are too many or the chunks are too many.
Question: My code is not waiting for readableStream to finish reading the chunk that was just pushed to it, before pushing the next one. I thought it was, and that is why I'm using read(){..} in this probably wrong way. So how should I wait for each chunk to be pushed, read, streamed to the client and cleared from my server's local memory, before I push the next one in ?
I have created this sandbox in case it helps anyone
In general, when the readable interface is implemented correctly (i.e., the backpressure signal is respected), the readable interface will prevent the code from overflowing the memory regardless of source size.
When implemented according to the API spec, the readable itself does not keep references for data that has finished passing through the stream. The memory requirement of a readable buffer is adjusted by specifying a highWatermark.
In this case, the snippet does not conform to the readable interface. It violates the following two concepts:
No data shall be pushed to the readable's buffer unless read() has been called. Currently, this implementation proceeds to push data from DB immediately. Consequently, the readable buffer will start to fill before the sink has begun to consume data.
The readable's push() method returns a boolean flag. When the flag is false, the implementation must wait for .read() to be called before pushing additional data. If the flag is ignored, the buffer will overflow wrt. the highWatermark.
Note that ignoring these core criteria of Readables circumvents the backpressure logic.
An alternative implementation, if this is a Mongoose query:
app.get('/api/download-stuff', async (req, res) => {
// ... truncated handler
// A helper variable to relay data from the stream to the response body
const passThrough = new stream.PassThrough({objectMode: false});
// Pipe data using pipeline() to simplify handling stream errors
stream.pipeline(
// Create a cursor that fetch all relevant documents using a single query
ChunkModel.find().limit(chunksLength).select("buffer").sort({index: 1}).lean().cursor(),
// Cherry pick the `buffer` property
new stream.Transform({
objectMode: true,
transform: ({ buffer }, encoding, next) => {
next(null, buffer);
}
}),
// Write the retrieved documents to the helper variable
passThrough,
error => {
if(error){
// Log and handle error. At this point the HTTP headers are probably already sent,
// and it is therefore too late to return HTTP500
}
}
);
res.body = passThrough;
});

Stream large JSON from REST API using NodeJS/ExpressJS

I have to return a large JSON, resulting from a query to MongoDB, from a REST API server build-up using ExpressJS. This JSON has to be converted into .csv so the client can directly save the resulting CSV file. I know that the best solution is to use NodeJS streams and pipe. Could anyone suggest to me a working example? Thanks.
Typically when wanting to parse JSON in Node its fairly simple. In the past I would do something like the following.
const fs = require('fs');
const rawdata = fs.readFileSync('file.json');
const data = JSON.parse(rawdata);
Or even simpler with a require statement like this
const data = require('./file.json');
Both of these work great with small or even moderate size files, but what if you need to parse a really large JSON file, one with millions of lines, reading the entire file into memory is no longer a great option.
Because of this I needed a way to “Stream” the JSON and process as it went. There is a nice module named ‘stream-json’ that does exactly what I wanted.
With stream-json, we can use the NodeJS file stream to process our large data file in chucks.
const StreamArray = require( 'stream-json/streamers/StreamArray');
const fs = require('fs');
const jsonStream = StreamArray.withParser();
//internal Node readable stream option, pipe to stream-json to convert it for us
fs.createReadStream('file.json').pipe(jsonStream.input);
//You'll get json objects here
//Key is the array-index here
jsonStream.on('data', ({key, value}) => {
console.log(key, value);
});
jsonStream.on('end', ({key, value}) => {
console.log('All Done');
});
Now our data can process without running out of memory, however in the use case I was working on, inside the stream I had an asynchronous process. Because of this, I still was consuming huge amounts of memory as this just up a very large amount of unresolved promises to keep in memory until they completed.
To solve this I had to also use a custom Writeable stream like this.
const StreamArray = require( 'stream-json/streamers/StreamArray');
const {Writable} = require('stream');
const fs = require('fs');
const fileStream = fs.createReadStream('file.json');
const jsonStream = StreamArray.withParser();
const processingStream = new Writable({
write({key, value}, encoding, callback) {
//some async operations
setTimeout(() => {
console.log(key,value);
//Runs one at a time, need to use a callback for that part to work
callback();
}, 1000);
},
//Don't skip this, as we need to operate with objects, not buffers
objectMode: true
});
//Pipe the streams as follows
fileStream.pipe(jsonStream.input);
jsonStream.pipe(processingStream);
//So we're waiting for the 'finish' event when everything is done.
processingStream.on('finish', () => console.log('All done' ));
The Writeable stream also allows each asynchronous process to complete and the promises to resolve before continuing on to the next, thus avoiding the memory backup.
This stack overflow is where I got the examples for this post.
Parse large JSON file in Nodejs and handle each object independently
Also note another thing I learned in this process is if you want to start Node with more than the default amount of RAM you can use the following command.
node --max-old-space-size=4096 file.js
By default the memory limit in Node.js is 512 mb, to solve this issue you need to increase the memory limit using command –max-old-space-size. This can be used to avoid the memory limits within node. The command above would give Node 4GB of RAM to use.

PapaParse and Highland

I have to parse a very big CSV file in NodeJS and save it in a database (async operation) that allows up to 500 entries at a time. Due to memory limits I have to stream the CSV file and want to use PapaParse to parse the CSV file (as that worked best in my case).
As PapaParse uses a callback style approach to parse Node.js streams I didn't see an easy to combine highland (for batching and data transform) and PapaParse. So, I tried to use a ParseThrough stream to write data to and read that stream with highland for batching:
const csv = require('papaparse');
const fs = require('fs');
const highland = require('highland');
const { PassThrough } = require('stream');
const passThroughStream = new PassThrough({ objectMode: true });
csv.parse(fileStream, {
step: function(row) {
// Write data to stream
passThroughStream.write(row.data[0]);
},
complete: function() {
// Somehow "end" the stream
passThroughStream.write(null);
},
});
highland(passThroughStream)
.map((data) => {
// data transform
})
.batch(500)
.map((data) => {
// Save up to 500 entries in database (async call)
});
Obviously that doesn't work as is and doesn't do anything really. Is something like that even possible or even an better way to parse very big CSV files and save the rows in a database (in batches of up to 500)?
Edit: Using the csv package (https://www.npmjs.com/package/csv) it would be possible like so (same for fast-csv):
highland(fileStream.pipe(csv.parse()))
.map((data) => {
// data transform
})
.batch(500)
.map((data) => {
// Save up to 500 entries in database (async call)
});
But unfortunately both NPM packages do not parse the CSV files properly in all cases.
After a quick look at papaparse I decided to implement CSV parser in scramjet.
fileStream.pipe(new scramjet.StringStream('utf-8'))
.csvParse(options)
.batch(500)
.map(items => db.insertArray('some_table', items))
I hope that works for you. :)

Get hash of ReadStream and output data of stream

I have a ReadStream that I want to read multiple times. The readStream is created with fs.createReadStream.
First time I'm using it to get it's md5 hash, I'm doing it with module hasha, function fromStream, and the second time I'm using it with FormData to upload a file to web hosting.
How can I use this one ReadStream to do both of these things?
readStream = fs.createReadStream("/tmp/test.txt");
hash = await hasha.fromStream(readStream, hashOptions);
readStream.on("data", (chunk) => console.log("data chunk", chunk)).on("end", () => console.log("finished"));
It's not loging the content to the console as it should, probably because in the hasha.fromStream it's pipe-ing the stream. If I don't execute hasha.fromStream it's working fine, the chunks are logged.
The module I'm using, called hasha is on github: https://github.com/sindresorhus/hasha/blob/master/index.js#L45
I don't want to save the data to buffer before getting hash, because I'll be using it with large files.
I have also made a runkit script showing my problem, you can play with it there:
https://runkit.com/5942fba4653ae70012196b77/5942fba4653ae70012196b78
Here's a standalone example on how to "fork" a stream so you can pipe it to two destinations:
const PassThrough = require('stream').PassThrough;
async function hashAndPost(stream) {
let pass1 = new PassThrough();
let pass2 = new PassThrough();
stream.pipe(pass1);
stream.pipe(pass2);
// Destination #1
pass1.on('data', chunk =>
console.log('data chunk', chunk.toString())
).on('end', () =>
console.log('finished')
);
// Destination #2
let hash = await hasha.fromStream(pass2, { algorithm : 'md5' });
console.log('hash', hash);
};
You can either recreate the stream by re-reading the file, or you can rewind the stream, as explained here: How to reset nodejs stream?

Nodejs: Performance issues parsing CSV and Zip

The files are submitted to my server and I'm trying to determine if the CSV is valid and if all the images referenced from the CSV are present in the zip. I have to populate a Mongo database with all that information but I want to do it in the background, and send a response to the client as fast as possible.
So I have two readable streams and I have three different approaches:
Unzipping the file takes 24sec, so unzip + parsing the csv + fs.exists is not an option.
Parsing the whole csv, save filenames in array and then reading the zip using node-unzip and pipe takes 5 seconds.
Reading the csv and in parallel read the zip and use a shared array to determine simultaneusly if the files are present, which is the fastest option, takes 4 seconds.
Does anyone have an idea of how to do it faster?
EDIT: The code used for validation is:
// lib/validator.js
function validateParallel(csv, zip) {
const shared = {};
return new Promise((resolve, reject) => {
const l = CSV_VALIDATORS.length - 1;
csv
.pipe(split())
.pipe(through2(validateLine.bind({ zip, reject, n: 0, l, shared })))
.on('finish', () => {
zip
.pipe(unzip.Parse())
.on('entry', (entry) => {
delete shared[entry.path];
})
.on('close', () => {
resolve(Object.keys(shared).length === 0);
});
});
});
}
// perfomance/validate.spec.js
const zip = fs.createReadStream('./performance/imports/import.zip');
const csv = fs.createReadStream('./performance/imports/stress-test.csv');
const hrstart = process.hrtime();
validator
.validateParallel(csv, zip)
.then(function(isValid) {
console.log(`valid=${isValid}`);
const hrend = process.hrtime(hrstart);
console.info("Execution time (hr): %ds %dms", hrend[0], hrend[1]/1000000);
});
ValidateLine takes the image name and pushes it into the shared object. The output is:
valid=true
Execution time (hr): 4s 926.031869ms
I have simplified the code and removed error management to make it more readable.
Do you have to also validate the images themselves, or just make sure their paths exist in the CSV file? If the latter, you can run a shell process that executes unzip -l on the zip file, which will print only the file names, should be quick.

Resources