Get hash of ReadStream and output data of stream - node.js

I have a ReadStream that I want to read multiple times. The readStream is created with fs.createReadStream.
First time I'm using it to get it's md5 hash, I'm doing it with module hasha, function fromStream, and the second time I'm using it with FormData to upload a file to web hosting.
How can I use this one ReadStream to do both of these things?
readStream = fs.createReadStream("/tmp/test.txt");
hash = await hasha.fromStream(readStream, hashOptions);
readStream.on("data", (chunk) => console.log("data chunk", chunk)).on("end", () => console.log("finished"));
It's not loging the content to the console as it should, probably because in the hasha.fromStream it's pipe-ing the stream. If I don't execute hasha.fromStream it's working fine, the chunks are logged.
The module I'm using, called hasha is on github: https://github.com/sindresorhus/hasha/blob/master/index.js#L45
I don't want to save the data to buffer before getting hash, because I'll be using it with large files.
I have also made a runkit script showing my problem, you can play with it there:
https://runkit.com/5942fba4653ae70012196b77/5942fba4653ae70012196b78

Here's a standalone example on how to "fork" a stream so you can pipe it to two destinations:
const PassThrough = require('stream').PassThrough;
async function hashAndPost(stream) {
let pass1 = new PassThrough();
let pass2 = new PassThrough();
stream.pipe(pass1);
stream.pipe(pass2);
// Destination #1
pass1.on('data', chunk =>
console.log('data chunk', chunk.toString())
).on('end', () =>
console.log('finished')
);
// Destination #2
let hash = await hasha.fromStream(pass2, { algorithm : 'md5' });
console.log('hash', hash);
};

You can either recreate the stream by re-reading the file, or you can rewind the stream, as explained here: How to reset nodejs stream?

Related

Cloning a Node File Object to use multiple streams in parallel (Multer)

Is it possible to clone a Node.JS File object?
I've written a custom storage driver for Multer which takes an array of storage drivers in it's constructor and calls ._handleFile of each driver. The goal is to save one file to multiple destinations in parralel.
However, it seems that the file stream that's opened by the disk driver messes up any subsequent reads. In my particular case I'm trying to save to a local disk + AWS-S3.
Through debugging (setTimeouts, etc.) I found out that:
If the file gets uploaded to S3 first, the file written to my local disk is empty.
If the file gets written my local disk first the S3 upload simply dies without any errors
So my assumption is that multiple streams on the same file cause strange issues.
The multer disk driver does the following:
...
var outStream = fs.createWriteStream(finalPath)
file.stream.pipe(outStream)
The multer AWS S3 driver does this:
...
var upload = this.s3.upload(params)
I assume the library opens a stream.
I don't want to save the file first and manually create two streams after. I'd prefer to somehow duplicate the file object and send them off to each individual ._handleFile method.
MultiStorage.prototype._handleFile = async function _handleFile (req, file, cb) {
// I removed some code for this example
...
const results = await Promise.all(drivers.map({ driver }, i) => {
return new Promise((fulfill, reject) => {
// file -> this I believe I need to duplicate
driver._handleFile(req, file, (error, info) => {
fulfill({ info, error })
})
})
....
Answering my own questions
I wrote little helper which creates new PassThrough streams and then writes to them as data comes in.
const { PassThrough } = require('stream');
// Split stream into $count amount of new streams and return them
const splitStream = (stream, count) => {
const streams = [...Array(count)].map(() => new PassThrough());
stream.on('data', chunk => {
streams.map(s => s.push(chunk));
})
stream.on('end', chunk => {
streams.on('end', () => {
streams.map(s => s.push(null));
})
})
return streams;
}
Now you just need to pass on your new stream(s) instead of the original stream.
myFn(streams[0]);
myFn(streams[1]);
Disclaimer: This method does not take care of error handling and can cause memory leaks. You might want to consider using the Pipeline() wrapper from the 'stream' lib.

NodeJS Stream flushed during the Event Loop iteration

I'm trying to pipe one Stream Axios Response into multiple files. It's not working, and I can reproduce it with the simple code below:
Will work:
const { PassThrough } = require('stream')
const inputStream = new PassThrough()
inputStream.write('foo')
// Now I have a stream with content
inputStream.pipe(process.stdout)
inputStream.pipe(process.stderr)
// will print 'foofoo', for both stdout and stderr
Will not work:
const { PassThrough } = require('stream')
const inputStream = new PassThrough()
inputStream.write('foo')
inputStream.pipe(process.stdout)
setImmediate(() => {
inputStream.pipe(process.stderr)
})
// Will print only 'foo'
The question is, Can I say that the existed content in the stream will be piped only if the two pipe commands will execute in the same Event-Loop iteration?
Doesn't that make the situation non-deterministic?
By the time the callback scheduled with setImmediate is executed, the stream data is already flushed. This can checked by .readableLength stream property.
You can use cork and uncork in order to control when the buffered stream data is flushed.
const { PassThrough } = require('stream')
const inputStream = new PassThrough()
inputStream.cork()
inputStream.write('foo')
inputStream.pipe(process.stdout)
setImmediate(() => {
inputStream.pipe(process.stderr)
inputStream.uncork()
})

PapaParse and Highland

I have to parse a very big CSV file in NodeJS and save it in a database (async operation) that allows up to 500 entries at a time. Due to memory limits I have to stream the CSV file and want to use PapaParse to parse the CSV file (as that worked best in my case).
As PapaParse uses a callback style approach to parse Node.js streams I didn't see an easy to combine highland (for batching and data transform) and PapaParse. So, I tried to use a ParseThrough stream to write data to and read that stream with highland for batching:
const csv = require('papaparse');
const fs = require('fs');
const highland = require('highland');
const { PassThrough } = require('stream');
const passThroughStream = new PassThrough({ objectMode: true });
csv.parse(fileStream, {
step: function(row) {
// Write data to stream
passThroughStream.write(row.data[0]);
},
complete: function() {
// Somehow "end" the stream
passThroughStream.write(null);
},
});
highland(passThroughStream)
.map((data) => {
// data transform
})
.batch(500)
.map((data) => {
// Save up to 500 entries in database (async call)
});
Obviously that doesn't work as is and doesn't do anything really. Is something like that even possible or even an better way to parse very big CSV files and save the rows in a database (in batches of up to 500)?
Edit: Using the csv package (https://www.npmjs.com/package/csv) it would be possible like so (same for fast-csv):
highland(fileStream.pipe(csv.parse()))
.map((data) => {
// data transform
})
.batch(500)
.map((data) => {
// Save up to 500 entries in database (async call)
});
But unfortunately both NPM packages do not parse the CSV files properly in all cases.
After a quick look at papaparse I decided to implement CSV parser in scramjet.
fileStream.pipe(new scramjet.StringStream('utf-8'))
.csvParse(options)
.batch(500)
.map(items => db.insertArray('some_table', items))
I hope that works for you. :)

Cannot pipe after data has been emitted from the response nodejs

I've been experiencing a problem with the require library of node js. When I try to pipe to a file and a stream on response, I get the error: you cannot pipe after data has been emitted from the response. This is because I do some calculations before really piping the data.
Example:
var request = require('request')
var fs = require('fs')
var through2 = require('through2')
options = {
url: 'url-to-fetch-a-file'
};
var req = request(options)
req.on('response',function(res){
//Some computations to remove files potentially
//These computations take quite somme time.
//Function that creates path recursively
createPath(path,function(){
var file = fs.createWriteStream(path+fname)
var stream = through2.obj(function (chunk, enc, callback) {
this.push(chunk)
callback()
})
req.pipe(file)
req.pipe(stream)
})
})
If I just pipe to the stream without any calculations, it's just fine. How can I pipe to both a file and stream using request module in nodejs?
I found this:Node.js Piping the same readable stream into multiple (writable) targets but it is not the same thing. There, piping happens 2 times in a different tick. This example pipes like the answer in the question and still receives an error.
Instead of piping directly to the file you can add a listener to the stream you defined. So you can replace req.pipe(file) with
stream.on('data',function(data){
file.write(data)
})
stream.on('end',function(){
file.end()
})
or
stream.pipe(file)
This will pause the stream untill its read, something that doesn't happen with the request module.
More info: https://github.com/request/request/issues/887

write base64 to file using stream

I am sending a base64 string to my server. On the server I want to create a readable stream that I push the base64 chunks onto that then goes to a writable stream and written to file. My problem is only the first chunk is written to file. My guess is because I create a new buffer with each chunk this is what is causing the problem but if I send just the string chunks in without creating the buffer the image file is corrupt.
var readable = new stream.Readable();
readable._read = function() {}
req.on('data', function(data) {
var dataText = data.toString();
var dataMatch = dataText.match(/^data:([A-Za-z-+\/]+);base64,(.+)$/);
var bufferData = null;
if (dataMatch) {
bufferData = new Buffer(dataMatch[2], 'base64')
}
else {
bufferData = new Buffer(dataText, 'base64')
}
readable.push(bufferData)
})
req.on('end', function() {
readable.push(null);
})
This is not so trivial as you might think:
Use Transform, not Readable. You can pipe request stream to transform, thus handling back pressure.
You can't use regular expressions, because text you are expecting can be broken in two or more chunks. You could try to accumulate chunks and exec regular expression each time, but if the format of stream is incorrect (that is, not a data uri) you will end up buffering the whole request and running regular expression a lot of times on megabytes long string.
You can't take arbitrary chunk and do new Buffer(chunk, 'base64') because it may not be valid itself. Example: new Buffer('AQID', 'base64') yields new Buffer([1, 2, 3]), but Buffer.concat([new Buffer('AQ', 'base64'), new Buffer('ID', 'base64')]) yields new Buffer([1, 32])
For the 3 problem you can use one of available modules (like base64-stream). Here is an example:
var base64 = require('base64-stream');
var stream = require('stream');
var decoder = base64.decode();
var input = new stream.PassThrough();
var output = new stream.PassThrough();
input.pipe(decoder).pipe(output);
output.on('data', function (data) {
console.log(data);
});
input.write('AQ');
input.write('ID');
You can see that it buffers input and emits data as soon as enough arrived.
As for the 2 problem you need to implement simple stream parser. As an idea: wait for data: string, then buffer chunks (if you need them) until ;base64, found, then pipe to base64-stream.

Resources