Node uploaded image save - stream vs buffer - node.js

I am working on image upload and don't know how to properly deal with storing the received file. It would be nice to analyze the file first if it is really an image or someone just changed the extension. Luckily I use package sharp which has exactly such a feature. I currently work with two approaches.
Buffering approach
I can parse multipart form as a buffer and easily decide whether save a file or not.
const metadata = await sharp(buffer).metadata();
if (metadata) {
saveImage(buffer);
} else {
throw new Error('It is not an image');
}
Streaming approach
I can parse multipart form as a readable stream. First I need to forward the readable stream to writable and store file to disk. Afterward, I need again to create a readable stream from saved file and verify whether it is really image. Otherwise, revert all.
// save uploaded file to file system with stream
readableStream.pipe(createWriteStream('./uploaded-file.jpg'));
// verify whether it is an image
createReadStream('./uploaded-file.jpg').pipe(
sharp().metadata((err, metadata) => {
if (!metadata) {
revertAll();
throw new Error('It is not an image');
}
})
)
It was my intention to avoid using buffer because as I know it needs to store the whole file in RAM. But on the other hand, the approach using streams seems to be really clunky.
Can someone help me to understand how these two approaches differ in terms of performance and used resources? Or is there some better approach to how to deal with such a situation?

In buffer mode, all the data coming from a resource is collected into a buffer, think of it as a data pool, until the operation is completed; it is then passed back to the caller as one single blob of data. Buffers in V8 are limited in size. You cannot allocate more than a few gigabytes of data, so you may hit a wall way before running out of physical memory if you need to read a big file.
On the other hand, streams allow us to process the data as soon as it arrives from the resource. So streams execute their data without storing it all in memory. Streams can be more efficient in terms of both space (memory usage) and time (computation clock time).

Related

What is the difference between async and steam writing files?

I now that it's possible to use async methods (like fs.appendFile) and streams (like fs.createWriteStream) to write files.
But why do we need both of them if streams are asynchronous as well and can provide us with better functionality?
Let's say you're downloading a file, a huge file, 1TB file, and you want to write that file to your filesystem.
You could download the whole file into a buffer in-memory, then fs.appendFile() or fs.writeFile() the buffer to a local file, or try, at least, you'd run out of memory.
Or you could create a read-stream for the downloading file, and pipe it to a write-stream for the write to your file-system:
const readStream = magicReadStreamFromUrl/*[1]*/('https://example.com/large.txt');
const writeStream = fs.createWriteStream('large.txt');
readStream.pipe(writeStream);
This means that the file is downloaded in chunks, and those chunks get piped to the writeStream (which would write them to disk), without having to store it in-memory yourself.
That is the reason for Streaming abstractions in general, and in Node in particular.
The http module supports streaming in this way, as well as most other HTTP libraries like request and axios, I've left out the specifics of how to create a read-stream as an exercise to the reader for brevity.

How to synchronously read from a ReadStream in node

I am trying to read UTF-8 text from a file in a memory and time efficient way. There are two ways to read directly from a file synchronously:
fs.readFileSync will read the entire file and return a buffer containing the file's entire contents
fs.readSync will read a set amount of bytes from a file and return a buffer containing just those contents
I initially just used fs.readFileSync because it's easiest, but I'd like to be able to efficiently handle potentially large files by only reading in chunks of text at a time. So I started using fs.readSync instead. But then I realized that fs.readSync doesn't handle UTF-8 decoding. UTF-8 is simple, so I could whip up some logic to manually decode it, but Node already has services for that, so I'd like to avoid that if possible.
I noticed fs.createReadStream, which returns a ReadStream that can be used for exactly this purpose, but unfortunately it seems to only be available in an asynchronous mode of operation.
Is there a way to read from a ReadStream in a synchronous way? I have a massive stack built on top of this already, and I'd rather not have to refactor it to be asynchronous.
I discovered the string_decoder module, which handles all that UTF-8 decoding logic I was worried I'd have to write. At this point, it seems like a no-brainer to use this on top of fs.readSync to get the synchronous behavior I was looking for.
You basically just keep feeding bytes to it, and as it is able to successfully decode characters, it will emit them. The Node documentation is sufficient at describing how it works.

NodeJS Request Pipe buffer size

How can I set up the maximum buffer size on a NodeJS Request Pipe? I'm trying to use AWS Lambda to download from a source and pipe upload to a destination like in the code below:
request(source).pipe(request(destination))
This code works fine, but if the file size is bigger than the AWS Lambda Memory Size (image below), it crashes. If I increase the memory, it works, so I know is not the timeout or link, but only the memory allocation. Initially I don't to increase the number, but even if I use the maximum, still 1.5GB, and I'm expecting to transfer files bigger than that.
Is there a global variable for NodeJS on AWS Lambda for this? Or any other suggestion?
Two things to consider:
Do not use request(source).pipe(request(destination)) with or within a promise (async/await). For some reason it memory leaks when done with promises.
"However, STREAMING THE RESPONSE (e.g. .pipe(...)) is DISCOURAGED because Request-Promise would grow the memory footprint for large requests unnecessarily high. Use the original Request library for that. You can use both libraries in the same project." Source: https://www.npmjs.com/package/request-promise
To control how much memory the pipe uses: Set the highWaterMark for BOTH ends of the pipe. I REPEAT: BOTH ENDS OF THE PIPE. This will force the pipe to let only so much data into the pipe and out of the pipe, and thus limits its occupation in memory. (But does not limit how fast data moves through the pipe...see Bonus)
request.get(sourceUrl,{highWaterMark: 1024000, encoding:null}).pipe(request(destinationUrl,{highWaterMark: 1024000));
1025000 is in bytes and is approximately 10MB.
Source for highWaterMark background:
"Because Duplex and Transform streams are both Readable and Writable, each maintains two separate internal buffers used for reading and writing, allowing each side to operate independently of the other while maintaining an appropriate and efficient flow of data. For example, net.Socket instances are Duplex streams whose Readable side allows consumption of data received from the socket and whose Writable side allows writing data to the socket. Because data may be written to the socket at a faster or slower rate than data is received, it is important for each side to operate (and buffer) independently of the other." <- last sentence here is the important part.
https://nodejs.org/api/stream.html#stream_readable_pipe_destination_options
Bonus: If you want to throttle how fast data passes through the pipe, check something like this out: https://www.npmjs.com/package/stream-throttle
const throttle = require('stream-throttle');
let th = new throttle.Throttle({rate: 10240000}); //if you dont want to transfer data faster than 10mb/sec
request.get(sourceUrl,{highWaterMark: 1024000, encoding:null}).pipe(th).pipe(request(destinationUrl,{highWaterMark: 1024000));

Are buffers generally faster to work with than streams?

I've tried a couple of Imagemagick wrapper libraries and some S3 libraries. I'm having trouble choosing the best concept due to big performance differences.
I have settled with the node library "gm", which is a joy to work with and very well documented.
As for S3 I have tried both Amazon's own AWS library as well as "S3-Streams"
Edit: I just discovered that the AWS library can handle streams. I suppose this is a new function s3.upload (or have I just missed it?). Anyway, I ditched s3-streams which makes use of s3uploadPart which is much more complicated. After switching library streaming is equal to uploading buffers in my test case.
My testcase is to split a 2MB jpg file into approx 30 512px tiles and send each of the tiles to S3. Imagemagick has a really fast automatic way of generating tiles via the crop command. Unfortunately I have not found any node library that can catch the multi file output from the autogenerated tiles. Instead I have to generate tiles in a loop by call the crop command individually for each tile.
I'll present the total timings before the details:
A: 85 seconds (s3-streams)
A: 34 seconds (aws.s3.upload) (EDIT)
B: 35 seconds (buffers)
C: 25 seconds (buffers in parallell)
Clearly buffers are faster to work with than streams in this case. I don't know if gm or s3-streams has a bad implementation of streams or if I should have tweaked something. For now I'll go with solution B. C is even faster, but eats more memory.
I'm running this on a low end Digital Ocean Ubuntu machine. This is what I have tried:
A. Generate tiles and stream them one by one
I have an array prepared with crop information and s3Key for each tile to generate
The array is looped with "async.eachLimit(1)". I have not succeeded in generating more than one tile at once, hence limit(1).
As the tiles are generated, they are directly streamed to S3
Pseudo code:
async.eachLimit(tiles, 1, function(tile, callback) {
gm(originalFileBuffer)
.crop(tile.width, tile.height, tile.x, tile.y)
.stream()
.pipe(s3Stream({Key: tile.key, Bucket: tile.bucket})) //using "s3-streams" package
.on('finish', callback)
});
B. Generate tiles to buffers and upload each buffer directly with AWS-package
As the tiles are generated to buffers, they are directly uploaded to S3
Pseudo code:
async.eachLimit(tiles, 1, function(tile, callback) {
gm(originalFileBuffer)
.crop(tile.width, tile.height, tile.x, tile.y)
.toBuffer(function(err, buffer) {
s3.upload(..
callback()
)
})
});
C. Same as B, but store all buffers in the tile array for later upload in parallell
Pseudo code:
async.eachLimit(tiles, 1, function(tile, callback) {
gm(originalFileBuffer)
.crop(tile.width, tile.height, tile.x, tile.y)
.toBufer(function(err, buffer) {
tile.buffer = buffer;
callback()
})
});
..this next step is done after finalizing the first each-loop. I don't seem to gain speed by pushing limit to more than 10.
async.eachLimit(tiles, 10, function(tile, callback) {
s3.upload(tile.buffer..
callback()
)
});
Edit: Some more background as per Mark's request
I originally left out the details in the hope that I would get a clear answer about buffer vs stream.
The goal is to serve our app with images in a responsive way via a node/Express API. Backend db is Postgres. Bulk storage is S3.
Incoming files are mostly photos, floor plan drawings and pdf document. The photos needs to be stored in several sizes so I can serve them to the app in a responsive way: thumbnail, low-res, mid-res and original resolution.
Floor plans has to be tiles so I can load them incrementally (scrolling tiles) in the app. A full resolution A1 drawing can be about 50 MPixels.
Files uploaded to S2 spans from 50kB (tiles) to 10MB (floor plans).
The files comes from various directions, but always as streams:
Form posts via web or some other API (SendGrid)
Uploads from the app
Downloaded stream from S3 when uploaded files needs more processing
I'm not keen on having the files temporarily on local disk, hence only buffer vs stream. If I could use the disk I'd use IM's own tile function for really speedy tiling.
Why not local disk?
Images are encrypted before uploading to S3. I don't want unencrypted files to linger in a temp directory.
There is always the issue of cleaning up temp files, with possible orphan files after unintended crashes etc.
After some more tinkering I feel obliged to answer my own question.
Originally I used the npm package s3-streams for streaming to S3. This package uses aws.s3.uploadPart.
Now I found out that the aws package has a neat function aws.s3.upload which takes a buffer or a stream.
After switching to AWS own streaming function there is no time difference between buffer/stream-upload.
I might have used s3-streams in the wrong way. But I also discovered a possible bug in this library (regaring files > 10MB). I posted an issue, but haven't got any answer. My guessing is that the library has been abandoned since the s3.upload function appeared.
So, the answer to my own question:
There might be differences between buffers/streams, but in my test case they are equal, which makes this a non issue for now.
Here is the new "save"-part in the each loop:
let fileStream = gm(originalFileBuffer)
.crop(tile.width, tile.height, tile.x, tile.y)
.stream();
let params = {Bucket: 'myBucket', Key: tile.s3Key, Body: fileStream};
let s3options = {partSize: 10 * 1024 * 1024, queueSize: 1};
s3.upload(params, s3options, function(err, data) {
console.log(err, data);
callback()
});
Thank you for reading.

Saving a base 64 string to a file via createWriteStream

I have an image coming into my Node.js application via email (through cloud service provider Mandrill). The image comes in as a base64 encoded string, email.content in the example below. I'm currently writing the image to a buffer, and then a file like this:
//create buffer and write to file
var dataBuffer = new Buffer(email.content, 'base64');
var writeStream = fs.createWriteStream(tmpFileName);
writeStream.once('open', function(fd) {
console.log('Our steam is open, lets write to it');
writeStream.write(dataBuffer);
writeStream.end();
}); //writeSteam.once('open')
writeStream.on('close', function() {
fileStats = fs.statSync(tmpFileName);
This works fine and is all well and good, but am I essentially doubling the memory requirements for this section of code, since I have my image in memory (as the original string), and then create a buffer of that same string before writing the file? I'm going to be dealing with a lot of inbound images so doubling my memory requirements is a concern.
I tried several ways to write email.content directly to the stream, but it always produced an invalid file. I'm a rank amateur with modern coding, so you're welcome to tell me this concern is completely unfounded as long as you tell me why so some light will dawn on marble head.
Thanks!
Since you already have the entire file in memory, there's no point in creating a write stream. Just use fs.writeFile
fs.writeFile(tmpFileName, email.content, 'base64', callback)
#Jonathan's answer is a better way to shorten the code you already have, so definitely do that.
I will expand on your question about memory though. The fact is that Node will not write anything to a file without converting it to a Buffer first, so given when you have told us about email.content, there is nothing more you can do.
If you are really worried about this though, then you would need some way to process the value of email.content as it comes in from where ever you are getting it from, as a stream. Then as the data is being streamed into the server, you immediately write it to a file, thus not taking up any more RAM than needed.
If you elaborate more, I can try to fill in more info.

Resources