Saving a base 64 string to a file via createWriteStream - node.js

I have an image coming into my Node.js application via email (through cloud service provider Mandrill). The image comes in as a base64 encoded string, email.content in the example below. I'm currently writing the image to a buffer, and then a file like this:
//create buffer and write to file
var dataBuffer = new Buffer(email.content, 'base64');
var writeStream = fs.createWriteStream(tmpFileName);
writeStream.once('open', function(fd) {
console.log('Our steam is open, lets write to it');
writeStream.write(dataBuffer);
writeStream.end();
}); //writeSteam.once('open')
writeStream.on('close', function() {
fileStats = fs.statSync(tmpFileName);
This works fine and is all well and good, but am I essentially doubling the memory requirements for this section of code, since I have my image in memory (as the original string), and then create a buffer of that same string before writing the file? I'm going to be dealing with a lot of inbound images so doubling my memory requirements is a concern.
I tried several ways to write email.content directly to the stream, but it always produced an invalid file. I'm a rank amateur with modern coding, so you're welcome to tell me this concern is completely unfounded as long as you tell me why so some light will dawn on marble head.
Thanks!

Since you already have the entire file in memory, there's no point in creating a write stream. Just use fs.writeFile
fs.writeFile(tmpFileName, email.content, 'base64', callback)

#Jonathan's answer is a better way to shorten the code you already have, so definitely do that.
I will expand on your question about memory though. The fact is that Node will not write anything to a file without converting it to a Buffer first, so given when you have told us about email.content, there is nothing more you can do.
If you are really worried about this though, then you would need some way to process the value of email.content as it comes in from where ever you are getting it from, as a stream. Then as the data is being streamed into the server, you immediately write it to a file, thus not taking up any more RAM than needed.
If you elaborate more, I can try to fill in more info.

Related

Node uploaded image save - stream vs buffer

I am working on image upload and don't know how to properly deal with storing the received file. It would be nice to analyze the file first if it is really an image or someone just changed the extension. Luckily I use package sharp which has exactly such a feature. I currently work with two approaches.
Buffering approach
I can parse multipart form as a buffer and easily decide whether save a file or not.
const metadata = await sharp(buffer).metadata();
if (metadata) {
saveImage(buffer);
} else {
throw new Error('It is not an image');
}
Streaming approach
I can parse multipart form as a readable stream. First I need to forward the readable stream to writable and store file to disk. Afterward, I need again to create a readable stream from saved file and verify whether it is really image. Otherwise, revert all.
// save uploaded file to file system with stream
readableStream.pipe(createWriteStream('./uploaded-file.jpg'));
// verify whether it is an image
createReadStream('./uploaded-file.jpg').pipe(
sharp().metadata((err, metadata) => {
if (!metadata) {
revertAll();
throw new Error('It is not an image');
}
})
)
It was my intention to avoid using buffer because as I know it needs to store the whole file in RAM. But on the other hand, the approach using streams seems to be really clunky.
Can someone help me to understand how these two approaches differ in terms of performance and used resources? Or is there some better approach to how to deal with such a situation?
In buffer mode, all the data coming from a resource is collected into a buffer, think of it as a data pool, until the operation is completed; it is then passed back to the caller as one single blob of data. Buffers in V8 are limited in size. You cannot allocate more than a few gigabytes of data, so you may hit a wall way before running out of physical memory if you need to read a big file.
On the other hand, streams allow us to process the data as soon as it arrives from the resource. So streams execute their data without storing it all in memory. Streams can be more efficient in terms of both space (memory usage) and time (computation clock time).

How to synchronously read from a ReadStream in node

I am trying to read UTF-8 text from a file in a memory and time efficient way. There are two ways to read directly from a file synchronously:
fs.readFileSync will read the entire file and return a buffer containing the file's entire contents
fs.readSync will read a set amount of bytes from a file and return a buffer containing just those contents
I initially just used fs.readFileSync because it's easiest, but I'd like to be able to efficiently handle potentially large files by only reading in chunks of text at a time. So I started using fs.readSync instead. But then I realized that fs.readSync doesn't handle UTF-8 decoding. UTF-8 is simple, so I could whip up some logic to manually decode it, but Node already has services for that, so I'd like to avoid that if possible.
I noticed fs.createReadStream, which returns a ReadStream that can be used for exactly this purpose, but unfortunately it seems to only be available in an asynchronous mode of operation.
Is there a way to read from a ReadStream in a synchronous way? I have a massive stack built on top of this already, and I'd rather not have to refactor it to be asynchronous.
I discovered the string_decoder module, which handles all that UTF-8 decoding logic I was worried I'd have to write. At this point, it seems like a no-brainer to use this on top of fs.readSync to get the synchronous behavior I was looking for.
You basically just keep feeding bytes to it, and as it is able to successfully decode characters, it will emit them. The Node documentation is sufficient at describing how it works.

Write stream into buffer object

I have a stream that is being read from an audio source and I'm trying to store it into a Buffer. From the documentation that I've read, you are able to pipe the stream into one using fs.createWriteStream(~buffer~) instead of a file path.
I'm doing this currently as:
const outputBuffer = Buffer.alloc(150000)
const stream = fs.createWriteStream(outputBuffer)
but when I run it, it throws an error saying that the Path: must be a string without null bytes for the file system call.
If I'm misunderstanding the docs or missing something obvious please let me know!
The first parameter to fs.createWriteStream() is the filename to read. That is why you receive that particular error.
There is no way to read from a stream directly into an existing Buffer. There was a node EP to support this, but it more or less died off because there are some potential gotchas with it.
For now you will need to either copy the bytes manually or if you don't want node to allocate extra Buffers, you will need to manually call fs.open(), fs.read() (this is the method that allows you to pass in your Buffer instance, along with an offset), and fs.close().

How to upload file in node.js with http module?

I have this code so far, but can not get the Buffer bianary.
var http = require('http');
var myServer = http.createServer(function(request, response)
{
var data = '';
request.on('data', function (chunk){
data += chunk;
});
request.on('end',function(){
if(request.headers['content-type'] == 'image/jpg') {
var binary = Buffer.concat(data);
//some file handling would come here if binary would be OK
response.write(binary.size)
response.writeHead(201)
response.end()
}
But get this error: throw new TypeError('Usage: Buffer.concat(list, [length])');
You're doing three bad things:
Using the Buffer API wrong - hence the error message.
Concatenating binary data as strings
Buffering data in memory
Mukesh has dealt with #1, so I'll cover the deeper problems.
First, you're receiving binary Buffer chunks and converting them to strings with the default (utf8) encoding, then concatenating them. This will corrupt your data. As well as there existing byte sequences that aren't valid utf8, if a valid sequence is cut in half by a chunk you'll lose that data too.
Instead, you should keep the data always as binary data. Maintain an array of Buffer to which you push each chunk, then concatenate them all at the end.
This leads to the problem #3. You are buffering the whole upload into memory then writing it to a file, instead of streaming it directly to a (temporary) file. This puts a lot of load on your application, both using up memory and using up time allocating it all. You should just pipe the request to a file output stream, then inspect it on disk.
If you are only accepting very small files you may get away with keeping them in memory, but you need to protect yourself from clients sending too much data (and indeed lying about how much they're going to send).

Are buffers generally faster to work with than streams?

I've tried a couple of Imagemagick wrapper libraries and some S3 libraries. I'm having trouble choosing the best concept due to big performance differences.
I have settled with the node library "gm", which is a joy to work with and very well documented.
As for S3 I have tried both Amazon's own AWS library as well as "S3-Streams"
Edit: I just discovered that the AWS library can handle streams. I suppose this is a new function s3.upload (or have I just missed it?). Anyway, I ditched s3-streams which makes use of s3uploadPart which is much more complicated. After switching library streaming is equal to uploading buffers in my test case.
My testcase is to split a 2MB jpg file into approx 30 512px tiles and send each of the tiles to S3. Imagemagick has a really fast automatic way of generating tiles via the crop command. Unfortunately I have not found any node library that can catch the multi file output from the autogenerated tiles. Instead I have to generate tiles in a loop by call the crop command individually for each tile.
I'll present the total timings before the details:
A: 85 seconds (s3-streams)
A: 34 seconds (aws.s3.upload) (EDIT)
B: 35 seconds (buffers)
C: 25 seconds (buffers in parallell)
Clearly buffers are faster to work with than streams in this case. I don't know if gm or s3-streams has a bad implementation of streams or if I should have tweaked something. For now I'll go with solution B. C is even faster, but eats more memory.
I'm running this on a low end Digital Ocean Ubuntu machine. This is what I have tried:
A. Generate tiles and stream them one by one
I have an array prepared with crop information and s3Key for each tile to generate
The array is looped with "async.eachLimit(1)". I have not succeeded in generating more than one tile at once, hence limit(1).
As the tiles are generated, they are directly streamed to S3
Pseudo code:
async.eachLimit(tiles, 1, function(tile, callback) {
gm(originalFileBuffer)
.crop(tile.width, tile.height, tile.x, tile.y)
.stream()
.pipe(s3Stream({Key: tile.key, Bucket: tile.bucket})) //using "s3-streams" package
.on('finish', callback)
});
B. Generate tiles to buffers and upload each buffer directly with AWS-package
As the tiles are generated to buffers, they are directly uploaded to S3
Pseudo code:
async.eachLimit(tiles, 1, function(tile, callback) {
gm(originalFileBuffer)
.crop(tile.width, tile.height, tile.x, tile.y)
.toBuffer(function(err, buffer) {
s3.upload(..
callback()
)
})
});
C. Same as B, but store all buffers in the tile array for later upload in parallell
Pseudo code:
async.eachLimit(tiles, 1, function(tile, callback) {
gm(originalFileBuffer)
.crop(tile.width, tile.height, tile.x, tile.y)
.toBufer(function(err, buffer) {
tile.buffer = buffer;
callback()
})
});
..this next step is done after finalizing the first each-loop. I don't seem to gain speed by pushing limit to more than 10.
async.eachLimit(tiles, 10, function(tile, callback) {
s3.upload(tile.buffer..
callback()
)
});
Edit: Some more background as per Mark's request
I originally left out the details in the hope that I would get a clear answer about buffer vs stream.
The goal is to serve our app with images in a responsive way via a node/Express API. Backend db is Postgres. Bulk storage is S3.
Incoming files are mostly photos, floor plan drawings and pdf document. The photos needs to be stored in several sizes so I can serve them to the app in a responsive way: thumbnail, low-res, mid-res and original resolution.
Floor plans has to be tiles so I can load them incrementally (scrolling tiles) in the app. A full resolution A1 drawing can be about 50 MPixels.
Files uploaded to S2 spans from 50kB (tiles) to 10MB (floor plans).
The files comes from various directions, but always as streams:
Form posts via web or some other API (SendGrid)
Uploads from the app
Downloaded stream from S3 when uploaded files needs more processing
I'm not keen on having the files temporarily on local disk, hence only buffer vs stream. If I could use the disk I'd use IM's own tile function for really speedy tiling.
Why not local disk?
Images are encrypted before uploading to S3. I don't want unencrypted files to linger in a temp directory.
There is always the issue of cleaning up temp files, with possible orphan files after unintended crashes etc.
After some more tinkering I feel obliged to answer my own question.
Originally I used the npm package s3-streams for streaming to S3. This package uses aws.s3.uploadPart.
Now I found out that the aws package has a neat function aws.s3.upload which takes a buffer or a stream.
After switching to AWS own streaming function there is no time difference between buffer/stream-upload.
I might have used s3-streams in the wrong way. But I also discovered a possible bug in this library (regaring files > 10MB). I posted an issue, but haven't got any answer. My guessing is that the library has been abandoned since the s3.upload function appeared.
So, the answer to my own question:
There might be differences between buffers/streams, but in my test case they are equal, which makes this a non issue for now.
Here is the new "save"-part in the each loop:
let fileStream = gm(originalFileBuffer)
.crop(tile.width, tile.height, tile.x, tile.y)
.stream();
let params = {Bucket: 'myBucket', Key: tile.s3Key, Body: fileStream};
let s3options = {partSize: 10 * 1024 * 1024, queueSize: 1};
s3.upload(params, s3options, function(err, data) {
console.log(err, data);
callback()
});
Thank you for reading.

Resources