Are buffers generally faster to work with than streams? - node.js

I've tried a couple of Imagemagick wrapper libraries and some S3 libraries. I'm having trouble choosing the best concept due to big performance differences.
I have settled with the node library "gm", which is a joy to work with and very well documented.
As for S3 I have tried both Amazon's own AWS library as well as "S3-Streams"
Edit: I just discovered that the AWS library can handle streams. I suppose this is a new function s3.upload (or have I just missed it?). Anyway, I ditched s3-streams which makes use of s3uploadPart which is much more complicated. After switching library streaming is equal to uploading buffers in my test case.
My testcase is to split a 2MB jpg file into approx 30 512px tiles and send each of the tiles to S3. Imagemagick has a really fast automatic way of generating tiles via the crop command. Unfortunately I have not found any node library that can catch the multi file output from the autogenerated tiles. Instead I have to generate tiles in a loop by call the crop command individually for each tile.
I'll present the total timings before the details:
A: 85 seconds (s3-streams)
A: 34 seconds (aws.s3.upload) (EDIT)
B: 35 seconds (buffers)
C: 25 seconds (buffers in parallell)
Clearly buffers are faster to work with than streams in this case. I don't know if gm or s3-streams has a bad implementation of streams or if I should have tweaked something. For now I'll go with solution B. C is even faster, but eats more memory.
I'm running this on a low end Digital Ocean Ubuntu machine. This is what I have tried:
A. Generate tiles and stream them one by one
I have an array prepared with crop information and s3Key for each tile to generate
The array is looped with "async.eachLimit(1)". I have not succeeded in generating more than one tile at once, hence limit(1).
As the tiles are generated, they are directly streamed to S3
Pseudo code:
async.eachLimit(tiles, 1, function(tile, callback) {
gm(originalFileBuffer)
.crop(tile.width, tile.height, tile.x, tile.y)
.stream()
.pipe(s3Stream({Key: tile.key, Bucket: tile.bucket})) //using "s3-streams" package
.on('finish', callback)
});
B. Generate tiles to buffers and upload each buffer directly with AWS-package
As the tiles are generated to buffers, they are directly uploaded to S3
Pseudo code:
async.eachLimit(tiles, 1, function(tile, callback) {
gm(originalFileBuffer)
.crop(tile.width, tile.height, tile.x, tile.y)
.toBuffer(function(err, buffer) {
s3.upload(..
callback()
)
})
});
C. Same as B, but store all buffers in the tile array for later upload in parallell
Pseudo code:
async.eachLimit(tiles, 1, function(tile, callback) {
gm(originalFileBuffer)
.crop(tile.width, tile.height, tile.x, tile.y)
.toBufer(function(err, buffer) {
tile.buffer = buffer;
callback()
})
});
..this next step is done after finalizing the first each-loop. I don't seem to gain speed by pushing limit to more than 10.
async.eachLimit(tiles, 10, function(tile, callback) {
s3.upload(tile.buffer..
callback()
)
});
Edit: Some more background as per Mark's request
I originally left out the details in the hope that I would get a clear answer about buffer vs stream.
The goal is to serve our app with images in a responsive way via a node/Express API. Backend db is Postgres. Bulk storage is S3.
Incoming files are mostly photos, floor plan drawings and pdf document. The photos needs to be stored in several sizes so I can serve them to the app in a responsive way: thumbnail, low-res, mid-res and original resolution.
Floor plans has to be tiles so I can load them incrementally (scrolling tiles) in the app. A full resolution A1 drawing can be about 50 MPixels.
Files uploaded to S2 spans from 50kB (tiles) to 10MB (floor plans).
The files comes from various directions, but always as streams:
Form posts via web or some other API (SendGrid)
Uploads from the app
Downloaded stream from S3 when uploaded files needs more processing
I'm not keen on having the files temporarily on local disk, hence only buffer vs stream. If I could use the disk I'd use IM's own tile function for really speedy tiling.
Why not local disk?
Images are encrypted before uploading to S3. I don't want unencrypted files to linger in a temp directory.
There is always the issue of cleaning up temp files, with possible orphan files after unintended crashes etc.

After some more tinkering I feel obliged to answer my own question.
Originally I used the npm package s3-streams for streaming to S3. This package uses aws.s3.uploadPart.
Now I found out that the aws package has a neat function aws.s3.upload which takes a buffer or a stream.
After switching to AWS own streaming function there is no time difference between buffer/stream-upload.
I might have used s3-streams in the wrong way. But I also discovered a possible bug in this library (regaring files > 10MB). I posted an issue, but haven't got any answer. My guessing is that the library has been abandoned since the s3.upload function appeared.
So, the answer to my own question:
There might be differences between buffers/streams, but in my test case they are equal, which makes this a non issue for now.
Here is the new "save"-part in the each loop:
let fileStream = gm(originalFileBuffer)
.crop(tile.width, tile.height, tile.x, tile.y)
.stream();
let params = {Bucket: 'myBucket', Key: tile.s3Key, Body: fileStream};
let s3options = {partSize: 10 * 1024 * 1024, queueSize: 1};
s3.upload(params, s3options, function(err, data) {
console.log(err, data);
callback()
});
Thank you for reading.

Related

Node uploaded image save - stream vs buffer

I am working on image upload and don't know how to properly deal with storing the received file. It would be nice to analyze the file first if it is really an image or someone just changed the extension. Luckily I use package sharp which has exactly such a feature. I currently work with two approaches.
Buffering approach
I can parse multipart form as a buffer and easily decide whether save a file or not.
const metadata = await sharp(buffer).metadata();
if (metadata) {
saveImage(buffer);
} else {
throw new Error('It is not an image');
}
Streaming approach
I can parse multipart form as a readable stream. First I need to forward the readable stream to writable and store file to disk. Afterward, I need again to create a readable stream from saved file and verify whether it is really image. Otherwise, revert all.
// save uploaded file to file system with stream
readableStream.pipe(createWriteStream('./uploaded-file.jpg'));
// verify whether it is an image
createReadStream('./uploaded-file.jpg').pipe(
sharp().metadata((err, metadata) => {
if (!metadata) {
revertAll();
throw new Error('It is not an image');
}
})
)
It was my intention to avoid using buffer because as I know it needs to store the whole file in RAM. But on the other hand, the approach using streams seems to be really clunky.
Can someone help me to understand how these two approaches differ in terms of performance and used resources? Or is there some better approach to how to deal with such a situation?
In buffer mode, all the data coming from a resource is collected into a buffer, think of it as a data pool, until the operation is completed; it is then passed back to the caller as one single blob of data. Buffers in V8 are limited in size. You cannot allocate more than a few gigabytes of data, so you may hit a wall way before running out of physical memory if you need to read a big file.
On the other hand, streams allow us to process the data as soon as it arrives from the resource. So streams execute their data without storing it all in memory. Streams can be more efficient in terms of both space (memory usage) and time (computation clock time).

Icecast: I have strange behaviour with repeats of end of tracks, as well as pitch changes from my Icecast Server

I only began using icecast a few days ago, so if I stuffed something up somewhere, please let me know.
I have a weird problem with icecast. Everytime a track is "finished" on icecast, a section of the end of the currently playing track (i think 64kbs of the track) is repeated about 2 to 3 times before the next song plays, but the next song doesn't begin playing in the start, but a few seconds of the way through. Also, I can notice that the playback speed (and hence the pitch) sometimes differs from the original as well.
I consulted this post and this post that was quoted below which taught me what the <burst-on-connect> and the <burst-size> tags are used for. It also taught me this:
What's happening here is that nothing is being added to the buffer, so clients connect, get the contents of that buffer, and then the stream ends. The client must be re-connecting repeatedly, and it keeps getting that same buffer.
Cheers to Brad for that post. A solution to this problem was provided in a comments section of that post and it said to decrease the <source-timeout> of the icecast server, so that it will close the connection quicker and stop any repeating. But this is assuming I want to close the mountpoint, and I dont, because what I am using Icecast for is actually a 24/7 radio player. If I did close my mountpoint, then what happens is VLC just turns off and doesn't repeatedly attempt to connect anymore. Unless this is wrong. I don't know.
I use VLC to hear the playback of the icecast streams and I use nodeshout which is a bunch of bindings from libshout built for node.js. I use nodeshout to send data to a bunch of mounts on my icecast server. In the future I plan to make a site that will listen to the icecast streams, meaning it will replace VLC.
icecast.xml
<limits>
<clients>100</clients>
<sources>4</sources>
<queue-size>1008576</queue-size>
<client-timeout>30</client-timeout>
<header-timeout>15</header-timeout>
<source-timeout>30</source-timeout>
<burst-on-connect>1</burst-on-connect>
<burst-size>252144</burst-size>
</limits>
This is a summary of the audio sending code on my node.js server.
nodejs
// these lines of code is a smaller part of a function, and this sets all the information. The variables name, description etc come from the arguments of the function
var nodeshout = require("nodeshout");
let shout = nodeshout.create();
shout.setHost('localhost');
shout.setPort(8000);
shout.setUser('source');
shout.setPassword(process.env.icecastPassword); //password in .env file
shout.setName(name);
shout.setDescription(description);
shout.setMount(mount);
shout.setGenre(genre);
shout.setFormat(1); // 0=ogg, 1=mp3
shout.setAudioInfo('bitrate', '128');
shout.setAudioInfo('samplerate', '44100');
shout.setAudioInfo('channels', '2');
return shout
// now meanwhile somewhere lower in the file, there is this summary of how the audio is sent to the icecast server
var nodeshout = require("nodeshout")
var {FileReadStream, ShoutStream} = require("nodeshout") //here is where the FileReadStream and ShoutStream functions come from
const filecontent = new FileReadStream(pathToSong, 65536); //if I change the 65536 to a higher value, then more bytes are being repeated at the end of the track. If I decrease this, it starts sounding buggy and off.
var streamcontent = filecontent.pipe(new ShoutStream(shoutstream))
streamcontent.on('finish', () => {
next()
console.log("Track has finished on " + stream.name + ": " + chosenTrack)
})
I also notice weirder behaviour. After the previous song had it's last chunk repeated a few times, that's when the server calls the streamcontent.on('finish') event that is located in the nodejs script, and only then does it warn me that the track is finished.
What I have tried
I tried messing around with the <source-timeout> tag, the number of bytes (or bits im not sure) that are being sent on nodejs, the burst size, I also tried turning bursting off completely but it results in super strange behavior.
I also thought creating a new stream every time per song was a bad idea as seen in new ShoutStream(shoutstream) when piping the file data, but using the same stream meant that the program would return an error because it would write the next track to the shoutstream after it had said it had closed.
If any more information is necessary to figure out what is going on, I can provide it. Thanks for your time.
Edit: I would like to add: Do you think I should manually control how many bytes are sent to icecast and then use the same stream object instead of calling a new one every time?
I found out why the stream didn't play some tracks as opposed to others.
How I got there
I could not switch to ogg/vorbis or ogg/opus for my stream, so I had to do something with my source client. I double checked everything was correct and that my audio files were in the correct bitrate. When i ran the ffprobe tool with ffprobe audio.mp3 sometimes the bitrates did not adhere to the typical rates of 120kbps, 128kbps, 192, 312, etc etc so on. It was always some strange value such as 129852 just to provide an example.
I then downloaded the checkmate mp3 checker here and checked my audio files, and they were all encoded in a variable bitrate!!! VBR damnit!
TLDR
I fixed my problem by re-encoding all my tracks to a constant bitrate of 128kbps using ffmpeg.
Quick Edit: I am pretty sure that programs such as Darkice might already support variable bit rate transfers to Icecast servers, but it would be impractical for me to use darkice, hence why I stuck with nodeshout.

NodeJS Simulate Live Video Stream

I have a video file that I would like to start broadcasting from NodeJS, preferably through Express, at a given time. That is, if the video starts being available at timestamp t0, then if a client hits the video endpoint at time t0+60, the video playback would start at 60 seconds in.
My key requirement is that when a client connect at a given time, no more of that video be available than what would have been seen so far, so the client connecting at t0+60 would not be able to watch past the minute mark (plus some error threshold) initially, and every ~second, another second of video availability would be added, simulating a live experience synced across all clients regardless of when each loads the stream.
So far, I've tried my luck converting videos to Apple's HLS protocol (because the name sounds promising) and I was able to host the m3u8 files using Node's hls-server library, where the call is very straightforward:
import HLSServer = require('hls-server');
import http = require('http');
const source = __dirname + '/resources';
const server = http.createServer();
const hls = new HLSServer(server, {
path: '/streams', // Base URI to output HLS streams
dir: source // Directory that input files are stored
});
server.listen(8000);
However, it sends the entire video to the browser when asked, and appears to offer no option of forcing a start at a given frame. (I imagine forcing the start position can be done out of band by simply sending the current time to the client and then having the client do whatever is necessary with HTML and Javascript to advance to the latest position).
There are some vague approaches that I saw online that use MP4, but from what I understand, due to its compression, it is hard to know how many bytes of video data correspond to what footage duration as it may widely vary.
There are also some other tutorials which have a direct pipe from an input source such as a webcam, thereby requiring liveness, but for my comparatively simple use case where the video file is already present, I'm content with the ability to maintain a limited amount of precision, such as ±10 seconds, just as long as all clients are forced to be approximately in sync.
Thank you very much in advance, and I appreciate any pointers.

What is the difference between async and steam writing files?

I now that it's possible to use async methods (like fs.appendFile) and streams (like fs.createWriteStream) to write files.
But why do we need both of them if streams are asynchronous as well and can provide us with better functionality?
Let's say you're downloading a file, a huge file, 1TB file, and you want to write that file to your filesystem.
You could download the whole file into a buffer in-memory, then fs.appendFile() or fs.writeFile() the buffer to a local file, or try, at least, you'd run out of memory.
Or you could create a read-stream for the downloading file, and pipe it to a write-stream for the write to your file-system:
const readStream = magicReadStreamFromUrl/*[1]*/('https://example.com/large.txt');
const writeStream = fs.createWriteStream('large.txt');
readStream.pipe(writeStream);
This means that the file is downloaded in chunks, and those chunks get piped to the writeStream (which would write them to disk), without having to store it in-memory yourself.
That is the reason for Streaming abstractions in general, and in Node in particular.
The http module supports streaming in this way, as well as most other HTTP libraries like request and axios, I've left out the specifics of how to create a read-stream as an exercise to the reader for brevity.

Saving a base 64 string to a file via createWriteStream

I have an image coming into my Node.js application via email (through cloud service provider Mandrill). The image comes in as a base64 encoded string, email.content in the example below. I'm currently writing the image to a buffer, and then a file like this:
//create buffer and write to file
var dataBuffer = new Buffer(email.content, 'base64');
var writeStream = fs.createWriteStream(tmpFileName);
writeStream.once('open', function(fd) {
console.log('Our steam is open, lets write to it');
writeStream.write(dataBuffer);
writeStream.end();
}); //writeSteam.once('open')
writeStream.on('close', function() {
fileStats = fs.statSync(tmpFileName);
This works fine and is all well and good, but am I essentially doubling the memory requirements for this section of code, since I have my image in memory (as the original string), and then create a buffer of that same string before writing the file? I'm going to be dealing with a lot of inbound images so doubling my memory requirements is a concern.
I tried several ways to write email.content directly to the stream, but it always produced an invalid file. I'm a rank amateur with modern coding, so you're welcome to tell me this concern is completely unfounded as long as you tell me why so some light will dawn on marble head.
Thanks!
Since you already have the entire file in memory, there's no point in creating a write stream. Just use fs.writeFile
fs.writeFile(tmpFileName, email.content, 'base64', callback)
#Jonathan's answer is a better way to shorten the code you already have, so definitely do that.
I will expand on your question about memory though. The fact is that Node will not write anything to a file without converting it to a Buffer first, so given when you have told us about email.content, there is nothing more you can do.
If you are really worried about this though, then you would need some way to process the value of email.content as it comes in from where ever you are getting it from, as a stream. Then as the data is being streamed into the server, you immediately write it to a file, thus not taking up any more RAM than needed.
If you elaborate more, I can try to fill in more info.

Resources