nodejs streams vs callbacks - node.js

Im reading this article: http://elegantcode.com/2011/04/06/taking-baby-steps-with-node-js-pumping-data-between-streams/ and having some slight troubles understanding streams.
Quote:
"Suppose we want to develop a simple web application
that reads a particular file from disk and send it to the browser.
The following code shows a very simple and naïve implementation
in order to make this happen."
So the code sample is as follows:
var readStream = fileSystem.createReadStream(filePath);
readStream.on('data', function(data) {
response.write(data);
});
readStream.on('end', function() {
response.end();
});
Why would we use that above way when we could simply do:
fs.readFile(filePath, function(err, data){
response.write(data);
response.end();
});
When or why would I use streams?

You'd use stream when working with large files. With a callback, all of the file's contents must be loaded into memory at once, while with a stream, only a chunk of the file is in memory at any given time.
Also, the stream interface is arguably more elegant. Instead of explicitly attaching data, drain, and end callbacks, you can instead use pipe:
var readStream = fileSystem.createReadStream(filePath);
readStream.pipe(response);

One big reason is that you can begin doing work on the data before it is all in memory. Think "streaming video", where you can begin watching a clip while it is still loading. In many use cases, a stream will allow you to begin processing data from a file before you have loaded the entire thing.
The other common use case is when you only want to read an object up until you detect some condition in the data. Say you needed to check to see if a large file contained the word "rabbit". If you use a callback pattern, you will need to read the entire file into memory, then go through the file and check to see whether or not the word is inside. With a stream, you might detect the word on line 5 of the file, then you are able to close the stream, without loading the entire thing.
There are obviously many more complex use cases, and there are still plenty of times where a callback still makes more sense for simplicity (such as if you needed to count the total times "rabbit" appeared, in which case you have to load the entire file anyway).

Related

Difference between response.write vs stream.pipe(response) in NodeJS

As I understand "response.write" gives more control over the chunk of data I am writing to, while pipe doesn't have any control over the chunks.
I am trying to stream files and I don't need any control on the chunk of data, so is it recommended to go with stream.pipe(response) ? is there any advantage such as performance over response.write?
downloadStream = readBucket.openDownloadStream(trackID)
downloadStream.on('data', chunk => {
console.log('chunk');
res.write(chunk);
});
downloadStream.on('error', error => {
console.log('error occured', error)
res.sendStatus(500);
});
downloadStream.on('end', () => {
res.end();
});
For my scenario, both codes do the same. I prefer pipe because of less code. Is there any performance benefits, memory/io efficiency advantages with pipe() over response.write?
downloadStream= readBucket.openDownloadStream(trackID)
downloadStream.pipe(res);
.pipe() is just a ready made way to send a readstream to a writestream. You can certainly code it manually if you want, but .pipe() handle a number of things for you.
I'd suggest it's kind of like fs.readFile(). If what you want to do is read a whole file into memory, fs.readFile() does the work of opening the file for reading, reading all the data into a buffer, closing the target file and giving you all the data at the end. If there are any errors, it makes sure the file you were reading gets closed.
The same is true of .pipe(). It hooks up to the data, finish and error events for you and just handles all those, while streaming the data out to our write stream. Depending on the type of writestream, it also takes care of "finishing" or "closing" both the readstream and the writestream, even if there are errors.
And, .pipe() has backflow handling, something your code does not. When you call res.write() it returns a boolean. If that boolean is true, then the write buffer is full and you should not be calling res.write() again until the drain event occurs. Note, your code does not do that. So, .pipe() is more complete than what many people will typically write themselves.
The only situations I've seen where you're generally doing a pipe-like operation, but you can't use .pipe() is when you have very custom behavior during error conditions and you want to do something significantly differently than the default error handling. For just streaming the data and finishing both input and output streams, terminating both on error, it does exactly what you want so there's really no reason to code it yourself when the desired behavior is already built-in.
For my scenario, both codes do the same. I prefer pipe because of less code.
Same here.
Is there any performance benefits, memory/io efficiency advantages with pipe() over response.write?
Yes, sort of. It probably has fewer bugs than the code you write yourself (like forgetting backflow detection in your example that might only show up in some circumstances, large data, slow connection).

What's better readSync or createReadStream (with Symbol.asyncIterator)?

createReadStream (with Symbol.asyncIterator)
async function* readChunkIter(chunksAsync) {
for await (const chunk of chunksAsync) {
// magic
yield chunk;
}
}
const fileStream = fs.createReadStream(filePath, { highWaterMark: 1024 * 64 });
const readChunk = readChunkIter(fileStream);
readSync
function* readChunkIter(fd) {
// loop
// magic
fs.readSync(fd, buffer, 0, chunkSize, bytesRead);
yield buffer;
}
const fd = fs.openSync(filePath, 'r');
const readChunk = readChunkIter(fd);
What's better to use with a generator function and why?
upd: I'm not looking for a better way, I want to know the difference between using these features
To start with, you're comparing a synchronous file operation fs.readSync() with an asynchronous one in the stream (which uses fs.read() internally). so, that's a bit like apples and oranges for server use.
If this is on a server, then NEVER use synchronous file I/O except at server startup time because when processing requests or any other server events, synchronous file I/O blocks the entire event loop during the file read operation which drastically reduces your server scalability. Only use asynchronous file I/O, which between your two cases would be the stream.
Otherwise, if this is not on a server or any process that cares about blocking the node.js event loop during a synchronous file operation, then it's entirely up to you on which interface you prefer.
Other comments:
It's also unclear why you wrap for await() in a generator. The caller can just use for await() themselves and avoid the wrapping in a generator.
Streams for reading files are usually used in an event driven manner by adding an event listener to the data event and responding to data as it arrives. If you're just going to asynchronously read chunks of data from the file, there's really no benefit to a stream. You may as well just use fs.read() or fs.promises.read().
We can't really comment on the best/better way to solve a problem without seeing the overall problem you're trying to code for. You've just shown one little snippet of reading data. The best way to structure that depends upon how the higher level code can most conveniently use/consume the data (which you don't show).
I really didn't ask the right question. I'm not looking for a better way, I want to know the difference between using these features.
Well, the main difference is that fs.readSync() is blocking and synchronous and thus blocks the event loop, ruining the scalability of a server and should never be used (except during startup code) in a server environment. Streams in node.js are asynchronous and do not block the event loop.
Other than that difference, streams are a higher level construct than just reading the file directly and should be used when you're actually using features of the streams and should probably not be used when you're just reading chunks from the file directly and aren't using any features of streams.
In particular, error handling is not always so clear with streams, particularly when trying to use await and promises with streams. This is probably because readstreams were originally designed to be an event driven object and that means communicating errors indirectly on an error event which complicates the error handling on straight read operations. If you're not using the event driven nature of readstreams or some transform feature or some other major feature of streams, I wouldn't use them - I'd use the more traditional fs.promises.readFile() to just read data.

Does write() (without callback) preserve order in node.js write streams?

I have a node.js program in which I use a stream to write information to a SFTP server. Something like this (simplified version):
var conn = new SSHClient();
process.nextTick(function (){
conn.on('ready', function () {
conn.sftp(function (error, sftp) {
var writeStream = sftp.createWriteStream(filename);
...
writeStream.write(line1);
writeStream.write(line2);
writeStream.write(line3);
...
});
}).connect(...);
});
Note I'm not using the (optional) callback argument (described in the write() API specification) and I'm not sure if this may cause undesired behaviour (i.e. lines not writen in the following order: line1, line2, line3). In other words, I don't know if this alternative (more complex code and not sure if less efficient) should be used:
writeStream.write(line1, ..., function() {
writeStream.write(line2, ..., function() {
writeStream.write(line3);
});
});
(or equivalent alternative using async series())
Empirically in my tests I have always get the file writen in the desired order (I mean, iirst line1, then line2 and finally line3). However, I don't now if this has happened just by chance or the above is the right way of using write().
I understand that writing in stream is in general asynchronous (as all I/O work should be) but I wonder if streams in node.js keep an internal buffer or similar that keeps data ordered, so each write() call doesn't return until the data has been put in this buffer.
Examples of usage of write() in real programs are very welcomed. Thanks!
Does write() (without callback) preserve order in node.js write streams?
Yes it does. It preserves order of your writes to that specific stream. All data you're writing goes through the stream buffer which serializes it.
but I wonder if streams in node.js keep an internal buffer or similar that keeps data ordered, so each write() call doesn't return until the data has been put in this buffer.
Yes, all data does go through a stream buffer. The .write() operation does not return until the data has been successfully copied into the buffer unless an error occurs.
Note, that if you are writing any significant amount of data, you may have to pay attention to flow control (often called back pressure) on the stream. It can back up and may tell you that you need to wait before writing more, but it does buffer your writes in the order you send them.
If the .write() operation returns false, then the stream is telling you that you need to wait for the drain event before writing any more. You can read about this issue in the node.js docs for .write() and in this article about backpressure.
Your code also needs to listen for the error event to detect any errors upon writing the stream. Because the writes are asynchronous, they may occur at some later time and are not necessarily reflected in either the return value from .write() or in the err parameter to the .write() callback. You have to listen for the error event to make sure you see errors on the stream.

What is the correct way to build up a value from a stream?

I'm running an ffmpeg command in a child process; it converts a video file into a new format, emitting chunks of that new video to stdout as it goes, and I capture that with event handlers.
If I want to save the video as a file, I can create a writable stream for that file, and pipe the child process's stdout to it, that's fine. But now I want to generate screenshots from the videos, not to save as a file, but to create a base64 representation of that image in memory, then save it in a database. (I am aware that saving images in a database generally isn't recommended.)
I'm wondering now what the recommended way to do that is, to build up a value over time.
Right now, I've declared an array, chunks. Every time I get a new chunk of data from my ffmpeg process, I base64-encode it, and push it into chunks. When I get the close event from the stream, I call join() on that array, and that's my value.
This works fine, since my screenshots are 1 MB max. But is it a dumb thing to do? Is there something in the Node library, a structure like a stream or a buffer or a UIntArray, that's intended for this kind of purpose, rather than building up an array and then joining it?
I'm using array of Buffers, like this:
let chunks = [];
stream.on('data', chunk => chunks.push(chunk));
stream.on('end', () => {
let result = Buffer.concat(chunks).toString('base64');
// Do something with result
});
This is the most memory efficient way to do that. No extra structures. No memory duplications.

Streaming / Piping JSON.stringify output in Node.js / Express

I have a scenario where I need to return a very large object, converted to a JSON string, from my Node.js/Express RESTful API.
res.end(JSON.stringify(obj));
However, this does not appear to scale well. Specifically, it works great on my testing machine with 1-2 clients connecting, but I suspect that this operation may be killing the CPU & memory usage when many clients are requesting large JSON objects simultaneously.
I've poked around looking for an async JSON library, but the only one I found seems to have an issue (specifically, I get a [RangeError]). Not only that, but it returns the string in one big chunk (eg, the callback is called once with the entire string, meaning memory footprint is not decreased).
What I really want is a completely asynchronous piping/streaming version of the JSON.stringify function, such that it writes the data as it is packed directly into the stream... thus saving me both memory footprint, and also from consuming the CPU in a synchronous fashion.
Ideally, you should stream your data as you have it and not buffer everything into one large object. If you cant't change this, then you need to break stringify into smaller units and allow main event loop to process other events using setImmediate. Example code (I'll assume main object has lots of top level properties and use them to split work):
function sendObject(obj, stream) {
var keys = Object.keys(obj);
function sendSubObj() {
setImmediate(function(){
var key = keys.shift();
stream.write('"' + key + '":' + JSON.stringify(obj[key]));
if (keys.length > 0) {
stream.write(',');
sendSubObj();
} else {
stream.write('}');
}
});
})
stream.write('{');
sendSubObj();
}
It sounds like you want Dominic Tarr's JSONStream. Obviously, there is some assembly required to merge this with express.
However, if you are maxing out the CPU attempting to serialize (Stringify) an object, then splitting that work into chunks may not really solve the problem. Streaming may reduce the memory footprint, but won't reduce the total amount of "work" required.

Resources