Correct usage of _writev in node.js - node.js

What is the correct usage of _writev() in node.js?
The documentation says:
If a stream implementation is capable of processing multiple chunks of data at once, the writable._writev() method should be implemented.
It also says:
The primary intent of writable.cork() is to avoid a situation where writing many small chunks of data to a stream do not cause a backup in the internal buffer that would have an adverse impact on performance. In such situations, implementations that implement the writable._writev() method can perform buffered writes in a more optimized manner.
From a stream implementation perspective this is okay. But from a writable stream consumer perspective, the only way that write or writev gets invoked is through Writable.write() and writable.cork()
I would like to see a small example which would depict the practical use case of implementing _writev()

A writev method can be added to the instance, in addition to write, and if the streams contains several chunks that method will be picked instead of write. For example Elasticsearch allows you to bulk insert records; so if you are creating a Writable stream to wrap Elasticsearch, it makes sense to have a writev method doing a single bulk insert rather than several individual ones, it is far more efficient. The same holds true, for example, for MongoDB and so on.
This post (not mine) shows an Elasticsearch implementation https://medium.com/#mark.birbeck/using-writev-to-create-a-fast-writable-stream-for-elasticsearch-ac69bd010802

The _writev() will been invoked when using uncork(). There is a simple example in node document.
stream.cork();
stream.write('some ');
stream.write('data ');
process.nextTick(() => stream.uncork());
More See,
https://nodejs.org/api/stream.html#stream_writable_uncork
https://github.com/nodejs/node/blob/master/lib/_stream_writable.js#L257

Related

Node.js Streams: When will _writev Be Invoked?

The Node.js documentation makes the following comments about a Writable stream's _writev method.
The writable._writev() method may be implemented in addition or alternatively to writable._write() in stream implementations that are capable of processing multiple chunks of data at once. If implemented and if there is buffered data from previous writes, _writev() will be called instead of _write().
Emphasis mine. In what scenarios can a Node.js writable stream have buffered data from previous writes?
Is the _writev method only called after uncorking a corked stream that's had data written to it? Or are there other scenarios where a stream can have buffered date from previous writes? Bonus point if you can point to the place in the Node.js source code where it makes a decisions w/r/t to calling _write or _writev.
_writev() will be called whenever there is more than one piece of data buffered from the stream and the function has been defined. Using cork() could cause more data to be buffered, but so could slow processing.
The code that guards _writev is in lib/internal/streams/writable.js. There is a buffer decision and then the guard for the write.

Node JS Streams: Understanding data concatenation

One of the first things you learn when you look at node's http module is this pattern for concatenating all of the data events coming from the request read stream:
let body = [];
request.on('data', chunk => {
body.push(chunk);
}).on('end', () => {
body = Buffer.concat(body).toString();
});
However, if you look at a lot of streaming library implementations they seem to gloss over this entirely. Also, when I inspect the request.on('data',...) event it almost ever only emits once for a typical JSON payload with a few to a dozen properties.
You can do things with the request stream like pipe it through some transforms in object mode and through to some other read streams. It looks like this concatenating pattern is never needed.
Is this because the request stream in handling POST and PUT bodies pretty much only ever emits one data event which is because their payload is way below the chunk partition size limit?. In practice, how large would a JSON encoded object need to be to be streamed in more than one data chunk?
It seems to me that objectMode streams don't need to worry about concatenating because if you're dealing with an object it is almost always no larger than one data emitted chunk, which atomically transforms to one object? I could see there being an issue if a client were uploading something like a massive collection (which is when a stream would be very useful as long as it could parse the individual objects in the collection and emit them one by one or in batches).
I find this to probably be the most confusing aspect of really understanding the node.js specifics of streams, there is a weird disconnect between streaming raw data, and dealing with atomic chunks like objects. Do objectMode stream transforms have internal logic for automatically concatenating up to object boundaries? If someone could clarify this it would be very appreciated.
The job of the code you show is to collect all the data from the stream into one buffer so when the end event occurs, you then have all the data.
request.on('data',...) may emit only once or it may emit hundreds of times. It depends upon the size of the data, the configuration of the stream object and the type of stream behind it. You cannot ever reliably assume it will only emit once.
You can do things with the request stream like pipe it through some transforms in object mode and through to some other read streams. It looks like this concatenating pattern is never needed.
You only use this concatenating pattern when you are trying to get the entire data from this stream into a single variable. The whole point of piping to another stream is that you don't need to fetch the entire data from one stream before sending it to the next stream. .pipe() will just send data as it arrives to the next stream for you. Same for transforms.
Is this because the request stream in handling POST and PUT bodies pretty much only ever emits one data event which is because their payload is way below the chunk partition size limit?.
It is likely because the payload is below some internal buffer size and the transport is sending all the data at once and you aren't running on a slow link and .... The point here is you cannot make assumptions about how many data events there will be. You must assume there can be more than one and that the first data event does not necessarily contain all the data or data separated on a nice boundary. Lots of things can cause the incoming data to get broken up differently.
Keep in mind that a readStream reads data until there's momentarily no more data to read (up to the size of the internal buffer) and then it emits a data event. It doesn't wait until the buffer fills before emitting a data event. So, since all data at the lower levels of the TCP stack is sent in packets, all it takes is a momentary delivery delay with some packet and the stream will find no more data available to read and will emit a data event. This can happen because of the way the data is sent, because of things that happen in the transport over which the data flows or even because of local TCP flow control if lots of stuff is going on with the TCP stack at the OS level.
In practice, how large would a JSON encoded object need to be to be streamed in more than one data chunk?
You really should not know or care because you HAVE to assume that any size object could be delivered in more than one data event. You can probably safely assume that a JSON object larger than the internal stream buffer size (which you could find out by studying the stream code or examining internals in the debugger) WILL be delivered in multiple data events, but you cannot assume the reverse because there are other variables such as transport-related things that can cause it to get split up into multiple events.
It seems to me that objectMode streams don't need to worry about concatenating because if you're dealing with an object it is almost always no larger than one data emitted chunk, which atomically transforms to one object? I could see there being an issue if a client were uploading something like a massive collection (which is when a stream would be very useful as long as it could parse the individual objects in the collection and emit them one by one or in batches).
Object mode streams must do their own internal buffering to find the boundaries of whatever objects they are parsing so that they can emit only whole objects. At some low level, they are concatenating data buffers and then examining them to see if they yet have a whole object.
Yes, you are correct that if you were using an object mode stream and the object themselves were very large, they could consume a lot of memory. Likely this wouldn't be the most optimal way of dealing with that type of data.
Do objectMode stream transforms have internal logic for automatically concatenating up to object boundaries?
Yes, they do.
FYI, the first thing I do when making http requests is to go use the request-promise library so I don't have to do my own concatenating. It handles all this for you. It also provides a promise-based interface and about 100 other helpful features which I find helpful.

How to use a NodeJS Stream twice?

I've a readable NodeJS Stream which I want to use twice. Disclaimer: I'm not very comfortable with streams
Why?
My Service allows uploading of images for users. I want to avoid uploading of the same images.
My workflow is as follows:
upload image per ajax
get hash of image
if hash in database
return url from database
else
pass hash to resize&optimize pipeline
upload image to s3 bucket
get hash of image and write it to database with url
return s3 url
I get the hash of my stream with hashstream and optimize my image with gm.
Hashstream takes a stream, closes it, creates a hash and returns it with a callback.
My question is: What would be the best approach to combine both methods?
There are two ways to solve it:
Buffer the stream
Since you don't know if your stream will be used again, you can simply buffer it up somehow (somehow meaning handling data events, or using some module, for
example accum). As soon as you know what the outcome of the hash function you'd simply write the whole accumulated buffer into the gm stream.
Use stream.pipe twice to "tee"
You probably know the posix command tee, likewise you can push all the data into two places. Here's an example implementation of a tee method in my "scramjet" stream, but I guess for you it'd be quite sufficient to simply pipe twice. Then as soon as you get your hash calculated and run into the first condition I'd simply send an end.
The right choice depends on if you want to conserve memory or CPU. For less memory use two pipes (your optimization process will start, but you'll cancel it before it would output anything). For less CPU and less processes usage I'd go for buffering.
All in all I would consider buffering only if you can easily scale to more incoming images or you know exactly how much load there is and you can handle it. Either way there will be limits and these limit need to be somehow handled, if you can start couple more instances then you should be better of with using more CPU and keeping the memory at a sensible level.

Any benefit to using streams if all the data fits in a single chunk?

When I do fs.createReadStream in Node.js, the data seems to come in 64KB chunks (I assume this varies between computers).
Let's say I'm piping a read stream through a series of transformations (which each operate on a single chunk) and then finally piping it to a write stream to save it to disk...
If I know in advance that the files I'm working on are guaranteed to be less than 64KB each (ie, they'll each be read in a single chunk), is there any benefit to using streams, as opposed to plain old async code?
First of all, you can configure the chunk size using the highWaterMark parameter: it defaults to 16k for byte-mode streams (16 objects for object-mode streams), but fs.ReadStream default to 64k chunks (see relevant source code).
If you are absolutely sure that your all of your data fits in a single chunk, there is no immediate benefit to using streams, indeed.
But remember that streams are flexible; they are the unifying abstraction of your code: you can read data from a file, a socket or a random generator. You can add or remove a duplex stream from a streams pipeline and your code will still work in the same way.
You can also pipe a single readable stream into multiple writable streams, which would be a pain to do using only asynchronous callback…
Also note that streams don't emit data synchronously (i.e. the readable event is emitted on the next tick), which handles nicely for you the common mistake to synchronously call an asynchronous callback, thus creating a possible stack overflow bug.

Are callbacks for requests a bad practice in node.js?

Imagine you want to download an image or a file, this would be the first way the internet will teach you to go ahead:
request(url, function(err, res, body) {
fs.writeFile(filename, body);
});
But doesn't this accumulate all data in body, filling the memory?
Would a pipe be totally more efficient?
request(url).pipe(fs.createWriteStream(filename));
Or is this handled internally in a similar matter, buffering the stream anyway, making this irrelevant?
Furthermore, if I want to use the callback but not the body (because you can still pipe), will this memory buffer still be filled?
I am asking because the first (callback) method allows me to chain downloads in stead of launching them in parallel(*), but I don't want to fill a buffer I'm not gonna use either. So I need the callback if I don't want to resort to something fancy like async just to use queue to prevent this.
(*) Which is bad because if you just request too many files before they are complete, the async nature of request will cause node to choke to death in an overdose of events and memory loss. First you'll get these:
"possible EventEmitter memory leak detected. 11 listeners added. Use emitter.setMaxListeners() to increase limit."
And when stretching it, 500 piped requests will fill your memory up and crash node. That's why you need the callback in stead of the pipe, so you know when to start the next file.
But doesn't this accumulate all data in body, filling the memory?
Yes, many operations such as your first snippet buffer data into memory for processing. Yes this uses memory, but it is at least convenient and sometimes required depending on how you intend to process that data. If you want to load an HTTP response and parse the body as JSON, that is almost always done via buffering, although it's possible with a streaming parser, it is much more complicated and usually unnecessary. Most JSON data is not sufficiently large such that streaming is a big win.
Or is this handled internally in a similar matter, making this irrelevant?
No, APIs that provide you an entire piece of data as a string use buffering and not streaming.
However, multimedia data, yes, you cannot realistically buffer it to memory and thus streaming is more appropriate. Also that data tends to be opaque (you don't parse it or process it), which is also good for streaming.
Streaming is nice when circumstances permit it, but that doesn't mean there's anything necessarily wrong with buffering. The truth is buffering is how the vast majority of things work most of the time. In the big picture, streaming is just buffering 1 chunk at a time and capping them at some size limit that is well within the available resources. Some portion of the data needs to go through memory at some point if you are going to process it.
Because if you just request too many files one by one, the async nature of request will cause node to choke to death in an overdose of events and memory loss.
Not sure exactly what you are stating/asking here, but yes, writing effective programs requires thinking about resources and efficiency.
See also substack's rant on streaming/pooling in the hyperquest README.
I figured out a solution that renders the questions about memory irrelevant (although I'm still curious).
if I want to use the callback but not the body (because you can still pipe), will this memory buffer still be filled?
You don't need the callback from request() in order to know when the request is finished. The pipe() will close itself when the stream 'ends'. The close emits an event and can be listened for:
request(url).pipe(fs.createWriteStream(filename)).on('close', function(){
next();
});
Now you can queue all your requests and download files one by one.
Of course you can vacuum the internet using 8 parallel requests all the time with libraries such as async.queue, but if all you want to do is get some files with a simple script, async is probably overkill.
Besides, you're not gonna want to max out your system resources for a single trick on a multi-user system anyway.

Resources