Memory efficient growing Nodejs Duplex/Transform stream - node.js

I am trying to add variables into a template at specific indices through streams.
The idea is that I have a readable stream in and a list of variables that can be either a readable stream a buffer or a string of an undetermined size. These variables can be inserted at a predefined list of indices. I have a few questions based on my assumptions and what I have tried so far.
My first attempt was to do it manually with readable streams. However, I couldn't const buffer = templateIn.read(size) (since the buffers were still empty) before template combined was trying to read it. The solution for that problem is similar to how you'd use a transform stream so that was the next step I took.
However, I have a problem with the transform streams. My problem is that something like this pseudo code will pile up buffers into memory until done() is called.
public _transform(chunk: Buffer, encoding: string, done: (err?: Error, data?: any) => void ): void {
let index = 0;
while (index < chunk.length) {
if (index === this.variableIndex) { // the basic idea (the actual logic is a bit more complex)
this.insertStreamHere(index);
index++;
} else {
// continue reading stream normally
}
}
done()
}
From: https://github.com/nodejs/node/blob/master/lib/_stream_transform.js
In a transform stream, the written data is placed in a buffer. When
_read(n) is called, it transforms the queued up data, calling the
buffered _write cb's as it consumes chunks. If consuming a single
written chunk would result in multiple output chunks, then the first
outputted bit calls the readcb, and subsequent chunks just go into
the read buffer, and will cause it to emit 'readable' if necessary.
This way, back-pressure is actually determined by the reading side,
since _read has to be called to start processing a new chunk. However,
a pathological inflate type of transform can cause excessive buffering
here. For example, imagine a stream where every byte of input is
interpreted as an integer from 0-255, and then results in that many
bytes of output. Writing the 4 bytes {ff,ff,ff,ff} would result in
1kb of data being output. In this case, you could write a very small
amount of input, and end up with a very large amount of output. In
such a pathological inflating mechanism, there'd be no way to tell
the system to stop doing the transform. A single 4MB write could
cause the system to run out of memory.
So TL;DR: How do I insert (large) streams at a specific index, without having a huge back pressure of buffers in memory. Any advice is appreciated.

After a lot of reading the documentation and the source code, a lot of trial and error and some testing. I have come up with a solution for my problem. I can just copy and paste my solution, but for the sake of completeness I will explain my findings here.
Handling the back pressure with pipes consists out of a few parts. We've got the Readable that writes data to the Writable. The Readable provides a callback for the Writable with which it can tell the Readable it is ready to receive a new chunk of data. The reading part is simpler. The Readable has an internal buffer. Using Readable.push() will add data to the buffer. When the data is being read, it will come from this internal buffer. Next to that we can use Readable.readableHighWaterMark and Readable.readableLength to make sure we don't push to much data at once.
Readable.readableHighWaterMark - Readable.readableLength
is the maximum amount of bytes we should push to this internal buffer.
So this means, since we want to read from two Readable streams at the same time we need two Writable streams to control the flow. To merge data we will need to buffer it ourselves, since there is (as far as I know) no internal buffer in the Writable stream. So the Duplex stream will be the best option, because we want to handle buffering, writing and reading our selves.
Writing
So let's get to the code now. To control the state of multiple streams we will create a state interface. which looks as follows:
declare type StreamCallback = (error?: Error | null) => void;
interface MergingState {
callback: StreamCallback;
queue: BufferList;
highWaterMark: number;
size: number;
finalizing: boolean;
}
The callback holds the last callback provided by either write or final (we'll get to final later). highWaterMark indicates the maximum size for the our queue and the size is our current size of the queue. Lastly the finalizing flag indicates that the current queue is the last queue. So once the queue is empty we're done reading the stream belonging to that state.
BufferList is a copy of the internal Nodejs implementation used for the build in streams.
As mentioned before the writable handles the back pressure, so the generalized method for both our writables looks like the following:
/**
* Method to write to provided state if it can
*
* (Will unshift the bytes that cannot be written back to the source)
*
* #param src the readable source that writes the chunk
* #param chunk the chunk to be written
* #param encoding the chunk encoding, currently not used
* #param cb the streamCallback provided by the writing state
* #param state the state which should be written to
*/
private writeState(src: Readable, chunk: Buffer, encoding: string, cb: StreamCallback, state: MergingState): void {
this.mergeNextTick();
const bytesAvailable = state.highWaterMark - state.size;
if (chunk.length <= bytesAvailable) {
// save to write to our local buffer
state.queue.push(chunk);
state.size += chunk.length;
if (chunk.length === bytesAvailable) {
// our queue is full, so store our callback
this.stateCallbackAndSet(state, cb);
} else {
// we still have some space, so we can call the callback immediately
cb();
}
return;
}
if (bytesAvailable === 0) {
// no space available unshift entire chunk
src.unshift(chunk);
} else {
state.size += bytesAvailable;
const leftOver = Buffer.alloc(chunk.length - bytesAvailable);
chunk.copy(leftOver, 0, bytesAvailable);
// push amount of bytes available
state.queue.push(chunk.slice(0, bytesAvailable));
// unshift what we cannot fit in our queue
src.unshift(leftOver);
}
this.stateCallbackAndSet(state, cb);
}
First we check how much space is available to buffer. If there is enough space for our full chunk, we'll buffer it. If there is no space available, we will unshift the buffer to its readable source. If there is some space available, we'll only unshift what we cannot fit. If our buffer is full, we will store the callback that requests a new chunk. If there is space we will request our next chunk.
this.mergeNextTick() is called because our state has changed and that it should be read in the next tick:
private mergeNextTick(): void {
if (!this.mergeSync) {
// make sure it is only called once per tick
// we don't want to call it multiple times
// since there will be nothing left to read the second time
this.mergeSync = true;
process.nextTick(() => this._read(this.readableHighWaterMark));
}
}
this.stateCallbackAndSet is a helper function that will just call our last callback to make sure we'll not get in a state that makes our stream stop flowing. And will the new one provided.
/**
* Helper function to call the callback if it exists and set the new callback
* #param state the state which holds the callback
* #param cb the new callback to be set
*/
private stateCallbackAndSet(state: MergingState, cb: StreamCallback): void {
if (!state) {
return;
}
if (state.callback) {
const callback = state.callback;
// do callback next tick, such that we can't get stuck in a writing loop
process.nextTick(() => callback());
}
state.callback = cb;
}
Reading
Now onto the reading side this is the part where we handle selecting the correct stream.
First our function to read the state, which is pretty straight forward. it reads the amount of bytes it is able to read. It returns the amount of bytes written, which is useful information for our other function.
/**
* Method to read the provided state if it can
*
* #param size the number of bytes to consume
* #param state the state from which needs to be read
* #returns the amount of bytes read
*/
private readState(size: number, state: MergingState): number {
if (state.size === 0) {
// our queue is empty so we read 0 bytes
return 0;
}
let buffer = null;
if (state.size < size) {
buffer = state.queue.consume(state.size, false);
} else {
buffer = state.queue.consume(size, false);
}
this.push(buffer);
this.stateCallbackAndSet(state, null);
state.size -= buffer.length;
return buffer.length;
}
The doRead method is where all the merging takes place: it fetches the nextMergingIndex. If the merging index is the END then we can just read the writingState until the end of the stream. If we are at the merging index, we read from the mergingState. Otherwise we read as much from the writingState until we reach the next merging index.
/**
* Method to read from the correct Queue
*
* The doRead method is called multiple times by the _read method until
* it is satisfied with the returned size, or until no more bytes can be read
*
* #param n the number of bytes that can be read until highWaterMark is hit
* #throws Errors when something goes wrong, so wrap this method in a try catch.
* #returns the number of bytes read from either buffer
*/
private doRead(n: number): number {
// first check all constants below 0,
// which is only Merge.END right now
const nextMergingIndex = this.getNextMergingIndex();
if (nextMergingIndex === Merge.END) {
// read writing state until the end
return this.readWritingState(n);
}
const bytesToNextIndex = nextMergingIndex - this.index;
if (bytesToNextIndex === 0) {
// We are at the merging index, thus should read merging queue
return this.readState(n, this.mergingState);
}
if (n <= bytesToNextIndex) {
// We are safe to read n bytes
return this.readWritingState(n);
}
// read the bytes until the next merging index
return this.readWritingState(bytesToNextIndex);
}
readWritingState reads the state and updates the index:
/**
* Method to read from the writing state
*
* #param n maximum number of bytes to be read
* #returns number of bytes written.
*/
private readWritingState(n: number): number {
const bytesWritten = this.readState(n, this.writingState);
this.index += bytesWritten;
return bytesWritten;
}
Merging
For selecting our streams to merge we'll use a generator function. The generator function yields an index and a stream to merge at that index:
export interface MergingStream { index: number; stream: Readable; }
In doRead getNextMergingIndex() is called. This function returns the index of the next MergingStream. If there is no next mergingStream the generator is called to fetch a new mergingStream. If there is no new merging stream, we'll just return END.
/**
* Method to get the next merging index.
*
* Also fetches the next merging stream if merging stream is null
*
* #returns the next merging index, or Merge.END if there is no new mergingStream
* #throws Error when invalid MergingStream is returned by streamGenerator
*/
private getNextMergingIndex(): number {
if (!this.mergingStream) {
this.setNewMergeStream(this.streamGenerator.next().value);
if (!this.mergingStream) {
return Merge.END;
}
}
return this.mergingStream.index;
}
In the setNewMergeStream we are creating a new Writable which we can pipe our new merging stream into. For our Writable We will need to handle the write callback for writing to our state and the final callback to handle the last chunk. We should also not forget to reset our state.
/**
* Method to set the new merging stream
*
* #throws Error when mergingStream has an index less than the current index
*/
private setNewMergeStream(mergingStream?: MergingStream): void {
if (this.mergingStream) {
throw new Error('There already is a merging stream');
}
// Set a new merging stream
this.mergingStream = mergingStream;
if (mergingStream == null || mergingStream.index === Merge.END) {
// set new state
this.mergingState = newMergingState(this.writableHighWaterMark);
// We're done, for now...
// mergingStream will be handled further once nextMainStream() is called
return;
}
if (mergingStream.index < this.index) {
throw new Error('Cannot merge at ' + mergingStream.index + ' because current index is ' + this.index);
}
// Create a new writable our new mergingStream can write to
this.mergeWriteStream = new Writable({
// Create a write callback for our new mergingStream
write: (chunk, encoding, cb) => this.writeMerge(mergingStream.stream, chunk, encoding, cb),
final: (cb: StreamCallback) => {
this.onMergeEnd(mergingStream.stream, cb);
},
});
// Create a new mergingState for our new merging stream
this.mergingState = newMergingState(this.mergeWriteStream.writableHighWaterMark);
// Pipe our new merging stream to our sink
mergingStream.stream.pipe(this.mergeWriteStream);
}
Finalizing
The last step in the process is to handle our final chunks. Such that we know when to end merging and can send an end chunk. In our main read loop we first read until our doRead() method returns 0 twice in a row, or has filled our read buffer. Once that happens we end our read loop and check our states to see if they have finished.
public _read(size: number): void {
if (this.finished) {
// we've finished, there is nothing to left to read
return;
}
this.mergeSync = false;
let bytesRead = 0;
do {
const availableSpace = this.readableHighWaterMark - this.readableLength;
bytesRead = 0;
READ_LOOP: while (bytesRead < availableSpace && !this.finished) {
try {
const result = this.doRead(availableSpace - bytesRead);
if (result === 0) {
// either there is nothing in our buffers
// or our states are outdated (since they get updated in doRead)
break READ_LOOP;
}
bytesRead += result;
} catch (error) {
this.emit('error', error);
this.push(null);
this.finished = true;
}
}
} while (bytesRead > 0 && !this.finished);
this.handleFinished();
}
Then in our handleFinished() we check our states.
private handleFinished(): void {
if (this.finished) {
// merge stream has finished, so nothing to check
return;
}
if (this.isStateFinished(this.mergingState)) {
this.stateCallbackAndSet(this.mergingState, null);
// set our mergingStream to null, to indicate we need a new one
// which will be fetched by getNextMergingIndex()
this.mergingStream = null;
this.mergeNextTick();
}
if (this.isStateFinished(this.writingState)) {
this.stateCallbackAndSet(this.writingState, null);
this.handleMainFinish(); // checks if there are still mergingStreams left, and sets finished flag
this.mergeNextTick();
}
}
The isStateFinished() checks if our state has the finalizing flag set and if the queue size equals 0
/**
* Method to check if a specific state has completed
* #param state the state to check
* #returns true if the state has completed
*/
private isStateFinished(state: MergingState): boolean {
if (!state || !state.finalizing || state.size > 0) {
return false;
}
return true;
}
The finalized flag is set once our end callback is in the final callback for our merging Writable stream. For our main stream we have to approach it a little differently, since we have little control over when our stream ends, because the readable calls the end of our writable by default. We want to remove this behavior such that we can decide when we finish our stream. This might cause some issues when other end listeners are set, but for most use cases this should be fine.
private onPipe(readable: Readable): void {
// prevent our stream from being closed prematurely and unpipe it instead
readable.removeAllListeners('end'); // Note: will cause issues if another end listener is set
readable.once('end', () => {
this.finalizeState(this.writingState);
readable.unpipe();
});
}
The finalizeState() sets the flag and the callback to end the stream.
/**
* Method to put a state in finalizing mode
*
* Finalizing mode: the last chunk has been received, when size is 0
* the stream should be removed.
*
* #param state the state which should be put in finalizing mode
*
*/
private finalizeState(state: MergingState, cb?: StreamCallback): void {
state.finalizing = true;
this.stateCallbackAndSet(state, cb);
this.mergeNextTick();
}
And that is how you merge multiple streams in one single sink.
TL;DR: The complete code
This code has been fully tested with my jest test suite on multiple edge cases And has a few more features than explained in my code. Such as appending streams and merging into that appended stream. By providing Merge.END as index.
Test result
You can see the tests I have ran here, if I forgot any, send me a message and I may write another test for it
MergeStream
✓ should throw an error when nextStream is not implemented (9ms)
✓ should throw an error when nextStream returns a stream with lower index (4ms)
✓ should reset index after new main stream (5ms)
✓ should write a single stream normally (50ms)
✓ should be able to merge a stream (2ms)
✓ should be able to append a stream on the end (1ms)
✓ should be able to merge large streams into a smaller stream (396ms)
✓ should be able to merge at the correct index (2ms)
Usage
const mergingStream = new Merge({
*nextStream(): IterableIterator<MergingStream> {
for (let i = 0; i < 10; i++) {
const stream = new Readable();
stream.push(i.toString());
stream.push(null);
yield {index: i * 2, stream};
}
},
});
const template = new Readable();
template.push(', , , , , , , , , ');
template.push(null);
template.pipe(mergingStream).pipe(getSink());
The result will of our sink would be
0, 1, 2, 3, 4, 5, 6, 7, 8, 9
Final Thoughts
This is not the most time efficient way of doing it, since we only manage one merging buffer at once. So there is a lot of waiting. For my use case that is fine. I care about it not eating up my memory and this solution works for me. But there is definitely some space for optimization. The complete code has some extra features that are not fully explained here, such as appending streams and merging into that appended stream. They have been explained with comments though.

Related

Node.js readable maximize throughput/performance for compute intense readable - Writable doesn't pull data fast enough

General setup
I developed an application using AWS Lambda node.js 14.
I use a custom Readable implementation FrameCreationStream that uses node-canvas to draw images, svgs and more on a canvas. This result is then extracted as a raw image buffer in BGRA. A single image buffer contains 1920 * 1080 * 4 Bytes = 8294400 Bytes ~8 MB.
This is then piped to stdin of a child_process running ffmpeg.
The highWaterMark of my Readable in objectMode:true is set to 25 so that the internal buffer can use up to 8 MB * 25 = 200 MB.
All this works fine and also doesn't contain too much RAM. But I noticed after some time, that the performance is not ideally.
Performance not optimal
I have an example input that generates a video of 315 frames. If I set highWaterMark to a value above 25 the performance increases to the point, when I set to a value of 315 or above.
For some reason ffmpeg doesn't start to pull any data until highWaterMark is reached. Obviously thats not what I want. ffmpeg should always consume data if minimum 1 frame is cached in the Readable and if it has finished processing the frame before. And the Readable should produce more frames as long highWaterMark isn't reached or the last frame has been reached. So ideally the Readable and the Writeable are busy all the time.
I found another way to improve the speed. If I add a timeout in the _read() method of the Readable after let's say every tenth frame for 100 ms. Then the ffmpeg-Writable will use this timeout to write some frames to ffmpeg.
It seems like frames aren't passed to ffmpeg during frame creation because some node.js main thread is busy?
The fastest result I have if I increase highWaterMark above the amount of frames - which doesn't work for longer videos as this would make the AWS Lambda RAM explode. And this makes the whole streaming idea useless. Using timeouts always gives me stomach pain. Also depending on the execution on different environments a good fitting timeout might differ. Any ideas?
FrameCreationStream
import canvas from 'canvas';
import {Readable} from 'stream';
import {IMAGE_STREAM_BUFFER_SIZE, PerformanceUtil, RenderingLibraryError, VideoRendererInput} from 'vm-rendering-backend-commons';
import {AnimationAssets, BufferType, DrawingService, FullAnimationData} from 'vm-rendering-library';
/**
* This is a proper back pressure compatible implementation of readable for a having a stream to read single frames from.
* Whenever read() is called a new frame is created and added to the stream.
* read() will be called internally until options.highWaterMark has been reached.
* then calling read will be paused until one frame is read from the stream.
*/
export class FrameCreationStream extends Readable {
drawingService: DrawingService;
endFrameIndex: number;
currentFrameIndex: number = 0;
startFrameIndex: number;
frameTimer: [number, number];
readTimer: [number, number];
fullAnimationData: FullAnimationData;
constructor(animationAssets: AnimationAssets, fullAnimationData: FullAnimationData, videoRenderingInput: VideoRendererInput, frameTimer: [number, number]) {
super({highWaterMark: IMAGE_STREAM_BUFFER_SIZE, objectMode: true});
this.frameTimer = frameTimer;
this.readTimer = PerformanceUtil.startTimer();
this.fullAnimationData = fullAnimationData;
this.startFrameIndex = Math.floor(videoRenderingInput.startFrameId);
this.currentFrameIndex = this.startFrameIndex;
this.endFrameIndex = Math.floor(videoRenderingInput.endFrameId);
this.drawingService = new DrawingService(animationAssets, fullAnimationData, videoRenderingInput, canvas);
console.time("read");
}
/**
* this method is only overwritten for debugging
* #param size
*/
read(size?: number): string | Buffer {
console.log("read("+size+")");
const buffer = super.read(size);
console.log(buffer);
console.log(buffer?.length);
if(buffer) {
console.timeLog("read");
}
return buffer;
}
// _read() will be called when the stream wants to pull more data in.
// _read() will be called again after each call to this.push(dataChunk) once the stream is ready to accept more data. https://nodejs.org/api/stream.html#readable_readsize
// this way it is ensured, that even though this.createImageBuffer() is async, only one frame is created at a time and the order is kept
_read(): void {
// as frame numbers are consecutive and unique, we have to draw each frame number (also the first and the last one)
if (this.currentFrameIndex <= this.endFrameIndex) {
PerformanceUtil.logTimer(this.readTimer, 'WAIT -> READ\t');
this.createImageBuffer()
.then(buffer => this.optionalTimeout(buffer))
// push means adding a buffered raw frame to the stream
.then((buffer: Buffer) => {
this.readTimer = PerformanceUtil.startTimer();
// the following two frame numbers start with 1 as first value
const processedFrameNumberOfScene = 1 + this.currentFrameIndex - this.startFrameIndex;
const totalFrameNumberOfScene = 1 + this.endFrameIndex - this.startFrameIndex;
// the overall frameId or frameIndex starts with frameId 0
const processedFrameIndex = this.currentFrameIndex;
this.currentFrameIndex++;
this.push(buffer); // nothing besides logging should happen after calling this.push(buffer)
console.log(processedFrameNumberOfScene + ' of ' + totalFrameNumberOfScene + ' processed - full video frameId: ' + processedFrameIndex + ' - buffered frames: ' + this.readableLength);
})
.catch(err => {
// errors will be finally handled, when subscribing to frameCreation stream in ffmpeg service
// this log is just generated for tracing errors and if for some reason the handling in ffmpeg service doesn't work
console.log("createImageBuffer: ", err);
this.emit("error", err);
});
} else {
// push(null) makes clear that this stream has ended
this.push(null);
PerformanceUtil.logTimer(this.frameTimer, 'FRAME_STREAM');
}
}
private optionalTimeout(buffer: Buffer): Promise<Buffer> {
if(this.currentFrameIndex % 10 === 0) {
return new Promise(resolve => setTimeout(() => resolve(buffer), 140));
}
return Promise.resolve(buffer);
}
// prevent memory leaks - without this lambda memory will increase with every call
_destroy(): void {
this.drawingService.destroyStage();
}
/**
* This creates a raw pixel buffer that contains a single frame of the video drawn by the rendering library
*
*/
public async createImageBuffer(): Promise<Buffer> {
const drawTimer = PerformanceUtil.startTimer();
try {
await this.drawingService.drawForFrame(this.currentFrameIndex);
} catch (err: any) {
throw new RenderingLibraryError(err);
}
PerformanceUtil.logTimer(drawTimer, 'DRAW -> FRAME\t');
const bufferTimer = PerformanceUtil.startTimer();
// Creates a raw pixel buffer, containing simple binary data
// the exact same information (BGRA/screen ratio) has to be provided to ffmpeg, because ffmpeg cannot detect format for raw input
const buffer = await this.drawingService.toBuffer(BufferType.RAW);
PerformanceUtil.logTimer(bufferTimer, 'CANVAS -> BUFFER');
return buffer;
}
}
FfmpegService
import {ChildProcess, execFile} from 'child_process';
import {Readable} from 'stream';
import {FPS, StageSize} from 'vm-rendering-library';
import {
FfmpegError,
LOCAL_MERGE_VIDEOS_TEXT_FILE, LOCAL_SOUND_FILE_PATH,
LOCAL_VIDEO_FILE_PATH,
LOCAL_VIDEO_SOUNDLESS_MERGE_FILE_PATH
} from "vm-rendering-backend-commons";
/**
* This class bundles all ffmpeg usages for rendering one scene.
* FFmpeg is a console program which can transcode nearly all types of sounds, images and videos from one to another.
*/
export class FfmpegService {
ffmpegPath: string = null;
constructor(ffmpegPath: string) {
this.ffmpegPath = ffmpegPath;
}
/**
* Convert a stream of raw images into an .mp4 video using the command line program ffmpeg.
*
* #param inputStream an input stream containing images in raw format BGRA
* #param stageSize the size of a single frame in pixels (minimum is 2*2)
* #param outputPath the filepath to write the resulting video to
*/
public imageToVideo(inputStream: Readable, stageSize: StageSize, outputPath: string): Promise<void> {
const args: string[] = [
'-f',
'rawvideo',
'-r',
`${FPS}`,
'-pix_fmt',
'bgra',
'-s',
`${stageSize.width}x${stageSize.height}`,
'-i',
// input "-" means input will be passed via pipe (streamed)
'-',
// codec that also QuickTime player can understand
'-vcodec',
'libx264',
'-pix_fmt',
'yuv420p',
/*
* "-movflags faststart":
* metadata at beginning of file
* needs more RAM
* file will be broken, if not finished properly
* higher application compatibility
* better for browser streaming
*/
'-movflags',
'faststart',
// "-preset ultrafast", //use this to speed up compression, but quality/compression ratio gets worse
// don't overwrite an existing file here,
// but delete file in the beginning of execution index.ts
// (this is better for local testing believe me)
outputPath
];
return this.execFfmpegPromise(args, inputStream);
}
private execFfmpegPromise(args: string[], inputStream?: Readable): Promise<void> {
const ffmpegServiceSelf = this;
return new Promise(function (resolve, reject) {
const executionProcess: ChildProcess = execFile(ffmpegServiceSelf.ffmpegPath, args, (err) => {
if (err) {
reject(new FfmpegError(err));
} else {
console.log('ffmpeg finished');
resolve();
}
});
if (inputStream) {
// it's important to listen on errors of input stream before piping it into the write stream
// if we don't do this here, we get an unhandled promise exception for every issue in the input stream
inputStream.on("error", err => {
reject(err);
});
// don't reject promise here as the error will also be thrown inside execFile and will contain more debugging info
// this log is just generated for tracing errors and if for some reason the handling in execFile doesn't work
inputStream.pipe(executionProcess.stdin).on("error", err => console.log("pipe stream: " , err));
}
});
}
}

Process only first x bytes of a stream, preserve stream contents

I have a function that takes in a stream
function processStream(stream) {
}
Other things process this stream after the function, so it needs to be left intact. This function only needs the first 20 bytes of a stream that could be gigabytes long in order to complete its processing. I can get this via:
function processStream(stream) {
const data = stream.read(20)
return stream
}
But by consuming those 20 bytes we've changed the stream for future functions, so we have recombine it. What's the fastest way to do this?
In the end I went with combined-streams2 to combine my streams quickly and efficiently:
const CombinedStream = require('combined-stream2')
function async processStream(stream) {
const bytes = stream.read(2048)
const original = CombinedStream.create()
original.append(bytes)
if (!stream._readableState.ended) {
original.append(stream)
}
return original
}

How do I apply back-pressure to node streams?

While attempting to experiment with Node.JS streams I ran into an interesting conundrum. When the input (Readable) stream pushes more data then the destination (Writable) cares about I was unable to apply back-pressure correctly.
The two methods I attempted was to return false from the Writable.prototype._write and to retain a reference to the Readable so I can call Readable.pause() from the Writable. Neither solution helped much which I'll explain.
In my exercise (which you can view the full source as a Gist) I have three streams:
Readable - PasscodeGenerator
util.inherits(PasscodeGenerator, stream.Readable);
function PasscodeGenerator(prefix) {
stream.Readable.call(this, {objectMode: true});
this.count = 0;
this.prefix = prefix || '';
}
PasscodeGenerator.prototype._read = function() {
var passcode = '' + this.prefix + this.count;
if (!this.push({passcode: passcode})) {
this.pause();
this.once('drain', this.resume.bind(this));
}
this.count++;
};
I thought that the return code from this.push() was enough to self pause and wait for the drain event to resume.
Transform - Hasher
util.inherits(Hasher, stream.Transform);
function Hasher(hashType) {
stream.Transform.call(this, {objectMode: true});
this.hashType = hashType;
}
Hasher.prototype._transform = function(sample, encoding, next) {
var hash = crypto.createHash(this.hashType);
hash.setEncoding('hex');
hash.write(sample.passcode);
hash.end();
sample.hash = hash.read();
this.push(sample);
next();
};
Simply add the hash of the passcode to the object.
Writable - SampleConsumer
util.inherits(SampleConsumer, stream.Writable);
function SampleConsumer(max) {
stream.Writable.call(this, {objectMode: true});
this.max = (max != null) ? max : 10;
this.count = 0;
}
SampleConsumer.prototype._write = function(sample, encoding, next) {
this.count++;
console.log('Hash %d (%s): %s', this.count, sample.passcode, sample.hash);
if (this.count < this.max) {
next();
} else {
return false;
}
};
Here I want to consume the data as fast as possible until I reach my max number of samples and then end the stream. I tried using this.end() instead of return false but that caused the dreaded write called after end problem. Returning false does stop everything if the sample size is small but when it is large I get an out of memory error:
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
Aborted (core dumped)
According to this SO answer in theory the Write stream would return false causing the streams to buffer until the buffers were full (16 by default for objectMode) and eventually the Readable would call it's this.pause() method. But 16 + 16 + 16 = 48; that's 48 objects in buffer till things fill up and the system is clogged. Actually less because there is no cloning involved so the objects passed between them is the same reference. Would that not mean only 16 objects in memory till the high water mark halts everything?
Lastly I realize I could have the Writable reference the Readable to call it's pause method using closures. However, this solution means the Writable stream knows to much about another object. I'd have to pass in a reference:
var foo = new PasscodeGenerator('foobar');
foo
.pipe(new Hasher('md5'))
.pipe(new SampleConsumer(samples, foo));
And this feels out of norm for how streams would work. I thought back-pressure was enough to cause a Writable to stop a Readable from pushing data and prevent out of memory errors.
An analogous example would be the Unix head command. Implementing that in Node I would assume that the destination could end and not just ignore causing the source to keep pushing data even if the destination has enough data to satisfy the beginning portion of the file.
How do I idiomatically construct custom streams such that when the destination is ready to end the source stream doesn't attempt to push more data?
This is a known issue with how _read() is called internally. Since your _read() is always pushing synchronously/immediately, the internal stream implementation can get into a loop in the right conditions. _read() implementations are generally expected to do some sort of async I/O (e.g. reading from disk or network).
The workaround for this (as noted in the link above) is to make your _read() asynchronous at least some of the time. You could also just make it async every time it's called with:
PasscodeGenerator.prototype._read = function(n) {
var passcode = '' + this.prefix + this.count;
var self = this;
// `setImmediate()` delays the push until the beginning
// of the next tick of the event loop
setImmediate(function() {
self.push({passcode: passcode});
});
this.count++;
};

How to get a unified onFinish from separate streams (some created from within original stream)

I have a stream process like this:
Incomming file via HTTP (original stream)
-> Check if zipfile
- Yes -> push through an unzip2-stream
- No -> push to S3
When the unzip2-stream finds zip-entries, these are pushed through the same chain of streams, i.e.
Incomming file entry from zip file ("child" stream)
-> Check if zipfile
- Yes -> push through an unzip2-stream
- No -> push to S3
Thanks to https://stackoverflow.com/users/3580261/eljefedelrodeodeljefe I managed to solve the main problem after this conversation:
How to redirect a stream to other stream depending on data in first chunk?
The problem with creating new "child" streams for every zip entry is that these will have no connection to the original stream, so I cannot get a unified onFinish for all the streams.
I don't want to send a 202 of to the sender before I have processed (unzipped and sent to S3) every file. How can I accomplish this?
I'm thinking that I might need some kind of control object which awaits onFinish for all child streams and forces the process to dwell in the original onFinish event until all files are processed. Would this be overkill? Is there a simpler solution?
I ended up making a separate counter for the streams. There is probably a better solution, but this works.
I send the counter object as an argument to the first call to my saveFile() function. The counter is passed along to the unzip stream so it can be passed to saveFile for every file entry.
Just before a stream is started (i.e. piped) I call streamCounter.streamStarted().
In the last onFinish in the pipe chain I call streamCounter.streamFinished()
In the event of a stream going bad I call streamCounter.streamFailed()
Just before I send the 202 in the form post route I wait for streamCounter.streamPromise to resolve.
I'm not very proud of the setInterval solution. It'd probably be better with some kind of event emitting.
module.exports.streamCounter = function() {
let streamCount = 0;
let isStarted = false;
let errors = [];
this.streamStarted = function(options) {
isStarted = true;
streamCount += 1;
log.debug(`Stream started for ${options.filename}. New streamCount: ${streamCount}`);
};
this.streamFinished = function(options) {
streamCount -= 1;
log.debug(`Finished stream for ${options.filename}. New streamCount: ${streamCount}`);
};
this.streamFailed = function(err) {
streamCount -= 1;
errors.push(err);
log.debug(`Failed stream because (${err.message}). New streamCount: ${streamCount}`);
};
this.streamPromise = new Promise(function(resolve, reject) {
let interval = setInterval(function() {
if(isStarted && streamCount === 0) {
clearInterval(interval);
if(errors.length === 0) {
log.debug('StreamCounter back on 0. Resolving streamPromise');
resolve();
} else {
log.debug('StreamCounter back on 0. Errors encountered.. Rejecting streamPromise');
reject(errors[errors.length-1]);
}
}
}, 100);
});
};
At first I tried this concept with a promise array and waited for Promise.all() before sending status 202. But Promise.all() only works with static arrays as far as I can tell. My "streamCount" is changing during the streaming so I needed a more dynamic "Promise.all".

Nodejs: Set highWaterMark of socket object

is it possible to set the highWaterMark of a socket object after it was created:
var http = require('http');
var server = http.createServer();
server.on('upgrade', function(req, socket, head) {
socket.on('data', function(chunk) {
var frame = new WebSocketFrame(chunk);
// skip invalid frames
if (!frame.isValid()) return;
// if the length in the head is unequal to the chunk
// node has maybe split it
if (chunk.length != WebSocketFrame.getLength()) {
socket.once('data', listenOnMissingChunks);
});
});
});
function listenOnMissingChunks(chunk, frame) {
frame.addChunkToPayload(chunk);
if (WebSocketFrame.getLength()) {
// if still corrupted listen once more
} else {
// else proceed
}
}
The above code example does not work. But how do I do it instead?
Further explaination:
When I receive big WebSocket frames they get split into multiple data events. This makes it hard to parse the frames because I do not know if this is a splitted or corrupted frame.
I think you misunderstand the nature of a TCP socket. Despite the fact that TCP sends its data over IP packets, TCP is not a packet protocol. A TCP socket is simply a stream of data. Thus, it is incorrect to view the data event as a logical message. In other words, one socket.write on one end does not equate to a single data event on the other.
There are many reasons that a single write to a socket does not map 1:1 to a single data event:
The sender's network stack may combine multiple small writes into a single IP packet. (The Nagle algorithm)
An IP packet may be fragmented (split into multiple packets) along its journey if its size exceeds any one hop's MTU.
The receiver's network stack may combine multiple packets into a single data event (as seen by your application).
Because of this, a single data event might contain multiple messages, a single message, or only part of a message.
In order to correctly handle messages sent over a stream, you must buffer incoming data until you have a complete message.
var net = require('net');
var max = 1024 * 1024 // 1 MB, the maximum amount of data that we will buffer (prevent a bad server from crashing us by filling up RAM)
, allocate = 4096; // how much memory to allocate at once, 4 kB (there's no point in wasting 1 MB of RAM to buffer a few bytes)
, buffer=new Buffer(allocate) // create a new buffer that allocates 4 kB to start
, nread=0 // how many bytes we've buffered so far
, nproc=0 // how many bytes in the buffer we've processed (to avoid looping over the entire buffer every time data is received)
, client = net.connect({host:'example.com', port: 8124}); // connect to the server
client.on('data', function(chunk) {
if (nread + chunk.length > buffer.length) { // if the buffer is too small to hold the data
var need = Math.min(chunk.length, allocate); // allocate at least 4kB
if (nread + need > max) throw new Error('Buffer overflow'); // uh-oh, we're all full - TODO you'll want to handle this more gracefully
var newbuf = new Buffer(buffer.length + need); // because Buffers can't be resized, we must allocate a new one
buffer.copy(newbuf); // and copy the old one's data to the new one
buffer = newbuf; // the old, small buffer will be garbage collected
}
chunk.copy(buffer, nread); // copy the received chunk of data into the buffer
nread += chunk.length; // add this chunk's length to the total number of bytes buffered
pump(); // look at the buffer to see if we've received enough data to act
});
client.on('end', function() {
// handle disconnect
});
client.on('error', function(err) {
// handle errors
});
function find(byte) { // look for a specific byte in the buffer
for (var i = nproc; i < nread; i++) { // look through the buffer, starting from where we left off last time
if (buffer.readUInt8(i, true) == byte) { // we've found one
return i;
}
}
}
function slice(bytes) { // discard bytes from the beginning of a buffer
buffer = buffer.slice(bytes); // slice off the bytes
nread -= bytes; // note that we've removed bytes
nproc = 0; // and reset the processed bytes counter
}
function pump() {
var pos; // position of a NULL character
while ((pos = find(0x00)) >= 0) { // keep going while there's a NULL (0x00) somewhere in the buffer
if (pos == 0) { // if there's more than one NULL in a row, the buffer will now start with a NULL
slice(1); // discard it
continue; // so that the next iteration will start with data
}
process(buffer.slice(0,pos)); // hand off the message
slice(pos+1); // and slice the processed data off the buffer
}
}
function process(msg) { // here's where we do something with a message
if (msg.length > 0) { // ignore empty messages
// here's where you have to decide what to do with the data you've received
// experiment with the protocol
}
}
You don't need to. Incoming data will almost certainly be split across two or more reads: this is the nature of TCP and there is nothing you can do about it. Fiddling with obscure socket parameters certainly won't change it. And the data will be lit but certainly not corrupted. Just treat the socket as what it is: a byte stream.

Resources