Process only first x bytes of a stream, preserve stream contents - node.js

I have a function that takes in a stream
function processStream(stream) {
}
Other things process this stream after the function, so it needs to be left intact. This function only needs the first 20 bytes of a stream that could be gigabytes long in order to complete its processing. I can get this via:
function processStream(stream) {
const data = stream.read(20)
return stream
}
But by consuming those 20 bytes we've changed the stream for future functions, so we have recombine it. What's the fastest way to do this?

In the end I went with combined-streams2 to combine my streams quickly and efficiently:
const CombinedStream = require('combined-stream2')
function async processStream(stream) {
const bytes = stream.read(2048)
const original = CombinedStream.create()
original.append(bytes)
if (!stream._readableState.ended) {
original.append(stream)
}
return original
}

Related

How can I limit the size of WriteStream buffer in NodeJS?

I'm using a WriteStream in NodeJS to write several GB of data, and I've identified the write loop as eating up ~2GB of virtual memory during runtime (which is the GC'd about 30 seconds after the loop finishes). I'm wondering how I can limit the size of the buffer node is using when writing the stream so that Node doesn't use up so much memory during that part of the code.
I've reduced it to this trivial loop:
let ofd = fs.openSync(fn, 'w')
let ws = fs.createWriteStream('', { fd: ofd })
:
while { /*..write ~4GB of binary formatted 32bit floats and uint32s...*/ }
:
:
ws.end()
The stream.write function will return a boolean value which indicate if the internal buffer is full. The buffer size is controlled by the option highWaterMark. However, this option is a threshold instead of a hard limitation, which means you can still call stream.write even if the internal buffer is full, and the memory will be used continuously if you code like this.
while (foo) {
ws.write(bar);
}
In order to solve this issue, you have to handle the returned value false from the ws.write and waiting until the drain event of this stream is called like the following example.
async function write() {
while (foo) {
if (!ws.write(bar)) {
await new Promise(resolve => ws.once('drain', resolve));
}
}
}

Writing large amounts of streamed data frequently with writestream

I'm trying to write a live websocket feed line-by-line to a file - I think for this I should be using a writeable stream.
My problem here is that the data received is in the region of 10 lines per second, which quickly fills the buffer.
I understand when using streams from sources you control, you would normally add some sort of backpressure logic here, but what should I do if I do not control the source? Should I be batching up the writes and writing, say 500 lines at a time, instead of per line, or should I be using some other way to save this data?
I'm wondering how big are the lines? 10 lines per second sounds trivial to stream to a disk unless the lines are gigantic or the disk really slow. Ultimately, if you have no ability to apply backpressure logic, the source can overwhelm you if they go fast or your storage goes slow and you'd have to decide how much you can reasonably buffer and eventually just drop some of the data if you get behind.
But, you should be able to write a lot of data. On a my regular hard disk (using the generic stream code below with no additional buffering) I can do sequential writes of 100,000,000 bytes at a speed of 55 MBytes/sec:
So, if you have 10 lines per second coming in, as long as the lines were below 10,000,000 bytes each, my hard drive could keep up.
Here's the code I used to test it:
const fs = require('fs');
const { Bench } = require('../../Github/measure');
const { addCommas } = require("../../Github/str-utils");
const lineData = Buffer.from("012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678\n", 'utf-8');
let stream = fs.createWriteStream("D:\\Temp\\temp.txt");
stream.on('open', function() {
let linesRemaining = 1_000_000;
let b = new Bench();
let bytes = 0;
function write() {
do {
linesRemaining--;
let readyMore;
bytes += lineData.length;
if (linesRemaining === 0) {
readyForMore = stream.write(lineData, done);
} else {
readyForMore = stream.write(lineData);
}
} while (linesRemaining > 0 && readyForMore);
if (linesRemaining > 0) {
stream.once('drain', write);
}
}
function done() {
b.markEnd();
console.log(`Time to write ${addCommas(bytes)} bytes: ${b.formatSec(3)}`);
console.log(`bytes/sec = ${addCommas((bytes/b.sec).toFixed(0))}`);
console.log(`MB/sec = ${addCommas(((bytes/(1024 * 1024))/b.sec).toFixed(1))}`);
stream.end();
}
b.markBegin();
write();
});
Theoretically, it is more efficient for your disk to do fewer writes that are larger, than tons of small writes. In practice, because of the way the writeStream works, as soon as an inefficient write gets slow, the next write will get buffered and it kind of self corrects. If you were really trying to minimize the load on the disk, you would buffer writes until you had at least something like 4k to write. The issue is that each write has potentially allocate some bytes to the file (which involves writing to a table on the disk), then seek to where the bytes should be written on the disk, then write the bytes. Fewer and larger writes that are larger (up to some limit that depends upon internal implementation) will reduce the number of times it has to do the file allocation overhead.
So, I ran a test. I modified the above code (shown below) to buffer into 4k chunks and write them out in 4k chunks. The write through increased from 55 MBytes/sec to 284.2 MBytes/sec.
So, the theory holds true that you will write faster if you buffer into larger chunks.
But, even the simpler, non-buffered version may be plenty fast.
Here's the test code for the buffered version:
const fs = require('fs');
const { Bench } = require('../../Github/measure');
const { addCommas } = require("../../Github/str-utils");
const lineData = Buffer.from("012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678\n", 'utf-8');
let stream = fs.createWriteStream("D:\\Temp\\temp.txt");
stream.on('open', function() {
let linesRemaining = 1_000_000;
let b = new Bench();
let bytes = 0;
let cache = [];
let cacheTotal = 0;
const maxBuffered = 4 * 1024;
stream.myWrite = function(data, callback) {
if (callback) {
cache.push(data);
return stream.write(Buffer.concat(cache), callback);
} else {
cache.push(data);
cacheTotal += data.length;
if (cacheTotal >= maxBuffered) {
let ready = stream.write(Buffer.concat(cache));
cache.length = 0;
cacheTotal = 0;
return ready;
} else {
return true;
}
}
}
function write() {
do {
linesRemaining--;
let readyMore;
bytes += lineData.length;
if (linesRemaining === 0) {
readyForMore = stream.myWrite(lineData, done);
} else {
readyForMore = stream.myWrite(lineData);
}
} while (linesRemaining > 0 && readyForMore);
if (linesRemaining > 0) {
stream.once('drain', write);
}
}
function done() {
b.markEnd();
console.log(`Time to write ${addCommas(bytes)} bytes: ${b.formatSec(3)}`);
console.log(`bytes/sec = ${addCommas((bytes/b.sec).toFixed(0))}`);
console.log(`MB/sec = ${addCommas(((bytes/(1024 * 1024))/b.sec).toFixed(1))}`);
stream.end();
}
b.markBegin();
write();
});
This code uses a couple of my local libraries for measuring the time and formatting the output. If you want to run this yourself, you can substitute your own logic for those.

Memory efficient growing Nodejs Duplex/Transform stream

I am trying to add variables into a template at specific indices through streams.
The idea is that I have a readable stream in and a list of variables that can be either a readable stream a buffer or a string of an undetermined size. These variables can be inserted at a predefined list of indices. I have a few questions based on my assumptions and what I have tried so far.
My first attempt was to do it manually with readable streams. However, I couldn't const buffer = templateIn.read(size) (since the buffers were still empty) before template combined was trying to read it. The solution for that problem is similar to how you'd use a transform stream so that was the next step I took.
However, I have a problem with the transform streams. My problem is that something like this pseudo code will pile up buffers into memory until done() is called.
public _transform(chunk: Buffer, encoding: string, done: (err?: Error, data?: any) => void ): void {
let index = 0;
while (index < chunk.length) {
if (index === this.variableIndex) { // the basic idea (the actual logic is a bit more complex)
this.insertStreamHere(index);
index++;
} else {
// continue reading stream normally
}
}
done()
}
From: https://github.com/nodejs/node/blob/master/lib/_stream_transform.js
In a transform stream, the written data is placed in a buffer. When
_read(n) is called, it transforms the queued up data, calling the
buffered _write cb's as it consumes chunks. If consuming a single
written chunk would result in multiple output chunks, then the first
outputted bit calls the readcb, and subsequent chunks just go into
the read buffer, and will cause it to emit 'readable' if necessary.
This way, back-pressure is actually determined by the reading side,
since _read has to be called to start processing a new chunk. However,
a pathological inflate type of transform can cause excessive buffering
here. For example, imagine a stream where every byte of input is
interpreted as an integer from 0-255, and then results in that many
bytes of output. Writing the 4 bytes {ff,ff,ff,ff} would result in
1kb of data being output. In this case, you could write a very small
amount of input, and end up with a very large amount of output. In
such a pathological inflating mechanism, there'd be no way to tell
the system to stop doing the transform. A single 4MB write could
cause the system to run out of memory.
So TL;DR: How do I insert (large) streams at a specific index, without having a huge back pressure of buffers in memory. Any advice is appreciated.
After a lot of reading the documentation and the source code, a lot of trial and error and some testing. I have come up with a solution for my problem. I can just copy and paste my solution, but for the sake of completeness I will explain my findings here.
Handling the back pressure with pipes consists out of a few parts. We've got the Readable that writes data to the Writable. The Readable provides a callback for the Writable with which it can tell the Readable it is ready to receive a new chunk of data. The reading part is simpler. The Readable has an internal buffer. Using Readable.push() will add data to the buffer. When the data is being read, it will come from this internal buffer. Next to that we can use Readable.readableHighWaterMark and Readable.readableLength to make sure we don't push to much data at once.
Readable.readableHighWaterMark - Readable.readableLength
is the maximum amount of bytes we should push to this internal buffer.
So this means, since we want to read from two Readable streams at the same time we need two Writable streams to control the flow. To merge data we will need to buffer it ourselves, since there is (as far as I know) no internal buffer in the Writable stream. So the Duplex stream will be the best option, because we want to handle buffering, writing and reading our selves.
Writing
So let's get to the code now. To control the state of multiple streams we will create a state interface. which looks as follows:
declare type StreamCallback = (error?: Error | null) => void;
interface MergingState {
callback: StreamCallback;
queue: BufferList;
highWaterMark: number;
size: number;
finalizing: boolean;
}
The callback holds the last callback provided by either write or final (we'll get to final later). highWaterMark indicates the maximum size for the our queue and the size is our current size of the queue. Lastly the finalizing flag indicates that the current queue is the last queue. So once the queue is empty we're done reading the stream belonging to that state.
BufferList is a copy of the internal Nodejs implementation used for the build in streams.
As mentioned before the writable handles the back pressure, so the generalized method for both our writables looks like the following:
/**
* Method to write to provided state if it can
*
* (Will unshift the bytes that cannot be written back to the source)
*
* #param src the readable source that writes the chunk
* #param chunk the chunk to be written
* #param encoding the chunk encoding, currently not used
* #param cb the streamCallback provided by the writing state
* #param state the state which should be written to
*/
private writeState(src: Readable, chunk: Buffer, encoding: string, cb: StreamCallback, state: MergingState): void {
this.mergeNextTick();
const bytesAvailable = state.highWaterMark - state.size;
if (chunk.length <= bytesAvailable) {
// save to write to our local buffer
state.queue.push(chunk);
state.size += chunk.length;
if (chunk.length === bytesAvailable) {
// our queue is full, so store our callback
this.stateCallbackAndSet(state, cb);
} else {
// we still have some space, so we can call the callback immediately
cb();
}
return;
}
if (bytesAvailable === 0) {
// no space available unshift entire chunk
src.unshift(chunk);
} else {
state.size += bytesAvailable;
const leftOver = Buffer.alloc(chunk.length - bytesAvailable);
chunk.copy(leftOver, 0, bytesAvailable);
// push amount of bytes available
state.queue.push(chunk.slice(0, bytesAvailable));
// unshift what we cannot fit in our queue
src.unshift(leftOver);
}
this.stateCallbackAndSet(state, cb);
}
First we check how much space is available to buffer. If there is enough space for our full chunk, we'll buffer it. If there is no space available, we will unshift the buffer to its readable source. If there is some space available, we'll only unshift what we cannot fit. If our buffer is full, we will store the callback that requests a new chunk. If there is space we will request our next chunk.
this.mergeNextTick() is called because our state has changed and that it should be read in the next tick:
private mergeNextTick(): void {
if (!this.mergeSync) {
// make sure it is only called once per tick
// we don't want to call it multiple times
// since there will be nothing left to read the second time
this.mergeSync = true;
process.nextTick(() => this._read(this.readableHighWaterMark));
}
}
this.stateCallbackAndSet is a helper function that will just call our last callback to make sure we'll not get in a state that makes our stream stop flowing. And will the new one provided.
/**
* Helper function to call the callback if it exists and set the new callback
* #param state the state which holds the callback
* #param cb the new callback to be set
*/
private stateCallbackAndSet(state: MergingState, cb: StreamCallback): void {
if (!state) {
return;
}
if (state.callback) {
const callback = state.callback;
// do callback next tick, such that we can't get stuck in a writing loop
process.nextTick(() => callback());
}
state.callback = cb;
}
Reading
Now onto the reading side this is the part where we handle selecting the correct stream.
First our function to read the state, which is pretty straight forward. it reads the amount of bytes it is able to read. It returns the amount of bytes written, which is useful information for our other function.
/**
* Method to read the provided state if it can
*
* #param size the number of bytes to consume
* #param state the state from which needs to be read
* #returns the amount of bytes read
*/
private readState(size: number, state: MergingState): number {
if (state.size === 0) {
// our queue is empty so we read 0 bytes
return 0;
}
let buffer = null;
if (state.size < size) {
buffer = state.queue.consume(state.size, false);
} else {
buffer = state.queue.consume(size, false);
}
this.push(buffer);
this.stateCallbackAndSet(state, null);
state.size -= buffer.length;
return buffer.length;
}
The doRead method is where all the merging takes place: it fetches the nextMergingIndex. If the merging index is the END then we can just read the writingState until the end of the stream. If we are at the merging index, we read from the mergingState. Otherwise we read as much from the writingState until we reach the next merging index.
/**
* Method to read from the correct Queue
*
* The doRead method is called multiple times by the _read method until
* it is satisfied with the returned size, or until no more bytes can be read
*
* #param n the number of bytes that can be read until highWaterMark is hit
* #throws Errors when something goes wrong, so wrap this method in a try catch.
* #returns the number of bytes read from either buffer
*/
private doRead(n: number): number {
// first check all constants below 0,
// which is only Merge.END right now
const nextMergingIndex = this.getNextMergingIndex();
if (nextMergingIndex === Merge.END) {
// read writing state until the end
return this.readWritingState(n);
}
const bytesToNextIndex = nextMergingIndex - this.index;
if (bytesToNextIndex === 0) {
// We are at the merging index, thus should read merging queue
return this.readState(n, this.mergingState);
}
if (n <= bytesToNextIndex) {
// We are safe to read n bytes
return this.readWritingState(n);
}
// read the bytes until the next merging index
return this.readWritingState(bytesToNextIndex);
}
readWritingState reads the state and updates the index:
/**
* Method to read from the writing state
*
* #param n maximum number of bytes to be read
* #returns number of bytes written.
*/
private readWritingState(n: number): number {
const bytesWritten = this.readState(n, this.writingState);
this.index += bytesWritten;
return bytesWritten;
}
Merging
For selecting our streams to merge we'll use a generator function. The generator function yields an index and a stream to merge at that index:
export interface MergingStream { index: number; stream: Readable; }
In doRead getNextMergingIndex() is called. This function returns the index of the next MergingStream. If there is no next mergingStream the generator is called to fetch a new mergingStream. If there is no new merging stream, we'll just return END.
/**
* Method to get the next merging index.
*
* Also fetches the next merging stream if merging stream is null
*
* #returns the next merging index, or Merge.END if there is no new mergingStream
* #throws Error when invalid MergingStream is returned by streamGenerator
*/
private getNextMergingIndex(): number {
if (!this.mergingStream) {
this.setNewMergeStream(this.streamGenerator.next().value);
if (!this.mergingStream) {
return Merge.END;
}
}
return this.mergingStream.index;
}
In the setNewMergeStream we are creating a new Writable which we can pipe our new merging stream into. For our Writable We will need to handle the write callback for writing to our state and the final callback to handle the last chunk. We should also not forget to reset our state.
/**
* Method to set the new merging stream
*
* #throws Error when mergingStream has an index less than the current index
*/
private setNewMergeStream(mergingStream?: MergingStream): void {
if (this.mergingStream) {
throw new Error('There already is a merging stream');
}
// Set a new merging stream
this.mergingStream = mergingStream;
if (mergingStream == null || mergingStream.index === Merge.END) {
// set new state
this.mergingState = newMergingState(this.writableHighWaterMark);
// We're done, for now...
// mergingStream will be handled further once nextMainStream() is called
return;
}
if (mergingStream.index < this.index) {
throw new Error('Cannot merge at ' + mergingStream.index + ' because current index is ' + this.index);
}
// Create a new writable our new mergingStream can write to
this.mergeWriteStream = new Writable({
// Create a write callback for our new mergingStream
write: (chunk, encoding, cb) => this.writeMerge(mergingStream.stream, chunk, encoding, cb),
final: (cb: StreamCallback) => {
this.onMergeEnd(mergingStream.stream, cb);
},
});
// Create a new mergingState for our new merging stream
this.mergingState = newMergingState(this.mergeWriteStream.writableHighWaterMark);
// Pipe our new merging stream to our sink
mergingStream.stream.pipe(this.mergeWriteStream);
}
Finalizing
The last step in the process is to handle our final chunks. Such that we know when to end merging and can send an end chunk. In our main read loop we first read until our doRead() method returns 0 twice in a row, or has filled our read buffer. Once that happens we end our read loop and check our states to see if they have finished.
public _read(size: number): void {
if (this.finished) {
// we've finished, there is nothing to left to read
return;
}
this.mergeSync = false;
let bytesRead = 0;
do {
const availableSpace = this.readableHighWaterMark - this.readableLength;
bytesRead = 0;
READ_LOOP: while (bytesRead < availableSpace && !this.finished) {
try {
const result = this.doRead(availableSpace - bytesRead);
if (result === 0) {
// either there is nothing in our buffers
// or our states are outdated (since they get updated in doRead)
break READ_LOOP;
}
bytesRead += result;
} catch (error) {
this.emit('error', error);
this.push(null);
this.finished = true;
}
}
} while (bytesRead > 0 && !this.finished);
this.handleFinished();
}
Then in our handleFinished() we check our states.
private handleFinished(): void {
if (this.finished) {
// merge stream has finished, so nothing to check
return;
}
if (this.isStateFinished(this.mergingState)) {
this.stateCallbackAndSet(this.mergingState, null);
// set our mergingStream to null, to indicate we need a new one
// which will be fetched by getNextMergingIndex()
this.mergingStream = null;
this.mergeNextTick();
}
if (this.isStateFinished(this.writingState)) {
this.stateCallbackAndSet(this.writingState, null);
this.handleMainFinish(); // checks if there are still mergingStreams left, and sets finished flag
this.mergeNextTick();
}
}
The isStateFinished() checks if our state has the finalizing flag set and if the queue size equals 0
/**
* Method to check if a specific state has completed
* #param state the state to check
* #returns true if the state has completed
*/
private isStateFinished(state: MergingState): boolean {
if (!state || !state.finalizing || state.size > 0) {
return false;
}
return true;
}
The finalized flag is set once our end callback is in the final callback for our merging Writable stream. For our main stream we have to approach it a little differently, since we have little control over when our stream ends, because the readable calls the end of our writable by default. We want to remove this behavior such that we can decide when we finish our stream. This might cause some issues when other end listeners are set, but for most use cases this should be fine.
private onPipe(readable: Readable): void {
// prevent our stream from being closed prematurely and unpipe it instead
readable.removeAllListeners('end'); // Note: will cause issues if another end listener is set
readable.once('end', () => {
this.finalizeState(this.writingState);
readable.unpipe();
});
}
The finalizeState() sets the flag and the callback to end the stream.
/**
* Method to put a state in finalizing mode
*
* Finalizing mode: the last chunk has been received, when size is 0
* the stream should be removed.
*
* #param state the state which should be put in finalizing mode
*
*/
private finalizeState(state: MergingState, cb?: StreamCallback): void {
state.finalizing = true;
this.stateCallbackAndSet(state, cb);
this.mergeNextTick();
}
And that is how you merge multiple streams in one single sink.
TL;DR: The complete code
This code has been fully tested with my jest test suite on multiple edge cases And has a few more features than explained in my code. Such as appending streams and merging into that appended stream. By providing Merge.END as index.
Test result
You can see the tests I have ran here, if I forgot any, send me a message and I may write another test for it
MergeStream
✓ should throw an error when nextStream is not implemented (9ms)
✓ should throw an error when nextStream returns a stream with lower index (4ms)
✓ should reset index after new main stream (5ms)
✓ should write a single stream normally (50ms)
✓ should be able to merge a stream (2ms)
✓ should be able to append a stream on the end (1ms)
✓ should be able to merge large streams into a smaller stream (396ms)
✓ should be able to merge at the correct index (2ms)
Usage
const mergingStream = new Merge({
*nextStream(): IterableIterator<MergingStream> {
for (let i = 0; i < 10; i++) {
const stream = new Readable();
stream.push(i.toString());
stream.push(null);
yield {index: i * 2, stream};
}
},
});
const template = new Readable();
template.push(', , , , , , , , , ');
template.push(null);
template.pipe(mergingStream).pipe(getSink());
The result will of our sink would be
0, 1, 2, 3, 4, 5, 6, 7, 8, 9
Final Thoughts
This is not the most time efficient way of doing it, since we only manage one merging buffer at once. So there is a lot of waiting. For my use case that is fine. I care about it not eating up my memory and this solution works for me. But there is definitely some space for optimization. The complete code has some extra features that are not fully explained here, such as appending streams and merging into that appended stream. They have been explained with comments though.

How do I apply back-pressure to node streams?

While attempting to experiment with Node.JS streams I ran into an interesting conundrum. When the input (Readable) stream pushes more data then the destination (Writable) cares about I was unable to apply back-pressure correctly.
The two methods I attempted was to return false from the Writable.prototype._write and to retain a reference to the Readable so I can call Readable.pause() from the Writable. Neither solution helped much which I'll explain.
In my exercise (which you can view the full source as a Gist) I have three streams:
Readable - PasscodeGenerator
util.inherits(PasscodeGenerator, stream.Readable);
function PasscodeGenerator(prefix) {
stream.Readable.call(this, {objectMode: true});
this.count = 0;
this.prefix = prefix || '';
}
PasscodeGenerator.prototype._read = function() {
var passcode = '' + this.prefix + this.count;
if (!this.push({passcode: passcode})) {
this.pause();
this.once('drain', this.resume.bind(this));
}
this.count++;
};
I thought that the return code from this.push() was enough to self pause and wait for the drain event to resume.
Transform - Hasher
util.inherits(Hasher, stream.Transform);
function Hasher(hashType) {
stream.Transform.call(this, {objectMode: true});
this.hashType = hashType;
}
Hasher.prototype._transform = function(sample, encoding, next) {
var hash = crypto.createHash(this.hashType);
hash.setEncoding('hex');
hash.write(sample.passcode);
hash.end();
sample.hash = hash.read();
this.push(sample);
next();
};
Simply add the hash of the passcode to the object.
Writable - SampleConsumer
util.inherits(SampleConsumer, stream.Writable);
function SampleConsumer(max) {
stream.Writable.call(this, {objectMode: true});
this.max = (max != null) ? max : 10;
this.count = 0;
}
SampleConsumer.prototype._write = function(sample, encoding, next) {
this.count++;
console.log('Hash %d (%s): %s', this.count, sample.passcode, sample.hash);
if (this.count < this.max) {
next();
} else {
return false;
}
};
Here I want to consume the data as fast as possible until I reach my max number of samples and then end the stream. I tried using this.end() instead of return false but that caused the dreaded write called after end problem. Returning false does stop everything if the sample size is small but when it is large I get an out of memory error:
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
Aborted (core dumped)
According to this SO answer in theory the Write stream would return false causing the streams to buffer until the buffers were full (16 by default for objectMode) and eventually the Readable would call it's this.pause() method. But 16 + 16 + 16 = 48; that's 48 objects in buffer till things fill up and the system is clogged. Actually less because there is no cloning involved so the objects passed between them is the same reference. Would that not mean only 16 objects in memory till the high water mark halts everything?
Lastly I realize I could have the Writable reference the Readable to call it's pause method using closures. However, this solution means the Writable stream knows to much about another object. I'd have to pass in a reference:
var foo = new PasscodeGenerator('foobar');
foo
.pipe(new Hasher('md5'))
.pipe(new SampleConsumer(samples, foo));
And this feels out of norm for how streams would work. I thought back-pressure was enough to cause a Writable to stop a Readable from pushing data and prevent out of memory errors.
An analogous example would be the Unix head command. Implementing that in Node I would assume that the destination could end and not just ignore causing the source to keep pushing data even if the destination has enough data to satisfy the beginning portion of the file.
How do I idiomatically construct custom streams such that when the destination is ready to end the source stream doesn't attempt to push more data?
This is a known issue with how _read() is called internally. Since your _read() is always pushing synchronously/immediately, the internal stream implementation can get into a loop in the right conditions. _read() implementations are generally expected to do some sort of async I/O (e.g. reading from disk or network).
The workaround for this (as noted in the link above) is to make your _read() asynchronous at least some of the time. You could also just make it async every time it's called with:
PasscodeGenerator.prototype._read = function(n) {
var passcode = '' + this.prefix + this.count;
var self = this;
// `setImmediate()` delays the push until the beginning
// of the next tick of the event loop
setImmediate(function() {
self.push({passcode: passcode});
});
this.count++;
};

NodeJS: How to write a file parser using readStream?

I have a file in a binary format:
The format is as follows:
[4 - header bytes] [8 bytes - int64 - how many bytes to read following] [variable num of bytes (size of the int64) - read the actual information]
And then it repeats, so I must first read the first 12 bytes to determine how many more bytes I need to read.
I have tried:
var readStream = fs.createReadStream('/path/to/file.bin');
readStream.on('data', function(chunk) { ... })
The problem I have is that chunk always comes back in chunks of 65536 bytes at a time whereas I need to be more specific on the number of bytes that I am reading.
I have always tried readStream.on('readable', function() { readStream.read(4) })
But it is also not very flexible, because it seems to turn asynchronous code into synchronous code because, I have to put the 'reading' in a while loop
Or maybe readStream is not appropriate in this case and I should use this instead? fs.read(fd, buffer, offset, length, position, callback)
Here's what I'd recommend as an abstract handler of a readStream to process abstract data like you're describing:
var pending = new Buffer(9999999);
var cursor = 0;
stream.on('data', function(d) {
d.copy(pending, cursor);
cursor += d.length;
var test = attemptToParse(pending.slice(0, cursor));
while (test !== false) {
// test is a valid blob of data
processTheThing(test);
var rawSize = test.raw.length; // How many bytes of data did the blob actually take up?
pending.copy(pending.copy, 0, rawSize, cursor); // Copy the data after the valid blob to the beginning of the pending buffer
cursor -= rawSize;
test = attemptToParse(pending.slice(0, cursor)); // Is there more than one valid blob of data in this chunk? Keep processing if so
}
});
For your use-case, ensure the initialized size of the pending Buffer is large enough to hold the largest possible valid blob of data you'll be parsing (you mention an int64; that max size plus the header size) plus one extra 65536 bytes in case the blob boundary happens just on the edge of a stream chunk.
My method requires a attemptToParse() method that takes a buffer and tries to parse the data out of it. It should return false if the length of the buffer is too short (data hasn't come in enough yet). If it is a valid object, it should return some parsed object that has a way to show the raw bytes it took up (.raw property in my example). Then you do any processing you need to do with the blob (processTheThing()), trim out that valid blob of data, shift the pending Buffer to just be the remainder and keep going. That way, you don't have a constantly growing pending buffer, or some array of "finished" blobs. Maybe process on the receiving end of processTheThing() is keeping an array of the blobs in memory, maybe it's writing them to a database, but in this example, that's abstracted away so this code just deals with how to handle the stream data.
Add the chunk to a Buffer, and then parse the data from there. Being aware not to go beyond the end of the buffer (if your data is large). I'm using my tablet right now so can't add any example source code. Maybe somebody else can?
Ok, mini source, very skeletal.
var chunks = [];
var bytesRead= 0;
stream.on('data', function(chunk) {
chunks.push(chunk);
bytesRead += chunk.length;
// look at bytesRead...
var buffer = Buffer.concat(chunks);
chunks = [buffer]; // trick for next event
// --> or, if memory is an issue, remove completed data from the beginning of chunks
// work with the buffer here...
}

Resources