WriteStream nodejs out memory - node.js

I try to create a 20MB file, but it throws the error out of memory, set the max-old-space-size to 2gb, but still can someone explain to me why writing a 20mb stream consumes so much memory?
I have 2.3 g.b of free memory
let size=20*1024*1024; //20MB
for(let i=0;i<size;i++){
writeStream.write('A')
}
writeStream.end();

As mentioned in node documentation, Writable stores data in an internal buffer. The amount of data that can be buffered depends on highWaterMark option passed into the stream's constructor.
As long as size of buffered data is below below highWaterMark, calls to Writable.write(chunk) will return true. Once the buffered data exceeds limit specified by highWaterMark it returns false. This is when you should stop writing more data to Writable and wait for drain event which indicates that it's now appropriate to resume writing data.
Your program crashes because it keeps writing even when the internal buffer has exceeded highWaterMark.
Check the docs about Event:'drain'. It includes an example program.
This looks like a nice use case for Readable.pipe(Writable)
You can create a generator function that returns a character and then create a Readable from that generator by using Readable.from(). Then pipe the output of Readable to a Writable file.
The reason why it's beneficial to use pipe here is that :
A key goal of the stream API, particularly the stream.pipe() method,
is to limit the buffering of data to acceptable levels such that
sources and destinations of differing speeds will not overwhelm the
available memory. link
and
The flow of data will be automatically managed so that the destination
Writable stream is not overwhelmed by a faster Readable stream. link
const { Readable } = require('stream');
const fs = require('fs');
const size = 20 * 1024 * 1024; //20MB
function * generator(numberOfChars) {
while(numberOfChars--) {
yield 'A';
}
}
const writeStream = fs.createWriteStream('./output.txt');
const readable = Readable.from(generator(size));
readable.pipe(writeStream);

Related

How can I limit the size of WriteStream buffer in NodeJS?

I'm using a WriteStream in NodeJS to write several GB of data, and I've identified the write loop as eating up ~2GB of virtual memory during runtime (which is the GC'd about 30 seconds after the loop finishes). I'm wondering how I can limit the size of the buffer node is using when writing the stream so that Node doesn't use up so much memory during that part of the code.
I've reduced it to this trivial loop:
let ofd = fs.openSync(fn, 'w')
let ws = fs.createWriteStream('', { fd: ofd })
:
while { /*..write ~4GB of binary formatted 32bit floats and uint32s...*/ }
:
:
ws.end()
The stream.write function will return a boolean value which indicate if the internal buffer is full. The buffer size is controlled by the option highWaterMark. However, this option is a threshold instead of a hard limitation, which means you can still call stream.write even if the internal buffer is full, and the memory will be used continuously if you code like this.
while (foo) {
ws.write(bar);
}
In order to solve this issue, you have to handle the returned value false from the ws.write and waiting until the drain event of this stream is called like the following example.
async function write() {
while (foo) {
if (!ws.write(bar)) {
await new Promise(resolve => ws.once('drain', resolve));
}
}
}

Reading data a block at a time, synchronously

What is the nodejs (typescript) equivalent of the following Python snippet? I've put an attempt at corresponding nodejs below the Python.
Note that I want to read a chunk at a time (later that is, in this example I'm just reading the first kilobyte), synchronously.
Also, I do not want to read the entire file into virtual memory at once; some of my input files will (eventually) be too big for that.
The nodejs snippet always returns null. I want it to return a string or buffer or something along those lines. If the file is >= 1024 bytes long, I want a 1024 character long return, otherwise I want the entire file.
I googled about this for an hour or two, but all I found was things synchronously reading an entire file at a time, or reading pieces at a time asynchronously.
Thanks!
Here's the Python:
def readPrefix(filename: str) -> str:
with open(filename, 'rb') as infile:
data = infile.read(1024)
return data
Here's the nodejs attempt:
const readPrefix = (filename: string): string => {
const readStream = fs.createReadStream(filename, { highWaterMark: 1024 });
const data = readStream.read(1024);
readStream.close();
return data;
};
To read synchronously, you would use fs.openSync(), fs.readSync() and fs.closeSync().
Here's some regular Javascript code (hopefully you can translate it to TypeScript) that synchronously reads a certain number of bytes from a file and returns a buffer object containing those bytes (or throws an exception in case of error):
const fs = require('fs');
function readBytesSync(filePath, filePosition, numBytesToRead) {
const buf = Buffer.alloc(numBytesToRead, 0);
let fd;
try {
fd = fs.openSync(filePath, "r");
fs.readSync(fd, buf, 0, numBytesToRead, filePosition);
} finally {
if (fd) {
fs.closeSync(fd);
}
}
return buf;
}
For your application, you can just pass 1024 as the bytes to read and if there are less than that in the file, it will just read up until the end of the file. The returns buffer object will contain the bytes read which you can access as binary or convert to a string.
For the benefit of others reading this, I mentioned in earlier comments that synchronous I/O should never be used in a server environment (servers should always use asynchronous I/O except at startup time). Synchronous I/O can be used for stand-alone scripts that only do one thing (like build scripts, as an example) and don't need to be responsive to multiple incoming requests.
Do I need to loop on readSync() in case of EINTR or something?
Not that I'm aware of.

Listen to write when piping read stream to write stream?

I have the code:
const readStream = fs.createReadStream(readFilename, {
highWaterMark: 10 * 1024
});
const writeStream = fs.createWriteStream(writeFilename, {
highWaterMark: 1 * 1024
});
readStream.pipe(writeStream);
As you can see, the buffer (highWaterMark) size is different for both. The read has a higher buffer, when read pipes to write it is indeed too much for the write buffer to handle. It reserves 9 * 1014 in memory and after it has handled the entire load it calls drain. This is fine.
However. When writing to write manually via writable.write, false is returned so you may alter the read stream to have a lower buffer (if that's what you wish).
My question is, since I'm piping directly, is there anyway to listen to the write event on the writable? The only thing that I can seem to listen to is the drain event after it has already taken in too much.
The general answer is "no, because there's no need to", but the less military one would be "kinda, but in another way and with consequences".
First there's a misunderstanding to what the drain event means in a piped stream:
You're making assumption that it's called when the Writable buffer is depleted but that's only node.js internal buffer, not the actual pipe to the filesystem.
Additionally you're not the only one who's reading that - the pipe method actually creates a lot of listeners and pause/resume logic around both streams.
So what's actually happening is that the Readable is listening on Writable#drain event to push some more data into the buffer.
Second, as said - Writable does not implement any confirmations that the specific chunk has been written, that's simply because on string and Buffer chunks it would be very hard to tell when those are actually written (even impossible at some point in a simple case of gzip stream when a part of a chunk may be written to actual disk).
There is a way to get close enough though (get nearly precise confirmation per chunk):
const {PassThrough} = require("stream");
fs.createReadStream(readFilename, {
highWaterMark: 10 * 1024
})
/* we pipe the readable to a buffer in a passthrough stream */
.pipe(new PassThrough({
highWaterMark: 1024
}))
/* pipe returns the stream we piped to */
/* now we pipe again, but to a stream with no highWaterMark */
.pipe(
new PassThrough({
highWaterMark: 1
})
.on("data", () => {
/* here's your confirmation called just before this chunk is written and after the last one has started to be written */
})
)
/* and there we push to the write stream */
.pipe(
fs.createWriteStream(writeFilename, {
highWaterMark: 1
})
);
Sure, that will definitely come with a performance impact and I don't know how big but it will keep the reading side more or less efficient and writable will get the buffer it needs - but with some extra CPU and perhaps some micro latency for every chunk.
It's up to you to test.
See more on streams, especially PassThrough here.
Yes there is. You can listen to the data event:
readStream.on('data', data => console.log(data))

Node.js: splitting a readable stream pipe to multiple sequential writable streams

Given a Readable stream (which may be process.stdin or a file stream), is it possible/practical to pipe() to a custom Writable stream that will fill a child Writable until a certain size; then close that child stream; open a new Writable stream and continue?
(The context is to upload a large piece of data from a pipeline to a CDN, dividing it up into blocks of a reasonable size as it goes, without having to write the data to disk first.)
I've tried creating a Writable that handles the opening and closing of the child stream in the _write function, but the problem comes when the incoming chunk is too big to fit in the existing child stream: it has to write some of the chunk to the old stream; create the new stream; and then wait for the open event on the new stream before completing the _write call.
The other thought I had was to create an extra Duplex or Transform stream to buffer the pipe and ensure that the chunk coming into the Writable is definitely equal to or less than the amount the existing child stream can accept, to give the Writable time to change the child stream over.
Alternatively, is this overcomplicating everything and there's a much easier way to do the original task?
I bumped across the question when looking for an answer for a related problem. How to parse a file and split it its lines into separate files depending on some category value in the line.
I did my best to change my code to make it more relevant to your problem. However, that's rapidly adapted. Not tested. Treat it as pseudo-code.
var fs = require('fs'),
through = require('through');
var destCount = 0, dest, size = 0, MAX_SIZE = 1000;
readableStream
.on('data', function(data) {
var out = data.toString() + "\n";
size += out.length;
if(size > MAX_SIZE) {
dest.emit("end");
dest = null;
size = 0;
}
if(!dest) {
// option 1. manipulate data before saving them.
dest = through();
dest.pipe(fs.createWriteStream("log" + destCount))
// option 2. write directly to file
// dest = fs.createWriteStream("log" + destCount);
}
dest.emit("data", out);
})
.on('end', function() {
dest.emit('end');
});
I would introduce a Transform in between the Readable and Writable stream. And in its _transform, I would do all the logic I would need.
Maybe, I would only have a Readable and a Transform only. The _transform method would create all the Writable stream I need
Personally, I only use a Writable stream only when I'm dumping data somewhere and I would be done processing that chunk.
I avoid implementing _read and _write as much as I can and abuse Transform stream.
But the point I don't understand in your question is write about size. What do you mean by it.?

gzipping a file with nodejs streams causes memory leaks

I'm trying to do what should be seemingly quite simple: take a file with filename X, and create a gzipped version as "X.gz". Nodejs's zlib module does not come with a convenient zlib.gzip(infile, outfile), so I figured I'd use an input stream, an output stream, and a zlib gzipper, then pipe them:
var zlib = require("zlib"),
zipper = zlib.createGzip(),
fs = require("fs");
var tryThing = function(logfile) {
var input = fs.createReadStream(logfile, {autoClose: true}),
output = fs.createWriteStream(logfile + ".gz");
input.pipe(zipper).pipe(output);
output.on("end", function() {
// delete original file, it is no longer needed
fs.unlink(logfile);
// clear listeners
zipper.removeAllListeners();
input.removeAllListeners();
});
}
however, every time I run this function, the memory footprint of Node.js grows by about 100kb. Am I forgetting to tell the streams they should just kill themselves off again because they won't be needed any longer?
Or, alternatively, is there a way to just gzip a file without bothering with streams and pipes? I tried googling for "node.js gzip a file" but it's just links to the API docs, and stack overflow questions on gzipping streams and buffers, not how to just gzip a file.
I think you need to properly unpipe and close the stream. Simply removeAllListeners() may not be enough to clean things up. As streams may be waiting for more data (and thus staying alive in memory unnecessarily.)
Also you're not closing the output stream as well and IMO I'd listen on the input stream's end instead of the output.
// cleanup
input.once('end', function() {
zipper.removeAllListeners();
zipper.close();
zipper = null;
input.removeAllListeners();
input.close();
input = null;
output.removeAllListeners();
output.close();
output = null;
});
Also I don't think the stream returned from zlib.createGzip() can be shared once ended. You should create a new one at every iteration of tryThing:
var input = fs.createReadStream(logfile, {autoClose: true}),
output = fs.createWriteStream(logfile + ".gz")
zipper = zlib.createGzip();
input.pipe(zipper).pipe(output);
Havn't tested this tho as I don't have a memory profile tool nearby right now.

Resources