gzipping a file with nodejs streams causes memory leaks

gzipping a file with nodejs streams causes memory leaks - node.js

I'm trying to do what should be seemingly quite simple: take a file with filename X, and create a gzipped version as "X.gz". Nodejs's zlib module does not come with a convenient zlib.gzip(infile, outfile), so I figured I'd use an input stream, an output stream, and a zlib gzipper, then pipe them:
var zlib = require("zlib"),
zipper = zlib.createGzip(),
fs = require("fs");
var tryThing = function(logfile) {
var input = fs.createReadStream(logfile, {autoClose: true}),
output = fs.createWriteStream(logfile + ".gz");
input.pipe(zipper).pipe(output);
output.on("end", function() {
// delete original file, it is no longer needed
fs.unlink(logfile);
// clear listeners
zipper.removeAllListeners();
input.removeAllListeners();
});
}
however, every time I run this function, the memory footprint of Node.js grows by about 100kb. Am I forgetting to tell the streams they should just kill themselves off again because they won't be needed any longer?
Or, alternatively, is there a way to just gzip a file without bothering with streams and pipes? I tried googling for "node.js gzip a file" but it's just links to the API docs, and stack overflow questions on gzipping streams and buffers, not how to just gzip a file.

I think you need to properly unpipe and close the stream. Simply removeAllListeners() may not be enough to clean things up. As streams may be waiting for more data (and thus staying alive in memory unnecessarily.)
Also you're not closing the output stream as well and IMO I'd listen on the input stream's end instead of the output.
// cleanup
input.once('end', function() {
zipper.removeAllListeners();
zipper.close();
zipper = null;
input.removeAllListeners();
input.close();
input = null;
output.removeAllListeners();
output.close();
output = null;
});
Also I don't think the stream returned from zlib.createGzip() can be shared once ended. You should create a new one at every iteration of tryThing:
var input = fs.createReadStream(logfile, {autoClose: true}),
output = fs.createWriteStream(logfile + ".gz")
zipper = zlib.createGzip();
input.pipe(zipper).pipe(output);
Havn't tested this tho as I don't have a memory profile tool nearby right now.

Related

Performing piped operations on individual chunks (node-wav)

I'm new to node and I'm working on an audio stream server. I'm trying to process / transform the chunks of a stream as they come out of each pipe.
So, file = fs.createReadStream(path) (filestream) is piped into file.pipe(wavy) (remove headers and output raw PCM) gets piped in to .pipe(waver) (add proper wav header to chunk) which is piped into .pipe(spark) (ouput chunk to client).
The idea is that each filestream chunk has headers removed if any (only applies to first chunk), then using the node-wav Writer that chunk is endowed with headers and then sent to the client. As I'm sure you guessed this doesn't work.
The pipe operations into node-wav are acting on the entire filestream, not the individual chunks. To confirm I've checked the output client side and it is effectively dropping the headers and re-adding them to the entire data stream.
From what I've read of the Node Stream docs it seems like what I'm trying to do should be possible, just not the way I'm doing it. I just can't pin down how to accomplish this.
Is it possible, and if so what am I missing?
Complete function:
processAudio = (path, spark) ->
wavy = new wav.Reader()
waver = new wav.Writer()
file = fs.createReadStream(path)
file.pipe(wavy).pipe(waver).pipe(spark)

I don't really know about wavs and headers but if you're "trying to process / transform the chunks of a stream as they come out of each pipe." you can use the Transform stream.
It permits you to sit between 2 streams and modify the bytes between them:
var util = require('util');
var Transform = require('stream').Transform;
util.inherits(Test, Transform);
function Test(options) {
Transform.call(this, options);
}
Test.prototype._transform = function(chunk, encoding, cb) {
// do something with chunk, then pass a modified chunk (or not)
// to the downstream
cb(null, chunk);
};
To observe the stream and potentially modify it, pipe like:
file.pipe(wavy).pipe(new Test()).pipe(waver).pipe(spark)

Node.js: splitting a readable stream pipe to multiple sequential writable streams

Given a Readable stream (which may be process.stdin or a file stream), is it possible/practical to pipe() to a custom Writable stream that will fill a child Writable until a certain size; then close that child stream; open a new Writable stream and continue?
(The context is to upload a large piece of data from a pipeline to a CDN, dividing it up into blocks of a reasonable size as it goes, without having to write the data to disk first.)
I've tried creating a Writable that handles the opening and closing of the child stream in the _write function, but the problem comes when the incoming chunk is too big to fit in the existing child stream: it has to write some of the chunk to the old stream; create the new stream; and then wait for the open event on the new stream before completing the _write call.
The other thought I had was to create an extra Duplex or Transform stream to buffer the pipe and ensure that the chunk coming into the Writable is definitely equal to or less than the amount the existing child stream can accept, to give the Writable time to change the child stream over.
Alternatively, is this overcomplicating everything and there's a much easier way to do the original task?

I bumped across the question when looking for an answer for a related problem. How to parse a file and split it its lines into separate files depending on some category value in the line.
I did my best to change my code to make it more relevant to your problem. However, that's rapidly adapted. Not tested. Treat it as pseudo-code.
var fs = require('fs'),
through = require('through');
var destCount = 0, dest, size = 0, MAX_SIZE = 1000;
readableStream
.on('data', function(data) {
var out = data.toString() + "\n";
size += out.length;
if(size > MAX_SIZE) {
dest.emit("end");
dest = null;
size = 0;
}
if(!dest) {
// option 1. manipulate data before saving them.
dest = through();
dest.pipe(fs.createWriteStream("log" + destCount))
// option 2. write directly to file
// dest = fs.createWriteStream("log" + destCount);
}
dest.emit("data", out);
})
.on('end', function() {
dest.emit('end');
});

I would introduce a Transform in between the Readable and Writable stream. And in its _transform, I would do all the logic I would need.
Maybe, I would only have a Readable and a Transform only. The _transform method would create all the Writable stream I need
Personally, I only use a Writable stream only when I'm dumping data somewhere and I would be done processing that chunk.
I avoid implementing _read and _write as much as I can and abuse Transform stream.
But the point I don't understand in your question is write about size. What do you mean by it.?

Read File in Node and process the same

I wanted to read a file and process each line of the file. I have used the readStream to read the file and then invoke the processRecord method. The processMethod need to make multiple calls and need to make the final data before its written to the store.
The file has 500K records.
The issue that Im facing is that, the files are read at a significant pace and I believe the node is not getting enough priority to actually process the processLine method. Hence the memory shoots upto 800MB and then slows down.
Any help is appreciated.
The code that Im using is given below -
var instream = fs.createReadStream('C:/data.txt');
var outstream = new stream;
var rl = readline.createInterface({
input: instream,
output: outstream,
terminal: false
});
outstream.readable = true;
rl.on('line', function(line) {
processRecord(line);
}

The Node.js readline module is intended more for user interaction than line-by-line streaming from files. You may have better luck with the popular byline package.
var fs = require('fs');
var byline = require('byline');
// You'll need to check the encoding.
var lineStream = byline(fs.createReadStream('C:/data.txt', { encoding: 'utf8' }));
lineStream.on('data', function (line) {
processRecord(line);
});
You'll have a better chance of avoiding memory leaks if the data is piped to another stream. I'm assuming here that processRecord is feeding into one. If you make it a transform stream object, then you can use pipes.
var out = fs.createWriteStream('output.txt');
lineStream.pipe(processRecordStream).pipe(out);

How do I close a stream that has no more data to send in node.js?

I am using node.js and reading input from a serial port by opening a /dev/tty file, I send a command and read the result of the command and I want to close the stream once I've read and parsed all the data. I know that I'm done reading data by and end of data marker. I'm finding that once I've closed the stream my program does not terminate.
Below is an example of what I am seeing but uses /dev/random to slowly generate data (assuming your system isn't doing much). What I find is that the process will terminate once the device generates data after the stream has been closed.
var util = require('util'),
PassThrough = require('stream').PassThrough,
fs = require('fs');
// If the system is not doing enough to fill the entropy pool
// /dev/random will not return much data. Feed the entropy pool with :
// ssh <host> 'cat /dev/urandom' > /dev/urandom
var readStream = fs.createReadStream('/dev/random');
var pt = new PassThrough();
pt.on('data', function (data) {
console.log(data)
console.log('closing');
readStream.close(); //expect the process to terminate immediately
});
readStream.pipe(pt);
Update:1
I am back on this issue and have another sample, this one just uses a pty and is easily reproduced in the node repl. Login on 2 terminals and use the pty of the terminal you're not running node in the below call to createReadStream.
var fs = require('fs');
var rs = fs.createReadStream('/dev/pts/1'); // a pty that is allocated in another terminal by my user
//wait just a second, don't copy and paste everything at once
process.exit(0);
at this point node will just hang and not exit. This is on 10.28.

Instead of using
readStream.close(),
try using
readStream.pause().
But, if you are using the newest version of node, wrap the readstream with the object created from stream module by isaacs, like this :
var Readable = require('stream').Readable;
var myReader = new Readable().wrap(readStream);
and use myReader in place of readStream after that.
Best of luck! Tell me if this works.

You are closing the /dev/random stream, but you still have a listener for the 'data' event on the pass-through, which will keep the app running until the pass-through is closed.
I'm guessing there is some buffered data from the read stream and until that is flushed the pass-through is not closed. But this is just a guess.
To get the desired behaviour you can remove the event listener on the pass-through like this:
pt.on('data', function (data) {
console.log(data)
console.log('closing');
pt.removeAllListeners('data');
readStream.close();
});

i am actually pipe to a http request.. so for me it's about :
pt.on('close', (chunk) => {
req.abort();
});

How to wrap a buffer as a stream2 Readable stream?

How can I transform a node.js buffer into a Readable stream following using the stream2 interface ?
I already found this answer and the stream-buffers module but this module is based on the stream1 interface.

The easiest way is probably to create a new PassThrough stream instance, and simply push your data into it. When you pipe it to other streams, the data will be pulled out of the first stream.
var stream = require('stream');
// Initiate the source
var bufferStream = new stream.PassThrough();
// Write your buffer
bufferStream.end(Buffer.from('Test data.'));
// Pipe it to something else (i.e. stdout)
bufferStream.pipe(process.stdout)

As natevw suggested, it's even more idiomatic to use a stream.PassThrough, and end it with the buffer:
var buffer = new Buffer( 'foo' );
var bufferStream = new stream.PassThrough();
bufferStream.end( buffer );
bufferStream.pipe( process.stdout );
This is also how buffers are converted/piped in vinyl-fs.

A modern simple approach that is usable everywhere you would use fs.createReadStream() but without having to first write the file to a path.
const {Duplex} = require('stream'); // Native Node Module
function bufferToStream(myBuuffer) {
let tmp = new Duplex();
tmp.push(myBuuffer);
tmp.push(null);
return tmp;
}
const myReadableStream = bufferToStream(your_buffer);
myReadableStream is re-usable.
The buffer and the stream exist only in memory without writing to local storage.
I use this approach often when the actual file is stored at some cloud service and our API acts as a go-between. Files never get wrote to a local file.
I have found this to be the very reliable no matter the buffer (up to 10 mb) or the destination that accepts a Readable Stream. Larger files should implement

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

gzipping a file with nodejs streams causes memory leaks - node.js

Related

Performing piped operations on individual chunks (node-wav)

Node.js: splitting a readable stream pipe to multiple sequential writable streams

Read File in Node and process the same

How do I close a stream that has no more data to send in node.js?

How to wrap a buffer as a stream2 Readable stream?

Categories

Resources