Read File in Node and process the same - node.js

I wanted to read a file and process each line of the file. I have used the readStream to read the file and then invoke the processRecord method. The processMethod need to make multiple calls and need to make the final data before its written to the store.
The file has 500K records.
The issue that Im facing is that, the files are read at a significant pace and I believe the node is not getting enough priority to actually process the processLine method. Hence the memory shoots upto 800MB and then slows down.
Any help is appreciated.
The code that Im using is given below -
var instream = fs.createReadStream('C:/data.txt');
var outstream = new stream;
var rl = readline.createInterface({
input: instream,
output: outstream,
terminal: false
});
outstream.readable = true;
rl.on('line', function(line) {
processRecord(line);
}

The Node.js readline module is intended more for user interaction than line-by-line streaming from files. You may have better luck with the popular byline package.
var fs = require('fs');
var byline = require('byline');
// You'll need to check the encoding.
var lineStream = byline(fs.createReadStream('C:/data.txt', { encoding: 'utf8' }));
lineStream.on('data', function (line) {
processRecord(line);
});
You'll have a better chance of avoiding memory leaks if the data is piped to another stream. I'm assuming here that processRecord is feeding into one. If you make it a transform stream object, then you can use pipes.
var out = fs.createWriteStream('output.txt');
lineStream.pipe(processRecordStream).pipe(out);

Related

Memory limit exceeded error when reading a large file with node js and readline

I am getting a memory error when using readline. The section of code is shown below:
var lineReader = require('readline').createInterface({
input: require('fs').createReadStream('/tmp/temp.ttl')
});
let entity;
let tripleKey;
let triple;
console.log('file ready for processing');
lineReader.on('line', function (line) {
triple = parser.parse(line)[0];
if (triple) {
tripleKey = datastore.key('triple');
entity = prepare_entity(tripleKey, triple);
lineReader.pause();
datastore.save(entity).then(()=>lineReader.resume());
number_of_rows += 1;
};
I thought all memory for the 'on' line event is pre-allocated as it is outside the loop. So my question is, what could be causing the consumption of memory in this section code?
In response to Doug, changing the readline to be fully streaming now shows the memory limit error after 140,000 entities (rather than 40,000 before).
See below:
const remoteFile = bucket.file(file.name);
var lineReader = require('readline').createInterface({
input: remoteFile.createReadStream()
});
console.log('file ready for processing');
lineReader.on('line', function (line) { ...
In Cloud Functions, /tmp is a memory-based filesystem. That means your 1.2GB file living there is actually taking 1.2GB of memory. That's a lot of memory.
If you need more memory, you can try to increase the memory limit for your functions in the Cloud Console, up to 2GB max.
Instead, you might want to try to stream the file from its origin and process the stream instead of downloading the entire thing locally. You'll save yourself money that way on memory used over time.

gzipping a file with nodejs streams causes memory leaks

I'm trying to do what should be seemingly quite simple: take a file with filename X, and create a gzipped version as "X.gz". Nodejs's zlib module does not come with a convenient zlib.gzip(infile, outfile), so I figured I'd use an input stream, an output stream, and a zlib gzipper, then pipe them:
var zlib = require("zlib"),
zipper = zlib.createGzip(),
fs = require("fs");
var tryThing = function(logfile) {
var input = fs.createReadStream(logfile, {autoClose: true}),
output = fs.createWriteStream(logfile + ".gz");
input.pipe(zipper).pipe(output);
output.on("end", function() {
// delete original file, it is no longer needed
fs.unlink(logfile);
// clear listeners
zipper.removeAllListeners();
input.removeAllListeners();
});
}
however, every time I run this function, the memory footprint of Node.js grows by about 100kb. Am I forgetting to tell the streams they should just kill themselves off again because they won't be needed any longer?
Or, alternatively, is there a way to just gzip a file without bothering with streams and pipes? I tried googling for "node.js gzip a file" but it's just links to the API docs, and stack overflow questions on gzipping streams and buffers, not how to just gzip a file.
I think you need to properly unpipe and close the stream. Simply removeAllListeners() may not be enough to clean things up. As streams may be waiting for more data (and thus staying alive in memory unnecessarily.)
Also you're not closing the output stream as well and IMO I'd listen on the input stream's end instead of the output.
// cleanup
input.once('end', function() {
zipper.removeAllListeners();
zipper.close();
zipper = null;
input.removeAllListeners();
input.close();
input = null;
output.removeAllListeners();
output.close();
output = null;
});
Also I don't think the stream returned from zlib.createGzip() can be shared once ended. You should create a new one at every iteration of tryThing:
var input = fs.createReadStream(logfile, {autoClose: true}),
output = fs.createWriteStream(logfile + ".gz")
zipper = zlib.createGzip();
input.pipe(zipper).pipe(output);
Havn't tested this tho as I don't have a memory profile tool nearby right now.

How do I close a stream that has no more data to send in node.js?

I am using node.js and reading input from a serial port by opening a /dev/tty file, I send a command and read the result of the command and I want to close the stream once I've read and parsed all the data. I know that I'm done reading data by and end of data marker. I'm finding that once I've closed the stream my program does not terminate.
Below is an example of what I am seeing but uses /dev/random to slowly generate data (assuming your system isn't doing much). What I find is that the process will terminate once the device generates data after the stream has been closed.
var util = require('util'),
PassThrough = require('stream').PassThrough,
fs = require('fs');
// If the system is not doing enough to fill the entropy pool
// /dev/random will not return much data. Feed the entropy pool with :
// ssh <host> 'cat /dev/urandom' > /dev/urandom
var readStream = fs.createReadStream('/dev/random');
var pt = new PassThrough();
pt.on('data', function (data) {
console.log(data)
console.log('closing');
readStream.close(); //expect the process to terminate immediately
});
readStream.pipe(pt);
Update:1
I am back on this issue and have another sample, this one just uses a pty and is easily reproduced in the node repl. Login on 2 terminals and use the pty of the terminal you're not running node in the below call to createReadStream.
var fs = require('fs');
var rs = fs.createReadStream('/dev/pts/1'); // a pty that is allocated in another terminal by my user
//wait just a second, don't copy and paste everything at once
process.exit(0);
at this point node will just hang and not exit. This is on 10.28.
Instead of using
readStream.close(),
try using
readStream.pause().
But, if you are using the newest version of node, wrap the readstream with the object created from stream module by isaacs, like this :
var Readable = require('stream').Readable;
var myReader = new Readable().wrap(readStream);
and use myReader in place of readStream after that.
Best of luck! Tell me if this works.
You are closing the /dev/random stream, but you still have a listener for the 'data' event on the pass-through, which will keep the app running until the pass-through is closed.
I'm guessing there is some buffered data from the read stream and until that is flushed the pass-through is not closed. But this is just a guess.
To get the desired behaviour you can remove the event listener on the pass-through like this:
pt.on('data', function (data) {
console.log(data)
console.log('closing');
pt.removeAllListeners('data');
readStream.close();
});
i am actually pipe to a http request.. so for me it's about :
pt.on('close', (chunk) => {
req.abort();
});

Node.js Readable file stream not getting data

I'm attempting to create a Readable file stream that I can read individual bytes from. I'm using the code below.
var rs = fs.createReadStream(file).on('open', function() {
var buff = rs.read(8); //Read first 8 bytes
console.log(buff);
});
Given that file is an existing file of at least 8 bytes, why am I getting 'null' as the output for this?
Event open means that stream has been initialized, it does not mean you can read from the stream. You would have to listen for either readable or data events.
var rs = fs.createReadStream(file);
rs.once('readable', function() {
var buff = rs.read(8); //Read first 8 bytes only once
console.log(buff.toString());
});
It looks like you're calling this rs.read() method. However, that method is only available in the Streams interface. In the Streams interface, you're looking for the 'data' event and not the 'open' event.
That stated, the docs actually recommend against doing this. Instead you should probably be handling chunks at a time if you want to stream them:
var rs = fs.createReadStream('test.txt');
rs.on('data', function(chunk) {
console.log(chunk);
});
If you want to read just a specific portion of a file, you may want to look at fs.open() and fs.read() which are lower level.

How to wrap a buffer as a stream2 Readable stream?

How can I transform a node.js buffer into a Readable stream following using the stream2 interface ?
I already found this answer and the stream-buffers module but this module is based on the stream1 interface.
The easiest way is probably to create a new PassThrough stream instance, and simply push your data into it. When you pipe it to other streams, the data will be pulled out of the first stream.
var stream = require('stream');
// Initiate the source
var bufferStream = new stream.PassThrough();
// Write your buffer
bufferStream.end(Buffer.from('Test data.'));
// Pipe it to something else (i.e. stdout)
bufferStream.pipe(process.stdout)
As natevw suggested, it's even more idiomatic to use a stream.PassThrough, and end it with the buffer:
var buffer = new Buffer( 'foo' );
var bufferStream = new stream.PassThrough();
bufferStream.end( buffer );
bufferStream.pipe( process.stdout );
This is also how buffers are converted/piped in vinyl-fs.
A modern simple approach that is usable everywhere you would use fs.createReadStream() but without having to first write the file to a path.
const {Duplex} = require('stream'); // Native Node Module
function bufferToStream(myBuuffer) {
let tmp = new Duplex();
tmp.push(myBuuffer);
tmp.push(null);
return tmp;
}
const myReadableStream = bufferToStream(your_buffer);
myReadableStream is re-usable.
The buffer and the stream exist only in memory without writing to local storage.
I use this approach often when the actual file is stored at some cloud service and our API acts as a go-between. Files never get wrote to a local file.
I have found this to be the very reliable no matter the buffer (up to 10 mb) or the destination that accepts a Readable Stream. Larger files should implement

Resources