Handle huge amount of data in node.js through stdin - node.js

Old question title:
"node.js readline form net.Socket (process.stdin) cause error: heap out of memory (conversion of net.Socket Duplex to Readable stream)"
... I've changed it because nobody answered and it seems like important question in node.js ecosystem.
Question is how to solve problem of "heap out of memory" error when reading line by line from huge stdin? Error is not happening when you dump stdout to file (eg: test.log) and read to 'readline' interface through fs.createReadStream('test.log').
Looks like process.stdin is not Readable stream as it is mentioned here:
https://nodejs.org/api/process.html#process_process_stdin
To reproduce the issue I've created two scripts. First is to just generate huge amount of data (a.js file):
// a.js
// loop in this form generates about 7.5G of data
// you can check yourself running:
// node a.js > test.log && ls -lah test.log
// will return
// -rw-r--r-- 1 sd staff 7.5G 31 Jan 22:29 test.log
for (let i = 0 ; i < 8000000 ; i += 1 ) {
console.log(`${i} ${".".repeat(1000)}\n`);
}
The script to consume this through bash pipe with readline (b.js file):
const fs = require('fs');
const readline = require('readline');
const rl = readline.createInterface({
input: process.stdin, // doesn't work
//input: fs.createReadStream('test.log'), // works
});
let s;
rl.on('line', line => {
// deliberaty commented out to demonstrate that issue
// has nothing to do beyond readline and process.stdin
// s = line.substring(0, 7);
//
// if (s === '100 ...' || s === '400 ...' || s === '7500000') {
//
// process.stdout.write(`${line}\n`);
// }
});
rl.on('error', e => {
console.log('general error', e)
})
Now when you run;
node a.js | node b.js
it will result with error:
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
but if you swap
const rl = readline.createInterface({
input: process.stdin,
});
to
const rl = readline.createInterface({
input: fs.createReadStream('test.log')
});
and run
node a.js > test.log
node b.js
everything works fine
Problem comes down actually to how to convert net.Socket to fully functional Readable stream?, if it is possible at all.
Edit:
Basically my problem is that it seems like it is not possible to handle huge amount of data from stdin as a stream which is natural for Unix style pipes. So despite the fact node.js is brilliant in handling streams you can't write program that would handle huge amount of data through unix style pipes.
It would be totally not necessary in some cases to dump data to the hard drive and only after handle it with fs.createReadStream('test.log') only because of this limitation.
I thought that streams are all about handling huge amount of data (among other use cases) on the flight without saving it on hard drive.

You can always treat process.stdin as a normal NodeJS stream and handle the reading your self:
const os = require('os');
function onReadLine(line) {
// do stuff with line
console.info(line);
}
// read input and split into lines
let BUFF = '';
process.stdin.on('data', (buff) => {
const content = buff.toString('utf-8');
for (let i = 0; i < content.length; i++){
if (content[i] === os.EOL) {
onReadLine(BUFF);
BUFF = '';
} else {
BUFF += content[i];
}
}
});
// flush last line
process.stdin.on('end', () => {
if (BUFF.length > 0) {
onReadLine(BUFF);
}
});
Example:
// unix
cat ./somefile.txt | node ./script.js
// windows
Start-Process -FilePath "node" -ArgumentList #(".\script.js") -RedirectStandardInput .\somefile.txt -NoNewWindow -Wait

The problem is not input data size, not Node, but a faulty design of your data generator: it does not implement pausing/resuming data generation on request of consumer output stream. Instead of just pushing data to console.log(..) you should correctly interact with standard output stream, and correctly handle pause and resume signals from that stream.
The file input stream created by fs.createReadStream() is implemented properly, and it does pause/resumes as necessary, thus does not crash the code.

Related

Stream uint8 through stdin/stdout in nodejs

I have two simple node scripts that I like to pipe together in bash. I want to stream 2 integers from one script to the other. Something goes wrong when moving to the next bit, e.g. 127 can be expressed in 7 bits while 128 needs 8 bits, if I understand correctly. My guess is that it has something tot do with the sign of the integer, e.g. plus or minus. I have specifically used writeUInt8 and readUInt8 for this reason though...
Script in.js, sends 2 integers to stdout:
process.stdout.setEncoding('binary');
const buff1 = Buffer.alloc(1);
const buff2 = Buffer.alloc(1);
buff1.writeUInt8(127);
buff2.writeUInt8(128);
process.stdout.write(buff1);
process.stdout.write(buff2);
process.stdout.end();
Script out.js, reads from stdin and writes to stdout again:
process.stdin.setEncoding('binary');
process.stdin.on('data', function(data) {
for(const uInt of data) {
const v = Buffer.from(uInt).readUInt8();
process.stdout.write(v + '\n');
}
});
In bash I connect in and out:
$ node in.js | node out.js
Expected result:
127
128
Actual Result:
127
194
Setting the encoding to binary is messing the received data in in.js.
According to the Readable Stream documentation of Node.js:
By default, no encoding is assigned and stream data will be returned
as Buffer objects.
I tested the code below and it works:
// in.js
process.stdin.on('data', function (data) {
for (let i = 0; i < data.length; ++i) {
const v = data.readUInt8(i);
process.stdout.write(v + '\n');
}
});

Memory limit exceeded error when reading a large file with node js and readline

I am getting a memory error when using readline. The section of code is shown below:
var lineReader = require('readline').createInterface({
input: require('fs').createReadStream('/tmp/temp.ttl')
});
let entity;
let tripleKey;
let triple;
console.log('file ready for processing');
lineReader.on('line', function (line) {
triple = parser.parse(line)[0];
if (triple) {
tripleKey = datastore.key('triple');
entity = prepare_entity(tripleKey, triple);
lineReader.pause();
datastore.save(entity).then(()=>lineReader.resume());
number_of_rows += 1;
};
I thought all memory for the 'on' line event is pre-allocated as it is outside the loop. So my question is, what could be causing the consumption of memory in this section code?
In response to Doug, changing the readline to be fully streaming now shows the memory limit error after 140,000 entities (rather than 40,000 before).
See below:
const remoteFile = bucket.file(file.name);
var lineReader = require('readline').createInterface({
input: remoteFile.createReadStream()
});
console.log('file ready for processing');
lineReader.on('line', function (line) { ...
In Cloud Functions, /tmp is a memory-based filesystem. That means your 1.2GB file living there is actually taking 1.2GB of memory. That's a lot of memory.
If you need more memory, you can try to increase the memory limit for your functions in the Cloud Console, up to 2GB max.
Instead, you might want to try to stream the file from its origin and process the stream instead of downloading the entire thing locally. You'll save yourself money that way on memory used over time.

node.js child_process#spawn bypass stdin/stdout inner buffers

I'm using child_process#spawn to use external binaries through node.js. Each binary search for precise words in a string, depending on a language, and produce an output depending on an input text. They don't have inner buffers. usage examples :
echo "I'm a random input" | ./my-english-binary
produces text like The word X is in the sentence
cat /dev/urandom | ./my-english-binary produces infinite ouptut
I want to use each of these binaries as "server". I want to launch a new binary instance after a language never found before is met, send data to it with stdin.write() when necessary, and get its output directly with stdout.on('data') event. The problem is that stdout.on('data') isn't called before a huge amount of data is sent to stdin.write(). stdout or stdin (or both) might have inner blocking buffers... But I want the output as soon as possible because otherwise, the program might wait hours before new input showed up and unlock stdin.write() or stdout.on('data'). How can I change their inner buffer size ? Or maybe can I use another non-blocking system ?
My code is:
const spawn = require('child_process').spawn;
const path = require('path');
class Driver {
constructor() {
// I have one binary per language
this.instances = {
frFR: {
instance: null,
path: path.join(__dirname, './my-french-binary')
},
enGB: {
instance: null,
path: path.join(__dirname, './my-english-binary')
}
}
};
// this function just check if an instance is running for a language
isRunning(lang) {
if (this.instances[lang] === undefined)
throw new Error("Language not supported by TreeTagger: " + lang);
return this.instances[lang].instance !== null;
}
// launch a binary according to a language and attach the function 'onData' to the stdout.on('data') event
run(lang, onData) {
const instance = spawn(this.instances[lang].path,{cwd:__dirname});
instance.stdout.on('data', buf => onData(buf.toString()));
// if a binary instance is killed, it will be relaunched later
instance.on('close', () => this.instances[lang].instance = null );
this.instances[lang].instance = instance;
}
/**
* indefinitely write to instance.stdin()
* I want to avoid this behavior by just writing one time to stdin
* But if I write only one time, stdout.on('data') is never called
* Everything works if I use stdin.end() but I don't want to use it
*/
write(lang, text) {
const id = setInterval(() => {
console.log('setInterval');
this.instances[lang].instance.stdin.write(text + '\n');
}, 1000);
}
};
// simple usage example
const driver = new Driver;
const txt = "This is a random input.";
if (driver.isRunning('enGB') === true)
driver.write('enGB', txt);
else {
/**
* the arrow function is called once every N stdin.write()
* While I want it to be called after each write
*/
driver.run('enGB', data => console.log('Data received!', data));
driver.write('enGB', txt);
}
I tried to:
use cork() and uncork() around stdin.write().
pipe child_process.stdout() to a custom Readable and a Socket.
Override the highWaterMark value to 1 and 0 in stdin, stdout and the aforementioned Readable
Lots of other things I forgot...
Moreover, I can't use stdin.end() because I don't want to kill my binaries instances each time a new text arrives. Does anyone have an idea ?

NODEJS: Uncork() method on writable stream doesn't really flush the data

I am writing quite simple application to transform data - read one file and write to another. Files are relatively large - 2 gb. However, what I found is that flush to the file system is not happening, on cork-uncork cycle, it only happens on end(), so the end() basically hangs the system until it's fully flashed.
I simplified the example so it just writes a line to the stream a lot of times.
var PREFIX = 'E:\\TEST\\';
var line = 'AA 11 999999999 20160101 123456 20160101 AAA 00 00 00 0 0 0 2 2 0 0 20160101 0 00';
var fileSystem = require('fs');
function writeStrings() {
var stringsCount = 0;
var stream = fileSystem.createWriteStream(PREFIX +'output.txt');
stream.once('drain', function () {
console.log("drained");
});
stream.once('open', function (fileDescriptor) {
var started = false;
console.log('writing file ');
stream.cork();
for (i = 0; i < 2000000; i++) {
stream.write(line + i);
if (i % 10000 == 0) {
// console.log('passed ',i);
}
if (i % 100000 == 0) {
console.log('uncorcked ',i,stream._writableState.writing);
stream.uncork();
stream.cork();
}
}
stream.end();
});
stream.once('finish', function () {
console.log("done");
});
}
writeStrings();
going inside the node _stream_writable.js, I found that it flushes the buffer only on this condition:
if (!state.writing &&
!state.corked &&
!state.finished &&
!state.bufferProcessing &&
state.buffer.length)
clearBuffer(this, state);
and, as you can see from example, the writing flag doesn't set back after first uncork(), which prevents the uncork to flush.
Also, I don't see drain events evoking at all. Playing with highWaterMark doesn't help (actually doesn't seems to have effect on anything). Manually setting the writing to false (+ some other flags) indeed helped but this is surely wrong.
Am I am misunderstanding the concept of this?
From the node.js documentation I found that number of uncork() should match the number of cork() call, I am not seeing matching stream.uncork() call for stream.cork(), which is called before the for loop. That might be the issue.
Looking at a guide on nodejs.org, you aren't supposed to call stream.uncork() twice in the same event loop. Here is an excerpt:
// Using .uncork() twice here makes two calls on the C++ layer, rendering the
// cork/uncork technique useless.
ws.cork();
ws.write('hello ');
ws.write('world ');
ws.uncork();
ws.cork();
ws.write('from ');
ws.write('Matteo');
ws.uncork();
// The correct way to write this is to utilize process.nextTick(), which fires
// on the next event loop.
ws.cork();
ws.write('hello ');
ws.write('world ');
process.nextTick(doUncork, ws);
ws.cork();
ws.write('from ');
ws.write('Matteo');
process.nextTick(doUncork, ws);
// as a global function
function doUncork(stream) {
stream.uncork();
}
.cork() can be called as many times we want, we just need to be careful to call .uncork() the same amount of times to make it flow again.

Read File in Node and process the same

I wanted to read a file and process each line of the file. I have used the readStream to read the file and then invoke the processRecord method. The processMethod need to make multiple calls and need to make the final data before its written to the store.
The file has 500K records.
The issue that Im facing is that, the files are read at a significant pace and I believe the node is not getting enough priority to actually process the processLine method. Hence the memory shoots upto 800MB and then slows down.
Any help is appreciated.
The code that Im using is given below -
var instream = fs.createReadStream('C:/data.txt');
var outstream = new stream;
var rl = readline.createInterface({
input: instream,
output: outstream,
terminal: false
});
outstream.readable = true;
rl.on('line', function(line) {
processRecord(line);
}
The Node.js readline module is intended more for user interaction than line-by-line streaming from files. You may have better luck with the popular byline package.
var fs = require('fs');
var byline = require('byline');
// You'll need to check the encoding.
var lineStream = byline(fs.createReadStream('C:/data.txt', { encoding: 'utf8' }));
lineStream.on('data', function (line) {
processRecord(line);
});
You'll have a better chance of avoiding memory leaks if the data is piped to another stream. I'm assuming here that processRecord is feeding into one. If you make it a transform stream object, then you can use pipes.
var out = fs.createWriteStream('output.txt');
lineStream.pipe(processRecordStream).pipe(out);

Resources