Stream uint8 through stdin/stdout in nodejs - node.js

I have two simple node scripts that I like to pipe together in bash. I want to stream 2 integers from one script to the other. Something goes wrong when moving to the next bit, e.g. 127 can be expressed in 7 bits while 128 needs 8 bits, if I understand correctly. My guess is that it has something tot do with the sign of the integer, e.g. plus or minus. I have specifically used writeUInt8 and readUInt8 for this reason though...
Script in.js, sends 2 integers to stdout:
process.stdout.setEncoding('binary');
const buff1 = Buffer.alloc(1);
const buff2 = Buffer.alloc(1);
buff1.writeUInt8(127);
buff2.writeUInt8(128);
process.stdout.write(buff1);
process.stdout.write(buff2);
process.stdout.end();
Script out.js, reads from stdin and writes to stdout again:
process.stdin.setEncoding('binary');
process.stdin.on('data', function(data) {
for(const uInt of data) {
const v = Buffer.from(uInt).readUInt8();
process.stdout.write(v + '\n');
}
});
In bash I connect in and out:
$ node in.js | node out.js
Expected result:
127
128
Actual Result:
127
194

Setting the encoding to binary is messing the received data in in.js.
According to the Readable Stream documentation of Node.js:
By default, no encoding is assigned and stream data will be returned
as Buffer objects.
I tested the code below and it works:
// in.js
process.stdin.on('data', function (data) {
for (let i = 0; i < data.length; ++i) {
const v = data.readUInt8(i);
process.stdout.write(v + '\n');
}
});

Related

Handle huge amount of data in node.js through stdin

Old question title:
"node.js readline form net.Socket (process.stdin) cause error: heap out of memory (conversion of net.Socket Duplex to Readable stream)"
... I've changed it because nobody answered and it seems like important question in node.js ecosystem.
Question is how to solve problem of "heap out of memory" error when reading line by line from huge stdin? Error is not happening when you dump stdout to file (eg: test.log) and read to 'readline' interface through fs.createReadStream('test.log').
Looks like process.stdin is not Readable stream as it is mentioned here:
https://nodejs.org/api/process.html#process_process_stdin
To reproduce the issue I've created two scripts. First is to just generate huge amount of data (a.js file):
// a.js
// loop in this form generates about 7.5G of data
// you can check yourself running:
// node a.js > test.log && ls -lah test.log
// will return
// -rw-r--r-- 1 sd staff 7.5G 31 Jan 22:29 test.log
for (let i = 0 ; i < 8000000 ; i += 1 ) {
console.log(`${i} ${".".repeat(1000)}\n`);
}
The script to consume this through bash pipe with readline (b.js file):
const fs = require('fs');
const readline = require('readline');
const rl = readline.createInterface({
input: process.stdin, // doesn't work
//input: fs.createReadStream('test.log'), // works
});
let s;
rl.on('line', line => {
// deliberaty commented out to demonstrate that issue
// has nothing to do beyond readline and process.stdin
// s = line.substring(0, 7);
//
// if (s === '100 ...' || s === '400 ...' || s === '7500000') {
//
// process.stdout.write(`${line}\n`);
// }
});
rl.on('error', e => {
console.log('general error', e)
})
Now when you run;
node a.js | node b.js
it will result with error:
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
but if you swap
const rl = readline.createInterface({
input: process.stdin,
});
to
const rl = readline.createInterface({
input: fs.createReadStream('test.log')
});
and run
node a.js > test.log
node b.js
everything works fine
Problem comes down actually to how to convert net.Socket to fully functional Readable stream?, if it is possible at all.
Edit:
Basically my problem is that it seems like it is not possible to handle huge amount of data from stdin as a stream which is natural for Unix style pipes. So despite the fact node.js is brilliant in handling streams you can't write program that would handle huge amount of data through unix style pipes.
It would be totally not necessary in some cases to dump data to the hard drive and only after handle it with fs.createReadStream('test.log') only because of this limitation.
I thought that streams are all about handling huge amount of data (among other use cases) on the flight without saving it on hard drive.
You can always treat process.stdin as a normal NodeJS stream and handle the reading your self:
const os = require('os');
function onReadLine(line) {
// do stuff with line
console.info(line);
}
// read input and split into lines
let BUFF = '';
process.stdin.on('data', (buff) => {
const content = buff.toString('utf-8');
for (let i = 0; i < content.length; i++){
if (content[i] === os.EOL) {
onReadLine(BUFF);
BUFF = '';
} else {
BUFF += content[i];
}
}
});
// flush last line
process.stdin.on('end', () => {
if (BUFF.length > 0) {
onReadLine(BUFF);
}
});
Example:
// unix
cat ./somefile.txt | node ./script.js
// windows
Start-Process -FilePath "node" -ArgumentList #(".\script.js") -RedirectStandardInput .\somefile.txt -NoNewWindow -Wait
The problem is not input data size, not Node, but a faulty design of your data generator: it does not implement pausing/resuming data generation on request of consumer output stream. Instead of just pushing data to console.log(..) you should correctly interact with standard output stream, and correctly handle pause and resume signals from that stream.
The file input stream created by fs.createReadStream() is implemented properly, and it does pause/resumes as necessary, thus does not crash the code.

Using node.js to execute 'tracert' in cmd, but my output is not consistent

I am trying to write a node.js script that executes tracert in cmd, and I would like to parse the output of tracert to be able to use in node. My issue is that the output I am receiving is not coming in consistently.
let argument = process.argv[2] /* what the user enters as first argument */
const { spawn } = require('child_process');
const command = spawn(process.env.comspec, ['/c', 'tracert', argument])
command.stdout.on('data', (data) => {
console.log(`stdout: ${data}`);
});
tracert outputs in this format on each line, for each hop.
1 5 ms 6 ms 4 ms 192.168.1.1
The expected output of the console.log SHOULD be:
stdout: 1
stdout: 5 ms
stdout: 6 ms
stdout: 4 ms
stdout: 192.168.1.1
and indeed that is what happens about 90% of the time, but sometimes the data comes in so fast, that some lines come together on the same line like so
stdout: 1
stdout: 5 ms
stdout: 6 ms 4 ms
stdout: 192.168.1.1
I don't want this to happen. I want each time the data variable comes in, that it only contains "one element" from each "column"
Unfortunately, you can't force tracert to send data column by column. Moreover, you cannot force OS, not to "join" columns in buffer when your process is not fast enough to read them from pipe.
The only possible solution is to wait until whole line is ready and only then parse it. Stg like this (warning, this is only pseudocode).
let bufffer = "";
command.stdout.on('data', (data) => {
buffer = buffer + data;
while (buffer.indexOf('\n') != -1) { // loop because in some cases you can even receive many lines
const idx = buffer.indexOf('\n');
const line = buffer.substr(0, idx).trim(); // get line and trim extra whitespace
buffer = buffer.substr(idx+1); // rest of buffer, usually empty string
// now parse line with regexp or something
const match = /(\d+) (\d+) ms +(\d+) ms +(\d+) ms ([^ ]+)/.match(line);
if (match) {
...
}
}
});

NODEJS: Uncork() method on writable stream doesn't really flush the data

I am writing quite simple application to transform data - read one file and write to another. Files are relatively large - 2 gb. However, what I found is that flush to the file system is not happening, on cork-uncork cycle, it only happens on end(), so the end() basically hangs the system until it's fully flashed.
I simplified the example so it just writes a line to the stream a lot of times.
var PREFIX = 'E:\\TEST\\';
var line = 'AA 11 999999999 20160101 123456 20160101 AAA 00 00 00 0 0 0 2 2 0 0 20160101 0 00';
var fileSystem = require('fs');
function writeStrings() {
var stringsCount = 0;
var stream = fileSystem.createWriteStream(PREFIX +'output.txt');
stream.once('drain', function () {
console.log("drained");
});
stream.once('open', function (fileDescriptor) {
var started = false;
console.log('writing file ');
stream.cork();
for (i = 0; i < 2000000; i++) {
stream.write(line + i);
if (i % 10000 == 0) {
// console.log('passed ',i);
}
if (i % 100000 == 0) {
console.log('uncorcked ',i,stream._writableState.writing);
stream.uncork();
stream.cork();
}
}
stream.end();
});
stream.once('finish', function () {
console.log("done");
});
}
writeStrings();
going inside the node _stream_writable.js, I found that it flushes the buffer only on this condition:
if (!state.writing &&
!state.corked &&
!state.finished &&
!state.bufferProcessing &&
state.buffer.length)
clearBuffer(this, state);
and, as you can see from example, the writing flag doesn't set back after first uncork(), which prevents the uncork to flush.
Also, I don't see drain events evoking at all. Playing with highWaterMark doesn't help (actually doesn't seems to have effect on anything). Manually setting the writing to false (+ some other flags) indeed helped but this is surely wrong.
Am I am misunderstanding the concept of this?
From the node.js documentation I found that number of uncork() should match the number of cork() call, I am not seeing matching stream.uncork() call for stream.cork(), which is called before the for loop. That might be the issue.
Looking at a guide on nodejs.org, you aren't supposed to call stream.uncork() twice in the same event loop. Here is an excerpt:
// Using .uncork() twice here makes two calls on the C++ layer, rendering the
// cork/uncork technique useless.
ws.cork();
ws.write('hello ');
ws.write('world ');
ws.uncork();
ws.cork();
ws.write('from ');
ws.write('Matteo');
ws.uncork();
// The correct way to write this is to utilize process.nextTick(), which fires
// on the next event loop.
ws.cork();
ws.write('hello ');
ws.write('world ');
process.nextTick(doUncork, ws);
ws.cork();
ws.write('from ');
ws.write('Matteo');
process.nextTick(doUncork, ws);
// as a global function
function doUncork(stream) {
stream.uncork();
}
.cork() can be called as many times we want, we just need to be careful to call .uncork() the same amount of times to make it flow again.

From node.js, which is faster, shell grep or fs.readFile?

I have a long running node.js process and I need to scan a log file for a pattern. I have at least two obvious choices: spawn a grep process or read the file using fs.read* and parse the buffer/stream in node.js. I haven't found a comparison of the two methods on the intarwebs. My question is twofold:
which is faster?
why might I prefer one technique over the other?
Here's my nodejs implementation, results are pretty much as expected:
small files run faster than a forked grep (files up to 2-3k short lines),
large files run slower. The larger the file, the bigger the difference.
(And perhaps the more complex the regex, the smaller the difference -- see
below.)
I used my own qfgets package for fast
line-at-a-time file i/o; there may be better ones out there, I don't know.
I saw an unexpected anomaly that I did not investigate: the below timings
are for the constant string regexp /foobar/. When I changed it to
/[f][o][o][b][a][r]/ to actually exercise the regex engine, grep slowed
down 3x, while node sped up! The 3x slowdown of grep is reproducible on the
command line.
filename = "/var/log/apache2/access.log"; // 2,540,034 lines, 187MB
//filename = "/var/log/messages"; // 25,703 lines, 2.5MB
//filename = "out"; // 2000 lines, 188K (head -2000 access.log)
//filename = "/etc/motd"; // 7 lines, 286B
regexp = /foobar/;
child_process = require('child_process');
qfgets = require('qfgets');
function grepWithFs( filename, regexp, done ) {
fp = new qfgets(filename, "r");
function loop() {
for (i=0; i<40; i++) {
line = fp.fgets();
if (line && line.match(regexp)) process.stdout.write(line);
}
if (!fp.feof()) setImmediate(loop);
else done();
}
loop();
}
function grepWithFork( filename, regexp, done ) {
cmd = "egrep '" + regexp.toString().slice(1, -1) + "' " + filename;
child_process.exec(cmd, {maxBuffer: 200000000}, function(err, stdout, stderr) {
process.stdout.write(stdout);
done(err);
});
}
The test:
function fptime() { t = process.hrtime(); return t[0] + t[1]*1e-9 }
t1 = fptime();
if (0) {
grepWithFs(filename, regexp, function(){
console.log("fs done", fptime() - t1);
});
}
else {
grepWithFork(filename, regexp, function(err){
console.log("fork done", fptime() - t1);
});
}
Results:
/**
results (all file contents memory resident, no disk i/o):
times in seconds, best run out of 5
/foobar/
fork fs
motd .00876 .00358 0.41 x 7 lines
out .00922 .00772 0.84 x 2000 lines
messages .0101 .0335 3.32 x 25.7 k lines
access.log .1367 1.032 7.55 x 2.54 m lines
/[f][o][o][b][a][r]/
access.log .4244 .8348 1.97 x 2.54 m lines
**/
(The above code was all one file, I split it up to avoid the scrollbar)
Edit: to highlight the key results:
185MB, 2.54 million lines, search RegExp /[f][o][o][b][a][r]/:
grepWithFs
elapsed: .83 sec
grepWithFork
elapsed: .42 sec
To answer this question, I wrote this little program.
#!/usr/local/bin/node
'use strict';
const fs = require('fs');
const log = '/var/log/maillog';
const fsOpts = { flag: 'r', encoding: 'utf8' };
const wantsRe = new RegExp(process.argv[2]);
function handleResults (err, data) {
console.log(data);
}
function grepWithFs (file, done) {
fs.readFile(log, fsOpts, function (err, data) {
if (err) throw (err);
let res = '';
data.toString().split(/\n/).forEach(function (line) {
if (wantsRe && !wantsRe.test(line)) return;
res += line + '\n';
});
done(null, res);
});
};
function grepWithShell (file, done) {
const spawn = require('child_process').spawn;
let res = '';
const child = spawn('grep', [ '-e', process.argv[2], file ]);
child.stdout.on('data', function (buffer) { res += buffer.toString(); });
child.stdout.on('end', function() { done(null, res); });
};
for (let i=0; i < 10; i++) {
// grepWithFs(log, handleResults);
grepWithShell(log, handleResults);
}
Then I alternately ran both functions inside a loop 10x and measured the time it took them to grep the result from a log file that's representative of my use case:
$ ls -alh /var/log/maillog
-rw-r--r-- 1 root wheel 37M Feb 8 16:44 /var/log/maillog
The file system is a pair of mirrored SSDs which are generally quick enough that they aren't the bottleneck. Here are the results:
grepWithShell
$ time node logreader.js 3E-4C03-86DD-FB6EF
real 0m0.238s
user 0m0.181s
sys 0m1.550s
grepWithFs
$ time node logreader.js 3E-4C03-86DD-FB6EF
real 0m6.599s
user 0m5.710s
sys 0m1.751s
The different is huge. Using a shell grep process is dramatically faster. As Andras points out, node's I/O can be tricky, and I didn't try any other fs.read* methods. If there's a better way, please do point it out (preferably with similar test scenario and results).
forking a grep is simpler and quicker, and grep would most likely run faster and use less cpu. Although fork has a moderately high overhead (much more than opening a file), you would only fork once and stream the results. Plus it can be tricky to get good performance out of node's file i/o.

Is there a Node.js console.log length limit?

Is there a limit the length of console.log output in Node.js? The following prints numbers up to 56462, then stops. This came up because we were returning datasets from MySQL and the output would just quit after 327k characters.
var out = "";
for (i = 0; i < 100000; i++) {
out += " " + i;
}
console.log(out);
The string itself seems fine, as this returns the last few numbers up to 99999:
console.log(out.substring(out.length - 23));
Returns:
99996 99997 99998 99999
This is using Node v0.6.14.
Have you tried writing that much on a machine with more memory?
According to Node source code console is writing into a stream: https://github.com/joyent/node/blob/cfcb1de130867197cbc9c6012b7e84e08e53d032/lib/console.js#L55
And streams may buffer the data into memory: http://nodejs.org/api/stream.html#stream_writable_write_chunk_encoding_callback
So if you put reeeaally a lot of data into a stream, you may hit the memory ceiling.
I'd recommend you split up your data and feed it into process.stdout.write method, here's an example: http://nodejs.org/api/stream.html#stream_event_drain
I would recommend using output to file when using node > 6.0
const output = fs.createWriteStream('./stdout.log');
const errorOutput = fs.createWriteStream('./stderr.log');
// custom simple logger
const logger = new Console(output, errorOutput);
// use it like console
var count = 5;
logger.log('count: %d', count);
// in stdout.log: count 5

Resources