child_process.execFile() without buffering - node.js

I'm using Node's child_process.execFile() to start and communicate with a process that puts all of its output into its standard output and error streams. The process runs for an indeterminate amount of time and may theoretically generate any amount of output, i.e.:
const process = execFile('path/to/executable', [], {encoding: buffer'});
process.stdout.on('data', (chunk) => {
doSomethingWith(chunk);
});
process.stderr.on('data', (chunk) => {
renderLogMessage(chunk);
});
Notice that I'm not using the last argument to execFile() because I never need an aggregated view of all data that ever came out of either of those streams. Despite this omission, Node appears to be buffering the output anyway and I can reliably make the process end with the SIGTERM signal just by giving it enough input for it to generate a large amount of output. That is problematic because the process is stateful and cannot simply be restarted periodically.
How can I alter or work around this behavior?

You don't want to use execFile, which will wait for the child process to exit before "returning" (by calling the callback that you're not passing).
The documentation for execFile also describes why your child process is being terminated:
maxBuffer <number> Largest amount of data in bytes allowed on stdout or stderr. (Default: 200*1024) If exceeded, the child process is terminated.
For long-running processes for which you want to incrementally read stdout/stderr, use child_process.spawn().

Related

Node.js - process.exit() vs childProcess.kill()

I have a node application that runs long running tasks so whenever a task runs a child process is forked to run the task. The code creates a fork for the task to be run and sends a message to the child process to start.
Originally, when the task was complete, I was sending a message back to the parent process and the parent process would call .kill() on the child process. I noticed in my activity monitor that the node processes weren't being removed. All the child processes were hanging around. So, instead of sending a message to the parent and calling .kill(), I called process.exit() in the child process code once the task was complete.
The second approach seems to work fine and I see the node processes being removed from the activity monitor but I'm wondering if there is a downside to this approach that I don't know about. Is one method better than the other? What's the difference between the 2 methods?
My code looks like this for the messaging approach.
//Parent process
const forked = fork('./app/jobs/onlineConcurrency.js');
forked.send({clientId: clientData.clientId,
schoolYear: schoolYear
});
forked.on("message", (msg) => {
console.log("message", msg);
forked.kill();
});
//child Process
process.on('message', (data) => {
console.log("Message recieved");
onlineConcurrencyJob(data.clientId, data.schoolYear, function() {
console.log("Killing process");
process.send("done");
});
})
The code looks like this for the child process when just exiting
//child Process
process.on('message', (data) => {
console.log("Message received");
onlineConcurrencyJob(data.clientId, data.schoolYear, function() {
console.log("Killing process");
process.exit();
});
})
kill sends a signal to the child process. Without an argument, it sends a SIGTERM (where TERM is short for "termination"), which typically, as the name suggests, terminates the process.
However, sending a signal like that is a forced method of stopping a process. If the process is performing tasks like writing to a file, and it receives a termination signal, it might cause file corruption because the process doesn't get a chance to write all data to the file, and close it (there are mitigations for this, like installing a signal handler that can be used to "catch" signals and ignore them, or finish all tasks before exiting, but this requires explicit code to be added to the child process).
Whereas with process.exit(), the process exits itself. And typically, it does so at a point where it knows that there are no more pending tasks, so it can exit cleanly. This is generally speaking the best way to stop a (child) process.
As for why the processes aren't being removed, I'm not sure. It could be that the parent process isn't cleaning up the resources for the child processes, but I would expect that to happen automatically (I don't even think you can perform so-called "child reaping" explicitly in Node.js).
Calling process.exit(0) is the best mechanism, though there are cases where you might want to .kill from the parent (eg. A distributed search where one node returning means all nodes can stop).
.kill is probably failing due to some handling of the signal it is getting. Try .kill('SIGTERM'), or even 'SIGKILL'.
Also note that subprocesses which aren't killed when the parent process exits will be moved to the grandparent process. See here for more info and a proposed workaround: https://github.com/nodejs/node/issues/13538
In summary, this is default Unix behavior, and the workaround is to process.on("exit", () => child.kill())

Highest performance way to fork/spawn many node.js processes from parent process

I am using Node.js to spawn upwards of 100 child processes, maybe even 1000. What concerns me is that the parent process could become some sort of bottleneck if all the stdout/stderr of the child processes has to go through the parent process in order to get logged somewhere.
So my assumption is that in order to achieve highest performance/throughput, we should ignore stdout/stderr in the parent process, like so:
const cp = require('child_process');
items.forEach(function(exec){
const n = cp.spawn('node', [exec], {
stdio: ['ignore','ignore','ignore','ipc']
});
});
My question is, how much of a performance penalty is it to use pipe in this manner:
// (100+ items to iterate over)
items.forEach(function(exec){
const n = cp.spawn('node', [exec], {
stdio: ['ignore','pipe','pipe','ipc']
});
});
such that stdout and stderr are piped to the parent process? I assume the performance penalty could be drastic, especially if we handle stdout/stderr in the parent process like so:
// (100+ items to iterate over)
items.forEach(function(exec){
const n = cp.spawn('node', [exec], {
stdio: ['ignore','pipe','pipe','ipc']
});
n.stdout.setEncoding('utf8');
n.stderr.setEncoding('utf8');
n.stdout.on('data', function(d){
// do something with the data
});
n.stderr.on('data', function(d){
// do something with the data
});
});
I am assuming
I assume if we use 'ignore' for stdout and stderr in the parent process, that
this is more performant than piping stdout/stderr to parent process.
I assume if we choose a file to stream stdout/stderr to like so
stdio: ['ignore', fs.openSync('/some/file.log'), fs.openSync('/some/file.log'),'ipc']
that this is almost as performant as using 'ignore' for stdout/stderr (which should send stdout/stderr to /dev/null)
Are these assumptions correct or not? With regard to stdout/stderr, how can I achieve highest performance, if I want to log the stdout/stderr somewhere (not to /dev/null)?
Note: This is for a library so the amount of stdout/stderr could vary quite a bit. Also, most likely will rarely fork more processes than there are cores, at most running about 15 processes simultaneously.
You have the following options:
you can have the child process completely ignore stdout/stderr, and do logging on its own by any other means (log a to a file, syslog...)
if you're logging the output of your parent process, you can set stdout/stderr to process.stdout and process.stderr respectively. This means the output of the child will be the same as the main process. Nothing will flow through the main process
you can set file descriptors directly. This means the output of the child process will go to the given files, without going through the parent process
however, if you don't have any control over the child processes AND you need to somehow do something to the logs (filter them, prefix them with the associated child process, etc.), then you probably need to go through the parent process.
As we have no idea of the volume of logs you're talking about, we have no idea whether this is critical or just premature optimisation. Node.js being asynchronous, I don't expect your parent process becoming a bottleneck unless it's really busy and you have lots of logs.
Are these assumptions correct or not?
how can I achieve highest performance?
Test it. That's how you can achieve the highest performance. Test on the same type of system you will use in production, with the same number of CPUs and similar disks (SSD or HDD).
I assume your concern is that the children might become blocked if the parent does not read quickly enough. That is a potential problem, depending on the buffer size of the pipe and how much data flows through it. However, if the alternative is to have each child process write to disk independently, this could be better, the same, or worse. We don't know for a whole bunch of reasons, starting with the fact that we have no idea how many cores you have, how quickly your processes produce data, and what I/O subsystem you're writing to.
If you have a single SSD you might be able to write 500 MB per second. That's great, but if that SSD is 512 GB in size, you'll only last 16 minutes before it is full! You'll need to narrow down the problem space a lot more before anyone can know what's the most efficient approach.
If your goal is simply to get logged data off the machine with as little system utilization as possible, your best bet is to directly write your log messages to the network.

Are the extra stdio streams in node.js child_process.spawn blocking?

When creating a child process using spawn() you can pass options to create multiple streams via the options.stdio argument. after the standard 3 (stdin, stdout, stderr) you can pass extra streams and pipes, which will be file descriptor in the child process. Then you can use a fs.createRead/WriteStream to access those.
See http://nodejs.org/api/child_process.html#child_process_child_process_spawn_command_args_options
var opts = {
stdio: [process.stdin, process.stdout, process.stderr, 'pipe']
};
var child = child_process.spawn('node', ['./child.js'], opts);
But the docs are not really clear on where these pipes are blocking. I know stdin/stdout/stderr are blocking, but what about the 'pipe''s?
In one part they say:
"Please note that the send() method on both the parent and child are
synchronous - sending large chunks of data is not advised (pipes can
be used instead, see child_process.spawn"
But elsewhere they say:
process.stderr and process.stdout are unlike other streams in Node in
that writes to them are usually blocking.
They are blocking in the case that they refer to regular files or TTY file descriptors.
In the case they refer to pipes:
They are blocking in Linux/Unix.
They are non-blocking like other streams in Windows.
Can anybody clarify this? Are pipes blocking on Linux?
I need to transfer large amounts of data without blocking my worker processes.
Related:
How to send huge amounts of data from child process to parent process in a non-blocking way in Node.js?
How to transfer/stream big data from/to child processes in node.js without using the blocking stdio?

Reading stdout of child process unbuffered

I'm trying to read the output of a Python script launched by Node.js as it arrives. However, I only get access to the data once the process has finished.
var proc, args;
args = [
'./bin/build_map.py',
'--min_lon',
opts.sw.lng,
'--max_lon',
opts.ne.lng,
'--min_lat',
opts.sw.lat,
'--max_lat',
opts.ne.lat,
'--city',
opts.city
];
proc = spawn('python', args);
proc.stdout.on('data', function (buf) {
console.log(buf.toString());
socket.emit('map-creation-response', buf.toString());
});
If I launch the process with { stdio : 'inherit' } I can see the output as it happens directly in the console. But doing something like process.stdout.on('data', ...) will not work.
How do I make sure I can read the output from the child process as it arrives and direct it somewhere else?
The process doing the buffering, because it knows the terminal was redirected and not really going to the terminal, is python. You can easily tell Python not to do this buffering: Just run "python -u" instead of "python". Should be easy as that.
When a process is spawned by child_process.spawn(), the streams connected to the child process's standard output and standard error are actually unbuffered on the Nodejs side. To illustrate this, consider the following program:
const spawn = require('child_process').spawn;
var proc = spawn('bash', [
'-c',
'for i in $(seq 1 80); do echo -n .; sleep 1; done'
]);
proc.stdout
.on('data', function (b) {
process.stdout.write(b);
})
.on('close', function () {
process.stdout.write("\n");
});
This program runs bash and has it emit . characters every second for 80 seconds, while consuming this child process's standard output via data events. You should notice that the dots are emitted by the Node program every second, helping to confirm that buffering does not occur on the Nodejs side.
Also, as explained in the Nodejs documentation on child_process:
By default, pipes for stdin, stdout and stderr are established between
the parent Node.js process and the spawned child. It is possible to
stream data through these pipes in a non-blocking way. Note, however,
that some programs use line-buffered I/O internally. While that does
not affect Node.js, it can mean that data sent to the child process
may not be immediately consumed.
You may want to confirm that your Python program does not buffer its output. If you feel you're emitting data from your Python program as separate distinct writes to standard output, consider running sys.stdout.flush() following each write to suggest that Python should actually write data instead of trying to buffer it.
Update: In this commit that passage from the Nodejs documentation was removed for the following reason:
doc: remove confusing note about child process stdio
It’s not obvious what the paragraph is supposed to say. In particular,
whether and what kind of buffering mechanism a process uses for its
stdio streams does not affect that, in general, no guarantees can be
made about when it consumes data that was sent to it.
This suggests that there could be buffering at play before the Nodejs process receives data. In spite of this, care should be taken to ensure that processes within your control upstream of Nodejs are not buffering their output.

Is there a race consuming stdout from child_process.spawn in node.js?

I'm using node.js to spawn a child process and consume its output, using the child.stdout.on('data') hook.
child.stdout is itself a stream, and over on the streams page I notice a fat warning informing me that if no handler is registered when a data event arrives, then the data is dropped on the floor.
There is a brief moment between the spawn of my child process, and the registration of my stdout handler. Is there a race condition here? I don't want my first read to be lost.
It is reasonable to guess that the pipe between parent and child would buffer, but does node guarantee it?
You have to call child.stdout.on('data', [...]) latter in the same turn (a series of serial statement) as the child_process.spawn([...]) call to ensure that the listener is there when the IO operations are next handled. Or you could pause the stream child.stdout.pause() (in the same turn) and resume it in a latter turn after adding the listener if it is necessary to install the listener in a latter turn.

Resources