pm2-runtime inside docker container receives SIGTERM - why?

pm2-runtime inside docker container receives SIGTERM - why? - node.js

I'm running a virtual machine with docker, which implements our CI/CD infratructure.
docker-compose has an nginx reverse proxy and another service. Essentially, this docker container's start command is a shell script, which creates local copies of files from a central repository. Then this shell script starts (by means of yarn start) a nodejs script that selects a couple of services and creates a pm2 application startup json file.
Finally, pm2-runtime is launched with this application definition file. This is done by
const child = exec(`pm2-runtime build/pm2startup.json`)
child.stdout.on("data", data => { process.stdout.write(data); })
child.stderr.on("data", data => { process.stderr.write(data); })
child.on("close", (code,signal) => {
process.stdout.write((`pm2-runtime process closed with code ${code}, signal ${signal}\n`));
})
child.on("error", error => {
process.stderr.write((`pm2-runtime process error ${error}\n`));
})
child.on("exit", (code, signal) => {
process.stdout.write((`pm2-runtime process exited with code ${code}, signal ${signal}\n`));
})
There are about 10 apps managed by pm2, docker stats say, the container has memory consumption greater than 850MB. However, I have nowhere put any explicit memory limits. I cannot find any implicit either.
Every now and then the container of services is restarted. According to the dockerd logs its task has exited. That's true: the pm2-runtime process (see above) is reported to be closed because of SIGTERM.
And that's the only message I get related to this. No other pm2 message, no service message, no docker event.
Now I'm seeking advice how to find the cause of this SIGTERM because I'm running out of ideas.

As it turned out, it was indeed the snippet inside the question that caused the problem.
pm2startup.json references long-running apps. Over time, depending on usage, they produce quite a few logs on stdout and/or stderr. At some point a certain buffer kept by exec is filled up and the node process that runs pm2-runtime stops. Unfortunately it stops without any kind of hint specifying the reason of the crash. But that's another story.
Solution in my case was to do without exec or execFile, but take spawn instead with the stdio option {stdio: "inherited"} (or the verbose version ["inherited", "inherited", "inherited"]).

Related

Node.js process doesn't exit when run under pm2

I have a node.js script that runs and exits fine in console, but it doesn't exit unless I call process.exit() in pm2. PM2 config is:
{
name: "worker",
script: "./worker.js",
restart_delay: 60000,
out_file: "/tmp/worker.log",
error_file: "/tmp/worker_err.log"
},
I've installed why-is-node-running to see what keeps the process running in 10 seconds after the expected exit and the output is:
There are 9 handle(s) keeping the process running
# TLSWRAP
node:internal/async_hooks:200
# TLSWRAP
node:internal/async_hooks:200
# ZLIB
node:internal/async_hooks:200
/Users/r/code/app/node_modules/decompress-response/index.js:43 - const decompressStream = isBrotli ? zlib.createBrotliDecompress() : zlib.createUnzip();
file:///Users/r/code/app/node_modules/got/dist/source/core/index.js:586
file:///Users/r/code/app/node_modules/got/dist/source/core/index.js:768
file:///Users/r/code/app/node_modules/got/dist/source/core/index.js:786
# TLSWRAP
node:internal/async_hooks:200
# ZLIB
node:internal/async_hooks:200
/Users/r/code/app/node_modules/decompress-response/index.js:43 - const decompressStream = isBrotli ? zlib.createBrotliDecompress() : zlib.createUnzip();
file:///Users/r/code/app/node_modules/got/dist/source/core/index.js:586
file:///Users/r/code/app/node_modules/got/dist/source/core/index.js:768
file:///Users/r/code/app/node_modules/got/dist/source/core/index.js:786
# TLSWRAP
node:internal/async_hooks:200
# ZLIB
node:internal/async_hooks:200
/Users/r/code/app/node_modules/decompress-response/index.js:43 - const decompressStream = isBrotli ? zlib.createBrotliDecompress() : zlib.createUnzip();
file:///Users/r/code/app/node_modules/got/dist/source/core/index.js:586
file:///Users/r/code/app/node_modules/got/dist/source/core/index.js:768
file:///Users/r/code/app/node_modules/got/dist/source/core/index.js:786
# TLSWRAP
node:internal/async_hooks:200
# Timeout
node:internal/async_hooks:200
node:internal/async_hooks:468
node:internal/timers:162
node:internal/timers:196
file:///Users/r/code/app/worker.js:65
node:internal/process/task_queues:94
Why doesn't node exit? How do I further debug this?
PS: Sorry for a large paste
UPDATE
I've managed to reproduce this in a comically small 2-liner:
import got from "got";
await got.post('https://anty-api.com/browser_profiles', {form: {a: 123}}).json();
The above code throws as expected when run form console, yet keeps running forever when called by pm2.
UPDATE 2
It does reproduce with an empty app file too.

I think this is just the way pm2 works. You can expect that, when running under pm2, the node process will continue to run forever, (whether your app is responsible for pending async event sources or not) unless you either crash or do something to explicitly terminate it such as process.exit().
As you've discovered, this has nothing to do with any code in your app.js. Even an empty app.js exhibits this behaviour. This is a fundamental design aspect of pm2. It wraps your program and it's the wrapper that is keep the node process alive.
This is because pm2 runs your program (in forked mode, as opposed to cluster mode) by launching a node process that runs ProcessContainerFork.js (the wrapper). This module establishes and maintains a connection to pm2's managing process (a.k.a "god daemon") and loads your app's main module with require('module')._load(...). The communication channel will always count as an event source that keeps the actual node process alive.
Even if your program does nothing, the status of your program will be "online". Even if your program reaches the state where, had it been launched directly, node would have exited, the state is still "online" in this case because of the wrapper.
This leaves the designers of pm2 with the challenge of trying to know if your program is no longer responsible for any events (in which case node would normally exit). pm2 doesn't have the feature to distinguish between reasons node is being kept alive due to code you wrote in your app.js vs reasons node is being kept alive due to the infrastructure established by ProcessContainerFork.js. One could certainly imagine that pm2 could use async_hooks to keep track of event sources originating from your app rather than from ProcessContainerFork.js (much like how why-is-node-running does), and then tearing down properly when it reaches this state. Perhaps pm2 chooses not to do this to avoid the performance penalty associated with async hooks? Perhaps an app that exits on purpose but is intended to be restarted seems too much like a cron job? I'm speculating yours is not the primary use case for pm2. I suppose you could make a feature request and see what the pm2 authors have to say about it.
I think this means if you want to gracefully exit and have pm2 restart your program, you'll need to call process.exit to do so. You won't be able to rely on node knowing that there are no more event sources because pm2 is responsible for some of them. You will, of course, have to ensure that all your relevant pending promises or timers have resolved before calling process.exit because that will immediately terminate the process without waiting for pending things to happen.

Lots of "Uncaught signal: 6" errors in Cloud Run

I have a Python (3.x) webservice deployed in GCP. Everytime Cloud Run is shutting down instances, most noticeably after a big load spike, I get many logs like these Uncaught signal: 6, pid=6, tid=6, fault_addr=0. together with [CRITICAL] WORKER TIMEOUT (pid:6) They are always signal 6.
The service is using FastAPI and Gunicorn running in a Docker with this start command
CMD gunicorn -w 2 -k uvicorn.workers.UvicornWorker -b 0.0.0.0:8080 app.__main__:app
The service is deployed using Terraform with 1 gig of ram, 2 cpu's and the timeout is set to 2 minutes
resource "google_cloud_run_service" <ressource-name> {
name = <name>
location = <location>
template {
spec {
service_account_name = <sa-email>
timeout_seconds = 120
containers {
image = var.image
env {
name = "GCP_PROJECT"
value = var.project
}
env {
name = "BRANCH_NAME"
value = var.branch
}
resources {
limits = {
cpu = "2000m"
memory = "1Gi"
}
}
}
}
}
autogenerate_revision_name = true
}
I have already tried tweaking the resources and timeout in Cloud Run, using the --timeout and --preload flag for gunicorn as that is what people always seem to recommend when googling the problem but all without success. I also dont exactly know why the workers are timing out.

Extending on the top answer which is correct, You are using GUnicorn which is a process manager that manages Uvicorn processes which runs the actual app.
When Cloudrun wants to shutdown the instance (due to lack of requests probably) it will send a signal 6 to process 1. However, GUnicorn occupies this process as the manager and will not pass it to the Uvicorn workers for handling - thus you receive the Unhandled signal 6.
The simplest solution, is to run Uvicorn directly instead of through GUnicorn (possibly with a smaller instance) and allow the scaling part to be handled via Cloudrun.
CMD ["uvicorn", "app.__main__:app", "--host", "0.0.0.0", "--port", "8080"]

Unless you have enabled CPU is always allocated, background threads and processes might stop receiving CPU time after all HTTP requests return. This means background threads and processes can fail, connections can timeout, etc. I cannot think of any benefits to running background workers with Cloud Run except when setting the --cpu-no-throttling flag. Cloud Run instances that are not processing requests, can be terminated.
Signal 6 means abort which terminates processes. This probably means your container is being terminated due to a lack of requests to process.
Run more workloads on Cloud Run with new CPU allocation controls
What if my application is doing background work outside of request processing?

This error happens when a background process is aborted. There are some advantages of running background threads on cloud just like for other applications. Luckily, you can still use them on Cloud Run without processes getting aborted. To do so, when deploying, chose the option "CPU always allocated" instead of "CPU only allocated during request processing"
For more details, check https://cloud.google.com/run/docs/configuring/cpu-allocation

How to restart a Node.js application and handover the new process to the console

The following Node.js script can restart itself and will even still print to the correct console (or terminal if you prefer), but it will no longer be running in the foreground, as in you can't exit it with Ctrl+C anymore (see screenshot) etc:
console.log("This is pid " + process.pid);
setTimeout(function () {
process.on("exit", function () {
require("child_process").spawn(process.argv.shift(), process.argv, {
cwd: process.cwd(),
detached : true,
stdio: "inherit"
});
});
process.exit();
}, 5000);
I've already tried detached: true vs detached: false, but obviously this didn't solve the problem...
Is there a way to make the new node process run in the foreground, replacing the old one? Or this this not possible?
I know that in Bash you can pull a program back from the background like this:
$ watch echo "runs in background" &
$ fg # pulls the background process to the foreground
But I'm not looking for a Bash command or so, I'm looking for a programmatic solution within the Node.js script that works on any platform.

No, once a process has exited it cannot perform any more operations, and there's no such thing as a "please foreground this after I exit"-type API for terminals that I've ever heard of.
The proper way to solve this is via a wrapper which monitors your process for failures and restarts. The wrapper then has control of stdio and passes those to its children.
You could achieve this via a simple bash loop, another node script, or you might just be able to leverage the NodeJS cluster module for this.

As Jonny said, you'd need to have a process manager that handles the running of your application. Per the Node.js documentation for child_process, spawn() functions similar to popen() at the system level, which creates a forked process. This generally doesn't go into the foreground. Also, when a parent process exits, control is returned to either the calling process or the shell itself.
A popular process management solution is PM2, which can be installed via npm i -g pm2. (Not linking to their site/documentation here) PM2 is cross-platform, but it does require an "external" dependency that doesn't live within the codebase itself.
I would also be curious as to why you want a script that on exit restarts itself in the manner you're describing, since it seems like just re-running the script -- which is what PM2 and similar do -- would yield the same results, and not involve mucking around with process management manually.

Node v8.5 with --trace-events-enabled not producing trace log file

I'm running node v8.5 and I'm trying to play around with the experimental Tracing feature.
Starting my application node --trace-events-enabled app.js I would expect to see a trace log file generated per the node documentation here https://nodejs.org/api/tracing.html which I can view in chrome by visiting chrome://tracing and loading that generated trace log file.
However, it doesn't seem like node is generating that log file at all. Are there settings I'm missing, or is the log file saved outside my project directory?

I have recently tried with node v8.9.1 and the correct creation of the logs dependes on how you close your app.js.
This example's app works correctly: it creates a file called node_trace.1.log in the directory where you start node (node --trace-events-enabled ./bin/trace-me.js will create the file in ./):
console.log("Trace me");
const interv = setInterval(()=>console.log("Runnning"), 1000);
// quit on ctrl-c when running docker in terminal
process.on('SIGINT', function onSigint() {
console.info('Got SIGINT (aka ctrl-c). Graceful shutdown ', new Date().toISOString());
clearInterval(interv);
});
process.on('beforeExit', function (exitCode) {
console.log("Before exit: "+ exitCode);
});
If you kill your process with ctrl-c without managing it, for example, the beforeExit event will not be call and the trace logs will never be created.
The same if you call process.exit():
it will terminate as soon as possible even if there are still
asynchronous operations pending that have not yet completed fully,
including I/O operations to process.stdout and process.stderr.
as described on docs.
So the solution is managing correctly the SIGINT and SIGTERM events and check if the beforeExit is called because it is emitted only when Node.js empties its event loop and is not killed.

Send command to running node process, get back data from inside the app

I start a node.js app per commandline in linux.
I see the app running, e.g. by entering "top".
Is there a way to send some command to the running app (maybe to the pid?) and get back info from inside it (maybe listen for some input and return requested info)?

Use repl module. There are examples in the doco doing exactly what you need: run JS in the context of your application and return output.

One simple solution is to use process signals. You can define a handler for a signal in your program to output some data to the console (or write to a file or to a database, if your application is running as a service not attached to a terminal you can see):
process.on('SIGUSR1', function() {
console.log('hello. you called?');
});
and then send a signal to it from your shell:
kill --signal USR1 <pid of node app.js>
This will invoke the signal handler you have defined in your node.js application.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string