Lots of "Uncaught signal: 6" errors in Cloud Run - python-3.x

I have a Python (3.x) webservice deployed in GCP. Everytime Cloud Run is shutting down instances, most noticeably after a big load spike, I get many logs like these Uncaught signal: 6, pid=6, tid=6, fault_addr=0. together with [CRITICAL] WORKER TIMEOUT (pid:6) They are always signal 6.
The service is using FastAPI and Gunicorn running in a Docker with this start command
CMD gunicorn -w 2 -k uvicorn.workers.UvicornWorker -b 0.0.0.0:8080 app.__main__:app
The service is deployed using Terraform with 1 gig of ram, 2 cpu's and the timeout is set to 2 minutes
resource "google_cloud_run_service" <ressource-name> {
name = <name>
location = <location>
template {
spec {
service_account_name = <sa-email>
timeout_seconds = 120
containers {
image = var.image
env {
name = "GCP_PROJECT"
value = var.project
}
env {
name = "BRANCH_NAME"
value = var.branch
}
resources {
limits = {
cpu = "2000m"
memory = "1Gi"
}
}
}
}
}
autogenerate_revision_name = true
}
I have already tried tweaking the resources and timeout in Cloud Run, using the --timeout and --preload flag for gunicorn as that is what people always seem to recommend when googling the problem but all without success. I also dont exactly know why the workers are timing out.

Extending on the top answer which is correct, You are using GUnicorn which is a process manager that manages Uvicorn processes which runs the actual app.
When Cloudrun wants to shutdown the instance (due to lack of requests probably) it will send a signal 6 to process 1. However, GUnicorn occupies this process as the manager and will not pass it to the Uvicorn workers for handling - thus you receive the Unhandled signal 6.
The simplest solution, is to run Uvicorn directly instead of through GUnicorn (possibly with a smaller instance) and allow the scaling part to be handled via Cloudrun.
CMD ["uvicorn", "app.__main__:app", "--host", "0.0.0.0", "--port", "8080"]

Unless you have enabled CPU is always allocated, background threads and processes might stop receiving CPU time after all HTTP requests return. This means background threads and processes can fail, connections can timeout, etc. I cannot think of any benefits to running background workers with Cloud Run except when setting the --cpu-no-throttling flag. Cloud Run instances that are not processing requests, can be terminated.
Signal 6 means abort which terminates processes. This probably means your container is being terminated due to a lack of requests to process.
Run more workloads on Cloud Run with new CPU allocation controls
What if my application is doing background work outside of request processing?

This error happens when a background process is aborted. There are some advantages of running background threads on cloud just like for other applications. Luckily, you can still use them on Cloud Run without processes getting aborted. To do so, when deploying, chose the option "CPU always allocated" instead of "CPU only allocated during request processing"
For more details, check https://cloud.google.com/run/docs/configuring/cpu-allocation

Related

error Command failed with signal "SIGKILL" on fargate

I have a fargate cluster running a node.js API, running on fargate 1.4.0
I have maybe 8-25 instances running depending on the load. Instances are defined with these parameters using aws CDK:
cpu: 512,
assignPublicIp: true,
memoryLimitMiB: 2048,
publicLoadBalancer: true,
Like few times a day I get error like this:
error Command failed with signal "SIGKILL".
I though I was running out of memory, so I've configured node, to start with less memory like this: NODE_OPTIONS=--max_old_space_size=900
This made it less likely to occur, but I am still getting some SIGKILLs.
When looking at the instances at runtime I see they have plenty of memory free on the OS level:
{
"freemem": "6.95GB",
"totalmem": "7.79GB",
"max_old_space_size": 813.1680679321289,
"processUptime": "46m",
"osUptime": "49m",
"rssMemory": "396.89MB"
}
Why is fargate still killing those instances? Is there a way to find out most memory hungry processes just before the SIGKILL?

Node.js process doesn't exit when run under pm2

I have a node.js script that runs and exits fine in console, but it doesn't exit unless I call process.exit() in pm2. PM2 config is:
{
name: "worker",
script: "./worker.js",
restart_delay: 60000,
out_file: "/tmp/worker.log",
error_file: "/tmp/worker_err.log"
},
I've installed why-is-node-running to see what keeps the process running in 10 seconds after the expected exit and the output is:
There are 9 handle(s) keeping the process running
# TLSWRAP
node:internal/async_hooks:200
# TLSWRAP
node:internal/async_hooks:200
# ZLIB
node:internal/async_hooks:200
/Users/r/code/app/node_modules/decompress-response/index.js:43 - const decompressStream = isBrotli ? zlib.createBrotliDecompress() : zlib.createUnzip();
file:///Users/r/code/app/node_modules/got/dist/source/core/index.js:586
file:///Users/r/code/app/node_modules/got/dist/source/core/index.js:768
file:///Users/r/code/app/node_modules/got/dist/source/core/index.js:786
# TLSWRAP
node:internal/async_hooks:200
# ZLIB
node:internal/async_hooks:200
/Users/r/code/app/node_modules/decompress-response/index.js:43 - const decompressStream = isBrotli ? zlib.createBrotliDecompress() : zlib.createUnzip();
file:///Users/r/code/app/node_modules/got/dist/source/core/index.js:586
file:///Users/r/code/app/node_modules/got/dist/source/core/index.js:768
file:///Users/r/code/app/node_modules/got/dist/source/core/index.js:786
# TLSWRAP
node:internal/async_hooks:200
# ZLIB
node:internal/async_hooks:200
/Users/r/code/app/node_modules/decompress-response/index.js:43 - const decompressStream = isBrotli ? zlib.createBrotliDecompress() : zlib.createUnzip();
file:///Users/r/code/app/node_modules/got/dist/source/core/index.js:586
file:///Users/r/code/app/node_modules/got/dist/source/core/index.js:768
file:///Users/r/code/app/node_modules/got/dist/source/core/index.js:786
# TLSWRAP
node:internal/async_hooks:200
# Timeout
node:internal/async_hooks:200
node:internal/async_hooks:468
node:internal/timers:162
node:internal/timers:196
file:///Users/r/code/app/worker.js:65
node:internal/process/task_queues:94
Why doesn't node exit? How do I further debug this?
PS: Sorry for a large paste
UPDATE
I've managed to reproduce this in a comically small 2-liner:
import got from "got";
await got.post('https://anty-api.com/browser_profiles', {form: {a: 123}}).json();
The above code throws as expected when run form console, yet keeps running forever when called by pm2.
UPDATE 2
It does reproduce with an empty app file too.
I think this is just the way pm2 works. You can expect that, when running under pm2, the node process will continue to run forever, (whether your app is responsible for pending async event sources or not) unless you either crash or do something to explicitly terminate it such as process.exit().
As you've discovered, this has nothing to do with any code in your app.js. Even an empty app.js exhibits this behaviour. This is a fundamental design aspect of pm2. It wraps your program and it's the wrapper that is keep the node process alive.
This is because pm2 runs your program (in forked mode, as opposed to cluster mode) by launching a node process that runs ProcessContainerFork.js (the wrapper). This module establishes and maintains a connection to pm2's managing process (a.k.a "god daemon") and loads your app's main module with require('module')._load(...). The communication channel will always count as an event source that keeps the actual node process alive.
Even if your program does nothing, the status of your program will be "online". Even if your program reaches the state where, had it been launched directly, node would have exited, the state is still "online" in this case because of the wrapper.
This leaves the designers of pm2 with the challenge of trying to know if your program is no longer responsible for any events (in which case node would normally exit). pm2 doesn't have the feature to distinguish between reasons node is being kept alive due to code you wrote in your app.js vs reasons node is being kept alive due to the infrastructure established by ProcessContainerFork.js. One could certainly imagine that pm2 could use async_hooks to keep track of event sources originating from your app rather than from ProcessContainerFork.js (much like how why-is-node-running does), and then tearing down properly when it reaches this state. Perhaps pm2 chooses not to do this to avoid the performance penalty associated with async hooks? Perhaps an app that exits on purpose but is intended to be restarted seems too much like a cron job? I'm speculating yours is not the primary use case for pm2. I suppose you could make a feature request and see what the pm2 authors have to say about it.
I think this means if you want to gracefully exit and have pm2 restart your program, you'll need to call process.exit to do so. You won't be able to rely on node knowing that there are no more event sources because pm2 is responsible for some of them. You will, of course, have to ensure that all your relevant pending promises or timers have resolved before calling process.exit because that will immediately terminate the process without waiting for pending things to happen.

Sharing resources across uwsgi processes in a flask app

I have a dictionary in my flask app with is global and contains information of logged in user.
globalDict = {}
globalDict[username] = "Some Important info"
When a person logs in this globalDict populates and depopulates when user logs out. When I use uwsgi with a single process, obviously there is no problem.
When I use it with multiple processes, sometimes the dictionary turns out to be empty on printing. I guess this is because there are multiple globalDicts across different processes.
How to share globalDict across all processes in my flask application?
P.S. I use only uwsgi for hosting my server and nothing else.
Give a try to uwsgi cache.
uwsgi startup
uwsgi --cache2 name=mycache,items=100 --ini $conffile
write example:
uwsgi.cache_update(this_item["sha"], encoded_blob, 0, "mycache")
read example:
value_from_cache = uwsgi.cache_get(this_item["sha"], "mycache")
More Details

pm2-runtime inside docker container receives SIGTERM - why?

I'm running a virtual machine with docker, which implements our CI/CD infratructure.
docker-compose has an nginx reverse proxy and another service. Essentially, this docker container's start command is a shell script, which creates local copies of files from a central repository. Then this shell script starts (by means of yarn start) a nodejs script that selects a couple of services and creates a pm2 application startup json file.
Finally, pm2-runtime is launched with this application definition file. This is done by
const child = exec(`pm2-runtime build/pm2startup.json`)
child.stdout.on("data", data => { process.stdout.write(data); })
child.stderr.on("data", data => { process.stderr.write(data); })
child.on("close", (code,signal) => {
process.stdout.write((`pm2-runtime process closed with code ${code}, signal ${signal}\n`));
})
child.on("error", error => {
process.stderr.write((`pm2-runtime process error ${error}\n`));
})
child.on("exit", (code, signal) => {
process.stdout.write((`pm2-runtime process exited with code ${code}, signal ${signal}\n`));
})
There are about 10 apps managed by pm2, docker stats say, the container has memory consumption greater than 850MB. However, I have nowhere put any explicit memory limits. I cannot find any implicit either.
Every now and then the container of services is restarted. According to the dockerd logs its task has exited. That's true: the pm2-runtime process (see above) is reported to be closed because of SIGTERM.
And that's the only message I get related to this. No other pm2 message, no service message, no docker event.
Now I'm seeking advice how to find the cause of this SIGTERM because I'm running out of ideas.
As it turned out, it was indeed the snippet inside the question that caused the problem.
pm2startup.json references long-running apps. Over time, depending on usage, they produce quite a few logs on stdout and/or stderr. At some point a certain buffer kept by exec is filled up and the node process that runs pm2-runtime stops. Unfortunately it stops without any kind of hint specifying the reason of the crash. But that's another story.
Solution in my case was to do without exec or execFile, but take spawn instead with the stdio option {stdio: "inherited"} (or the verbose version ["inherited", "inherited", "inherited"]).

Add worker to PM2 pool. Don't reload/restart existing workers

Env.: Node.js on Ubuntu, using PM2 programmatically.
I have started PM2 with 3 instances via Node on my main code. Suppose I use the PM2 command line to delete one of the instances. Can I add back another worker to the pool? Can this be done without affecting the operation of the other workers?
I suppose I should use the start method:
pm2.start({
name : 'worker',
script : 'api/workers/worker.js', // Script to be run
exec_mode : 'cluster', // OR FORK
instances : 1, // Optional: Scale your app by 4
max_memory_restart : '100M', // Optional: Restart your app if it reaches 100Mo
autorestart : true
}, function(err, apps) {
pm2.disconnect();
});
However, if you use pm2 monit you'll see that the 2 existing instances are restarted and no other is created. Result is still 2 running instances.
update
it doesn't matter if cluster or fork -- behavior is the same.
update 2 The command line has the scale option ( https://keymetrics.io/2015/03/26/pm2-clustering-made-easy/ ), but I don't see this method on the programmatic API documentation ( https://github.com/Unitech/PM2/blob/master/ADVANCED_README.md#programmatic-api ).
I actually think this can't be done in PM2 as I have the exact same problem.
I'm sorry, but I think the solution is to use something else as PM2 is fairly limited. The lack of ability to add more workers is a deal breaker for me.
I know you can "scale" on the command line if you are using clustering but I have no idea why you can not start more instances if you are using fork. It makes no sense.
As I know, all commands of PM2 can also be used programmatically, including scale. Check out CLI.js to see all available methods.
Try to use the force attribute in the application declaration. If force is true, you can start the same script several times, which is usually not allowed by PM2 (according to the Application Declaration
docs)
By the way, autorestart it's true by default.
You can do so by use of a ecosystem.config file.
Inside that file you can specify as much worker processes as you want.
E.g. we used BullJS to develop a microservice architecture of different workers that are started with the help of PM2 on multiple cores: The same worker started as named instances multiple times.
Now when jobs are run BullJS load balances the workloads for one specific worker on all available instances for that worker.
You could of course start or stop any instance via CLI and also start additional named workers via the command line to increase the amount of workers (e.g. if many jobs need to be run and you want to process more jobs at a time):
pm2 start './script/to/start.js' --name additional-worker-4
pm2 start './script/to/start.js' --name additional-worker-5

Resources