I'm running a kubernetes cluster and one microservice is constantly crashing with exitCode 134. I already changed the resource memory limit to 6Gi
resources: {
limits: {
memory: "6Gi"
}
}
but the pod never goes above 1.6/1.7Gi.
What may be missing?
It's not about Kubernetes memory limit. Default JavaScript Heap limit is 1.76GB when running in node (v8 engine).
The command-line in Deployment/Pod should be changed like node --max-old-space-size=6144 index.js.
Related
I have a fargate cluster running a node.js API, running on fargate 1.4.0
I have maybe 8-25 instances running depending on the load. Instances are defined with these parameters using aws CDK:
cpu: 512,
assignPublicIp: true,
memoryLimitMiB: 2048,
publicLoadBalancer: true,
Like few times a day I get error like this:
error Command failed with signal "SIGKILL".
I though I was running out of memory, so I've configured node, to start with less memory like this: NODE_OPTIONS=--max_old_space_size=900
This made it less likely to occur, but I am still getting some SIGKILLs.
When looking at the instances at runtime I see they have plenty of memory free on the OS level:
{
"freemem": "6.95GB",
"totalmem": "7.79GB",
"max_old_space_size": 813.1680679321289,
"processUptime": "46m",
"osUptime": "49m",
"rssMemory": "396.89MB"
}
Why is fargate still killing those instances? Is there a way to find out most memory hungry processes just before the SIGKILL?
I have a Python (3.x) webservice deployed in GCP. Everytime Cloud Run is shutting down instances, most noticeably after a big load spike, I get many logs like these Uncaught signal: 6, pid=6, tid=6, fault_addr=0. together with [CRITICAL] WORKER TIMEOUT (pid:6) They are always signal 6.
The service is using FastAPI and Gunicorn running in a Docker with this start command
CMD gunicorn -w 2 -k uvicorn.workers.UvicornWorker -b 0.0.0.0:8080 app.__main__:app
The service is deployed using Terraform with 1 gig of ram, 2 cpu's and the timeout is set to 2 minutes
resource "google_cloud_run_service" <ressource-name> {
name = <name>
location = <location>
template {
spec {
service_account_name = <sa-email>
timeout_seconds = 120
containers {
image = var.image
env {
name = "GCP_PROJECT"
value = var.project
}
env {
name = "BRANCH_NAME"
value = var.branch
}
resources {
limits = {
cpu = "2000m"
memory = "1Gi"
}
}
}
}
}
autogenerate_revision_name = true
}
I have already tried tweaking the resources and timeout in Cloud Run, using the --timeout and --preload flag for gunicorn as that is what people always seem to recommend when googling the problem but all without success. I also dont exactly know why the workers are timing out.
Extending on the top answer which is correct, You are using GUnicorn which is a process manager that manages Uvicorn processes which runs the actual app.
When Cloudrun wants to shutdown the instance (due to lack of requests probably) it will send a signal 6 to process 1. However, GUnicorn occupies this process as the manager and will not pass it to the Uvicorn workers for handling - thus you receive the Unhandled signal 6.
The simplest solution, is to run Uvicorn directly instead of through GUnicorn (possibly with a smaller instance) and allow the scaling part to be handled via Cloudrun.
CMD ["uvicorn", "app.__main__:app", "--host", "0.0.0.0", "--port", "8080"]
Unless you have enabled CPU is always allocated, background threads and processes might stop receiving CPU time after all HTTP requests return. This means background threads and processes can fail, connections can timeout, etc. I cannot think of any benefits to running background workers with Cloud Run except when setting the --cpu-no-throttling flag. Cloud Run instances that are not processing requests, can be terminated.
Signal 6 means abort which terminates processes. This probably means your container is being terminated due to a lack of requests to process.
Run more workloads on Cloud Run with new CPU allocation controls
What if my application is doing background work outside of request processing?
This error happens when a background process is aborted. There are some advantages of running background threads on cloud just like for other applications. Luckily, you can still use them on Cloud Run without processes getting aborted. To do so, when deploying, chose the option "CPU always allocated" instead of "CPU only allocated during request processing"
For more details, check https://cloud.google.com/run/docs/configuring/cpu-allocation
In the official documentation here, the pod.spec.container.resources.limits is defined as follows :
"Limits describes the maximum amount of compute resources allowed."
I understand that k8s prohibits a pod from consuming more resources than specified in limits.
The documentation does not say that a pod is not scheduled in a node that does not have the amount of resources specified in limits.
For instance, if each node in my cluster has 2 cpus, and I try to deploy a pod defining a cpu limit to 3, my pod will never be running and will be in status Pending.
Here is the example template : mypod.yml
apiVersion: v1
kind: Pod
metadata:
name: secondbug
spec:
containers:
- name: container
image: nginx
resources:
limits:
cpu: 3
Is this behaviour intended and why?
You have 3 kinds of pods, pods with no request definition, pods that are burstable which have limits and requests and the third kind which has only limits or requests defined.
For the third case if you only define limits or requests, the other one will be defined by the first one meaning:
resources:
limits:
cpu: 3
Is actually:
resources:
limits:
cpu: 3
requests:
cpu: 3
So you cant really define only one without the other. So in your case, you want to define requests for it so it won't be defined by the limits.
I tried to use K8s to setup spark cluster (I use standalone deployment mode, and I cannot use k8s deployment mode for some reason)
I didn't set any cpu related arguments.
for spark, that means:
Total CPU cores to allow Spark applications to use on the machine (default: all available); only on worker
http://spark.apache.org/docs/latest/spark-standalone.html
for k8s pods, that means:
If you do not specify a CPU limit for a Container, then one of these situations applies:
The Container has no upper bound on the CPU resources it can use. The Container could use all of the CPU resources available on the Node where it is running.
The Container is running in a namespace that has a default CPU limit, and the Container is automatically assigned the default limit. Cluster administrators can use a LimitRange to specify a default value for the CPU limit.
https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/
...
Addresses:
InternalIP: 172.16.197.133
Hostname: ubuntu
Capacity:
cpu: 4
memory: 3922Mi
pods: 110
Allocatable:
cpu: 4
memory: 3822Mi
pods: 110
...
But my spark worker only use 1 core (I have 4 cores on the worker node and the namespace has no resource limits).
That means the spark worker pod only used 1 core of the node (which should be 4).
How can I write yaml file to set the pod to use all available cpu cores?
Here is my yaml file:
---
apiVersion: v1
kind: Namespace
metadata:
name: spark-standalone
---
kind: DaemonSet
apiVersion: apps/v1
metadata:
name: spark-slave
namespace: spark-standalone
labels:
k8s-app: spark-slave
spec:
selector:
matchLabels:
k8s-app: spark-slave
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
template:
metadata:
name: spark-slave
namespace: spark-standalone
labels:
k8s-app: spark-slave
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/edge
operator: Exists
hostNetwork: true
containers:
- name: spark-slave
image: spark:2.4.3
command: ["/bin/sh","-c"]
args:
- "
${SPARK_HOME}/sbin/start-slave.sh
spark://$(SPARK_MASTER_IP):$(SPARK_MASTER_PORT)
--webui-port $(SPARK_SLAVE_WEBUI_PORT)
&&
tail -f ${SPARK_HOME}/logs/*
"
env:
- name: SPARK_MASTER_IP
value: "10.4.20.34"
- name: SPARK_MASTER_PORT
value: "7077"
- name: SPARK_SLAVE_WEBUI_PORT
value: "8081"
---
Kubernetes - No upper bound
The Container has no upper bound on the CPU resources it can use. The Container could use all of the CPU resources available on the Node where it is running
Unless you confgure a limit on CPU for your pod, it can use all available CPU resources on the node.
Consider dedicated nodes
If you are running other workload on the same node, they also consume CPU resources, and may be guaranteed CPU resources if they have configured request for CPU. Consider use a dedicated node for your workload using NodeSelector and Taints and Tolerations.
Spark - No upper bound
You configure the slave with parameters to the start-slave.sh e.g. --cores X to limit CPU core usage.
Total CPU cores to allow Spark applications to use on the machine (default: all available); only on worker
Multithreaded workload
In the end, if pod can use multiple CPU cores depends on how your application uses threads. Some things only uses a single thread, so the application must be designed for multithreading and have something parallelized to do.
I hit absolutely the same issue with a Spark worker. Knowing the fact, that Java sometimes failed to calculate CPU correctly, I tried to specify the CPU request or limit in Pod spec and the worker automatically understood what the environment it is. And needed cores will be assigned to the Spark worker executor.
Also, I faced this behavior in k8s only. In Docker Swarm, all available CPU cores had been taken by a worker.
What's more, in default templates for 'cores' parameters for Spark worker 1 core is mentioned. I believe it could be taken in case of the wrong calculation of CPU cores.
I am using sharp image processing module to resize the image and render it on UI.
app.get('/api/preview-small/:filename',(req,res)=>{
let filename = req.params.filename;
sharp('files/' + filename)
.resize(200, 200, {
fit: sharp.fit.inside,
withoutEnlargement: true
})
.toFormat('jpeg')
.toBuffer()
.then(function(outputBuffer) {
res.writeHead('200',{"Content-Type":"image/jpeg"});
res.write(outputBuffer);
res.end();
});
});
I am running above code on a single board computer Rock64 with 1 GB ram. When I run a Linux htop command and monitor the memory utilization, I could see the memory usage is adding up exponentially from 10% to 60% after every call to the nodejs app and it never comes down.
CPU USAGE
Though it does not give any issue running the application, my only concern is memory usage does not come down, even when the app is not running and I am not sure if this will crash the application eventually if this application runs continuously.
or if I move a similar code snippet to the cloud will it keep occupying memory even when it's not running?
Anyone who is using sharp module facing the similar issue or is this a known issue with node.js. Do we have a way to flush out/clear out the memory or will node do garbage collection?
Any help is appreciated. Thanks
sharp has some memory debugging stuff built in:
http://sharp.dimens.io/en/stable/api-utility/#cache
You can control the libvips cache, and get stats about resource usage.
The node version has a very strong effect on memory behaviour. This has been discussed a lot on the sharp issue tracker, see for example:
https://github.com/lovell/sharp/issues/429
Or perhaps:
https://github.com/lovell/sharp/issues/778