Managing Eviction on Kubernetes for Node.js and Puppeteer

Managing Eviction on Kubernetes for Node.js and Puppeteer - node.js

I am currently seeing a strange issue where I have a Pod that is constantly being Evicted by Kubernetes.
My Cluster / App Information:
Node size: 7.5GB RAM / 2vCPU
Application Language: nodejs
Use Case: puppeteer website extraction (I have code that loads a website, then extracts an element and repeats this a couple of times per hour)
Running on Azure Kubernetes Service (AKS)
What I tried:
Check if Puppeteer is closed correctly and that I am removing any chrome instances. After adding a force killer it seems to be doing this
Checked kubectl get events where it is showing the lines:
8m17s Normal NodeHasSufficientMemory node/node-1 Node node-1 status is now: NodeHasSufficientMemory
2m28s Warning EvictionThresholdMet node/node-1 Attempting to reclaim memory
71m Warning FailedScheduling pod/my-deployment 0/4 nodes are available: 1 node(s) had taint {node.kubernetes.io/memory-pressure: }, that the pod didn't tolerate, 3 node(s) didn't match node selector
Checked kubectl top pods where it shows it was only utilizing ~30% of the node's memory
Added resource limits in my kubernetes .yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app-d
spec:
replicas: 1
template:
spec:
containers:
- name: main
image: my-image
imagePullPolicy: Always
resources:
limits:
memory: "2Gi"
Current way of thinking:
A node has X memory total available, however from X memory only Y is actually allocatable due to reserved space. However when running os.totalmem() in node.js I am still able to see that Node is allowed to allocate the X memory.
What I am thinking here is that Node.js is allocating up to X due to its Garbage Collecting which should actually kick in at Y instead of X. However with my limit I actually expected it to see the limit instead of the K8S Node memory limit.
Question
Are there any other things I should try to resolve this? Did anyone have this before?

You NodeJS app is not aware that it runs in container. It sees only the amount of memory that Linux kernel reports (which always reports the total node memory). You should make your app aware of cgroup limits, see https://medium.com/the-node-js-collection/node-js-memory-management-in-container-environments-7eb8409a74e8
With regard to Evictions: when you've set memory limits - did that solve your problems with evictions?
And don't trust kubectl top pods too much. It always shows data with some delay.

Related

Spark on Kubernetes driver pod cleanup

I am running spark 3.1.1 on kubernetes 1.19. Once job finishes executor pods get cleaned up but driver pod remains in completed state. How to clean up driver pod once it is completed? any configuration option to set?
NAME READY STATUS RESTARTS AGE
my-job-0e85ea790d5c9f8d-driver 0/1 Completed 0 2d20h
my-job-8c1d4f79128ccb50-driver 0/1 Completed 0 43h
my-job-c87bfb7912969cc5-driver 0/1 Completed 0 43h

Concerning the initial question "Spark on Kubernetes driver pod cleanup", it seems that there is no way to pass, at spark-submit time, a TTL parameter to kubernetes for avoiding the never-removal of driver pods in completed status.
From Spark documentation:
https://spark.apache.org/docs/latest/running-on-kubernetes.html
When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists logs and remains in “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.
It is not very clear who is doing this 'eventually garbage collected'.

spark.kubernetes.driver.service.deleteOnTermination was added to spark in 3.2.0. This should solve the issue. src: https://spark.apache.org/docs/latest/core-migration-guide.html
update: this will only delete the service to the pod..but not the pod itself

According to the official documentation since Kubernetes 1.12:
Another way to clean up finished Jobs (either Complete or Failed) automatically is to use a TTL mechanism provided by a TTL controller for finished resources, by specifying the .spec.ttlSecondsAfterFinished field of the Job.
When the TTL controller cleans up the Job, it will delete the Job cascadingly, i.e. delete its dependent objects, such as Pods, together with the Job. Note that when the Job is deleted, its lifecycle guarantees, such as finalizers, will be honored.
Example:
apiVersion: batch/v1
kind: Job
metadata:
name: pi-with-ttl
spec:
ttlSecondsAfterFinished: 100
template:
spec:
...
The Job pi-with-ttl will be eligible to be automatically deleted, 100 seconds after it finishes.
If the field is set to 0, the Job will be eligible to be automatically deleted immediately after it finishes.
If customisation of the Job resource is not possible you may use an external tool to clean up completed jobs. For example check https://github.com/dtan4/k8s-job-cleaner

nodejs web application in k8s gets OOM

I'm running a nestjs web application implemented with fastify on kubernetes.
I split my application into Multi Zones, and deploy it into different pyhsical location k8s clusters (Cluster A & Cluster B).
Everything gose well, except the Zone X in Culster A which has the maximum traffic during all zones.
( Here is a 2-Day metrics dashboard for Zone X during normal time )
The problem only happens on the Zone X in Cluster A and never happens on any other zones or clusters.
At first some 499 responses appear in Cluster A's Ingress Dashboard, and soon the memory of pods suddenly expand to the memory limit one pod after another.
It seems that the 499 status is caused by pods not sending responses to the outer.
At the same time, other zones in Cluster A work normally.
For avoiding influencing users, I switch all network traffic to Cluster B and everything work properly, Which excludes causing by dirty data.
I tried to kill and redeploy all pods of Zone X in Cluster A, but when I switch traffic back to Cluster A, the problem occurs again. But after waitting for 2-3 hours and then swith back the traffic, the problems disappers!
Since I don't konow how comes, only thing I can do is switching traffic and check is everything back to normal.
I've tried multiple variations of node memory issues, but none of them seems to cause this problem. Any ideas or inspirations of this problem?
Name
Version
nestjs
v6.1.1
fastify
v2.11.0
Docker Image
node:12-alpine(v12.18.3)
Ingress
v0.30.0
Kubernetes
v1.18.12

How to tells Knative Pod Autoscaler not to kill in-progress long running pod

My goal:
Implement a cron job run once per week and I intend to implement this topology on Knative to save the computing resources:
PingSource -> knative service
The PingSource will emit a dummy event to a knative service once per week just to bring up 1 knative service pod. The knative service pod will get huge amount of data and then process them.
My concern:
If I set enable-scale-to-zero to true, the Knative pod autoscaler probably shutdown the knative service pod even when the pod has not finished its work.
So far, I explored:
The scale-to-zero-grace-period which can be configured to tell the auto scaler how long it should wait after the last traffic ends to shutdown the pod. But I don't think this approach is subtle. I prefer somewhat similar to readinessProbe or livenessProbe. The auto scaler should send a probe to know whether the pod is processing something before sending the kill signal.
In addition, according to knative's docs, there are 2 type of event sink: callable and addressable. Addressable and Callable both return the response or acknowledgement. Would the knative auto scaler consider the pod as handling the request till the pod return the response/acknowledgement? So as long as the pod does not response, it won't be removed by the auto scaler.

The Knative autoscaler relies on the pod strictly working in a request/response fashion. As long as the "huge amount of data" is processed as part of an HTTP request (or Websocket session, or gRPC session etc.) the pod will not even be considered for deletion.
What will not work is sending the request, immediately return and then munging the data in the background. The autoscaler will think that there's no activity at all and thus shut it down. There is a sandbox project that tries to implement such asynchronous semantics though.

How many replicas for express (NodeJS)?

So before I used kubernetes the general rule I used for running multiple express instances on a VM was one per cpu. That seemed to give the best performance.
For kubernetes, would it be wise to have a replica per node cpu? Or should I let the horizontalpodautoscaler decide? The cluster has a node autoscaler.
Thanks for any advice!

good question !
You need to consider 4 things :
Run the pod using Deployment so you enable replication, rolling update,...so on
Set resources.limits to your container definition. this is mandatory for autoscaling , because HPA is monitoring the percentage of usage, and if there is NO limit, there will be NEVER percentage, then HPA will never reach the threshold.
Set resources.requests. This will help the scheduler to estimate how much the app needs, so it will be assigned to the suitable Node per its current capacity.
Set HPA threshold: The percentage of usage (CPU, memory) when the HPA will trigger scale out or scale in.
for your situation, you said "one per cpu".. then, it should be:
containers:
- name: express
image: myapp-node
#.....
resources:
requests:
memory: "256Mi"
cpu: "750m"
limits:
memory: "512Mi"
cpu: "1000m" # <-- 🔴 match what you have in the legacy deployment
you may wonder why I put memory limits/requests without any input from your side ?
The answer is that I put it randomly. Your task is to monitor your application, and adjust all these values accordingly.

Executing multiple HTTP client request in node

I'm writing a backend app using nodejs which execute a lot of http requests to external services and s3.
I have reached to roughly 800 requests per second on a single kubernetes pod.
The pod is limited to a single vcpu and it has reached to 100% usage.
I can scale it to tens of pods to handle the execution of thousands of requests,
but it seems that this limit has reached too soon.
I have tested it in my real backend app and then on a demo pod which does nothing but to send http request using axios.
Does it make sense that a single vcpu kubernetes pod can only handle 800 req / sec? (as client and not as a server).

It's quite hard to propose any advice for the best approach with choosing a proper capacity for the compute resources affordable to your specific needs. However, when you use 1x vCPU in Pod limit requests it equivalents 1 CPU unit for most widely used Cloud providers VM resources.
Thus, I would bet here for adding more CPU units into your Pod than spinning more Pods with a same number of vCPU by Kubernetes scheduler using HPA (Horizontal Pod Autoscaler) feature. Therefore, if you don't have enough capacity on your node, it's very easy to push lots of Pod to be overloaded; and indeed this would not give positive influence on Node compute engine.
In your example, there are two key metric parameters to analyze: latency (time for sending requests and receiving answer) and throughput (requests per second) of HTTP requests; here is always the rule on the top: Increasing the latency will decrease the overall throughput for your requests.
You can also read about Vertical Pod Autoscaler as an option for managing compute resources in Kubernetes cluster.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string