I am getting the following twistlock vulnerability:
(CIS_Docker_CE_v1.1.0 - 5.28) Use PIDs cgroup limit
How can set the cgroup pid limit using the kubernetes deployment yaml file?
I know docker run has the flag for setting pid limit, but we are not using docker run.
Can anyone please advice?
In Kubernetes 1.14, they have added a feature that allows for the configuration of a kubelet to limit the number of PIDs a given pod can consume. If that machine supports 32,768 PIDs and 100 pods, one can give each pod a budget of 300 PIDs to prevent total exhaustion of PIDs.
The parameter (PodPidsLimit) is part of the kubelet configuration: kubelet-configuration.
// The maximum number of processes per pod. If -1, the kubelet defaults to the node allocatable pid capacity.
PodPidsLimit [int64](https://godoc.org/builtin#int64)
To see current configuration and if the parameter is available in your current version: generate-configuration-file.
Take a look: pid-limit-kubelet, pid-limit-pod.
Related
When running a job as a pipeline in Gitlab Runner's K8s pod, the job gets completed successfully only when running on a small instance like m5*.large which offers 2 vCPUs and 8GB of RAM. We set a limit for the build, helper, and services containers mentioned below. Still, the job fails with an Out Of Memory (OOM) error, getting the process node killed by cgroup when running on an instance way more powerful like m5d*.2xlarge for example which offers 8 vCPUs and 32GB of RAM.
Note that we tried to dedicate high resources to the containers, especially the build one in which the node process is a child process of this and nothing changed when running on powerful instances; the node process still got killed because of OOM, each time we give it more memory, the node process consumed higher memory and so on.
Also, regarding the CPU usage, in powerful instances, the more vCPUs we gave it, the more is consumed and we noticed that it has CPU Throtelling at ~100% almost all the time, however, in the small instances like m5*.large, the CPU throttling didn't pass the 3%.
Note that we specified a maximum of memory that be used by the node process but it looks like it does not take any effect. We tried to set it to 1GB, 1.5GB and 3GB.
NODE_OPTIONS: "--max-old-space-size=1536"
Node Version
v16.19.0
Platform
amzn2.x86_64
Logs of the host where the job runs
"message": "oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=....
....
"message": "Memory cgroup out of memory: Killed process 16828 (node) total-vm:1667604kB
resources request/limits configuration
memory_request = "1Gi"
memory_limit = "4Gi"
service_cpu_request = "100m"
service_cpu_limit = "500m"
service_memory_request = "250Mi"
service_memory_limit = "2Gi"
helper_cpu_request = "100m"
helper_cpu_limit = "250m"
helper_memory_request = "250Mi"
helper_memory_limit = "1Gi"
Resource consumption of a successful job running on m5d.large
Resource consumption of a failing job running on m5d.2xlarge
When a process in the container tries to consume more than the allowed amount of memory, the system kernel terminates the process that attempted the allocation, with an out of memory (OOM) error.
Check did you enable persistent journaling in your container(s)?
One way: mkdir /var/log/journal && systemctl restart systemd-journald
Other way: in ystemd/man/journald.conf.html
If not and your container uses systemd, it will log to memory with limits derived from the host RAM which can lead to unexpected OOM situations..
Also if possible you can increase the amount of RAM (clamav does use quite a bit).
If the node experiences an out of memory (OOM) event prior to the kubelet being able to reclaim memory, the node depends on the oom_killer to respond.
Node out of memory behavior is well described in Kubernetes best practices: Resource requests and limits. Adjust memory requests (minimal threshold) and memory limits (maximal threshold) in your containers.
Pods crash and OS Syslog shows the OOM killer kills the container process, Pod memory limit and cgroup memory settings. Kubernetes manages the Pod memory limit with cgroup and OOM killer. We need to be careful to separate the OS OOM and the pods OOM.
Try to use the --oom-score-adj option to docker run or even --oom-kill-disable. Refer to Runtime constraints on resources for more info.
Also refer to the similar SO for more related information.
My cluster is in AKS with 5 Nodes of size Standard_D4s_v3 and with K8s version 1.14.8.
As soon as a pod is started/restarted it shows Running (kubectl get pods) and up until the pods are in Running state the CPU usage shows 150m or as much as they require.
But when I top it (kubectl top po) after a pod has moved to Running state, the specific pod shows only 1m CPU usage, but Memory usage is where they should be and the service is down as well.
Kubectl logs -f (pod_name) returns nothing but I can ssh into the pods(kubectl exec -it ....)
It's totally normal behavior, if You create pod it needs more CPU resources to create it, once it's created it doesn't need that much resources anymore.
You can always use cpu/memory limits and resources, more about it with examples how to do it here
Pod CPU/Memory requests define a set amount of CPU and memory that the pod needs on a regular basis.
When the Kubernetes scheduler tries to place a pod on a node, the pod requests are used to determine which node has sufficient resources available for scheduling.
Not setting a pod request will default it to the limit defined.
It is very important to monitor the performance of your application to adjust these requests. If insufficient requests are made, your application may receive degraded performance due to over scheduling a node. If requests are overestimated, your application may have increased difficulty getting scheduled.
Pod CPU/Memory limits are the maximum amount of CPU and memory that a pod can use. These limits help define which pods should be killed in the event of node instability due to insufficient resources. Without proper limits set pods will be killed until resource pressure is lifted.
Pod limits help define when a pod has lost of control of resource consumption. When a limit is exceeded, the pod is prioritized for killing to maintain node health and minimize impact to pods sharing the node.
Not setting a pod limit defaults it to the highest available value on a given node.
Don't set a pod limit higher than your nodes can support. Each AKS node reserves a set amount of CPU and memory for the core Kubernetes components. Your application may try to consume too many resources on the node for other pods to successfully run.
Again, it is very important to monitor the performance of your application at different times during the day or week. Determine when the peak demand is, and align the pod limits to the resources required to meet the application's max needs.
I am operating kubernetes.
There are many terminating pods.
And So many crond daemons are in place in VM.
both /var/log/messages and /var/log/crond are empty.
I don't know why crond daemon is occurred so many?
500 Crond daemons are excecuting.
ps -ef | grep crond | wc -l
648
and load average is 16
I want to know relations between crond and pod terminating on kubernetes.
How Could I dertermine ?
I checked /etc/rsyslog.conf - it's normal.
By default cron emails the program output to the user who owns a particular crontab, therefore you can check whether any of the emails have been delivered within default path /var/spool/mail.
When you have a long-running or continuous script that could never be finished in cron, it can produce a multiple cron processes appearing in the process list, therefore it might be useful to get a list with a tree view on `crontab specific parent/child processes:
pstree -ap| grep crond
I assume that you have a large CPU utilization on your VM, which can potentially degrade the overall performance and affect Kubernetes engine. Although, Kubernetes provides a comprehensive mechanism for managing Compute resources, it distributes resources allocated on a specific Node within Pods which are consuming CPU and RAM on that Node.
To check general resource utilization on a particular Node, you can use this command:
kubectl describe node <node-name>
To check a Pod terminating reason, you can use similar command as in the above example:
kubectl describe pod <pod_name>
However, when you require to dig deeper into the troubleshooting action on your Kubernetes cluster, I would recommend to look at the official Guide.
What I see: Kubernetes takes into account only the memory used by its components when scheduling new Pods, and considers the remaining memory as free, even if it's being used by other system processes outside Kubernetes. So, when creating new deployments, it attempts to schedule new pods on a suffocated node.
What I expected to see: Kubernetes automatically take in consideration the total memory usage (by kubernetes components + system processes) and schedule it on another node.
As a work-around, is there a configuration parameter that I need to set or is it a bug?
Yes, there are few parameters to allocate resources:
You can allocate memory and CPU for your pods and allocate memory and CPU for your system daemons manually. In documentation you could find how it works with the example:
Example Scenario
Here is an example to illustrate Node Allocatable computation:
Node has 32Gi of memory, 16 CPUs and 100Gi of Storage
--kube-reserved is set to cpu=1,memory=2Gi,ephemeral-storage=1Gi
--system-reserved is set to cpu=500m,memory=1Gi,ephemeral-storage=1Gi
--eviction-hard is set to memory.available<500Mi,nodefs.available<10%
Under this scenario, Allocatable will be 14.5 CPUs, 28.5Gi of memory and 98Gi of local storage. Scheduler ensures that the total memory requests across all pods on this node does not exceed 28.5Gi and storage doesn’t exceed 88Gi. Kubelet evicts pods whenever the overall memory usage across pods exceeds 28.5Gi, or if overall disk usage exceeds 88GiIf all processes on the node consume as much CPU as they can, pods together cannot consume more than 14.5 CPUs.
If kube-reserved and/or system-reserved is not enforced and system daemons exceed their reservation, kubelet evicts pods whenever the overall node memory usage is higher than 31.5Gi or storage is greater than 90Gi
You can allocate as many as you need for Kubernetes with flag --kube-reserved and for system with flag -system-reserved.
Additionally, if you need stricter rules for spawning pods, you could try to use Pod Affinity.
Kubelet has the parameter --system-reserved that allows you to make a reservation of cpu and memory for system processes.
It is not dynamic (you reserve resources only at launch) but is the only way to tell Kubelet not to use all resources in node.
--system-reserved mapStringString
A set of ResourceName=ResourceQuantity (e.g. cpu=200m,memory=500Mi,ephemeral-storage=1Gi) pairs that describe resources reserved for non-kubernetes components. Currently only cpu and memory are supported. See http://kubernetes.io/docs/user-guide/compute-resources for more detail. [default=none]
I have celery running in a docker container on GCP with Kubernetes. Its workers have recently started to get kill -9'd – this looks like it has something to do with OOMKiller. There are no OOM events in kubectl get events, which is something to be expected if these events only appear when a pod has trespassed resources.limits.memory value.
So, my theory is that celery process getting killed is a work of linux' own OOMKiller. This doesn't make sense though: if so much memory is consumed that OOMKiller enters the stage, how is it possible that this pod was scheduled in the first place? (assuming that Kubernetes does not allow scheduling of new pods if the sum of resources.limits.memory exceeds the amount of memory available to the system).
However, I am not aware of any other plausible reason for these SIGKILLs than OOMKiller.
An example of celery error (there is one for every worker):
[2017-08-12 07:00:12,124: ERROR/MainProcess] Process 'ForkPoolWorker-7' pid:16 exited with 'signal 9 (SIGKILL)'
[2017-08-12 07:00:12,208: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)
Containers can be OOMKilled for two reasons.
If they exceed the memory limits of set for them. Limits are specified on a per container basis and if the container uses more memory than the limit it will be OOMKilled. From the process's point of view this is the same as if the system ran out of memory.
If the system runs out of memory. There are two kinds of resource specifications in Kubernetes: requests and limits. Limits specify the maximum amount of memory the container can use before being OOMKilled. Requests are used to schedule Pods and default to the limits if not specified. Requests must be less than or equal to container limits. That means that containers could be overcommitted on nodes and OOMKilled if multiple containers are using more memory than their respective requests at the same time.
For instance, if both process A and process B have request of 1GB and limit of 2GB, they can both be scheduled on a node that has 2GB of memory because requests are what is used for scheduling. Having requests less than the limit generally means that the container can burst up to 2GB but will usually use less than 1GB. Now, if both burst above 1GB at the same time the system can run out of memory and one container will get OOMKilled while still being below the limit set on the container.
You can debug whether the container is being OOMKilled by examining the containerStatuses field on the Pod.
$ kubectl get pod X -o json | jq '.status.containerStatuses'
If the pod was OOMKilled it will usually say something to that effect in the lastState field. In your case it looks like it may have been an OOM error based on issues filed against celery (like this one).