Kubernetes Pods not using CPU more than 1m - azure

My cluster is in AKS with 5 Nodes of size Standard_D4s_v3 and with K8s version 1.14.8.
As soon as a pod is started/restarted it shows Running (kubectl get pods) and up until the pods are in Running state the CPU usage shows 150m or as much as they require.
But when I top it (kubectl top po) after a pod has moved to Running state, the specific pod shows only 1m CPU usage, but Memory usage is where they should be and the service is down as well.
Kubectl logs -f (pod_name) returns nothing but I can ssh into the pods(kubectl exec -it ....)

It's totally normal behavior, if You create pod it needs more CPU resources to create it, once it's created it doesn't need that much resources anymore.
You can always use cpu/memory limits and resources, more about it with examples how to do it here
Pod CPU/Memory requests define a set amount of CPU and memory that the pod needs on a regular basis.
When the Kubernetes scheduler tries to place a pod on a node, the pod requests are used to determine which node has sufficient resources available for scheduling.
Not setting a pod request will default it to the limit defined.
It is very important to monitor the performance of your application to adjust these requests. If insufficient requests are made, your application may receive degraded performance due to over scheduling a node. If requests are overestimated, your application may have increased difficulty getting scheduled.
Pod CPU/Memory limits are the maximum amount of CPU and memory that a pod can use. These limits help define which pods should be killed in the event of node instability due to insufficient resources. Without proper limits set pods will be killed until resource pressure is lifted.
Pod limits help define when a pod has lost of control of resource consumption. When a limit is exceeded, the pod is prioritized for killing to maintain node health and minimize impact to pods sharing the node.
Not setting a pod limit defaults it to the highest available value on a given node.
Don't set a pod limit higher than your nodes can support. Each AKS node reserves a set amount of CPU and memory for the core Kubernetes components. Your application may try to consume too many resources on the node for other pods to successfully run.
Again, it is very important to monitor the performance of your application at different times during the day or week. Determine when the peak demand is, and align the pod limits to the resources required to meet the application's max needs.

Related

Kubernetes workload scaling on multi-threaded code

Getting started with Kubernetes so have the following question:
Say a microservice has the following C# code snippet:
var tasks = _componentBuilders.Select(b =>
{
return Task.Factory.StartNew(() => b.SetReference(context, typedModel));
});
Task.WaitAll(tasks.ToArray());
On my box, I understand that each thread be executed on a vCPU. So if I have 4 cores with hyperthreading enabled I will be able to execute 8 tasks concurrently. Therefore, if I have about 50000 tasks, it will take roughly
(50,000/8) * approximate time per task
to complete this work. This ignores context switch, etc.
Now, moving to the cloud and assuming this code is in a docker container managed by Kubernetes Deployment and we have a single docker container per VM to keep this simple. How does the above code scale horizontally across the VMs in the deployment? Can not find very clear guidance on this so if anyone has any reference material, that would be helpful.
You'll typically use a Kubernetes Deployment object to deploy application code. That has a replicas: setting, which launches some number of identical disposable Pods. Each Pod has a container, and each pod will independently run the code block you quoted above.
The challenge here is distributing work across the Pods. If each Pod generates its own 50,000 work items, they'll all do the same work and things won't happen any faster. Just running your application in Kubernetes doesn't give you any prebuilt way to share thread pools or task queues between Pods.
A typical approach here is to use a job queue system; RabbitMQ is a popular open-source option. One part of the system generates the tasks and writes them into RabbitMQ. One or more workers reads jobs from the queue and runs them. You can set this up and demonstrate it to yourself without using container technology, then repackage it in Docker or Kubernetes just changing the RabbitMQ broker address at deploy time.
In this setup I'd probably have the worker run jobs serially, one at a time, with no threading. That will simplify the implementation of the worker. If you want to run more jobs in parallel, run more workers; in Kubernetes, increase the Deployment replica: count.
In Kubernetes, when we deploy containers as Pods we can include the resources.limits.cpu and resources.requests.cpu fields for each container in the Pod's manifest:
resources:
requests:
cpu: "1000m"
limits:
cpu: "2000m"
In the example above we have a request for 1 CPU and a limit for a maximum of 2 CPUs. This means the Pod will be scheduled to a worker node which can satisfy above resource requirements.
One cpu, in Kubernetes, is equivalent to 1 vCPU/Core for cloud providers and 1 hyperthread on bare-metal Intel processors.
We can vertically scale by increasing / decreasing the values for the requests and limits fields. Or we can horizontally scale by increasing / decreasing the number of replicas of the pod.
For more details about resource units in Kubernetes here

Start kubernetes pod memory depending on size of data job

is there a way to scale dynamically the memory size of Pod based on size of data job (my use case)?
Currently we have Job and Pods that are defined with memory amounts, but we wouldn't know how big the data will be for a given time-slice (sometimes 1000 rows, sometimes 100,000 rows).
So it will break if the data is bigger than the memory we have allocated beforehand.
I have thought of using slices by data volume, i.e. cut by every 10,000 rows, we will know memory requirement of processing a fixed amount of rows. But we are trying to aggregate by time hence the need for time-slice.
Or any other solutions, like Spark on kubernetes?
Another way of looking at it:
How can we do an implementation of Cloud Dataflow in Kubernetes on AWS
It's a best practice always define resources in your container definition, in particular:
limits:the upper level of CPU and memory
requests: the minimal level of CPU and memory
This allows the scheduler to take a better decision and it eases the assignment of Quality of Service (QoS) for each pod (https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/) which falls into three possible classes:
Guaranteed (highest priority): when requests = limits
Burstable: when requests < limits
BestEffort (lowest priority): when requests and limits are not set.
The QoS enables a criterion for killing pods when the system is overcommited.
If you don’t know the memory requirement for your pod a priori for a given time-slice, then it is difficult for Kubernete Cluster Autoscaler to automatically scale node pool for you as per this documentation [1]. Therefore for both of your suggestions like running either Cloud Dataflow or Spark on Kubernete with Kubernete Cluster Autoscaler, may not work for your case.
However, you can use custom scaling as a workaround. For example, you can export memory related metrics of the pod to Stackdriver, then deploy HorizontalPodAutoscaler (HPA) resource to scale your application as [2].
[1] https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler#how_cluster_autoscaler_works
[2] https://cloud.google.com/kubernetes-engine/docs/tutorials/custom-metrics-autoscaling
I have found the partial solution to this.
Note there are 2 parts to this problem.
1. Make the Pod request the correct amount of memory depending on size of data job
2. Ensure that this Pod can find a Node to run on.
The Kubernetes Cluster Autoscaler (CA) can solve part 2.
https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler
According to the readme:
Cluster Autoscaler is a tool that automatically adjusts the size of the Kubernetes cluster when there are pods that failed to run in the cluster due to insufficient resources.
Thus if there is a data job that needs more memory than available in the currently running nodes, it will start a new node by increasing the size of a node group.
Details:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md
I am still unsure how to do point 1.
An alternative to point 1, start the container without specific memory request or limit:
https://kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource/#if-you-don-t-specify-a-memory-limit
If you don’t specify a memory limit for a Container, then one of these
situations applies:
The Container has no upper bound on the amount of memory it uses.
or
The Container could use all of the memory available on the Node where it is running.

Kubernetes doesn't take into account total node memory usage when starting Pods

What I see: Kubernetes takes into account only the memory used by its components when scheduling new Pods, and considers the remaining memory as free, even if it's being used by other system processes outside Kubernetes. So, when creating new deployments, it attempts to schedule new pods on a suffocated node.
What I expected to see: Kubernetes automatically take in consideration the total memory usage (by kubernetes components + system processes) and schedule it on another node.
As a work-around, is there a configuration parameter that I need to set or is it a bug?
Yes, there are few parameters to allocate resources:
You can allocate memory and CPU for your pods and allocate memory and CPU for your system daemons manually. In documentation you could find how it works with the example:
Example Scenario
Here is an example to illustrate Node Allocatable computation:
Node has 32Gi of memory, 16 CPUs and 100Gi of Storage
--kube-reserved is set to cpu=1,memory=2Gi,ephemeral-storage=1Gi
--system-reserved is set to cpu=500m,memory=1Gi,ephemeral-storage=1Gi
--eviction-hard is set to memory.available<500Mi,nodefs.available<10%
Under this scenario, Allocatable will be 14.5 CPUs, 28.5Gi of memory and 98Gi of local storage. Scheduler ensures that the total memory requests across all pods on this node does not exceed 28.5Gi and storage doesn’t exceed 88Gi. Kubelet evicts pods whenever the overall memory usage across pods exceeds 28.5Gi, or if overall disk usage exceeds 88GiIf all processes on the node consume as much CPU as they can, pods together cannot consume more than 14.5 CPUs.
If kube-reserved and/or system-reserved is not enforced and system daemons exceed their reservation, kubelet evicts pods whenever the overall node memory usage is higher than 31.5Gi or storage is greater than 90Gi
You can allocate as many as you need for Kubernetes with flag --kube-reserved and for system with flag -system-reserved.
Additionally, if you need stricter rules for spawning pods, you could try to use Pod Affinity.
Kubelet has the parameter --system-reserved that allows you to make a reservation of cpu and memory for system processes.
It is not dynamic (you reserve resources only at launch) but is the only way to tell Kubelet not to use all resources in node.
--system-reserved mapStringString
A set of ResourceName=ResourceQuantity (e.g. cpu=200m,memory=500Mi,ephemeral-storage=1Gi) pairs that describe resources reserved for non-kubernetes components. Currently only cpu and memory are supported. See http://kubernetes.io/docs/user-guide/compute-resources for more detail. [default=none]

I/O monitoring on Kubernetes / CoreOS nodes

I have a Kubernetes cluster. Provisioned with kops, running on CoreOS workers. From time to time I see a significant load spikes, that correlate with I/O spikes reported in Prometheus from node_disk_io_time_ms metric. The thing is, I seem to be unable to use any metric to pinpoint where this I/O workload actually originates from. Metrics like container_fs_* seem to be useless as I always get zero values for actual containers, and any data only for whole node.
Any hints on how can I approach the issue of locating what is to be blamed for I/O load in kube cluster / coreos node very welcome
If you are using nginx ingress you can configure it with
enable-vts-status: "true"
This will give you a bunch of prometheus metrics for each pod that has on ingress. The metric names start with nginx_upstream_
In case it is the cronjob creating the spikes, install node-exporter daemonset and check the metrics container_fs_

Possible OOM in GCP container – how to debug?

I have celery running in a docker container on GCP with Kubernetes. Its workers have recently started to get kill -9'd – this looks like it has something to do with OOMKiller. There are no OOM events in kubectl get events, which is something to be expected if these events only appear when a pod has trespassed resources.limits.memory value.
So, my theory is that celery process getting killed is a work of linux' own OOMKiller. This doesn't make sense though: if so much memory is consumed that OOMKiller enters the stage, how is it possible that this pod was scheduled in the first place? (assuming that Kubernetes does not allow scheduling of new pods if the sum of resources.limits.memory exceeds the amount of memory available to the system).
However, I am not aware of any other plausible reason for these SIGKILLs than OOMKiller.
An example of celery error (there is one for every worker):
[2017-08-12 07:00:12,124: ERROR/MainProcess] Process 'ForkPoolWorker-7' pid:16 exited with 'signal 9 (SIGKILL)'
[2017-08-12 07:00:12,208: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)
Containers can be OOMKilled for two reasons.
If they exceed the memory limits of set for them. Limits are specified on a per container basis and if the container uses more memory than the limit it will be OOMKilled. From the process's point of view this is the same as if the system ran out of memory.
If the system runs out of memory. There are two kinds of resource specifications in Kubernetes: requests and limits. Limits specify the maximum amount of memory the container can use before being OOMKilled. Requests are used to schedule Pods and default to the limits if not specified. Requests must be less than or equal to container limits. That means that containers could be overcommitted on nodes and OOMKilled if multiple containers are using more memory than their respective requests at the same time.
For instance, if both process A and process B have request of 1GB and limit of 2GB, they can both be scheduled on a node that has 2GB of memory because requests are what is used for scheduling. Having requests less than the limit generally means that the container can burst up to 2GB but will usually use less than 1GB. Now, if both burst above 1GB at the same time the system can run out of memory and one container will get OOMKilled while still being below the limit set on the container.
You can debug whether the container is being OOMKilled by examining the containerStatuses field on the Pod.
$ kubectl get pod X -o json | jq '.status.containerStatuses'
If the pod was OOMKilled it will usually say something to that effect in the lastState field. In your case it looks like it may have been an OOM error based on issues filed against celery (like this one).

Resources