I/O monitoring on Kubernetes / CoreOS nodes - io

I have a Kubernetes cluster. Provisioned with kops, running on CoreOS workers. From time to time I see a significant load spikes, that correlate with I/O spikes reported in Prometheus from node_disk_io_time_ms metric. The thing is, I seem to be unable to use any metric to pinpoint where this I/O workload actually originates from. Metrics like container_fs_* seem to be useless as I always get zero values for actual containers, and any data only for whole node.
Any hints on how can I approach the issue of locating what is to be blamed for I/O load in kube cluster / coreos node very welcome

If you are using nginx ingress you can configure it with
enable-vts-status: "true"
This will give you a bunch of prometheus metrics for each pod that has on ingress. The metric names start with nginx_upstream_
In case it is the cronjob creating the spikes, install node-exporter daemonset and check the metrics container_fs_

Related

kube-api server high cpu [migrated]

This question was migrated from Stack Overflow because it can be answered on Server Fault.
Migrated 20 days ago.
I want to know how I can check why one of my ctrl node and kubernetes consumes more cpu than the others.
I have a cluster with 3 ctrl nodes and 4 worker nodes.
I have an nginx load balancer with the least_conn algorithm to distribute the requests to the ctrl nodes.
Monitoring the resources with the top command, I observe that of the three ctrl nodes, the kube api server process always in the first ctrl node gives me a cpu usage above 100%, unlike the other ctrl nodes where the kube-api server uses less than 20%.
I want to know why?
And how can I see that same representation of consumption, be it pod, containers. nodes in grafana
After finding what happens in your cluster using kubctl top node and kubectl top pod, you can further diagnose what is happening with kubectl logs $pod -c $container on the pod.
At this point, it is up to the container to provide information on what it is doing, so ideally, you would collect metrics in the pods to get a quick insight into what is happening on your cluster using e.g. Grafana. You can also have a look at the resources assigned to your pod using kubectl get pod $pod -o jsonpath='{.spec.containers[].resources}'.
In your case, the log messages of the kubernetes apiserver should give you a hint. Probably, something (another container/pod maybe) is clogging up your API server.

Azure AKS Prometheus-operator double metrics

I'm running Azure AKS Cluster 1.15.11 with prometheus-operator 8.15.6 installed as a helm chart and I'm seeing some different metrics displayed by Kubernetes Dashboard compared to the ones provided by prometheus Grafana.
An application pod which is being monitored has three containers in it. Kubernetes-dashboard shows that the memory consumption for this pod is ~250MB, standard prometheus-operator dashboard is displaying almost exactly double value for the memory consumption ~500MB.
At first we thought that there might be some misconfiguration on our monitoring setup. Since prometheus-operator is installed as standard helm chart, Daemon Set for node exporter ensures that every node has exactly one exporter deployed so duplicate exporters shouldn't be the reason. However, after migrating our cluster to different node pools I've noticed that when our application is running on user node pool instead of system node pool metrics does match exactly on both tools. I know that system node pool is running CoreDNS and tunnelfront but I assume these are running as separate components also I'm aware that overall it's not the best choice to run infrastructure and applications in the same node pool.
However, I'm still wondering why running application under system node pool causes metrics by prometheus to be doubled?
I ran into a similar problem (aks v1.14.6, prometheus-operator v0.38.1) where all my values were multiplied by a factor of 3. Turns out you have to remember to remove the extra endpoints called prometheus-operator-kubelet that are created in the kube-system-namespace during install before you remove / reinstall prometheus-operator since Prometheus aggregates the metric types collected for each endpoint.
Log in to the Prometheus-pod and check the status page. There should be as many endpoints as there are nodes in the cluster, otherwise you may have a surplus of endpoints:

Kubernetes Pods not using CPU more than 1m

My cluster is in AKS with 5 Nodes of size Standard_D4s_v3 and with K8s version 1.14.8.
As soon as a pod is started/restarted it shows Running (kubectl get pods) and up until the pods are in Running state the CPU usage shows 150m or as much as they require.
But when I top it (kubectl top po) after a pod has moved to Running state, the specific pod shows only 1m CPU usage, but Memory usage is where they should be and the service is down as well.
Kubectl logs -f (pod_name) returns nothing but I can ssh into the pods(kubectl exec -it ....)
It's totally normal behavior, if You create pod it needs more CPU resources to create it, once it's created it doesn't need that much resources anymore.
You can always use cpu/memory limits and resources, more about it with examples how to do it here
Pod CPU/Memory requests define a set amount of CPU and memory that the pod needs on a regular basis.
When the Kubernetes scheduler tries to place a pod on a node, the pod requests are used to determine which node has sufficient resources available for scheduling.
Not setting a pod request will default it to the limit defined.
It is very important to monitor the performance of your application to adjust these requests. If insufficient requests are made, your application may receive degraded performance due to over scheduling a node. If requests are overestimated, your application may have increased difficulty getting scheduled.
Pod CPU/Memory limits are the maximum amount of CPU and memory that a pod can use. These limits help define which pods should be killed in the event of node instability due to insufficient resources. Without proper limits set pods will be killed until resource pressure is lifted.
Pod limits help define when a pod has lost of control of resource consumption. When a limit is exceeded, the pod is prioritized for killing to maintain node health and minimize impact to pods sharing the node.
Not setting a pod limit defaults it to the highest available value on a given node.
Don't set a pod limit higher than your nodes can support. Each AKS node reserves a set amount of CPU and memory for the core Kubernetes components. Your application may try to consume too many resources on the node for other pods to successfully run.
Again, it is very important to monitor the performance of your application at different times during the day or week. Determine when the peak demand is, and align the pod limits to the resources required to meet the application's max needs.

Start kubernetes pod memory depending on size of data job

is there a way to scale dynamically the memory size of Pod based on size of data job (my use case)?
Currently we have Job and Pods that are defined with memory amounts, but we wouldn't know how big the data will be for a given time-slice (sometimes 1000 rows, sometimes 100,000 rows).
So it will break if the data is bigger than the memory we have allocated beforehand.
I have thought of using slices by data volume, i.e. cut by every 10,000 rows, we will know memory requirement of processing a fixed amount of rows. But we are trying to aggregate by time hence the need for time-slice.
Or any other solutions, like Spark on kubernetes?
Another way of looking at it:
How can we do an implementation of Cloud Dataflow in Kubernetes on AWS
It's a best practice always define resources in your container definition, in particular:
limits:the upper level of CPU and memory
requests: the minimal level of CPU and memory
This allows the scheduler to take a better decision and it eases the assignment of Quality of Service (QoS) for each pod (https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/) which falls into three possible classes:
Guaranteed (highest priority): when requests = limits
Burstable: when requests < limits
BestEffort (lowest priority): when requests and limits are not set.
The QoS enables a criterion for killing pods when the system is overcommited.
If you don’t know the memory requirement for your pod a priori for a given time-slice, then it is difficult for Kubernete Cluster Autoscaler to automatically scale node pool for you as per this documentation [1]. Therefore for both of your suggestions like running either Cloud Dataflow or Spark on Kubernete with Kubernete Cluster Autoscaler, may not work for your case.
However, you can use custom scaling as a workaround. For example, you can export memory related metrics of the pod to Stackdriver, then deploy HorizontalPodAutoscaler (HPA) resource to scale your application as [2].
[1] https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler#how_cluster_autoscaler_works
[2] https://cloud.google.com/kubernetes-engine/docs/tutorials/custom-metrics-autoscaling
I have found the partial solution to this.
Note there are 2 parts to this problem.
1. Make the Pod request the correct amount of memory depending on size of data job
2. Ensure that this Pod can find a Node to run on.
The Kubernetes Cluster Autoscaler (CA) can solve part 2.
https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler
According to the readme:
Cluster Autoscaler is a tool that automatically adjusts the size of the Kubernetes cluster when there are pods that failed to run in the cluster due to insufficient resources.
Thus if there is a data job that needs more memory than available in the currently running nodes, it will start a new node by increasing the size of a node group.
Details:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md
I am still unsure how to do point 1.
An alternative to point 1, start the container without specific memory request or limit:
https://kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource/#if-you-don-t-specify-a-memory-limit
If you don’t specify a memory limit for a Container, then one of these
situations applies:
The Container has no upper bound on the amount of memory it uses.
or
The Container could use all of the memory available on the Node where it is running.

Kubernetes NodeLost/NotReady / High IO Disks

I am experiencing a very complicated issue with Kubernetes in my production environments losing all their Agent Nodes, they change from Ready to NotReady, all the pods change from Running to NodeLost state. I have discovered that Kubernetes is making intensive usage of disks:
My cluster is deployed using acs-engine 0.17.0 (and I tested previous versions too and the same happened).
On the other hand, we decided to deploy the Standard_DS2_VX VM series which contains Premium disks and we incresed the IOPS to 2000 (It was previously under 500 IOPS) and same thing happened. I am going to try with a higher number now.
Any help on this will be appreaciated.
It was a microservice exhauting resources and then Kubernetes just halt the nodes. We have worked on establishing resources/limits based so we can avoid the entire cluster disruption.

Resources