Excessive CPU use of azure mdsd - azure

I am running a Kubernetes cluster on Azure deployed using "azure acs ...". Recently I noticed that the pods on one of the nodes were not responsive and that the CPU on the node was maxed out. I logged in, executed top and found that a process called "mdsd" was using up all available CPU.
When I killed that process with "sudo kill -9", the CPU usage returned to normal and my pods were working fine.
It seems to me like "mdsd" is part of the Azure linux monitoring framework. I installed omi-1.4.0-6.ssl_100.ulinux.x64.deb.
Is there a way to make sure that mdsd is not eating up all my CPU and stopping my pods from working properly?

Related

Azure AKS Prometheus-operator double metrics

I'm running Azure AKS Cluster 1.15.11 with prometheus-operator 8.15.6 installed as a helm chart and I'm seeing some different metrics displayed by Kubernetes Dashboard compared to the ones provided by prometheus Grafana.
An application pod which is being monitored has three containers in it. Kubernetes-dashboard shows that the memory consumption for this pod is ~250MB, standard prometheus-operator dashboard is displaying almost exactly double value for the memory consumption ~500MB.
At first we thought that there might be some misconfiguration on our monitoring setup. Since prometheus-operator is installed as standard helm chart, Daemon Set for node exporter ensures that every node has exactly one exporter deployed so duplicate exporters shouldn't be the reason. However, after migrating our cluster to different node pools I've noticed that when our application is running on user node pool instead of system node pool metrics does match exactly on both tools. I know that system node pool is running CoreDNS and tunnelfront but I assume these are running as separate components also I'm aware that overall it's not the best choice to run infrastructure and applications in the same node pool.
However, I'm still wondering why running application under system node pool causes metrics by prometheus to be doubled?
I ran into a similar problem (aks v1.14.6, prometheus-operator v0.38.1) where all my values were multiplied by a factor of 3. Turns out you have to remember to remove the extra endpoints called prometheus-operator-kubelet that are created in the kube-system-namespace during install before you remove / reinstall prometheus-operator since Prometheus aggregates the metric types collected for each endpoint.
Log in to the Prometheus-pod and check the status page. There should be as many endpoints as there are nodes in the cluster, otherwise you may have a surplus of endpoints:

How to change host machine time for Kubernetes cluster on Azure

How do I change system time (not timezone) for all containers deployed on Azure Kubernetes cluster?
Can this be changed from inside container / pods? I guess it should be changeable from host machine. How to do that?
I don't believe this is possible.
Time comes from the underlying kernel and that is not something that you will be able to adjust from code that runs in a pod.
Even if you could, I suspect it would cause a whole heap of trouble; the pod time and api-server time would be inconsistent and that won't end well!

Node goes to unusable state when using GPU Container supported VMs in Azure batch pool

I am trying to create a pool of GPU based Containers supported VMs. I have valid ContainerConfiguration and start task. The VM size is Standard_NC6. But whenever i create a pool it always goes to unusable state. If i remove ContainerConfiguration setting the node are in idle state but I dont see problem with ContainerConfiguration settings because If i choose the VM size standard_f2s_v2 (not-gpu) and keep the same ContainerConfiguration settings then it works fine and installs all images on machine. I think it has to do with some nvidia libraries installation while setting up the nodes.

Docker containers freezing

I'm currently trying to deploy a node.js app on docker containers. I need to deploy 30 of them but they begin to have a weird behavior at some point, some of them freeze.
I am currently running Docker version for windows 18.03.0-ce, build 0520e24302, my computer specs (cpu and memory):
I5 4670 K
24 GB of ram
My docker default machine resource allocation is the following :
Allocated RAM : 10 Gb
Allocated vCPUs : 4
My node application is running on apline3.8 and node.js 11.4 and mostly do http requests every 2-3 seconds.
When i try to deploy 20 containers everything is running like a charm, my application do the job and i can see that there is an activity on every on my containers through the logs, activity stats.
The problem comes when i try to deploy more containers, more than 20, i can notice that some of the previously deployed containers start to stop their activities (0% cpu using, logs freezing). When everything is deployed (30 containers), Docker start to block the activity of some of them and unblock them at some point to block some others (blocking/unblocking is random). It seems to be sequential. I tried to wait and see what happened and the result is that some of the containers are able to poursue their activities and some others are stuck forever (still running but no more activity).
It's important to notice that i used the following resources restrictions on each of my containers :
MemoryReservation : 160mb
Memory soft limit : 160mb
NanoCPUs : 250000000 (0.25 cpus)
I had to increase my docker default machine resource allocation and decrease container's ressource allocation because it was using almost 100% of my cpu, maybe i did a mistake in my configuration. I tried to tweak those values, but no success i still have some containers freezing.
I'm kind of lost right know.
Any help would be appreciated even a little one, thank you in advance !

Kubernetes NodeLost/NotReady / High IO Disks

I am experiencing a very complicated issue with Kubernetes in my production environments losing all their Agent Nodes, they change from Ready to NotReady, all the pods change from Running to NodeLost state. I have discovered that Kubernetes is making intensive usage of disks:
My cluster is deployed using acs-engine 0.17.0 (and I tested previous versions too and the same happened).
On the other hand, we decided to deploy the Standard_DS2_VX VM series which contains Premium disks and we incresed the IOPS to 2000 (It was previously under 500 IOPS) and same thing happened. I am going to try with a higher number now.
Any help on this will be appreaciated.
It was a microservice exhauting resources and then Kubernetes just halt the nodes. We have worked on establishing resources/limits based so we can avoid the entire cluster disruption.

Resources