Kubernetes NodeLost/NotReady / High IO Disks - azure

I am experiencing a very complicated issue with Kubernetes in my production environments losing all their Agent Nodes, they change from Ready to NotReady, all the pods change from Running to NodeLost state. I have discovered that Kubernetes is making intensive usage of disks:
My cluster is deployed using acs-engine 0.17.0 (and I tested previous versions too and the same happened).
On the other hand, we decided to deploy the Standard_DS2_VX VM series which contains Premium disks and we incresed the IOPS to 2000 (It was previously under 500 IOPS) and same thing happened. I am going to try with a higher number now.
Any help on this will be appreaciated.

It was a microservice exhauting resources and then Kubernetes just halt the nodes. We have worked on establishing resources/limits based so we can avoid the entire cluster disruption.

Related

How to scale up Kubernetes cluster with Terraform avoiding downtime?

Here's the scenario: we have some applications running on a Kubernetes cluster on Azure. Currently our production cluster has one Nodepool with 3 nodes which are fairly low on resources because we still don't have that many active users/requests simultaneously.
Our backend APIs app is running on three pods, one on each node. I was told I will have need to increase resources soon (I'm thinking more memory or even replacing the VMs of the nodes with better ones).
We structured everything Kubernetes related using Terraform and I know that replacing VMs in a node is a destructive action, meaning the cluster will have to be replaces, new config and all deployments, services and etc will have to be reapplied.
I am fairly new to the Kubernetes and Terraform world, meaning I can do the basics to get an application up and running but I would like to learn what is the best practice when it comes to scaling and performance. How can I perform such increase in resources without having any downtime of our services?
I'm wondering if having an extra Nodepool would help while I replace the VM's of the other one (I might be absolutely wrong here)
If there's any link, course, tutorial you can point me to it's highly appreciated.
(Moved from comments)
In Azure, when you're performing cluster upgrade, there's a parameter called "max surge count" which is equal to 1 by default. What it means is when you update your cluster or node configuration, it will first create one extra node with the updated configuration - and only then it will safely drain and remove one of old ones. More on this here: Azure - Node Surge Upgrade

Azure AKS Prometheus-operator double metrics

I'm running Azure AKS Cluster 1.15.11 with prometheus-operator 8.15.6 installed as a helm chart and I'm seeing some different metrics displayed by Kubernetes Dashboard compared to the ones provided by prometheus Grafana.
An application pod which is being monitored has three containers in it. Kubernetes-dashboard shows that the memory consumption for this pod is ~250MB, standard prometheus-operator dashboard is displaying almost exactly double value for the memory consumption ~500MB.
At first we thought that there might be some misconfiguration on our monitoring setup. Since prometheus-operator is installed as standard helm chart, Daemon Set for node exporter ensures that every node has exactly one exporter deployed so duplicate exporters shouldn't be the reason. However, after migrating our cluster to different node pools I've noticed that when our application is running on user node pool instead of system node pool metrics does match exactly on both tools. I know that system node pool is running CoreDNS and tunnelfront but I assume these are running as separate components also I'm aware that overall it's not the best choice to run infrastructure and applications in the same node pool.
However, I'm still wondering why running application under system node pool causes metrics by prometheus to be doubled?
I ran into a similar problem (aks v1.14.6, prometheus-operator v0.38.1) where all my values were multiplied by a factor of 3. Turns out you have to remember to remove the extra endpoints called prometheus-operator-kubelet that are created in the kube-system-namespace during install before you remove / reinstall prometheus-operator since Prometheus aggregates the metric types collected for each endpoint.
Log in to the Prometheus-pod and check the status page. There should be as many endpoints as there are nodes in the cluster, otherwise you may have a surplus of endpoints:

Is Imap.get() expensive in Hazelcast if Hazelcast cluster is running in the Cloud?

I have a distributed map stored in hazelcast. My hazelcast cluster run in a cloud either private or public. My app may not run on the same network where hazelcast cluster is running.
My app tries to access distributed map using IMap.get() may be thousands per second. I tried to major performance of the above operation on the local cluster by running hazelcast cluster on my local machine. I could read everything in 15-20ms. But I am not getting the same performance if hazelcast cluster runs in the cloud.
If you are reading a map, more frequently, Will it increase the load on hazelcast in the cloud environment?, yes any reasons?
Performance of running software locally will always be different than running in a distributed environment, more so when servers are located elsewhere - network latencies being the most prominent factor.
Servers in cloud, app on local = not the recipe for best performance. Either move all cluster components- servers and app clients, in one network (aim for same availability zone if looking for best performance) or expect delays. Its not the cloud in particular that deteriorates the performance, its the way VMs are setup in cloud. For example, if one VM is in us-east-1 and other in London and your app is in Tokyo then expect inferior performance numbers.

I/O monitoring on Kubernetes / CoreOS nodes

I have a Kubernetes cluster. Provisioned with kops, running on CoreOS workers. From time to time I see a significant load spikes, that correlate with I/O spikes reported in Prometheus from node_disk_io_time_ms metric. The thing is, I seem to be unable to use any metric to pinpoint where this I/O workload actually originates from. Metrics like container_fs_* seem to be useless as I always get zero values for actual containers, and any data only for whole node.
Any hints on how can I approach the issue of locating what is to be blamed for I/O load in kube cluster / coreos node very welcome
If you are using nginx ingress you can configure it with
enable-vts-status: "true"
This will give you a bunch of prometheus metrics for each pod that has on ingress. The metric names start with nginx_upstream_
In case it is the cronjob creating the spikes, install node-exporter daemonset and check the metrics container_fs_

Pausing Dataproc cluster - Google Compute engine

is there a way of pausing a Dataproc cluster so I don't get billed when I am not actively running spark-shell or spark-submit jobs ? The cluster management instructions at this link: https://cloud.google.com/sdk/gcloud/reference/beta/dataproc/clusters/
only show how to destroy a cluster but I have installed spark cassandra connector API for example. Is my only alternative to just creating an image that I'll need to install every time ?
In general, the best thing to do is to distill out the steps you used to customize your cluster into some setup scripts, and then use Dataproc's initialization actions to easily automate doing the installation during cluster deployment.
This way, you can easily reproduce the customizations without requiring manual involvement if you ever want, for example, to do the same setup on multiple concurrent Dataproc clusters, or want to change machine types, or receive sub-minor-version bug fixes that Dataproc releases occasionally.
There's indeed no officially supported way of pausing a Dataproc cluster at the moment, in large part simply because being able to have reproducible cluster deployments along with several other considerations listed below means that 99% of the time it's better to use initialization-action customizations instead of pausing a cluster in-place. That said, there are possible short-term hacks, such as going into the Google Compute Engine page, selecting the instances that are part of the Dataproc cluster you want to pause, and clicking "stop" without deleting them.
The Compute Engine hourly charges and Dataproc's per-vCPU charges are only incurred when the underlying instance is running, so while you've "stopped" the instances manually, you won't incur Dataproc or Compute Engine's instance-hour charges despite Dataproc still listing the cluster as "RUNNING", albeit with warnings that you'll see if you go to the "VM Instances" tab of the Dataproc cluster summary page.
You should then be able to just click "start" from the Google Compute Engine page page to have the cluster running again, but it's important to consider the following caveats:
The cluster may occasionally fail to start up into a healthy state again; anything using local SSDs already can't be stopped and started again cleanly, but beyond that, Hadoop daemons may have failed for whatever reason to flush something important to disk if the shutdown wasn't orderly, or even user-installed settings may have broken the startup process in unknown ways.
Even when VMs are "stopped", they depend on the underlying Persistent Disks remaining, so you'll continue to incur charges for those even while "paused"; if we assume $0.04 per GB-month, and a default 500GB disk per Dataproc node, that comes out to continuing to pay ~$0.028/hour per instance; generally your data will be more accessible and also cheaper to just put in Google Cloud Storage for long term storage rather than trying to keep it long-term on the Dataproc cluster's HDFS.
If you come to depend on a manual cluster setup too much, then it'll become much more difficult to re-do if you need to size up your cluster, or change machine types, or change zones, etc. In contrast, with Dataproc's initialization actions, you can use Dataproc's cluster scaling feature to resize your cluster and automatically run the initialization actions for new workers created.
Update
Dataproc recently launched the ability to stop and start clusters: https://cloud.google.com/dataproc/docs/guides/dataproc-start-stop

Resources