Troubleshooting Azure AKS Autoscale - azure

We are working with the preview feature for AKS Cluster Autoscale. We have it all setup with the VMSS support. However in our tests it does not scale/add nodes. We get errors that 0/2 nodes are available and no more replicas can be added. A while later i did see it add one node but a lot of replicas are still red/failed. What is the best way to go about troubleshooting this. I looked at the autoscale status. but it did not show any errors. I recall doing something similar with the Regular Kubernetes autoscale and was able to pour through some logs to find the issues. But i cant find anything in the logs about this? What logs to look at? and if you are feeling generous. What are you seeing for amount of time the cluster takes to add a node and those pod add failures to go away?
Searching logs. add replicas slower

Related

Azure Databricks - monitor job cluster for up or down status

Anyone has a way to monitor a group of job clusters in Azure Databricks?
We just want to make sure the job cluster are up and running, maybe have a Dashboard or Workbook in Azure that can be red or green depending on the status of the job cluster.
We have this NRT interfaces pulling data from a source application via these job cluster and would like to see when they are down. We already get an alert when the service goes down but having a panel where we can see these interfaces would be really useful. Prhaps something that will make use of an API call would be needed unless there is something out of the box like those Ganglia reports bur haven't seen anything close to what I'm looking for.
Thanks in advance for any answer you may provide.

How to scale up Kubernetes cluster with Terraform avoiding downtime?

Here's the scenario: we have some applications running on a Kubernetes cluster on Azure. Currently our production cluster has one Nodepool with 3 nodes which are fairly low on resources because we still don't have that many active users/requests simultaneously.
Our backend APIs app is running on three pods, one on each node. I was told I will have need to increase resources soon (I'm thinking more memory or even replacing the VMs of the nodes with better ones).
We structured everything Kubernetes related using Terraform and I know that replacing VMs in a node is a destructive action, meaning the cluster will have to be replaces, new config and all deployments, services and etc will have to be reapplied.
I am fairly new to the Kubernetes and Terraform world, meaning I can do the basics to get an application up and running but I would like to learn what is the best practice when it comes to scaling and performance. How can I perform such increase in resources without having any downtime of our services?
I'm wondering if having an extra Nodepool would help while I replace the VM's of the other one (I might be absolutely wrong here)
If there's any link, course, tutorial you can point me to it's highly appreciated.
(Moved from comments)
In Azure, when you're performing cluster upgrade, there's a parameter called "max surge count" which is equal to 1 by default. What it means is when you update your cluster or node configuration, it will first create one extra node with the updated configuration - and only then it will safely drain and remove one of old ones. More on this here: Azure - Node Surge Upgrade

Silencing Alerts on Multiple AlertManagers Simultaneously

we are running a large number of kubernetes clusters in our network and each with its own prometheus-operator deployment. Every deployment has its own AlertManager deployment. We are finding it very time consuming to silence an alert across all the clusters.
Currently what we have to do is to go to individual alertManager and silence the Individual alert there.
What we are hoping to achieve is an easy way of silencing the alerts for all the cluster (ideally from a single GUI)
Don't want to use inhibit as it kills the purpose of alert.
Any one has any idea how to do that?
The most obvious way should be - by turning off the deployment of an Alertmanager per K8s cluster and using a single central Alertmanager cluster where the firing alerts would be silenced.
A concern might be the performance of that single Alertmanager if there is a very big number of K8s clusters. That requires testing.

Azure AKS Prometheus-operator double metrics

I'm running Azure AKS Cluster 1.15.11 with prometheus-operator 8.15.6 installed as a helm chart and I'm seeing some different metrics displayed by Kubernetes Dashboard compared to the ones provided by prometheus Grafana.
An application pod which is being monitored has three containers in it. Kubernetes-dashboard shows that the memory consumption for this pod is ~250MB, standard prometheus-operator dashboard is displaying almost exactly double value for the memory consumption ~500MB.
At first we thought that there might be some misconfiguration on our monitoring setup. Since prometheus-operator is installed as standard helm chart, Daemon Set for node exporter ensures that every node has exactly one exporter deployed so duplicate exporters shouldn't be the reason. However, after migrating our cluster to different node pools I've noticed that when our application is running on user node pool instead of system node pool metrics does match exactly on both tools. I know that system node pool is running CoreDNS and tunnelfront but I assume these are running as separate components also I'm aware that overall it's not the best choice to run infrastructure and applications in the same node pool.
However, I'm still wondering why running application under system node pool causes metrics by prometheus to be doubled?
I ran into a similar problem (aks v1.14.6, prometheus-operator v0.38.1) where all my values were multiplied by a factor of 3. Turns out you have to remember to remove the extra endpoints called prometheus-operator-kubelet that are created in the kube-system-namespace during install before you remove / reinstall prometheus-operator since Prometheus aggregates the metric types collected for each endpoint.
Log in to the Prometheus-pod and check the status page. There should be as many endpoints as there are nodes in the cluster, otherwise you may have a surplus of endpoints:

How to undo kubectl delete node

I have a k8s cluster on Azure created with asc-engine. It has 4 windows agent nodes.
Recently 2 of the nodes went into a not-ready state and remained there for over a day. In an attempt to correct the situation I did a "kubectl delete node" command on both of the not-ready nodes, thinking that they would simply be restarted in the same way that a pod that is part of a deployment is restarted.
No such luck. The nodes no longer appear in the "kubectl get nodes" list. The virtual machines that are backing the nodes are still there and still running. I tried restarting the VMs thinking that this might cause them to self register, but no luck.
How do I get the nodes back as part of the k8s cluster? Otherwise, how do I recover from this situation? Worse case I can simply throw away the entire cluster and recreate it, but I really would like to simply fix what I have.
You can delete the virtual machines and rerun your acs engine template, that should bring the nodes back (although, i didnt really test your exact scenario). Or you could simply create a new cluster, not that it takes a lot of time, since you just need to run your template.
There is no way of recovering from deletion of object in k8s. Pretty sure they are purged from etcd as soon as you delete them.

Resources