Azure Kubernetes Cluster Node Failure Scenario - azure

Let's say I have 3 nodes in my cluster and I want to run 300 jobs.
If I run 1 job per POD and 100 pods per NODE, what will happen if a node fails in Azure Kubernetes Service?

Those Jobs will go to pending, as Kubernetes supports 110 pods per node, so wouldn't have the resources to support the failed over jobs. You could look at using the Cluster Autoscaler (Beta) and it would provision more host to satisfy running those jobs that are in a pending state.

if a node fails
Cluster Autoscaler (CA) can be used to handle node failures in Azure using autoscaling groups:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/README.md
https://learn.microsoft.com/en-us/azure/aks/autoscaler
https://learn.microsoft.com/en-us/azure/aks/scale-cluster

Related

How to check kubectl execution history of a kubernetes cluster?

Suppose I have a Kubernetes cluster, devops manages the cluster using kubectl. How can I track the kubectl execution on the cluster to monitor if there is any suspicious activities?

what is an agent node in aks

I found the mention of an agent node in the aks documentation but i'm not finding the defition of it. can anyone please explain it to ? also want to know if is it an azure concept or a kubernetes concept.
Regards,
In Kubernetes the term node refers to a compute node. Depending on the role of the node it is usually referred to as control plane node or worker node. From the docs:
A Kubernetes cluster consists of a set of worker machines, called nodes, that run containerized applications. Every cluster has at least one worker node.
The worker node(s) host the Pods that are the components of the application workload. The control plane manages the worker nodes and the Pods in the cluster. In production environments, the control plane usually runs across multiple computers and a cluster usually runs multiple nodes, providing fault-tolerance and high availability.
Agent nodes in AKS refers to the worker nodes (which should not be confused with the Kubelet, which is the primary "node agent" that runs on each worker node)

Custom metrics using active TLS connections for AKS HPA

I am running a service in AKS pods that would establish TLS connections with the client. There is a hard limit of 5K active connections per pod. I need a way to determine number of active TLS connections per pod and auto scale (HPA) when it reaches a threshold (say 3.5K TLS connections) and scale down when active connections are below 1K.
Is there a way to collect such metrics in AKS and scale based on that metrics. Kindly suggest.
By default, scale-up operations performed manually or by the cluster autoscaler require the allocation and provisioning of new nodes, and scale-down operations delete nodes. Scale-down Mode allows you to decide whether you would like to delete or deallocate the nodes in your Azure Kubernetes Service (AKS) cluster upon scaling down.
There is not any microsoft document that autoscale based on TLS
connection per pod.
Kubernetes has a cluster autoscaler, that adjusts the number of nodes based on the requested compute resources in the node pool. By default, the cluster autoscaler checks the Metrics API server every 10 seconds for any required changes in node count. If the cluster autoscale determines that a change is required, the number of nodes in your AKS cluster is increased or decreased accordingly. The cluster autoscaler works with Kubernetes RBAC-enabled AKS clusters that run Kubernetes 1.10.x or higher.
Cluster autoscaler is typically used alongside the horizontal pod autoscaler. When combined, the horizontal pod autoscaler increases or decreases the number of pods based on application demand, and the cluster autoscaler adjusts the number of nodes as needed to run those additional pods accordingly.
To get started with the cluster autoscaler in AKS, see Cluster Autoscaler on AKS.
Reference : https://learn.microsoft.com/en-us/azure/aks/concepts-scale#cluster-autoscaler
For Counting the TLS connection to particuler nodes can be done using Plateform metrics ->Microsoft.Blockchain/blockchainMembers->ClusterCommEgressTlsConnectionCount
You can refer the same here.

How to run a node auto-scaler script without using a cron-job in Kubernetes ( PKS)

I have a node auto-scaling shell script which takes care of auto-scaling the worker nodes based on the average CPU/memory of all the nodes in the Kubernetes cluster.
I currently run this script from a bastion where I have the pks, kubectl cli installed and have also configured a cron-job to run it every 5 minutes.
Is there any other way to do this in Kubernetes ( PKS on AWS) ?
Or may be without using a cron-job, as the auto-scaling becomes completely dependent on the cron.
Thanks
TL;DR: Autoscale with k8s
To setup autoscaling on k8s use:
kubectl autoscale -f <controller>.yaml --min=3 --max=5
Note: PKS over AWS is an overkill
You mentioned PKS
Using PKS over AWS infrastructure seems as overkill. Just because AWS has EKS
To work with AWS cloud, VMware recommends VMC on AWS
PKS autoscale
If you do insist to use PKS over AWS, you may try this sample repo: pks-autoscale
Author of the repo also has great PKS quickstart guide for aws
Scaling on AWS
EKS autoscaling
AWS EKS supports three-dimensional scaling:
Cluster Autoscaler — The Kubernetes Cluster Autoscaler automatically adjusts the number of nodes in your cluster when pods fail to launch due to lack of resources or when nodes in the cluster are underutilized and their pods can be rescheduled on to other nodes in the cluster.
Horizontal Pod Autoscaler — The Kubernetes Horizontal Pod Autoscaler automatically scales the number of pods in a deployment, replication controller, or replica set based on that resource's CPU utilization.
Vertical Pod Autoscaler — The Kubernetes Vertical Pod Autoscaler automatically adjusts the CPU and memory reservations for your pods to help "right size" your applications. This can help you to better use your cluster resources and free up CPU and memory for other pods.
EC2 Auto Scaling
If you decided to build your own k8s cluster using PKS, you may use EC2 auto scaling - just create an Auto Scaling Group.
Using aws-cli:
aws autoscaling create-auto-scaling-group --auto-scaling-group-name <my-asg> --launch-configuration-name <my-launch-config> --min-size 3 --max-size 5 --vpc-zone-identifier "<zones>
EC2 predictive scaling
Recently, AWS introduced predictive scaling for EC2:
... predictive scaling. Using data collected from your actual EC2 usage and further informed by billions of data points drawn from our own observations, we use well-trained Machine Learning models to predict your expected traffic (and EC2 usage) including daily and weekly patterns.
If you mean EKS on AWS than there are different auto-scaling options

Marathon on Azure Container Service - cannot scale to all nodes

I have setup up a VM cluster using Azure Container Service. The container orchestrator is DC/OS. There are 3 Master nodes and 3 slave agents.
I have a Docker app that I am trying to launch on my cluster using Marathon. Each time I launch, I notice that the CPU utilization of 3 nodes is always 0 i.e. the app is never scheduled on them. The other 3 nodes, on the other hand, have almost 100% CPU utilization. (As I scale the application.) At that point, the scaling stops and Marathon shows state "waiting" for resource ads from Mesos.
I don't understand why Marathon is not scheduling more containers, despite there being empty nodes when I try to scale the application.
I know that Marathon runs on the Master nodes; is it unaware of the presence of the slave agents? (Assuming that the 3 free nodes are the slaves.)
Here is the config file of the application: pastebin-config-file
How can I make full use of the machines using Marathon?
Tasks are not scheduled to the masters. They are reserved for management of the cluster.

Resources