Azure Kubernetes - System and User pool - azure

I configured system and user pool on Azure AKS instances. I follow this guide:
Microrosft Guide
before the activity we only had system type pools for applications and system pods as well.
I did the following steps:
creation of a system type pool and set of the following taint "CriticalAddonsOnly = true: NoSchedule" (to avoid deployment on the system pool for application microservices)
conversion of old pools from system to users
restart the following deployments:
gatekeeper-system:
gatekeeper-audit
gatekeeper-controller
kube-system:
coredns
coredns-autoscaler
metrics-server
azure-policy
azure-policy-webhook
konnectivity-agent
ama-logs-rs
to allow the scheduling of system pods also on the pool system since they are not automatically scheduled after pool creation.
Now i'm noticing that the system pods have now been scheduled on the pool system as well but I keep seeing the same pods on all other nodes. Even if I brutally delete them from the user pools, they are immediately redeployed on them. Is the behavior correct? Logically if I have a pool system all pods should only be on that pool and none on the user pool?
Thanks

As per Microsoft official documentation, these are the some features of user node pool and system node pool.
System Node Pool:
Must be running Linux.
They can have a minimum of 1 node, but it is recommended to have 2 nodes or 3 if it is your only Linux node pool.
They only support AKS cluster running on Virtual Machine Scale Sets.
The nodes need at least 2 vCPUs and 4GB memory.
They need to support at least 30 pods.
Cannot be made up of Spot VM’s.
Can have multiple system node pools.
If only one system node pool, it cannot be deleted.
Can be changed to a user node pool if you have another system node pool.
User Node Pool:
User node pools can be either Linux or Windows.
Can scale down to 0 nodes.
Can be deleted with no issues.
Spot VM’s can be used
Can be changed to a system node pool.
Can have as many user node pols as Azure will let you.
As per pod definitions, system pods are bound to be scheduled on system node pool unless controlled by DaemonSet. If a system pod is controlled by DaemonSet, it is bound to be scheduled to on every node present in a cluster regardless of pool type. My cluster has 4 nodes. 2 systems, 2 user. So these system pods exist in kube-system namespace have replicas each for one node.
kubectl get ds -n kube-system
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
ama-logs 4 4 4 4 4 <none> 14d
azure-cni-networkmonitor 4 4 4 4 4 <none> 540d
azure-ip-masq-agent 4 4 4 4 4 <none> 540d
kube-proxy 4 4 4 4 4 <none> 540d
To further controll the behaviour of application pod to be not scheduled on system pool. You can add tain on System node pool by this and all application pods will be only scheduled on user node pool.
az aks nodepool add \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name systempool \
--node-count 3 \
--node-taints CriticalAddonsOnly=true:NoSchedule \
--mode System

AKS prefer system nodepool when scheduling system pods, but it's not guaranteed that system pods won't be put on a user nodepool when system nodepool does not have enough capacity to schedule all system pods.
Have you checked if your system pool has the required capacity for all system pods?
see the limitaions section of the page you mentioned.

Related

AKS nodepool in a failed state, PODS all pending

yesterday I was using kubectl in my command line and was getting this message after trying any command. Everything was working fine the previous day and I had not touched anything in my AKS.
Unable to connect to the server: x509: certificate has expired or is not yet valid: current time 2022-01-11T12:57:51-05:00 is after 2022-01-11T13:09:11Z
After doing some google to solve this issue I found a guide about rotating certificates:
https://learn.microsoft.com/en-us/azure/aks/certificate-rotation
After following the rotate guide it fixed my certificate issue however all my pods were still in a pending state so I then followed this guide: https://learn.microsoft.com/en-us/azure/aks/update-credentials
Then one of my nodepools started working again which is of type user but the one of type system is still in a failed state with all pods pending.
I am not sure of the next steps I should be taking to solve this issue. Does anyone have any recommendations? I was going to delete the nodepool and make a new one but I can't do that either because it is the last system node pool.
Assuming you are using API version older than 2020-03-01 for creating AKS cluster.
There are few limitations apply when you create and manage AKS clusters that support system node pools.
• An API version of 2020-03-01 or greater must be used to set a node
pool mode. Clusters created on API versions older than 2020-03-01
contain only user node pools, but can be migrated to contain system
node pools by following update pool mode steps.
• The mode of a node pool is a required property and must be
explicitly set when using ARM templates or direct API calls.
You can use the Bicep/JSON code provided in MS Document to create the AKS cluster as there is using upgaded API version.
You can also follow this MS Document if you want to Create a new AKS cluster with a system node pool and add a dedicated system node pool to the existing AKS cluster.
The following command adds a dedicated node pool of mode type system with a default count of three nodes.
az aks nodepool add \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name systempool \
--node-count 3 \
--node-taints CriticalAddonsOnly=true:NoSchedule \
--mode System

How to dictate a master pod with NodeJS app

I'm trying to run a deployment of my NodeJS application in EKS with a ReplicaSet dictating that 3 pods should be run of the application. However, I'm trying to make some logic exclusive to one of the Pods, calling it the "master" version of the application.
Is it possible to either a) have a different environment like IS_MASTER passed to just that pod or to otherwise tell from within the application that it's running on the "master pod" without multiple deployments?
You can have a sticky identity for each pods using StatefulSets
https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/
Quoting docs
Like a Deployment, a StatefulSet manages Pods that are based on an identical container spec. Unlike a Deployment, a StatefulSet maintains a sticky identity for each of their Pods. These pods are created from the same spec, but are not interchangeable: each has a persistent identifier that it maintains across any rescheduling.
Pods will have the hostnames foo-{0..N-1} given N replicas, you can do some sort of check for a master like if the hostname is foo-0 then it is master.

Azure Kubernetes - Auto-scaling & Nodeselector, Taint and Tolerance?

I have an AKS cluster with the below configuration
Windows Node Pools - 1
Nodes - 2
Node Labels - 2 : app1, app2
Pods - 4 : two pods for each app, node is selected based on the nodeselector
Pod uses Taint & Tolerance
Node auto-scaling is enabled
Now, lets says if a new node is created to support the additional load of app1. would that new node labelled automatically and taint is applied so that app1 can be deployed on that node?
When you create a nodepool, you can specify labels and taints (--nodetaints) that would be applied automatically. Once the nodepool is created, I don't think you can currently go back and add that auto-label or auto-tainting ability.

Manage Docker containers at low scale

I have deployed 5 apps using Azure container instances, these are working fine, the issue I have is that currently, all containers are running all the time, which gets expensive.
What I want to do is to start/stop instances when required using for this a Master container or VM that will be working all the time.
E.G.
This master service gets a request to spin up service number 3 for 2 hours then shut it down and all other containers will be off until they receive a similar request.
For my use case, each service will be used for less than 5 hours a day most of the time.
Now, I know Kubernetes its an engine made to manage containers but all examples I have found are for high scale services, not for 5 services with only one container each, also not sure if Kubernetes allows to have all the containers off most of the time.
What I was thinking on is to handle all these throw some API, but I'm not fiding any service in Azure that allows something similar to this, I have only found options to create new containers, not to spin up and shut them down.
EDIT:
Also, this apps run process that are to heavy to have them on a serverless platform.
Solution is to define horizontal pod autoscaler for your deployment.
The Horizontal Pod Autoscaler automatically scales the number of pods in a replication controller, deployment or replica set based on observed CPU utilization (or, with custom metrics support, on some other application-provided metrics). Note that Horizontal Pod Autoscaling does not apply to objects that can’t be scaled, for example, DaemonSets.
The Horizontal Pod Autoscaler is implemented as a Kubernetes API resource and a controller. The resource determines the behavior of the controller. The controller periodically adjusts the number of replicas in a replication controller or deployment to match the observed average CPU utilization to the target specified by user.
Configuration file should looks like this:
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: hpa-images-service
spec:
scaleTargetRef:
apiVersion: apps/v1beta1
kind: Deployment
name: example-deployment
minReplicas: 2
maxReplicas: 100
targetCPUUtilizationPercentage: 75
scaleRef should refer toyour deployment definition and minReplicas you can set as 0, value of targetCPUUtilization you can set according to your preferences.. Such approach should help you to save money due to termination pod which have high CPU utilization.
Kubernetes official documentation: kubernetes-hpa.
GKE autoscaler documentation: gke-autoscaler.
Useful blog about saving cash using GCP: kubernetes-google-cloud.

How to limit amount of pods with attached managed disks per node

Imagine there is a cluster with lots of different deployments running on it. Some pods uses PersistentVolumes (Azure Disks). There is a limit in Azure how much disks can be mounted to a VM and this leads to errors on scheduling like
Status=409 Code="OperationNotAllowed" Message="The maximum number of data disks allowed to be attached to a VM of this size is 8
Pods stay in
Waiting: Container creating
state forever, however some nodes were having much less pods with attached disks at the moment of scheduling. It would be great to limit amount of pods with attached disks per node so this error will never happen. I believe
podAntiAffinity
is what I need and I know I can restrict pods with same label from scheduling on same node, but I don't know how to allow it until node has maximum amount of pods with disks.
My installation is AKS.
az acs create \
--orchestrator-type=kubernetes \
--orchestrator-version 1.7.9 \
--resource-group <resource_group_here> \
--name=<name_here> \
...
KUBE_MAX_PD_VOLS is what you are looking for. By default it's value is 16 for Azure Disks. So you can either use instances which has same limit of attached disks (16) or set it to preferrable value. You can see where it's declared at github
You should set this environment variable in your scheduler declaration. I found my scheduler declaration in /etc/kubernetes/manifests/kube-scheduler.yaml. This is what it looks now:
apiVersion: "v1"
kind: "Pod"
metadata:
name: "kube-scheduler"
...
spec:
containers:
- name: "kube-scheduler"
...
env:
- name: KUBE_MAX_PD_VOLS
value: "8"
...
Note spec.containers.env.KUBE_MAX_PD_VOLS setting - it prevents from scheduling more than 8 disks on each node.
This way pods spread among nodes without any issues, pods which cannot fit stays in Pending state until they find enough nodes to fit in.

Resources