Kubernetes FailedScheduling using nodeSelector - linux

I have set an On-Prem Kubernetes cluster using Rancher, with 3 Centos nodes and 1 Windows node.
I wanted to set a Deployment that will never run over the Windows node, so I set in the Deployment spec.template.spec.nodeSelector: kubernetes.io/os: linux
It seems to run but the deployment gets stuck in Pending, with this error:
Warning FailedScheduling default-scheduler 0/4 nodes are
available: 1 node(s) didn't match node selector, 3 node(s) had taint
{cattle.io/os: linux}, that the pod didn't tolerate.
Any insights?

The scheduler is not able to schedule the pod on linux nodes because those nodes have got taints. So you need to add tolerations in the pod spec of the deployment.
tolerations:
- key: "cattle.io/os"
operator: "Equal"
value: "linux"
effect: "NoSchedule"
Also add a specific taint to the windows nodes so that only specific pods with specific tolerations can only be scheduled onto the windows nodes
kubectl taint nodes windowsnode cattle.io/os=windows:NoSchedule

Related

Azure Kubernetes - System and User pool

I configured system and user pool on Azure AKS instances. I follow this guide:
Microrosft Guide
before the activity we only had system type pools for applications and system pods as well.
I did the following steps:
creation of a system type pool and set of the following taint "CriticalAddonsOnly = true: NoSchedule" (to avoid deployment on the system pool for application microservices)
conversion of old pools from system to users
restart the following deployments:
gatekeeper-system:
gatekeeper-audit
gatekeeper-controller
kube-system:
coredns
coredns-autoscaler
metrics-server
azure-policy
azure-policy-webhook
konnectivity-agent
ama-logs-rs
to allow the scheduling of system pods also on the pool system since they are not automatically scheduled after pool creation.
Now i'm noticing that the system pods have now been scheduled on the pool system as well but I keep seeing the same pods on all other nodes. Even if I brutally delete them from the user pools, they are immediately redeployed on them. Is the behavior correct? Logically if I have a pool system all pods should only be on that pool and none on the user pool?
Thanks
As per Microsoft official documentation, these are the some features of user node pool and system node pool.
System Node Pool:
Must be running Linux.
They can have a minimum of 1 node, but it is recommended to have 2 nodes or 3 if it is your only Linux node pool.
They only support AKS cluster running on Virtual Machine Scale Sets.
The nodes need at least 2 vCPUs and 4GB memory.
They need to support at least 30 pods.
Cannot be made up of Spot VM’s.
Can have multiple system node pools.
If only one system node pool, it cannot be deleted.
Can be changed to a user node pool if you have another system node pool.
User Node Pool:
User node pools can be either Linux or Windows.
Can scale down to 0 nodes.
Can be deleted with no issues.
Spot VM’s can be used
Can be changed to a system node pool.
Can have as many user node pols as Azure will let you.
As per pod definitions, system pods are bound to be scheduled on system node pool unless controlled by DaemonSet. If a system pod is controlled by DaemonSet, it is bound to be scheduled to on every node present in a cluster regardless of pool type. My cluster has 4 nodes. 2 systems, 2 user. So these system pods exist in kube-system namespace have replicas each for one node.
kubectl get ds -n kube-system
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
ama-logs 4 4 4 4 4 <none> 14d
azure-cni-networkmonitor 4 4 4 4 4 <none> 540d
azure-ip-masq-agent 4 4 4 4 4 <none> 540d
kube-proxy 4 4 4 4 4 <none> 540d
To further controll the behaviour of application pod to be not scheduled on system pool. You can add tain on System node pool by this and all application pods will be only scheduled on user node pool.
az aks nodepool add \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name systempool \
--node-count 3 \
--node-taints CriticalAddonsOnly=true:NoSchedule \
--mode System
AKS prefer system nodepool when scheduling system pods, but it's not guaranteed that system pods won't be put on a user nodepool when system nodepool does not have enough capacity to schedule all system pods.
Have you checked if your system pool has the required capacity for all system pods?
see the limitaions section of the page you mentioned.

How to request scale up from 0 to X the number of nodes in a Nodepool in Azure Kubernetes using nodeSelector?

I have a kubernetes cluster (v1.24.3) running in Azure with 3 nodepools called small, standard and large. For each of these nodepools I have added a label named type, where the value is SMALL-2CPU-8GB, STANDARD-4CPU-16GB and LARGE-8CPU-32GB respectively. These nodepools are also configured with the autoscaler from Azure, and the min is 0 and the max is 10.
Now, I am deploying my applications which are required to run in each of these nodepools depending on the specification - for example, one of the apps requires a small node, so it is requesting to run in the nodepool called small with a label type=SMALL-2CPU-8GB and so on.
The way I am requesting this is by setting the nodeSelector in the manifest of the application. Exactly this is the portion of the template:
# App 1
podTemplate:
spec:
nodeSelector:
type: LARGE-8CPU-32GB
agentpool: large
# App 2
podTemplate:
spec:
nodeSelector:
type: STANDARD-4CPU-16GB
agentpool: standard
# App 3
podTemplate:
spec:
nodeSelector:
type: SMALL-4CPU-16GB
agentpool: small
...
When I apply the manifest to the cluster, the pods are in pending state with the message:
Normal NotTriggerScaleUp 43m (x13 over 45m) cluster-autoscaler pod didn't trigger scale-up: 3 node(s) didn't match Pod's node affinity/selector, 1 not ready for scale-up
And I can see that the node count is still 0, so the pod is not triggering the autoscaler to request a new node.
My question is, how to make the autoscaler work when I am requesting nodes (even when the nodepool has zero nodes) via the nodeSelector? Should I specify a different label or use taints?

AKS PersistentVolume Affinity?

Disclaimer: This question is very specific about the used platforms and the UseCase we are trying to solve with it. Also it compares two approaches we currently use at least in a development stage and are trying to compare, but perhaps don't fully understand yet. I am asking for guidance on this very specific topic...
A) We are running a Kafka cluster as Kafka Tasks on DC/OS, where persistence of data is maintained via local Disk Storage which is provisioned on the very same host as the according kafka broker instance.
B) We are trying to run Kafka on Kubernetes (via Strimzi Operator), specifically Azure Kubernetes Service (AKS) and are struggling to get reliable Data Persistence using the StorageClasses you get in AKS. We tried three possibilities:
(Default) Azure Disk
Azure File
emptyDir
I see two major issues with Azure Disk, as we are able to set the Kafka Pod Affinity in a manner that they do not end up on the same maintenance zone / host, we have no instrument to bind the according PersistentVolume anywhere near the Pod. There is nothing like NodeAffinity for AzureDisks. Also it is fairly common that an Azure Disk ends up on another host than its corresponding pod, which might be limited by network bandwidth then?
With Azure File we don't have issues because of maintenance zones which are going down temporarily, but as a high latency storage option it doesn't seem to be a good fit and also Kafka has trouble to delete / update files on retention.
So I ended up using an ephemeral Storage Cluster which is commonly NOT recommended but doesn't come with the problems above. The Volume "lives" near the pod and is available to it as long as the pod itself runs on any node. In the maintenance case pod AND volume die together. As long as I am able to maintain a quorum, I don't see where this might cause issues.
Is there anything like podAffinity for PersistentVolumes as Azure-Disk is per definition Node bound?
What are the major downsides in using emptyDir for persistence in a Kafka Cluster on Kubernetes?
Is there anything like podAffinity for PersistentVolumes as Azure-Disk
is per definition Node bound?
As I know, there is nothing like podaffinity for PersistentVolumes as Azure-Disk. The azure disk should be attached to the node, so if the pod changes the host node, then the pod can't use the volume on that disk. Only the Azure file share is podAffinity.
What are the major downsides in using emptyDir for persistence in a
Kafka Cluster on Kubernetes?
You can take a look at the emptyDir:
scratch space, such as for a disk-based merge sort
This is the most thing you need to watch out for when you use the AKS. You need to calculate the disk space, perhaps you need to attach multiple Azure disks to the nodes.
Starting off - I'm not sure what you mean about an Azure Disk ending up on a node other than where the pod is assigned - that shouldn't be possible, per my understanding (for completeness, you can do this on a VM with the shared disks feature outside of AKS, but as far as I'm aware that's not supported in AKS for dynamic disks at the time of writing). If you're looking at the volume.kubernetes.io/selected-node annotation on the PVC, I don't believe that's updated after initial creation.
You can reach the configuration you're looking for by using a statefulset with antiaffinity. Consider this statefulset. It creates three pods, which must be in different availability zones. I'm deploying this to an AKS cluster with a nodepool (nodepool2) with two nodes per AZ:
❯ kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{","}{.metadata.labels.topology\.kubernetes\.io\/zone}{"\n"}{end}'
aks-nodepool1-25997496-vmss000000,0
aks-nodepool2-25997496-vmss000000,westus2-1
aks-nodepool2-25997496-vmss000001,westus2-2
aks-nodepool2-25997496-vmss000002,westus2-3
aks-nodepool2-25997496-vmss000003,westus2-1
aks-nodepool2-25997496-vmss000004,westus2-2
aks-nodepool2-25997496-vmss000005,westus2-3
Once the statefulset is deployed and spun up, you can see each pod was assigned to one of the nodepool2 nodes:
❯ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
echo-0 1/1 Running 0 3m42s 10.48.36.102 aks-nodepool2-25997496-vmss000001 <none> <none>
echo-1 1/1 Running 0 3m19s 10.48.36.135 aks-nodepool2-25997496-vmss000002 <none> <none>
echo-2 1/1 Running 0 2m55s 10.48.36.72 aks-nodepool2-25997496-vmss000000 <none> <none>
Each pod created a PVC based on the template:
❯ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
demo-echo-0 Bound pvc-bf6104e0-c05e-43d4-9ec5-fae425998f9d 1Gi RWO managed-premium 25m
demo-echo-1 Bound pvc-9d9fbd5f-617a-4582-abc3-ca34b1b178e4 1Gi RWO managed-premium 25m
demo-echo-2 Bound pvc-d914a745-688f-493b-9b82-21598d4335ca 1Gi RWO managed-premium 24m
Let's take a look at one of the PVs that was created:
apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
pv.kubernetes.io/bound-by-controller: "yes"
pv.kubernetes.io/provisioned-by: kubernetes.io/azure-disk
volumehelper.VolumeDynamicallyCreatedByKey: azure-disk-dynamic-provisioner
creationTimestamp: "2021-04-05T14:08:12Z"
finalizers:
- kubernetes.io/pv-protection
labels:
failure-domain.beta.kubernetes.io/region: westus2
failure-domain.beta.kubernetes.io/zone: westus2-3
name: pvc-9d9fbd5f-617a-4582-abc3-ca34b1b178e4
resourceVersion: "19275047"
uid: 945ad69a-92cc-4d8d-96f4-bdf0b80f9965
spec:
accessModes:
- ReadWriteOnce
azureDisk:
cachingMode: ReadOnly
diskName: kubernetes-dynamic-pvc-9d9fbd5f-617a-4582-abc3-ca34b1b178e4
diskURI: /subscriptions/02a062c5-366a-4984-9788-d9241055dda2/resourceGroups/rg-sandbox-aks-mc-sandbox0-westus2/providers/Microsoft.Compute/disks/kubernetes-dynamic-pvc-9d9fbd5f-617a-4582-abc3-ca34b1b178e4
fsType: ""
kind: Managed
readOnly: false
capacity:
storage: 1Gi
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: demo-echo-1
namespace: zonetest
resourceVersion: "19275017"
uid: 9d9fbd5f-617a-4582-abc3-ca34b1b178e4
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: failure-domain.beta.kubernetes.io/region
operator: In
values:
- westus2
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- westus2-3
persistentVolumeReclaimPolicy: Delete
storageClassName: managed-premium
volumeMode: Filesystem
status:
phase: Bound
As you can see, that PV has a required nodeAffinity for nodes in failure-domain.beta.kubernetes.io/zone with value westus2-3. This ensures that the pod that owns that PV will only ever get placed on a node in westus2-3, and that PV will be bound to the node the disk is running on when the pod is started.
At this point, I deleted all the pods to get them on the other nodes:
❯ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
echo-0 1/1 Running 0 4m4s 10.48.36.168 aks-nodepool2-25997496-vmss000004 <none> <none>
echo-1 1/1 Running 0 3m30s 10.48.36.202 aks-nodepool2-25997496-vmss000005 <none> <none>
echo-2 1/1 Running 0 2m56s 10.48.36.42 aks-nodepool2-25997496-vmss000003 <none> <none>
There's no way to see it via Kubernetes, but you can see via the Azure portal that managed disk kubernetes-dynamic-pvc-bf6104e0-c05e-43d4-9ec5-fae425998f9d, which backs pv pvc-bf6104e0-c05e-43d4-9ec5-fae425998f9d, which backs PVC zonetest/demo-echo-0, is listed as Managed by: aks-nodepool2-25997496-vmss_4, so it's been removed and assigned to the node where the pod is running.
Portal screenshot showing disk attached to node 4
If I were to remove nodes such that I didn't have nodes in AZ 3, I wouldn't be able to start pod echo-1, since it's bound to a disk in AZ 3, which can't be attached to a node not in AZ 3.

Azure Kubernetes - Auto-scaling & Nodeselector, Taint and Tolerance?

I have an AKS cluster with the below configuration
Windows Node Pools - 1
Nodes - 2
Node Labels - 2 : app1, app2
Pods - 4 : two pods for each app, node is selected based on the nodeselector
Pod uses Taint & Tolerance
Node auto-scaling is enabled
Now, lets says if a new node is created to support the additional load of app1. would that new node labelled automatically and taint is applied so that app1 can be deployed on that node?
When you create a nodepool, you can specify labels and taints (--nodetaints) that would be applied automatically. Once the nodepool is created, I don't think you can currently go back and add that auto-label or auto-tainting ability.

Problem with Kubernetes Cluster Autoscaler on Azure

I have kubernetes cluster running on Azure Virtual Machine Scale Set. I use Kubernetes Cluster Autoscaler to scale the number of nodes. It works fine, if i set limit from 1 to 10 but the problem appears when i set limit from 0 in one particular case:
When the number of nodes has been scaled to 0 and after this operation pod with cluster autoscaler restarted. Then i want to run pod on this VMSS (pod with nodeSelector - agentpool: memory), but it looks like autoscaler can't read appropriate labels from VMSS when number of instance is scaled to 0.
According to documentation i add the following tag to the VMSS k8s.io_cluster-autoscaler_node-template_label_agentpool: memory.
I have logs from autoscaler pod:
GeneralPredicates predicate mismatch, reason: node(s) didn't match node selector

Resources