Cant run GPU pod - 0/12 nodes are available: 1 node(s) had taint {CriticalAddonsOnly: true} - azure

Trying to create GPU node in my Azure cluster.
I am following this instruction - https://learn.microsoft.com/en-us/azure/aks/gpu-cluster
So, I already had K8s cluster, I added new pool:
az aks nodepool add \
--resource-group XXX \
--cluster-name XXX \
--name spotgpu \
--node-vm-size standard_nv12s_v3 \
--node-taints sku=gpu:NoSchedule \
--aks-custom-headers UseGPUDedicatedVHD=true \
--enable-cluster-autoscaler \
--node-count 1 \
--min-count 1 \
--max-count 2 \
--max-pods 12 \
--priority Spot \
--eviction-policy Delete \
--spot-max-price 0.2
So, node pool was successfully created:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
...
aks-spotgpu-XXX-XXX Ready agent 11m v1.21.9
After that I applied this Job - https://learn.microsoft.com/en-us/azure/aks/gpu-cluster#run-a-gpu-enabled-workload
But new cant run, it is in Pending state -
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 76s default-scheduler 0/12 nodes are available: 1
node(s) had taint {CriticalAddonsOnly: true}, that the pod didn't tolerate, 1 node(s) had taint {sku: compute-cpu}, that the pod didn't tolerate, 2 node(s) had taint {kubernetes.azure.com/scalesetpriority: spot}, that the pod didn't tolerate, 8 Insufficient nvidia.com/gpu.
Warning FailedScheduling 75s default-scheduler 0/12 nodes are available: 1 node(s) had taint {CriticalAddonsOnly: true}, that the pod didn't tolerate, 1 node(s) had taint {sku: compute-cpu}, that the pod didn't tolerate, 2 node(s) had taint {kubernetes.azure.com/scalesetpriority: spot}, that the pod didn't tolerate, 8 Insufficient nvidia.com/gpu.
Normal NotTriggerScaleUp 39s cluster-autoscaler pod didn't trigger scale-up: 2 node(s) had taint {kubernetes.azure.com/scalesetpriority: spot}, that the pod didn't tolerate, 1 Insufficient nvidia.com/gpu, 1 node(s) had taint {sku: compute-cpu}, that the pod didn't tolerate, 1 node(s) had taint {CriticalAddonsOnly: true}, that the pod didn't tolerate
I tried different max/min/node count variants but always got the same warning messages and can`t start the pod.
Where I am wrong?

Related

Unable to get Azure Key Vault integrated with Azure Kubernetes Service

Stuck on getting this integration working. I'm following the documentation step-by-step.
The following is everything I have done starting from scratch, so if it isn't listed here, I haven't tried it (I apologize in advance for the long series of commands):
# create the resource group
az group create -l westus -n k8s-test
# create the azure container registery
az acr create -g k8s-test -n k8stestacr --sku Basic -l westus
# create the azure key vault and add a test value to it
az keyvault create --name k8stestakv --resource-group k8s-test -l westus
az keyvault secret set --vault-name k8stestakv --name SECRETTEST --value abc123
# create the azure kubernetes service
az aks create -n k8stestaks -g k8s-test --kubernetes-version=1.19.7 --node-count 1 -l westus --enable-managed-identity --attach-acr k8stestacr -s Standard_B2s
# switch to the aks context
az aks get-credentials -b k8stestaks -g k8s-test
# install helm charts for secrets store csi
helm repo add csi-secrets-store-provider-azure https://raw.githubusercontent.com/Azure/secrets-store-csi-driver-provider-azure/master/charts
helm install csi-secrets-store-provider-azure/csi-secrets-store-provider-azure --generate-name
# create role managed identity operator
az role assignment create --role "Managed Identity Operator" --assignee <k8stestaks_clientId> --scope /subscriptions/<subscriptionId>/resourcegroups/MC_k8s-test_k8stestaks_westus
# create role virtual machine contributor
az role assignment create --role "Virtual Machine Contributor" --assignee <k8stestaks_clientId> --scope /subscriptions/<subscriptionId>/resourcegroups/MC_k8s-test_k8stestaks_westus
# install more helm charts
helm repo add aad-pod-identity https://raw.githubusercontent.com/Azure/aad-pod-identity/master/charts
helm install pod-identity aad-pod-identity/aad-pod-identity
# create identity
az identity create -g MC_k8s-test_k8stestaks_westus -n TestIdentity
# give the new identity a reader role for AKV
az role assignment create --role "Reader" --assignee <TestIdentity_principalId> --scope /subscriptions/<subscription_id/resourceGroups/k8s-test/providers/Microsoft.KeyVault/vaults/k8stestakv
# allow the identity to get secrets from AKV
az keyvault set-policy -n k8stestakv --secret-permissions get --spn <TestIdentity_clientId>
That is pretty much it for az cli commands. Everything up to this point executes fine with no errors. I can go into the portal, see these roles for the MC_ group, the TestIdentity with read-only for secrets, etc.
After that, the documentation has you build secretProviderClass.yaml:
apiVersion: secrets-store.csi.x-k8s.io/v1alpha1
kind: SecretProviderClass
metadata:
name: azure-kvname
spec:
provider: azure
parameters:
usePodIdentity: "true"
useVMManagedIdentity: "false"
userAssignedIdentityID: ""
keyvaultName: "k8stestakv"
cloudName: ""
objects: |
array:
- |
objectName: SECRETTEST
objectType: secret
objectVersion: ""
resourceGroup: "k8s-test"
subscriptionId: "<subscriptionId>"
tenantId: "<tenantId>"
And also the podIdentityBinding.yaml:
apiVersion: aadpodidentity.k8s.io/v1
kind: AzureIdentity
metadata:
name: azureIdentity
spec:
type: 0
resourceID: /subscriptions/<subscriptionId>/resourcegroups/MC_k8s-test_k8stestaks_westus/providers/Microsoft.ManagedIdentity/userAssignedIdentities/TestIdentity
clientID: <TestIdentity_clientId>
---
apiVersion: aadpodidentity.k8s.io/v1
kind: AzureIdentityBinding
metadata:
name: azure-pod-identity-binding
spec:
azureIdentity: azureIdentity
selector: azure-pod-identity-binding-selector
Then just apply them:
# this one executes fine
kubectl apply -f k8s/secret/secretProviderClass.yaml
# this one does not
kubectl apply -f k8s/identity/podIdentityBinding.yaml
Problem #1
With the last one I get:
unable to recognize "k8s/identity/podIdentityBinding.yaml": no matches for kind "AzureIdentity" in version "aadpodidentity.k8s.io/v1"
unable to recognize "k8s/identity/podIdentityBinding.yaml": no matches for kind "AzureIdentityBinding" in version "aadpodidentity.k8s.io/v1"
Not sure why because the helm install pod-identity aad-pod-identity/aad-pod-identity command was successful. Looking at my Pods however...
Problem #2
I've followed these steps three times and every time the issue is the same--the aad-pod-identity-nmi-xxxxx will not launch:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
aad-pod-identity-mic-7b4558845f-hwv8t 1/1 Running 0 37m
aad-pod-identity-mic-7b4558845f-w8mxt 1/1 Running 0 37m
aad-pod-identity-nmi-4sf5q 0/1 CrashLoopBackOff 12 37m
csi-secrets-store-provider-azure-1613256848-cjlwc 1/1 Running 0 41m
csi-secrets-store-provider-azure-1613256848-secrets-store-m4wth 3/3 Running 0 41m
$ kubectl describe pod aad-pod-identity-nmi-4sf5q
Name: aad-pod-identity-nmi-4sf5q
Namespace: default
Priority: 0
Node: aks-nodepool1-40626841-vmss000000/10.240.0.4
Start Time: Sat, 13 Feb 2021 14:57:54 -0800
Labels: app.kubernetes.io/component=nmi
app.kubernetes.io/instance=pod-identity
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=aad-pod-identity
controller-revision-hash=669df55fd8
helm.sh/chart=aad-pod-identity-3.0.3
pod-template-generation=1
tier=node
Annotations: <none>
Status: Running
IP: 10.240.0.4
IPs:
IP: 10.240.0.4
Controlled By: DaemonSet/aad-pod-identity-nmi
Containers:
nmi:
Container ID: containerd://5f9e17e95ae395971dfd060c1db7657d61e03052ffc3cbb59d01c774bb4a2f6a
Image: mcr.microsoft.com/oss/azure/aad-pod-identity/nmi:v1.7.4
Image ID: mcr.microsoft.com/oss/azure/aad-pod-identity/nmi#sha256:0b4e296a7b96a288960c39dbda1a3ffa324ef33c77bb5bd81a4266b85efb3498
Port: <none>
Host Port: <none>
Args:
--node=$(NODE_NAME)
--http-probe-port=8085
--operation-mode=standard
--kubelet-config=/etc/default/kubelet
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 255
Started: Sat, 13 Feb 2021 15:34:40 -0800
Finished: Sat, 13 Feb 2021 15:34:40 -0800
Ready: False
Restart Count: 12
Limits:
cpu: 200m
memory: 512Mi
Requests:
cpu: 100m
memory: 256Mi
Liveness: http-get http://:8085/healthz delay=10s timeout=1s period=5s #success=1 #failure=3
Environment:
NODE_NAME: (v1:spec.nodeName)
FORCENAMESPACED: false
Mounts:
/etc/default/kubelet from kubelet-config (ro)
/run/xtables.lock from iptableslock (rw)
/var/run/secrets/kubernetes.io/serviceaccount from aad-pod-identity-nmi-token-8sfh4 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
iptableslock:
Type: HostPath (bare host directory volume)
Path: /run/xtables.lock
HostPathType: FileOrCreate
kubelet-config:
Type: HostPath (bare host directory volume)
Path: /etc/default/kubelet
HostPathType:
aad-pod-identity-nmi-token-8sfh4:
Type: Secret (a volume populated by a Secret)
SecretName: aad-pod-identity-nmi-token-8sfh4
Optional: false
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/network-unavailable:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 38m default-scheduler Successfully assigned default/aad-pod-identity-nmi-4sf5q to aks-nodepool1-40626841-vmss000000
Normal Pulled 38m kubelet Successfully pulled image "mcr.microsoft.com/oss/azure/aad-pod-identity/nmi:v1.7.4" in 14.677657725s
Normal Pulled 38m kubelet Successfully pulled image "mcr.microsoft.com/oss/azure/aad-pod-identity/nmi:v1.7.4" in 5.976721016s
Normal Pulled 37m kubelet Successfully pulled image "mcr.microsoft.com/oss/azure/aad-pod-identity/nmi:v1.7.4" in 627.112255ms
Normal Pulling 37m (x4 over 38m) kubelet Pulling image "mcr.microsoft.com/oss/azure/aad-pod-identity/nmi:v1.7.4"
Normal Pulled 37m kubelet Successfully pulled image "mcr.microsoft.com/oss/azure/aad-pod-identity/nmi:v1.7.4" in 794.669637ms
Normal Created 37m (x4 over 38m) kubelet Created container nmi
Normal Started 37m (x4 over 38m) kubelet Started container nmi
Warning BackOff 3m33s (x170 over 38m) kubelet Back-off restarting failed container
I guess I'm not sure if both problems are related and I haven't been able to get the failing Pod to start up.
Any suggestions here?
Looks it is related to the default network plugin that AKS picks for you if you don't specify "Advanced" for network options: kubenet.
This integration can be done with kubenet outlined here:
https://azure.github.io/aad-pod-identity/docs/configure/aad_pod_identity_on_kubenet/
If you are creating a new cluster, enable Advanced networking or add the --network-plugin azure flag and parameter.

Spark submit on Kubernetes cloud engine using only one node, one cpu requested

I have set up a cluster with 4 nodes each having 2 CPUs so 8 in total.
My code end up running only on one CPU and no matter the settings the requested CPUs is always 1 in the pod description and the execution time stays the same. Tried the spark examples and same thing applies.
Spark submit script I use:
./bin/spark-submit \
--master k8s://https://34.64.87.144 \
--deploy-mode cluster \
--name spark-counter \
--class DataCounter \
--driver-java-options "-Dlog4j.configuration=file:////opt/spark/data/log4j.properties" \
--conf spark.executor.cores=4 \
--conf spark.kubernetes.executor.request.cores=3.6 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.driver.pod.name=spark-counter \
--conf spark.kubernetes.container.image=asia.gcr.io/profound-media-298808/spark-base:latest \
local:///opt/spark/data/spark_counter-1.0.jar /opt/spark/data/input1
and the description of the pod
spark-role=driver
Annotations: <none>
Status: Succeeded
IP: 10.32.4.37
IPs:
IP: 10.32.4.37
Containers:
spark-kubernetes-driver:
Container ID: containerd://bd31b8112159145169ab1b6397af8bc2f10cee5429b11c8025f2359ab5194882
Image: asia.gcr.io/profound-media-298808/spark-base:latest
Image ID: asia.gcr.io/profound-media-298808/spark-base#sha256:6aaf817da5606a39bf2aeea769c4ec2d62c7986d06109cb4a38f4f7157702ff1
Ports: 7078/TCP, 7079/TCP, 4040/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP
Args:
driver
--properties-file
/opt/spark/conf/spark.properties
--class
DataCounter
spark-internal
/opt/spark/data/input1
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 18 Dec 2020 07:13:45 +0000
Finished: Fri, 18 Dec 2020 07:14:28 +0000
Ready: False
Restart Count: 0
Limits:
memory: 1408Mi
Requests:
cpu: 1
memory: 1408Mi
Environment:
SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)
SPARK_LOCAL_DIRS: /var/data/spark-f4832458-7450-4d96-b7ed-c672d5ec0eda
SPARK_CONF_DIR: /opt/spark/conf
Mounts:
/opt/spark/conf from spark-conf-volume (rw)
/var/data/spark-f4832458-7450-4d96-b7ed-c672d5ec0eda from spark-local-dir-1 (rw)
/var/run/secrets/kubernetes.io/serviceaccount from spark-token-5zk2c (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
spark-local-dir-1:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
spark-conf-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: spark-counter-1608275621712-driver-conf-map
Optional: false
spark-token-5zk2c:
Type: Secret (a volume populated by a Secret)
SecretName: spark-token-5zk2c
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 7m44s default-scheduler Successfully assigned default/spark-counter to gke-cluster-1-default-pool-f51b7df5-g048
Warning FailedMount 7m43s (x2 over 7m43s) kubelet MountVolume.SetUp failed for volume "spark-conf-volume" : configmap "spark-counter-1608275621712-driver-conf-map" not found
Normal Pulled 7m42s kubelet Container image "asia.gcr.io/profound-media-298808/spark-base:latest" already present on machine
Normal Created 7m42s kubelet Created container spark-kubernetes-driver
Normal Started 7m42s kubelet Started container spark-kubernetes-driver
I have no additional configuration files set up, using the spark image builder script as a base to build the image used only adding my own jar and data to it. Should I have something more?
How do I set up my cluster to utilize all nodes?

nodeSelector constraint ignored on AKS on mixed node pools (windows/linux)?

Tried installing a basic Nginx ingress using Helm by running the following command:
helm install nginx-ingress --namespace ingress-basic ingress-nginx/ingress-nginx \
--set controller.service.loadBalancerIP='52.232.109.226' \
--set controller.nodeSelector."beta\.kubernetes\.io/os"='linux' \
--set defaultBackend.nodeSelector."beta\.kubernetes\.io/os"='linux' \
--set controller.replicaCount=1 \
--set rbac.create=true
Shortly after installing I noticed the pod was scheduled onto a Windows node instead of a Linux node:
wesley#Azure:~$ kubectl get pods -n ingress-basic -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-ingress-ingress-nginx-admission-create-jcp6x 0/1 ContainerCreating 0 18s <none> akswin000002 <none> <none>
wesley#Azure:~$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
aks-agentpool-59412422-vmss000000 Ready agent 5h32m v1.17.11 10.240.0.4 <none> Ubuntu 16.04.7 LTS 4.15.0-1096-azure docker://19.3.12
aks-linuxpool-59412422-vmss000000 Ready agent 5h32m v1.17.11 10.240.0.128 <none> Ubuntu 16.04.7 LTS 4.15.0-1096-azure docker://19.3.12
akswin000000 Ready agent 5h28m v1.17.11 10.240.0.35 <none> Windows Server 2019 Datacenter 10.0.17763.1397 docker://19.3.11
akswin000001 Ready agent 5h28m v1.17.11 10.240.0.66 <none> Windows Server 2019 Datacenter 10.0.17763.1397 docker://19.3.11
akswin000002 Ready agent 5h28m v1.17.11 10.240.0.97 <none> Windows Server 2019 Datacenter 10.0.17763.1397 docker://19.3.11
Running a describe on the nginx pod revealed that the field Node-selectors remains to be set to <none>.
Name: nginx-ingress-ingress-nginx-admission-create-jcp6x
Namespace: ingress-basic
Priority: 0
PriorityClassName: <none>
Node: akswin000002/10.240.0.97
Start Time: Fri, 16 Oct 2020 20:09:36 +0000
Labels: app.kubernetes.io/component=admission-webhook
app.kubernetes.io/instance=nginx-ingress
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=ingress-nginx
app.kubernetes.io/version=0.40.2
controller-uid=d03091cd-8138-4923-a369-afeca669099c
helm.sh/chart=ingress-nginx-3.7.1
job-name=nginx-ingress-ingress-nginx-admission-create
Annotations: <none>
**Status: Pending**
IP:
Controlled By: Job/nginx-ingress-ingress-nginx-admission-create
Containers:
create:
Container ID:
Image: docker.io/jettech/kube-webhook-certgen:v1.3.0
Image ID:
Port: <none>
Host Port: <none>
Args:
create
--host=nginx-ingress-ingress-nginx-controller-admission,nginx-ingress-ingress-nginx-controller-admission.$(POD_NAMESPACE).svc
--namespace=$(POD_NAMESPACE)
--secret-name=nginx-ingress-ingress-nginx-admission
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Environment:
POD_NAMESPACE: ingress-basic (v1:metadata.namespace)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from nginx-ingress-ingress-nginx-admission-token-8x6ct (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
nginx-ingress-ingress-nginx-admission-token-8x6ct:
Type: Secret (a volume populated by a Secret)
SecretName: nginx-ingress-ingress-nginx-admission-token-8x6ct
Optional: false
QoS Class: BestEffort
**Node-Selectors: <none>**
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SandboxChanged 5m (x5401 over 2h) kubelet, akswin000002 Pod sandbox changed, it will be killed and re-created.
Warning FailedCreatePodSandBox 31s (x5543 over 2h) kubelet, akswin000002 (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "nginx-ingress-ingress-nginx-admission-create-jcp6x": Error response from daemon: container a734c23d20338d7fed800752c19f5e94688fd38fe82c9e90fc14533bae90c6bc encountered an error during hcsshim::System::CreateProcess: failure in a Windows system call: The user name or password is incorrect. (0x52e) extra info: {"CommandLine":"cmd /S /C pauseloop.exe","User":"2000","WorkingDirectory":"C:\\","Environment":{"PATH":"c:\\Windows\\System32;c:\\Windows"},"CreateStdInPipe":true,"CreateStdOutPipe":true,"CreateStdErrPipe":true,"ConsoleSize":[0,0]}
I expected the pod to be scheduled onto a Linux node instead. Does anyone have a clue why this is happening? I saw no taints or anything and this is just a newly spin up cluster. The only workaround for now seems to be to scale the windows node back to 0. Install nginx ingress and then scale up the windows nodes again.
Kubernetes version: 1.17.11
Works with below. Added admissionWebhooks.patch.nodeSelector.
https://learn.microsoft.com/en-us/azure/aks/ingress-basic#create-an-ingress-controller
helm install nginx-ingress ingress-nginx/ingress-nginx \
--namespace ingress-basic \
--set controller.replicaCount=2 \
--set controller.nodeSelector."beta\.kubernetes\.io/os"=linux \
--set defaultBackend.nodeSelector."beta\.kubernetes\.io/os"=linux \
--set controller.admissionWebhooks.patch.nodeSelector."beta\.kubernetes\.io/os"=linux

aks reporting "Insufficient pods"

I've gone through the Azure Cats&Dogs tutorial described here and I am getting an error in the final step where the apps are launched in AKS. Kubernetes is reporting that I have insufficent pods but I'm not sure why this would be. I've run through this same tutorial a few weeks ago without problems.
$ kubectl apply -f azure-vote-all-in-one-redis.yaml
deployment.apps/azure-vote-back created
service/azure-vote-back created
deployment.apps/azure-vote-front created
service/azure-vote-front created
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
azure-vote-back-655476c7f7-mntrt 0/1 Pending 0 6s
azure-vote-front-7c7d7f6778-mvflj 0/1 Pending 0 6s
$ kubectl get events
LAST SEEN TYPE REASON KIND MESSAGE
3m36s Warning FailedScheduling Pod 0/1 nodes are available: 1 Insufficient pods.
84s Warning FailedScheduling Pod 0/1 nodes are available: 1 Insufficient pods.
70s Warning FailedScheduling Pod skip schedule deleting pod: default/azure-vote-back-655476c7f7-l5j28
9s Warning FailedScheduling Pod 0/1 nodes are available: 1 Insufficient pods.
53m Normal SuccessfulCreate ReplicaSet Created pod: azure-vote-back-655476c7f7-kjld6
99s Normal SuccessfulCreate ReplicaSet Created pod: azure-vote-back-655476c7f7-l5j28
24s Normal SuccessfulCreate ReplicaSet Created pod: azure-vote-back-655476c7f7-mntrt
53m Normal ScalingReplicaSet Deployment Scaled up replica set azure-vote-back-655476c7f7 to 1
99s Normal ScalingReplicaSet Deployment Scaled up replica set azure-vote-back-655476c7f7 to 1
24s Normal ScalingReplicaSet Deployment Scaled up replica set azure-vote-back-655476c7f7 to 1
9s Warning FailedScheduling Pod 0/1 nodes are available: 1 Insufficient pods.
3m36s Warning FailedScheduling Pod 0/1 nodes are available: 1 Insufficient pods.
53m Normal SuccessfulCreate ReplicaSet Created pod: azure-vote-front-7c7d7f6778-rmbqb
24s Normal SuccessfulCreate ReplicaSet Created pod: azure-vote-front-7c7d7f6778-mvflj
53m Normal ScalingReplicaSet Deployment Scaled up replica set azure-vote-front-7c7d7f6778 to 1
53m Normal EnsuringLoadBalancer Service Ensuring load balancer
52m Normal EnsuredLoadBalancer Service Ensured load balancer
46s Normal DeletingLoadBalancer Service Deleting load balancer
24s Normal ScalingReplicaSet Deployment Scaled up replica set azure-vote-front-7c7d7f6778 to 1
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
aks-nodepool1-27217108-0 Ready agent 7d4h v1.9.9
The only thing I can think of that has changed is that I have other (larger) clusters running now as well, and the main reason I went through this Cats&Dogs tutorial again was because I hit this same problem today with my other clusters. Is this a resources limit issue with my Azure account?
Update 10-20/3:15 PST: Notice how these three clusters all show that they use the same nodepool, even though they were created in different resource groups. Also note how the "get-credentials" call for gem2-cluster reports an error. I did have a cluster earlier called gem2-cluster which I deleted and recreated using the same name (in fact I deleted the wole resource group). What's the correct process for doing this?
$ az aks get-credentials --name gem1-cluster --resource-group gem1-rg
Merged "gem1-cluster" as current context in /home/psteele/.kube/config
$ kubectl get nodes -n gem1
NAME STATUS ROLES AGE VERSION
aks-nodepool1-27217108-0 Ready agent 3h26m v1.9.11
$ az aks get-credentials --name gem2-cluster --resource-group gem2-rg
A different object named gem2-cluster already exists in clusters
$ az aks get-credentials --name gem3-cluster --resource-group gem3-rg
Merged "gem3-cluster" as current context in /home/psteele/.kube/config
$ kubectl get nodes -n gem1
NAME STATUS ROLES AGE VERSION
aks-nodepool1-14202150-0 Ready agent 26m v1.9.11
$ kubectl get nodes -n gem2
NAME STATUS ROLES AGE VERSION
aks-nodepool1-14202150-0 Ready agent 26m v1.9.11
$ kubectl get nodes -n gem3
NAME STATUS ROLES AGE VERSION
aks-nodepool1-14202150-0 Ready agent 26m v1.9.11
What is your max-pods set to? This is a normal error when you've reached the limit of pods per node.
You can check your current maximum number of pods per node with:
$ kubectl get nodes -o yaml | grep pods
pods: "30"
pods: "30"
And your current with:
$ kubectl get pods --all-namespaces | grep Running | wc -l
18
I hit this because I exceed the max pods, I found out how much I could handle by doing:
$ kubectl get nodes -o json | jq -r .items[].status.allocatable.pods | paste -sd+ - | bc
Check to make sure you are not hitting core limits for your subscription.
az vm list-usage --location "<location>" -o table
If you are you can request more quota, https://learn.microsoft.com/en-us/azure/azure-supportability/resource-manager-core-quotas-request

Azure aks no nodes found

I created an azure AKS with 3 nodes(Standard DS3 v2 (4 vcpus, 14 GB memory)). I was fiddling with the cluster and created a Deployment with 1000 replicas.After this complete cluster went down.
azureuser#saa:~$ k get cs
NAME STATUS MESSAGE ERROR
controller-manager Unhealthy Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: getsockopt: connection refused
scheduler Unhealthy Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: getsockopt: connection refused
etcd-0 Healthy {"health": "true"}
From debugging it seems both Scheduler and Controller-manager went down. How to Fix this?
What exactly happened when created a Deployment with 1000 replicas? Should it be taken care by k8s?
Few debugging commands output:
kubectl cluster-info
Kubernetes master is running at https://cg-games-e5252212.hcp.eastus.azmk8s.io:443
Heapster is running at https://cg-games-e5252212.hcp.eastus.azmk8s.io:443/api/v1/namespaces/kube-system/services/heapster/proxy
KubeDNS is running at https://cg-games-e5252212.hcp.eastus.azmk8s.io:443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
kubernetes-dashboard is running at https://cg-games-e5252212.hcp.eastus.azmk8s.io:443/api/v1/namespaces/kube-system/services/kubernetes-dashboard/proxy
Logs for kubectl cluster-info dump # http://termbin.com/e6wb
azureuser#sim:~$ az aks scale -n cg -g cognitive-games -c 4 --verbose
Deployment failed. Correlation ID: 4df797b2-28bf-4c18-a26a-4e341xxxxx. Operation failed with status: 200. Details: Resource state Failed
no nodes displayed
azureuser#si:~$ k get nodes
No resources found
Looks silly but when AKS is created in an RG, surprisingly two RGs are created one with the AKS and another one with some random hash having all the VMS. I've deleted the 2nd RG and the basic AKS stopped working.

Resources