aks reporting "Insufficient pods" - azure

I've gone through the Azure Cats&Dogs tutorial described here and I am getting an error in the final step where the apps are launched in AKS. Kubernetes is reporting that I have insufficent pods but I'm not sure why this would be. I've run through this same tutorial a few weeks ago without problems.
$ kubectl apply -f azure-vote-all-in-one-redis.yaml
deployment.apps/azure-vote-back created
service/azure-vote-back created
deployment.apps/azure-vote-front created
service/azure-vote-front created
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
azure-vote-back-655476c7f7-mntrt 0/1 Pending 0 6s
azure-vote-front-7c7d7f6778-mvflj 0/1 Pending 0 6s
$ kubectl get events
LAST SEEN TYPE REASON KIND MESSAGE
3m36s Warning FailedScheduling Pod 0/1 nodes are available: 1 Insufficient pods.
84s Warning FailedScheduling Pod 0/1 nodes are available: 1 Insufficient pods.
70s Warning FailedScheduling Pod skip schedule deleting pod: default/azure-vote-back-655476c7f7-l5j28
9s Warning FailedScheduling Pod 0/1 nodes are available: 1 Insufficient pods.
53m Normal SuccessfulCreate ReplicaSet Created pod: azure-vote-back-655476c7f7-kjld6
99s Normal SuccessfulCreate ReplicaSet Created pod: azure-vote-back-655476c7f7-l5j28
24s Normal SuccessfulCreate ReplicaSet Created pod: azure-vote-back-655476c7f7-mntrt
53m Normal ScalingReplicaSet Deployment Scaled up replica set azure-vote-back-655476c7f7 to 1
99s Normal ScalingReplicaSet Deployment Scaled up replica set azure-vote-back-655476c7f7 to 1
24s Normal ScalingReplicaSet Deployment Scaled up replica set azure-vote-back-655476c7f7 to 1
9s Warning FailedScheduling Pod 0/1 nodes are available: 1 Insufficient pods.
3m36s Warning FailedScheduling Pod 0/1 nodes are available: 1 Insufficient pods.
53m Normal SuccessfulCreate ReplicaSet Created pod: azure-vote-front-7c7d7f6778-rmbqb
24s Normal SuccessfulCreate ReplicaSet Created pod: azure-vote-front-7c7d7f6778-mvflj
53m Normal ScalingReplicaSet Deployment Scaled up replica set azure-vote-front-7c7d7f6778 to 1
53m Normal EnsuringLoadBalancer Service Ensuring load balancer
52m Normal EnsuredLoadBalancer Service Ensured load balancer
46s Normal DeletingLoadBalancer Service Deleting load balancer
24s Normal ScalingReplicaSet Deployment Scaled up replica set azure-vote-front-7c7d7f6778 to 1
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
aks-nodepool1-27217108-0 Ready agent 7d4h v1.9.9
The only thing I can think of that has changed is that I have other (larger) clusters running now as well, and the main reason I went through this Cats&Dogs tutorial again was because I hit this same problem today with my other clusters. Is this a resources limit issue with my Azure account?
Update 10-20/3:15 PST: Notice how these three clusters all show that they use the same nodepool, even though they were created in different resource groups. Also note how the "get-credentials" call for gem2-cluster reports an error. I did have a cluster earlier called gem2-cluster which I deleted and recreated using the same name (in fact I deleted the wole resource group). What's the correct process for doing this?
$ az aks get-credentials --name gem1-cluster --resource-group gem1-rg
Merged "gem1-cluster" as current context in /home/psteele/.kube/config
$ kubectl get nodes -n gem1
NAME STATUS ROLES AGE VERSION
aks-nodepool1-27217108-0 Ready agent 3h26m v1.9.11
$ az aks get-credentials --name gem2-cluster --resource-group gem2-rg
A different object named gem2-cluster already exists in clusters
$ az aks get-credentials --name gem3-cluster --resource-group gem3-rg
Merged "gem3-cluster" as current context in /home/psteele/.kube/config
$ kubectl get nodes -n gem1
NAME STATUS ROLES AGE VERSION
aks-nodepool1-14202150-0 Ready agent 26m v1.9.11
$ kubectl get nodes -n gem2
NAME STATUS ROLES AGE VERSION
aks-nodepool1-14202150-0 Ready agent 26m v1.9.11
$ kubectl get nodes -n gem3
NAME STATUS ROLES AGE VERSION
aks-nodepool1-14202150-0 Ready agent 26m v1.9.11

What is your max-pods set to? This is a normal error when you've reached the limit of pods per node.
You can check your current maximum number of pods per node with:
$ kubectl get nodes -o yaml | grep pods
pods: "30"
pods: "30"
And your current with:
$ kubectl get pods --all-namespaces | grep Running | wc -l
18

I hit this because I exceed the max pods, I found out how much I could handle by doing:
$ kubectl get nodes -o json | jq -r .items[].status.allocatable.pods | paste -sd+ - | bc

Check to make sure you are not hitting core limits for your subscription.
az vm list-usage --location "<location>" -o table
If you are you can request more quota, https://learn.microsoft.com/en-us/azure/azure-supportability/resource-manager-core-quotas-request

Related

Azure Add on Build In Policies Constraint templates not visible

I enabled the Azure Add on Policy on my Azure Cluster. Somehow when i enabled the audit pods and gatekeeper pods are created in Gatekeeper system Namespace but i do not see any default constraint templates or constraints created.
/home/azure> kubectl get pods -n gatekeeper-system
NAME READY STATUS RESTARTS AGE
gatekeeper-audit-5c96c9b6d6-tlwrd 1/1 Running 0 6m34s
gatekeeper-controller-7c4c5b8667-kthqv 1/1 Running 0 6m34s
gatekeeper-controller-7c4c5b8667-pjtn7 1/1 Running 0 6m34s
/home/azure>
/home/azure> kubectl get constraints
No resources found
/home/azure> kubectl get constrainttemplates
No resources found
/home/azure>
I see no resource found . Ideally in good scenario the built in Policies constraint templates should be visible like this:
$ kubectl get constrainttemplate
NAME AGE
k8sazureallowedcapabilities 23m
k8sazureallowedusersgroups 23m
k8sazureblockhostnamespace 23m
k8sazurecontainerallowedimages 23m
k8sazurecontainerallowedports 23m
k8sazurecontainerlimits 23m
k8sazurecontainernoprivilege 23m
k8sazurecontainernoprivilegeescalation 23m
k8sazureenforceapparmor 23m
k8sazurehostfilesystem 23m
k8sazurehostnetworkingports 23m
k8sazurereadonlyrootfilesystem 23m
k8sazureserviceallowedports 23m
Can someone please help me that why its not visible.

Cant run GPU pod - 0/12 nodes are available: 1 node(s) had taint {CriticalAddonsOnly: true}

Trying to create GPU node in my Azure cluster.
I am following this instruction - https://learn.microsoft.com/en-us/azure/aks/gpu-cluster
So, I already had K8s cluster, I added new pool:
az aks nodepool add \
--resource-group XXX \
--cluster-name XXX \
--name spotgpu \
--node-vm-size standard_nv12s_v3 \
--node-taints sku=gpu:NoSchedule \
--aks-custom-headers UseGPUDedicatedVHD=true \
--enable-cluster-autoscaler \
--node-count 1 \
--min-count 1 \
--max-count 2 \
--max-pods 12 \
--priority Spot \
--eviction-policy Delete \
--spot-max-price 0.2
So, node pool was successfully created:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
...
aks-spotgpu-XXX-XXX Ready agent 11m v1.21.9
After that I applied this Job - https://learn.microsoft.com/en-us/azure/aks/gpu-cluster#run-a-gpu-enabled-workload
But new cant run, it is in Pending state -
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 76s default-scheduler 0/12 nodes are available: 1
node(s) had taint {CriticalAddonsOnly: true}, that the pod didn't tolerate, 1 node(s) had taint {sku: compute-cpu}, that the pod didn't tolerate, 2 node(s) had taint {kubernetes.azure.com/scalesetpriority: spot}, that the pod didn't tolerate, 8 Insufficient nvidia.com/gpu.
Warning FailedScheduling 75s default-scheduler 0/12 nodes are available: 1 node(s) had taint {CriticalAddonsOnly: true}, that the pod didn't tolerate, 1 node(s) had taint {sku: compute-cpu}, that the pod didn't tolerate, 2 node(s) had taint {kubernetes.azure.com/scalesetpriority: spot}, that the pod didn't tolerate, 8 Insufficient nvidia.com/gpu.
Normal NotTriggerScaleUp 39s cluster-autoscaler pod didn't trigger scale-up: 2 node(s) had taint {kubernetes.azure.com/scalesetpriority: spot}, that the pod didn't tolerate, 1 Insufficient nvidia.com/gpu, 1 node(s) had taint {sku: compute-cpu}, that the pod didn't tolerate, 1 node(s) had taint {CriticalAddonsOnly: true}, that the pod didn't tolerate
I tried different max/min/node count variants but always got the same warning messages and can`t start the pod.
Where I am wrong?

Gitlab Autodevops: Resetting a kubernetes cluster

I'm currently on a self-hosted Gitlab 11.9 instance. I have the ability to add a kube cluster to projects on an individual level, but not on a group level (that was introduced in 11.10).
I created a Kubernetes cluster on AWS EKS and successfully connected it to Gitlab's Autodevops for a specific project. I was able to successfully install Helm tiller, Prometheus, and Gitlab Runner. Autodevops was working fine for that project.
Before I discovered that having a cluster run at the group-level was introduced in Gitlab 11.10, I disconnected the kube cluster from the first project and connected it at the group-level. I successfully installed Helm Tiller but failed to install Ingres or Cert-Manager. After I discovered my version doesn't contain group-level autodevops functionality, I connected the cluster to another, different, application and attempted to install Prometheus and Gitlab Runner. However, the operation failed.
My pods are as follows:
% kubectl get pods --namespace=gitlab-managed-apps
NAME READY STATUS RESTARTS AGE
install-prometheus 0/1 Error 0 18h
install-runner 0/1 Error 0 18h
prometheus-kube-state-metrics-8668948654-8p4d5 1/1 Running 0 18h
prometheus-prometheus-server-746bb67956-789ln 2/2 Running 0 18h
runner-gitlab-runner-548ddfd4f4-k5r8s 1/1 Running 0 18h
tiller-deploy-6586b57bcb-p8kdm 1/1 Running 0 18h
Here's some output from my log file:
% kubectl logs install-prometheus --namespace=gitlab-managed-apps --container=helm
+ helm init --upgrade
Creating /root/.helm
Creating /root/.helm/repository
Creating /root/.helm/repository/cache
Creating /root/.helm/repository/local
Creating /root/.helm/plugins
Creating /root/.helm/starters
Creating /root/.helm/cache/archive
Creating /root/.helm/repository/repositories.yaml
Adding stable repo with URL: https://kubernetes-charts.storage.googleapis.com
Adding local repo with URL: http://127.0.0.1:8879/charts
$HELM_HOME has been configured at /root/.helm.
Tiller (the Helm server-side component) has been upgraded to the current version.
Happy Helming!
+ seq 1 30
+ helm version
Client: &version.Version{SemVer:"v2.12.3", GitCommit:"eecf22f77df5f65c823aacd2dbd30ae6c65f186e", GitTreeState:"clean"}
Error: cannot connect to Tiller
+ sleep 1s
Retrying (1)...
+ echo 'Retrying (1)...'
+ helm version
Client: &version.Version{SemVer:"v2.12.3", GitCommit:"eecf22f77df5f65c823aacd2dbd30ae6c65f186e", GitTreeState:"clean"}
Error: cannot connect to Tiller
...
+ sleep 1s
+ echo 'Retrying (30)...'
+ helm upgrade prometheus stable/prometheus --install --reset-values --tls --tls-ca-cert /data/helm/prometheus/config/ca.pem --tls-cert /data/helm/prometheus/config/cert.pem --tls-key /data/helm/prometheus/config/key.pem --version 6.7.3 --set 'rbac.create=false,rbac.enabled=false' --namespace gitlab-managed-apps -f /data/helm/prometheus/config/values.yaml
Retrying (30)...
Error: UPGRADE FAILED: remote error: tls: bad certificate
This cluster doesn't contain anything else except for services, pods, deployments specifically for autodevops. How should I go about 'resetting' the cluster or uninstalling services?

Azure aks no nodes found

I created an azure AKS with 3 nodes(Standard DS3 v2 (4 vcpus, 14 GB memory)). I was fiddling with the cluster and created a Deployment with 1000 replicas.After this complete cluster went down.
azureuser#saa:~$ k get cs
NAME STATUS MESSAGE ERROR
controller-manager Unhealthy Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: getsockopt: connection refused
scheduler Unhealthy Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: getsockopt: connection refused
etcd-0 Healthy {"health": "true"}
From debugging it seems both Scheduler and Controller-manager went down. How to Fix this?
What exactly happened when created a Deployment with 1000 replicas? Should it be taken care by k8s?
Few debugging commands output:
kubectl cluster-info
Kubernetes master is running at https://cg-games-e5252212.hcp.eastus.azmk8s.io:443
Heapster is running at https://cg-games-e5252212.hcp.eastus.azmk8s.io:443/api/v1/namespaces/kube-system/services/heapster/proxy
KubeDNS is running at https://cg-games-e5252212.hcp.eastus.azmk8s.io:443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
kubernetes-dashboard is running at https://cg-games-e5252212.hcp.eastus.azmk8s.io:443/api/v1/namespaces/kube-system/services/kubernetes-dashboard/proxy
Logs for kubectl cluster-info dump # http://termbin.com/e6wb
azureuser#sim:~$ az aks scale -n cg -g cognitive-games -c 4 --verbose
Deployment failed. Correlation ID: 4df797b2-28bf-4c18-a26a-4e341xxxxx. Operation failed with status: 200. Details: Resource state Failed
no nodes displayed
azureuser#si:~$ k get nodes
No resources found
Looks silly but when AKS is created in an RG, surprisingly two RGs are created one with the AKS and another one with some random hash having all the VMS. I've deleted the 2nd RG and the basic AKS stopped working.

kube-dns stays in ContainerCreating status

I have 5 machines running Ubuntu 16.04.1 LTS. I want to set them up as a Kubernetes Cluster. Iḿ trying to follow this getting started guide where they're using kubeadm.
It all worked fine until step 3/4 Installing a pod network. I've looked at there addon page to look for a pod network and chose the flannel overlay network. Iǘe copied the yaml file to the machine and executed:
root#up01:/home/up# kubectl apply -f flannel.yml
Which resulted in:
configmap "kube-flannel-cfg" created
daemonset "kube-flannel-ds" created
So i thought that it went ok, but when I display all the pod stuff:
root#up01:/etc/kubernetes/manifests# kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system dummy-2088944543-d5f50 1/1 Running 0 50m
kube-system etcd-up01 1/1 Running 0 48m
kube-system kube-apiserver-up01 1/1 Running 0 50m
kube-system kube-controller-manager-up01 1/1 Running 0 49m
kube-system kube-discovery-1769846148-jvx53 1/1 Running 0 50m
kube-system kube-dns-2924299975-prlgf 0/4 ContainerCreating 0 49m
kube-system kube-flannel-ds-jb1df 2/2 Running 0 32m
kube-system kube-proxy-rtcht 1/1 Running 0 49m
kube-system kube-scheduler-up01 1/1 Running 0 49m
The problem is that the kube-dns keeps in the ContainerCreating state. I don't know what to do.
It is very likely that you missed this critical piece of information from the guide:
If you want to use flannel as the pod network, specify
--pod-network-cidr 10.244.0.0/16 if you’re using the daemonset manifest below.
If you omit this kube-dns will never leave the ContainerCreating STATUS.
Your kubeadm init command should be:
# kubeadm init --pod-network-cidr 10.244.0.0/16
and not
# kubeadm init
Did you try restarting NetworkManager ...? it worked for me.. Plus, it also worked when I also disabled IPv6.

Resources