I have AKS with 03 nodes, I tried to manually scale out from 3 to 4 nodes. Scale up was fine.
After ~ 20 minutes , all 04 Node are in NotReady Service, all kube-system services is not Ready status.
NAME STATUS ROLES AGE VERSION
aks-agentpool-40760006-vmss000000 Ready agent 16m v1.18.14
aks-agentpool-40760006-vmss000001 Ready agent 17m v1.18.14
aks-agentpool-40760006-vmss000002 Ready agent 16m v1.18.14
aks-agentpool-40760006-vmss000003 Ready agent 11m v1.18.14
NAME STATUS ROLES AGE VERSION
aks-agentpool-40760006-vmss000000 NotReady agent 23m v1.18.14
aks-agentpool-40760006-vmss000002 NotReady agent 24m v1.18.14
aks-agentpool-40760006-vmss000003 NotReady agent 19m v1.18.14
k get po -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-748cdb7bf4-7frq2 0/1 Pending 0 10m
coredns-748cdb7bf4-vg5nn 0/1 Pending 0 10m
coredns-748cdb7bf4-wrhxs 1/1 Terminating 0 28m
coredns-autoscaler-868b684fd4-2gb8f 0/1 Pending 0 10m
kube-proxy-p6wmv 1/1 Running 0 28m
kube-proxy-sksz6 1/1 Running 0 23m
kube-proxy-vpb2g 1/1 Running 0 28m
metrics-server-58fdc875d5-sbckj 0/1 Pending 0 10m
tunnelfront-5d74798f6b-w6rvn 0/1 Pending 0 10m
The node logs shows that:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 25m kubelet Starting kubelet.
Normal NodeHasSufficientMemory 25m (x2 over 25m) kubelet Node aks-agentpool-40760006-vmss000000 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 25m (x2 over 25m) kubelet Node aks-agentpool-40760006-vmss000000 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 25m (x2 over 25m) kubelet Node aks-agentpool-40760006-vmss000000 status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 25m kubelet Updated Node Allocatable limit across pods
Normal Starting 25m kube-proxy Starting kube-proxy.
Normal NodeReady 24m kubelet Node aks-agentpool-40760006-vmss000000 status is now: NodeReady
Warning FailedToCreateRoute 5m5s route_controller Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 50.264754ms: timed out waiting for the condition
Warning FailedToCreateRoute 4m55s route_controller Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 45.945658ms: timed out waiting for the condition
Warning FailedToCreateRoute 4m45s route_controller Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 46.180158ms: timed out waiting for the condition
Warning FailedToCreateRoute 4m35s route_controller Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 46.550858ms: timed out waiting for the condition
Warning FailedToCreateRoute 4m25s route_controller Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 44.74355ms: timed out waiting for the condition
Warning FailedToCreateRoute 4m15s route_controller Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 42.428456ms: timed out waiting for the condition
Warning FailedToCreateRoute 4m5s route_controller Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 41.664858ms: timed out waiting for the condition
Warning FailedToCreateRoute 3m55s route_controller Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 48.456954ms: timed out waiting for the condition
Warning FailedToCreateRoute 3m45s route_controller Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 38.611964ms: timed out waiting for the condition
Warning FailedToCreateRoute 65s (x16 over 3m35s) route_controller (combined from similar events): Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 13.972487ms: timed out waiting for the condition
You can use cluster autoscaler option to avoid such situations in future.
To keep up with application demands in Azure Kubernetes Service (AKS),
you may need to adjust the number of nodes that run your workloads.
The cluster autoscaler component can watch for pods in your cluster
that can't be scheduled because of resource constraints. When issues
are detected, the number of nodes in a node pool is increased to meet
the application demand. Nodes are also regularly checked for a lack of
running pods, with the number of nodes then decreased as needed. This
ability to automatically scale up or down the number of nodes in your
AKS cluster lets you run an efficient, cost-effective cluster.
You can Update an existing AKS cluster to enable the cluster autoscaler in order to use your current resource group.
az aks update \
--resource-group myResourceGroup \
--name myAKSCluster \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 3
Seem it is OK now. I am lacking of the right to scale up node.
Related
I enabled the Azure Add on Policy on my Azure Cluster. Somehow when i enabled the audit pods and gatekeeper pods are created in Gatekeeper system Namespace but i do not see any default constraint templates or constraints created.
/home/azure> kubectl get pods -n gatekeeper-system
NAME READY STATUS RESTARTS AGE
gatekeeper-audit-5c96c9b6d6-tlwrd 1/1 Running 0 6m34s
gatekeeper-controller-7c4c5b8667-kthqv 1/1 Running 0 6m34s
gatekeeper-controller-7c4c5b8667-pjtn7 1/1 Running 0 6m34s
/home/azure>
/home/azure> kubectl get constraints
No resources found
/home/azure> kubectl get constrainttemplates
No resources found
/home/azure>
I see no resource found . Ideally in good scenario the built in Policies constraint templates should be visible like this:
$ kubectl get constrainttemplate
NAME AGE
k8sazureallowedcapabilities 23m
k8sazureallowedusersgroups 23m
k8sazureblockhostnamespace 23m
k8sazurecontainerallowedimages 23m
k8sazurecontainerallowedports 23m
k8sazurecontainerlimits 23m
k8sazurecontainernoprivilege 23m
k8sazurecontainernoprivilegeescalation 23m
k8sazureenforceapparmor 23m
k8sazurehostfilesystem 23m
k8sazurehostnetworkingports 23m
k8sazurereadonlyrootfilesystem 23m
k8sazureserviceallowedports 23m
Can someone please help me that why its not visible.
I am new to Yugabyte DB i have 6 pods given below names and states . One pod yb-tserver-3 is in CrashLoopBackOff state .Now i am not able to connect to my DB through DBeaver as i am getting error1 :- FATAL: Remote error: Service unavailable (yb/rpc/service_pool.cc:223): OpenTable request on yb.tserver.PgClientService from [xxx.xxx.xxx.xxx.xxx]:xxxx dropped due to backpressure
error2 :- connection timeout
yb-master-0 2/2 Running 0 24h
yb-master-1 2/2 Running 0 24h
yb-master-2 2/2 Running 0 23h
yb-tserver-0 2/2 Running 230 14d
yb-tserver-1 2/2 Running 0 25h
yb-tserver-2 2/2 Running 0 23h
yb-tserver-3 1/2 CrashLoopBackOff 4 6m33s
Now my question is if my one of the yb-tserver is down my yugabyte db services should be up and running but here my db is down and rejecting application connectivity and Dbeaver connectivity . How can i resolve this issue i have seen many times if one tablet servers stops working all connectivity gets lost.
please help me out.
I have a small application in nodejs to do tests with kubernetes, but it seems that the application does not keep running
I put all the code that I developed to test, in the GitHub
I'm run kubectl create -f deploy.yaml
Works, but..
[webapp#srvapih ex-node]$ kubectl get pods
NAME READY STATUS RESTARTS AGE
api-7b89bd4755-4lc6k 1/1 Running 0 5s
api-7b89bd4755-7x964 0/1 ContainerCreating 0 5s
api-7b89bd4755-dv299 1/1 Running 0 5s
api-7b89bd4755-w6tzj 0/1 ContainerCreating 0 5s
api-7b89bd4755-xnm8l 0/1 ContainerCreating 0 5s
[webapp#srvapih ex-node]$ kubectl get pods
NAME READY STATUS RESTARTS AGE
api-7b89bd4755-4lc6k 0/1 CrashLoopBackOff 1 11s
api-7b89bd4755-7x964 0/1 CrashLoopBackOff 1 11s
api-7b89bd4755-dv299 0/1 CrashLoopBackOff 1 11s
api-7b89bd4755-w6tzj 0/1 CrashLoopBackOff 1 11s
api-7b89bd4755-xnm8l 0/1 CrashLoopBackOff 1 11s
Events for describe pod
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 6m48s (x5 over 8m14s) kubelet, srvweb05.beirario.intranet Container image "node:8-alpine" already present on machine
Normal Created 6m48s (x5 over 8m14s) kubelet, srvweb05.beirario.intranet Created container
Normal Started 6m48s (x5 over 8m12s) kubelet, srvweb05.beirario.intranet Started container
Normal Scheduled 6m9s default-scheduler Successfully assigned default/api-7b89bd4755-4lc6k to srvweb05.beirario.intranet
Warning BackOff 3m2s (x28 over 8m8s) kubelet, srvweb05.beirario.intranet Back-off restarting failed container
All I can say here - you are providing a task that finish with command: ["/bin/sh","-c", "node", "servidor.js"].
Instead of this you should provide command in that way so it never completes.
Describe your pods shows that container in the pod has been completed successfully with exit code 0
Containers:
ex-node:
Container ID: docker://836ffd771b3514fd13ae3e6b8818a7f35807db55cf8f756e962131823a476675
Image: node:8-alpine
Image ID: docker-pullable://node#sha256:8e9987a6d91d783c56980f1bd4b23b4c05f9f6076d513d6350fef8fe09ed01fd
Port: 3000/TCP
Host Port: 0/TCP
Command:
/bin/sh
-c
node
servidor.js
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 08 Mar 2019 14:29:54 +0000
Finished: Fri, 08 Mar 2019 14:29:54 +0000
you may use "process.stdout.write" method in your code ,This will cause the k8s session to be lost. Do not print anything in stdout!
Try to use pm2 https://pm2.io/docs/runtime/integration/docker/. It starts your nodejs app as a background process.
I'd like to configure cluster autoscaler on AKS. When scaling down it fails due to PDB:
I1207 14:24:09.523313 1 cluster.go:95] Fast evaluation: node aks-nodepool1-32797235-0 cannot be removed: no enough pod disruption budget to move kube-system/metrics-server-5cbc77f79f-44f9w
I1207 14:24:09.523413 1 cluster.go:95] Fast evaluation: node aks-nodepool1-32797235-3 cannot be removed: non-daemonset, non-mirrored, non-pdb-assignedkube-system pod present: cluster-autoscaler-84984799fd-22j42
I1207 14:24:09.523438 1 scale_down.go:490] 2 nodes found to be unremovable in simulation, will re-check them at 2018-12-07 14:29:09.231201368 +0000 UTC m=+8976.856144807
All system pods have minAvailable: 1 PDB assigned manually. I can imagine that this is not working for PODs with only a single replica like the metrics-server:
❯ k get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
aks-nodepool1-32797235-0 Ready agent 4h v1.11.4 10.240.0.4 <none> Ubuntu 16.04.5 LTS 4.15.0-1030-azure docker://3.0.1
aks-nodepool1-32797235-3 Ready agent 4h v1.11.4 10.240.0.6 <none> Ubuntu 16.04.5 LTS 4.15.0-1030-azure docker://3.0.1
❯ ks get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
cluster-autoscaler-84984799fd-22j42 1/1 Running 0 2h 10.244.1.5 aks-nodepool1-32797235-3 <none>
heapster-5d6f9b846c-g7qb8 2/2 Running 0 1h 10.244.0.16 aks-nodepool1-32797235-0 <none>
kube-dns-v20-598f8b78ff-8pshc 4/4 Running 0 3h 10.244.1.4 aks-nodepool1-32797235-3 <none>
kube-dns-v20-598f8b78ff-plfv8 4/4 Running 0 1h 10.244.0.15 aks-nodepool1-32797235-0 <none>
kube-proxy-fjvjv 1/1 Running 0 1h 10.240.0.6 aks-nodepool1-32797235-3 <none>
kube-proxy-szr8z 1/1 Running 0 1h 10.240.0.4 aks-nodepool1-32797235-0 <none>
kube-svc-redirect-2rhvg 2/2 Running 0 4h 10.240.0.4 aks-nodepool1-32797235-0 <none>
kube-svc-redirect-r2m4r 2/2 Running 0 4h 10.240.0.6 aks-nodepool1-32797235-3 <none>
kubernetes-dashboard-68f468887f-c8p78 1/1 Running 0 4h 10.244.0.7 aks-nodepool1-32797235-0 <none>
metrics-server-5cbc77f79f-44f9w 1/1 Running 0 4h 10.244.0.3 aks-nodepool1-32797235-0 <none>
tiller-deploy-57f988f854-z9qln 1/1 Running 0 4h 10.244.0.8 aks-nodepool1-32797235-0 <none>
tunnelfront-7cf9d447f9-56g7k 1/1 Running 0 4h 10.244.0.2 aks-nodepool1-32797235-0 <none>
What needs be changed (number of replicas? PDB configuration?) for down-scaling to work?
Basically, this is an administration issues when draining nodes that are configured by PDB ( Pod Disruption Budget )
This is because the evictions are forced to respect the PDB you specify
you have two options:
Either force the hand:
kubectl drain foo --force --grace-period=0
you can check other options from the doc -> https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#drain
or use the eviction api:
{
"apiVersion": "policy/v1beta1",
"kind": "Eviction",
"metadata": {
"name": "quux",
"namespace": "default"
}
}
Anyhow, the drain or the eviction api attempts delete on pod to let them be scheduled elswhere before completely draining the node
As mentioned in the docs:
the API can respond in one of three ways:
If the eviction is granted, then the pod is deleted just as if you had sent a DELETE request to the pod’s URL and you get back 200 OK.
If the current state of affairs wouldn’t allow an eviction by the rules set forth in the budget, you get back 429 Too Many Requests. This is typically used for generic rate limiting of any requests
If there is some kind of misconfiguration, like multiple budgets pointing at the same pod, you will get 500 Internal Server Error.
For a given eviction request, there are two cases:
There is no budget that matches this pod. In this case, the server always returns 200 OK.
There is at least one budget. In this case, any of the three above responses may apply.
If it gets stuck then you might need to do it manually
you can read me here or here
I've gone through the Azure Cats&Dogs tutorial described here and I am getting an error in the final step where the apps are launched in AKS. Kubernetes is reporting that I have insufficent pods but I'm not sure why this would be. I've run through this same tutorial a few weeks ago without problems.
$ kubectl apply -f azure-vote-all-in-one-redis.yaml
deployment.apps/azure-vote-back created
service/azure-vote-back created
deployment.apps/azure-vote-front created
service/azure-vote-front created
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
azure-vote-back-655476c7f7-mntrt 0/1 Pending 0 6s
azure-vote-front-7c7d7f6778-mvflj 0/1 Pending 0 6s
$ kubectl get events
LAST SEEN TYPE REASON KIND MESSAGE
3m36s Warning FailedScheduling Pod 0/1 nodes are available: 1 Insufficient pods.
84s Warning FailedScheduling Pod 0/1 nodes are available: 1 Insufficient pods.
70s Warning FailedScheduling Pod skip schedule deleting pod: default/azure-vote-back-655476c7f7-l5j28
9s Warning FailedScheduling Pod 0/1 nodes are available: 1 Insufficient pods.
53m Normal SuccessfulCreate ReplicaSet Created pod: azure-vote-back-655476c7f7-kjld6
99s Normal SuccessfulCreate ReplicaSet Created pod: azure-vote-back-655476c7f7-l5j28
24s Normal SuccessfulCreate ReplicaSet Created pod: azure-vote-back-655476c7f7-mntrt
53m Normal ScalingReplicaSet Deployment Scaled up replica set azure-vote-back-655476c7f7 to 1
99s Normal ScalingReplicaSet Deployment Scaled up replica set azure-vote-back-655476c7f7 to 1
24s Normal ScalingReplicaSet Deployment Scaled up replica set azure-vote-back-655476c7f7 to 1
9s Warning FailedScheduling Pod 0/1 nodes are available: 1 Insufficient pods.
3m36s Warning FailedScheduling Pod 0/1 nodes are available: 1 Insufficient pods.
53m Normal SuccessfulCreate ReplicaSet Created pod: azure-vote-front-7c7d7f6778-rmbqb
24s Normal SuccessfulCreate ReplicaSet Created pod: azure-vote-front-7c7d7f6778-mvflj
53m Normal ScalingReplicaSet Deployment Scaled up replica set azure-vote-front-7c7d7f6778 to 1
53m Normal EnsuringLoadBalancer Service Ensuring load balancer
52m Normal EnsuredLoadBalancer Service Ensured load balancer
46s Normal DeletingLoadBalancer Service Deleting load balancer
24s Normal ScalingReplicaSet Deployment Scaled up replica set azure-vote-front-7c7d7f6778 to 1
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
aks-nodepool1-27217108-0 Ready agent 7d4h v1.9.9
The only thing I can think of that has changed is that I have other (larger) clusters running now as well, and the main reason I went through this Cats&Dogs tutorial again was because I hit this same problem today with my other clusters. Is this a resources limit issue with my Azure account?
Update 10-20/3:15 PST: Notice how these three clusters all show that they use the same nodepool, even though they were created in different resource groups. Also note how the "get-credentials" call for gem2-cluster reports an error. I did have a cluster earlier called gem2-cluster which I deleted and recreated using the same name (in fact I deleted the wole resource group). What's the correct process for doing this?
$ az aks get-credentials --name gem1-cluster --resource-group gem1-rg
Merged "gem1-cluster" as current context in /home/psteele/.kube/config
$ kubectl get nodes -n gem1
NAME STATUS ROLES AGE VERSION
aks-nodepool1-27217108-0 Ready agent 3h26m v1.9.11
$ az aks get-credentials --name gem2-cluster --resource-group gem2-rg
A different object named gem2-cluster already exists in clusters
$ az aks get-credentials --name gem3-cluster --resource-group gem3-rg
Merged "gem3-cluster" as current context in /home/psteele/.kube/config
$ kubectl get nodes -n gem1
NAME STATUS ROLES AGE VERSION
aks-nodepool1-14202150-0 Ready agent 26m v1.9.11
$ kubectl get nodes -n gem2
NAME STATUS ROLES AGE VERSION
aks-nodepool1-14202150-0 Ready agent 26m v1.9.11
$ kubectl get nodes -n gem3
NAME STATUS ROLES AGE VERSION
aks-nodepool1-14202150-0 Ready agent 26m v1.9.11
What is your max-pods set to? This is a normal error when you've reached the limit of pods per node.
You can check your current maximum number of pods per node with:
$ kubectl get nodes -o yaml | grep pods
pods: "30"
pods: "30"
And your current with:
$ kubectl get pods --all-namespaces | grep Running | wc -l
18
I hit this because I exceed the max pods, I found out how much I could handle by doing:
$ kubectl get nodes -o json | jq -r .items[].status.allocatable.pods | paste -sd+ - | bc
Check to make sure you are not hitting core limits for your subscription.
az vm list-usage --location "<location>" -o table
If you are you can request more quota, https://learn.microsoft.com/en-us/azure/azure-supportability/resource-manager-core-quotas-request