AKS nodepool in a failed state, PODS all pending - azure

yesterday I was using kubectl in my command line and was getting this message after trying any command. Everything was working fine the previous day and I had not touched anything in my AKS.
Unable to connect to the server: x509: certificate has expired or is not yet valid: current time 2022-01-11T12:57:51-05:00 is after 2022-01-11T13:09:11Z
After doing some google to solve this issue I found a guide about rotating certificates:
https://learn.microsoft.com/en-us/azure/aks/certificate-rotation
After following the rotate guide it fixed my certificate issue however all my pods were still in a pending state so I then followed this guide: https://learn.microsoft.com/en-us/azure/aks/update-credentials
Then one of my nodepools started working again which is of type user but the one of type system is still in a failed state with all pods pending.
I am not sure of the next steps I should be taking to solve this issue. Does anyone have any recommendations? I was going to delete the nodepool and make a new one but I can't do that either because it is the last system node pool.

Assuming you are using API version older than 2020-03-01 for creating AKS cluster.
There are few limitations apply when you create and manage AKS clusters that support system node pools.
• An API version of 2020-03-01 or greater must be used to set a node
pool mode. Clusters created on API versions older than 2020-03-01
contain only user node pools, but can be migrated to contain system
node pools by following update pool mode steps.
• The mode of a node pool is a required property and must be
explicitly set when using ARM templates or direct API calls.
You can use the Bicep/JSON code provided in MS Document to create the AKS cluster as there is using upgaded API version.
You can also follow this MS Document if you want to Create a new AKS cluster with a system node pool and add a dedicated system node pool to the existing AKS cluster.
The following command adds a dedicated node pool of mode type system with a default count of three nodes.
az aks nodepool add \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name systempool \
--node-count 3 \
--node-taints CriticalAddonsOnly=true:NoSchedule \
--mode System

Related

AKS Cluster deployment fails with "ReconcileMSICredentialError"

When I try to deploy a fresh AKS cluster with "Dev/Test" Settings via the Portal, I get the following error while deployment:
{"code":"DeploymentFailed","message":"At least one resource deployment operation failed.
Please list deployment operations for details. Please see
https://aka.ms/DeployOperations for usage details.","details":
[{"code":"ReconcileMSICredentialError","message":"Reconcile MSI credential failed.
Details: autorest/azure: Service returned an error. Status=409 Code=\"Conflict\"
Message=\"Secret bf905bf9e9ad86526b26e98d2ea490a0a500ff23907f9df987d95de3a649e751 is
currently being deleted and cannot be re-created; retry later.\" InnerError=
{\"code\":\"ObjectIsBeingDeleted\"}."}]}
However, the resource still gets deployed, but with a notification that "the resource is in a failed state". When I stop the cluster and start it new, the notification disappears but I'm not sure if the error remains.
I can avoid the error altogether, if I pick a new name for the cluster. However, I'd like to keep the old name.
The same happens when I deploy with different settings (CPU, number of nodes, etc.). I also tried deleting the cluster entirely and deploying it completely new but the error persists. I haven't found any explanation to this error either on Stackoverflow or Google.
What could be the reason for this error and how to avoid it?
I tried to reproduce the same issue in my environment and got the below results
I have created the AKS cluster with dev/test environment
The reference cluster is successfully created
I have given the some credentials to the cluster using below command
az aks get-credentials --resource-group Alldemorg --name cluster_name
*Created the sample application and deployed that application into the cluster,
I have used the following Reference for example sample file.*
Deployment got succeeded and I am able to see all the pods and nodes which got created for the application
Note:
1). "ReconcileMSICredentialError" error we are getting because of the version please check the version and upgrade to latest
2). If the resource is in failed stated delete the entire resource instead of deleting cluster and create it again if we stop and start the resource may chance of getting "ReconcileMSICredentialError".

Pull images from an Azure container registry to a Kubernetes cluster

I have followed this tutorial microsoft_website to pull images from an azure container. My yaml successfully creates a pod job, which can pull the image, BUT only when it runs on the agentpool node in my cluster.
For example, adding nodeName: aks-agentpool-33515997-vmss000000 to the yamlworks fine, but specifying a different node name, e.g. nodeName: aks-cpu1-33515997-vmss000000, the pod fails. The error message I get with describe pods is Failed to pull image and then kubelet Error: ErrImagePull.
What I'm missing?
Create secret:
kubectl create secret docker-registry <secret-name> \
--docker-server=<container-registry-name>.azurecr.io \
--docker-username=<service-principal-ID> \
--docker-password=<service-principal-password>
As #user1571823 told solution to the problem is deleting the old image from the acr and creating/pushing a new one.
The problem was related to some sort of corruption in the image saved in the azure container registry (acr). The reason why one agent pool could pulled the image was actually because the image already existed in the VM.
Henceforth as #andov said it is good option to open an incident case to Azure support for AKS from your subscription, where AKS is deployed. The support team has full access to the AKS service backend and they can tell exactly what was causing your problem.
Four things to check:
Is it a subscription issue? Are the nodes in different subscriptions?
Is it a rights issue? Does the service principle of the node have rights to pull the image.
Is it a network issue? Are the nodes on different subnets?
Is there something with the image size or configuration, that means that it cannot run on the other cluster.
Edit
New-AzAksNodePool has a parameter -DefaultProfile
It can be AzContext, AzureRmContext, AzureCredential
If this is different between your nodes it would explain the error

Azure Kubernetes Service (AKS) and the primary node pool

Foreword
When you create a Kubernetes cluster on AKS you specify the type of VMs you want to use for your nodes (--node-vm-size). I read that you can't change this after you create the Kubernetes cluster, which would mean that you'd be scaling vertically instead of horizontally whenever you add resources.
However, you can create different node pools in an AKS cluster that use different types of VMs for your nodes. So, I thought, if you want to "change" the type of VM that you chose initially, maybe add a new node pool and remove the old one ("nodepool1")?
I tried that through the following steps:
Create a node pool named "stda1v2" with a VM type of "Standard_A1_v2"
Delete "nodepool1" (az aks nodepool delete --cluster-name ... -g ... -n nodepool1
Unfortunately I was met with Primary agentpool cannot be deleted.
Question
What is the purpose of the "primary agentpool" which cannot be deleted, and does it matter (a lot) what type of VM I choose when I create the AKS cluster (in a real world scenario)?
Can I create other node pools and let the primary one live its life? Will it cause trouble in the future if I have node pools that use larger VMs for its nodes but the primary one still using "Standard_A1_v2" for example?
Primary node pool is the first nodepool in the cluster and you cannot delete it, because its currently not supported. You can create and delete additional node pools and just let primary be as it is. It will not create any trouble.
For the primary node pool I suggest picking a VM size that makes more sense in a long run (since you cannot change it). B-series would be a good fit, since they are cheap and CPU\mem ratio is good for average workloads.
ps. You can always scale primary node pool to 0 nodes, cordon the node and shut it down. You will have to repeat this post upgrade, but otherwise it will work
It looks like this functionality was introduced around the time of your question, allowing you to add new system nodepools and delete old ones, including the initial nodepool. After encountering the same error message myself while trying to tidy up a cluster, I discovered I had to set another nodepool to a system type in order to delete the first.
There's more info about it here, but in short, Azure nodepools are split into two types ('modes' as they call it): System and User. When creating a single pool to begin with, it will be of a system type (favouring system pod scheduling -- so it might be good to have a dedicated pool of a node or two for system use, then a second user nodepool for the actual app pods).
So if you wish to delete your only system pool, you need to first create another nodepool with the --mode switch set to 'system' (with your preferred VM size etc.), then you'll be able to delete the first (and nodepool modes can't be changed after the fact, only on creation).

Unable to pull image from Azure Container Registry

We recently had an issue with our Azure Kubernetes Cluster not reporting back any data through the Azure Portal. To fix this, I updated the Kubernetes version to the latest version as was recommended on GitHub. After the upgrade was complete, we were able to view logs and monitoring data through the portal, but one of the containers stored in our Azure Container Registry is not able to be pulled by the Kubernetes Cluster.
The error I see in the Kuberenetes Management page is:
Failed to pull image "myacr.azurecr.io/container:190305.191": [rpc error: code = Unknown desc = Error response from daemon: Get https://myacr.azurecr.io/v2/mycontainer/manifests/190305.191: unauthorized: authentication required, rpc error: code = Unknown desc = Error response from daemon: Get https://myacr.azurecr.io/v2/mycontainer/manifests/190305.191: unauthorized: authentication required]
My original setup used the first script provided in this document and it worked correctly without issue. Once I started receiving the error, I ran it again just to make sure.
Once I saw that failed, I then deleted the account from the permissions on both the ACR and the AKS. Again, it failed to pull the image.
After that, I tried using the second method of creating an Kubernetes secret and received the same error.
At this point, I'm unsure what else to check. I've verified that I can run docker pull on my machine and pull the image, but there seems to be a breakdown between the AKS and the ACR that I can not sort out.
It's been a while since I originally posted this, but I did stumble across a currently stable solution to the problem.
The service principal, for whatever reason, is not able to maintain a connection to the ACR. So if your cluster ever goes down, you lose the ability to pull from the ACR. I had this happen multiple times over the last year and as I moved more of my Kubernetes deployment to Azure, it became a bigger and bigger issue.
I stumbled across this Microsoft Doc and noticed the mention of the --attach-acr command.
This is what the full command looks like:
az aks create -n myAKSCluster -g myResourceGroup --generate-ssh-keys --attach-acr $MYACR
Since setting it up with that flag, I have had 0 issues with it.
knock on wood

Unable to connect AKS cluster: connection time out

I've created an AKS cluster in the UK region in Azure.
Currently, I can no longer access my AKS cluster. Connecting to the public IPs fails; all connections time out.
Furthermore, I can't run the kubectl command either:
fcarlier#ubuntu:~$ kubectl get nodes
Unable to connect to the server: net/http: TLS handshake timeout
Is there a known issue with AKS in that region or is it something on my side?
Is there a known issue with AKS in that region or is it something on
my side?
Sorry to give you a bad experience.
For now, Azure AKS still in preview, please try to recreate it, ukwest works fine now.
Here is a similar case about you, please refer to it.
I just successfully created a single node AKS cluster on UK West with no issues. Can you please retest? For now, I would avoid provisioning on West US 2 until the threshold issues are fixed. I'm aware the AKS team is actively engaged to restore service on West US. Sorry for the inconvenience. Below is the sample cmd to create in UK if you need the reference. Hope this helps.
Create Resource Group (UK West):az group create --name myResourceGroupUK --location ukwest
Create AKS cluster in (UK west):az aks create --resource-group myResourceGroupUK --name myK8sClusterUK --agent-count 1 --generate-ssh-keys
I just finished a big post over here on this topic (which is not as straight forward as a single solution / workaround): 'Unable to connect Net/http: TLS handshake timeout' — Why can't Kubectl connect to Azure Kubernetes server? (AKS)
That being said, the solution to this one for me was to scale the nodes up — and then back down — for my impacted Cluster from the Azure Kubernetes service blade web console.
Workaround / Potential Solution
Log into the Azure Console — Kubernetes Service blade.
Scale your cluster up by 1 node.
Wait for scale to complete and attempt to connect (you should be able to).
Scale your cluster back down to the normal size to avoid cost increases.
Total time it took me ~2 mins.
More Background Info on the Issue
Also added this solution to the full ticket description write up that I posted over here (if you want more info have a read):
'Unable to connect Net/http: TLS handshake timeout' — Why can't Kubectl connect to Azure Kubernetes server? (AKS)

Resources