This container service is in a failed state

This container service is in a failed state - azure

All of our AKS clusters have the following error reported in Azure Portal:
This container service is in a failed state. Click here to open a new support request.
It seems we also cannot edit the cluster. When trying to scale out the nodes, I am getting the following error:
Failed to save container service 'test-aks'. Error: Operation is not allowed while cluster is being upgrading or failed in upgrade
When looking into the AKS properties, I see there is a provisioning state of "Failed":
We don't know how to troubleshoot this problem.

Use the az aks scale command to scale the cluster nodes using Azure CLI as described here: https://learn.microsoft.com/en-us/azure/aks/scale-cluster#scale-the-cluster-nodes
az aks show --resource-group myResourceGroup --name myAKSCluster --query agentPoolProfiles
This will show you the descriptive error message in Azure CLI. It is likely that you exceeded the limit for the core quota.
More details discussed on this thread: https://github.com/Azure/AKS/issues/542

For the issue that you shows:
This container service is in a failed state. Click here to open a new
support request.
It also happened to me. Usually, there is some limitation to the user for the use of resources. On my side, I just can use 10 vCpu. So I got the error when I scale up for more nodes if the vCpu have none left. I think it's also a possible reason for you. You can take a check.

Related

Failing to update the cluster

For one of the test AKS clusters I am trying to update, it gives the following error.
Error: SkuNotAvailable.
Message: The requested VM size for resource "Following SKUs have failed for capacity restrictions:
Standard_D4s_v4' is currently not available in location 'SouthAfricaNorth'. Please try another size or deploy to a different location or different size.
I have checked and found that the quota is available in the subscription for this SKU and region selected.
Now cluster and pools went in to failed status

As far as I know, this error "SkuNotAvailable" is either a capacity issue in the region or that your SUBSCRIPTION doesn't have access to that specific size
You could once verify that by running the below Azure cli command
az vm list-skus --location centralus --size Standard_D --all --output table
If a SKU isn't available for your subscription in a location or zone that meets your business needs, submit a SKU request to Azure Support.
If the subscription doesn't have access, please reach out to azure subscription and quota mgmt support team through as support case to check and make sure it's available to use the particular size on your subscription in case they cannot enable that for any reason, there will be an appropriate explanation.
At this point there is nothing can be done at the AKS side.

Kubernetes pods using invalid Azure Active Directory access tokens

When deploying new jobs and services to Azure Kubernetes Service cluster, the pods fail to request valid AAD access tokens with all permissions available. If new permissions were added on the same day, before or after a deployment, the tokens still do not pick them up. This issue has been observed with permissions granted to Active Directory Groups over Key Vaults, Storage Accounts, and SQL databases scopes so far.
Example: I have a .NET 5.0 C# API running on 3 pods with antiaffinity rules located each on a separate node. The application reads information from a SQL database. I made a release and added the database permissions afterwards. Things I have tried so far to make the application reset the access tokens:
kubectl delete pods --all -n <namespace> which essentially created 3 new pods again failing due to insufficient permissions.
kubectl apply -f deployment.yaml to deploy a new version of the image running in the containers, again all 3 pods kept failing.
kubectl delete -f deployment.yaml followed by kubectl apply -f deployment.yaml to erase the old kubernetes object and create a new one. This resolved the issue on 2/3 pods, however, the third one kept failing due to insufficient permissions.
kubectl delete namespace <namespace> to erase the entire namespace with all configuration available and recreated it again. Surprisingly, again 2/3 pods were running with the correct permissions and the last one did not.
The commands were ran more than one hour after the permissions were added to the group. The database tokens are active for 24 hours and when I have seen this issue occur with cronjobs, I had to wait 1 day for the task to execute correctly (none of the above steps worked in a cronjob scenario). The validity of the tokens kept changing which implied that the pods are requesting new access tokens, again excluding the most recently added permissions. The only solution I have found that works 100% of the time is destroy the cluster and recreate it which is not viable in any production scenario.
The failing pod from my example was the one always running on node 00 which made me think there may be an extra caching layer on the first initial node of the cluster. However, I still do not understand why the other 2 pods were running with no issue and also what is the way to restart my pods or refresh the access token to minimise the wait time until resolution.
Kubernetes version: 1.21.7.
The cluster has no AKS-managed AAD or pod-identity enabled. All RBAC is granted to the cluster MSI via AD groups.

Please check if below can be worked around in your case.
To access the Kubernetes resources, you must have access to the AKS cluster, the Kubernetes API, and the Kubernetes objects. Ensure that you're either a cluster administrator or a user with the appropriate permissions to access the AKS cluster
Things you need to do, if you haven't already:
Enable Azure RBAC on your existing AKS cluster, using:
az aks update -g myResourceGroup -n myAKSCluster --enable-azure-rbac
Create Role that allows read access to all other Pods and Services:
Add the necessary roles (Azure Kubernetes Service Cluster User Role , Azure Kubernetes Service RBAC Reader/Writer/Admin/Cluster Admin) to the user. See ( Microsoft Docs).
Also check Troubleshooting
Also check if you need to have "Virtual Machine Contributor" and storage account contributer for your resource group containing pods and see if namespace is mentioned in that pod , if you have missed . Stack Overflow refernce.Also do check if firewall is restricting the access to the network in that pod.
Resetting the kubeconfig context using the az aks get-credentials command may clear the previously cached authentication token for some xyz user:
az aks get-credentials --resource-group myResourceGroup --name myAKSCluster --overwrite-existing >Reference
Please do check Other References below:
kubernetes - Permissions error - Stack Overflow
create-role-assignments-for-users-to-access-cluster | microsoft docs
user can't access to AKS cluster with RBAC enabled (github.com)
kubernetes - Stack Overflow

Azure Ui shows wrong amount of nodes after deleting nodes with kubectl

I removed two nodes of my Kubernetes cluster manually first calling "kubectl drain " and then "kubectl delete " for each. While the cluster seems to work without a problem the Azure UI shows me exactly two nodes more than I see when I use "kubectl get nodes". So when I configure Kubernetes to have 9 nodes in the Azure UI only 7 nodes are there if I take a look with kubectl. Scaling up or down does not solve the problem as Azure is always off by two nodes.
How can I solve this problem? Is there a way I can notify Azure that a node has been deleted?

If you want to solve the issue, you need to have a deeper understanding of the k8s cluster.
When you use the command kubectl delete to remove the node from the agent pool, it means the agent pool won't have control over that node. But it does not mean you really delete the machine. So you can find the number of the machine does not change in the Azure portal. This is the truth you find.
How can I solve this problem? Is there a way I can notify Azure that a
node has been deleted?
Here are two questions. For the first, you can express it in this way:
How to restore the node that deleted before to the agent pool?
It's simple to solve. You only need to restart the kubelet service in that node. For example, you use the VMSS as the agent pool of the AKS and that node instance id is 4. Then you can do it like this:
az vmss run-command invoke --resource-group group_name --name vmss_name --instance-id 4 --command-id RunShellScript --scripts "service kubelet restart"
For the second one, you can only use the Azure command to let Azure know the update. Here it means you can scale the agent pool, for example, using the Azure CLI command:
az aks nodepool --resource-group group_name --name agentpool_name --cluster-name cluster_name --node-count 2

Changing --network-plugin in Azure Kubernetes Service for existing cluster

I'm trying to implement Azure Key Vault such that API keys, credentials and other Kubernetes secrets are read into production and staging environments. Ultimately, I'd like to try to expand that to local development environments so devs don't have to mess with it at all. It is just read in when they start their cluster.
Anyway, I'm following this to enable Pod Identities:
https://learn.microsoft.com/en-us/azure/aks/use-azure-ad-pod-identity
When I get to this step, I'm modifying the:
az aks create -g myResourceGroup -n myAKSCluster --enable-managed-identity --enable-pod-identity --network-plugin azure
To the following because I'm trying to change an existing cluster:
az aks update -g myResourceGroup -n myAKSCluster --enable-managed-identity --enable-pod-identity --network-plugin azure
This doesn't work and figured out I need to run each flag one at a time, so I had to run --enable-managed-identity first since --enable-pod-identity depends on it.
At any rate, when I get to the --enable-pod-identity I get the following error:
Operation failed with status: 'Bad Request'. Details: Network plugin kubenet is not supported to use with PodIdentity addon.
So I try the --network-plugin azure and get:
az: error: unrecognized arguments: --network-plugin azure
Apparently this is flag is not available with update.
Poking around in the Azure portal for the AKS resource, I do see kubenet listed, but I'm not able to change it.
So, the question: Is it possible to change the Network Plugin on existing cluster or do I need to start a new?
EDIT: Looks like others are having similar issues on existing clusters:
https://github.com/Azure/AKS/issues/2094

Is it possible to change the Network Plugin on the existing cluster or do
I need to start a new?
It's impossible to change the network plugin on the existing cluster, so you need to create a new cluster and set the network plugin with azure at the creation time. You can find there is no parameter --network-plugin in the CLI command az aks update even if you install the aks-preview extension. It means it does not support changing the network plugin of the existing cluster.

Error while applying Node Autoscaler for existing AKS cluster

I am trying to experiment with Preview feature available in Azure AKS as per documentation available we need to have the following requirements
Kubernetes version 1.12.4 or later
Azure CLI version 2.0.55 or later.
add aks preview :- az extension add --name aks-preview
register scale set provider:- az feature register --name VMSSPreview --namespace Microsoft.ContainerService
ensure that it is registerd
created AKS cluster with terraform
when i try to apply following command
az aks update --resource-group rg-euwest-d04-dvag-001 --name k8s-euwest-d04-dvag-dfs-dfsapp-001 --enable-cluster-autoscaler --min-count 3 --max-count 5
error
Operation failed with status: 'Bad Request'. Details: AgentPool
'' has set auto scaling as enabled but is not on Virtual
Machine Scale Sets, this is not allowed

As per my understanding, it is not supported at this time through terraform or from Azure Portal but only possible from Azure CLI

Your cluster needs to be created via Azure CLI to enable autoscaling. So if you have created on evia Azure portal, you need to delete it and create new one through Azure CLI. Ref: https://github.com/MicrosoftDocs/azure-docs/issues/29199

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string