AKS Cluster deployment fails with "ReconcileMSICredentialError" - azure

When I try to deploy a fresh AKS cluster with "Dev/Test" Settings via the Portal, I get the following error while deployment:
{"code":"DeploymentFailed","message":"At least one resource deployment operation failed.
Please list deployment operations for details. Please see
https://aka.ms/DeployOperations for usage details.","details":
[{"code":"ReconcileMSICredentialError","message":"Reconcile MSI credential failed.
Details: autorest/azure: Service returned an error. Status=409 Code=\"Conflict\"
Message=\"Secret bf905bf9e9ad86526b26e98d2ea490a0a500ff23907f9df987d95de3a649e751 is
currently being deleted and cannot be re-created; retry later.\" InnerError=
{\"code\":\"ObjectIsBeingDeleted\"}."}]}
However, the resource still gets deployed, but with a notification that "the resource is in a failed state". When I stop the cluster and start it new, the notification disappears but I'm not sure if the error remains.
I can avoid the error altogether, if I pick a new name for the cluster. However, I'd like to keep the old name.
The same happens when I deploy with different settings (CPU, number of nodes, etc.). I also tried deleting the cluster entirely and deploying it completely new but the error persists. I haven't found any explanation to this error either on Stackoverflow or Google.
What could be the reason for this error and how to avoid it?

I tried to reproduce the same issue in my environment and got the below results
I have created the AKS cluster with dev/test environment
The reference cluster is successfully created
I have given the some credentials to the cluster using below command
az aks get-credentials --resource-group Alldemorg --name cluster_name
*Created the sample application and deployed that application into the cluster,
I have used the following Reference for example sample file.*
Deployment got succeeded and I am able to see all the pods and nodes which got created for the application
Note:
1). "ReconcileMSICredentialError" error we are getting because of the version please check the version and upgrade to latest
2). If the resource is in failed stated delete the entire resource instead of deleting cluster and create it again if we stop and start the resource may chance of getting "ReconcileMSICredentialError".

Related

GKE: Impossible to delete a cluster

I have a weird issue with GKE, the cluster has been created by Terraform, and I tried to make a change requiring a deletion and re-creation.
It failed at the re-creation because I was missing an API, so I added it and retry.
Thing is that I have a cluster that exists, empty but with failed to delete cluster message on it.
I never had this issue and I already destroyed and re-created this very resource. I tried to destroy all the resources created by terraform on this project but I still get an error "failed to delete cluster".
Also I tried to do it by hand on the UI but still get the same error.
I tried to do it using
gcloud container clusters delete <cluster_name> and got
"Failed to delete cluster, name: operation-xxx-xxx..." and got a link to the operation failed.
It's a JSON with a 401 code, with the following message:
Request is missing required authentication credential. Expected OAuth
2 access token, login cookie or other valid authentication credential.
See
https://developers.google.com/identity/sign-in/web/devconsole-project.
I tried to re-auth but it doesn't help I get the same error.
I'm running out of idea, can you help me here?
A 401 (unauthorized) suggests that you've insufficient permissions to delete the cluster.
Either get a role that permits your user account to delete clusters.
Or ask someone who has an account that has sufficient powers to delete it for you.
Or authenticate gcloud (gcloud activate-service-account) with the Service Account that you used to create the cluster (assuming it can delete clusters too) and then use gcloud container clusters delete ... optionally include --account=${SERVICE_ACCOUNT_EMAIL} or just ensure the Service Account is ACTIVE with gcloud auth list.
I did not found a proper solution, but what did work was to delete the whole project and start over.
Luckily for me it was a lab, not a production project...

Pull images from an Azure container registry to a Kubernetes cluster

I have followed this tutorial microsoft_website to pull images from an azure container. My yaml successfully creates a pod job, which can pull the image, BUT only when it runs on the agentpool node in my cluster.
For example, adding nodeName: aks-agentpool-33515997-vmss000000 to the yamlworks fine, but specifying a different node name, e.g. nodeName: aks-cpu1-33515997-vmss000000, the pod fails. The error message I get with describe pods is Failed to pull image and then kubelet Error: ErrImagePull.
What I'm missing?
Create secret:
kubectl create secret docker-registry <secret-name> \
--docker-server=<container-registry-name>.azurecr.io \
--docker-username=<service-principal-ID> \
--docker-password=<service-principal-password>
As #user1571823 told solution to the problem is deleting the old image from the acr and creating/pushing a new one.
The problem was related to some sort of corruption in the image saved in the azure container registry (acr). The reason why one agent pool could pulled the image was actually because the image already existed in the VM.
Henceforth as #andov said it is good option to open an incident case to Azure support for AKS from your subscription, where AKS is deployed. The support team has full access to the AKS service backend and they can tell exactly what was causing your problem.
Four things to check:
Is it a subscription issue? Are the nodes in different subscriptions?
Is it a rights issue? Does the service principle of the node have rights to pull the image.
Is it a network issue? Are the nodes on different subnets?
Is there something with the image size or configuration, that means that it cannot run on the other cluster.
Edit
New-AzAksNodePool has a parameter -DefaultProfile
It can be AzContext, AzureRmContext, AzureCredential
If this is different between your nodes it would explain the error

Azure Data Explorer error when creating cluster: subscription '' is not registered

While working on this official tutorial Create an Azure Data Explorer cluster and database, I am getting the following error when creating a Cluster. Question: What I may be missing and how the issue can be resolved?
Remarks:
I'm using Visual Studio Enterprise Subscription - MPN
My online search shows similar error here but the context seems different since those error messages are related to The subscription not registered to use namespace. Not sure if there is a relevance to my error.
{"code":"DeploymentFailed","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/DeployOperations for usage details.","details":[{"code":"SubscriptionNotRegistered","message":"The subscription 'a86d7e9f-210d-48e8-8f5e-528015d1c998' is not registered."}]}
Using the link provided in the error, I got the following:
When I click on the 'write cluster resource' link from the above screen:
The error is because you did not register the Kusto resource provider as described here
However, once you create a new cluster for the first time on a given subscription and it fails because the provider is not registered, Kusto tries to register it for you. So if you try again it should just work, if not please follow the process in the link.

Unable to pull image from Azure Container Registry

We recently had an issue with our Azure Kubernetes Cluster not reporting back any data through the Azure Portal. To fix this, I updated the Kubernetes version to the latest version as was recommended on GitHub. After the upgrade was complete, we were able to view logs and monitoring data through the portal, but one of the containers stored in our Azure Container Registry is not able to be pulled by the Kubernetes Cluster.
The error I see in the Kuberenetes Management page is:
Failed to pull image "myacr.azurecr.io/container:190305.191": [rpc error: code = Unknown desc = Error response from daemon: Get https://myacr.azurecr.io/v2/mycontainer/manifests/190305.191: unauthorized: authentication required, rpc error: code = Unknown desc = Error response from daemon: Get https://myacr.azurecr.io/v2/mycontainer/manifests/190305.191: unauthorized: authentication required]
My original setup used the first script provided in this document and it worked correctly without issue. Once I started receiving the error, I ran it again just to make sure.
Once I saw that failed, I then deleted the account from the permissions on both the ACR and the AKS. Again, it failed to pull the image.
After that, I tried using the second method of creating an Kubernetes secret and received the same error.
At this point, I'm unsure what else to check. I've verified that I can run docker pull on my machine and pull the image, but there seems to be a breakdown between the AKS and the ACR that I can not sort out.
It's been a while since I originally posted this, but I did stumble across a currently stable solution to the problem.
The service principal, for whatever reason, is not able to maintain a connection to the ACR. So if your cluster ever goes down, you lose the ability to pull from the ACR. I had this happen multiple times over the last year and as I moved more of my Kubernetes deployment to Azure, it became a bigger and bigger issue.
I stumbled across this Microsoft Doc and noticed the mention of the --attach-acr command.
This is what the full command looks like:
az aks create -n myAKSCluster -g myResourceGroup --generate-ssh-keys --attach-acr $MYACR
Since setting it up with that flag, I have had 0 issues with it.
knock on wood

Can't list HDInsight clusters

I'm trying to use the azure command-line interface.
I imported the manifest file and am able to run azure hdinsight -h and azure account list (which gives me the good credentials).
However, I'm unable to list my HDInsight clusters with
azure hdinsight cluster list
This returns me the following error :
- Getting HDInsight serverserror: tunneling socket could not be established, cause=1500:error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol:openssl\ssl\s23_clnt.c:766:
info: Error information has been recorded to azure.err
error: hdinsight cluster list command failed
I get a similar error message when doing azure hdinsight account storage create storagename
Did I miss a step in the installation or is there something wrong going on ? I'm working behind a proxy and got http_proxy and https_proxy well set.
In order to proceed ahead with the project, you could also launch the Powershell from the portal itself and execute the PowerShell commandlet from there. (It is called CloudShell in Azure, click this highlighted icon I just launched the PowerShell windows from portal and executed "azure hdinsight cluster list" and it returned me the list of my clusters.
more details about the Azure Powershell at :
https://azure.microsoft.com/en-us/blog/powershell-comes-to-azure-cloud-shell/
and
https://learn.microsoft.com/en-us/azure/cloud-shell/quickstart-powershell

Resources