GKE: Impossible to delete a cluster - terraform

I have a weird issue with GKE, the cluster has been created by Terraform, and I tried to make a change requiring a deletion and re-creation.
It failed at the re-creation because I was missing an API, so I added it and retry.
Thing is that I have a cluster that exists, empty but with failed to delete cluster message on it.
I never had this issue and I already destroyed and re-created this very resource. I tried to destroy all the resources created by terraform on this project but I still get an error "failed to delete cluster".
Also I tried to do it by hand on the UI but still get the same error.
I tried to do it using
gcloud container clusters delete <cluster_name> and got
"Failed to delete cluster, name: operation-xxx-xxx..." and got a link to the operation failed.
It's a JSON with a 401 code, with the following message:
Request is missing required authentication credential. Expected OAuth
2 access token, login cookie or other valid authentication credential.
See
https://developers.google.com/identity/sign-in/web/devconsole-project.
I tried to re-auth but it doesn't help I get the same error.
I'm running out of idea, can you help me here?

A 401 (unauthorized) suggests that you've insufficient permissions to delete the cluster.
Either get a role that permits your user account to delete clusters.
Or ask someone who has an account that has sufficient powers to delete it for you.
Or authenticate gcloud (gcloud activate-service-account) with the Service Account that you used to create the cluster (assuming it can delete clusters too) and then use gcloud container clusters delete ... optionally include --account=${SERVICE_ACCOUNT_EMAIL} or just ensure the Service Account is ACTIVE with gcloud auth list.

I did not found a proper solution, but what did work was to delete the whole project and start over.
Luckily for me it was a lab, not a production project...

Related

AKS Cluster deployment fails with "ReconcileMSICredentialError"

When I try to deploy a fresh AKS cluster with "Dev/Test" Settings via the Portal, I get the following error while deployment:
{"code":"DeploymentFailed","message":"At least one resource deployment operation failed.
Please list deployment operations for details. Please see
https://aka.ms/DeployOperations for usage details.","details":
[{"code":"ReconcileMSICredentialError","message":"Reconcile MSI credential failed.
Details: autorest/azure: Service returned an error. Status=409 Code=\"Conflict\"
Message=\"Secret bf905bf9e9ad86526b26e98d2ea490a0a500ff23907f9df987d95de3a649e751 is
currently being deleted and cannot be re-created; retry later.\" InnerError=
{\"code\":\"ObjectIsBeingDeleted\"}."}]}
However, the resource still gets deployed, but with a notification that "the resource is in a failed state". When I stop the cluster and start it new, the notification disappears but I'm not sure if the error remains.
I can avoid the error altogether, if I pick a new name for the cluster. However, I'd like to keep the old name.
The same happens when I deploy with different settings (CPU, number of nodes, etc.). I also tried deleting the cluster entirely and deploying it completely new but the error persists. I haven't found any explanation to this error either on Stackoverflow or Google.
What could be the reason for this error and how to avoid it?
I tried to reproduce the same issue in my environment and got the below results
I have created the AKS cluster with dev/test environment
The reference cluster is successfully created
I have given the some credentials to the cluster using below command
az aks get-credentials --resource-group Alldemorg --name cluster_name
*Created the sample application and deployed that application into the cluster,
I have used the following Reference for example sample file.*
Deployment got succeeded and I am able to see all the pods and nodes which got created for the application
Note:
1). "ReconcileMSICredentialError" error we are getting because of the version please check the version and upgrade to latest
2). If the resource is in failed stated delete the entire resource instead of deleting cluster and create it again if we stop and start the resource may chance of getting "ReconcileMSICredentialError".

Azure terraform storage account permission

I want to learn more about azure open vpn configurations and how it work. So looking around I found a open source project on GitHub, at the following link:
https://github.com/terraform-azurerm-examples/example-hub.git (Thank you for your code)
I set all the variable I wanted, and removed the version from azure provider.
but when I run terraform apply, I got an error on azure Storage account.
the error is this one:
Error: reading queue properties for AzureRM Storage Account "examplehubw6sr1wyncn": queues.Client#GetServiceProperties: Failure responding to request: StatusCode=403 -- Original Error: autorest/azure: Service returned an error. Status=403 Code="AuthorizationPermissionMismatch" Message="This request is not authorized to perform this operation using this permission.\nRequestId:cce5a313-b003-005c-2bb2-9d8a2f000000\nTime:2021-08-30T15:19:07.9036073Z"
As far as I understand, the error is due to setting secret permissions, which I did updated giving Get, List and Set but the error keeps showing up.
I am using terraform version 0.14.5
and my azurerm version is 2.74.0
I never had this type of error, on my subscription I have administrator role.
Did anyone get this error and know how to solve it, I would really appreciate you help
The error is probably because your user does not have data plane permissions on your storage account - which is where Terraform wants to put the statefile. Give your user Storage Blob Data Contributor role: https://learn.microsoft.com/en-us/azure/storage/blobs/assign-azure-role-data-access?tabs=portal

"insufficient authentication scopes" from Google API when calling from K8S cluster

I'm trying to report Node.js errors to Google Error Reporting, from one of our kubernetes deployments running on a GCP/GKE cluster with RBAC. (i.e. permissions defined in a service account associated to the cluster)
const googleCloud = require('#google-cloud/error-reporting');
const googleCloudErrorReporting = new googleCloud.ErrorReporting();
googleCloudErrorReporting.report('[test] dummy error message');
This works only in certain environments:
it works when run on my laptop, using a service account that has the "Errors Writer" role
it works when running in my cluster as a K8S job, after having added the "Errors Writer" role to that cluster's service account
it causes the following error when called from my Node.js application running in one of my K8S deployments:
ERROR:#google-cloud/error-reporting: Encountered an error while attempting to transmit an error to the Stackdriver Error Reporting API.
Error: Request had insufficient authentication scopes.
It feels like the job did pick up the permission changes of the cluster's service account, whereas my deployment did not.
I did try to re-create the deployment to make it refresh its auth token, but the error is still happening...
Any ideas?
UPDATE: I ended up following Jérémie Girault's suggestion: create a service account and bind it to my deployment. It works!
The error message has to do with the access scopes set on the cluster when using the default service account. You must enable access to the appropriate API.
As you mentioned, creating a separate service account, providing it the appropriate IAM permissions and linking it to your cluster or workload will bypass this error as well.

Unable to get access to Key Vault using Azure MSI on App Service

I have enabled Managed Service Identities on an App Service. However, my WebJobs seem unable to access the keys.
They report:
Tried the following 3 methods to get an access token, but none of them worked.
Parameters: Connectionstring: [No connection string specified], Resource: https://vault.azure.net, Authority: . Exception Message: Tried to get token using Managed Service Identity. Unable to connect to the Managed Service Identity (MSI) endpoint. Please check that you are running on an Azure resource that has MSI setup.
Parameters: Connectionstring: [No connection string specified], Resource: https://vault.azure.net, Authority: https://login.microsoftonline.com/common. Exception Message: Tried to get token using Active Directory Integrated Authentication. Access token could not be acquired. password_required_for_managed_user: Password is required for managed user
Parameters: Connectionstring: [No connection string specified], Resource: https://vault.azure.net, Authority: . Exception Message: Tried to get token using Azure CLI. Access token could not be acquired. 'az' is not recognized as an internal or external command,
Kudo does not show any MSI_ environmental variables.
How is this supposed to work? This is an existing App Service Plan.
The AppAuthentication library leverages an internal endpoint in App Service that receives the tokens on your site's behalf. This endpoint is non-static and therefore is set to an environment variable. After activating MSI for your site through ARM, your site will need to be restarted to get two new Environment Variables set in it:
MSI_ENDPOINT and MSI_SECRET
The presence of these variables are essential to the MSI feature working properly during runtime as the AppAuthentication library uses them to get the authorization token. The error message reflects this:
Exception Message: Tried to get token using Managed Service Identity. Unable to connect to the Managed Service Identity (MSI) endpoint. Please check that you are running on an Azure resource that has MSI setup.
If these variables are absent, you might need to restart the site.
https://learn.microsoft.com/en-us/azure/app-service/app-service-managed-service-identity
If the environment variables are set and you still see the same error, the article above has a code sample showing how to send requests to that endpoint manually.
public static async Task<HttpResponseMessage> GetToken(string resource, string apiversion) {
HttpClient client = new HttpClient();
client.DefaultRequestHeaders.Add("Secret", Environment.GetEnvironmentVariable("MSI_SECRET"));
return await client.GetAsync(String.Format("{0}/?resource={1}&api-version={2}", Environment.GetEnvironmentVariable("MSI_ENDPOINT"), resource, apiversion));
}
I would try that and see what kind of response I get back.
I just solved this issue when trying to use MSI with a Function app, though I already had the environment variables set. I tried restarting multiple times to no success. What I ended up doing was manually turning off MSI for the Function, then re-enabling it. This wasn't ideal, but it worked.
Hope it helps!
I've found out that if you enable MSI and then swap out the slot, the functionality leaves with the slot change. You can re-enable it by switching it off and on again but that will create a new identity in AD and will require you to reset permissions on the key vault for it to work.
Enable the identity and give access to your azure function app in keyvault via access policy.
You can find identity in platform feature tab
These two steps works for me
In my case I had forgotten to add an Access Policy for the application in the Key Vault
Just switched ON the Status like #Sebastian Inones showed.
Than add access policy for KeyVault like
This is resolved the issue!!
For the ones, like my self, wondering how to enable MSI.
My scenario:
I have an App Service already deployed and running for a long time.
In addition, on Azure DevOps I have my Pipeline configured to Auto-Swap my Deployment Slots (Staging/Production). Suddenly, after a normal push, Production starts failing because of the described issue.
So, in order to enable MSI again (I don't know why it has to be re-enabled but I believe this is only a workaround, not a solution, as it should be still enabled in the first place)
Go to your App Service. Then Under Settings --> Identity.
Check the status: In my case, it was off
I have attached an image below to make it easier to follow.
For the folks that will come across these answers, I would like to share my experience.
I got this problem with Azure Synapse pipeline run. Essentially I added access policies properly to the KeyVault, and also I added a LinkedService to the Azure Synapse pointing to my KeyVault.
If I trigger the notebook manually it works, but in the pipeline, it fails.
Initially, I used the following statement:
url = TokenLibrary.getSecret("mykeyvault", "ConnectionString")
Then I added the name of the linked service as a third parameter, and the pipeline was able to leverage that linked service to obtain the MSI token for a Vault.
url = TokenLibrary.getSecret("mykeyvault", "ConnectionString", "AzureKeyVaultLinkedServiceName")
Might be unrelated to your issue but I was getting the same error message.
For me, the issue was using pip3's azure-cli. I was able to fix this issue by using brew packages for both azure-cli and azure-functions-core-tools.
Uninstall pip3 azure-cli
pip3 uninstall azure-cli
Install brew azure-cli
brew update
brew install azure-cli
Double check if the error message ends with:
Please go to Tools->Options->Azure Services Authentication, and re-authenticate the account you want to use.

DC/OS private registry with authentication fails

I got a running DC/OS cluster on Azure and i'm trying to configure it to use private registry credentials.
I'm running Azure Private Registry with admin. I can log in and use the images.
I followed the guide provided by DC/OS but it recommends saving it on the nodes themselves. I want to use Azure File Storage instead.
I saved the config.json file to auth to the loginserver on a blob and provide the URI with deployment configuration.
config.json:
auths:
stageon.azurecr.io:
auth "..."
Now the configuration just keeps running without any output so I assume it's hanging on pulling the image.
I am providing the direct link URL to the file and when I access it through webbrowser it returns the JSON.
Did anyone do something similar before I found this thread for amazon before but I can't seem to get it working.
I've used a customization to acs-engine a few times to push registry credentials to the agent nodes.
This approach makes sure that the credentials will be present even when you add nodes later on.
The code is here: https://github.com/xtophs/acs-engine-1/tree/xtoph-registry. Example cluster API model is at: https://github.com/xtophs/acs-engine-1/blob/xtoph-registry/examples/privateregistry/dcos1.8.4.json

Resources