I've been trying to manage an Azure Kubernetes Service (AKS) instance via Terraform. When I create the AKS instance via the Azure CLI per this MS tutorial, then install an ingress controller with a static public IP, per this MS tutorial, everything works fine. This method implicitly creates a service principal (SP).
When I create an otherwise exact duplicate of the AKS cluster via Terraform, I am forced to supply the service principal explicitly. I gave this new SP "Contributor" access to the cluster's entire resource group yet, when I get to the step to create the ingress controller (using the same command that tutorial 2 provided, above: helm install stable/nginx-ingress --set controller.replicaCount=2 --set controller.service.loadBalancerIP="XX.XX.XX.XX"), the ingress service comes up but it never acquires its public IP. The IP status remains "<pending>" indefinitely, and I can find nothing in any log about why. Are there logs that should tell me why my IP is still pending?
Again, I am fairly certain that, other than the SP, the Terraform AKS cluster is an exact duplicate of the one created based on the MS tutorial. Running terraform plan finds no differences between the two. Does anyone have any idea what permission my AKS SP might need or what else I might be missing here? Strangely, I can't find ANY permissions assigned to the implicitly created principal via the Azure portal, but I can't think of anything else that might be causing this behavior.
Not sure if it's a red herring or not, but other users have complained about a similar problem in the context of issues opened against the second tutorial. Their fix always appears to be "tear down your cluster and retry", but that isn't an acceptable solution in this context. I need a reproducible working cluster and azurerm_kubernetes_cluster doesn't currently allow for building an AKS instance with an implicitly created SP.
I'm going to answer my own question, for posterity. It turns out the problem was the resource group where I created the static public IP. AKS clusters use two resource groups: the group that you explicitly created the cluster in, and a second group which is implicitly created by the cluster. That second, implicit resource group always gets a name starting with "MC_" (the rest of the name is derivative of the explicit RG, the cluster name, and the region).
Anyhow, the default AKS configuration requires that the public IP be created within that implicit resource group. Assuming that you created the AKS cluster with Terraform, its name will be exported in ${azurerm_kubernetes_cluster.NAME.node_resource_group}.
EDIT 2019-05-23
Since writing this, we found a use case that the workaround of using the MC_* resource group wasn't good enough for. I opened a support ticket with MS and they directed me to this solution. Add the following annotation to your LoadBalancer (or Ingress controller), and make sure that the AKS SP has at least Network Contributor rights in the destination resource group (myResourceGroup in the example below):
metadata:
annotations:
service.beta.kubernetes.io/azure-load-balancer-resource-group: myResourceGroup
This solved it completely for us.
Set Static IP Resource Group when Installing Helm Chart
Here is a minimal helm install command for nginx-controller that works when the static IP is in a different resource group than the cluster managed node resource group.
helm upgrade --install ingress-nginx ingress-nginx \
--repo https://kubernetes.github.io/ingress-nginx \
--namespace ingress-nginx \
--set controller.replicaCount=1 \
--set controller.service.externalTrafficPolicy=Local \
--set controller.service.loadBalancerIP=$ingress_controller_ip \
--set controller.service.annotations."service\.beta\.kubernetes\.io/azure-load-balancer-resource-group"=$STATIC_IP_ROSOURCE_GROUP
The key is the last override to provide the resource group of the static IP.
Also, note that you may need to customize the load balancer health probe if your root path doesn't return a successful http response. We do this by additionally adding the following (replace /healthz with your probe EP):
Additional Note: Health Probe Endpoints
--set controller.service.annotations."service\.beta\.kubernetes\.io/azure-load-balancer-health-probe-request-path"=/healthz
Versions
Kubernetes 1.22.6
ingress-nginx-4.1.0
ingress-nginx/controller:v1.2.0
I can't comment just yet so putting this addition as answer.
Derek is right, you can totally use existing IP from a resource group different to where AKS cluster was provisioned. There is the documentation page. Just make sure you've done these two steps below:
Add "Network Contributor" role assignment for your AKS service principal to the resource group where your existing static IP is.
Add service.beta.kubernetes.io/azure-load-balancer-resource-group: myResourceGroup to the ingress controller with the following command:
kubectl annotate service ingress-nginx-controller -n ingress service.beta.kubernetes.io/azure-load-balancer-resource-group=datagate
Related
I'm working with Azure Kubernetes Service and would like to manage my infrastructure using Terraform.
When you create a new AKS cluster in Azure, a separate resource group is created to manage the resources that the cluster depends on (e.g. virtual machine sets, load balancer, etc.)
This is no different when creating an AKS cluster using the Terraform azurerm_kubernetes_cluster resource.
However, I'd like to be able to work with the resources created in this resource group within Terraform. For example, when using the Application Gateway Ingress Controller, I'd like to be able to grab the public IP address that is created in this resource group so I can assign a DNS A record in my DNS zone. This is one such example, but the scope of my question includes any resources created in this AKS-managed resource group.
I have attempted to reference these as data resources in Terraform that depend on the creation of the AKS cluster, however this requires a role assignment to the new resource group, which my service principal will not have. Assigning the Terraform service principal to the entire subscription also feels like too much of a sledgehammer approach.
It seems I must be missing something, as this seems like a big flaw in the current approach with Terraform. Can anyone enlighten me to something I am missing?
I recently created inside one of my resource groups an Azure Container App (with an environment & the rest) just for learning reasons (at the West Europe region). After I played with that, I decided to delete it. I tried to delete it without any success from the portal.
Looking around in the portal I found out that a new resource group had been created with name MC_braverock-518cbd83-rg_braverock-518cbd83_westeurope. This resource group was never generated by me. It appears that inside it there are a public IP address, a NSG & 2 Kubernetes Load Balancers.
I tried then to delete that auto-generated (somehow) resource group but again with no success. I literarily can't even touch it. I tried to delete all resources one by one. Nothing again. I even issued the command az group delete --resource-group "MC_braverock-518cbd83-rg_braverock-518cbd83_westeurope" from inside the Azure Cloud Shell and it seems that the cli gets stuck in Running.... When I had issued the command from the portal it was still running for a whole hour. So, obviously something is going wrong.
I visited the page https://resources.azure.com/, then I visited that resource group and the Json in the resource group that returned is having the following:
"provisioningState": "Deleting".
Do you know how I can delete the resources & the resource group?
I am almost confident that this is not being deleted... :(
EDIT:
Trying to delete manually one of the Load Balancers in that resource group I get a message that the Load Balancer in that subscription can not be deleted as it is in use by a virtual machine scale set that it is on a totally different subscription (a subscription that I am not aware off).
To delete the resources and the resource group, you can try using Resource Explorer (azure.com) portal as there might be some dependencies to delete the resource group/resources.
You can try deleting the resources like below:
Go to Resource Explorer (azure.com) portal -> Click on your subscription -> Expand Resource groups -> Select your Resource Group
Expand Providers -> Microsoft.Network -> networkSecurityGroups -> Select your NSG -> Action(POST,DELETE) -> Delete
In my environment testnsg networkSecurityGroup deleted successfully in the Azure Portal.
You can try deleting the required Azure Resources by following the same process. If still, you are not able to delete the Azure resource Group try checking the child resources associated to that resource group.
I tried to reproduce the same in my environment and got the same error like below:
az network lb delete -g ResourceGroupName -n LoadBalancerName
The error "LoadBalancerUseByVirtualMachineScaleSet" usually occurs if backendpool is being used by any other resource.
To resolve the error, try executing below commands in CloudShell like below:
Delete the load balancer associated with the VMSS:
az vmss update --resource-group ResourceGroupName1 --name VmssName --remove virtualMachineProfile.networkProfile.networkInterfaceConfigurations[0].ipConfigurations[0].loadBalancerBackendAddressPools 0
Update the VMSS instance:
az vmss update-instances --instance-ids "*" -n VmssName -g ResourceGroupName1
Now, delete load balancer and it will be deleted successfully like below:
az network lb delete -g ResourceGroupName -n LoadBalancerName
Reference:
Update or delete an existing load balancer used by virtual machine scale sets - Azure Load Balancer
This is a side artifact of Container app managedEnvironment resource. You need to first delete the environment in order to get artifacts to get automatically removed.
As JJ mentioned, the MC_* resource group is created when you create ACAs with the internal configuration. You try to find whether you have any container app environments in your subscription. Could be that you created your test app in a wrong resource group and can't find it now. :)
Try deleting all container app envs this resource group will automatically be gone.
When deploying new jobs and services to Azure Kubernetes Service cluster, the pods fail to request valid AAD access tokens with all permissions available. If new permissions were added on the same day, before or after a deployment, the tokens still do not pick them up. This issue has been observed with permissions granted to Active Directory Groups over Key Vaults, Storage Accounts, and SQL databases scopes so far.
Example: I have a .NET 5.0 C# API running on 3 pods with antiaffinity rules located each on a separate node. The application reads information from a SQL database. I made a release and added the database permissions afterwards. Things I have tried so far to make the application reset the access tokens:
kubectl delete pods --all -n <namespace> which essentially created 3 new pods again failing due to insufficient permissions.
kubectl apply -f deployment.yaml to deploy a new version of the image running in the containers, again all 3 pods kept failing.
kubectl delete -f deployment.yaml followed by kubectl apply -f deployment.yaml to erase the old kubernetes object and create a new one. This resolved the issue on 2/3 pods, however, the third one kept failing due to insufficient permissions.
kubectl delete namespace <namespace> to erase the entire namespace with all configuration available and recreated it again. Surprisingly, again 2/3 pods were running with the correct permissions and the last one did not.
The commands were ran more than one hour after the permissions were added to the group. The database tokens are active for 24 hours and when I have seen this issue occur with cronjobs, I had to wait 1 day for the task to execute correctly (none of the above steps worked in a cronjob scenario). The validity of the tokens kept changing which implied that the pods are requesting new access tokens, again excluding the most recently added permissions. The only solution I have found that works 100% of the time is destroy the cluster and recreate it which is not viable in any production scenario.
The failing pod from my example was the one always running on node 00 which made me think there may be an extra caching layer on the first initial node of the cluster. However, I still do not understand why the other 2 pods were running with no issue and also what is the way to restart my pods or refresh the access token to minimise the wait time until resolution.
Kubernetes version: 1.21.7.
The cluster has no AKS-managed AAD or pod-identity enabled. All RBAC is granted to the cluster MSI via AD groups.
Please check if below can be worked around in your case.
To access the Kubernetes resources, you must have access to the AKS cluster, the Kubernetes API, and the Kubernetes objects. Ensure that you're either a cluster administrator or a user with the appropriate permissions to access the AKS cluster
Things you need to do, if you haven't already:
Enable Azure RBAC on your existing AKS cluster, using:
az aks update -g myResourceGroup -n myAKSCluster --enable-azure-rbac
Create Role that allows read access to all other Pods and Services:
Add the necessary roles (Azure Kubernetes Service Cluster User Role , Azure Kubernetes Service RBAC Reader/Writer/Admin/Cluster Admin) to the user. See ( Microsoft Docs).
Also check Troubleshooting
Also check if you need to have "Virtual Machine Contributor" and storage account contributer for your resource group containing pods and see if namespace is mentioned in that pod , if you have missed . Stack Overflow refernce.Also do check if firewall is restricting the access to the network in that pod.
Resetting the kubeconfig context using the az aks get-credentials command may clear the previously cached authentication token for some xyz user:
az aks get-credentials --resource-group myResourceGroup --name myAKSCluster --overwrite-existing >Reference
Please do check Other References below:
kubernetes - Permissions error - Stack Overflow
create-role-assignments-for-users-to-access-cluster | microsoft docs
user can't access to AKS cluster with RBAC enabled (github.com)
kubernetes - Stack Overflow
I'm trying to implement Azure Key Vault such that API keys, credentials and other Kubernetes secrets are read into production and staging environments. Ultimately, I'd like to try to expand that to local development environments so devs don't have to mess with it at all. It is just read in when they start their cluster.
Anyway, I'm following this to enable Pod Identities:
https://learn.microsoft.com/en-us/azure/aks/use-azure-ad-pod-identity
When I get to this step, I'm modifying the:
az aks create -g myResourceGroup -n myAKSCluster --enable-managed-identity --enable-pod-identity --network-plugin azure
To the following because I'm trying to change an existing cluster:
az aks update -g myResourceGroup -n myAKSCluster --enable-managed-identity --enable-pod-identity --network-plugin azure
This doesn't work and figured out I need to run each flag one at a time, so I had to run --enable-managed-identity first since --enable-pod-identity depends on it.
At any rate, when I get to the --enable-pod-identity I get the following error:
Operation failed with status: 'Bad Request'. Details: Network plugin kubenet is not supported to use with PodIdentity addon.
So I try the --network-plugin azure and get:
az: error: unrecognized arguments: --network-plugin azure
Apparently this is flag is not available with update.
Poking around in the Azure portal for the AKS resource, I do see kubenet listed, but I'm not able to change it.
So, the question: Is it possible to change the Network Plugin on existing cluster or do I need to start a new?
EDIT: Looks like others are having similar issues on existing clusters:
https://github.com/Azure/AKS/issues/2094
Is it possible to change the Network Plugin on the existing cluster or do
I need to start a new?
It's impossible to change the network plugin on the existing cluster, so you need to create a new cluster and set the network plugin with azure at the creation time. You can find there is no parameter --network-plugin in the CLI command az aks update even if you install the aks-preview extension. It means it does not support changing the network plugin of the existing cluster.
I created kubernetes cluster in my Azure resource group using Azure Kubernetes Service and login into cluster with the help of resource group credentials through Azure CLI. I could opened the kubernetes dashboard successfully for the first time. After that i deleted my resource group and other resource groups which are defaultly created along with kubernetes cluster. I created resource group and kubernetes cluster one more time in my azure account. i am trying to open the kubernetes dashboard next time, getting error like 8001 port not open. I tried with proxy port-forwarding, but i don't have idea how to hit the dashboard url with different port?.
Could anybody suggest me how to resolve this issue?
I think you need to delete your kubernetes config and pull new one with az aks get-credentials or whatever you are using, because you are probably still using config from the previous cluster (hint: it wont work because its a different cluster).
you can do that by deleting this file: ~/.kube/config and pull the new one and try kubectl get nodes. if that works try the port-forward. It is not port related. something is wrong with your config\az cli
ok, I recall in the previous question you mentioned you started using RBAC, you need to add cluster role to the dashboard:
kubectl create clusterrolebinding kubernetes-dashboard --clusterrole=cluster-admin --serviceaccount=kube-system:kubernetes-dashboard
https://learn.microsoft.com/en-us/azure/aks/kubernetes-dashboard#for-rbac-enabled-clusters