`kubectl delete service` gets stuck in 'Terminating' state - azure

I'm trying to delete a service I wrote & deployed to Azure Kubernetes Service (along with required Dask components that accompany it), and when I run kubectl delete -f my_manifest.yml, my service gets stuck in the Terminating state. The console tells me that it was deleted, but the command hangs:
> kubectl delete -f my-manifest.yaml
service "dask-scheduler" deleted
deployment.apps "dask-scheduler" deleted
deployment.apps "dask-worker" deleted
service "my-service" deleted
deployment.apps "my-deployment" deleted
I have to Ctrl+C this command. When I check my services, Dask has been successfully deleted, but my custom service hasn't. If I try to manually delete it, it similarly hangs/fails:
> kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP x.x.x.x <none> 443/TCP 18h
my-service LoadBalancer x.x.x.x x.x.x.x 80:30786/TCP,443:31934/TCP 18h
> kubectl delete service my-service
service "my-service" deleted
This question says to delete the pods first, but all my pods are deleted (kubectl get pods returns nothing). There's also this closed K8s issue that says --wait=false might fix foreground cascade deletion, but this doesn't work and doesn't seem to be the issue here anyway (as the pods themselves have already been deleted).
I assume that I can completely wipe out my AKS cluster and re-create, but that's an option of last resort here. I don't know whether it's relevant, but my service is using the azure-load-balancer-internal: "true" annotation for the service, and I have a webapp deployed to my VNet that uses this service.
Is there any other way to force shutdown this service?

Thanks to #4c74356b41's suggestion of looking at kubectl describe service my-service (which I hadn't considered for some reason), I saw this warning:
Code="LinkedAuthorizationFailed" Message="The client 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' with object id 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' has permission to perform action 'Microsoft.Network/loadBalancers/write' on scope '/subscriptions/<subscriptionId>/resourceGroups/<resourceGroup>/providers/Microsoft.Network/loadBalancers/kubernetes-internal'; however, it does not have permission to perform action 'Microsoft.Network/virtualNetworks/subnets/join/action' on the linked scope(s) '/subscriptions/<subscriptionId>/resourceGroups/<resourceGroup>/providers/Microsoft.Network/virtualNetworks/<vnet>/subnets/<subnet>' or the linked scope(s) are invalid.
(The client and object id GUIDs are the same value.)
This indicated that it's not exactly a Kubernetes issue, but moreso permissions within the Azure ecosystem. I looked through the portal and didn't find that GUID in any of my users, groups, or apps, so I'm not sure what it's referring to. However, I granted the Owner role to this client id, and after a few minutes, the service deleted.
az role assignment create `
--role Owner `
--assignee xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

I had a similar issue with a svc not connecting to the pod cause the pod was already deleted:
HTTPConnectionPool(host='scv-name-not-shown-because-prod.namespace-prod', port=7999): Max retries exceeded with url:
my-url-not-shown-because-prod (Caused by
NewConnectionError('<urllib3.connection.HTTPConnection object at
0x7faee4b112b0>: Failed to establish a new connection: [Errno 110] Connection timed out'))
I was able to solve this with the patch command:
kubectl patch service scv-name-not-shown-because-prod -n namespace-prod -p '{"metadata":{"finalizers":null}}'
I think the service went into some illegal state and was not able to ricover

Related

Azure DevOps Release Pipeline || To sign in, use a web browser to open

I created the aks cluster with azure service principal id and i provided the contributer role according to the subscription and resource group.
For each and every time when i executed the pipeline the sign-in is asking and after i authenticated it is getting the data.
Also the "kubectl get" task is taking more than 30 min and is getting "Kubectl Server Version: Could not find kubectl server version"
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code CRA2XssWEXUUA to authenticate
Thanks in advance
What is the version of the created cluster?
I'm assuming from your snapshot that you are using az in order to get credentials for it.
Old azure auth plugin is deprecated in V1.22+. If you are using V1.22 or above you should use kubelogin in order authenticate.
You will also need to update your kube config accordingly:
kubelogin convert-kubeconfig
and specifically if you're logging via az:
kubelogin convert-kubeconfig -l azurecli
Note that the flag -l azurecli is important here: the default value is "devicecode" which will not consider your az as a logging method - and you will still be requested a browser authentication.
Alternatively, you can set environment variable:
AAD_LOGIN_METHOD=azurecli
Because you are getting sign in request and not the deprecation warning for the auth plugin, I suspect that you already have kubelogin installed on your agent, and you just need to update the kube config file
What task are you using? There is official kubectl task: https://learn.microsoft.com/en-us/azure/devops/pipelines/tasks/deploy/kubernetes?view=azure-devops
It requires the service connection.
If you still want to execute kubectl directly, you should run the following before the kubectl inside the AzureCLI task:
az aks get-credentials --resource-group "$(resourceGroup)" --name "$(k8sName)" --overwrite-existing
Please use Selfhosted agents for executing your commands. looks like you have private endpoints for your AKS and requests are only allowed from trusted devices.
I ran into the same issue and for me the fix was to change the Connection Type in the stage definition from Azure Resource Manager to Kubernetes Service Connection - check on the screenshot below.
Then you should be able to also specify the connection type in each of the tasks where you are running kubectl or helm commands. For example, in a kubectl task, under Kubernetes Cluster --> Service connection type use the Kubernetes Service Connection:
As mentioned by #DevOpsEngg, the problem could be related to private endpoints but I wouldn't say that it is regarding selfhosted agents, because I'm using these. As an extra comment - this started happening when I added more than one user to the cluster, so you might want to check user permissions and authentication. Unfortunately, I'm still getting used to K8s so I don't have more info about that.

Docker: "Deploying Docker containers on Azure" directions

I'm trying to follow these directions to use docker on Azure:
Here is what I did:
Ensured Docker Desktop is downloaded. It is and it is running
Created Azure account & log in via CLI (finished logging in via browser.)
Created ACI context & confirmed it exists.
Ran a working, existing container using docker --context mycontext run -p 8080:8080 <container-name>
Step 4 returns...
containerinstance.ContainerGroupsClient#CreateOrUpdate: Failure sending request: StatusCode=403 -- Original Error: Code="AuthorizationFailed" Message="The client 'client#domain.edu' with object id 'id' does not have authorization to perform action 'Microsoft.ContainerInstance/containerGroups/write' over scope '/subscriptions/subscriptionNumber/resourceGroups/TLT-WVD-Pilot/providers/Microsoft.ContainerInstance/containerGroups/<group-name-here>' or the scope is invalid. If access was recently granted, please refresh your credentials."
How do I debug this or even "refresh credentials"? Repeating the login process via CLI (finishing with browser) does not change anything.
According to the error message, it seems your Azure account does not have enough permission to create the ACI in the resource group. So you need to check if your Azure account has enough permission for the resource group. Perhaps the built-in role Contributor for the resource group is suitable and safe for you.

Pull images from an Azure container registry to a Kubernetes cluster

I have followed this tutorial microsoft_website to pull images from an azure container. My yaml successfully creates a pod job, which can pull the image, BUT only when it runs on the agentpool node in my cluster.
For example, adding nodeName: aks-agentpool-33515997-vmss000000 to the yamlworks fine, but specifying a different node name, e.g. nodeName: aks-cpu1-33515997-vmss000000, the pod fails. The error message I get with describe pods is Failed to pull image and then kubelet Error: ErrImagePull.
What I'm missing?
Create secret:
kubectl create secret docker-registry <secret-name> \
--docker-server=<container-registry-name>.azurecr.io \
--docker-username=<service-principal-ID> \
--docker-password=<service-principal-password>
As #user1571823 told solution to the problem is deleting the old image from the acr and creating/pushing a new one.
The problem was related to some sort of corruption in the image saved in the azure container registry (acr). The reason why one agent pool could pulled the image was actually because the image already existed in the VM.
Henceforth as #andov said it is good option to open an incident case to Azure support for AKS from your subscription, where AKS is deployed. The support team has full access to the AKS service backend and they can tell exactly what was causing your problem.
Four things to check:
Is it a subscription issue? Are the nodes in different subscriptions?
Is it a rights issue? Does the service principle of the node have rights to pull the image.
Is it a network issue? Are the nodes on different subnets?
Is there something with the image size or configuration, that means that it cannot run on the other cluster.
Edit
New-AzAksNodePool has a parameter -DefaultProfile
It can be AzContext, AzureRmContext, AzureCredential
If this is different between your nodes it would explain the error

How does kubernetes work from within a docker container

How does a Kubernetes run (kubectl get no) from within a docker container?
I know that it has to talk with the API server, but nowhere can I find a config file containing details of this (like .kube/config file found under my user)
I've done an env to check out what variables are set.
I've gone to the home directory which has a .kube directory but no config file.
As per documentation:
The recommended way to authenticate to the apiserver is with a service account credential. By kube-system, a pod is associated with a service account, and a credential (token) for that service account is placed into the filesystem tree of each container in that pod, at /var/run/secrets/kubernetes.io/serviceaccount/token
When kubectl is connecting with api using serviceaccount - token is placed in /var/run/secrets/kubernetes.io/serviceaccount/token
When you create a pod, if you do not specify a service account, it is automatically assigned the default service account in the same namespace
When you perform "config operation" with kubectl like:
kubectl config set-context test
.kube/config will be created automatically.
You can pass also different serviceAccountName into your pod and auto mount token like:
spec:
serviceAccountName: <your_service_account>
automountServiceAccountToken: true
You can find more information about Configure Service Accounts for Pods here.
Hope this help.

Azure AKS 'Kube-Proxy' Kubernetes Node Log file location?

My question is 'probably' specific to Azure.
How can I review the Kube-Proxy logs?
After SSH'ing into an Azure AKS Node (done) I can use the following to view the Kubelet logs:
journalctl -u kubelet -o cat
Azure docs on the Azure Kubelet logs can be found here:
https://learn.microsoft.com/en-us/azure/aks/kubelet-logs
I have reviewed the following Kubernetes resource regarding logs but Kube-Proxy logs on Azure do not appear in any of the suggested locations on the AKS node:
https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/#looking-at-logs
This is part of a trouble shooting effort related to a Kubernetes nGinx Ingress temporarily returning a '504 Gateway Time-out' when a service has not been accessed / going idle for some period of time (perhaps 5 to 10 minutes) but then becoming accessible on the next attempt(s).
On AKS, kube-proxy runs as a DaemonSet in the kube-system namespace
You can list the kube-proxy pods + node information with:
kubectl get pods -l component=kube-proxy -n kube-system -o wide
And then you can review the logs by running:
kubectl logs kube-proxy-<suffix> -n kube-system
On the same note as Acanthamoeba's answer the logs for the Kube-Proxy pod can also be accessed via the browse UI interface that can be launched via:
az aks browse --resource-group <ClusterResourceGroup> --name <ClusterName>
The above should pop open a new browser window pointed at the following URL: http://127.0.0.1:8001/#!/overview?namespace=default
Switch to Kube-System Namespace
Once the browser window is open, change to the Kube-System namespace, by selecting that option from the drop down on the left side:
Kube-System namespace is all the way at the bottom of the drop down... and probably requires scrolling.
Navigate to Pods
From there click "pods" (also on the left hand side menu, below the namespaces drop down) and then click the Kube-Proxy pod:
View Kube-Proxy Logs
Click to view logs of your Azure AKS based Kube-Proxy pod, logs button in is in the top right hand menu to the left of "Delete' and 'Edit' just below create:
Other Azure AKS Trouble Shooting Resources
Since you are trying to view the Kube-Proxy logs you are probably trouble shooting some networking issues or something along those lines. Here are some other resources that I used during my trouble shooting tour of my Azure AKS Cluster:
View Kubelet Logs on Azure AKS: https://learn.microsoft.com/en-us/azure/aks/kubelet-logs
nGinx Ingress Troubleshooting: https://github.com/kubernetes/ingress-nginx/blob/master/docs/troubleshooting.md
SSH into an Azure AKS Cluster VM: https://learn.microsoft.com/en-us/azure/aks/aks-ssh

Resources