"timeout expired" error when mouning PVC in Azure Kubernetes AKS - azure

After a high load problem that triggered my pod evicted in the deployment, even after deleting the deployment and creating it again, I am getting the following problem:
Warning FailedMount 15s kubelet Unable to mount volumes for pod "XXX(YYY)": timeout expired waiting for volumes to attach or mount for pod "qa"/"XXX". list of unmounted volumes=[ZZZ-volume]. list of unattached volumes=[shared dockersocket ZZZ-volume default-token-kks6d]
The PV is RWO mode so it can only be attached to one POD at a time. I guess the system still has the PV as attached to the evicted pod (which I have deleted) so it does not allow it to be attached to a new POD.
How can I "free" my PV/PVC so it can be attached to the new POD?
Edit: I added get PV and get PVC outputs as requested:
kubectl get pvc:
XXX-pvc-default Bound pvc-XXX-7d98-11ea-91c2-XXX 5Gi RWO default 469d
kubectl get pv:
pvc-XXX-7d98-11ea-91c2-XXX 5Gi RWO Delete Bound qa/XXX-pvc-default default 469d

Related

Persistant Volume is not available in Volume Attachment

I am trying to expose the Blob Storage to Kubernetes pods. During the testing, I got the below error
Unable to attach or mount volumes: unmounted volumes=[blob-secret],
unattached volumes=[secrets-store]: error processing PVC
namespace/pvc-blob-claim: PVC is being deleted
Then I executed the below command for debugging
kubectl get pv
pv-blob 10Gi RWX Retain
kubectl get volumeattachment
I got none <pv-blob is not available here>
May I know why the PV is not available under Volume Attachment ? Any suggestions on debugging the scenario are much appreciated.

AKS PersistentVolume Affinity?

Disclaimer: This question is very specific about the used platforms and the UseCase we are trying to solve with it. Also it compares two approaches we currently use at least in a development stage and are trying to compare, but perhaps don't fully understand yet. I am asking for guidance on this very specific topic...
A) We are running a Kafka cluster as Kafka Tasks on DC/OS, where persistence of data is maintained via local Disk Storage which is provisioned on the very same host as the according kafka broker instance.
B) We are trying to run Kafka on Kubernetes (via Strimzi Operator), specifically Azure Kubernetes Service (AKS) and are struggling to get reliable Data Persistence using the StorageClasses you get in AKS. We tried three possibilities:
(Default) Azure Disk
Azure File
emptyDir
I see two major issues with Azure Disk, as we are able to set the Kafka Pod Affinity in a manner that they do not end up on the same maintenance zone / host, we have no instrument to bind the according PersistentVolume anywhere near the Pod. There is nothing like NodeAffinity for AzureDisks. Also it is fairly common that an Azure Disk ends up on another host than its corresponding pod, which might be limited by network bandwidth then?
With Azure File we don't have issues because of maintenance zones which are going down temporarily, but as a high latency storage option it doesn't seem to be a good fit and also Kafka has trouble to delete / update files on retention.
So I ended up using an ephemeral Storage Cluster which is commonly NOT recommended but doesn't come with the problems above. The Volume "lives" near the pod and is available to it as long as the pod itself runs on any node. In the maintenance case pod AND volume die together. As long as I am able to maintain a quorum, I don't see where this might cause issues.
Is there anything like podAffinity for PersistentVolumes as Azure-Disk is per definition Node bound?
What are the major downsides in using emptyDir for persistence in a Kafka Cluster on Kubernetes?
Is there anything like podAffinity for PersistentVolumes as Azure-Disk
is per definition Node bound?
As I know, there is nothing like podaffinity for PersistentVolumes as Azure-Disk. The azure disk should be attached to the node, so if the pod changes the host node, then the pod can't use the volume on that disk. Only the Azure file share is podAffinity.
What are the major downsides in using emptyDir for persistence in a
Kafka Cluster on Kubernetes?
You can take a look at the emptyDir:
scratch space, such as for a disk-based merge sort
This is the most thing you need to watch out for when you use the AKS. You need to calculate the disk space, perhaps you need to attach multiple Azure disks to the nodes.
Starting off - I'm not sure what you mean about an Azure Disk ending up on a node other than where the pod is assigned - that shouldn't be possible, per my understanding (for completeness, you can do this on a VM with the shared disks feature outside of AKS, but as far as I'm aware that's not supported in AKS for dynamic disks at the time of writing). If you're looking at the volume.kubernetes.io/selected-node annotation on the PVC, I don't believe that's updated after initial creation.
You can reach the configuration you're looking for by using a statefulset with antiaffinity. Consider this statefulset. It creates three pods, which must be in different availability zones. I'm deploying this to an AKS cluster with a nodepool (nodepool2) with two nodes per AZ:
❯ kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{","}{.metadata.labels.topology\.kubernetes\.io\/zone}{"\n"}{end}'
aks-nodepool1-25997496-vmss000000,0
aks-nodepool2-25997496-vmss000000,westus2-1
aks-nodepool2-25997496-vmss000001,westus2-2
aks-nodepool2-25997496-vmss000002,westus2-3
aks-nodepool2-25997496-vmss000003,westus2-1
aks-nodepool2-25997496-vmss000004,westus2-2
aks-nodepool2-25997496-vmss000005,westus2-3
Once the statefulset is deployed and spun up, you can see each pod was assigned to one of the nodepool2 nodes:
❯ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
echo-0 1/1 Running 0 3m42s 10.48.36.102 aks-nodepool2-25997496-vmss000001 <none> <none>
echo-1 1/1 Running 0 3m19s 10.48.36.135 aks-nodepool2-25997496-vmss000002 <none> <none>
echo-2 1/1 Running 0 2m55s 10.48.36.72 aks-nodepool2-25997496-vmss000000 <none> <none>
Each pod created a PVC based on the template:
❯ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
demo-echo-0 Bound pvc-bf6104e0-c05e-43d4-9ec5-fae425998f9d 1Gi RWO managed-premium 25m
demo-echo-1 Bound pvc-9d9fbd5f-617a-4582-abc3-ca34b1b178e4 1Gi RWO managed-premium 25m
demo-echo-2 Bound pvc-d914a745-688f-493b-9b82-21598d4335ca 1Gi RWO managed-premium 24m
Let's take a look at one of the PVs that was created:
apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
pv.kubernetes.io/bound-by-controller: "yes"
pv.kubernetes.io/provisioned-by: kubernetes.io/azure-disk
volumehelper.VolumeDynamicallyCreatedByKey: azure-disk-dynamic-provisioner
creationTimestamp: "2021-04-05T14:08:12Z"
finalizers:
- kubernetes.io/pv-protection
labels:
failure-domain.beta.kubernetes.io/region: westus2
failure-domain.beta.kubernetes.io/zone: westus2-3
name: pvc-9d9fbd5f-617a-4582-abc3-ca34b1b178e4
resourceVersion: "19275047"
uid: 945ad69a-92cc-4d8d-96f4-bdf0b80f9965
spec:
accessModes:
- ReadWriteOnce
azureDisk:
cachingMode: ReadOnly
diskName: kubernetes-dynamic-pvc-9d9fbd5f-617a-4582-abc3-ca34b1b178e4
diskURI: /subscriptions/02a062c5-366a-4984-9788-d9241055dda2/resourceGroups/rg-sandbox-aks-mc-sandbox0-westus2/providers/Microsoft.Compute/disks/kubernetes-dynamic-pvc-9d9fbd5f-617a-4582-abc3-ca34b1b178e4
fsType: ""
kind: Managed
readOnly: false
capacity:
storage: 1Gi
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: demo-echo-1
namespace: zonetest
resourceVersion: "19275017"
uid: 9d9fbd5f-617a-4582-abc3-ca34b1b178e4
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: failure-domain.beta.kubernetes.io/region
operator: In
values:
- westus2
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- westus2-3
persistentVolumeReclaimPolicy: Delete
storageClassName: managed-premium
volumeMode: Filesystem
status:
phase: Bound
As you can see, that PV has a required nodeAffinity for nodes in failure-domain.beta.kubernetes.io/zone with value westus2-3. This ensures that the pod that owns that PV will only ever get placed on a node in westus2-3, and that PV will be bound to the node the disk is running on when the pod is started.
At this point, I deleted all the pods to get them on the other nodes:
❯ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
echo-0 1/1 Running 0 4m4s 10.48.36.168 aks-nodepool2-25997496-vmss000004 <none> <none>
echo-1 1/1 Running 0 3m30s 10.48.36.202 aks-nodepool2-25997496-vmss000005 <none> <none>
echo-2 1/1 Running 0 2m56s 10.48.36.42 aks-nodepool2-25997496-vmss000003 <none> <none>
There's no way to see it via Kubernetes, but you can see via the Azure portal that managed disk kubernetes-dynamic-pvc-bf6104e0-c05e-43d4-9ec5-fae425998f9d, which backs pv pvc-bf6104e0-c05e-43d4-9ec5-fae425998f9d, which backs PVC zonetest/demo-echo-0, is listed as Managed by: aks-nodepool2-25997496-vmss_4, so it's been removed and assigned to the node where the pod is running.
Portal screenshot showing disk attached to node 4
If I were to remove nodes such that I didn't have nodes in AZ 3, I wouldn't be able to start pod echo-1, since it's bound to a disk in AZ 3, which can't be attached to a node not in AZ 3.

`kubectl delete service` gets stuck in 'Terminating' state

I'm trying to delete a service I wrote & deployed to Azure Kubernetes Service (along with required Dask components that accompany it), and when I run kubectl delete -f my_manifest.yml, my service gets stuck in the Terminating state. The console tells me that it was deleted, but the command hangs:
> kubectl delete -f my-manifest.yaml
service "dask-scheduler" deleted
deployment.apps "dask-scheduler" deleted
deployment.apps "dask-worker" deleted
service "my-service" deleted
deployment.apps "my-deployment" deleted
I have to Ctrl+C this command. When I check my services, Dask has been successfully deleted, but my custom service hasn't. If I try to manually delete it, it similarly hangs/fails:
> kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP x.x.x.x <none> 443/TCP 18h
my-service LoadBalancer x.x.x.x x.x.x.x 80:30786/TCP,443:31934/TCP 18h
> kubectl delete service my-service
service "my-service" deleted
This question says to delete the pods first, but all my pods are deleted (kubectl get pods returns nothing). There's also this closed K8s issue that says --wait=false might fix foreground cascade deletion, but this doesn't work and doesn't seem to be the issue here anyway (as the pods themselves have already been deleted).
I assume that I can completely wipe out my AKS cluster and re-create, but that's an option of last resort here. I don't know whether it's relevant, but my service is using the azure-load-balancer-internal: "true" annotation for the service, and I have a webapp deployed to my VNet that uses this service.
Is there any other way to force shutdown this service?
Thanks to #4c74356b41's suggestion of looking at kubectl describe service my-service (which I hadn't considered for some reason), I saw this warning:
Code="LinkedAuthorizationFailed" Message="The client 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' with object id 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' has permission to perform action 'Microsoft.Network/loadBalancers/write' on scope '/subscriptions/<subscriptionId>/resourceGroups/<resourceGroup>/providers/Microsoft.Network/loadBalancers/kubernetes-internal'; however, it does not have permission to perform action 'Microsoft.Network/virtualNetworks/subnets/join/action' on the linked scope(s) '/subscriptions/<subscriptionId>/resourceGroups/<resourceGroup>/providers/Microsoft.Network/virtualNetworks/<vnet>/subnets/<subnet>' or the linked scope(s) are invalid.
(The client and object id GUIDs are the same value.)
This indicated that it's not exactly a Kubernetes issue, but moreso permissions within the Azure ecosystem. I looked through the portal and didn't find that GUID in any of my users, groups, or apps, so I'm not sure what it's referring to. However, I granted the Owner role to this client id, and after a few minutes, the service deleted.
az role assignment create `
--role Owner `
--assignee xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
I had a similar issue with a svc not connecting to the pod cause the pod was already deleted:
HTTPConnectionPool(host='scv-name-not-shown-because-prod.namespace-prod', port=7999): Max retries exceeded with url:
my-url-not-shown-because-prod (Caused by
NewConnectionError('<urllib3.connection.HTTPConnection object at
0x7faee4b112b0>: Failed to establish a new connection: [Errno 110] Connection timed out'))
I was able to solve this with the patch command:
kubectl patch service scv-name-not-shown-because-prod -n namespace-prod -p '{"metadata":{"finalizers":null}}'
I think the service went into some illegal state and was not able to ricover

Kubernetes FailedScheduling using nodeSelector

I have set an On-Prem Kubernetes cluster using Rancher, with 3 Centos nodes and 1 Windows node.
I wanted to set a Deployment that will never run over the Windows node, so I set in the Deployment spec.template.spec.nodeSelector: kubernetes.io/os: linux
It seems to run but the deployment gets stuck in Pending, with this error:
Warning FailedScheduling default-scheduler 0/4 nodes are
available: 1 node(s) didn't match node selector, 3 node(s) had taint
{cattle.io/os: linux}, that the pod didn't tolerate.
Any insights?
The scheduler is not able to schedule the pod on linux nodes because those nodes have got taints. So you need to add tolerations in the pod spec of the deployment.
tolerations:
- key: "cattle.io/os"
operator: "Equal"
value: "linux"
effect: "NoSchedule"
Also add a specific taint to the windows nodes so that only specific pods with specific tolerations can only be scheduled onto the windows nodes
kubectl taint nodes windowsnode cattle.io/os=windows:NoSchedule

How to limit amount of pods with attached managed disks per node

Imagine there is a cluster with lots of different deployments running on it. Some pods uses PersistentVolumes (Azure Disks). There is a limit in Azure how much disks can be mounted to a VM and this leads to errors on scheduling like
Status=409 Code="OperationNotAllowed" Message="The maximum number of data disks allowed to be attached to a VM of this size is 8
Pods stay in
Waiting: Container creating
state forever, however some nodes were having much less pods with attached disks at the moment of scheduling. It would be great to limit amount of pods with attached disks per node so this error will never happen. I believe
podAntiAffinity
is what I need and I know I can restrict pods with same label from scheduling on same node, but I don't know how to allow it until node has maximum amount of pods with disks.
My installation is AKS.
az acs create \
--orchestrator-type=kubernetes \
--orchestrator-version 1.7.9 \
--resource-group <resource_group_here> \
--name=<name_here> \
...
KUBE_MAX_PD_VOLS is what you are looking for. By default it's value is 16 for Azure Disks. So you can either use instances which has same limit of attached disks (16) or set it to preferrable value. You can see where it's declared at github
You should set this environment variable in your scheduler declaration. I found my scheduler declaration in /etc/kubernetes/manifests/kube-scheduler.yaml. This is what it looks now:
apiVersion: "v1"
kind: "Pod"
metadata:
name: "kube-scheduler"
...
spec:
containers:
- name: "kube-scheduler"
...
env:
- name: KUBE_MAX_PD_VOLS
value: "8"
...
Note spec.containers.env.KUBE_MAX_PD_VOLS setting - it prevents from scheduling more than 8 disks on each node.
This way pods spread among nodes without any issues, pods which cannot fit stays in Pending state until they find enough nodes to fit in.

Resources