k8s pods stuck in failed/shutdown state after preemption (gke v1.20) - node.js

TL;DR - gke 1.20 preemptible nodes cause pods to zombie into Failed/Shutdown
We have been using GKE for a few years with clusters containing a mixture of both stable and preemptible node pools. Recently, since gke v1.20, we have started seeing preempted pods enter into a weird zombie state where they are described as:
Status: Failed
Reason: Shutdown
Message: Node is shutting, evicting pods
When this started occurring we were convinced it was related to our pods failing to properly handle the SIGTERM at preemption. We decided to eliminate our service software as a source of a problem by boiling it down to a simple service that mostly sleeps:
/* eslint-disable no-console */
let exitNow = false
process.on( 'SIGINT', () => {
console.log( 'INT shutting down gracefully' )
exitNow = true
} )
process.on( 'SIGTERM', () => {
console.log( 'TERM shutting down gracefully' )
exitNow = true
} )
const sleep = ( seconds ) => {
return new Promise( ( resolve ) => {
setTimeout( resolve, seconds * 1000 )
} )
}
const Main = async ( cycles = 120, delaySec = 5 ) => {
console.log( `Starting ${cycles}, ${delaySec} second cycles` )
for ( let i = 1; i <= cycles && !exitNow; i++ ) {
console.log( `---> ${i} of ${cycles}` )
await sleep( delaySec ) // eslint-disable-line
}
console.log( '*** Cycle Complete - exiting' )
process.exit( 0 )
}
Main()
This code is built into a docker image using the tini init to spawn the pod process running under nodejs (fermium-alpine image). No matter how we shuffle the signal handling it seems the pods never really shutdown cleanly, even though the logs suggest they are.
Another oddity to this is that according to the Kubernetes Pod logs, we see the pod termination start and then gets cancelled:
2021-08-06 17:00:08.000 EDT Stopping container preempt-pod
2021-08-06 17:02:41.000 EDT Cancelling deletion of Pod preempt-pod
We have also tried adding a preStop 15 second delay just to see if that has any effect, but nothing we try seems to matter - the pods become zombies. New replicas are started on the other nodes that are available in the pool, so it always maintains the minimum number of successfully running pods on the system.
We are also testing the preemption cycle using a sim maintenance event:
gcloud compute instances simulate-maintenance-event node-id

After poking around various posts I finally relented to running a cronjob every 9 minutes to avoid the alertManager trigger that occurs after pods have been stuck in shutdown for 10+ minutes. This still feels like a hack to me, but it works, and it forced me to dig in to k8s cronjob and RBAC.
This post started me on the path:
How to remove Kubernetes 'shutdown' pods
And the resultant cronjob spec:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: pod-accessor-role
namespace: default
rules:
- apiGroups: [""] # "" indicates the core API group
resources: ["pods"]
verbs: ["get", "delete", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: pod-access
namespace: default
subjects:
- kind: ServiceAccount
name: cronjob-sa
namespace: default
roleRef:
kind: Role
name: pod-accessor-role
apiGroup: ""
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: cronjob-sa
namespace: default
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: cron-zombie-killer
namespace: default
spec:
schedule: "*/9 * * * *"
successfulJobsHistoryLimit: 1
jobTemplate:
spec:
template:
metadata:
name: cron-zombie-killer
namespace: default
spec:
serviceAccountName: cronjob-sa
restartPolicy: Never
containers:
- name: cron-zombie-killer
imagePullPolicy: IfNotPresent
image: bitnami/kubectl
command:
- "/bin/sh"
args:
- "-c"
- "kubectl get pods -n default --field-selector='status.phase==Failed' -o name | xargs kubectl delete -n default 2> /dev/null"
status: {}
Note that the redirect of stderr to /dev/null is to simply avoid the error output from kubectl delete when the kubectl get doesn't find any pods in the failed state.
Update added missing "delete" verb from the role, and added the missing RoleBinding
Update added imagePullPolicy

Starting with GKE 1.20.5 and later, the kubelet graceful node shutdown feature is enabled preemptible nodes. From the note on the feature page:
When pods were evicted during the graceful node shutdown, they are
marked as failed. Running kubectl get pods shows the status of the the
evicted pods as Shutdown. And kubectl describe pod indicates that the
pod was evicted because of node shutdown:
Status: Failed Reason: Shutdown Message: Node
is shutting, evicting pods Failed pod objects will be preserved until
explicitly deleted or cleaned up by the GC. This is a change of
behavior compared to abrupt node termination.
These pods should eventually be garbage collected, although I'm not sure of the threshold value.

Related

kube-proxy changes reverting after couple of minutes on my AKS cluster

I am experimenting and tweaking a bit on my sandbox AKS cluster with the intention to configure it in a production ready state. Regarding that, I am following a book where the writer is redeployig the initial kube-proxy daemonset with some modification (the only difference is that he is doing it on AWS EKS).
The problem is that the daemonset and pod are getting to the initial state after 2-3 minutes. AKS is just doing a rollback, what I can se when execute the rollback command
> kubectl rollout history daemonset kube-proxy -n kube-system
daemonset.apps/kube-proxy
REVISION CHANGE-CAUSE
2 <none>
8 <none>
10 <none>
14 <none>
16 <none>
I tried to redeploy the daemonset with my minor changes (changed cpu from 100m to 120m and changed the -v flag from 3 to 2) declaretively by applying following manifest
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
labels:
addonmanager.kubernetes.io/mode: Reconcile
component: kube-proxy
tier: node
deployment: custom
name: kube-proxy
namespace: kube-system
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
component: kube-proxy
tier: node
template:
metadata:
creationTimestamp: null
labels:
component: kube-proxy
tier: node
deployedBy: Luka
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.azure.com/cluster
operator: Exists
- key: type
operator: NotIn
values:
- virtual-kubelet
- key: kubernetes.io/os
operator: In
values:
- linux
containers:
- command:
- kube-proxy
- --conntrack-max-per-core=0
- --metrics-bind-address=0.0.0.0:10249
- --kubeconfig=/var/lib/kubelet/kubeconfig
- --cluster-cidr=10.244.0.0/16
- --detect-local-mode=ClusterCIDR
- --pod-interface-name-prefix=
- --v=2
image: mcr.microsoft.com/oss/kubernetes/kube-proxy:v1.23.12-hotfix.20220922.1
imagePullPolicy: IfNotPresent
name: kube-proxy
resources:
requests:
cpu: 120m
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/kubelet
name: kubeconfig
readOnly: true
- mountPath: /etc/kubernetes/certs
name: certificates
readOnly: true
- mountPath: /run/xtables.lock
name: iptableslock
- mountPath: /lib/modules
name: modules
dnsPolicy: ClusterFirst
hostNetwork: true
initContainers:
- command:
- /bin/sh
- -c
- |
SYSCTL=/proc/sys/net/netfilter/nf_conntrack_max
echo "Current net.netfilter.nf_conntrack_max: $(cat $SYSCTL)"
DESIRED=$(awk -F= '/net.netfilter.nf_conntrack_max/ {print $2}' /etc/sysctl.d/999-sysctl-aks.conf)
if [ -z "$DESIRED" ]; then
DESIRED=$((32768*$(nproc)))
if [ $DESIRED -lt 131072 ]; then
DESIRED=131072
fi
echo "AKS custom config for net.netfilter.nf_conntrack_max not set."
echo "Setting nf_conntrack_max to $DESIRED (32768 * $(nproc) cores, minimum 131072)."
echo $DESIRED > $SYSCTL
else
echo "AKS custom config for net.netfilter.nf_conntrack_max set to $DESIRED."
echo "Setting nf_conntrack_max to $DESIRED."
echo $DESIRED > $SYSCTL
fi
image: mcr.microsoft.com/oss/kubernetes/kube-proxy:v1.23.12-hotfix.20220922.1
imagePullPolicy: IfNotPresent
name: kube-proxy-bootstrap
resources:
requests:
cpu: 100m
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/sysctl.d
name: sysctls
- mountPath: /lib/modules
name: modules
priorityClassName: system-node-critical
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- effect: NoExecute
operator: Exists
- effect: NoSchedule
operator: Exists
volumes:
- hostPath:
path: /var/lib/kubelet
type: ""
name: kubeconfig
- hostPath:
path: /etc/kubernetes/certs
type: ""
name: certificates
- hostPath:
path: /run/xtables.lock
type: FileOrCreate
name: iptableslock
- hostPath:
path: /etc/sysctl.d
type: Directory
name: sysctls
- hostPath:
path: /lib/modules
type: Directory
name: modules
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
status:
currentNumberScheduled: 4
desiredNumberScheduled: 4
numberAvailable: 4
numberMisscheduled: 0
numberReady: 4
observedGeneration: 1
updatedNumberScheduled: 4
I tried it also by removing the initContainer. Even the solution by editing the daemonset, explained in this stackoverlow post didnt worked.
Do I miss something? Why is the kube-proxy daemonset always rolling back?
In Kubernetes rolling updates are the default strategy to update running version of the application
When I upgrade the pods from version 1 to 2 the deployment will creates the new ReplicaSet and increase the count of replicas and previous count goes to 0
After rolling update, the previous replica set is not deleted
If we try to execute another rolling update from version 2 to 3 we might notice that at the end of the upgrade we have two replica sets with 0 count
I have created the deployment file and deployed when I check the history of the daemonset I am able to see below results
kubectl rollout history daemonset kube-proxy -n kube-system
We can rollback to the specific version
kubectl rollout undo daemonset kube-proxy --to-revision=4 -n kube-system
After undo changes my replica revision changes to my daemonset look like below
kubectl rollout history daemonset kube-proxy -n kube-system
In the above command we have two columns 1 is revision and another is change-cause and it is always set to none
I have set the change-cause to 'Kube' as mentioned below and got below results
If I try to get the rollout history again
kubernetes.io/change-cause: "Kube" #for particular revision
kubectl apply -f filename
kubectl rollout history daemonset kube-proxy -n kube-system
Reference: To know more about the rolling updates use this kubernetes link

keda: Azure Service bus scaled object not scaling deployment-Scaling is not performed because triggers are not active

Trying to autoscale pod with inbound messages from azure service bus using KEDA. Scaled object defined is
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: main-router-scaledobject
namespace: rehmannazar-camel-dev
spec:
minReplicaCount: 0
maxReplicaCount: 10
scaleTargetRef:
name: mainrouter
kind: Deployment
triggers:
- type: azure-servicebus
metadata:
topicName: topic4test
subscriptionName: sub3
messageCount: "10"
activationMessageCount: "0"
authenticationRef:
name: trigger-auth-service*
with trigger-auth-service defined as
apiVersion: keda.sh/v1alpha1
*kind: TriggerAuthentication
metadata:
name: trigger-auth-service
spec:
secretTargetRef:
- parameter: connection
name: connectionsecret
key: connection*
and connectionsecret defines connection string to azure service bus.
kubectl describe scaledobject main-router-scaledobject
is having status
Status:
Conditions:
Message: ScaledObject is defined correctly and is ready for scaling
Reason: ScaledObjectReady
Status: True
Type: Ready
Message: Scaling is not performed because triggers are not active
Reason: ScalerNotActive
Status: False
Type: Active
Message: No fallbacks are active on this scaled object
Reason: NoFallbackFound
Status: False
Type: Fallback
External Metric Names:
s0-azure-servicebus-topic4test
Health:
s0-azure-servicebus-topic4test:
Number Of Failures: 0
Status: Happy
Hpa Name: keda-hpa-main-router-scaledobject
Original Replica Count: 1
Scale Target GVKR:
Group: apps
Kind: Deployment
Resource: deployments
Version: v1
Scale Target Kind: apps/v1.Deployment
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal KEDAScaleTargetDeactivated 3m53s (x191 over 98m) keda-operator Deactivated apps/v1.Deployment rehmannazar-camel-dev/mainrouter from 1 to 0
kubectl get ScaledObject main-router-scaledobject
NAME SCALETARGETKIND SCALETARGETNAME MIN MAX TRIGGERS AUTHENTICATION READY ACTIVE FALLBACK AGE
main-router-scaledobject apps/v1.Deployment mainrouter 0 10 azure-servicebus trigger-auth-service True False False 101m
yet pods are not scaled to zero and when posting messages on subscription sub3 pods are not scaled. Pods are also not downscaled to zero when sub3 has no messages. There is always a single pod in running state. Only activity I am observing is pods getting terminated and new pod getting started but pods replicas always remain 1. Is there something I missed in keda configuration?.
KEDA configuration is working . Issue was keda integration with camel-k.

Event Hub triggered Azure Function running on AKS with KEDA does not scale out

I have deployed an Event Hub triggered Azure Function written in Java on AKS. The function should scale out using KEDA.
The function is correctly triggerd and working but it's not scaling out when the load increases. I have added sleep calls to the function implementation to make sure it's not burning through the events too fast and should be forced to scale out but this did not show any change as well.
kubectl get hpa shows the following output
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
keda-hpa-eventlogger Deployment/eventlogger 64/64 (avg) 1 20 1 3m41s
This seems to be a first indicator that something is not correct as i assume the first number in the targets column is the number of unprocessed events in event hub. This stays the same no matter how many events i pump into the hub.
The Function was deployed using the following Kubernetes Deployment Manifest
data:
AzureWebJobsStorage: <removed>
FUNCTIONS_WORKER_RUNTIME: amF2YQ==
EventHubConnectionString: <removed>
apiVersion: v1
kind: Secret
metadata:
name: eventlogger
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: eventlogger
labels:
app: eventlogger
spec:
selector:
matchLabels:
app: eventlogger
template:
metadata:
labels:
app: eventlogger
spec:
containers:
- name: eventlogger
image: <removed>
env:
- name: AzureFunctionsJobHost__functions__0
value: eventloggerHandler
envFrom:
- secretRef:
name: eventlogger
readinessProbe:
failureThreshold: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 240
httpGet:
path: /
port: 80
scheme: HTTP
startupProbe:
failureThreshold: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 240
httpGet:
path: /
port: 80
scheme: HTTP
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: eventlogger
labels:
app: eventlogger
spec:
scaleTargetRef:
name: eventlogger
pollingInterval: 5
cooldownPeriod: 5
minReplicaCount: 0
maxReplicaCount: 20
triggers:
- type: azure-eventhub
metadata:
storageConnectionFromEnv: AzureWebJobsStorage
connectionFromEnv: EventHubConnectionString
---
The Connection String of Event Hub contains the "EntityPath=" Section as described in the KEDA Event Hub Scaler Documentation and has Manage-Permissions on the Event Hub Namespace.
The output of kubectl describe ScaledObject is
Name: eventlogger
Namespace: default
Labels: app=eventlogger
scaledobject.keda.sh/name=eventlogger
Annotations: <none>
API Version: keda.sh/v1alpha1
Kind: ScaledObject
Metadata:
Creation Timestamp: 2022-04-17T10:30:36Z
Finalizers:
finalizer.keda.sh
Generation: 1
Managed Fields:
API Version: keda.sh/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:kubectl.kubernetes.io/last-applied-configuration:
f:labels:
.:
f:app:
f:spec:
.:
f:cooldownPeriod:
f:maxReplicaCount:
f:minReplicaCount:
f:pollingInterval:
f:scaleTargetRef:
.:
f:name:
f:triggers:
Manager: kubectl-client-side-apply
Operation: Update
Time: 2022-04-17T10:30:36Z
API Version: keda.sh/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:finalizers:
.:
v:"finalizer.keda.sh":
f:labels:
f:scaledobject.keda.sh/name:
f:status:
.:
f:conditions:
f:externalMetricNames:
f:lastActiveTime:
f:originalReplicaCount:
f:scaleTargetGVKR:
.:
f:group:
f:kind:
f:resource:
f:version:
f:scaleTargetKind:
Manager: keda
Operation: Update
Time: 2022-04-17T10:30:37Z
Resource Version: 1775052
UID: 3b6a68c1-c3b9-4cdf-b5d5-41a9721ac661
Spec:
Cooldown Period: 5
Max Replica Count: 20
Min Replica Count: 0
Polling Interval: 5
Scale Target Ref:
Name: eventlogger
Triggers:
Metadata:
Connection From Env: EventHubConnectionString
Storage Connection From Env: AzureWebJobsStorage
Type: azure-eventhub
Status:
Conditions:
Message: ScaledObject is defined correctly and is ready for scaling
Reason: ScaledObjectReady
Status: False
Type: Ready
Message: Scaling is performed because triggers are active
Reason: ScalerActive
Status: True
Type: Active
Status: Unknown
Type: Fallback
External Metric Names:
s0-azure-eventhub-$Default
Last Active Time: 2022-04-17T10:30:47Z
Original Replica Count: 1
Scale Target GVKR:
Group: apps
Kind: Deployment
Resource: deployments
Version: v1
Scale Target Kind: apps/v1.Deployment
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal KEDAScalersStarted 10s keda-operator Started scalers watch
Normal ScaledObjectReady 10s keda-operator ScaledObject is ready for scaling
So i'm a bit stucked as i don't see any errors but it's still not behaving as expected.
Versions:
Kubernetes version: 1.21.9
KEDA Version: 2.6.1 installed using kubectl apply -f https://github.com/kedacore/keda/releases/download/v2.6.1/keda-2.6.1.yaml
Azure Functions using Java 11 and extensionBundle in host.json is configured using version [2.8.4, 3.0.0)
Was able to find a solution to the problem.
Event Hub triggered Azure Functions deployed on AKS show the same scaling characteristics as Azure Functions on App Service show:
You only get one consumer per partition to allow for ordering per partition.
This characteristic overrules maxReplicaCount in the Kubernetes Deployment Manifest.
So to solve my own issue: By increasing the partitions for Event Hub i get a pod per partition and KEDA scales the workload as expected.

Error in NMI pod after adding and installing Bitnami External DNS via Terraform and Helm - No AzureIdentityBinding found for pod

I am struggling to get the azureIdentity for ExternalDNS bound and get DNS entries into our zone(s).
Key error: I0423 19:27:52.830107 1 mic.go:610] No AzureIdentityBinding found for pod default/external-dns-84dcc5f68c-cl5h5 that matches selector: external-dns. it will be ignored
Also, no azureAssignedIdentity is created since there is no match for the pod and selector/aadpodidbinding.
I'm building IaaC using Terraform, Helm, Azure, Azure AKS, VSCODE, and so far, three Kubernetes add-ons - aad pod identity, application-gateway-kubernetes-ingress, and Bitnami external-dns.
Since the identity isn't being bound, an azureAssignedIdentity isn't being created and ExternalDNS isn't able to put records into our DNS zone(s).
The names and aadpodidbindings seem correct. I've tried passing in fullnameOverride in the Terraform kubectl_manifest provider for the Helm install of Bitnami ExternalDNS. I've tried suppressing the suffixes on ExternalDNS names and labels. I've tried editing the Helm and Kubernetes YAML on the cluster itself to try to force a binding. I've tried using the AKS user managed identity which is used for AAD Pod Identity and is located in the cluster's nodepools resource group. I've tried letting the Bitnami ExternalDNS configure and add an azure.json file, and I've also done so manually prior to adding and installing ExternalDNS. I've tried assigning the managed identity to the VMSS of the AKS cluster.
Thanks!
JBP
PS C:\Workspace\tf\HelmOne> kubectl logs pod/external-dns-84dcc5f68c-542mv
: Refresh request failed. Status Code = '404'. Response body: getting assigned identities for pod default/external-dns-84dcc5f68c-542mv in CREATED state failed after 16 attempts, retry duration [5]s, error: <nil>. Check MIC pod logs for identity assignment errors\n"
time="2021-04-24T19:57:30Z" level=debug msg="Retrieving Azure DNS zones for resource group: one-hi-sso-dnsrg-tf."
time="2021-04-24T20:06:02Z" level=error msg="azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/8fb55161-REDACTED-3400b5271a8c/resourceGroups/one-hi-sso-dnsrg-tf/providers/Microsoft.Network/dnsZones?api-version=2018-05-01: StatusCode=404 -- Original Error: adal: Refresh request failed. Status Code = '404'. Response body: getting assigned identities for pod default/external-dns-84dcc5f68c-542mv in CREATED state failed after 16 attempts, retry duration [5]s, error: <nil>. Check MIC pod logs for identity assignment errors\n"
time="2021-04-24T20:06:02Z" level=debug msg="Retrieving Azure DNS zones for resource group: one-hi-sso-dnsrg-tf."
PS C:\Workspace\tf\HelmOne> kubectl logs pod/aad-pod-identity-nmi-vtmwm
I0424 20:07:22.400942 1 server.go:196] status (404) took 80007557875 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.0.8.7
E0424 20:08:44.427353 1 server.go:375] failed to get matching identities for pod: default/external-dns-84dcc5f68c-542mv, error: getting assigned identities for pod default/external-dns-84dcc5f68c-542mv in CREATED state failed after 16 attempts, retry duration [5]s, error: <nil>. Check MIC pod logs for identity assignment errors
I0424 20:08:44.427400 1 server.go:196] status (404) took 80025612263 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.0.8.7
PS C:\Workspace\TF\HelmOne> kubectl logs pod/aad-pod-identity-mic-86944f67b8-k4hds
I0422 21:05:11.298958 1 main.go:114] starting mic process. Version: v1.7.5. Build date: 2021-04-02-21:14
W0422 21:05:11.299031 1 main.go:119] --kubeconfig not passed will use InClusterConfig
I0422 21:05:11.299038 1 main.go:136] kubeconfig () cloudconfig (/etc/kubernetes/azure.json)
I0422 21:05:11.299205 1 main.go:144] running MIC in namespaced mode: false
I0422 21:05:11.299223 1 main.go:148] client QPS set to: 5. Burst to: 5
I0422 21:05:11.299243 1 mic.go:139] starting to create the pod identity client. Version: v1.7.5. Build date: 2021-04-02-21:14
I0422 21:05:11.318835 1 mic.go:145] Kubernetes server version: v1.18.14
I0422 21:05:11.319465 1 cloudprovider.go:122] MIC using user assigned identity: c380##### REDACTED #####814b for authentication.
I0422 21:05:11.392322 1 probes.go:41] initialized health probe on port 8080
I0422 21:05:11.392351 1 probes.go:44] started health probe
I0422 21:05:11.392458 1 metrics.go:341] registered views for metric
I0422 21:05:11.392544 1 prometheus_exporter.go:21] starting Prometheus exporter
I0422 21:05:11.392561 1 metrics.go:347] registered and exported metrics on port 8888
I0422 21:05:11.392568 1 mic.go:244] initiating MIC Leader election
I0422 21:05:11.393053 1 leaderelection.go:243] attempting to acquire leader lease default/aad-pod-identity-mic...
E0423 01:47:52.730839 1 leaderelection.go:325] error retrieving resource lock default/aad-pod-identity-mic: etcdserver: request timed out
resource "helm_release" "external-dns" {
name = "external-dns"
repository = "https://charts.bitnami.com/bitnami"
chart = "external-dns"
namespace = "default"
version = "4.0.0"
set {
name = "azure.cloud"
value = "AzurePublicCloud"
}
#MyDnsResourceGroup
set {
name = "azure.resourceGroup"
value = data.azurerm_resource_group.dnsrg.name
}
set {
name = "azure.tenantId"
value = data.azurerm_subscription.currenttenantid.tenant_id
}
set {
name = "azure.subscriptionId"
value = data.azurerm_subscription.currentSubscription.subscription_id
}
set {
name = "azure.userAssignedIdentityID"
value = azurerm_user_assigned_identity.external-dns-mi-tf.client_id
}
#Verbosity of the logs (options: panic, debug, info, warning, error, fatal, trace)
set {
name = "logLevel"
value = "trace"
}
set {
name = "sources"
value = "{service,ingress}"
}
set {
name = "domainFilters"
value = "{${var.child_domain_prefix}.${lower(var.parent_domain)}}"
}
#DNS provider where the DNS records will be created (mandatory) (options: aws, azure, google, ...)
set {
name = "provider"
value = "azure"
}
#podLabels: {aadpodidbinding: <selector>} # selector you defined above in AzureIdentityBinding
set {
name = "podLabels.aadpodidbinding"
value = "external-dns"
}
set {
name = "azure.useManagedIdentityExtension"
value = true
}
}
resource "helm_release" "aad-pod-identity" {
name = "aad-pod-identity"
repository = "https://raw.githubusercontent.com/Azure/aad-pod-identity/master/charts"
chart = "aad-pod-identity"
}
resource "helm_release" "ingress-azure" {
name = "ingress-azure"
repository = "https://appgwingress.blob.core.windows.net/ingress-azure-helm-package/"
chart = "ingress-azure"
namespace = "default"
version = "1.4.0"
set {
name = "debug"
value = "true"
}
set {
name = "appgw.name"
value = data.azurerm_application_gateway.appgwpub.name
}
set {
name = "appgw.resourceGroup"
value = data.azurerm_resource_group.appgwpubrg.name
}
set {
name = "appgw.subscriptionId"
value = data.azurerm_subscription.currentSubscription.subscription_id
}
set {
name = "appgw.usePrivateIP"
value = "false"
}
set {
name = "armAuth.identityClientID"
value = azurerm_user_assigned_identity.agic-mi-tf.client_id
}
set {
name = "armAuth.identityResourceID"
value = azurerm_user_assigned_identity.agic-mi-tf.id
}
set {
name = "armAuth.type"
value = "aadPodIdentity"
}
set {
name = "rbac.enabled"
value = "true"
}
set {
name = "verbosityLevel"
value = "5"
}
set {
name = "appgw.environment"
value = "AZUREPUBLICCLOUD"
}
set {
name = "metadata.name"
value = "ingress-azure"
}
}
PS C:\Workspace\tf\HelmOne> kubectl get azureassignedidentities
NAME AGE
ingress-azure-68c97fd496-qbptf-default-ingress-azure 23h
PS C:\Workspace\tf\HelmOne> kubectl get azureidentity
NAME AGE
ingress-azure 23h
one-hi-sso-agic-mi-tf 23h
one-hi-sso-external-dns-mi-tf 23h
PS C:\Workspace\tf\HelmOne> kubectl edit azureidentity one-hi-sso-external-dns-mi-tf
apiVersion: aadpodidentity.k8s.io/v1
kind: AzureIdentity
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"aadpodidentity.k8s.io/v1","kind":"AzureIdentity","metadata":{"annotations":{},"name":"one-hi-sso-external-dns-mi-tf","namespace":"default"},"spec":{"clientID":"f58e7c55-REDACTED-a6e358e53912","resourceID":"/subscriptions/8fb55161-REDACTED-3400b5271a8c/resourceGroups/one-hi-sso-kuberg-tf/providers/Microsoft.ManagedIdentity/userAssignedIdentities/one-hi-sso-external-dns-mi-tf","type":0}}
creationTimestamp: "2021-04-22T20:44:42Z"
generation: 2
name: one-hi-sso-external-dns-mi-tf
namespace: default
resourceVersion: "432055"
selfLink: /apis/aadpodidentity.k8s.io/v1/namespaces/default/azureidentities/one-hi-sso-external-dns-mi-tf
uid: f8e22fd9-REDACTED-6cdead0d7e22
spec:
clientID: f58e7c55-REDACTED-a6e358e53912
resourceID: /subscriptions/8fb55161-REDACTED-3400b5271a8c/resourceGroups/one-hi-sso-kuberg-tf/providers/Microsoft.ManagedIdentity/userAssignedIdentities/one-hi-sso-external-dns-mi-tf
type: 0
PS C:\Workspace\tf\HelmOne> kubectl edit azureidentitybinding external-dns-mi-binding
apiVersion: aadpodidentity.k8s.io/v1
kind: AzureIdentityBinding
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"aadpodidentity.k8s.io/v1","kind":"AzureIdentityBinding","metadata":{"annotations":{},"name":"external-dns-mi-binding","namespace":"default"},"spec":{"AzureIdentity":"one-hi-sso-external-dns-mi-tf","Selector":"external-dns"}}
creationTimestamp: "2021-04-22T20:44:42Z"
generation: 1
name: external-dns-mi-binding
namespace: default
resourceVersion: "221101"
selfLink: /apis/aadpodidentity.k8s.io/v1/namespaces/default/azureidentitybindings/external-dns-mi-binding
uid: f39e7418-e896-4b8e-b596-035cf4b66252
spec:
AzureIdentity: one-hi-sso-external-dns-mi-tf
Selector: external-dns
resource "kubectl_manifest" "one-hi-sso-external-dns-mi-tf" {
yaml_body = <<YAML
apiVersion: "aadpodidentity.k8s.io/v1"
kind: AzureIdentity
metadata:
name: one-hi-sso-external-dns-mi-tf
namespace: default
spec:
type: 0
resourceID: /subscriptions/8fb55161-REDACTED-3400b5271a8c/resourceGroups/one-hi-sso-kuberg-tf/providers/Microsoft.ManagedIdentity/userAssignedIdentities/one-hi-sso-external-dns-mi-tf
clientID: f58e7c55-REDACTED-a6e358e53912
YAML
}
resource "kubectl_manifest" "external-dns-mi-binding" {
yaml_body = <<YAML
apiVersion: "aadpodidentity.k8s.io/v1"
kind: AzureIdentityBinding
metadata:
name: external-dns-mi-binding
spec:
AzureIdentity: one-hi-sso-external-dns-mi-tf
Selector: external-dns
YAML
}
The managed identity I’m using was not added to the virtual machine scale set VMSS. Once I added it, the binding works and the azureAssignedIdentity was created.
Also - I converted the AzureIdentity and Selector lines in my AzureIdentity YAML from upper case first letters to lower case first letters.
Correct:
azureIdentity:
selector:

Kubernetes CronJob Failure to start a job "Timeout" and "Job already exists"

I am trying to run a cronjob in kubernetes, but keep to having these two errors:
type: 'Warning' reason: 'FailedCreate' Error creating job: jobs.batch "dev-cron-1516702680" already exists
and
type: 'Warning' reason: 'FailedCreate' Error creating job: Timeout: request did not complete within allowed duration
Below are my cronjob yaml
apiVersion: batch/v1beta1
kind: CronJob
metadata:
creationTimestamp: 2018-01-23T09:45:10Z
name: dev-cron
namespace: dev
resourceVersion: "16768201"
selfLink: /apis/batch/v1beta1/namespaces/dev/cronjobs/dev-cron
uid: 1a32eb94-0022-11e8-9256-065eb556d6a2
spec:
concurrencyPolicy: Allow
failedJobsHistoryLimit: 1
jobTemplate:
metadata:
creationTimestamp: null
spec:
template:
metadata:
creationTimestamp: null
spec:
containers:
- args:
- for country in th;
- do
- 'curl -X POST -d "{'footprint':'xxxx-xxxx'}"-H "Content-Type: application/json" https://dev.xxx.com/xxx/xxx'
- done
image: appropriate/curl:latest
imagePullPolicy: Always
name: cron
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Never
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
schedule: '* * * * *'
startingDeadlineSeconds: 10
successfulJobsHistoryLimit: 3
suspend: false
status: {}
I am not sure why this is keep happening. I am running Kubernetes version 1.9.1, in AWS cluster. Any idea why?
It turned out to happen because there is an auto injector by Istio initializer. Once I disabled Istio initializer injection for cronjobs, it works fine.

Resources