Getting hudson.remoting.ChannelClosedException error in Jenkins

Getting hudson.remoting.ChannelClosedException error in Jenkins - node.js

I have an error while running a pipeline in Jenkins using a Kubernetes Cloud server.
Everything works fine until the moment of the npm install where i get Cannot contact nodejs-rn5f3: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel#3b1e0041:nodejs-rn5f3": Remote call on nodejs-rn5f3 failed. The channel is closing down or has closed down
How can I fix this error ?
Here are my logs :
[Pipeline] Start of Pipeline
[Pipeline] podTemplate
[Pipeline] {
[Pipeline] node
Still waiting to schedule task
‘nodejs-rn5f3’ is offline
Agent nodejs-rn5f3 is provisioned from template nodejs
---
apiVersion: "v1"
kind: "Pod"
metadata:
labels:
jenkins: "slave"
jenkins/label-digest: "XXXXXXXXXXXXXXXXXXXXXXXXXX"
jenkins/label: "nodejs"
name: "nodejs-rn5f3"
spec:
containers:
- args:
- "cat"
command:
- "/bin/sh"
- "-c"
image: "node:15.5.1-alpine3.10"
imagePullPolicy: "IfNotPresent"
name: "node"
resources:
limits: {}
requests: {}
tty: true
volumeMounts:
- mountPath: "/home/jenkins/agent"
name: "workspace-volume"
readOnly: false
workingDir: "/home/jenkins/agent"
- env:
- name: "JENKINS_SECRET"
value: "********"
- name: "JENKINS_AGENT_NAME"
value: "nodejs-rn5f3"
- name: "JENKINS_WEB_SOCKET"
value: "true"
- name: "JENKINS_NAME"
value: "nodejs-rn5f3"
- name: "JENKINS_AGENT_WORKDIR"
value: "/home/jenkins/agent"
- name: "JENKINS_URL"
value: "http://XX.XX.XX.XX/"
image: "jenkins/inbound-agent:4.3-4"
name: "jnlp"
resources:
requests:
cpu: "100m"
memory: "256Mi"
volumeMounts:
- mountPath: "/home/jenkins/agent"
name: "workspace-volume"
readOnly: false
hostNetwork: false
nodeSelector:
kubernetes.io/os: "linux"
restartPolicy: "Never"
volumes:
- emptyDir:
medium: ""
name: "workspace-volume"
Running on nodejs-rn5f3 in /home/jenkins/agent/workspace/something
[Pipeline] {
[Pipeline] stage
[Pipeline] { (Test)
[Pipeline] checkout
Selected Git installation does not exist. Using Default
[... cloning repository]
[Pipeline] container
[Pipeline] {
[Pipeline] sh
+ ls -la
total 1240
drwxr-xr-x 5 node node 4096 Feb 26 07:33 .
drwxr-xr-x 4 node node 4096 Feb 26 07:33 ..
-rw-r--r-- 1 node node 1689 Feb 26 07:33 package.json
and some other files and folders
[Pipeline] sh
+ cat package.json
{
[...]
"dependencies": {
[blabla....]
},
"devDependencies": {
[blabla...]
}
}
[Pipeline] sh
+ npm install
Cannot contact nodejs-rn5f3: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel#3b1e0041:nodejs-rn5f3": Remote call on nodejs-rn5f3 failed. The channel is closing down or has closed down
At this stage, here are the logs of the container jnlp in my pod nodejs-rnf5f3 :
INFO: Connected
Feb 26, 2021 8:05:53 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Read side closed
Feb 26, 2021 8:05:53 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Terminated
Feb 26, 2021 8:05:53 AM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$FindEffectiveRestarters$1 onReconnect
INFO: Restarting agent via jenkins.slaves.restarter.UnixSlaveRestarter#1a39588e
Feb 26, 2021 8:05:55 AM hudson.remoting.jnlp.Main createEngine
INFO: Setting up agent: nodejs-rnf5f3
Feb 26, 2021 8:05:55 AM hudson.remoting.jnlp.Main$CuiListener <init>
INFO: Jenkins agent is running in headless mode.
Feb 26, 2021 8:05:55 AM hudson.remoting.Engine startEngine
INFO: Using Remoting version: 4.3
Feb 26, 2021 8:05:55 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
INFO: Using /home/jenkins/agent/remoting as a remoting work directory
Feb 26, 2021 8:05:55 AM org.jenkinsci.remoting.engine.WorkDirManager setupLogging
INFO: Both error and output logs will be printed to /home/jenkins/agent/remoting
Feb 26, 2021 8:05:55 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: WebSocket connection open
Feb 26, 2021 8:05:58 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connected
.... same as above
I don't know where this error come from. Is this related to the usage of resources ?
Here are the usage of my containers :
POD NAME CPU(cores) MEMORY(bytes)
jenkins-1-jenkins-0 jenkins-master 61m 674Mi
nodejs-rnf5f3 jnlp 468m 104Mi
nodejs-rnf5f3 node 1243m 1284Mi
My cluster is a e2-medium in GKE with 2 nodes.

If I had to bet (but its just a wild guess) I had say that the pod was killed due to running out of memory (OOM Killed).
The ChannelClosedException is a symptom, not the problem.
Its kind of hard to debug because the agent pod is being deleted, you can try kubectl get events in the relevant namespace, but events only last for 1 hour by default.

Related

Files in AIRFLOW_HOME (which is an Azure File Share MOUNT) are created as root

I have set up the airflow in Azure Cloud (Azure Container Apps) and attached an Azure File Share as an external mount/volume
1. I ran **airflow init service**, it had created the airflow.cfg and `'webserver_config.py'` file in the **AIRFLOW_HOME (/opt/airflow)**, which is actually an azure mounted file system
2. I ran **airflow webserver service**, it had created the `airflow-webserver.pid` file in the **AIRFLOW_HOME (/opt/airflow)**, which is actually an azure mounted file system
Now the problem is all the files created above are created with root user&groups, not as airflow user(50000),
I have also set the env variable AIRFLOW_UID to 50000 during the creation of the container app. due to this my webservers are not starting, throwing the below error
PermissionError: [Errno 1] Operation not permitted: '/opt/airflow/airflow-webserver.pid'
Note: Azure Containers Apps does not allow use root/sudo commands, otherwise I could solve this problem with simple chown commands
Another problem is the airflow configurations passed through environment variables are never picked up by Docker, eg
- name: AIRFLOW__API__AUTH_BACKENDS
value: 'airflow.api.auth.backend.basic_auth'
Attached screenshot for reference
Your help is much appreciated!
YAML file that I use to create my container app:
id: /subscriptions/1234/resourceGroups/<my-res-group>/providers/Microsoft.App/containerApps/<app-name>
identity:
type: None
location: eastus2
name: webservice
properties:
configuration:
activeRevisionsMode: Single
registries: []
managedEnvironmentId: /subscriptions/1234/resourceGroups/<my-res-group>/providers/Microsoft.App/managedEnvironments/container-app-env
template:
containers:
- command:
- /bin/bash
- -c
- exec /entrypoint airflow webserver
env:
- name: AIRFLOW__API__AUTH_BACKENDS
value: 'airflow.api.auth.backend.basic_auth'
- name: AIRFLOW__CELERY__BROKER_URL
value: redis://:#myredis.redis.cache.windows.net:6379/0
- name: AIRFLOW__CELERY__RESULT_BACKEND
value: db+postgresql://user:pass#postres-db-servconn.postgres.database.azure.com/airflow?sslmode=require
- name: AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION
value: 'true'
- name: AIRFLOW__CORE__EXECUTOR
value: CeleryExecutor
- name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
value: postgresql+psycopg2://user:pass#postres-db-servconn.postgres.database.azure.com/airflow?sslmode=require
- name: AIRFLOW__DATABASE__SQL_ALCHEMY_CONN
value: postgresql+psycopg2://user:pass#postres-db-servconn.postgres.database.azure.com/airflow?sslmode=require
- name: AIRFLOW__CORE__LOAD_EXAMPLES
value: 'false'
- name: AIRFLOW_UID
value: 50000
image: docker.io/apache/airflow:latest
name: wsr
volumeMounts:
- volumeName: randaf-azure-files-volume
mountPath: /opt/airflow
probes: []
resources:
cpu: 0.25
memory: 0.5Gi
scale:
maxReplicas: 3
minReplicas: 1
volumes:
- name: randaf-azure-files-volume
storageName: randafstorage
storageType: AzureFile
resourceGroup: RAND
tags:
tagname: ws-only
type: Microsoft.App/containerApps

Cannot mount volume to pod in Kubernetes using Azure file provisioner

I have the problem that I cannot mount volumes to pods in Kubernetes using the Azure File CSI in Azure cloud.
The error message I am receiving in the pod is
Warning FailedMount 38s kubelet Unable to attach or mount volumes: unmounted volumes=[sensu-backend-etcd], unattached volumes=[default-token-42kfh sensu-backend-etcd sensu-asset-server-ca-cert]: timed out waiting for the condition
My storageclass looks like the following:
items:
- allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"allowVolumeExpansion":true,"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"},"name":"azure-csi-standard-lrs"},"mountOptions":["dir_mode=0640","file_mode=0640","uid=0","gid=0","mfsymlinks","cache=strict","nosharesock"],"parameters":{"location":"eastus","resourceGroup":"kubernetes-resource-group","shareName":"kubernetes","skuName":"Standard_LRS","storageAccount":"kubernetesrf"},"provisioner":"kubernetes.io/azure-file","reclaimPolicy":"Delete","volumeBindingMode":"Immediate"}
storageclass.kubernetes.io/is-default-class: "true"
creationTimestamp: "2020-12-21T19:16:19Z"
managedFields:
- apiVersion: storage.k8s.io/v1
fieldsType: FieldsV1
fieldsV1:
f:allowVolumeExpansion: {}
f:metadata:
f:annotations:
.: {}
f:kubectl.kubernetes.io/last-applied-configuration: {}
f:storageclass.kubernetes.io/is-default-class: {}
f:mountOptions: {}
f:parameters:
.: {}
f:location: {}
f:resourceGroup: {}
f:shareName: {}
f:skuName: {}
f:storageAccount: {}
f:provisioner: {}
f:reclaimPolicy: {}
f:volumeBindingMode: {}
manager: kubectl-client-side-apply
operation: Update
time: "2020-12-21T19:16:19Z"
name: azure-csi-standard-lrs
resourceVersion: "15914"
selfLink: /apis/storage.k8s.io/v1/storageclasses/azure-csi-standard-lrs
uid: 3de65d08-14e7-4d0b-a6fe-39ab9a714191
mountOptions:
- dir_mode=0640
- file_mode=0640
- uid=0
- gid=0
- mfsymlinks
- cache=strict
- nosharesock
parameters:
location: eastus
resourceGroup: kubernetes-resource-group
shareName: kubernetes
skuName: Standard_LRS
storageAccount: kubernetesrf
provisioner: kubernetes.io/azure-file
reclaimPolicy: Delete
volumeBindingMode: Immediate
kind: List
metadata:
resourceVersion: ""
selfLink: ""
My PV and PVC are bound:
sensu-backend-etcd 10Gi RWX Retain Bound sensu-system/sensu-backend-etcd azure-csi-standard-lrs 4m31s
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
sensu-backend-etcd Bound sensu-backend-etcd 10Gi RWX azure-csi-standard-lrs 4m47s
In the kubelet log I get the following:
Dez 21 19:26:37 kubernetes-3 kubelet[34828]: E1221 19:26:37.766476 34828 pod_workers.go:191] Error syncing pod bab5a69a-f8af-43f1-a3ae-642de8daa05d ("sensu-backend-0_sensu-system(bab5a69a-f8af-43f1-a3ae-642de8daa05d)"), skipping: unmounted volumes=[sensu-backend-etcd], unattached volumes=[sensu-backend-etcd sensu-asset-server-ca-cert default-token-42kfh]: timed out waiting for the condition
Dez 21 19:26:58 kubernetes-3 kubelet[34828]: I1221 19:26:58.002474 34828 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "sensu-backend-etcd" (UniqueName: "kubernetes.io/csi/file.csi.azure.com^sensu-backend-etcd") pod "sensu-backend-0" (UID: "bab5a69a-f8af-43f1-a3ae-642de8daa05d")
Dez 21 19:26:58 kubernetes-3 kubelet[34828]: E1221 19:26:58.006699 34828 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/file.csi.azure.com^sensu-backend-etcd podName: nodeName:}" failed. No retries permitted until 2020-12-21 19:29:00.006639988 +0000 UTC m=+3608.682310977 (durationBeforeRetry 2m2s). Error: "Volume not attached according to node status for volume \"sensu-backend-etcd\" (UniqueName: \"kubernetes.io/csi/file.csi.azure.com^sensu-backend-etcd\") pod \"sensu-backend-0\" (UID: \"bab5a69a-f8af-43f1-a3ae-642de8daa05d\") "
Dez 21 19:28:51 kubernetes-3 kubelet[34828]: E1221 19:28:51.768309 34828 kubelet.go:1594] Unable to attach or mount volumes for pod "sensu-backend-0_sensu-system(bab5a69a-f8af-43f1-a3ae-642de8daa05d)": unmounted volumes=[sensu-backend-etcd], unattached volumes=[sensu-backend-etcd sensu-asset-server-ca-cert default-token-42kfh]: timed out waiting for the condition; skipping pod
Dez 21 19:28:51 kubernetes-3 kubelet[34828]: E1221 19:28:51.768335 34828 pod_workers.go:191] Error syncing pod bab5a69a-f8af-43f1-a3ae-642de8daa05d ("sensu-backend-0_sensu-system(bab5a69a-f8af-43f1-a3ae-642de8daa05d)"), skipping: unmounted volumes=[sensu-backend-etcd], unattached volumes=[sensu-backend-etcd sensu-asset-server-ca-cert default-token-42kfh]: timed out waiting for the condition
Dez 21 19:29:00 kubernetes-3 kubelet[34828]: I1221 19:29:00.103881 34828 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "sensu-backend-etcd" (UniqueName: "kubernetes.io/csi/file.csi.azure.com^sensu-backend-etcd") pod "sensu-backend-0" (UID: "bab5a69a-f8af-43f1-a3ae-642de8daa05d")
Dez 21 19:29:00 kubernetes-3 kubelet[34828]: E1221 19:29:00.108069 34828 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/file.csi.azure.com^sensu-backend-etcd podName: nodeName:}" failed. No retries permitted until 2020-12-21 19:31:02.108044076 +0000 UTC m=+3730.783715065 (durationBeforeRetry 2m2s). Error: "Volume not attached according to node status for volume \"sensu-backend-etcd\" (UniqueName: \"kubernetes.io/csi/file.csi.azure.com^sensu-backend-etcd\") pod \"sensu-backend-0\" (UID: \"bab5a69a-f8af-43f1-a3ae-642de8daa05d\") "
Dez 21 19:31:02 kubernetes-3 kubelet[34828]: I1221 19:31:02.169246 34828 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "sensu-backend-etcd" (UniqueName: "kubernetes.io/csi/file.csi.azure.com^sensu-backend-etcd") pod "sensu-backend-0" (UID: "bab5a69a-f8af-43f1-a3ae-642de8daa05d")
Dez 21 19:31:02 kubernetes-3 kubelet[34828]: E1221 19:31:02.172474 34828 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/file.csi.azure.com^sensu-backend-etcd podName: nodeName:}" failed. No retries permitted until 2020-12-21 19:33:04.172432877 +0000 UTC m=+3852.848103766 (durationBeforeRetry 2m2s). Error: "Volume not attached according to node status for volume \"sensu-backend-etcd\" (UniqueName: \"kubernetes.io/csi/file.csi.azure.com^sensu-backend-etcd\") pod \"sensu-backend-0\" (UID: \"bab5a69a-f8af-43f1-a3ae-642de8daa05d\") "
Dez 21 19:31:09 kubernetes-3 kubelet[34828]: E1221 19:31:09.766084 34828 kubelet.go:1594] Unable to attach or mount volumes for pod "sensu-backend-0_sensu-system(bab5a69a-f8af-43f1-a3ae-642de8daa05d)": unmounted volumes=[sensu-backend-etcd], unattached volumes=[default-token-42kfh sensu-backend-etcd sensu-asset-server-ca-cert]: timed out waiting for the condition; skipping pod
In the kube-controller-manager pod I get:
E1221 20:21:34.069309 1 csi_attacher.go:500] kubernetes.io/csi: attachdetacher.WaitForDetach timeout after 2m0s [volume=sensu-backend-etcd; attachment.ID=csi-9a83de4bef35f5d01e10e3a7d598204c459cac705371256e818e3a35b4b29e4e]
E1221 20:21:34.069453 1 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/file.csi.azure.com^sensu-backend-etcd podName: nodeName:kubernetes-3}" failed. No retries permitted until 2020-12-21 20:21:34.569430175 +0000 UTC m=+6862.322990347 (durationBeforeRetry 500ms). Error: "AttachVolume.Attach failed for volume \"sensu-backend-etcd\" (UniqueName: \"kubernetes.io/csi/file.csi.azure.com^sensu-backend-etcd\") from node \"kubernetes-3\" : attachdetachment timeout for volume sensu-backend-etcd"
I1221 20:21:34.069757 1 event.go:291] "Event occurred" object="sensu-system/sensu-backend-0" kind="Pod" apiVersion="v1" type="Warning" reason="FailedAttachVolume" message="AttachVolume.Attach failed for volume \"sensu-backend-etcd\" : attachdetachment timeout for volume sensu-backend-etcd"
Anyone who knows this error and how to mitigate it?
Thanks in advance.
Best regards,
rforberger

I fixed it.
I switched to the disk.csi.azure.com provisioner and I had to use a volume name as a resource link to Azure like
volumeHandle: /subscriptions/XXXXXXXXXXXXXXXXXXXXXX/resourcegroups/kubernetes-resource-group/providers/Microsoft.Compute/disks/sensu-backend-etcd
in the PV.
Also, I had some mount options in the PV, which did not work with the Azure disk provisioner.

How does the master bootstrap process work and how can I debug it?

I am working to stand up 3 instances of the yugabyte master and tserver in separate k8s clusters connected over LoadBalancer services on bare metal. However, on all three master instances it looks like the bootstrap process is failing:
I0531 19:50:28.081645 1 master_main.cc:94] NumCPUs determined to be: 2
I0531 19:50:28.082594 1 server_base_options.cc:124] Updating master addrs to {yb-master-black.example.com:7100},{yb-master-blue.example.com:7100},{yb-master-white.example.com:7100},{:7100}
I0531 19:50:28.082682 1 server_base_options.cc:124] Updating master addrs to {yb-master-black.example.com:7100},{yb-master-blue.example.com:7100},{yb-master-white.example.com:7100},{:7100}
I0531 19:50:28.082937 1 mem_tracker.cc:249] MemTracker: hard memory limit is 1.699219 GB
I0531 19:50:28.082963 1 mem_tracker.cc:251] MemTracker: soft memory limit is 1.444336 GB
I0531 19:50:28.083189 1 server_base_options.cc:124] Updating master addrs to {yb-master-black.example.com:7100},{yb-master-blue.example.com:7100},{yb-master-white.example.com:7100},{:7100}
I0531 19:50:28.090148 1 server_base_options.cc:124] Updating master addrs to {yb-master-black.example.com:7100},{yb-master-blue.example.com:7100},{yb-master-white.example.com:7100},{:7100}
I0531 19:50:28.090863 1 rpc_server.cc:86] yb::server::RpcServer created at 0x1a7e210
I0531 19:50:28.090924 1 master.cc:146] yb::master::Master created at 0x7ffe2d4bd140
I0531 19:50:28.090958 1 master.cc:147] yb::master::TSManager created at 0x1a90850
I0531 19:50:28.090975 1 master.cc:148] yb::master::CatalogManager created at 0x1dea000
I0531 19:50:28.091152 1 master_main.cc:115] Initializing master server...
I0531 19:50:28.093097 1 server_base.cc:462] Could not load existing FS layout: Not found (yb/util/env_posix.cc:1482): /mnt/disk0/yb-data/master/instance: No such file or directory (system error 2)
I0531 19:50:28.093150 1 server_base.cc:463] Creating new FS layout
I0531 19:50:28.193439 1 fs_manager.cc:463] Generated new instance metadata in path /mnt/disk0/yb-data/master/instance:
uuid: "5f2f6ad78d27450b8cde9c8bcf40fefa"
format_stamp: "Formatted at 2020-05-31 19:50:28 on yb-master-0"
I0531 19:50:28.238484 1 fs_manager.cc:463] Generated new instance metadata in path /mnt/disk1/yb-data/master/instance:
uuid: "5f2f6ad78d27450b8cde9c8bcf40fefa"
format_stamp: "Formatted at 2020-05-31 19:50:28 on yb-master-0"
I0531 19:50:28.377483 1 fs_manager.cc:251] Opened local filesystem: /mnt/disk0,/mnt/disk1
uuid: "5f2f6ad78d27450b8cde9c8bcf40fefa"
format_stamp: "Formatted at 2020-05-31 19:50:28 on yb-master-0"
I0531 19:50:28.378015 1 server_base.cc:245] Auto setting FLAGS_num_reactor_threads to 2
I0531 19:50:28.380707 1 thread_pool.cc:166] Starting thread pool { name: Master queue_limit: 10000 max_workers: 1024 }
I0531 19:50:28.382266 1 master_main.cc:118] Starting Master server...
I0531 19:50:28.382313 24 async_initializer.cc:74] Starting to init ybclient
I0531 19:50:28.382365 1 master_main.cc:119] ulimit cur(max)...
ulimit: core file size unlimited(unlimited) blks
ulimit: data seg size unlimited(unlimited) kb
ulimit: open files 1048576(1048576)
ulimit: file size unlimited(unlimited) blks
ulimit: pending signals 22470(22470)
ulimit: file locks unlimited(unlimited)
ulimit: max locked memory 64(64) kb
ulimit: max memory size unlimited(unlimited) kb
ulimit: stack size 8192(unlimited) kb
ulimit: cpu time unlimited(unlimited) secs
ulimit: max user processes unlimited(unlimited)
W0531 19:50:28.383322 24 master.cc:186] Failed to get current config: Illegal state (yb/master/catalog_manager.cc:6854): Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized.
I0531 19:50:28.383525 24 client-internal.cc:1847] New master addresses: [yb-master-black.example.com:7100,yb-master-blue.example.com:7100,yb-master-white.example.com:7100,:7100]
I0531 19:50:28.383685 1 service_pool.cc:148] yb.master.MasterBackupService: yb::rpc::ServicePoolImpl created at 0x1a82b40
I0531 19:50:28.384888 1 service_pool.cc:148] yb.master.MasterService: yb::rpc::ServicePoolImpl created at 0x1a83680
I0531 19:50:28.385342 1 service_pool.cc:148] yb.tserver.TabletServerService: yb::rpc::ServicePoolImpl created at 0x1a838c0
I0531 19:50:28.388526 1 thread_pool.cc:166] Starting thread pool { name: Master-high-pri queue_limit: 10000 max_workers: 1024 }
I0531 19:50:28.388588 1 service_pool.cc:148] yb.consensus.ConsensusService: yb::rpc::ServicePoolImpl created at 0x201eb40
I0531 19:50:28.393231 1 service_pool.cc:148] yb.tserver.RemoteBootstrapService: yb::rpc::ServicePoolImpl created at 0x201ed80
I0531 19:50:28.393501 1 webserver.cc:148] Starting webserver on 0.0.0.0:7000
I0531 19:50:28.393544 1 webserver.cc:153] Document root: /home/yugabyte/www
I0531 19:50:28.394471 1 webserver.cc:240] Webserver started. Bound to: http://0.0.0.0:7000/
I0531 19:50:28.394668 1 service_pool.cc:148] yb.server.GenericService: yb::rpc::ServicePoolImpl created at 0x201efc0
I0531 19:50:28.395015 1 rpc_server.cc:169] RPC server started. Bound to: 0.0.0.0:7100
I0531 19:50:28.420223 23 tcp_stream.cc:308] { local: 10.233.80.35:55710 remote: 172.16.0.34:7100 }: Recv failed: Network error (yb/util/net/socket.cc:537): recvmsg error: Connection refused (system error 111)
E0531 19:51:28.523921 24 async_initializer.cc:84] Failed to initialize client: Timed out (yb/rpc/rpc.cc:213): Could not locate the leader master: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 293) passed its deadline 2074493.105s (passed: 60.140s): Not found (yb/master/master_rpc.cc:284): no leader found: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 1)
W0531 19:51:29.524827 24 master.cc:186] Failed to get current config: Illegal state (yb/master/catalog_manager.cc:6854): Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized.
I0531 19:51:29.524914 24 client-internal.cc:1847] New master addresses: [yb-master-black.example.com:7100,yb-master-blue.example.com:7100,yb-master-white.example.com:7100,:7100]
E0531 19:52:29.524785 24 async_initializer.cc:84] Failed to initialize client: Timed out (yb/rpc/outbound_call.cc:512): Could not locate the leader master: GetMasterRegistration RPC (request call id 2359) to 172.29.1.1:7100 timed out after 0.033s
W0531 19:52:30.525079 24 master.cc:186] Failed to get current config: Illegal state (yb/master/catalog_manager.cc:6854): Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized.
I0531 19:52:30.525205 24 client-internal.cc:1847] New master addresses: [yb-master-black.example.com:7100,yb-master-blue.example.com:7100,yb-master-white.example.com:7100,:7100]
W0531 19:53:28.114395 36 master-path-handlers.cc:150] Illegal state (yb/master/catalog_manager.cc:6854): Unable to list Masters: Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized.
W0531 19:53:29.133951 36 master-path-handlers.cc:1002] Illegal state (yb/master/catalog_manager.cc:6854): Unable to list Masters: Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized.
E0531 19:53:30.625366 24 async_initializer.cc:84] Failed to initialize client: Timed out (yb/rpc/rpc.cc:213): Could not locate the leader master: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 299) passed its deadline 2074615.247s (passed: 60.099s): Not found (yb/master/master_rpc.cc:284): no leader found: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 1)
W0531 19:53:31.625660 24 master.cc:186] Failed to get current config: Illegal state (yb/master/catalog_manager.cc:6854): Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized.
I0531 19:53:31.625742 24 client-internal.cc:1847] New master addresses: [yb-master-black.example.com:7100,yb-master-blue.example.com:7100,yb-master-white.example.com:7100,:7100]
W0531 19:53:34.024369 37 master-path-handlers.cc:150] Illegal state (yb/master/catalog_manager.cc:6854): Unable to list Masters: Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized.
E0531 19:54:31.870801 24 async_initializer.cc:84] Failed to initialize client: Timed out (yb/rpc/rpc.cc:213): Could not locate the leader master: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 300) passed its deadline 2074676.348s (passed: 60.244s): Not found (yb/master/master_rpc.cc:284): no leader found: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 1)
W0531 19:54:32.871065 24 master.cc:186] Failed to get current config: Illegal state (yb/master/catalog_manager.cc:6854): Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized.
I0531 19:54:32.871222 24 client-internal.cc:1847] New master addresses: [yb-master-black.example.com:7100,yb-master-blue.example.com:7100,yb-master-white.example.com:7100,:7100]
W0531 19:55:28.190217 41 master-path-handlers.cc:1002] Illegal state (yb/master/catalog_manager.cc:6854): Unable to list Masters: Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized.
W0531 19:55:31.745038 42 master-path-handlers.cc:1002] Illegal state (yb/master/catalog_manager.cc:6854): Unable to list Masters: Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized.
E0531 19:55:33.164300 24 async_initializer.cc:84] Failed to initialize client: Timed out (yb/rpc/rpc.cc:213): Could not locate the leader master: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 299) passed its deadline 2074737.593s (passed: 60.292s): Not found (yb/master/master_rpc.cc:284): no leader found: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 1)
W0531 19:55:34.164574 24 master.cc:186] Failed to get current config: Illegal state (yb/master/catalog_manager.cc:6854): Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized.
I0531 19:55:34.164667 24 client-internal.cc:1847] New master addresses: [yb-master-black.example.com:7100,yb-master-blue.example.com:7100,yb-master-white.example.com:7100,:7100]
E0531 19:56:34.315380 24 async_initializer.cc:84] Failed to initialize client: Timed out (yb/rpc/rpc.cc:213): Could not locate the leader master: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 299) passed its deadline 2074798.886s (passed: 60.150s): Not found (yb/master/master_rpc.cc:284): no leader found: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 1)
As far as connectivity goes, I am able to verify the LoadBalancer endpoints are responding across the different network boundaries by curling the same service endpoint but on the UI port:
[root#yb-master-0 yugabyte]# curl -I http://yb-master-blue.example.com:7000
HTTP/1.1 200 OK
Content-Type: text/html
Content-Length: 1975
Access-Control-Allow-Origin: *
[root#yb-master-0 yugabyte]# curl -I http://yb-master-white.example.com:7000
HTTP/1.1 200 OK
Content-Type: text/html
Content-Length: 1975
Access-Control-Allow-Origin: *
[root#yb-master-0 yugabyte]# curl -I http://yb-master-black.example.com:7000
HTTP/1.1 200 OK
Content-Type: text/html
Content-Length: 1975
Access-Control-Allow-Origin: *
What strategies are there to debug the bootstrap process?
EDIT:
Here are the startup flags for the master:
/home/yugabyte/bin/yb-master --fs_data_dirs=/mnt/disk0,/mnt/disk1 --server_broadcast_addresses=yb-master-white.example.com:7100 --master_addresses=yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, --replication_factor=3 --enable_ysql=true --rpc_bind_addresses=0.0.0.0:7100 --metric_node_name=yb-master-0 --memory_limit_hard_bytes=1824522240 --stderrthreshold=0 --num_cpus=2 --undefok=num_cpus,enable_ysql --default_memory_limit_to_ram_ratio=0.85 --leader_failure_max_missed_heartbeat_periods=10 --placement_cloud=AAAA --placement_region=XXXX --placement_zone=XXXX
/home/yugabyte/bin/yb-master --fs_data_dirs=/mnt/disk0,/mnt/disk1 --server_broadcast_addresses=yb-master-blue.example.com:7100 --master_addresses=yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, --replication_factor=3 --enable_ysql=true --rpc_bind_addresses=0.0.0.0:7100 --metric_node_name=yb-master-0 --memory_limit_hard_bytes=1824522240 --stderrthreshold=0 --num_cpus=2 --undefok=num_cpus,enable_ysql --default_memory_limit_to_ram_ratio=0.85 --leader_failure_max_missed_heartbeat_periods=10 --placement_cloud=AAAA --placement_region=YYYY --placement_zone=YYYY
/home/yugabyte/bin/yb-master --fs_data_dirs=/mnt/disk0,/mnt/disk1 --server_broadcast_addresses=yb-master-black.example.com:7100 --master_addresses=yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, --replication_factor=3 --enable_ysql=true --rpc_bind_addresses=0.0.0.0:7100 --metric_node_name=yb-master-0 --memory_limit_hard_bytes=1824522240 --stderrthreshold=0 --num_cpus=2 --undefok=num_cpus,enable_ysql --default_memory_limit_to_ram_ratio=0.85 --leader_failure_max_missed_heartbeat_periods=10 --placement_cloud=AAAA --placement_region=ZZZZ --placement_zone=ZZZZ
For the sake of completeness here is one of the k8s manifest that I've modified from one of the helm examples. It is modified to utilize LoadBalancer for the master service:
---
# Source: yugabyte/templates/service.yaml
apiVersion: v1
kind: Service
metadata:
name: "yb-masters"
labels:
app: "yb-master"
heritage: "Helm"
release: "blue"
chart: "yugabyte"
component: "yugabytedb"
spec:
type: LoadBalancer
loadBalancerIP: 172.16.0.34
ports:
- name: "rpc-port"
port: 7100
- name: "ui"
port: 7000
selector:
app: "yb-master"
---
# Source: yugabyte/templates/service.yaml
apiVersion: v1
kind: Service
metadata:
name: "yb-tservers"
labels:
app: "yb-tserver"
heritage: "Helm"
release: "blue"
chart: "yugabyte"
component: "yugabytedb"
spec:
clusterIP: None
ports:
- name: "rpc-port"
port: 7100
- name: "ui"
port: 9000
- name: "yedis-port"
port: 6379
- name: "yql-port"
port: 9042
- name: "ysql-port"
port: 5433
selector:
app: "yb-tserver"
---
# Source: yugabyte/templates/service.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: "yb-master"
namespace: "yugabytedb"
labels:
app: "yb-master"
heritage: "Helm"
release: "blue"
chart: "yugabyte"
component: "yugabytedb"
spec:
serviceName: "yb-masters"
podManagementPolicy: Parallel
replicas: 1
volumeClaimTemplates:
- metadata:
name: datadir0
annotations:
volume.beta.kubernetes.io/storage-class: rook-ceph-block
labels:
heritage: "Helm"
release: "blue"
chart: "yugabyte"
component: "yugabytedb"
spec:
accessModes:
- "ReadWriteOnce"
storageClassName: rook-ceph-block
resources:
requests:
storage: 10Gi
- metadata:
name: datadir1
annotations:
volume.beta.kubernetes.io/storage-class: rook-ceph-block
labels:
heritage: "Helm"
release: "blue"
chart: "yugabyte"
component: "yugabytedb"
spec:
accessModes:
- "ReadWriteOnce"
storageClassName: rook-ceph-block
resources:
requests:
storage: 10Gi
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 0
selector:
matchLabels:
app: "yb-master"
template:
metadata:
labels:
app: "yb-master"
heritage: "Helm"
release: "blue"
chart: "yugabyte"
component: "yugabytedb"
spec:
affinity:
# Set the anti-affinity selector scope to YB masters.
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- "yb-master"
topologyKey: kubernetes.io/hostname
containers:
- name: "yb-master"
image: "yugabytedb/yugabyte:2.1.6.0-b17"
imagePullPolicy: IfNotPresent
lifecycle:
postStart:
exec:
command:
- "sh"
- "-c"
- >
mkdir -p /mnt/disk0/cores;
mkdir -p /mnt/disk0/yb-data/scripts;
if [ ! -f /mnt/disk0/yb-data/scripts/log_cleanup.sh ]; then
if [ -f /home/yugabyte/bin/log_cleanup.sh ]; then
cp /home/yugabyte/bin/log_cleanup.sh /mnt/disk0/yb-data/scripts;
fi;
fi
env:
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: HOSTNAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
resources:
limits:
cpu: 2
memory: 2Gi
requests:
cpu: 500m
memory: 1Gi
command:
- "/home/yugabyte/bin/yb-master"
- "--fs_data_dirs=/mnt/disk0,/mnt/disk1"
- "--server_broadcast_addresses=yb-master-blue.example.com:7100"
- "--master_addresses=yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, "
- "--replication_factor=3"
- "--enable_ysql=true"
- "--rpc_bind_addresses=0.0.0.0:7100"
- "--metric_node_name=$(HOSTNAME)"
- "--memory_limit_hard_bytes=1824522240"
- "--stderrthreshold=0"
- "--num_cpus=2"
- "--undefok=num_cpus,enable_ysql"
- "--default_memory_limit_to_ram_ratio=0.85"
- "--leader_failure_max_missed_heartbeat_periods=10"
- "--placement_cloud=AAAA"
- "--placement_region=YYYY"
- "--placement_zone=YYYY"
ports:
- containerPort: 7100
name: "rpc-port"
- containerPort: 7000
name: "ui"
volumeMounts:
- name: datadir0
mountPath: /mnt/disk0
- name: datadir1
mountPath: /mnt/disk1
- name: yb-cleanup
image: busybox:1.31
env:
- name: USER
value: "yugabyte"
command:
- "/bin/sh"
- "-c"
- >
mkdir /var/spool/cron;
mkdir /var/spool/cron/crontabs;
echo "0 * * * * /home/yugabyte/scripts/log_cleanup.sh" | tee -a /var/spool/cron/crontabs/root;
crond;
while true; do
sleep 86400;
done
volumeMounts:
- name: datadir0
mountPath: /home/yugabyte/
subPath: yb-data
volumes:
- name: datadir0
hostPath:
path: /mnt/disks/ssd0
- name: datadir1
hostPath:
path: /mnt/disks/ssd1
---
# Source: yugabyte/templates/service.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: "yb-tserver"
namespace: "yugabytedb"
labels:
app: "yb-tserver"
heritage: "Helm"
release: "blue"
chart: "yugabyte"
component: "yugabytedb"
spec:
serviceName: "yb-tservers"
podManagementPolicy: Parallel
replicas: 1
volumeClaimTemplates:
- metadata:
name: datadir0
annotations:
volume.beta.kubernetes.io/storage-class: rook-ceph-block
labels:
heritage: "Helm"
release: "blue"
chart: "yugabyte"
component: "yugabytedb"
spec:
accessModes:
- "ReadWriteOnce"
storageClassName: rook-ceph-block
resources:
requests:
storage: 10Gi
- metadata:
name: datadir1
annotations:
volume.beta.kubernetes.io/storage-class: rook-ceph-block
labels:
heritage: "Helm"
release: "blue"
chart: "yugabyte"
component: "yugabytedb"
spec:
accessModes:
- "ReadWriteOnce"
storageClassName: rook-ceph-block
resources:
requests:
storage: 10Gi
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 0
selector:
matchLabels:
app: "yb-tserver"
template:
metadata:
labels:
app: "yb-tserver"
heritage: "Helm"
release: "blue"
chart: "yugabyte"
component: "yugabytedb"
spec:
affinity:
# Set the anti-affinity selector scope to YB masters.
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- "yb-tserver"
topologyKey: kubernetes.io/hostname
containers:
- name: "yb-tserver"
image: "yugabytedb/yugabyte:2.1.6.0-b17"
imagePullPolicy: IfNotPresent
lifecycle:
postStart:
exec:
command:
- "sh"
- "-c"
- >
mkdir -p /mnt/disk0/cores;
mkdir -p /mnt/disk0/yb-data/scripts;
if [ ! -f /mnt/disk0/yb-data/scripts/log_cleanup.sh ]; then
if [ -f /home/yugabyte/bin/log_cleanup.sh ]; then
cp /home/yugabyte/bin/log_cleanup.sh /mnt/disk0/yb-data/scripts;
fi;
fi
env:
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: HOSTNAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
resources:
limits:
cpu: 2
memory: 4Gi
requests:
cpu: 500m
memory: 2Gi
command:
- "/home/yugabyte/bin/yb-tserver"
- "--fs_data_dirs=/mnt/disk0,/mnt/disk1"
- "--server_broadcast_addresses=$(HOSTNAME).yb-tservers.$(NAMESPACE).svc.cluster.local:9100"
- "--rpc_bind_addresses=$(HOSTNAME).yb-tservers.$(NAMESPACE).svc.cluster.local"
- "--cql_proxy_bind_address=$(HOSTNAME).yb-tservers.$(NAMESPACE).svc.cluster.local"
- "--enable_ysql=true"
- "--pgsql_proxy_bind_address=$(POD_IP):5433"
- "--tserver_master_addrs=yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, "
- "--metric_node_name=$(HOSTNAME)"
- "--memory_limit_hard_bytes=3649044480"
- "--stderrthreshold=0"
- "--num_cpus=2"
- "--undefok=num_cpus,enable_ysql"
- "--leader_failure_max_missed_heartbeat_periods=10"
- "--placement_cloud=AAAA"
- "--placement_region=YYYY"
- "--placement_zone=YYYY"
- "--use_cassandra_authentication=false"
ports:
- containerPort: 7100
name: "rpc-port"
- containerPort: 9000
name: "ui"
- containerPort: 6379
name: "yedis-port"
- containerPort: 9042
name: "yql-port"
- containerPort: 5433
name: "ysql-port"
volumeMounts:
- name: datadir0
mountPath: /mnt/disk0
- name: datadir1
mountPath: /mnt/disk1
- name: yb-cleanup
image: busybox:1.31
env:
- name: USER
value: "yugabyte"
command:
- "/bin/sh"
- "-c"
- >
mkdir /var/spool/cron;
mkdir /var/spool/cron/crontabs;
echo "0 * * * * /home/yugabyte/scripts/log_cleanup.sh" | tee -a /var/spool/cron/crontabs/root;
crond;
while true; do
sleep 86400;
done
volumeMounts:
- name: datadir0
mountPath: /home/yugabyte/
subPath: yb-data
volumes:
- name: datadir0
hostPath:
path: /mnt/disks/ssd0
- name: datadir1
hostPath:
path: /mnt/disks/ssd1

This was mostly resolved (looks like I've now run into an unrelated issue), by dropping the extraneous comma on the master addresses list:
--master_addresses=yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100,
vs
--master_addresses=yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100

K8s did not kill my airflow webserver pod

I have airflow running in k8s containers.
The webserver encountered a DNS error (could not translate the url for my db to an ip) and the webserver workers were killed.
What is troubling me is that the k8s did not attempt to kill the pod and start a new one its place.
Pod log output:
OperationalError: (psycopg2.OperationalError) could not translate host name "my.dbs.url" to address: Temporary failure in name resolution
[2017-12-01 06:06:05 +0000] [2202] [INFO] Worker exiting (pid: 2202)
[2017-12-01 06:06:05 +0000] [2186] [INFO] Worker exiting (pid: 2186)
[2017-12-01 06:06:05 +0000] [2190] [INFO] Worker exiting (pid: 2190)
[2017-12-01 06:06:05 +0000] [2194] [INFO] Worker exiting (pid: 2194)
[2017-12-01 06:06:05 +0000] [2198] [INFO] Worker exiting (pid: 2198)
[2017-12-01 06:06:06 +0000] [13] [INFO] Shutting down: Master
[2017-12-01 06:06:06 +0000] [13] [INFO] Reason: Worker failed to boot.
The k8s status is RUNNING but when I open an exec shell in the k8s UI i get the following output (gunicorn appears to realize it's dead):
root#webserver-373771664-3h4v9:/# ps -Al
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
4 S 0 1 0 0 80 0 - 107153 - ? 00:06:42 /usr/local/bin/
4 Z 0 13 1 0 80 0 - 0 - ? 00:01:24 gunicorn: maste <defunct>
4 S 0 2206 0 0 80 0 - 4987 - ? 00:00:00 bash
0 R 0 2224 2206 0 80 0 - 7486 - ? 00:00:00 ps
The following is the YAML for my deployment:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: webserver
namespace: airflow
spec:
replicas: 1
template:
metadata:
labels:
app: airflow-webserver
spec:
volumes:
- name: webserver-dags
emptyDir: {}
containers:
- name: airflow-webserver
image: my.custom.image :latest
imagePullPolicy: Always
resources:
requests:
cpu: 100m
limits:
cpu: 500m
ports:
- containerPort: 80
protocol: TCP
env:
- name: AIRFLOW_HOME
value: /var/lib/airflow
- name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
valueFrom:
secretKeyRef:
name: db1
key: sqlalchemy_conn
volumeMounts:
- mountPath: /var/lib/airflow/dags/
name: webserver-dags
command: ["airflow"]
args: ["webserver"]
- name: docker-s3-to-backup
image: my.custom.image:latest
imagePullPolicy: Always
resources:
requests:
cpu: 50m
limits:
cpu: 500m
env:
- name: ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws
key: access_key_id
- name: SECRET_KEY
valueFrom:
secretKeyRef:
name: aws
key: secret_access_key
- name: S3_PATH
value: s3://my-s3-bucket/dags/
- name: DATA_PATH
value: /dags/
- name: CRON_SCHEDULE
value: "*/5 * * * *"
volumeMounts:
- mountPath: /dags/
name: webserver-dags
---
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: webserver
namespace: airflow
spec:
scaleTargetRef:
apiVersion: apps/v1beta1
kind: Deployment
name: webserver
minReplicas: 2
maxReplicas: 20
targetCPUUtilizationPercentage: 75
---
apiVersion: v1
kind: Service
metadata:
labels:
name: webserver
namespace: airflow
spec:
type: NodePort
ports:
- port: 80
selector:
app: airflow-webserver

you need to define the readiness and liveness probe Kubernetes to detect the POD status.
like documented on this page. https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-a-tcp-liveness-probe
- containerPort: 8080
readinessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 15
periodSeconds: 20

Well, when process dies in a container, this container will exit and kubelet will restart the container on the same node / within the same pod. What happened here is by no means a fault of kubernetes, but in fact a problem of your container. The main process that you launch in the container (be it just from CMD or via ENTRYPOINT) needs to die, for the above to happen, and the ones you launch did not (one went zombie mode, but was not reaped, which is an example of another issue all together - zombie reaping. Liveness probe will help in this case (as mentioned by #sfgroups) as it will terminate the pod if it fails, but this is treating symptoms rather then root cause (not that you shouldn't have probes defined in general as a good practice).

cassandra cluster joining delay

I am trying to install 3 cassandra using bosh release.I am getting error
java.lang.UnsupportedOperationException: Other bootstrapping/leaving/moving nodes detected, cannot bootstrap while cassandra.consistent.rangemovement is true
On searching in net i found that we need to put some delay when cluster joins.Let me know how to introduced delay. Do we have any attribute for this ?
- name: cassandra_seed
templates:
- name: cassandra
release: cassandra
- name: collectd
release: metrics
- name: logstash-shipper
release: cassandra
- name: consul
release: consul
instances: 1
resource_pool: service-net-medium
persistent_disk: 10240
networks:
- name: ccc-service-net
default: [dns, gateway]
properties:
collectd:
plugin_templates: [cassandra]
cassandra:
broadcast_address: 0.cassandra-seed.ccc-service-net.<%= $deployment_name %>.microbosh
consul:
bootstrap_expect: 0
join_hosts: ["0.vault-consul.ccc-service-net.<%= $deployment_name %>.microbosh"]
service:
name: cassandra
process:
name: ps -ef |grep cassandra |grep -v grep || exit 2
server: false
default_recursor: 8.8.8.8
update:
serial: false
Error
root#9e3c9ac3-1832-48cf-a58c-3ef25ee17869:/var/vcap/sys/log/cassandra# vim cassandra.stderr.log
java.lang.UnsupportedOperationException: Other bootstrapping/leaving/moving nodes detected, cannot bootstrap while cassandra.consistent.rangemovement is true
at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:584)
at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:855)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:725)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:625)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:366)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:581)
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:710)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Getting hudson.remoting.ChannelClosedException error in Jenkins - node.js

Related

Files in AIRFLOW_HOME (which is an Azure File Share MOUNT) are created as root

Cannot mount volume to pod in Kubernetes using Azure file provisioner

How does the master bootstrap process work and how can I debug it?

K8s did not kill my airflow webserver pod

cassandra cluster joining delay

Categories

Resources