How does the master bootstrap process work and how can I debug it? - yugabytedb

I am working to stand up 3 instances of the yugabyte master and tserver in separate k8s clusters connected over LoadBalancer services on bare metal. However, on all three master instances it looks like the bootstrap process is failing:
I0531 19:50:28.081645 1 master_main.cc:94] NumCPUs determined to be: 2
I0531 19:50:28.082594 1 server_base_options.cc:124] Updating master addrs to {yb-master-black.example.com:7100},{yb-master-blue.example.com:7100},{yb-master-white.example.com:7100},{:7100}
I0531 19:50:28.082682 1 server_base_options.cc:124] Updating master addrs to {yb-master-black.example.com:7100},{yb-master-blue.example.com:7100},{yb-master-white.example.com:7100},{:7100}
I0531 19:50:28.082937 1 mem_tracker.cc:249] MemTracker: hard memory limit is 1.699219 GB
I0531 19:50:28.082963 1 mem_tracker.cc:251] MemTracker: soft memory limit is 1.444336 GB
I0531 19:50:28.083189 1 server_base_options.cc:124] Updating master addrs to {yb-master-black.example.com:7100},{yb-master-blue.example.com:7100},{yb-master-white.example.com:7100},{:7100}
I0531 19:50:28.090148 1 server_base_options.cc:124] Updating master addrs to {yb-master-black.example.com:7100},{yb-master-blue.example.com:7100},{yb-master-white.example.com:7100},{:7100}
I0531 19:50:28.090863 1 rpc_server.cc:86] yb::server::RpcServer created at 0x1a7e210
I0531 19:50:28.090924 1 master.cc:146] yb::master::Master created at 0x7ffe2d4bd140
I0531 19:50:28.090958 1 master.cc:147] yb::master::TSManager created at 0x1a90850
I0531 19:50:28.090975 1 master.cc:148] yb::master::CatalogManager created at 0x1dea000
I0531 19:50:28.091152 1 master_main.cc:115] Initializing master server...
I0531 19:50:28.093097 1 server_base.cc:462] Could not load existing FS layout: Not found (yb/util/env_posix.cc:1482): /mnt/disk0/yb-data/master/instance: No such file or directory (system error 2)
I0531 19:50:28.093150 1 server_base.cc:463] Creating new FS layout
I0531 19:50:28.193439 1 fs_manager.cc:463] Generated new instance metadata in path /mnt/disk0/yb-data/master/instance:
uuid: "5f2f6ad78d27450b8cde9c8bcf40fefa"
format_stamp: "Formatted at 2020-05-31 19:50:28 on yb-master-0"
I0531 19:50:28.238484 1 fs_manager.cc:463] Generated new instance metadata in path /mnt/disk1/yb-data/master/instance:
uuid: "5f2f6ad78d27450b8cde9c8bcf40fefa"
format_stamp: "Formatted at 2020-05-31 19:50:28 on yb-master-0"
I0531 19:50:28.377483 1 fs_manager.cc:251] Opened local filesystem: /mnt/disk0,/mnt/disk1
uuid: "5f2f6ad78d27450b8cde9c8bcf40fefa"
format_stamp: "Formatted at 2020-05-31 19:50:28 on yb-master-0"
I0531 19:50:28.378015 1 server_base.cc:245] Auto setting FLAGS_num_reactor_threads to 2
I0531 19:50:28.380707 1 thread_pool.cc:166] Starting thread pool { name: Master queue_limit: 10000 max_workers: 1024 }
I0531 19:50:28.382266 1 master_main.cc:118] Starting Master server...
I0531 19:50:28.382313 24 async_initializer.cc:74] Starting to init ybclient
I0531 19:50:28.382365 1 master_main.cc:119] ulimit cur(max)...
ulimit: core file size unlimited(unlimited) blks
ulimit: data seg size unlimited(unlimited) kb
ulimit: open files 1048576(1048576)
ulimit: file size unlimited(unlimited) blks
ulimit: pending signals 22470(22470)
ulimit: file locks unlimited(unlimited)
ulimit: max locked memory 64(64) kb
ulimit: max memory size unlimited(unlimited) kb
ulimit: stack size 8192(unlimited) kb
ulimit: cpu time unlimited(unlimited) secs
ulimit: max user processes unlimited(unlimited)
W0531 19:50:28.383322 24 master.cc:186] Failed to get current config: Illegal state (yb/master/catalog_manager.cc:6854): Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized.
I0531 19:50:28.383525 24 client-internal.cc:1847] New master addresses: [yb-master-black.example.com:7100,yb-master-blue.example.com:7100,yb-master-white.example.com:7100,:7100]
I0531 19:50:28.383685 1 service_pool.cc:148] yb.master.MasterBackupService: yb::rpc::ServicePoolImpl created at 0x1a82b40
I0531 19:50:28.384888 1 service_pool.cc:148] yb.master.MasterService: yb::rpc::ServicePoolImpl created at 0x1a83680
I0531 19:50:28.385342 1 service_pool.cc:148] yb.tserver.TabletServerService: yb::rpc::ServicePoolImpl created at 0x1a838c0
I0531 19:50:28.388526 1 thread_pool.cc:166] Starting thread pool { name: Master-high-pri queue_limit: 10000 max_workers: 1024 }
I0531 19:50:28.388588 1 service_pool.cc:148] yb.consensus.ConsensusService: yb::rpc::ServicePoolImpl created at 0x201eb40
I0531 19:50:28.393231 1 service_pool.cc:148] yb.tserver.RemoteBootstrapService: yb::rpc::ServicePoolImpl created at 0x201ed80
I0531 19:50:28.393501 1 webserver.cc:148] Starting webserver on 0.0.0.0:7000
I0531 19:50:28.393544 1 webserver.cc:153] Document root: /home/yugabyte/www
I0531 19:50:28.394471 1 webserver.cc:240] Webserver started. Bound to: http://0.0.0.0:7000/
I0531 19:50:28.394668 1 service_pool.cc:148] yb.server.GenericService: yb::rpc::ServicePoolImpl created at 0x201efc0
I0531 19:50:28.395015 1 rpc_server.cc:169] RPC server started. Bound to: 0.0.0.0:7100
I0531 19:50:28.420223 23 tcp_stream.cc:308] { local: 10.233.80.35:55710 remote: 172.16.0.34:7100 }: Recv failed: Network error (yb/util/net/socket.cc:537): recvmsg error: Connection refused (system error 111)
E0531 19:51:28.523921 24 async_initializer.cc:84] Failed to initialize client: Timed out (yb/rpc/rpc.cc:213): Could not locate the leader master: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 293) passed its deadline 2074493.105s (passed: 60.140s): Not found (yb/master/master_rpc.cc:284): no leader found: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 1)
W0531 19:51:29.524827 24 master.cc:186] Failed to get current config: Illegal state (yb/master/catalog_manager.cc:6854): Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized.
I0531 19:51:29.524914 24 client-internal.cc:1847] New master addresses: [yb-master-black.example.com:7100,yb-master-blue.example.com:7100,yb-master-white.example.com:7100,:7100]
E0531 19:52:29.524785 24 async_initializer.cc:84] Failed to initialize client: Timed out (yb/rpc/outbound_call.cc:512): Could not locate the leader master: GetMasterRegistration RPC (request call id 2359) to 172.29.1.1:7100 timed out after 0.033s
W0531 19:52:30.525079 24 master.cc:186] Failed to get current config: Illegal state (yb/master/catalog_manager.cc:6854): Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized.
I0531 19:52:30.525205 24 client-internal.cc:1847] New master addresses: [yb-master-black.example.com:7100,yb-master-blue.example.com:7100,yb-master-white.example.com:7100,:7100]
W0531 19:53:28.114395 36 master-path-handlers.cc:150] Illegal state (yb/master/catalog_manager.cc:6854): Unable to list Masters: Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized.
W0531 19:53:29.133951 36 master-path-handlers.cc:1002] Illegal state (yb/master/catalog_manager.cc:6854): Unable to list Masters: Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized.
E0531 19:53:30.625366 24 async_initializer.cc:84] Failed to initialize client: Timed out (yb/rpc/rpc.cc:213): Could not locate the leader master: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 299) passed its deadline 2074615.247s (passed: 60.099s): Not found (yb/master/master_rpc.cc:284): no leader found: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 1)
W0531 19:53:31.625660 24 master.cc:186] Failed to get current config: Illegal state (yb/master/catalog_manager.cc:6854): Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized.
I0531 19:53:31.625742 24 client-internal.cc:1847] New master addresses: [yb-master-black.example.com:7100,yb-master-blue.example.com:7100,yb-master-white.example.com:7100,:7100]
W0531 19:53:34.024369 37 master-path-handlers.cc:150] Illegal state (yb/master/catalog_manager.cc:6854): Unable to list Masters: Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized.
E0531 19:54:31.870801 24 async_initializer.cc:84] Failed to initialize client: Timed out (yb/rpc/rpc.cc:213): Could not locate the leader master: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 300) passed its deadline 2074676.348s (passed: 60.244s): Not found (yb/master/master_rpc.cc:284): no leader found: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 1)
W0531 19:54:32.871065 24 master.cc:186] Failed to get current config: Illegal state (yb/master/catalog_manager.cc:6854): Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized.
I0531 19:54:32.871222 24 client-internal.cc:1847] New master addresses: [yb-master-black.example.com:7100,yb-master-blue.example.com:7100,yb-master-white.example.com:7100,:7100]
W0531 19:55:28.190217 41 master-path-handlers.cc:1002] Illegal state (yb/master/catalog_manager.cc:6854): Unable to list Masters: Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized.
W0531 19:55:31.745038 42 master-path-handlers.cc:1002] Illegal state (yb/master/catalog_manager.cc:6854): Unable to list Masters: Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized.
E0531 19:55:33.164300 24 async_initializer.cc:84] Failed to initialize client: Timed out (yb/rpc/rpc.cc:213): Could not locate the leader master: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 299) passed its deadline 2074737.593s (passed: 60.292s): Not found (yb/master/master_rpc.cc:284): no leader found: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 1)
W0531 19:55:34.164574 24 master.cc:186] Failed to get current config: Illegal state (yb/master/catalog_manager.cc:6854): Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized.
I0531 19:55:34.164667 24 client-internal.cc:1847] New master addresses: [yb-master-black.example.com:7100,yb-master-blue.example.com:7100,yb-master-white.example.com:7100,:7100]
E0531 19:56:34.315380 24 async_initializer.cc:84] Failed to initialize client: Timed out (yb/rpc/rpc.cc:213): Could not locate the leader master: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 299) passed its deadline 2074798.886s (passed: 60.150s): Not found (yb/master/master_rpc.cc:284): no leader found: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 1)
As far as connectivity goes, I am able to verify the LoadBalancer endpoints are responding across the different network boundaries by curling the same service endpoint but on the UI port:
[root#yb-master-0 yugabyte]# curl -I http://yb-master-blue.example.com:7000
HTTP/1.1 200 OK
Content-Type: text/html
Content-Length: 1975
Access-Control-Allow-Origin: *
[root#yb-master-0 yugabyte]# curl -I http://yb-master-white.example.com:7000
HTTP/1.1 200 OK
Content-Type: text/html
Content-Length: 1975
Access-Control-Allow-Origin: *
[root#yb-master-0 yugabyte]# curl -I http://yb-master-black.example.com:7000
HTTP/1.1 200 OK
Content-Type: text/html
Content-Length: 1975
Access-Control-Allow-Origin: *
What strategies are there to debug the bootstrap process?
EDIT:
Here are the startup flags for the master:
/home/yugabyte/bin/yb-master --fs_data_dirs=/mnt/disk0,/mnt/disk1 --server_broadcast_addresses=yb-master-white.example.com:7100 --master_addresses=yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, --replication_factor=3 --enable_ysql=true --rpc_bind_addresses=0.0.0.0:7100 --metric_node_name=yb-master-0 --memory_limit_hard_bytes=1824522240 --stderrthreshold=0 --num_cpus=2 --undefok=num_cpus,enable_ysql --default_memory_limit_to_ram_ratio=0.85 --leader_failure_max_missed_heartbeat_periods=10 --placement_cloud=AAAA --placement_region=XXXX --placement_zone=XXXX
/home/yugabyte/bin/yb-master --fs_data_dirs=/mnt/disk0,/mnt/disk1 --server_broadcast_addresses=yb-master-blue.example.com:7100 --master_addresses=yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, --replication_factor=3 --enable_ysql=true --rpc_bind_addresses=0.0.0.0:7100 --metric_node_name=yb-master-0 --memory_limit_hard_bytes=1824522240 --stderrthreshold=0 --num_cpus=2 --undefok=num_cpus,enable_ysql --default_memory_limit_to_ram_ratio=0.85 --leader_failure_max_missed_heartbeat_periods=10 --placement_cloud=AAAA --placement_region=YYYY --placement_zone=YYYY
/home/yugabyte/bin/yb-master --fs_data_dirs=/mnt/disk0,/mnt/disk1 --server_broadcast_addresses=yb-master-black.example.com:7100 --master_addresses=yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, --replication_factor=3 --enable_ysql=true --rpc_bind_addresses=0.0.0.0:7100 --metric_node_name=yb-master-0 --memory_limit_hard_bytes=1824522240 --stderrthreshold=0 --num_cpus=2 --undefok=num_cpus,enable_ysql --default_memory_limit_to_ram_ratio=0.85 --leader_failure_max_missed_heartbeat_periods=10 --placement_cloud=AAAA --placement_region=ZZZZ --placement_zone=ZZZZ
For the sake of completeness here is one of the k8s manifest that I've modified from one of the helm examples. It is modified to utilize LoadBalancer for the master service:
---
# Source: yugabyte/templates/service.yaml
apiVersion: v1
kind: Service
metadata:
name: "yb-masters"
labels:
app: "yb-master"
heritage: "Helm"
release: "blue"
chart: "yugabyte"
component: "yugabytedb"
spec:
type: LoadBalancer
loadBalancerIP: 172.16.0.34
ports:
- name: "rpc-port"
port: 7100
- name: "ui"
port: 7000
selector:
app: "yb-master"
---
# Source: yugabyte/templates/service.yaml
apiVersion: v1
kind: Service
metadata:
name: "yb-tservers"
labels:
app: "yb-tserver"
heritage: "Helm"
release: "blue"
chart: "yugabyte"
component: "yugabytedb"
spec:
clusterIP: None
ports:
- name: "rpc-port"
port: 7100
- name: "ui"
port: 9000
- name: "yedis-port"
port: 6379
- name: "yql-port"
port: 9042
- name: "ysql-port"
port: 5433
selector:
app: "yb-tserver"
---
# Source: yugabyte/templates/service.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: "yb-master"
namespace: "yugabytedb"
labels:
app: "yb-master"
heritage: "Helm"
release: "blue"
chart: "yugabyte"
component: "yugabytedb"
spec:
serviceName: "yb-masters"
podManagementPolicy: Parallel
replicas: 1
volumeClaimTemplates:
- metadata:
name: datadir0
annotations:
volume.beta.kubernetes.io/storage-class: rook-ceph-block
labels:
heritage: "Helm"
release: "blue"
chart: "yugabyte"
component: "yugabytedb"
spec:
accessModes:
- "ReadWriteOnce"
storageClassName: rook-ceph-block
resources:
requests:
storage: 10Gi
- metadata:
name: datadir1
annotations:
volume.beta.kubernetes.io/storage-class: rook-ceph-block
labels:
heritage: "Helm"
release: "blue"
chart: "yugabyte"
component: "yugabytedb"
spec:
accessModes:
- "ReadWriteOnce"
storageClassName: rook-ceph-block
resources:
requests:
storage: 10Gi
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 0
selector:
matchLabels:
app: "yb-master"
template:
metadata:
labels:
app: "yb-master"
heritage: "Helm"
release: "blue"
chart: "yugabyte"
component: "yugabytedb"
spec:
affinity:
# Set the anti-affinity selector scope to YB masters.
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- "yb-master"
topologyKey: kubernetes.io/hostname
containers:
- name: "yb-master"
image: "yugabytedb/yugabyte:2.1.6.0-b17"
imagePullPolicy: IfNotPresent
lifecycle:
postStart:
exec:
command:
- "sh"
- "-c"
- >
mkdir -p /mnt/disk0/cores;
mkdir -p /mnt/disk0/yb-data/scripts;
if [ ! -f /mnt/disk0/yb-data/scripts/log_cleanup.sh ]; then
if [ -f /home/yugabyte/bin/log_cleanup.sh ]; then
cp /home/yugabyte/bin/log_cleanup.sh /mnt/disk0/yb-data/scripts;
fi;
fi
env:
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: HOSTNAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
resources:
limits:
cpu: 2
memory: 2Gi
requests:
cpu: 500m
memory: 1Gi
command:
- "/home/yugabyte/bin/yb-master"
- "--fs_data_dirs=/mnt/disk0,/mnt/disk1"
- "--server_broadcast_addresses=yb-master-blue.example.com:7100"
- "--master_addresses=yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, "
- "--replication_factor=3"
- "--enable_ysql=true"
- "--rpc_bind_addresses=0.0.0.0:7100"
- "--metric_node_name=$(HOSTNAME)"
- "--memory_limit_hard_bytes=1824522240"
- "--stderrthreshold=0"
- "--num_cpus=2"
- "--undefok=num_cpus,enable_ysql"
- "--default_memory_limit_to_ram_ratio=0.85"
- "--leader_failure_max_missed_heartbeat_periods=10"
- "--placement_cloud=AAAA"
- "--placement_region=YYYY"
- "--placement_zone=YYYY"
ports:
- containerPort: 7100
name: "rpc-port"
- containerPort: 7000
name: "ui"
volumeMounts:
- name: datadir0
mountPath: /mnt/disk0
- name: datadir1
mountPath: /mnt/disk1
- name: yb-cleanup
image: busybox:1.31
env:
- name: USER
value: "yugabyte"
command:
- "/bin/sh"
- "-c"
- >
mkdir /var/spool/cron;
mkdir /var/spool/cron/crontabs;
echo "0 * * * * /home/yugabyte/scripts/log_cleanup.sh" | tee -a /var/spool/cron/crontabs/root;
crond;
while true; do
sleep 86400;
done
volumeMounts:
- name: datadir0
mountPath: /home/yugabyte/
subPath: yb-data
volumes:
- name: datadir0
hostPath:
path: /mnt/disks/ssd0
- name: datadir1
hostPath:
path: /mnt/disks/ssd1
---
# Source: yugabyte/templates/service.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: "yb-tserver"
namespace: "yugabytedb"
labels:
app: "yb-tserver"
heritage: "Helm"
release: "blue"
chart: "yugabyte"
component: "yugabytedb"
spec:
serviceName: "yb-tservers"
podManagementPolicy: Parallel
replicas: 1
volumeClaimTemplates:
- metadata:
name: datadir0
annotations:
volume.beta.kubernetes.io/storage-class: rook-ceph-block
labels:
heritage: "Helm"
release: "blue"
chart: "yugabyte"
component: "yugabytedb"
spec:
accessModes:
- "ReadWriteOnce"
storageClassName: rook-ceph-block
resources:
requests:
storage: 10Gi
- metadata:
name: datadir1
annotations:
volume.beta.kubernetes.io/storage-class: rook-ceph-block
labels:
heritage: "Helm"
release: "blue"
chart: "yugabyte"
component: "yugabytedb"
spec:
accessModes:
- "ReadWriteOnce"
storageClassName: rook-ceph-block
resources:
requests:
storage: 10Gi
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 0
selector:
matchLabels:
app: "yb-tserver"
template:
metadata:
labels:
app: "yb-tserver"
heritage: "Helm"
release: "blue"
chart: "yugabyte"
component: "yugabytedb"
spec:
affinity:
# Set the anti-affinity selector scope to YB masters.
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- "yb-tserver"
topologyKey: kubernetes.io/hostname
containers:
- name: "yb-tserver"
image: "yugabytedb/yugabyte:2.1.6.0-b17"
imagePullPolicy: IfNotPresent
lifecycle:
postStart:
exec:
command:
- "sh"
- "-c"
- >
mkdir -p /mnt/disk0/cores;
mkdir -p /mnt/disk0/yb-data/scripts;
if [ ! -f /mnt/disk0/yb-data/scripts/log_cleanup.sh ]; then
if [ -f /home/yugabyte/bin/log_cleanup.sh ]; then
cp /home/yugabyte/bin/log_cleanup.sh /mnt/disk0/yb-data/scripts;
fi;
fi
env:
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: HOSTNAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
resources:
limits:
cpu: 2
memory: 4Gi
requests:
cpu: 500m
memory: 2Gi
command:
- "/home/yugabyte/bin/yb-tserver"
- "--fs_data_dirs=/mnt/disk0,/mnt/disk1"
- "--server_broadcast_addresses=$(HOSTNAME).yb-tservers.$(NAMESPACE).svc.cluster.local:9100"
- "--rpc_bind_addresses=$(HOSTNAME).yb-tservers.$(NAMESPACE).svc.cluster.local"
- "--cql_proxy_bind_address=$(HOSTNAME).yb-tservers.$(NAMESPACE).svc.cluster.local"
- "--enable_ysql=true"
- "--pgsql_proxy_bind_address=$(POD_IP):5433"
- "--tserver_master_addrs=yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, "
- "--metric_node_name=$(HOSTNAME)"
- "--memory_limit_hard_bytes=3649044480"
- "--stderrthreshold=0"
- "--num_cpus=2"
- "--undefok=num_cpus,enable_ysql"
- "--leader_failure_max_missed_heartbeat_periods=10"
- "--placement_cloud=AAAA"
- "--placement_region=YYYY"
- "--placement_zone=YYYY"
- "--use_cassandra_authentication=false"
ports:
- containerPort: 7100
name: "rpc-port"
- containerPort: 9000
name: "ui"
- containerPort: 6379
name: "yedis-port"
- containerPort: 9042
name: "yql-port"
- containerPort: 5433
name: "ysql-port"
volumeMounts:
- name: datadir0
mountPath: /mnt/disk0
- name: datadir1
mountPath: /mnt/disk1
- name: yb-cleanup
image: busybox:1.31
env:
- name: USER
value: "yugabyte"
command:
- "/bin/sh"
- "-c"
- >
mkdir /var/spool/cron;
mkdir /var/spool/cron/crontabs;
echo "0 * * * * /home/yugabyte/scripts/log_cleanup.sh" | tee -a /var/spool/cron/crontabs/root;
crond;
while true; do
sleep 86400;
done
volumeMounts:
- name: datadir0
mountPath: /home/yugabyte/
subPath: yb-data
volumes:
- name: datadir0
hostPath:
path: /mnt/disks/ssd0
- name: datadir1
hostPath:
path: /mnt/disks/ssd1

This was mostly resolved (looks like I've now run into an unrelated issue), by dropping the extraneous comma on the master addresses list:
--master_addresses=yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100,
vs
--master_addresses=yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100

Related

kube-proxy changes reverting after couple of minutes on my AKS cluster

I am experimenting and tweaking a bit on my sandbox AKS cluster with the intention to configure it in a production ready state. Regarding that, I am following a book where the writer is redeployig the initial kube-proxy daemonset with some modification (the only difference is that he is doing it on AWS EKS).
The problem is that the daemonset and pod are getting to the initial state after 2-3 minutes. AKS is just doing a rollback, what I can se when execute the rollback command
> kubectl rollout history daemonset kube-proxy -n kube-system
daemonset.apps/kube-proxy
REVISION CHANGE-CAUSE
2 <none>
8 <none>
10 <none>
14 <none>
16 <none>
I tried to redeploy the daemonset with my minor changes (changed cpu from 100m to 120m and changed the -v flag from 3 to 2) declaretively by applying following manifest
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
labels:
addonmanager.kubernetes.io/mode: Reconcile
component: kube-proxy
tier: node
deployment: custom
name: kube-proxy
namespace: kube-system
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
component: kube-proxy
tier: node
template:
metadata:
creationTimestamp: null
labels:
component: kube-proxy
tier: node
deployedBy: Luka
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.azure.com/cluster
operator: Exists
- key: type
operator: NotIn
values:
- virtual-kubelet
- key: kubernetes.io/os
operator: In
values:
- linux
containers:
- command:
- kube-proxy
- --conntrack-max-per-core=0
- --metrics-bind-address=0.0.0.0:10249
- --kubeconfig=/var/lib/kubelet/kubeconfig
- --cluster-cidr=10.244.0.0/16
- --detect-local-mode=ClusterCIDR
- --pod-interface-name-prefix=
- --v=2
image: mcr.microsoft.com/oss/kubernetes/kube-proxy:v1.23.12-hotfix.20220922.1
imagePullPolicy: IfNotPresent
name: kube-proxy
resources:
requests:
cpu: 120m
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/kubelet
name: kubeconfig
readOnly: true
- mountPath: /etc/kubernetes/certs
name: certificates
readOnly: true
- mountPath: /run/xtables.lock
name: iptableslock
- mountPath: /lib/modules
name: modules
dnsPolicy: ClusterFirst
hostNetwork: true
initContainers:
- command:
- /bin/sh
- -c
- |
SYSCTL=/proc/sys/net/netfilter/nf_conntrack_max
echo "Current net.netfilter.nf_conntrack_max: $(cat $SYSCTL)"
DESIRED=$(awk -F= '/net.netfilter.nf_conntrack_max/ {print $2}' /etc/sysctl.d/999-sysctl-aks.conf)
if [ -z "$DESIRED" ]; then
DESIRED=$((32768*$(nproc)))
if [ $DESIRED -lt 131072 ]; then
DESIRED=131072
fi
echo "AKS custom config for net.netfilter.nf_conntrack_max not set."
echo "Setting nf_conntrack_max to $DESIRED (32768 * $(nproc) cores, minimum 131072)."
echo $DESIRED > $SYSCTL
else
echo "AKS custom config for net.netfilter.nf_conntrack_max set to $DESIRED."
echo "Setting nf_conntrack_max to $DESIRED."
echo $DESIRED > $SYSCTL
fi
image: mcr.microsoft.com/oss/kubernetes/kube-proxy:v1.23.12-hotfix.20220922.1
imagePullPolicy: IfNotPresent
name: kube-proxy-bootstrap
resources:
requests:
cpu: 100m
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/sysctl.d
name: sysctls
- mountPath: /lib/modules
name: modules
priorityClassName: system-node-critical
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- effect: NoExecute
operator: Exists
- effect: NoSchedule
operator: Exists
volumes:
- hostPath:
path: /var/lib/kubelet
type: ""
name: kubeconfig
- hostPath:
path: /etc/kubernetes/certs
type: ""
name: certificates
- hostPath:
path: /run/xtables.lock
type: FileOrCreate
name: iptableslock
- hostPath:
path: /etc/sysctl.d
type: Directory
name: sysctls
- hostPath:
path: /lib/modules
type: Directory
name: modules
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
status:
currentNumberScheduled: 4
desiredNumberScheduled: 4
numberAvailable: 4
numberMisscheduled: 0
numberReady: 4
observedGeneration: 1
updatedNumberScheduled: 4
I tried it also by removing the initContainer. Even the solution by editing the daemonset, explained in this stackoverlow post didnt worked.
Do I miss something? Why is the kube-proxy daemonset always rolling back?
In Kubernetes rolling updates are the default strategy to update running version of the application
When I upgrade the pods from version 1 to 2 the deployment will creates the new ReplicaSet and increase the count of replicas and previous count goes to 0
After rolling update, the previous replica set is not deleted
If we try to execute another rolling update from version 2 to 3 we might notice that at the end of the upgrade we have two replica sets with 0 count
I have created the deployment file and deployed when I check the history of the daemonset I am able to see below results
kubectl rollout history daemonset kube-proxy -n kube-system
We can rollback to the specific version
kubectl rollout undo daemonset kube-proxy --to-revision=4 -n kube-system
After undo changes my replica revision changes to my daemonset look like below
kubectl rollout history daemonset kube-proxy -n kube-system
In the above command we have two columns 1 is revision and another is change-cause and it is always set to none
I have set the change-cause to 'Kube' as mentioned below and got below results
If I try to get the rollout history again
kubernetes.io/change-cause: "Kube" #for particular revision
kubectl apply -f filename
kubectl rollout history daemonset kube-proxy -n kube-system
Reference: To know more about the rolling updates use this kubernetes link

AKS timeout container kubectl Rollout status check failed

I have a sporadic issue that I am struggling to understand - azure pipeline on promote fails due to kubectl rollout status Deployment/name --timeout 120s --namespace xyz
I have tried to increase the progressDeadlineSeconds, but I know it may not take, I have tried to update the replicaSets to 2 so it can take but it still does not apply. I am not fully understanding this error and there is a roll out issue.
Yaml file
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: #{KubeComponentName}#
namespace: #{Namespace}#
spec:
selector:
matchLabels:
app: #{KubeComponentName}#
progressDeadlineSeconds: 600
replicas: #{ReplicaCount}#
template:
metadata:
labels:
app: #{KubeComponentName}#
annotations:
spec:
securityContext:
runAsUser: 999
serviceAccountName: #{KubeComponentName}#
containers:
- name: #{KubeComponentName}#
image: #{ImageRegistry}#/datahub/#{KubeComponentName}#:latest
#command: ["/bin/bash", "-c", "--"]
#args: [ "while true; do sleep 30; done;" ]
volumeMounts:
ports:
env:
- name: NodeName
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: PodName
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: PodNamespace
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: PodIp
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: PodServiceAccount
valueFrom:
fieldRef:
fieldPath: spec.serviceAccountName
- name: ComponentInfo__ComponentName
value: #{KubeComponentName}#
- name: ComponentInfo__ComponentHost
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: ComponentInfo__ServiceUser
valueFrom:
fieldRef:
fieldPath: spec.serviceAccountName
- name: MongoDbUserName
valueFrom:
secretKeyRef:
name: mongodb-xyz-username
key: secret-value
- name: MongoDbPassword
valueFrom:
secretKeyRef:
name: mongodb-xyz-password
key: secret-value
- name: MongoDbKubernetesHosts
value: #{MongoDbKubernetesHosts}#
- name: MongoDbScriptBasePath
value: #{MongoDbScriptBasePath}#
volumes:
My errors keep happening such that I get a timeout error waiting for rollout or exceeded progress deadline
----
/opt/vsts-agent/_work/_tool/kubectl/1.18.6/x64/kubectl rollout status Deployment/datahub-recon --timeout 120s --namespace xyz
Waiting for deployment "datahub-recon" rollout to finish: 0 of 1 updated replicas are available...
error: timed out waiting for the condition
##[error]Error: error: timed out waiting for the condition
/opt/vsts-agent/_work/_tool/kubectl/1.18.6/x64/kubectl describe Deployment datahub-recon --namespace xyz
Conditions:
Type Status Reason
---- ------ ------
Available False MinimumReplicasUnavailable
Progressing True ReplicaSetUpdated
OldReplicaSets: <none>
NewReplicaSet: datahub-recon-567c7d6958 (1/1 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 2m1s deployment-controller Scaled up replica set datahub-recon-567c7d6958 to 1
For more information, go to https://dev.azure.com/pbc/Premera/_environments/23
##[error]Rollout status check failed.
----
/opt/vsts-agent/_work/_tool/kubectl/1.18.6/x64/kubectl rollout status Deployment/datahub-recon --timeout 120s --namespace xyz
error: deployment "datahub-recon" exceeded its progress deadline
##[error]Error: error: deployment "datahub-recon" exceeded its progress deadline
/opt/vsts-agent/_work/_tool/kubectl/1.18.6/x64/kubectl describe Deployment datahub-recon --namespace xyz
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing False ProgressDeadlineExceeded
OldReplicaSets: datahub-recon-6bc6f85fc6 (2/2 replicas created)
NewReplicaSet: datahub-recon-bd7d9754 (1/1 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 13m deployment-controller Scaled up replica set datahub-recon-bd7d9754 to 1
For more information, go to https://dev.azure.com/pbc/Premera/_environments/23
##[error]Rollout status check failed.

Elasticsearch PVC

I am trying to build a es cluster using the helm chart with the following es yaml: values:
resources:
requests:
cpu: ".1"
memory: "2Gi"
limits:
cpu: "1"
memory: "3.5Gi"
volumeClaimTemplate:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 500Gi
esConfig:
elasticsearch.yml: |
path.data: /mnt/azure
The problem is that the pods are throwing the following error at the start
"Caused by: java.nio.file.AccessDeniedException: /mnt/azure"
I put the azure disk as default storage in order not to specify the storage class. I don't know if this is the best practice or should i create the storage and after that mount it to the pods
You need to keep init container to change mounted directory ownership
You can update your path as per need, for you changes will be for /mnt/azure
initContainers:
- command:
- sh
- -c
- chown -R 1000:1000 /usr/share/elasticsearch/data
- sysctl -w vm.max_map_count=262144
- chmod 777 /usr/share/elasticsearch/data
- chomod 777 /usr/share/elasticsearch/data/node
- chmod g+rwx /usr/share/elasticsearch/data
- chgrp 1000 /usr/share/elasticsearch/data
Example stateful sets file
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app : elasticsearch
component: elasticsearch
release: elasticsearch
name: elasticsearch
spec:
podManagementPolicy: Parallel
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app : elasticsearch
component: elasticsearch
release: elasticsearch
serviceName: elasticsearch
template:
metadata:
creationTimestamp: null
labels:
app : elasticsearch
component: elasticsearch
release: elasticsearch
spec:
containers:
- env:
- name: cluster.name
value: <SET THIS>
- name: discovery.type
value: single-node
- name: ES_JAVA_OPTS
value: -Xms512m -Xmx512m
- name: bootstrap.memory_lock
value: "false"
image: elasticsearch:6.5.0
imagePullPolicy: IfNotPresent
name: elasticsearch
ports:
- containerPort: 9200
name: http
protocol: TCP
- containerPort: 9300
name: transport
protocol: TCP
resources:
limits:
cpu: 250m
memory: 1Gi
requests:
cpu: 150m
memory: 512Mi
securityContext:
privileged: true
runAsUser: 1000
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /usr/share/elasticsearch/data
name: elasticsearch-data
dnsPolicy: ClusterFirst
initContainers:
- command:
- sh
- -c
- chown -R 1000:1000 /usr/share/elasticsearch/data
- sysctl -w vm.max_map_count=262144
- chmod 777 /usr/share/elasticsearch/data
- chomod 777 /usr/share/elasticsearch/data/node
- chmod g+rwx /usr/share/elasticsearch/data
- chgrp 1000 /usr/share/elasticsearch/data
image: busybox:1.29.2
imagePullPolicy: IfNotPresent
name: set-dir-owner
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /usr/share/elasticsearch/data
name: elasticsearch-data
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 10
updateStrategy:
type: OnDelete
volumeClaimTemplates:
- metadata:
creationTimestamp: null
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
The mounted Elasticsearch data directory by default is owned by root. Try the following container to change it before Elasticsearch starts:
initContainers:
- name: chown
image: busybox
imagePullPolicy: IfNotPresent
command:
- chown
args:
- 1000:1000
- /mnt/azure
volumeMounts:
- name: <your volume claim template name>
mountPath: /mnt/azure

OpenShift Access Mongodb Pod from another Pod

I'm currentrly trying to deploy a mongodb pod on OpenShift and accessing this pod from another node.js application via mongoose. Now at first everything seems fine. I have created a route to the mongodb and when i open it in my browser I get
It looks like you are trying to access MongoDB over HTTP on the
native driver port.
So far so good. But when I try opening a connection to the database from another pod it refuses the connection. I'm using the username and password provided by OpenShift and connect to
mongodb://[username]:[password]#[host]:[port]/[dbname]
unfortunately without luck. It seems that the database is just accepting connections from the localhost. However I could not find out how to change that. Would be great if someone had an idea.
Heres the Deployment Config
apiVersion: v1
kind: DeploymentConfig
metadata:
annotations:
template.alpha.openshift.io/wait-for-ready: "true"
creationTimestamp: null
generation: 1
labels:
app: mongodb-persistent
template: mongodb-persistent-template
name: mongodb
spec:
replicas: 1
selector:
name: mongodb
strategy:
activeDeadlineSeconds: 21600
recreateParams:
timeoutSeconds: 600
resources: {}
type: Recreate
template:
metadata:
creationTimestamp: null
labels:
name: mongodb
spec:
containers:
- env:
- name: MONGODB_USER
valueFrom:
secretKeyRef:
key: database-user
name: mongodb
- name: MONGODB_PASSWORD
valueFrom:
secretKeyRef:
key: database-password
name: mongodb
- name: MONGODB_ADMIN_PASSWORD
valueFrom:
secretKeyRef:
key: database-admin-password
name: mongodb
- name: MONGODB_DATABASE
valueFrom:
secretKeyRef:
key: database-name
name: mongodb
image: registry.access.redhat.com/rhscl/mongodb-32-rhel7#sha256:82c79f0e54d5a23f96671373510159e4fac478e2aeef4181e61f25ac38c1ae1f
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: 27017
timeoutSeconds: 1
name: mongodb
ports:
- containerPort: 27017
protocol: TCP
readinessProbe:
exec:
command:
- /bin/sh
- -i
- -c
- mongo 127.0.1:27017/$MONGODB_DATABASE -u $MONGODB_USER -p $MONGODB_PASSWORD
--eval="quit()"
failureThreshold: 3
initialDelaySeconds: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
memory: 512Mi
securityContext:
capabilities: {}
privileged: false
terminationMessagePath: /dev/termination-log
volumeMounts:
- mountPath: /var/lib/mongodb/data
name: mongodb-data
dnsPolicy: ClusterFirst
restartPolicy: Always
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- name: mongodb-data
persistentVolumeClaim:
claimName: mongodb
test: false
triggers:
- imageChangeParams:
automatic: true
containerNames:
- mongodb
from:
kind: ImageStreamTag
name: mongodb:3.2
namespace: openshift
type: ImageChange
- type: ConfigChange
status:
availableReplicas: 0
latestVersion: 0
observedGeneration: 0
replicas: 0
unavailableReplicas: 0
updatedReplicas: 0
The Service Config
apiVersion: v1
kind: Service
metadata:
annotations:
template.openshift.io/expose-uri: mongodb://{.spec.clusterIP}:{.spec.ports[?(.name=="mongo")].port}
creationTimestamp: null
labels:
app: mongodb-persistent
template: mongodb-persistent-template
name: mongodb
spec:
ports:
- name: mongo
port: 27017
protocol: TCP
targetPort: 27017
selector:
name: mongodb
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
and the pod
apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/created-by: |
{"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"some-name-space","name":"mongodb-3","uid":"xxxx-xxx-xxx-xxxxxx","apiVersion":"v1","resourceVersion":"244413593"}}
kubernetes.io/limit-ranger: 'LimitRanger plugin set: cpu request for container
mongodb'
openshift.io/deployment-config.latest-version: "3"
openshift.io/deployment-config.name: mongodb
openshift.io/deployment.name: mongodb-3
openshift.io/scc: nfs-scc
creationTimestamp: null
generateName: mongodb-3-
labels:
deployment: mongodb-3
deploymentconfig: mongodb
name: mongodb
ownerReferences:
- apiVersion: v1
controller: true
kind: ReplicationController
name: mongodb-3
uid: a694b832-5dd2-11e8-b2fc-40f2e91e2433
spec:
containers:
- env:
- name: MONGODB_USER
valueFrom:
secretKeyRef:
key: database-user
name: mongodb
- name: MONGODB_PASSWORD
valueFrom:
secretKeyRef:
key: database-password
name: mongodb
- name: MONGODB_ADMIN_PASSWORD
valueFrom:
secretKeyRef:
key: database-admin-password
name: mongodb
- name: MONGODB_DATABASE
valueFrom:
secretKeyRef:
key: database-name
name: mongodb
image: registry.access.redhat.com/rhscl/mongodb-32-rhel7#sha256:82c79f0e54d5a23f96671373510159e4fac478e2aeef4181e61f25ac38c1ae1f
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: 27017
timeoutSeconds: 1
name: mongodb
ports:
- containerPort: 27017
protocol: TCP
readinessProbe:
exec:
command:
- /bin/sh
- -i
- -c
- mongo 127.0.1:27017/$MONGODB_DATABASE -u $MONGODB_USER -p $MONGODB_PASSWORD
--eval="quit()"
failureThreshold: 3
initialDelaySeconds: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
memory: 512Mi
requests:
cpu: 250m
memory: 512Mi
securityContext:
capabilities:
drop:
- KILL
- MKNOD
- SETGID
- SETUID
- SYS_CHROOT
privileged: false
runAsUser: 1049930000
seLinuxOptions:
level: s0:c223,c212
terminationMessagePath: /dev/termination-log
volumeMounts:
- mountPath: /var/lib/mongodb/data
name: mongodb-data
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-rfvr5
readOnly: true
dnsPolicy: ClusterFirst
imagePullSecrets:
- name: default-dockercfg-3mpps
nodeName: thenode.name.net
nodeSelector:
region: primary
restartPolicy: Always
securityContext:
fsGroup: 1049930000
seLinuxOptions:
level: s0:c223,c212
supplementalGroups:
- 5555
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
volumes:
- name: mongodb-data
persistentVolumeClaim:
claimName: mongodb
- name: default-token-rfvr5
secret:
defaultMode: 420
secretName: default-token-rfvr5
status:
phase: Pending
Ok that was a long search and finally I was able to solve it. My first mistake was, that routes are not suited to make a connection to a database as they only use the http-protocol.
Now there were 2 usecases left for me
You're working on your local machine and want to test code that you later upload to OpenShift
You deploy that code to OpenShift (has to be in the same project but is a different app than the database)
1. Local Machine
Since the route doesn't work port forwarding is used. I've read that before but didn't really understand what it meant (i thought the service itsself is forwading ports already).
When you are on your local machine you will do the following with the oc
oc port-forward <pod-name> <local-port>:<remote-port>
You'll get the info that the port is forwarded. Now the thing is, that in your app you will now connect to localhost (even on your local machine)
2. App running on OpenShift
After you will upload your code to OpenShift(In my case, just Add to project --> Node.js --> Add your repo), localhost will not be working any longer.
What took a while for me to understand is that as long as you are in the same project you will have a lot of information in your environment variables.
So just check the name of the service of your database (in my case mongodb) and you will find the host and port to use
Summary
Here's a little code example that works now, as well on the local machine as on OpenShift. I have already set up a persistand MongoDB on OpenShift called mongodb.
The code doesn't do much, but It will make a connection and tell you that it did, so you know it's working.
var mongoose = require('mongoose');
// Connect to Mongodb
var username = process.env.MONGO_DB_USERNAME || 'someUserName';
var password = process.env.MONGO_DB_PASSWORD || 'somePassword';
var host = process.env.MONGODB_SERVICE_HOST || '127.0.0.1';
var port = process.env.MONGODB_SERVICE_PORT || '27017';
var database = process.env.MONGO_DB_DATABASE || 'sampledb';
console.log('---DATABASE PARAMETERS---');
console.log('Host: ' + host);
console.log('Port: ' + port);
console.log('Username: ' + username);
console.log('Password: ' + password);
console.log('Database: ' + database);
var connectionString = 'mongodb://' + username + ':' + password +'#' + host + ':' + port + '/' + database;
console.log('---CONNECTING TO---');
console.log(connectionString);
mongoose.connect(connectionString);
mongoose.connection.once('open', (data) => {
console.log('Connection has been made');
console.log(data);
});

K8s did not kill my airflow webserver pod

I have airflow running in k8s containers.
The webserver encountered a DNS error (could not translate the url for my db to an ip) and the webserver workers were killed.
What is troubling me is that the k8s did not attempt to kill the pod and start a new one its place.
Pod log output:
OperationalError: (psycopg2.OperationalError) could not translate host name "my.dbs.url" to address: Temporary failure in name resolution
[2017-12-01 06:06:05 +0000] [2202] [INFO] Worker exiting (pid: 2202)
[2017-12-01 06:06:05 +0000] [2186] [INFO] Worker exiting (pid: 2186)
[2017-12-01 06:06:05 +0000] [2190] [INFO] Worker exiting (pid: 2190)
[2017-12-01 06:06:05 +0000] [2194] [INFO] Worker exiting (pid: 2194)
[2017-12-01 06:06:05 +0000] [2198] [INFO] Worker exiting (pid: 2198)
[2017-12-01 06:06:06 +0000] [13] [INFO] Shutting down: Master
[2017-12-01 06:06:06 +0000] [13] [INFO] Reason: Worker failed to boot.
The k8s status is RUNNING but when I open an exec shell in the k8s UI i get the following output (gunicorn appears to realize it's dead):
root#webserver-373771664-3h4v9:/# ps -Al
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
4 S 0 1 0 0 80 0 - 107153 - ? 00:06:42 /usr/local/bin/
4 Z 0 13 1 0 80 0 - 0 - ? 00:01:24 gunicorn: maste <defunct>
4 S 0 2206 0 0 80 0 - 4987 - ? 00:00:00 bash
0 R 0 2224 2206 0 80 0 - 7486 - ? 00:00:00 ps
The following is the YAML for my deployment:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: webserver
namespace: airflow
spec:
replicas: 1
template:
metadata:
labels:
app: airflow-webserver
spec:
volumes:
- name: webserver-dags
emptyDir: {}
containers:
- name: airflow-webserver
image: my.custom.image :latest
imagePullPolicy: Always
resources:
requests:
cpu: 100m
limits:
cpu: 500m
ports:
- containerPort: 80
protocol: TCP
env:
- name: AIRFLOW_HOME
value: /var/lib/airflow
- name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
valueFrom:
secretKeyRef:
name: db1
key: sqlalchemy_conn
volumeMounts:
- mountPath: /var/lib/airflow/dags/
name: webserver-dags
command: ["airflow"]
args: ["webserver"]
- name: docker-s3-to-backup
image: my.custom.image:latest
imagePullPolicy: Always
resources:
requests:
cpu: 50m
limits:
cpu: 500m
env:
- name: ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws
key: access_key_id
- name: SECRET_KEY
valueFrom:
secretKeyRef:
name: aws
key: secret_access_key
- name: S3_PATH
value: s3://my-s3-bucket/dags/
- name: DATA_PATH
value: /dags/
- name: CRON_SCHEDULE
value: "*/5 * * * *"
volumeMounts:
- mountPath: /dags/
name: webserver-dags
---
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: webserver
namespace: airflow
spec:
scaleTargetRef:
apiVersion: apps/v1beta1
kind: Deployment
name: webserver
minReplicas: 2
maxReplicas: 20
targetCPUUtilizationPercentage: 75
---
apiVersion: v1
kind: Service
metadata:
labels:
name: webserver
namespace: airflow
spec:
type: NodePort
ports:
- port: 80
selector:
app: airflow-webserver
you need to define the readiness and liveness probe Kubernetes to detect the POD status.
like documented on this page. https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-a-tcp-liveness-probe
- containerPort: 8080
readinessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
Well, when process dies in a container, this container will exit and kubelet will restart the container on the same node / within the same pod. What happened here is by no means a fault of kubernetes, but in fact a problem of your container. The main process that you launch in the container (be it just from CMD or via ENTRYPOINT) needs to die, for the above to happen, and the ones you launch did not (one went zombie mode, but was not reaped, which is an example of another issue all together - zombie reaping. Liveness probe will help in this case (as mentioned by #sfgroups) as it will terminate the pod if it fails, but this is treating symptoms rather then root cause (not that you shouldn't have probes defined in general as a good practice).

Resources