RabbitMQ cannot start after upgrading Azure Kubernetes Service (AKS) - azure
I had the same problem with #Amir Soleimani but the error result was a bit different, I tried all the solutions in that post but all of them didn't work.... I'm using Azure Kubernetes Service (AKS) and after upgrading from 1.13.xx to 1.18.xx can't start RabbitMQ anymore.
UPDATED - Solution that worked for me (please consider this approach as it may affect your existing queues)
Remove current rabbitmq StatefulSet including persistent disks
========
Here is my StatefulSet file:
apiVersion: v1
kind: Service
metadata:
name: rabbitmq-management
labels:
app: rabbitmq
spec:
ports:
- port: 80
targetPort: 15672
name: http
selector:
app: rabbitmq
type: LoadBalancer
---
apiVersion: v1
kind: Service
metadata:
name: rabbitmq
labels:
app: rabbitmq
spec:
ports:
- port: 5672
name: amqp
- port: 4369
name: epmd
- port: 25672
name: rabbitmq-dist
clusterIP: None
selector:
app: rabbitmq
---
apiVersion: v1
kind: Secret
metadata:
name: rabbitmq-config
namespace: default
type: Opaque
data:
erlang.cookie: samplecookie==
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: rabbitmq
labels:
app: rabbitmq
spec:
serviceName: rabbitmq
selector:
matchLabels:
app: rabbitmq
replicas: 3
template:
metadata:
labels:
app: rabbitmq
spec:
containers:
- name: rabbitmq
image: 'rabbitmq:3.6.6-management-alpine'
lifecycle:
postStart:
exec:
command:
- /bin/sh
- -c
- >
if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then
sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new;
cat /etc/resolv.conf.new > /etc/resolv.conf;
rm /etc/resolv.conf.new;
fi;
until rabbitmqctl node_health_check; do sleep 1; done;
if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then
rabbitmqctl stop_app;
rabbitmqctl join_cluster rabbit#rabbitmq-0;
rabbitmqctl start_app;
fi;
rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}'
env:
- name: RABBITMQ_ERLANG_COOKIE
valueFrom:
secretKeyRef:
name: rabbitmq-config
key: erlang.cookie
- name: RABBITMQ_DEFAULT_USER
value: username
- name: RABBITMQ_DEFAULT_PASS
value: password
ports:
- containerPort: 5672
name: amqp
- containerPort: 15672
name: amqp-management
volumeMounts:
- mountPath: /var/lib/rabbitmq
name: volume
volumeClaimTemplates:
- metadata:
name: volume
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
Result of kubectl describe pod rabbitmq-0
DIAGNOSTICS
===========
attempted to contact: ['rabbit#rabbitmq-0']
rabbit#rabbitmq-0:
* connected to epmd (port 4369) on rabbitmq-0
* epmd reports: node 'rabbit' not running at all
no other nodes on rabbitmq-0
* suggestion: start the node
current node details:
- node name: 'rabbitmq-cli-91#rabbitmq-0'
- home dir: /var/lib/rabbitmq
- cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==
Error: unable to connect to node 'rabbit#rabbitmq-0': nodedown
DIAGNOSTICS
===========
attempted to contact: ['rabbit#rabbitmq-0']
rabbit#rabbitmq-0:
* connected to epmd (port 4369) on rabbitmq-0
* epmd reports: node 'rabbit' not running at all
no other nodes on rabbitmq-0
* suggestion: start the node
current node details:
- node name: 'rabbitmq-cli-26#rabbitmq-0'
- home dir: /var/lib/rabbitmq
- cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: rabbit application is not running on node rabbit#rabbitmq-0.
* Suggestion: start it with "rabbitmqctl start_app" and try again
, message: "Timeout: 70.0 seconds ...\nChecking health of node 'rabbit#rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit#rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit#rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit#rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit#rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit#rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit#rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit#rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit#rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit#rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit#rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit#rabbitmq-0' ...\nError: unable to connect to node 'rabbit#rabbitmq-0': nodedown\n\nDIAGNOSTICS\n===========\n\nattempted to contact: ['rabbit#rabbitmq-0']\n\nrabbit#rabbitmq-0:\n * connected to epmd (port 4369) on rabbitmq-0\n * epmd reports: node 'rabbit' not running at all\n no other nodes on rabbitmq-0\n * suggestion: start the node\n\ncurrent node details:\n- node name: 'rabbitmq-cli-91#rabbitmq-0'\n- home dir: /var/lib/rabbitmq\n- cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==\n\nError: unable to connect to node 'rabbit#rabbitmq-0': nodedown\n\nDIAGNOSTICS\n===========\n\nattempted to contact: ['rabbit#rabbitmq-0']\n\nrabbit#rabbitmq-0:\n * connected to epmd (port 4369) on rabbitmq-0\n * epmd reports: node 'rabbit' not running at all\n no other nodes on rabbitmq-0\n * suggestion: start the node\n\ncurrent node details:\n- node name: 'rabbitmq-cli-26#rabbitmq-0'\n- home dir: /var/lib/rabbitmq\n- cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==\n\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: rabbit application is not running on node rabbit#rabbitmq-0.\n * Suggestion: start it with \"rabbitmqctl start_app\" and try again\n"
Warning FailedPostStartHook 23m kubelet Exec lifecycle hook ([/bin/sh -c if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then
sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new;
cat /etc/resolv.conf.new > /etc/resolv.conf;
rm /etc/resolv.conf.new;
fi; until rabbitmqctl node_health_check; do sleep 1; done; if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then
rabbitmqctl stop_app;
rabbitmqctl join_cluster rabbit#rabbitmq-0;
rabbitmqctl start_app;
fi; rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}'
]) for Container "rabbitmq" in Pod "rabbitmq-0_default(3ac91d73-de7b-4cde-81f6-c31bacd10252)" failed - error: command '/bin/sh -c if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then
sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new;
cat /etc/resolv.conf.new > /etc/resolv.conf;
rm /etc/resolv.conf.new;
fi; until rabbitmqctl node_health_check; do sleep 1; done; if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then
rabbitmqctl stop_app;
rabbitmqctl join_cluster rabbit#rabbitmq-0;
rabbitmqctl start_app;
fi; rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}'
' exited with 137: Error: unable to connect to node 'rabbit#rabbitmq-0': nodedown
Result of kubectl logs rabbitmq-0
=CRASH REPORT==== 18-Jul-2021::11:06:01 ===
crasher:
initial call: application_master:init/4
pid: <0.156.0>
registered_name: []
exception exit: {{timeout_waiting_for_tables,
[rabbit_user,rabbit_user_permission,rabbit_vhost,
rabbit_durable_route,rabbit_durable_exchange,
rabbit_runtime_parameters,rabbit_durable_queue]},
{rabbit,start,[normal,[]]}}
in function application_master:init/4 (application_master.erl, line 134)
ancestors: [<0.155.0>]
messages: [{'EXIT',<0.157.0>,normal}]
links: [<0.155.0>,<0.31.0>]
dictionary: []
trap_exit: true
status: running
heap_size: 987
stack_size: 27
reductions: 98
neighbours:
=INFO REPORT==== 18-Jul-2021::11:06:01 ===
application: rabbit
exited: {{timeout_waiting_for_tables,
[rabbit_user,rabbit_user_permission,rabbit_vhost,
rabbit_durable_route,rabbit_durable_exchange,
rabbit_runtime_parameters,rabbit_durable_queue]},
{rabbit,start,[normal,[]]}}
type: temporary
=INFO REPORT==== 18-Jul-2021::11:06:01 ===
application: amqp_client
exited: stopped
type: temporary
=INFO REPORT==== 18-Jul-2021::11:06:01 ===
application: rabbit_common
exited: stopped
type: temporary
=INFO REPORT==== 18-Jul-2021::11:06:01 ===
application: xmerl
exited: stopped
type: temporary
=INFO REPORT==== 18-Jul-2021::11:06:01 ===
application: os_mon
exited: stopped
type: temporary
=INFO REPORT==== 18-Jul-2021::11:06:01 ===
application: inets
exited: stopped
type: temporary
=INFO REPORT==== 18-Jul-2021::11:06:01 ===
application: asn1
exited: stopped
type: temporary
=INFO REPORT==== 18-Jul-2021::11:06:01 ===
application: syntax_tools
exited: stopped
type: temporary
=INFO REPORT==== 18-Jul-2021::11:06:01 ===
application: mnesia
exited: stopped
type: temporary
=INFO REPORT==== 18-Jul-2021::11:06:01 ===
application: crypto
exited: stopped
type: temporary
=INFO REPORT==== 18-Jul-2021::11:06:01 ===
application: ranch
exited: stopped
type: temporary
=INFO REPORT==== 18-Jul-2021::11:06:01 ===
application: compiler
exited: stopped
type: temporary
BOOT FAILED
===========
Timeout contacting cluster nodes: ['rabbit#rabbitmq-1','rabbit#rabbitmq-2'].
BACKGROUND
==========
This cluster node was shut down while other nodes were still running.
To avoid losing data, you should start the other nodes first, then
start this one. To force this node to start, first invoke
"rabbitmqctl force_boot". If you do so, any changes made on other
cluster nodes after this one was shut down may be lost.
DIAGNOSTICS
===========
attempted to contact: ['rabbit#rabbitmq-1','rabbit#rabbitmq-2']
rabbit#rabbitmq-1:
* unable to connect to epmd (port 4369) on rabbitmq-1: nxdomain (non-existing domain)
rabbit#rabbitmq-2:
* unable to connect to epmd (port 4369) on rabbitmq-2: nxdomain (non-existing domain)
current node details:
- node name: 'rabbit#rabbitmq-0'
- home dir: /var/lib/rabbitmq
- cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==
=INFO REPORT==== 18-Jul-2021::11:06:01 ===
Timeout contacting cluster nodes: ['rabbit#rabbitmq-1','rabbit#rabbitmq-2'].
BACKGROUND
==========
This cluster node was shut down while other nodes were still running.
To avoid losing data, you should start the other nodes first, then
start this one. To force this node to start, first invoke
"rabbitmqctl force_boot". If you do so, any changes made on other
cluster nodes after this one was shut down may be lost.
DIAGNOSTICS
===========
attempted to contact: ['rabbit#rabbitmq-1','rabbit#rabbitmq-2']
rabbit#rabbitmq-1:
* unable to connect to epmd (port 4369) on rabbitmq-1: nxdomain (non-existing domain)
rabbit#rabbitmq-2:
* unable to connect to epmd (port 4369) on rabbitmq-2: nxdomain (non-existing domain)
current node details:
- node name: 'rabbit#rabbitmq-0'
- home dir: /var/lib/rabbitmq
- cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==
{"init terminating in do_boot",timeout_waiting_for_tables}
init terminating in do_boot (timeout_waiting_for_tables)
Crash dump is being written to: erl_crash.dump...
What I tried but didn't work:
rabbitmqctl stop_app
rabbitmqctl force_boot
Remove StatefulSet and re-install
Re-configure the yaml file
Please try force boot in post Start scipt:
...
fi;
if [[ "$HOSTNAME" == "rabbitmq-0" ]]; then
rabbitmqctl stop_app;
rabbitmqctl force_boot;
fi;
until rabbitmqctl node_health_check; do sleep 1; done;
...
Related
Unable to connect to MongoDB: MongoNetworkError & MongoNetworkError connecting to kubernetis MongoDB pod with mongoose
I am trying to connect to MongoDB in a microservice-based project using NodeJs, Kubernetes, Ingress, and skaffold. I got two errors on doing skaffold dev: MongoNetworkError: failed to connect to server [auth-mongo-srv:21017] on first connect [MongoNetworkTimeoutError: connection timed out. Mongoose default connection error: MongoNetworkError: MongoNetworkError: failed to connect to server [auth-mongo-srv:21017] on first connect [MongoNetworkTimeoutError: connection timed out at connectionFailureError. My auth-mongo-deploy.yaml: apiVersion: apps/v1 kind: Deployment metadata: name: auth-mongo-deploy spec: replicas: 1 selector: matchLabels: app: auth-mongo template: metadata: labels: app: auth-mongo spec: containers: - name: auth-mongo image: mongo --- apiVersion: v1 kind: Service metadata: name: auth-mongo-srv spec: selector: app: auth-mongo ports: - name: db protocol: TCP port: 27017 targetPort: 27017 My server.ts const dbURI: string = "mongodb://auth-mongo-srv:21017/auth" logger.debug(dbURI) logger.info('connecting to database...') // changing {} --> options change nothing! mongoose.connect(dbURI, {}).then(() => { logger.info('Mongoose connection done') app.listen(APP_PORT, () => { logger.info(`server listening on ${APP_PORT}`) }) console.clear(); }).catch((e) => { logger.info('Mongoose connection error') logger.error(e) }) Additional information: 1. pod is created: rhythm#vivobook:~/Documents/TicketResale/server$ kubectl get pods NAME STATUS RESTARTS AGE auth-deploy-595c6cbf6d-9wzt9 1/1 Running 0 5m53s auth-mongo-deploy-6b96b7798c-9726w 1/1 Running 0 5m53s tickets-deploy-675b7b9b58-f5bzs 1/1 Running 0 5m53s 2. pod description: kubectl describe pod auth-mongo-deploy-6b96b7798c-9726w Name: auth-mongo-deploy-694b67f76d-ksw82 Namespace: default Priority: 0 Node: minikube/192.168.49.2 Start Time: Tue, 21 Jun 2022 14:11:47 +0530 Labels: app=auth-mongo pod-template-hash=694b67f76d skaffold.dev/run-id=2f5d2142-0f1a-4fa4-b641-3f301f10e65a Annotations: <none> Status: Running IP: 172.17.0.2 IPs: IP: 172.17.0.2 Controlled By: ReplicaSet/auth-mongo-deploy-694b67f76d Containers: auth-mongo: Container ID: docker://fa43cd7e03ac32ed63c82419e5f9722deffd2f93206b6a0f2b25ae9be8f6cedf Image: mongo Image ID: docker-pullable://mongo#sha256:37e84d3dd30cdfb5472ec42b8a6b4dc6ca7cacd91ebcfa0410a54528bbc5fa6d Port: <none> Host Port: <none> State: Running Started: Tue, 21 Jun 2022 14:11:52 +0530 Ready: True Restart Count: 0 Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zw7s9 (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: kube-api-access-zw7s9: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 79s default-scheduler Successfully assigned default/auth-mongo-deploy-694b67f76d-ksw82 to minikube Normal Pulling 79s kubelet Pulling image "mongo" Normal Pulled 75s kubelet Successfully pulled image "mongo" in 4.429126953s Normal Created 75s kubelet Created container auth-mongo Normal Started 75s kubelet Started container auth-mongo I have also tried: kubectl describe service auth-mongo-srv Name: auth-mongo-srv Namespace: default Labels: skaffold.dev/run-id=2f5d2142-0f1a-4fa4-b641-3f301f10e65a Annotations: <none> Selector: app=auth-mongo Type: ClusterIP IP Family Policy: SingleStack IP Families: IPv4 IP: 10.100.42.183 IPs: 10.100.42.183 Port: db 27017/TCP TargetPort: 27017/TCP Endpoints: 172.17.0.2:27017 Session Affinity: None Events: <none> And then changed: const dbURI: string = "mongodb://auth-mongo-srv:21017/auth" to const dbURI: string = "mongodb://172.17.0.2:27017:21017/auth" generated a different error of MongooseServerSelectionError.
const dbURI: string = "mongodb://auth-mongo-srv:27017/auth"
Elastic Search upgrade to v8 on Kubernetes
I am having an elastic search deployment on a Microsoft Kubernetes cluster that was deployed with a 7.x chart and I changed the image to 8.x. This upgrade worked and both elastic and Kibana was accessible, but now i need to enable THE new security feature which is included in the basic license from now on. The reason behind the security first came from the requirement to enable APM Server/Agents. I have the following values: - name: cluster.initial_master_nodes value: elasticsearch-master-0, - name: discovery.seed_hosts value: elasticsearch-master-headless - name: cluster.name value: elasticsearch - name: network.host value: 0.0.0.0 - name: cluster.deprecation_indexing.enabled value: 'false' - name: node.roles value: data,ingest,master,ml,remote_cluster_client The elastic search and kibana pods are able to start but i am unable to set APM Integration due security. So I am enabling security using the below values: - name: xpack.security.enabled value: 'true' Then i am getting an error log from the elasic search pod: "Transport SSL must be enabled if security is enabled. Please set [xpack.security.transport.ssl.enabled] to [true] or disable security by setting [xpack.security.enabled] to [false]". So i am enabling ssl using the below values: - name: xpack.security.transport.ssl.enabled value: 'true' Then i am getting an error log from elastic search pod: "invalid SSL configuration for xpack.security.transport.ssl - server ssl configuration requires a key and certificate, but these have not been configured; you must set either [xpack.security.transport.ssl.keystore.path] (p12 file), or both [xpack.security.transport.ssl.key] (pem file) and [xpack.security.transport.ssl.certificate] (pem key file)". I start with Option1, i am creating the keys using the below commands (no password / enter, enter / enter, enter, enter) and i am coping them to a persistent folder: ./bin/elasticsearch-certutil ca ./bin/elasticsearch-certutil cert --ca elastic-stack-ca.p12 cp elastic-stack-ca.p12 data/elastic-stack-ca.p12 cp elastic-certificates.p12 data/elastic-certificates.p12 In addition I am also configuring the below values: - name: xpack.security.transport.ssl.truststore.path value: '/usr/share/elasticsearch/data/elastic-certificates.p12' - name: xpack.security.transport.ssl.keystore.path value: '/usr/share/elasticsearch/data/elastic-certificates.p12' But the pod is still in initializing, if generate the certificates with password. then i am getting an error log from elastic search pod: "cannot read configured [PKCS12] keystore (as a truststore) [/usr/share/elasticsearch/data/elastic-certificates.p12] - this is usually caused by an incorrect password; (no password was provided)" Then i go to Option2, i am creating the keys using the below commands and i am coping them to a persistent folder ./bin/elasticsearch-certutil ca --pem unzip elastic-stack-ca.zip –d cp ca.crt data/ca.crt cp ca.key data/ca.key In addition I am also configuring the below values: - name: xpack.security.transport.ssl.key value: '/usr/share/elasticsearch/data/ca.key' - name: xpack.security.transport.ssl.certificate value: '/usr/share/elasticsearch/data/ca.crt' But the pod is still in initializing state without providing any logs, as i know while pod is in initializing state it does not produce any container logs. From portal side in events everything seems to be ok, except the elastic pod which is not in ready state. At last i located the same issue to the eleastic search community, without any response: https://discuss.elastic.co/t/elasticsearch-pods-are-not-ready-when-xpack-security-enabled-is-configured/281709?u=s19k15 Here is my StatefullSet status: observedGeneration: 169 replicas: 1 updatedReplicas: 1 currentRevision: elasticsearch-master-7449d7bd69 updateRevision: elasticsearch-master-7d8c7b6997 collisionCount: 0 spec: replicas: 1 selector: matchLabels: app: elasticsearch-master template: metadata: name: elasticsearch-master creationTimestamp: null labels: app: elasticsearch-master chart: elasticsearch release: platform spec: initContainers: - name: configure-sysctl image: docker.elastic.co/elasticsearch/elasticsearch:8.1.2 command: - sysctl - '-w' - vm.max_map_count=262144 resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File imagePullPolicy: IfNotPresent securityContext: privileged: true runAsUser: 0 containers: - name: elasticsearch image: docker.elastic.co/elasticsearch/elasticsearch:8.1.2 ports: - name: http containerPort: 9200 protocol: TCP - name: transport containerPort: 9300 protocol: TCP env: - name: node.name valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: cluster.initial_master_nodes value: elasticsearch-master-0, - name: discovery.seed_hosts value: elasticsearch-master-headless - name: cluster.name value: elasticsearch - name: cluster.deprecation_indexing.enabled value: 'false' - name: ES_JAVA_OPTS value: '-Xmx512m -Xms512m' - name: node.roles value: data,ingest,master,ml,remote_cluster_client - name: xpack.license.self_generated.type value: basic - name: xpack.security.enabled value: 'true' - name: xpack.security.transport.ssl.enabled value: 'true' - name: xpack.security.transport.ssl.truststore.path value: /usr/share/elasticsearch/data/elastic-certificates.p12 - name: xpack.security.transport.ssl.keystore.path value: /usr/share/elasticsearch/data/elastic-certificates.p12 - name: xpack.security.http.ssl.enabled value: 'true' - name: xpack.security.http.ssl.truststore.path value: /usr/share/elasticsearch/data/elastic-certificates.p12 - name: xpack.security.http.ssl.keystore.path value: /usr/share/elasticsearch/data/elastic-certificates.p12 - name: logger.org.elasticsearch.discovery value: debug - name: path.logs value: /usr/share/elasticsearch/data - name: xpack.security.enrollment.enabled value: 'true' resources: limits: cpu: '1' memory: 2Gi requests: cpu: 100m memory: 512Mi volumeMounts: - name: elasticsearch-master mountPath: /usr/share/elasticsearch/data readinessProbe: exec: command: - bash - '-c' - > set -e # If the node is starting up wait for the cluster to be ready (request params: "wait_for_status=green&timeout=1s" ) # Once it has started only check that the node itself is responding START_FILE=/tmp/.es_start_file # Disable nss cache to avoid filling dentry cache when calling curl # This is required with Elasticsearch Docker using nss < 3.52 export NSS_SDB_USE_CACHE=no http () { local path="${1}" local args="${2}" set -- -XGET -s if [ "$args" != "" ]; then set -- "$#" $args fi if [ -n "${ELASTIC_PASSWORD}" ]; then set -- "$#" -u "elastic:${ELASTIC_PASSWORD}" fi curl --output /dev/null -k "$#" "http://127.0.0.1:9200${path}" } if [ -f "${START_FILE}" ]; then echo 'Elasticsearch is already running, lets check the node is healthy' HTTP_CODE=$(http "/" "-w %{http_code}") RC=$? if [[ ${RC} -ne 0 ]]; then echo "curl --output /dev/null -k -XGET -s -w '%{http_code}' \${BASIC_AUTH} http://127.0.0.1:9200/ failed with RC ${RC}" exit ${RC} fi # ready if HTTP code 200, 503 is tolerable if ES version is 6.x if [[ ${HTTP_CODE} == "200" ]]; then exit 0 elif [[ ${HTTP_CODE} == "503" && "8" == "6" ]]; then exit 0 else echo "curl --output /dev/null -k -XGET -s -w '%{http_code}' \${BASIC_AUTH} http://127.0.0.1:9200/ failed with HTTP code ${HTTP_CODE}" exit 1 fi else echo 'Waiting for elasticsearch cluster to become ready (request params: "wait_for_status=green&timeout=1s" )' if http "/_cluster/health?wait_for_status=green&timeout=1s" "--fail" ; then touch ${START_FILE} exit 0 else echo 'Cluster is not yet ready (request params: "wait_for_status=green&timeout=1s" )' exit 1 fi fi initialDelaySeconds: 10 timeoutSeconds: 5 periodSeconds: 10 successThreshold: 3 failureThreshold: 3 lifecycle: postStart: exec: command: - bash - '-c' - > #!/bin/bash # Create the dev.general.logcreation.elasticsearchlogobject.v1.json index ES_URL=http://localhost:9200 while [[ "$(curl -s -o /dev/null -w '%{http_code}\n' $ES_URL)" != "200" ]]; do sleep 1; done curl --request PUT --header 'Content-Type: application/json' "$ES_URL/dev.general.logcreation.elasticsearchlogobject.v1.json/" --data '{"mappings":{"properties":{"Properties":{"properties":{"StatusCode":{"type":"text"}}}}},"settings":{"index":{"number_of_shards":"1","number_of_replicas":"0"}}}' terminationMessagePath: /dev/termination-log terminationMessagePolicy: File imagePullPolicy: IfNotPresent securityContext: capabilities: drop: - ALL runAsUser: 1000 runAsNonRoot: true restartPolicy: Always terminationGracePeriodSeconds: 120 dnsPolicy: ClusterFirst automountServiceAccountToken: true securityContext: runAsUser: 1000 fsGroup: 1000 affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - elasticsearch-master topologyKey: kubernetes.io/hostname schedulerName: default-scheduler enableServiceLinks: true volumeClaimTemplates: - kind: PersistentVolumeClaim apiVersion: v1 metadata: name: elasticsearch-master creationTimestamp: null spec: accessModes: - ReadWriteOnce resources: requests: storage: 4Gi volumeMode: Filesystem status: phase: Pending serviceName: elasticsearch-master-headless podManagementPolicy: Parallel updateStrategy: type: RollingUpdate revisionHistoryLimit: 10 Any ideas?
Finally found the answer, maybe it helps lot of people in case they face something similar. When the pod is initializing endlessly is like sleeping. In my case a strange code inside my chart StatefullSet started causing this issue when security became enabled. while [[ "$(curl -s -o /dev/null -w '%{http_code}\n' $ES_URL)" != "200" ]]; do sleep 1; done This will not return 200 as now the http excepts also a user and a password to authenticate and therefore is goes for a sleep. So make sure that in case the pods are in initializing state and remaining there, there is no any while/sleep
AWS EKS terraform tutorial (with assumeRole) - k8s dashboard error
I followed the tutorial at https://learn.hashicorp.com/tutorials/terraform/eks. Everything works fine with a single IAM user with the required permissions as specified at https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/iam-permissions.md But when I try to assumeRole in a cross AWSAccount scenario I run into errors/failures. I started kubectl proxy as per step 5. However, when I try to access the k8s dashboard at http://127.0.0.1:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/ (after completing steps 1-5), I get the error message as follows - { "kind": "Status", "apiVersion": "v1", "metadata": { }, "status": "Failure", "message": "no endpoints available for service \"kubernetes-dashboard\"", "reason": "ServiceUnavailable", "code": 503 } I also got zero pods in READY state for the metrics server deployment in step 3 of the tutorial - $ kubectl get deployment metrics-server -n kube-system NAME READY UP-TO-DATE AVAILABLE AGE metrics-server 0/1 1 0 21m My kube dns too has zero pods in READY state and the status is - kubectl -n kube-system -l=k8s-app=kube-dns get pod NAME READY STATUS RESTARTS AGE coredns-55cbf8d6c5-5h8md 0/1 Pending 0 10m coredns-55cbf8d6c5-n7wp8 0/1 Pending 0 10m My terraform version info is as below - $ terraform version 2021/03/06 21:18:18 [WARN] Log levels other than TRACE are currently unreliable, and are supported only for backward compatibility. Use TF_LOG=TRACE to see Terraform's internal logs. ---- 2021/03/06 21:18:18 [INFO] Terraform version: 0.14.7 2021/03/06 21:18:18 [INFO] Go runtime version: go1.15.6 2021/03/06 21:18:18 [INFO] CLI args: []string{"/usr/local/bin/terraform", "version"} 2021/03/06 21:18:18 [DEBUG] Attempting to open CLI config file: /Users/user1/.terraformrc 2021/03/06 21:18:18 [DEBUG] File doesn't exist, but doesn't need to. Ignoring. 2021/03/06 21:18:18 [DEBUG] ignoring non-existing provider search directory terraform.d/plugins 2021/03/06 21:18:18 [DEBUG] ignoring non-existing provider search directory /Users/user1/.terraform.d/plugins 2021/03/06 21:18:18 [DEBUG] ignoring non-existing provider search directory /Users/user1/Library/Application Support/io.terraform/plugins 2021/03/06 21:18:18 [DEBUG] ignoring non-existing provider search directory /Library/Application Support/io.terraform/plugins 2021/03/06 21:18:18 [INFO] CLI command args: []string{"version"} Terraform v0.14.7 + provider registry.terraform.io/hashicorp/aws v3.31.0 + provider registry.terraform.io/hashicorp/kubernetes v2.0.2 + provider registry.terraform.io/hashicorp/local v2.0.0 + provider registry.terraform.io/hashicorp/null v3.0.0 + provider registry.terraform.io/hashicorp/random v3.0.0 + provider registry.terraform.io/hashicorp/template v2.2.0 Output of describe pods for kube-system ns is - $ kubectl describe pods -n kube-system Name: coredns-7dcf49c5dd-kffzw Namespace: kube-system Priority: 2000000000 PriorityClassName: system-cluster-critical Node: <none> Labels: eks.amazonaws.com/component=coredns k8s-app=kube-dns pod-template-hash=7dcf49c5dd Annotations: eks.amazonaws.com/compute-type: ec2 kubernetes.io/psp: eks.privileged Status: Pending IP: Controlled By: ReplicaSet/coredns-7dcf49c5dd Containers: coredns: Image: 602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/coredns:v1.8.0-eksbuild.1 Ports: 53/UDP, 53/TCP, 9153/TCP Host Ports: 0/UDP, 0/TCP, 0/TCP Args: -conf /etc/coredns/Corefile Limits: memory: 170Mi Requests: cpu: 100m memory: 70Mi Liveness: http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5 Readiness: http-get http://:8080/health delay=0s timeout=1s period=10s #success=1 #failure=3 Environment: <none> Mounts: /etc/coredns from config-volume (ro) /tmp from tmp (rw) /var/run/secrets/kubernetes.io/serviceaccount from coredns-token-sqv8j (ro) Conditions: Type Status PodScheduled False Volumes: tmp: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> config-volume: Type: ConfigMap (a volume populated by a ConfigMap) Name: coredns Optional: false coredns-token-sqv8j: Type: Secret (a volume populated by a Secret) SecretName: coredns-token-sqv8j Optional: false QoS Class: Burstable Node-Selectors: <none> Tolerations: CriticalAddonsOnly node-role.kubernetes.io/master:NoSchedule node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 34s (x16 over 15m) default-scheduler no nodes available to schedule pods Name: coredns-7dcf49c5dd-rdw94 Namespace: kube-system Priority: 2000000000 PriorityClassName: system-cluster-critical Node: <none> Labels: eks.amazonaws.com/component=coredns k8s-app=kube-dns pod-template-hash=7dcf49c5dd Annotations: eks.amazonaws.com/compute-type: ec2 kubernetes.io/psp: eks.privileged Status: Pending IP: Controlled By: ReplicaSet/coredns-7dcf49c5dd Containers: coredns: Image: 602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/coredns:v1.8.0-eksbuild.1 Ports: 53/UDP, 53/TCP, 9153/TCP Host Ports: 0/UDP, 0/TCP, 0/TCP Args: -conf /etc/coredns/Corefile Limits: memory: 170Mi Requests: cpu: 100m memory: 70Mi Liveness: http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5 Readiness: http-get http://:8080/health delay=0s timeout=1s period=10s #success=1 #failure=3 Environment: <none> Mounts: /etc/coredns from config-volume (ro) /tmp from tmp (rw) /var/run/secrets/kubernetes.io/serviceaccount from coredns-token-sqv8j (ro) Conditions: Type Status PodScheduled False Volumes: tmp: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> config-volume: Type: ConfigMap (a volume populated by a ConfigMap) Name: coredns Optional: false coredns-token-sqv8j: Type: Secret (a volume populated by a Secret) SecretName: coredns-token-sqv8j Optional: false QoS Class: Burstable Node-Selectors: <none> Tolerations: CriticalAddonsOnly node-role.kubernetes.io/master:NoSchedule node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 35s (x16 over 15m) default-scheduler no nodes available to schedule pods Name: metrics-server-5889d4b758-2bmc4 Namespace: kube-system Priority: 0 PriorityClassName: <none> Node: <none> Labels: k8s-app=metrics-server pod-template-hash=5889d4b758 Annotations: kubernetes.io/psp: eks.privileged Status: Pending IP: Controlled By: ReplicaSet/metrics-server-5889d4b758 Containers: metrics-server: Image: k8s.gcr.io/metrics-server-amd64:v0.3.6 Port: <none> Host Port: <none> Environment: <none> Mounts: /tmp from tmp-dir (rw) /var/run/secrets/kubernetes.io/serviceaccount from metrics-server-token-wsqkn (ro) Conditions: Type Status PodScheduled False Volumes: tmp-dir: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> metrics-server-token-wsqkn: Type: Secret (a volume populated by a Secret) SecretName: metrics-server-token-wsqkn Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 6s (x9 over 6m56s) default-scheduler no nodes available to schedule pods Also, $ kubectl get nodes No resources found. And, $ kubectl describe nodes returns nothing Can someone help me troubleshoot and fix this ? TIA.
Self documenting my solution Given my AWS setup is as follows account1:user1:role1 account2:user2:role2 and the role setup is as below - arn:aws:iam::account2:role/role2 << trust relationship >> eks.amazonaws.com ec2.amazonaws.com arn:aws:iam::account1:user/user1 arn:aws:sts::account2:assumed-role/role2/user11 Updating the eks-cluster.tf as below - map_roles = [ { "groups": [ "system:masters" ], "rolearn": "arn:aws:iam::account2:role/role2", "username": "role2" } ] map_users = [ { "groups": [ "system:masters" ], "userarn": "arn:aws:iam::account1:user/user1", "username": "user1" }, { "groups": [ "system:masters" ], "userarn": "arn:aws:sts::account2:assumed-role/role2/user11", "username": "user1" } ] p.s.: Yes "user11" is a generated username suffixed with a "1" to the account1 user with a username of "user1" Makes everything work !
How does the master bootstrap process work and how can I debug it?
I am working to stand up 3 instances of the yugabyte master and tserver in separate k8s clusters connected over LoadBalancer services on bare metal. However, on all three master instances it looks like the bootstrap process is failing: I0531 19:50:28.081645 1 master_main.cc:94] NumCPUs determined to be: 2 I0531 19:50:28.082594 1 server_base_options.cc:124] Updating master addrs to {yb-master-black.example.com:7100},{yb-master-blue.example.com:7100},{yb-master-white.example.com:7100},{:7100} I0531 19:50:28.082682 1 server_base_options.cc:124] Updating master addrs to {yb-master-black.example.com:7100},{yb-master-blue.example.com:7100},{yb-master-white.example.com:7100},{:7100} I0531 19:50:28.082937 1 mem_tracker.cc:249] MemTracker: hard memory limit is 1.699219 GB I0531 19:50:28.082963 1 mem_tracker.cc:251] MemTracker: soft memory limit is 1.444336 GB I0531 19:50:28.083189 1 server_base_options.cc:124] Updating master addrs to {yb-master-black.example.com:7100},{yb-master-blue.example.com:7100},{yb-master-white.example.com:7100},{:7100} I0531 19:50:28.090148 1 server_base_options.cc:124] Updating master addrs to {yb-master-black.example.com:7100},{yb-master-blue.example.com:7100},{yb-master-white.example.com:7100},{:7100} I0531 19:50:28.090863 1 rpc_server.cc:86] yb::server::RpcServer created at 0x1a7e210 I0531 19:50:28.090924 1 master.cc:146] yb::master::Master created at 0x7ffe2d4bd140 I0531 19:50:28.090958 1 master.cc:147] yb::master::TSManager created at 0x1a90850 I0531 19:50:28.090975 1 master.cc:148] yb::master::CatalogManager created at 0x1dea000 I0531 19:50:28.091152 1 master_main.cc:115] Initializing master server... I0531 19:50:28.093097 1 server_base.cc:462] Could not load existing FS layout: Not found (yb/util/env_posix.cc:1482): /mnt/disk0/yb-data/master/instance: No such file or directory (system error 2) I0531 19:50:28.093150 1 server_base.cc:463] Creating new FS layout I0531 19:50:28.193439 1 fs_manager.cc:463] Generated new instance metadata in path /mnt/disk0/yb-data/master/instance: uuid: "5f2f6ad78d27450b8cde9c8bcf40fefa" format_stamp: "Formatted at 2020-05-31 19:50:28 on yb-master-0" I0531 19:50:28.238484 1 fs_manager.cc:463] Generated new instance metadata in path /mnt/disk1/yb-data/master/instance: uuid: "5f2f6ad78d27450b8cde9c8bcf40fefa" format_stamp: "Formatted at 2020-05-31 19:50:28 on yb-master-0" I0531 19:50:28.377483 1 fs_manager.cc:251] Opened local filesystem: /mnt/disk0,/mnt/disk1 uuid: "5f2f6ad78d27450b8cde9c8bcf40fefa" format_stamp: "Formatted at 2020-05-31 19:50:28 on yb-master-0" I0531 19:50:28.378015 1 server_base.cc:245] Auto setting FLAGS_num_reactor_threads to 2 I0531 19:50:28.380707 1 thread_pool.cc:166] Starting thread pool { name: Master queue_limit: 10000 max_workers: 1024 } I0531 19:50:28.382266 1 master_main.cc:118] Starting Master server... I0531 19:50:28.382313 24 async_initializer.cc:74] Starting to init ybclient I0531 19:50:28.382365 1 master_main.cc:119] ulimit cur(max)... ulimit: core file size unlimited(unlimited) blks ulimit: data seg size unlimited(unlimited) kb ulimit: open files 1048576(1048576) ulimit: file size unlimited(unlimited) blks ulimit: pending signals 22470(22470) ulimit: file locks unlimited(unlimited) ulimit: max locked memory 64(64) kb ulimit: max memory size unlimited(unlimited) kb ulimit: stack size 8192(unlimited) kb ulimit: cpu time unlimited(unlimited) secs ulimit: max user processes unlimited(unlimited) W0531 19:50:28.383322 24 master.cc:186] Failed to get current config: Illegal state (yb/master/catalog_manager.cc:6854): Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized. I0531 19:50:28.383525 24 client-internal.cc:1847] New master addresses: [yb-master-black.example.com:7100,yb-master-blue.example.com:7100,yb-master-white.example.com:7100,:7100] I0531 19:50:28.383685 1 service_pool.cc:148] yb.master.MasterBackupService: yb::rpc::ServicePoolImpl created at 0x1a82b40 I0531 19:50:28.384888 1 service_pool.cc:148] yb.master.MasterService: yb::rpc::ServicePoolImpl created at 0x1a83680 I0531 19:50:28.385342 1 service_pool.cc:148] yb.tserver.TabletServerService: yb::rpc::ServicePoolImpl created at 0x1a838c0 I0531 19:50:28.388526 1 thread_pool.cc:166] Starting thread pool { name: Master-high-pri queue_limit: 10000 max_workers: 1024 } I0531 19:50:28.388588 1 service_pool.cc:148] yb.consensus.ConsensusService: yb::rpc::ServicePoolImpl created at 0x201eb40 I0531 19:50:28.393231 1 service_pool.cc:148] yb.tserver.RemoteBootstrapService: yb::rpc::ServicePoolImpl created at 0x201ed80 I0531 19:50:28.393501 1 webserver.cc:148] Starting webserver on 0.0.0.0:7000 I0531 19:50:28.393544 1 webserver.cc:153] Document root: /home/yugabyte/www I0531 19:50:28.394471 1 webserver.cc:240] Webserver started. Bound to: http://0.0.0.0:7000/ I0531 19:50:28.394668 1 service_pool.cc:148] yb.server.GenericService: yb::rpc::ServicePoolImpl created at 0x201efc0 I0531 19:50:28.395015 1 rpc_server.cc:169] RPC server started. Bound to: 0.0.0.0:7100 I0531 19:50:28.420223 23 tcp_stream.cc:308] { local: 10.233.80.35:55710 remote: 172.16.0.34:7100 }: Recv failed: Network error (yb/util/net/socket.cc:537): recvmsg error: Connection refused (system error 111) E0531 19:51:28.523921 24 async_initializer.cc:84] Failed to initialize client: Timed out (yb/rpc/rpc.cc:213): Could not locate the leader master: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 293) passed its deadline 2074493.105s (passed: 60.140s): Not found (yb/master/master_rpc.cc:284): no leader found: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 1) W0531 19:51:29.524827 24 master.cc:186] Failed to get current config: Illegal state (yb/master/catalog_manager.cc:6854): Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized. I0531 19:51:29.524914 24 client-internal.cc:1847] New master addresses: [yb-master-black.example.com:7100,yb-master-blue.example.com:7100,yb-master-white.example.com:7100,:7100] E0531 19:52:29.524785 24 async_initializer.cc:84] Failed to initialize client: Timed out (yb/rpc/outbound_call.cc:512): Could not locate the leader master: GetMasterRegistration RPC (request call id 2359) to 172.29.1.1:7100 timed out after 0.033s W0531 19:52:30.525079 24 master.cc:186] Failed to get current config: Illegal state (yb/master/catalog_manager.cc:6854): Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized. I0531 19:52:30.525205 24 client-internal.cc:1847] New master addresses: [yb-master-black.example.com:7100,yb-master-blue.example.com:7100,yb-master-white.example.com:7100,:7100] W0531 19:53:28.114395 36 master-path-handlers.cc:150] Illegal state (yb/master/catalog_manager.cc:6854): Unable to list Masters: Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized. W0531 19:53:29.133951 36 master-path-handlers.cc:1002] Illegal state (yb/master/catalog_manager.cc:6854): Unable to list Masters: Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized. E0531 19:53:30.625366 24 async_initializer.cc:84] Failed to initialize client: Timed out (yb/rpc/rpc.cc:213): Could not locate the leader master: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 299) passed its deadline 2074615.247s (passed: 60.099s): Not found (yb/master/master_rpc.cc:284): no leader found: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 1) W0531 19:53:31.625660 24 master.cc:186] Failed to get current config: Illegal state (yb/master/catalog_manager.cc:6854): Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized. I0531 19:53:31.625742 24 client-internal.cc:1847] New master addresses: [yb-master-black.example.com:7100,yb-master-blue.example.com:7100,yb-master-white.example.com:7100,:7100] W0531 19:53:34.024369 37 master-path-handlers.cc:150] Illegal state (yb/master/catalog_manager.cc:6854): Unable to list Masters: Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized. E0531 19:54:31.870801 24 async_initializer.cc:84] Failed to initialize client: Timed out (yb/rpc/rpc.cc:213): Could not locate the leader master: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 300) passed its deadline 2074676.348s (passed: 60.244s): Not found (yb/master/master_rpc.cc:284): no leader found: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 1) W0531 19:54:32.871065 24 master.cc:186] Failed to get current config: Illegal state (yb/master/catalog_manager.cc:6854): Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized. I0531 19:54:32.871222 24 client-internal.cc:1847] New master addresses: [yb-master-black.example.com:7100,yb-master-blue.example.com:7100,yb-master-white.example.com:7100,:7100] W0531 19:55:28.190217 41 master-path-handlers.cc:1002] Illegal state (yb/master/catalog_manager.cc:6854): Unable to list Masters: Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized. W0531 19:55:31.745038 42 master-path-handlers.cc:1002] Illegal state (yb/master/catalog_manager.cc:6854): Unable to list Masters: Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized. E0531 19:55:33.164300 24 async_initializer.cc:84] Failed to initialize client: Timed out (yb/rpc/rpc.cc:213): Could not locate the leader master: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 299) passed its deadline 2074737.593s (passed: 60.292s): Not found (yb/master/master_rpc.cc:284): no leader found: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 1) W0531 19:55:34.164574 24 master.cc:186] Failed to get current config: Illegal state (yb/master/catalog_manager.cc:6854): Node 5f2f6ad78d27450b8cde9c8bcf40fefa peer not initialized. I0531 19:55:34.164667 24 client-internal.cc:1847] New master addresses: [yb-master-black.example.com:7100,yb-master-blue.example.com:7100,yb-master-white.example.com:7100,:7100] E0531 19:56:34.315380 24 async_initializer.cc:84] Failed to initialize client: Timed out (yb/rpc/rpc.cc:213): Could not locate the leader master: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 299) passed its deadline 2074798.886s (passed: 60.150s): Not found (yb/master/master_rpc.cc:284): no leader found: GetLeaderMasterRpc(addrs: [yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, :7100], num_attempts: 1) As far as connectivity goes, I am able to verify the LoadBalancer endpoints are responding across the different network boundaries by curling the same service endpoint but on the UI port: [root#yb-master-0 yugabyte]# curl -I http://yb-master-blue.example.com:7000 HTTP/1.1 200 OK Content-Type: text/html Content-Length: 1975 Access-Control-Allow-Origin: * [root#yb-master-0 yugabyte]# curl -I http://yb-master-white.example.com:7000 HTTP/1.1 200 OK Content-Type: text/html Content-Length: 1975 Access-Control-Allow-Origin: * [root#yb-master-0 yugabyte]# curl -I http://yb-master-black.example.com:7000 HTTP/1.1 200 OK Content-Type: text/html Content-Length: 1975 Access-Control-Allow-Origin: * What strategies are there to debug the bootstrap process? EDIT: Here are the startup flags for the master: /home/yugabyte/bin/yb-master --fs_data_dirs=/mnt/disk0,/mnt/disk1 --server_broadcast_addresses=yb-master-white.example.com:7100 --master_addresses=yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, --replication_factor=3 --enable_ysql=true --rpc_bind_addresses=0.0.0.0:7100 --metric_node_name=yb-master-0 --memory_limit_hard_bytes=1824522240 --stderrthreshold=0 --num_cpus=2 --undefok=num_cpus,enable_ysql --default_memory_limit_to_ram_ratio=0.85 --leader_failure_max_missed_heartbeat_periods=10 --placement_cloud=AAAA --placement_region=XXXX --placement_zone=XXXX /home/yugabyte/bin/yb-master --fs_data_dirs=/mnt/disk0,/mnt/disk1 --server_broadcast_addresses=yb-master-blue.example.com:7100 --master_addresses=yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, --replication_factor=3 --enable_ysql=true --rpc_bind_addresses=0.0.0.0:7100 --metric_node_name=yb-master-0 --memory_limit_hard_bytes=1824522240 --stderrthreshold=0 --num_cpus=2 --undefok=num_cpus,enable_ysql --default_memory_limit_to_ram_ratio=0.85 --leader_failure_max_missed_heartbeat_periods=10 --placement_cloud=AAAA --placement_region=YYYY --placement_zone=YYYY /home/yugabyte/bin/yb-master --fs_data_dirs=/mnt/disk0,/mnt/disk1 --server_broadcast_addresses=yb-master-black.example.com:7100 --master_addresses=yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, --replication_factor=3 --enable_ysql=true --rpc_bind_addresses=0.0.0.0:7100 --metric_node_name=yb-master-0 --memory_limit_hard_bytes=1824522240 --stderrthreshold=0 --num_cpus=2 --undefok=num_cpus,enable_ysql --default_memory_limit_to_ram_ratio=0.85 --leader_failure_max_missed_heartbeat_periods=10 --placement_cloud=AAAA --placement_region=ZZZZ --placement_zone=ZZZZ For the sake of completeness here is one of the k8s manifest that I've modified from one of the helm examples. It is modified to utilize LoadBalancer for the master service: --- # Source: yugabyte/templates/service.yaml apiVersion: v1 kind: Service metadata: name: "yb-masters" labels: app: "yb-master" heritage: "Helm" release: "blue" chart: "yugabyte" component: "yugabytedb" spec: type: LoadBalancer loadBalancerIP: 172.16.0.34 ports: - name: "rpc-port" port: 7100 - name: "ui" port: 7000 selector: app: "yb-master" --- # Source: yugabyte/templates/service.yaml apiVersion: v1 kind: Service metadata: name: "yb-tservers" labels: app: "yb-tserver" heritage: "Helm" release: "blue" chart: "yugabyte" component: "yugabytedb" spec: clusterIP: None ports: - name: "rpc-port" port: 7100 - name: "ui" port: 9000 - name: "yedis-port" port: 6379 - name: "yql-port" port: 9042 - name: "ysql-port" port: 5433 selector: app: "yb-tserver" --- # Source: yugabyte/templates/service.yaml apiVersion: apps/v1 kind: StatefulSet metadata: name: "yb-master" namespace: "yugabytedb" labels: app: "yb-master" heritage: "Helm" release: "blue" chart: "yugabyte" component: "yugabytedb" spec: serviceName: "yb-masters" podManagementPolicy: Parallel replicas: 1 volumeClaimTemplates: - metadata: name: datadir0 annotations: volume.beta.kubernetes.io/storage-class: rook-ceph-block labels: heritage: "Helm" release: "blue" chart: "yugabyte" component: "yugabytedb" spec: accessModes: - "ReadWriteOnce" storageClassName: rook-ceph-block resources: requests: storage: 10Gi - metadata: name: datadir1 annotations: volume.beta.kubernetes.io/storage-class: rook-ceph-block labels: heritage: "Helm" release: "blue" chart: "yugabyte" component: "yugabytedb" spec: accessModes: - "ReadWriteOnce" storageClassName: rook-ceph-block resources: requests: storage: 10Gi updateStrategy: type: RollingUpdate rollingUpdate: partition: 0 selector: matchLabels: app: "yb-master" template: metadata: labels: app: "yb-master" heritage: "Helm" release: "blue" chart: "yugabyte" component: "yugabytedb" spec: affinity: # Set the anti-affinity selector scope to YB masters. podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - "yb-master" topologyKey: kubernetes.io/hostname containers: - name: "yb-master" image: "yugabytedb/yugabyte:2.1.6.0-b17" imagePullPolicy: IfNotPresent lifecycle: postStart: exec: command: - "sh" - "-c" - > mkdir -p /mnt/disk0/cores; mkdir -p /mnt/disk0/yb-data/scripts; if [ ! -f /mnt/disk0/yb-data/scripts/log_cleanup.sh ]; then if [ -f /home/yugabyte/bin/log_cleanup.sh ]; then cp /home/yugabyte/bin/log_cleanup.sh /mnt/disk0/yb-data/scripts; fi; fi env: - name: POD_IP valueFrom: fieldRef: fieldPath: status.podIP - name: HOSTNAME valueFrom: fieldRef: fieldPath: metadata.name - name: NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace resources: limits: cpu: 2 memory: 2Gi requests: cpu: 500m memory: 1Gi command: - "/home/yugabyte/bin/yb-master" - "--fs_data_dirs=/mnt/disk0,/mnt/disk1" - "--server_broadcast_addresses=yb-master-blue.example.com:7100" - "--master_addresses=yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, " - "--replication_factor=3" - "--enable_ysql=true" - "--rpc_bind_addresses=0.0.0.0:7100" - "--metric_node_name=$(HOSTNAME)" - "--memory_limit_hard_bytes=1824522240" - "--stderrthreshold=0" - "--num_cpus=2" - "--undefok=num_cpus,enable_ysql" - "--default_memory_limit_to_ram_ratio=0.85" - "--leader_failure_max_missed_heartbeat_periods=10" - "--placement_cloud=AAAA" - "--placement_region=YYYY" - "--placement_zone=YYYY" ports: - containerPort: 7100 name: "rpc-port" - containerPort: 7000 name: "ui" volumeMounts: - name: datadir0 mountPath: /mnt/disk0 - name: datadir1 mountPath: /mnt/disk1 - name: yb-cleanup image: busybox:1.31 env: - name: USER value: "yugabyte" command: - "/bin/sh" - "-c" - > mkdir /var/spool/cron; mkdir /var/spool/cron/crontabs; echo "0 * * * * /home/yugabyte/scripts/log_cleanup.sh" | tee -a /var/spool/cron/crontabs/root; crond; while true; do sleep 86400; done volumeMounts: - name: datadir0 mountPath: /home/yugabyte/ subPath: yb-data volumes: - name: datadir0 hostPath: path: /mnt/disks/ssd0 - name: datadir1 hostPath: path: /mnt/disks/ssd1 --- # Source: yugabyte/templates/service.yaml apiVersion: apps/v1 kind: StatefulSet metadata: name: "yb-tserver" namespace: "yugabytedb" labels: app: "yb-tserver" heritage: "Helm" release: "blue" chart: "yugabyte" component: "yugabytedb" spec: serviceName: "yb-tservers" podManagementPolicy: Parallel replicas: 1 volumeClaimTemplates: - metadata: name: datadir0 annotations: volume.beta.kubernetes.io/storage-class: rook-ceph-block labels: heritage: "Helm" release: "blue" chart: "yugabyte" component: "yugabytedb" spec: accessModes: - "ReadWriteOnce" storageClassName: rook-ceph-block resources: requests: storage: 10Gi - metadata: name: datadir1 annotations: volume.beta.kubernetes.io/storage-class: rook-ceph-block labels: heritage: "Helm" release: "blue" chart: "yugabyte" component: "yugabytedb" spec: accessModes: - "ReadWriteOnce" storageClassName: rook-ceph-block resources: requests: storage: 10Gi updateStrategy: type: RollingUpdate rollingUpdate: partition: 0 selector: matchLabels: app: "yb-tserver" template: metadata: labels: app: "yb-tserver" heritage: "Helm" release: "blue" chart: "yugabyte" component: "yugabytedb" spec: affinity: # Set the anti-affinity selector scope to YB masters. podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - "yb-tserver" topologyKey: kubernetes.io/hostname containers: - name: "yb-tserver" image: "yugabytedb/yugabyte:2.1.6.0-b17" imagePullPolicy: IfNotPresent lifecycle: postStart: exec: command: - "sh" - "-c" - > mkdir -p /mnt/disk0/cores; mkdir -p /mnt/disk0/yb-data/scripts; if [ ! -f /mnt/disk0/yb-data/scripts/log_cleanup.sh ]; then if [ -f /home/yugabyte/bin/log_cleanup.sh ]; then cp /home/yugabyte/bin/log_cleanup.sh /mnt/disk0/yb-data/scripts; fi; fi env: - name: POD_IP valueFrom: fieldRef: fieldPath: status.podIP - name: HOSTNAME valueFrom: fieldRef: fieldPath: metadata.name - name: NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace resources: limits: cpu: 2 memory: 4Gi requests: cpu: 500m memory: 2Gi command: - "/home/yugabyte/bin/yb-tserver" - "--fs_data_dirs=/mnt/disk0,/mnt/disk1" - "--server_broadcast_addresses=$(HOSTNAME).yb-tservers.$(NAMESPACE).svc.cluster.local:9100" - "--rpc_bind_addresses=$(HOSTNAME).yb-tservers.$(NAMESPACE).svc.cluster.local" - "--cql_proxy_bind_address=$(HOSTNAME).yb-tservers.$(NAMESPACE).svc.cluster.local" - "--enable_ysql=true" - "--pgsql_proxy_bind_address=$(POD_IP):5433" - "--tserver_master_addrs=yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, " - "--metric_node_name=$(HOSTNAME)" - "--memory_limit_hard_bytes=3649044480" - "--stderrthreshold=0" - "--num_cpus=2" - "--undefok=num_cpus,enable_ysql" - "--leader_failure_max_missed_heartbeat_periods=10" - "--placement_cloud=AAAA" - "--placement_region=YYYY" - "--placement_zone=YYYY" - "--use_cassandra_authentication=false" ports: - containerPort: 7100 name: "rpc-port" - containerPort: 9000 name: "ui" - containerPort: 6379 name: "yedis-port" - containerPort: 9042 name: "yql-port" - containerPort: 5433 name: "ysql-port" volumeMounts: - name: datadir0 mountPath: /mnt/disk0 - name: datadir1 mountPath: /mnt/disk1 - name: yb-cleanup image: busybox:1.31 env: - name: USER value: "yugabyte" command: - "/bin/sh" - "-c" - > mkdir /var/spool/cron; mkdir /var/spool/cron/crontabs; echo "0 * * * * /home/yugabyte/scripts/log_cleanup.sh" | tee -a /var/spool/cron/crontabs/root; crond; while true; do sleep 86400; done volumeMounts: - name: datadir0 mountPath: /home/yugabyte/ subPath: yb-data volumes: - name: datadir0 hostPath: path: /mnt/disks/ssd0 - name: datadir1 hostPath: path: /mnt/disks/ssd1
This was mostly resolved (looks like I've now run into an unrelated issue), by dropping the extraneous comma on the master addresses list: --master_addresses=yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100, vs --master_addresses=yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100
K8s did not kill my airflow webserver pod
I have airflow running in k8s containers. The webserver encountered a DNS error (could not translate the url for my db to an ip) and the webserver workers were killed. What is troubling me is that the k8s did not attempt to kill the pod and start a new one its place. Pod log output: OperationalError: (psycopg2.OperationalError) could not translate host name "my.dbs.url" to address: Temporary failure in name resolution [2017-12-01 06:06:05 +0000] [2202] [INFO] Worker exiting (pid: 2202) [2017-12-01 06:06:05 +0000] [2186] [INFO] Worker exiting (pid: 2186) [2017-12-01 06:06:05 +0000] [2190] [INFO] Worker exiting (pid: 2190) [2017-12-01 06:06:05 +0000] [2194] [INFO] Worker exiting (pid: 2194) [2017-12-01 06:06:05 +0000] [2198] [INFO] Worker exiting (pid: 2198) [2017-12-01 06:06:06 +0000] [13] [INFO] Shutting down: Master [2017-12-01 06:06:06 +0000] [13] [INFO] Reason: Worker failed to boot. The k8s status is RUNNING but when I open an exec shell in the k8s UI i get the following output (gunicorn appears to realize it's dead): root#webserver-373771664-3h4v9:/# ps -Al F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 4 S 0 1 0 0 80 0 - 107153 - ? 00:06:42 /usr/local/bin/ 4 Z 0 13 1 0 80 0 - 0 - ? 00:01:24 gunicorn: maste <defunct> 4 S 0 2206 0 0 80 0 - 4987 - ? 00:00:00 bash 0 R 0 2224 2206 0 80 0 - 7486 - ? 00:00:00 ps The following is the YAML for my deployment: apiVersion: extensions/v1beta1 kind: Deployment metadata: name: webserver namespace: airflow spec: replicas: 1 template: metadata: labels: app: airflow-webserver spec: volumes: - name: webserver-dags emptyDir: {} containers: - name: airflow-webserver image: my.custom.image :latest imagePullPolicy: Always resources: requests: cpu: 100m limits: cpu: 500m ports: - containerPort: 80 protocol: TCP env: - name: AIRFLOW_HOME value: /var/lib/airflow - name: AIRFLOW__CORE__SQL_ALCHEMY_CONN valueFrom: secretKeyRef: name: db1 key: sqlalchemy_conn volumeMounts: - mountPath: /var/lib/airflow/dags/ name: webserver-dags command: ["airflow"] args: ["webserver"] - name: docker-s3-to-backup image: my.custom.image:latest imagePullPolicy: Always resources: requests: cpu: 50m limits: cpu: 500m env: - name: ACCESS_KEY valueFrom: secretKeyRef: name: aws key: access_key_id - name: SECRET_KEY valueFrom: secretKeyRef: name: aws key: secret_access_key - name: S3_PATH value: s3://my-s3-bucket/dags/ - name: DATA_PATH value: /dags/ - name: CRON_SCHEDULE value: "*/5 * * * *" volumeMounts: - mountPath: /dags/ name: webserver-dags --- apiVersion: autoscaling/v1 kind: HorizontalPodAutoscaler metadata: name: webserver namespace: airflow spec: scaleTargetRef: apiVersion: apps/v1beta1 kind: Deployment name: webserver minReplicas: 2 maxReplicas: 20 targetCPUUtilizationPercentage: 75 --- apiVersion: v1 kind: Service metadata: labels: name: webserver namespace: airflow spec: type: NodePort ports: - port: 80 selector: app: airflow-webserver
you need to define the readiness and liveness probe Kubernetes to detect the POD status. like documented on this page. https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-a-tcp-liveness-probe - containerPort: 8080 readinessProbe: tcpSocket: port: 8080 initialDelaySeconds: 5 periodSeconds: 10 livenessProbe: tcpSocket: port: 8080 initialDelaySeconds: 15 periodSeconds: 20
Well, when process dies in a container, this container will exit and kubelet will restart the container on the same node / within the same pod. What happened here is by no means a fault of kubernetes, but in fact a problem of your container. The main process that you launch in the container (be it just from CMD or via ENTRYPOINT) needs to die, for the above to happen, and the ones you launch did not (one went zombie mode, but was not reaped, which is an example of another issue all together - zombie reaping. Liveness probe will help in this case (as mentioned by #sfgroups) as it will terminate the pod if it fails, but this is treating symptoms rather then root cause (not that you shouldn't have probes defined in general as a good practice).