cassandra cluster joining delay

cassandra cluster joining delay - cassandra

I am trying to install 3 cassandra using bosh release.I am getting error
java.lang.UnsupportedOperationException: Other bootstrapping/leaving/moving nodes detected, cannot bootstrap while cassandra.consistent.rangemovement is true
On searching in net i found that we need to put some delay when cluster joins.Let me know how to introduced delay. Do we have any attribute for this ?
- name: cassandra_seed
templates:
- name: cassandra
release: cassandra
- name: collectd
release: metrics
- name: logstash-shipper
release: cassandra
- name: consul
release: consul
instances: 1
resource_pool: service-net-medium
persistent_disk: 10240
networks:
- name: ccc-service-net
default: [dns, gateway]
properties:
collectd:
plugin_templates: [cassandra]
cassandra:
broadcast_address: 0.cassandra-seed.ccc-service-net.<%= $deployment_name %>.microbosh
consul:
bootstrap_expect: 0
join_hosts: ["0.vault-consul.ccc-service-net.<%= $deployment_name %>.microbosh"]
service:
name: cassandra
process:
name: ps -ef |grep cassandra |grep -v grep || exit 2
server: false
default_recursor: 8.8.8.8
update:
serial: false
Error
root#9e3c9ac3-1832-48cf-a58c-3ef25ee17869:/var/vcap/sys/log/cassandra# vim cassandra.stderr.log
java.lang.UnsupportedOperationException: Other bootstrapping/leaving/moving nodes detected, cannot bootstrap while cassandra.consistent.rangemovement is true
at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:584)
at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:855)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:725)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:625)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:366)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:581)
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:710)

Related

kube-proxy changes reverting after couple of minutes on my AKS cluster

I am experimenting and tweaking a bit on my sandbox AKS cluster with the intention to configure it in a production ready state. Regarding that, I am following a book where the writer is redeployig the initial kube-proxy daemonset with some modification (the only difference is that he is doing it on AWS EKS).
The problem is that the daemonset and pod are getting to the initial state after 2-3 minutes. AKS is just doing a rollback, what I can se when execute the rollback command
> kubectl rollout history daemonset kube-proxy -n kube-system
daemonset.apps/kube-proxy
REVISION CHANGE-CAUSE
2 <none>
8 <none>
10 <none>
14 <none>
16 <none>
I tried to redeploy the daemonset with my minor changes (changed cpu from 100m to 120m and changed the -v flag from 3 to 2) declaretively by applying following manifest
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
labels:
addonmanager.kubernetes.io/mode: Reconcile
component: kube-proxy
tier: node
deployment: custom
name: kube-proxy
namespace: kube-system
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
component: kube-proxy
tier: node
template:
metadata:
creationTimestamp: null
labels:
component: kube-proxy
tier: node
deployedBy: Luka
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.azure.com/cluster
operator: Exists
- key: type
operator: NotIn
values:
- virtual-kubelet
- key: kubernetes.io/os
operator: In
values:
- linux
containers:
- command:
- kube-proxy
- --conntrack-max-per-core=0
- --metrics-bind-address=0.0.0.0:10249
- --kubeconfig=/var/lib/kubelet/kubeconfig
- --cluster-cidr=10.244.0.0/16
- --detect-local-mode=ClusterCIDR
- --pod-interface-name-prefix=
- --v=2
image: mcr.microsoft.com/oss/kubernetes/kube-proxy:v1.23.12-hotfix.20220922.1
imagePullPolicy: IfNotPresent
name: kube-proxy
resources:
requests:
cpu: 120m
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/kubelet
name: kubeconfig
readOnly: true
- mountPath: /etc/kubernetes/certs
name: certificates
readOnly: true
- mountPath: /run/xtables.lock
name: iptableslock
- mountPath: /lib/modules
name: modules
dnsPolicy: ClusterFirst
hostNetwork: true
initContainers:
- command:
- /bin/sh
- -c
- |
SYSCTL=/proc/sys/net/netfilter/nf_conntrack_max
echo "Current net.netfilter.nf_conntrack_max: $(cat $SYSCTL)"
DESIRED=$(awk -F= '/net.netfilter.nf_conntrack_max/ {print $2}' /etc/sysctl.d/999-sysctl-aks.conf)
if [ -z "$DESIRED" ]; then
DESIRED=$((32768*$(nproc)))
if [ $DESIRED -lt 131072 ]; then
DESIRED=131072
fi
echo "AKS custom config for net.netfilter.nf_conntrack_max not set."
echo "Setting nf_conntrack_max to $DESIRED (32768 * $(nproc) cores, minimum 131072)."
echo $DESIRED > $SYSCTL
else
echo "AKS custom config for net.netfilter.nf_conntrack_max set to $DESIRED."
echo "Setting nf_conntrack_max to $DESIRED."
echo $DESIRED > $SYSCTL
fi
image: mcr.microsoft.com/oss/kubernetes/kube-proxy:v1.23.12-hotfix.20220922.1
imagePullPolicy: IfNotPresent
name: kube-proxy-bootstrap
resources:
requests:
cpu: 100m
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/sysctl.d
name: sysctls
- mountPath: /lib/modules
name: modules
priorityClassName: system-node-critical
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- effect: NoExecute
operator: Exists
- effect: NoSchedule
operator: Exists
volumes:
- hostPath:
path: /var/lib/kubelet
type: ""
name: kubeconfig
- hostPath:
path: /etc/kubernetes/certs
type: ""
name: certificates
- hostPath:
path: /run/xtables.lock
type: FileOrCreate
name: iptableslock
- hostPath:
path: /etc/sysctl.d
type: Directory
name: sysctls
- hostPath:
path: /lib/modules
type: Directory
name: modules
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
status:
currentNumberScheduled: 4
desiredNumberScheduled: 4
numberAvailable: 4
numberMisscheduled: 0
numberReady: 4
observedGeneration: 1
updatedNumberScheduled: 4
I tried it also by removing the initContainer. Even the solution by editing the daemonset, explained in this stackoverlow post didnt worked.
Do I miss something? Why is the kube-proxy daemonset always rolling back?

In Kubernetes rolling updates are the default strategy to update running version of the application
When I upgrade the pods from version 1 to 2 the deployment will creates the new ReplicaSet and increase the count of replicas and previous count goes to 0
After rolling update, the previous replica set is not deleted
If we try to execute another rolling update from version 2 to 3 we might notice that at the end of the upgrade we have two replica sets with 0 count
I have created the deployment file and deployed when I check the history of the daemonset I am able to see below results
kubectl rollout history daemonset kube-proxy -n kube-system
We can rollback to the specific version
kubectl rollout undo daemonset kube-proxy --to-revision=4 -n kube-system
After undo changes my replica revision changes to my daemonset look like below
kubectl rollout history daemonset kube-proxy -n kube-system
In the above command we have two columns 1 is revision and another is change-cause and it is always set to none
I have set the change-cause to 'Kube' as mentioned below and got below results
If I try to get the rollout history again
kubernetes.io/change-cause: "Kube" #for particular revision
kubectl apply -f filename
kubectl rollout history daemonset kube-proxy -n kube-system
Reference: To know more about the rolling updates use this kubernetes link

Elastic Search upgrade to v8 on Kubernetes

I am having an elastic search deployment on a Microsoft Kubernetes cluster that was deployed with a 7.x chart and I changed the image to 8.x. This upgrade worked and both elastic and Kibana was accessible, but now i need to enable THE new security feature which is included in the basic license from now on. The reason behind the security first came from the requirement to enable APM Server/Agents.
I have the following values:
- name: cluster.initial_master_nodes
value: elasticsearch-master-0,
- name: discovery.seed_hosts
value: elasticsearch-master-headless
- name: cluster.name
value: elasticsearch
- name: network.host
value: 0.0.0.0
- name: cluster.deprecation_indexing.enabled
value: 'false'
- name: node.roles
value: data,ingest,master,ml,remote_cluster_client
The elastic search and kibana pods are able to start but i am unable to set APM Integration due security. So I am enabling security using the below values:
- name: xpack.security.enabled
value: 'true'
Then i am getting an error log from the elasic search pod: "Transport SSL must be enabled if security is enabled. Please set [xpack.security.transport.ssl.enabled] to [true] or disable security by setting [xpack.security.enabled] to [false]". So i am enabling ssl using the below values:
- name: xpack.security.transport.ssl.enabled
value: 'true'
Then i am getting an error log from elastic search pod: "invalid SSL configuration for xpack.security.transport.ssl - server ssl configuration requires a key and certificate, but these have not been configured; you must set either [xpack.security.transport.ssl.keystore.path] (p12 file), or both [xpack.security.transport.ssl.key] (pem file) and [xpack.security.transport.ssl.certificate] (pem key file)".
I start with Option1, i am creating the keys using the below commands (no password / enter, enter / enter, enter, enter) and i am coping them to a persistent folder:
./bin/elasticsearch-certutil ca
./bin/elasticsearch-certutil cert --ca elastic-stack-ca.p12
cp elastic-stack-ca.p12 data/elastic-stack-ca.p12
cp elastic-certificates.p12 data/elastic-certificates.p12
In addition I am also configuring the below values:
- name: xpack.security.transport.ssl.truststore.path
value: '/usr/share/elasticsearch/data/elastic-certificates.p12'
- name: xpack.security.transport.ssl.keystore.path
value: '/usr/share/elasticsearch/data/elastic-certificates.p12'
But the pod is still in initializing, if generate the certificates with password. then i am getting an error log from elastic search pod: "cannot read configured [PKCS12] keystore (as a truststore) [/usr/share/elasticsearch/data/elastic-certificates.p12] - this is usually caused by an incorrect password; (no password was provided)"
Then i go to Option2, i am creating the keys using the below commands and i am coping them to a persistent folder
./bin/elasticsearch-certutil ca --pem
unzip elastic-stack-ca.zip –d
cp ca.crt data/ca.crt
cp ca.key data/ca.key
In addition I am also configuring the below values:
- name: xpack.security.transport.ssl.key
value: '/usr/share/elasticsearch/data/ca.key'
- name: xpack.security.transport.ssl.certificate
value: '/usr/share/elasticsearch/data/ca.crt'
But the pod is still in initializing state without providing any logs, as i know while pod is in initializing state it does not produce any container logs. From portal side in events everything seems to be ok, except the elastic pod which is not in ready state.
At last i located the same issue to the eleastic search community, without any response: https://discuss.elastic.co/t/elasticsearch-pods-are-not-ready-when-xpack-security-enabled-is-configured/281709?u=s19k15
Here is my StatefullSet
status:
observedGeneration: 169
replicas: 1
updatedReplicas: 1
currentRevision: elasticsearch-master-7449d7bd69
updateRevision: elasticsearch-master-7d8c7b6997
collisionCount: 0
spec:
replicas: 1
selector:
matchLabels:
app: elasticsearch-master
template:
metadata:
name: elasticsearch-master
creationTimestamp: null
labels:
app: elasticsearch-master
chart: elasticsearch
release: platform
spec:
initContainers:
- name: configure-sysctl
image: docker.elastic.co/elasticsearch/elasticsearch:8.1.2
command:
- sysctl
- '-w'
- vm.max_map_count=262144
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
runAsUser: 0
containers:
- name: elasticsearch
image: docker.elastic.co/elasticsearch/elasticsearch:8.1.2
ports:
- name: http
containerPort: 9200
protocol: TCP
- name: transport
containerPort: 9300
protocol: TCP
env:
- name: node.name
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: cluster.initial_master_nodes
value: elasticsearch-master-0,
- name: discovery.seed_hosts
value: elasticsearch-master-headless
- name: cluster.name
value: elasticsearch
- name: cluster.deprecation_indexing.enabled
value: 'false'
- name: ES_JAVA_OPTS
value: '-Xmx512m -Xms512m'
- name: node.roles
value: data,ingest,master,ml,remote_cluster_client
- name: xpack.license.self_generated.type
value: basic
- name: xpack.security.enabled
value: 'true'
- name: xpack.security.transport.ssl.enabled
value: 'true'
- name: xpack.security.transport.ssl.truststore.path
value: /usr/share/elasticsearch/data/elastic-certificates.p12
- name: xpack.security.transport.ssl.keystore.path
value: /usr/share/elasticsearch/data/elastic-certificates.p12
- name: xpack.security.http.ssl.enabled
value: 'true'
- name: xpack.security.http.ssl.truststore.path
value: /usr/share/elasticsearch/data/elastic-certificates.p12
- name: xpack.security.http.ssl.keystore.path
value: /usr/share/elasticsearch/data/elastic-certificates.p12
- name: logger.org.elasticsearch.discovery
value: debug
- name: path.logs
value: /usr/share/elasticsearch/data
- name: xpack.security.enrollment.enabled
value: 'true'
resources:
limits:
cpu: '1'
memory: 2Gi
requests:
cpu: 100m
memory: 512Mi
volumeMounts:
- name: elasticsearch-master
mountPath: /usr/share/elasticsearch/data
readinessProbe:
exec:
command:
- bash
- '-c'
- >
set -e
# If the node is starting up wait for the cluster to be ready
(request params: "wait_for_status=green&timeout=1s" )
# Once it has started only check that the node itself is
responding
START_FILE=/tmp/.es_start_file
# Disable nss cache to avoid filling dentry cache when calling
curl
# This is required with Elasticsearch Docker using nss < 3.52
export NSS_SDB_USE_CACHE=no
http () {
local path="${1}"
local args="${2}"
set -- -XGET -s
if [ "$args" != "" ]; then
set -- "$#" $args
fi
if [ -n "${ELASTIC_PASSWORD}" ]; then
set -- "$#" -u "elastic:${ELASTIC_PASSWORD}"
fi
curl --output /dev/null -k "$#" "http://127.0.0.1:9200${path}"
}
if [ -f "${START_FILE}" ]; then
echo 'Elasticsearch is already running, lets check the node is healthy'
HTTP_CODE=$(http "/" "-w %{http_code}")
RC=$?
if [[ ${RC} -ne 0 ]]; then
echo "curl --output /dev/null -k -XGET -s -w '%{http_code}' \${BASIC_AUTH} http://127.0.0.1:9200/ failed with RC ${RC}"
exit ${RC}
fi
# ready if HTTP code 200, 503 is tolerable if ES version is 6.x
if [[ ${HTTP_CODE} == "200" ]]; then
exit 0
elif [[ ${HTTP_CODE} == "503" && "8" == "6" ]]; then
exit 0
else
echo "curl --output /dev/null -k -XGET -s -w '%{http_code}' \${BASIC_AUTH} http://127.0.0.1:9200/ failed with HTTP code ${HTTP_CODE}"
exit 1
fi
else
echo 'Waiting for elasticsearch cluster to become ready (request params: "wait_for_status=green&timeout=1s" )'
if http "/_cluster/health?wait_for_status=green&timeout=1s" "--fail" ; then
touch ${START_FILE}
exit 0
else
echo 'Cluster is not yet ready (request params: "wait_for_status=green&timeout=1s" )'
exit 1
fi
fi
initialDelaySeconds: 10
timeoutSeconds: 5
periodSeconds: 10
successThreshold: 3
failureThreshold: 3
lifecycle:
postStart:
exec:
command:
- bash
- '-c'
- >
#!/bin/bash
# Create the
dev.general.logcreation.elasticsearchlogobject.v1.json index
ES_URL=http://localhost:9200
while [[ "$(curl -s -o /dev/null -w '%{http_code}\n'
$ES_URL)" != "200" ]]; do sleep 1; done
curl --request PUT --header 'Content-Type: application/json'
"$ES_URL/dev.general.logcreation.elasticsearchlogobject.v1.json/"
--data
'{"mappings":{"properties":{"Properties":{"properties":{"StatusCode":{"type":"text"}}}}},"settings":{"index":{"number_of_shards":"1","number_of_replicas":"0"}}}'
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
securityContext:
capabilities:
drop:
- ALL
runAsUser: 1000
runAsNonRoot: true
restartPolicy: Always
terminationGracePeriodSeconds: 120
dnsPolicy: ClusterFirst
automountServiceAccountToken: true
securityContext:
runAsUser: 1000
fsGroup: 1000
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- elasticsearch-master
topologyKey: kubernetes.io/hostname
schedulerName: default-scheduler
enableServiceLinks: true
volumeClaimTemplates:
- kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: elasticsearch-master
creationTimestamp: null
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 4Gi
volumeMode: Filesystem
status:
phase: Pending
serviceName: elasticsearch-master-headless
podManagementPolicy: Parallel
updateStrategy:
type: RollingUpdate
revisionHistoryLimit: 10
Any ideas?

Finally found the answer, maybe it helps lot of people in case they face something similar. When the pod is initializing endlessly is like sleeping. In my case a strange code inside my chart StatefullSet started causing this issue when security became enabled.
while [[ "$(curl -s -o /dev/null -w '%{http_code}\n'
$ES_URL)" != "200" ]]; do sleep 1; done
This will not return 200 as now the http excepts also a user and a password to authenticate and therefore is goes for a sleep.
So make sure that in case the pods are in initializing state and remaining there, there is no any while/sleep

AWS EKS terraform tutorial (with assumeRole) - k8s dashboard error

I followed the tutorial at https://learn.hashicorp.com/tutorials/terraform/eks.
Everything works fine with a single IAM user with the required permissions as specified at https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/iam-permissions.md
But when I try to assumeRole in a cross AWSAccount scenario I run into errors/failures.
I started kubectl proxy as per step 5.
However, when I try to access the k8s dashboard at http://127.0.0.1:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/ (after completing steps 1-5), I get the error message as follows -
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {
},
"status": "Failure",
"message": "no endpoints available for service \"kubernetes-dashboard\"",
"reason": "ServiceUnavailable",
"code": 503
}
I also got zero pods in READY state for the metrics server deployment in step 3 of the tutorial -
$ kubectl get deployment metrics-server -n kube-system
NAME READY UP-TO-DATE AVAILABLE AGE
metrics-server 0/1 1 0 21m
My kube dns too has zero pods in READY state and the status is -
kubectl -n kube-system -l=k8s-app=kube-dns get pod
NAME READY STATUS RESTARTS AGE
coredns-55cbf8d6c5-5h8md 0/1 Pending 0 10m
coredns-55cbf8d6c5-n7wp8 0/1 Pending 0 10m
My terraform version info is as below -
$ terraform version
2021/03/06 21:18:18 [WARN] Log levels other than TRACE are currently unreliable, and are supported only for backward compatibility.
Use TF_LOG=TRACE to see Terraform's internal logs.
----
2021/03/06 21:18:18 [INFO] Terraform version: 0.14.7
2021/03/06 21:18:18 [INFO] Go runtime version: go1.15.6
2021/03/06 21:18:18 [INFO] CLI args: []string{"/usr/local/bin/terraform", "version"}
2021/03/06 21:18:18 [DEBUG] Attempting to open CLI config file: /Users/user1/.terraformrc
2021/03/06 21:18:18 [DEBUG] File doesn't exist, but doesn't need to. Ignoring.
2021/03/06 21:18:18 [DEBUG] ignoring non-existing provider search directory terraform.d/plugins
2021/03/06 21:18:18 [DEBUG] ignoring non-existing provider search directory /Users/user1/.terraform.d/plugins
2021/03/06 21:18:18 [DEBUG] ignoring non-existing provider search directory /Users/user1/Library/Application Support/io.terraform/plugins
2021/03/06 21:18:18 [DEBUG] ignoring non-existing provider search directory /Library/Application Support/io.terraform/plugins
2021/03/06 21:18:18 [INFO] CLI command args: []string{"version"}
Terraform v0.14.7
+ provider registry.terraform.io/hashicorp/aws v3.31.0
+ provider registry.terraform.io/hashicorp/kubernetes v2.0.2
+ provider registry.terraform.io/hashicorp/local v2.0.0
+ provider registry.terraform.io/hashicorp/null v3.0.0
+ provider registry.terraform.io/hashicorp/random v3.0.0
+ provider registry.terraform.io/hashicorp/template v2.2.0
Output of describe pods for kube-system ns is -
$ kubectl describe pods -n kube-system
Name: coredns-7dcf49c5dd-kffzw
Namespace: kube-system
Priority: 2000000000
PriorityClassName: system-cluster-critical
Node: <none>
Labels: eks.amazonaws.com/component=coredns
k8s-app=kube-dns
pod-template-hash=7dcf49c5dd
Annotations: eks.amazonaws.com/compute-type: ec2
kubernetes.io/psp: eks.privileged
Status: Pending
IP:
Controlled By: ReplicaSet/coredns-7dcf49c5dd
Containers:
coredns:
Image: 602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/coredns:v1.8.0-eksbuild.1
Ports: 53/UDP, 53/TCP, 9153/TCP
Host Ports: 0/UDP, 0/TCP, 0/TCP
Args:
-conf
/etc/coredns/Corefile
Limits:
memory: 170Mi
Requests:
cpu: 100m
memory: 70Mi
Liveness: http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
Readiness: http-get http://:8080/health delay=0s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/etc/coredns from config-volume (ro)
/tmp from tmp (rw)
/var/run/secrets/kubernetes.io/serviceaccount from coredns-token-sqv8j (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
tmp:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: coredns
Optional: false
coredns-token-sqv8j:
Type: Secret (a volume populated by a Secret)
SecretName: coredns-token-sqv8j
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: CriticalAddonsOnly
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 34s (x16 over 15m) default-scheduler no nodes available to schedule pods
Name: coredns-7dcf49c5dd-rdw94
Namespace: kube-system
Priority: 2000000000
PriorityClassName: system-cluster-critical
Node: <none>
Labels: eks.amazonaws.com/component=coredns
k8s-app=kube-dns
pod-template-hash=7dcf49c5dd
Annotations: eks.amazonaws.com/compute-type: ec2
kubernetes.io/psp: eks.privileged
Status: Pending
IP:
Controlled By: ReplicaSet/coredns-7dcf49c5dd
Containers:
coredns:
Image: 602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/coredns:v1.8.0-eksbuild.1
Ports: 53/UDP, 53/TCP, 9153/TCP
Host Ports: 0/UDP, 0/TCP, 0/TCP
Args:
-conf
/etc/coredns/Corefile
Limits:
memory: 170Mi
Requests:
cpu: 100m
memory: 70Mi
Liveness: http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
Readiness: http-get http://:8080/health delay=0s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/etc/coredns from config-volume (ro)
/tmp from tmp (rw)
/var/run/secrets/kubernetes.io/serviceaccount from coredns-token-sqv8j (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
tmp:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: coredns
Optional: false
coredns-token-sqv8j:
Type: Secret (a volume populated by a Secret)
SecretName: coredns-token-sqv8j
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: CriticalAddonsOnly
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 35s (x16 over 15m) default-scheduler no nodes available to schedule pods
Name: metrics-server-5889d4b758-2bmc4
Namespace: kube-system
Priority: 0
PriorityClassName: <none>
Node: <none>
Labels: k8s-app=metrics-server
pod-template-hash=5889d4b758
Annotations: kubernetes.io/psp: eks.privileged
Status: Pending
IP:
Controlled By: ReplicaSet/metrics-server-5889d4b758
Containers:
metrics-server:
Image: k8s.gcr.io/metrics-server-amd64:v0.3.6
Port: <none>
Host Port: <none>
Environment: <none>
Mounts:
/tmp from tmp-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from metrics-server-token-wsqkn (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
tmp-dir:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
metrics-server-token-wsqkn:
Type: Secret (a volume populated by a Secret)
SecretName: metrics-server-token-wsqkn
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 6s (x9 over 6m56s) default-scheduler no nodes available to schedule pods
Also,
$ kubectl get nodes
No resources found.
And,
$ kubectl describe nodes
returns nothing
Can someone help me troubleshoot and fix this ?
TIA.

Self documenting my solution
Given my AWS setup is as follows
account1:user1:role1
account2:user2:role2
and the role setup is as below -
arn:aws:iam::account2:role/role2
<< trust relationship >>
eks.amazonaws.com
ec2.amazonaws.com
arn:aws:iam::account1:user/user1
arn:aws:sts::account2:assumed-role/role2/user11
Updating the eks-cluster.tf as below -
map_roles = [
{
"groups": [ "system:masters" ],
"rolearn": "arn:aws:iam::account2:role/role2",
"username": "role2"
}
]
map_users = [
{
"groups": [ "system:masters" ],
"userarn": "arn:aws:iam::account1:user/user1",
"username": "user1"
},
{
"groups": [ "system:masters" ],
"userarn": "arn:aws:sts::account2:assumed-role/role2/user11",
"username": "user1"
}
]
p.s.: Yes "user11" is a generated username suffixed with a "1" to the account1 user with a username of "user1"
Makes everything work !

Task Instance State Task is in the 'queued' state which is not a valid state for execution. The task must be cleared in order to be run. Airflow task

Kubernetes Pod describes as above, and it says it is using local executor instead of Kubernetes executor and invalid image. Pod log shows as below
kubectl describe pod tablescreationschematablescreation-ecabd38a66664a33b6645a72ef056edc
Name: swedschematablescreationschematablescreation-ecabd38a66664a33b6645a72ef056edc
Namespace: default
Priority: 0
Node: 10.73.96.181
Start Time: Mon, 11 May 2020 18:22:15 +0530
Labels: airflow-worker=5888feda-6aee-49c8-a94b-39cbe5758062
airflow_version=1.10.10
dag_id=Swed-schema-tables-creation
execution_date=2020-05-11T12_52_09.829627_plus_00_00
kubernetes_executor=True
task_id=Schema_Tables_Creation
try_number=1
Annotations: <none>
Status: Pending
IP: 172.17.0.46
IPs:
IP: 172.17.0.46
Containers:
base:
Container ID:
Image: :
Image ID:
Port: <none>
Host Port: <none>
Command:
airflow
run
Swed-schema-tables-creation
Schema_Tables_Creation
2020-05-11T12:52:09.829627+00:00
--local
--pool
default_pool
-sd
/root/airflow/dags/User_Creation_dag.py
State: Waiting
Reason: InvalidImageName
Ready: False
Restart Count: 0
Environment:
AIRFLOW__CORE__EXECUTOR: LocalExecutor
AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql://airflowkube:airflowkube#10.73.96.181:5434/airflowkube
Mounts:
/root/airflow/dags from airflow-dags (ro)
/root/airflow/logs from airflow-logs (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-64cxg (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
airflow-dags:
Type: HostPath (bare host directory volume)
Path: /data/Naveen/Airflow/dags
HostPathType:
airflow-logs:
Type: HostPath (bare host directory volume)
Path: /data/Naveen/Airflow/Logs
HostPathType:
default-token-64cxg:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-64cxg
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned default/swedschematablescreationschematablescreation-ecabd38a66664a33b6645a72ef056edc to evblfnclnullnull1538
Warning Failed 2m15s (x12 over 4m28s) kubelet, evblfnclnullnull1538 Error: InvalidImageName
Warning InspectFailed 2m (x13 over 4m28s) kubelet, evblfnclnullnull1538 Failed to apply default image tag ":": couldn't parse image reference ":": invalid reference format
**strong text**
enter code here

Multi-broker Kafka on Kubernetes how to set KAFKA_ADVERTISED_HOST_NAME

My current Kafka deployment file with 3 Kafka brokers looks like this:
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: kafka
spec:
selector:
matchLabels:
app: kafka
serviceName: kafka-headless
replicas: 3
updateStrategy:
type: RollingUpdate
podManagementPolicy: Parallel
template:
metadata:
labels:
app: kafka
spec:
containers:
- name: kafka-instance
image: wurstmeister/kafka
ports:
- containerPort: 9092
env:
- name: KAFKA_ADVERTISED_PORT
value: "9092"
- name: KAFKA_ADVERTISED_HOST_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: KAFKA_ZOOKEEPER_CONNECT
value: "zookeeper-0.zookeeper-headless.default.svc.cluster.local:2181,\
zookeeper-1.zookeeper-headless.default.svc.cluster.local:2181,\
zookeeper-2.zookeeper-headless.default.svc.cluster.local:2181"
- name: BROKER_ID_COMMAND
value: "hostname | awk -F '-' '{print $2}'"
- name: KAFKA_CREATE_TOPICS
value: hello:2:1
volumeMounts:
- name: data
mountPath: /var/lib/kafka/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 50Gi
This creates 3 Kafka brokers as a Stateful Set and connects to the Zookeeper cluster using the Kubedns service with FQDN (Fully Qualified Domain Names) such as:
zookeeper-0.zookeeper-headless.default.svc.cluster.local:2181
Broker IDs are generated based on the pod name:
- name: BROKER_ID_COMMAND
value: "hostname | awk -F '-' '{print $2}'"
Result:
kafka-0 = 0
kafka-1 = 1
kafka-2 = 2
However, In order to use the Kubedns names for the Kafka brokers:
kafka-0.kafka-headless.default.svc.cluster.local:9092
kafka-1.kafka-headless.default.svc.cluster.local:9092
kafka-2.kafka-headless.default.svc.cluster.local:9092
I need to be able to set the KAFKA_ADVERTISED_HOST_NAME variable to the above FQDN values based on the name of the pod.
Currently I have the variable set to the name of the pod:
- name: KAFKA_ADVERTISED_HOST_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
Result:
KAFKA_ADVERTISED_HOST_NAME=kafka-0
KAFKA_ADVERTISED_HOST_NAME=kafka-1
KAFKA_ADVERTISED_HOST_NAME=kafka-2
But somehow I would need to append the rest of the DNS name.
Is there a way I could set the DNS value directly?
Something like that:
- name: KAFKA_ADVERTISED_HOST_NAME
valueFrom:
fieldRef:
fieldPath: kubedns.name

I managed to solve the problem with a command field inside the pod definition:
command:
- sh
- -c
- "export KAFKA_ADVERTISED_HOST_NAME=$(hostname).kafka-headless.default.svc.cluster.local &&
start-kafka.sh"
This runs a shell command which exports the advertised hostname environment variable based on the hostname value.

- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: KAFKA_ZOOKEEPER_CONNECT
value: zook-zookeeper.zook.svc.cluster.local:2181
- name: KAFKA_PORT_NUMBER
value: "9092"
- name: KAFKA_LISTENERS
value: SASL_SSL://:$(KAFKA_PORT_NUMBER)
- name: KAFKA_ADVERTISED_LISTENERS
value: SASL_SSL://$(MY_POD_NAME).kafka-kafka-headless.kafka.svc.cluster.local:$(KAFKA_PORT_NUMBER)
The above config would create your FQDN.
You should be able to see those names in your Kafka logs when Kafka server starts.
NOTE: Kubernetes allows you to reference environment variables using the syntax $(VARIABLE)

None of the above worked for me; my setup it wurstmeister/kafka:2.12-2.5.0 and wurstmeister/zookeeper:3.4.6 in a single pod on Kubernetes 1.16 (don't ask); ClusterIp service on top which forwards 9092 to the Kafka container.
This set of environment variables works:
- name: KAFKA_LISTENERS
value: "INSIDE://:9094,OUTSIDE://:9092"
- name: KAFKA_ADVERTISED_LISTENERS
value: "INSIDE://:9094,OUTSIDE://my-service.my-namespace.svc.cluster.local:9092"
- name: KAFKA_LISTENER_SECURITY_PROTOCOL_MAP
value: "INSIDE:PLAINTEXT,OUTSIDE:PLAINTEXT" # not production-ready!
- name: KAFKA_INTER_BROKER_LISTENER_NAME
value: INSIDE
- name: KAFKA_ZOOKEEPER_CONNECT
value: "localhost:2181" # since it's in the same pod
Sources: wurstmeister/kafka doc, Kafka doc
The inherent problem seems to be that Kafka itself needs to be an IP-ish thing to bind to and to talk to itself via, while clients need a DNS-ish name to connect to from the outside. The latter one can't contain the pod name for some reason. (Might be a separate configuration issue on my end.)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

cassandra cluster joining delay - cassandra

Related

kube-proxy changes reverting after couple of minutes on my AKS cluster

Elastic Search upgrade to v8 on Kubernetes

AWS EKS terraform tutorial (with assumeRole) - k8s dashboard error

Task Instance State Task is in the 'queued' state which is not a valid state for execution. The task must be cleared in order to be run. Airflow task

Multi-broker Kafka on Kubernetes how to set KAFKA_ADVERTISED_HOST_NAME

Categories

Resources