Rook and ceph on kubernetes

Rook and ceph on kubernetes - linux

I am new to Kubernetes. I am in need of integrating rook and ceph, adding NFS as block storage. Does anyone have any working examples? I followed https://www.digitalocean.com/community/tutorials/how-to-set-up-a-ceph-cluster-within-kubernetes-using-rook this document and I am getting errors(stuck at container creating, stuck at pod initializing) while creating ceph cluster in rook on Kubernetes. Any help would be appreciated.
kubectl get pod -n rook-ceph
NAME READY STATUS RESTARTS AGE
csi-cephfsplugin-provisioner-5bcd46f965-42f9r 0/5 ContainerCreating 0 12m
csi-cephfsplugin-provisioner-5bcd46f965-zszwz 5/5 Running 0 12m
csi-cephfsplugin-xcswb 3/3 Running 0 12m
csi-cephfsplugin-zwl9x 3/3 Running 0 12m
csi-rbdplugin-4mh9x 3/3 Running 0 12m
csi-rbdplugin-nlcjr 3/3 Running 0 12m
csi-rbdplugin-provisioner-6658cf554c-4xx9f 6/6 Running 0 12m
csi-rbdplugin-provisioner-6658cf554c-62xc2 0/6 ContainerCreating 0 12m
rook-ceph-detect-version-bwcmp 0/1 Init:0/1 0 9m18s
rook-ceph-operator-5dc456cdb6-n4tgm 1/1 Running 0 13m
rook-discover-l2r27 1/1 Running 0 13m
rook-discover-rxkv4 0/1 ContainerCreating 0 13m

Use kubectl describe pod <name> -n rook-ceph to see the list of events, it is on the bottom of the output. This will show where the pods get stuck.

It may be also the case that one of your nodes is in bad state, as it seems that some pod replicas are failing to start. You can confirm by running
kubectl get pod -o wide | grep -v Running
Possible all failing pods are running on the same node. If that is the case you can inspect the problematic node with
kubectl describe node [node]

Related

kubernetes networking: pod cannot reach nodes

I have kubernetes cluster with 3 masters and 7 workers. I use Calico as cni. When I deploy Calico, the calico-kube-controllers-xxx fails because it cannot reach 10.96.0.1:443.
2020-06-23 13:05:28.737 [INFO][1] main.go 88: Loaded configuration from environment config=&config.Config{LogLevel:"info", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", DatastoreType:"kubernetes"}
W0623 13:05:28.740128 1 client_config.go:541] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
2020-06-23 13:05:28.742 [INFO][1] main.go 109: Ensuring Calico datastore is initialized
2020-06-23 13:05:38.742 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-06-23 13:05:38.742 [FATAL][1] main.go 114: Failed to initialize Calico datastore error=Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
this is the situation in the kube-system namespace:
kubectl get po -n kube-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-77d6cbc65f-6bmjg 0/1 CrashLoopBackOff 56 4h33m
calico-node-94pkr 1/1 Running 0 36m
calico-node-d8vc4 1/1 Running 0 36m
calico-node-fgpd4 1/1 Running 0 37m
calico-node-jqgkp 1/1 Running 0 37m
calico-node-m9lds 1/1 Running 0 37m
calico-node-n5qmb 1/1 Running 0 37m
calico-node-t46jb 1/1 Running 0 36m
calico-node-w6xch 1/1 Running 0 38m
calico-node-xpz8k 1/1 Running 0 37m
calico-node-zbw4x 1/1 Running 0 36m
coredns-5644d7b6d9-ms7gv 0/1 Running 0 4h33m
coredns-5644d7b6d9-thwlz 0/1 Running 0 4h33m
kube-apiserver-k8s01 1/1 Running 7 34d
kube-apiserver-k8s02 1/1 Running 9 34d
kube-apiserver-k8s03 1/1 Running 7 34d
kube-controller-manager-k8s01 1/1 Running 7 34d
kube-controller-manager-k8s02 1/1 Running 9 34d
kube-controller-manager-k8s03 1/1 Running 8 34d
kube-proxy-9dppr 1/1 Running 3 4d
kube-proxy-9hhm9 1/1 Running 3 4d
kube-proxy-9svfk 1/1 Running 1 4d
kube-proxy-jctxm 1/1 Running 3 4d
kube-proxy-lsg7m 1/1 Running 3 4d
kube-proxy-m257r 1/1 Running 1 4d
kube-proxy-qtbbz 1/1 Running 2 4d
kube-proxy-v958j 1/1 Running 2 4d
kube-proxy-x97qx 1/1 Running 2 4d
kube-proxy-xjkjl 1/1 Running 3 4d
kube-scheduler-k8s01 1/1 Running 7 34d
kube-scheduler-k8s02 1/1 Running 9 34d
kube-scheduler-k8s03 1/1 Running 8 34d
Besides, also coredns cannot get internal kubernetes service.
Within a node, if I run wget -S 10.96.0.1:443, I receive a response.
wget -S 10.96.0.1:443
--2020-06-23 13:12:12-- http://10.96.0.1:443/
Connecting to 10.96.0.1:443... connected.
HTTP request sent, awaiting response...
HTTP/1.0 400 Bad Request
2020-06-23 13:12:12 ERROR 400: Bad Request.
But, if I run wget -S 10.96.0.1:443 in a pod, I receive a timeout error.
Also, i cannot ping nodes from pods.
Cluster pod cidr is 192.168.0.0/16.

I resolve recreating the cluster with different pod cidr

Can't do 'helm install' on cluster. Tiller was installed by gitab

I created a cluster in GKE using Gitlab and installed Helm & Tiller and some other stuffs like ingress and gitlab runner using gitab's interface. But when I try to install something using helm from gcloud, it gives "Error: Transport is closing".
I did gcloud container clusters get-credentials ....
$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
default jaeger-deployment-59ffb979c8-lmjk5 1/1 Running 0 17h
gitlab-managed-apps certmanager-cert-manager-6c8cd9f9bf-67wnh 1/1 Running 0 17h
gitlab-managed-apps ingress-nginx-ingress-controller-75c4d99549-x66n4 1/1 Running 0 21h
gitlab-managed-apps ingress-nginx-ingress-default-backend-6f58fb5f56-pvv2f 1/1 Running 0 21h
gitlab-managed-apps prometheus-kube-state-metrics-6584885ccf-hr8fw 1/1 Running 0 22h
gitlab-managed-apps prometheus-prometheus-server-69b9f444df-htxsq 2/2 Running 0 22h
gitlab-managed-apps runner-gitlab-runner-56798d9d9d-nljqn 1/1 Running 0 22h
gitlab-managed-apps tiller-deploy-74f5d65d77-xk6cc 1/1 Running 0 22h
kube-system heapster-v1.6.0-beta.1-7bdb4fd8f9-t8bq9 2/2 Running 0 22h
kube-system kube-dns-7549f99fcc-bhg9t 4/4 Running 0 22h
kube-system kube-dns-autoscaler-67c97c87fb-4vz9t 1/1 Running 0 22h
kube-system kube-proxy-gke-cluster2-pool-1-05abcbc6-0s6j 1/1 Running 0 20h
kube-system kube-proxy-gke-cluster2-pool-2-67e57524-ht5p 1/1 Running 0 22h
kube-system metrics-server-v0.2.1-fd596d746-289nd 2/2 Running 0 22h
visual-react-10450736 production-847c7d879c-z4h5t 1/1 Running 0 22h
visual-react-10450736 production-postgres-64cfcf9464-jr74c 1/1 Running 0 22h
$ ./helm install stable/wordpress --tiller-namespace gitlab-managed-apps --name wordpress
E0127 10:27:29.790366 418 portforward.go:331] an error occurred forwarding 39113 -> 44134: error forwarding port 44134 to pod 86b33bdc7bc30c08d98fe44c0772517c344dd1bdfefa290b46e82bf84959cb6f, uid : exit status 1: 2019/01/27 04:57:29 socat[11124] E write(5, 0x14ed120, 186): Broken pipe
Error: transport is closing
Another one
$ ./helm install incubator/jaeger --tiller-namespace gitlab-managed-apps --name jaeger --set elasticsearch.rbac.create=true --set provisionDataStore.cassandra=false --set provisionDataStore.elasticsearch=true --set storage.type=elasticsearch
E0127 10:30:24.591751 429 portforward.go:331] an error occurred forwarding 45597 -> 44134: error forwarding port 44134 to pod 86b33bdc7bc30c08d98fe44c0772517c344dd1bdfefa290b46e82bf84959cb6f, uid : exit status 1: 2019/01/27 05:00:24 socat[13937] E write(5, 0x233d120, 8192): Connection reset by peer
Error: transport is closing
I tried forwarding ports myself and it never returns to prompt, takes forever.
kubectl port-forward --namespace gitlab-managed-apps tiller-deploy 39113:44134
Apparently installing anything from Gitab's ui uses Helm and those do not fail. Yet doing so from shell fails. Please help me out.
Thanks in advance.

I know it's late but I'll share this just in case someone else struggles with this issue. I've found an answer in the gitlab forums: HERE.
The trick is to export and decode the certificates from the tiller service account and pass them as arguments to helm like this:
helm list --tiller-connection-timeout 30 --tls --tls-ca-cert tiller-ca.crt --tls-cert tiller.crt --tls-key tiller.key ---all --tiller-namespace gitlab-managed-apps

How to configure pod disruption budget to drain kubernetes node?

I'd like to configure cluster autoscaler on AKS. When scaling down it fails due to PDB:
I1207 14:24:09.523313 1 cluster.go:95] Fast evaluation: node aks-nodepool1-32797235-0 cannot be removed: no enough pod disruption budget to move kube-system/metrics-server-5cbc77f79f-44f9w
I1207 14:24:09.523413 1 cluster.go:95] Fast evaluation: node aks-nodepool1-32797235-3 cannot be removed: non-daemonset, non-mirrored, non-pdb-assignedkube-system pod present: cluster-autoscaler-84984799fd-22j42
I1207 14:24:09.523438 1 scale_down.go:490] 2 nodes found to be unremovable in simulation, will re-check them at 2018-12-07 14:29:09.231201368 +0000 UTC m=+8976.856144807
All system pods have minAvailable: 1 PDB assigned manually. I can imagine that this is not working for PODs with only a single replica like the metrics-server:
❯ k get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
aks-nodepool1-32797235-0 Ready agent 4h v1.11.4 10.240.0.4 <none> Ubuntu 16.04.5 LTS 4.15.0-1030-azure docker://3.0.1
aks-nodepool1-32797235-3 Ready agent 4h v1.11.4 10.240.0.6 <none> Ubuntu 16.04.5 LTS 4.15.0-1030-azure docker://3.0.1
❯ ks get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
cluster-autoscaler-84984799fd-22j42 1/1 Running 0 2h 10.244.1.5 aks-nodepool1-32797235-3 <none>
heapster-5d6f9b846c-g7qb8 2/2 Running 0 1h 10.244.0.16 aks-nodepool1-32797235-0 <none>
kube-dns-v20-598f8b78ff-8pshc 4/4 Running 0 3h 10.244.1.4 aks-nodepool1-32797235-3 <none>
kube-dns-v20-598f8b78ff-plfv8 4/4 Running 0 1h 10.244.0.15 aks-nodepool1-32797235-0 <none>
kube-proxy-fjvjv 1/1 Running 0 1h 10.240.0.6 aks-nodepool1-32797235-3 <none>
kube-proxy-szr8z 1/1 Running 0 1h 10.240.0.4 aks-nodepool1-32797235-0 <none>
kube-svc-redirect-2rhvg 2/2 Running 0 4h 10.240.0.4 aks-nodepool1-32797235-0 <none>
kube-svc-redirect-r2m4r 2/2 Running 0 4h 10.240.0.6 aks-nodepool1-32797235-3 <none>
kubernetes-dashboard-68f468887f-c8p78 1/1 Running 0 4h 10.244.0.7 aks-nodepool1-32797235-0 <none>
metrics-server-5cbc77f79f-44f9w 1/1 Running 0 4h 10.244.0.3 aks-nodepool1-32797235-0 <none>
tiller-deploy-57f988f854-z9qln 1/1 Running 0 4h 10.244.0.8 aks-nodepool1-32797235-0 <none>
tunnelfront-7cf9d447f9-56g7k 1/1 Running 0 4h 10.244.0.2 aks-nodepool1-32797235-0 <none>
What needs be changed (number of replicas? PDB configuration?) for down-scaling to work?

Basically, this is an administration issues when draining nodes that are configured by PDB ( Pod Disruption Budget )
This is because the evictions are forced to respect the PDB you specify
you have two options:
Either force the hand:
kubectl drain foo --force --grace-period=0
you can check other options from the doc -> https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#drain
or use the eviction api:
{
"apiVersion": "policy/v1beta1",
"kind": "Eviction",
"metadata": {
"name": "quux",
"namespace": "default"
}
}
Anyhow, the drain or the eviction api attempts delete on pod to let them be scheduled elswhere before completely draining the node
As mentioned in the docs:
the API can respond in one of three ways:
If the eviction is granted, then the pod is deleted just as if you had sent a DELETE request to the pod’s URL and you get back 200 OK.
If the current state of affairs wouldn’t allow an eviction by the rules set forth in the budget, you get back 429 Too Many Requests. This is typically used for generic rate limiting of any requests
If there is some kind of misconfiguration, like multiple budgets pointing at the same pod, you will get 500 Internal Server Error.
For a given eviction request, there are two cases:
There is no budget that matches this pod. In this case, the server always returns 200 OK.
There is at least one budget. In this case, any of the three above responses may apply.
If it gets stuck then you might need to do it manually
you can read me here or here

Kubernetes on Rasperry Pi kube flannel CrashLoopBackOff and kube dns rpc error code = 2

I used this tutorial to set up a kubernetes cluster on my Raspberry 3.
I followed the instructions until the setup of flannel by:
curl -sSL https://rawgit.com/coreos/flannel/v0.7.0/Documentation/kube-flannel.yml | sed "s/amd64/arm/g" | kubectl create -f -
I get the following error message on kubectl get po --all-namespaces:
kube-system etcd-node01 1/1 Running
0 34m
kube-system kube-apiserver-node01 1/1 Running
0 34m
kube-system kube-controller-manager-node01 1/1 Running
0 34m
kube-system kube-dns-279829092-x4dc4 0/3 rpc error:
code = 2 desc = failed to start container
"de9b2094dbada10a0b44df97d25bb629d6fbc96b8ddc0c060bed1d691a308b37":
Error response from daemon: {"message":"cannot join network of a non
running container:
af8e15c6ad67a231b3637c66fab5d835a150da7385fc403efc0a32b8fb7aa165"}
15 39m
kube-system kube-flannel-ds-zk17g 1/2
CrashLoopBackOff
11 35m
kube-system kube-proxy-6zwtb 1/1 Running
0 37m
kube-system kube-proxy-wbmz2 1/1 Running
0 39m
kube-system kube-scheduler-node01 1/1 Running
Interestingly I have the same issue, installing kubernetes with flannel on my laptop with another tutorial.
Version details are here:
Client Version: version.Info{Major:"1", Minor:"6",
GitVersion:"v1.6.3",
GitCommit:"0480917b552be33e2dba47386e51decb1a211df6",
GitTreeState:"clean", BuildDate:"2017-05-10T15:48:59Z",
GoVersion:"go1.8rc2", Compiler:"gc", Platform:"linux/arm"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.3",
GitCommit:"0480917b552be33e2dba47386e51decb1a211df6",
GitTreeState:"clean", BuildDate:"2017-05-10T15:38:08Z",
GoVersion:"go1.8rc2", Compiler:"gc", Platform:"linux/arm"}
Any suggestions, that might help?

I solved this issue by generating cluster-roles before setting up the pod network driver:
curl -sSL https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel-rbac.yml | sed "s/amd64/arm/g" | kubectl create -f -
Then setting up the pod network driver by:
curl -sSL https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml | sed "s/amd64/arm/g" | kubectl create -f -
Worked for me so far...

kube-dns stays in ContainerCreating status

I have 5 machines running Ubuntu 16.04.1 LTS. I want to set them up as a Kubernetes Cluster. Iḿ trying to follow this getting started guide where they're using kubeadm.
It all worked fine until step 3/4 Installing a pod network. I've looked at there addon page to look for a pod network and chose the flannel overlay network. Iǘe copied the yaml file to the machine and executed:
root#up01:/home/up# kubectl apply -f flannel.yml
Which resulted in:
configmap "kube-flannel-cfg" created
daemonset "kube-flannel-ds" created
So i thought that it went ok, but when I display all the pod stuff:
root#up01:/etc/kubernetes/manifests# kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system dummy-2088944543-d5f50 1/1 Running 0 50m
kube-system etcd-up01 1/1 Running 0 48m
kube-system kube-apiserver-up01 1/1 Running 0 50m
kube-system kube-controller-manager-up01 1/1 Running 0 49m
kube-system kube-discovery-1769846148-jvx53 1/1 Running 0 50m
kube-system kube-dns-2924299975-prlgf 0/4 ContainerCreating 0 49m
kube-system kube-flannel-ds-jb1df 2/2 Running 0 32m
kube-system kube-proxy-rtcht 1/1 Running 0 49m
kube-system kube-scheduler-up01 1/1 Running 0 49m
The problem is that the kube-dns keeps in the ContainerCreating state. I don't know what to do.

It is very likely that you missed this critical piece of information from the guide:
If you want to use flannel as the pod network, specify
--pod-network-cidr 10.244.0.0/16 if you’re using the daemonset manifest below.
If you omit this kube-dns will never leave the ContainerCreating STATUS.
Your kubeadm init command should be:
# kubeadm init --pod-network-cidr 10.244.0.0/16
and not
# kubeadm init

Did you try restarting NetworkManager ...? it worked for me.. Plus, it also worked when I also disabled IPv6.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Rook and ceph on kubernetes - linux

Use kubectl describe pod <name> -n rook-ceph to see the list of events, it is on the bottom of the output. This will show where the pods get stuck.

Related

kubernetes networking: pod cannot reach nodes

Can't do 'helm install' on cluster. Tiller was installed by gitab

How to configure pod disruption budget to drain kubernetes node?

Kubernetes on Rasperry Pi kube flannel CrashLoopBackOff and kube dns rpc error code = 2

kube-dns stays in ContainerCreating status

Categories

Resources