Kops rolling-update fails with "Cluster did not pass validation" for master node - linux

For some reason my master node can no longer connect to my cluster after upgrading from kubernetes 1.11.9 to 1.12.9 via kops (version 1.13.0). In the manifest I'm upgrading kubernetesVersion from 1.11.9 -> 1.12.9. This is the only change I'm making. However when I run kops rolling-update cluster --yes I get the following error:
Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-01234567" has not yet joined cluster.
Cluster did not validate within 5m0s
After that if I run a kubectl get nodes I no longer see that master node in my cluster.
Doing a little bit of debugging by sshing into the disconnected master node instance I found the following error in my api-server log by running sudo cat /var/log/kube-apiserver.log:
controller.go:135] Unable to perform initial IP allocation check: unable to refresh the service IP block: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 127.0.0.1:4001: connect: connection refused
I suspect the issue might be related to etcd, because when I run sudo netstat -nap | grep LISTEN | grep etcd there is no output.
Anyone have any idea how I can get my master node back in the cluster or have advice on things to try?

I have made some research I got few ideas for you:
If there is no output for the etcd grep it means that your etcd server is down. Check the logs for the 'Exited' etcd container | grep Exited | grep etcd and than logs <etcd-container-id>
Try this instruction I found:
1 - I removed the old master from de etcd cluster using etcdctl. You
will need to connect on the etcd-server container to do this.
2 - On the new master node I stopped kubelet and protokube services.
3 - Empty Etcd data dir. (data and data-events)
4 - Edit /etc/kubernetes/manifests/etcd.manifests and
etcd-events.manifest changing ETCD_INITIAL_CLUSTER_STATE from new to
existing.
5 - Get the name and PeerURLS from new master and use etcdctl to add
the new master on the cluster. (etcdctl member add "name"
"PeerULR")You will need to connect on the etcd-server container to do
this.
6 - Start kubelet and protokube services on the new master.
If that is not the case than you might have a problem with the certs. They are provisioned during the creation of the cluster and some of them have the allowed master's endpoints. If that is the case you'd need to create new certs and roll them for the api server/etcd clusters.
Please let me know if that helped.

Related

Error connecting second elasticsearch node to single node cluster

While following the following tutorial steps: https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html
I've managed to create a single node elastic search cluster.
But when running the following line of code to add a second elasticsearch node to the existing cluster:
docker run -e ENROLLMENT_TOKEN="<token>" --name es02 --net elastic -it docker.elastic.co/elasticsearch/elasticsearch:8.3.2
I get the following error:
Unable to communicate with the node on https://172.18.0.92:9200/_security/enroll/node. Error was Connection timed out.
ERROR: Aborting enrolling to cluster. Could not communicate with the node on any of the addresses from the enrollment token. All of [172.18.0.92:9200] were attempted.
I would greatly appreciate it if others are getting the same error or not or if you know how to fix this issue. Thanks.

Kubernetes Unable to connect to the server: dial tcp x.x.x.x:6443: i/o timeout

I am using test kubenetes cluster (Kubeadm 1 master and 2 nodes setup), My public ip change time to time and when my public IP changed, I am unable to connect to cluster and i get below error
Kubernetes Unable to connect to the server: dial tcp x.x.x.x:6443: i/o timeout
I also have private IP 10.10.10.10 which is consistent all the time.
I have created kubernetes cluster using below command
kubeadm init --control-plane-endpoint 10.10.10.10
But still it failed because certificates are signed to public IP and below is the error
The connection to the server x.x.x.x:6443 was refused - did you specify the right host or port?
Can someone help to setup kubeadm, and should allow for all IP's something like 0.0.0.0 and I am fine for security view point since it is test setup. or any parament fix.
Since #Vidya has already solved this issue by using a static IP address, I decided to provide a Community Wiki answer just for better visibility to other community members.
First of all, it is not recommended to have a frequently changing master/server IP address.
As we can find in the discussion on GitHub kubernetes/88648 - kubeadm does not provide an easy way to deal with this.
However, there are a few workarounds that can help us, when the IP address on the Kubernetes master node changes.
Based on the discussion Changing master IP address, I prepared a script that regenerates certificates and re-init master node.
This script might be helpful, but I recommend running one command at a time (it will be safer).
In addition, you may need to customize some steps to your needs:
NOTE: In the example below, I'm using Docker as the container runtime.
root#kmaster:~# cat reinit_master.sh
#!/bin/bash
set -e
echo "Stopping kubelet and docker"
systemctl stop kubelet docker
echo "Making backup kubernetes data"
mv /etc/kubernetes /etc/kubernetes-backup
mv /var/lib/kubelet /var/lib/kubelet-backup
echo "Restoring certificates"
mkdir /etc/kubernetes
cp -r /etc/kubernetes-backup/pki /etc/kubernetes/
rm /etc/kubernetes/pki/{apiserver.*,etcd/peer.*}
echo "Starting docker"
systemctl start docker
echo "Reinitializing master node"
kubeadm init --ignore-preflight-errors=DirAvailable--var-lib-etcd
echo "Updating kubeconfig file"
cp /etc/kubernetes/admin.conf ~/.kube/config
Then you need to rejoin the worker nodes to the cluster.

Could not get apiVersions from Kubernetes: Unable to retrieve the complete list of server APIs

While trying to deploy an application got an error as below:
Error: UPGRADE FAILED: could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
Output of kubectl api-resources consists some resources along with the same error in the end.
Environment: Azure Cloud, AKS Service
Solution:
The steps I followed are:
kubectl get apiservices : If metric-server service is down with the error CrashLoopBackOff try to follow the step 2 otherwise just try to restart the metric-server service using kubectl delete apiservice/"service_name". For me it was v1beta1.metrics.k8s.io .
kubectl get pods -n kube-system and found out that pods like metrics-server, kubernetes-dashboard are down because of the main coreDNS pod was down.
For me it was:
NAME READY STATUS RESTARTS AGE
pod/coredns-85577b65b-zj2x2 0/1 CrashLoopBackOff 7 13m
Use kubectl describe pod/"pod_name" to check the error in coreDNS pod and if it is down because of /etc/coredns/Corefile:10 - Error during parsing: Unknown directive proxy, then we need to use forward instead of proxy in the yaml file where coreDNS config is there. Because CoreDNS version 1.5x used by the image does not support the proxy keyword anymore.
This error happens commonly when your metrics server pod is not reachable by the master node. Possible reasons are
metric-server pod is not running. This is the first thing you should check. Then look at the logs of the metric-server pod to check if it has some permission issues trying to get metrics
Try to confirm communication between master and slave nodes.
Try running kubectl top nodes and kubectl top pods -A to see if metric-server runs ok.
From these points you can proceed further.

Cannot setup multi-host Docker overlay network with etcd

I am trying to connect two Docker hosts with an overlay network and am using etcd as a KV-store. etcd is running directly on the first host (not in a container). I finally managed to connect the Docker daemon of the first host to etcd but cannot manage to establish a connection the Docker daemon on the second host.
I downloaded etcd from the Github releases page and followed the instructions under the "Linux" section.
After starting etcd, it is listening to the following ports:
etcdmain: listening for peers on http://localhost:2380
etcdmain: listening for peers on http://localhost:7001
etcdmain: listening for client requests on http://localhost:2379
etcdmain: listening for client requests on http://localhost:4001
And I started the Docker daemon on the first host (on which etcd is running as well) like this:
docker daemon --cluster-advertise eth1:2379 --cluster-store etcd://127.0.0.1:2379
After that, I could also create an overlay network with:
docker network create -d overlay <network name>
But I can't figure out how to start the daemon on the second host. No matter which values I tried for --cluster-advertise and --cluster-store, I keep getting the following error message:
discovery error: client: etcd cluster is unavailable or misconfigured
Both my hosts are using the eth1 interface. The IP of host1 is 10.10.10.10 and the IP of host2 is 10.10.10.20. I already ran iperf to make sure they can connect to each other.
Any ideas?
So I finally figured out how to connect the two hosts and to be honest, I don't understand why it took me so long to solve the problem. But in case other people run into the same problem I will post my solution here. As mentioned earlier, I downloaded etcd from the Github release page and extracted the tar file.
I followed the instructions from the etcd documentation and applied it to my situation. Instead of running etcd with all the options directly from the command line I created a simple bash script. This makes it a lot easier to adjust the options and rerun the command. Once you figured out the right options it would be handy to place them separately in a config file and run etcd as a service as explaind in this tutorial. So here is my bash script:
#!/bin/bash
./etcd --name infra0 \
--initial-advertise-peer-urls http://10.10.10.10:2380 \
--listen-peer-urls http://10.10.10.10:2380 \
--listen-client-urls http://10.10.10.10:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://10.10.10.10:2379 \
--initial-cluster-token etcd-cluster-1 \
--initial-cluster infra0=http://10.10.10.10:2380,infra1=http://10.10.10.20:2380 \
--initial-cluster-state new
I placed this file in the etcd-vX.X.X-linux-amd64 directory (that I just downloaded and extracted) which also contains the etcd binary. On the second host I did the same thing but changed the --name from infra0 to infra1 and adjusted the IP to that one the second host (10.10.10.20). The --initial-cluster option is not modified.
Then I executed the script on host1 first and then on host2. I'm not sure if the order matters, but in my case I got an error message when I did it the other way round.
To make sure your cluster is set up correctly you can run:
./etcdctl cluster-health
If the output looks similar to this (listing the two members) it should work.
member 357e60d488ae5ab3 is healthy: got healthy result from http://10.10.10.10:2379
member 590f234979b9a5ee is healthy: got healthy result from http://10.10.10.20:2379
If you want to be really sure, add a value to your store on host1 and retrieve it on host2:
host1$ ./etcdctl set myKey myValue
host2$ ./etcdctl get myKey
Setting up docker overlay network
In order to set up a docker overlay network I had to restart the Docker daemon with the --cluster-store and --cluster-advertise options. My solution is probably not the cleanest one but it works. So on both hosts first stopped the docker service and then restarted the daemon with the options:
sudo service docker stop
sudo /usr/bin/docker daemon --cluster-store=etcd://10.10.10.10:2379 --cluster-advertise=10.10.10.10:2379
Note that on host2 the IP addresses need to be adjusted. Then I created the overlay network like this on one of the hosts:
sudo docker network create -d overlay <network name>
If everything worked correctly, the overlay network can now be seen on the other host. Check with this command:
sudo docker network ls

Docker Registry Stays Pending After Deployment

I have installed OpenShift Enterprise as per the online guide (quick installation) but I'm stuck at deploying the registry.
[https://docs.openshift.com/enterprise/3.0/admin_guide/install/docker_registry.html#deploy-registry][1]
I create the registry
oadm registry --config=/etc/openshift/master/admin.kubeconfig \
--credentials=/etc/openshift/master/openshift-registry.kubeconfig \
--images='registry.access.redhat.com/openshift3/ose-${component}:${version}'
I check that it was configured
[justin#172 ~]$ oc get se docker-registry
NAME LABELS SELECTOR IP(S) PORT(S)
docker-registry docker-registry=default docker-registry=default 172.30.144.220 5000/TCP
But it never runs it stays pending
[justin#172 ~]$ oc get pods
NAME READY STATUS RESTARTS AGE
docker-registry-1-deploy 0/1 Pending 0 2h
I try to get some more info
[justin#172 ~]$ oc logs docker-registry-1-deploy
[justin#172 ~]$
but the logs command returns nothing
I had attempted an install with one node sharing the machine with the master.
My nodes looked like this:
[root#master ~]# oc get nodes
NAME LABELS STATUS
master.mydomain.com kubernetes.io/hostname=master.mydomain.com Ready,SchedulingDisabled
Note: SchedulingDisabled
I ran this command:
oc describe pod docker-registry-1-deploy
And it gave the reason for not being deployed which was that there were no nodes to schedule a deployment on. Just to get things going quickly I performed the install again added a node on another VM.
Then
[root#master ~]# oc get nodes
NAME LABELS STATUS
master.mydomain.com kubernetes.io/hostname=master.mydomain.com Ready,SchedulingDisabled
node1.mydomain.com kubernetes.io/hostname=node1.mydomain.com Ready
and I managed to successfully deploy the registry.

Resources