We're using k8s 1.9.3 managed via kops 1.9.3 in AWS with Gossip based DNS using the weave cni network plugin.
I was doing a rolling-update of the master IG's to enable a some additional admission controllers. (PodNodeSelector and PodTolerationRestriction) I did this in two other clusters with no problems. When the cluster got to rolling the third master (we run our cluster in a 3 master setup) it brought down the instance and tried to bring up the new master instance but the new master instance failed to join the cluster. Upon further research and subsequent attempts to roll the third master to bring it into the cluster I found that the third, failing to join master, keeps trying to join the cluster as the old masters ip address. Even though it's ip address is something different. Watching a kubectl get nodes | grep master shows that the cluster thinks it's the old ip address and it fails because it's not that ip anymore. It seems that for some reason the cluster gossip based DNS is not getting notified about the new master's ip address.
This is causing problems because the kubernetes svc still has the old master's ip address in it, which is causing any api requests that get directed to that non-existent backend master to fail. It is also causing problems for etcd which keeps trying to contact it on the old ip address. Lots of logs like this:
018-10-29 22:25:43.326966 W | etcdserver: failed to reach the peerURL(http://etcd-events-f.internal.kops-prod.k8s.local:2381) of member 3b7c45b923efd852 (Get http://etcd-events-f.internal.kops-prod.k8s.local:2381/version: dial tcp 10.34.6.51:2381: i/o timeout)
2018-10-29 22:25:43.327088 W | etcdserver: cannot get the version of member 3b7c45b923efd852 (Get http://etcd-events-f.internal.kops-prod.k8s.local:2381/version: dial tcp 10.34.6.51:2381: i/o timeout)
One odd thing is that if I run etcdctl cluster-health on the available masters etcd instances they all show the unhealthy member id as f90faf39a4c5d077 but when I look at the etcd-events logs I see that it sees the unhealth member id as 3b7c45b923efd852. So there seems to be some inconsistency with etcd.
Since we are running in a three node master setup with one master down we don't want to restart any of the other masters to try to fix the problem because we're afraid to lose quorum on the etcd cluster.
We use weave 2.3.0 as our network CNI provider.
Noticed on the failing master that the weave cni config /etc/cni/net.d/10-weave.conf isn't getting created and the /etc/hosts files on the working masters isn't properly getting updated with the new master ip address. It seems like kube-proxy isn't getting the update for some reason.
Running the default debian 8 (jessie) image that is provided with kops 1.9.
How can we get the master to properly update DNS with it's new ip address?
My co-worker found that the fix was restarting the kube-dns and kube-dns-autoscaler pods. We're still not sure why they were failing to update dns with the new master ip but after restarting them adding the new master to the cluster worked fine.
Related
I'm trying to setup a cluster of one machine for now. I know that I can get the API server running and listening to some ports.
I am looking to issue commands against the master machine from my laptop.
KUBECONFIG=/home/slackware/kubeconfig_of_master kubectl get nodes should send a request to the master machine, hit the API server, and get a response of the running nodes.
However, I am hitting issues with permissions. One is similar to x509: certificate is valid for 10.61.164.153, not 10.0.0.1. Another is a 403 if I hit the kubectl proxy --port=8080 that is running on the master machine.
I think two solutions are possible, with a preferable one (B):
A. Add my laptop's ip address to the list of accepted ip addresses that API server or certificates or certificate agents holds. How would I do that? Is that something I can set in kubeadm init?
B. Add 127.0.0.1 to the list of accepted ip addresses that API server or certificates or certificate agents holds. How would I do that? Is that something I can set in kubeadm init?
I think B would be better, because I could create an ssh tunnel from my laptop to the remote machine and allow my teammates (if I ever have any) to do similarly.
Thank you,
Slackware
You shoud add --apiserver-cert-extra-sans 10.0.0.1 to your kubeadm init command.
Refer to https://kubernetes.io/docs/reference/setup-tools/kubeadm/kubeadm-init/#options
You should also use a config file:
apiVersion: kubeadm.k8s.io/v1beta2
kind: ClusterConfiguration
kubernetesVersion: v1.16.2
apiServer:
certSANs:
- 10.0.0.1
You can find all relevant info here: https://godoc.org/k8s.io/kubernetes/cmd/kubeadm/app/apis/kubeadm/v1beta2
I've been tasked with maintaining a Rocks (Centos 6.2 based) cluster where the head node is configured with a static IP to the public network and acts as a NAT router for the compute nodes on the internal private network. The nodes are connected to the head node by standard ethernet and also QDR Infiniband.
Recently, the compute nodes have been unable to access an external data source to begin computations as DNS lookup fails when they use wget to pull down publically-available datasets. All compute nodes are configured with the IP of the head node in their /etc/resolv.conf and I've checked the iptables firewall on the head node, and nothing has changed. SSH works between all nodes and the head node. When I use the IP address of some of the data sources for manually-initiated transfers, data flows again, but some of the applications cannot use IPs to grab data. I've tried restarting named and the iptables firewall, and so far nothing has fixed it. System logs (dmesg, /var/log/messages) show no sudden failures or error messages, I've made no recent configuration changes, and everything had worked fine for multiple months until about 2 nights ago. The head node can access and resolve names fine, it's only the compute nodes behind the NAT head node that are not working.
I'm still unfamiliar with all the workings of Rocks and am not sure if there is some special rocks command(s) that I'm overlooking to get this to work again. What might I be missing to get DNS resolution working again?
Thanks in advance!
UPDATE: DNS is working internally between compute nodes and the head node (e.g. compute-10-10 resolves to the IP address of that node from all other nodes) so the head node is functioning as the cluster DNS properly. Requests to domains outside the local zone still are failing (e.g. nslookup google.com fails) for all compute nodes.
Root cause was a failed upstream DNS server. Reconfigured the /etc/named.conf forwarder options to other servers, and all compute nodes could access external resources once again.
I'm testing out memsql for a project by running it on a laptop in its simplest configuration. It was working fine at home with an IP address of 192.168.0.22. When I take the laptop in to work, it gets a different IP address (10.0.1.35), and when I start up the server, it's unable to bring the nodes online. I get this message in the ops app:
192.168.0.22:3306: This MemSQL node is offline, but MemSQL Ops expects it to be online.
192.168.0.22:3307: This MemSQL node is offline, but MemSQL Ops expects it to be online.
Is there any way to change the IP addresses of the nodes so I can run memsql in either location?
To change the IP you probably need to update it in two places:
Ops: On the command line run:
memsql-ops memsql-unmonitor <old memsql id>
memsql-ops memql-monitor [-h <HOST>] [-P <PORT>]
MemSQL: Connect to MemSQL and run
REMOVE LEAF ‘old ip':port FORCE;
ADD LEAF root#‘<new ip>’:port;
It sounds like you are running both nodes on the same machine, in which case you may want to use 127.0.0.1 as the IP to avoid issues with your machine's IP changing.
I've a problem, once set host name, cluster wouldn't update it's IP, even in DNS changes.
Or what is the recommended way of making the application resilient to the fact that more nodes can be added to DNS round robin and old nodes decomissioned ?
I had same thing with Astyanax driver. For me it looks like it works this way:
DNS name is used only when initial connection to cluster is created. At this point driver collects data about cluster nodes. This information is kept in terms of IP addresses already and DNS names are not used any more. Sub-sequential changes in the cluster topology are propagated into the client also using IP addresses.
So, when you add more nodes to the cluster, you actually do not have to assign domain names to them. Just adding a node to the cluster propagates its IP address to the cluster topology table and this info is distributed among all cluster members and smart clients like Java Driver (some third party clients might not have this info and will use only seed nodes to pass queries to).
When you decommission node it works same way. Just all cluster nodes and smart clients receive information that node with a particular IP is not in the cluster any more. It can be even initial seed node.
->Domain name makes sense only for clients which hadn't established cluster connection.
In case you really need to switch IP you have to:
Join node with new IP
Decommission node with old IP
Assign DNS name to new IP
I created 3 instances with 3 Elastic IP addresses pointing to these instances.
I did a yum install of dsc:
dsc12.noarch 1.2.13-1 #datastax
And the /etc/cassandra/default.conf/cassandra.yaml has the:
- seeds: [Elastic IP list]
But when I start cassandra via "service cassandra start" I see in /var/log/cassandra/cassandra.log:
...
Exception encountered during startup: Unable to contact any seeds!
...
And sure enough "nodetool status" shows:
Failed to connect to '127.0.0.1:7199': Connection refused
BUT:
If I change the value of the seeds to use the "private IP" of the instance, cassandra starts just fine. I would expect it work just fine with the Elastic IP's, but it doesn't.
Do you know why that is?
The reason I want the Elastic IP's to work is that I know the IP address ahead of time, so that when I provision a machine with Puppet, I can pre-populate the seeds in cassandra.yaml file. I don't know the private IP address until after the machine has booted :(
This is almost a duplicate of: Cassandra on Amazon EC2 with Elastic IP addresses
I believe your problem comes from the seed IP's not being the same as the node's Broadcast IP's . To change this modify the following line in each of your Cassandra.yamls
# Address to broadcast to other Cassandra nodes
# Leaving this blank will set it to the same value as listen_address
broadcast_address: <node's elastic ip> #uncomment this line