Rocks cluster head node DNS failure. Compute nodes unable to resolve hostnames - dns

I've been tasked with maintaining a Rocks (Centos 6.2 based) cluster where the head node is configured with a static IP to the public network and acts as a NAT router for the compute nodes on the internal private network. The nodes are connected to the head node by standard ethernet and also QDR Infiniband.
Recently, the compute nodes have been unable to access an external data source to begin computations as DNS lookup fails when they use wget to pull down publically-available datasets. All compute nodes are configured with the IP of the head node in their /etc/resolv.conf and I've checked the iptables firewall on the head node, and nothing has changed. SSH works between all nodes and the head node. When I use the IP address of some of the data sources for manually-initiated transfers, data flows again, but some of the applications cannot use IPs to grab data. I've tried restarting named and the iptables firewall, and so far nothing has fixed it. System logs (dmesg, /var/log/messages) show no sudden failures or error messages, I've made no recent configuration changes, and everything had worked fine for multiple months until about 2 nights ago. The head node can access and resolve names fine, it's only the compute nodes behind the NAT head node that are not working.
I'm still unfamiliar with all the workings of Rocks and am not sure if there is some special rocks command(s) that I'm overlooking to get this to work again. What might I be missing to get DNS resolution working again?
Thanks in advance!
UPDATE: DNS is working internally between compute nodes and the head node (e.g. compute-10-10 resolves to the IP address of that node from all other nodes) so the head node is functioning as the cluster DNS properly. Requests to domains outside the local zone still are failing (e.g. nslookup google.com fails) for all compute nodes.

Root cause was a failed upstream DNS server. Reconfigured the /etc/named.conf forwarder options to other servers, and all compute nodes could access external resources once again.

Related

kubernetes master failing to join cluster

We're using k8s 1.9.3 managed via kops 1.9.3 in AWS with Gossip based DNS using the weave cni network plugin.
I was doing a rolling-update of the master IG's to enable a some additional admission controllers. (PodNodeSelector and PodTolerationRestriction) I did this in two other clusters with no problems. When the cluster got to rolling the third master (we run our cluster in a 3 master setup) it brought down the instance and tried to bring up the new master instance but the new master instance failed to join the cluster. Upon further research and subsequent attempts to roll the third master to bring it into the cluster I found that the third, failing to join master, keeps trying to join the cluster as the old masters ip address. Even though it's ip address is something different. Watching a kubectl get nodes | grep master shows that the cluster thinks it's the old ip address and it fails because it's not that ip anymore. It seems that for some reason the cluster gossip based DNS is not getting notified about the new master's ip address.
This is causing problems because the kubernetes svc still has the old master's ip address in it, which is causing any api requests that get directed to that non-existent backend master to fail. It is also causing problems for etcd which keeps trying to contact it on the old ip address. Lots of logs like this:
018-10-29 22:25:43.326966 W | etcdserver: failed to reach the peerURL(http://etcd-events-f.internal.kops-prod.k8s.local:2381) of member 3b7c45b923efd852 (Get http://etcd-events-f.internal.kops-prod.k8s.local:2381/version: dial tcp 10.34.6.51:2381: i/o timeout)
2018-10-29 22:25:43.327088 W | etcdserver: cannot get the version of member 3b7c45b923efd852 (Get http://etcd-events-f.internal.kops-prod.k8s.local:2381/version: dial tcp 10.34.6.51:2381: i/o timeout)
One odd thing is that if I run etcdctl cluster-health on the available masters etcd instances they all show the unhealthy member id as f90faf39a4c5d077 but when I look at the etcd-events logs I see that it sees the unhealth member id as 3b7c45b923efd852. So there seems to be some inconsistency with etcd.
Since we are running in a three node master setup with one master down we don't want to restart any of the other masters to try to fix the problem because we're afraid to lose quorum on the etcd cluster.
We use weave 2.3.0 as our network CNI provider.
Noticed on the failing master that the weave cni config /etc/cni/net.d/10-weave.conf isn't getting created and the /etc/hosts files on the working masters isn't properly getting updated with the new master ip address. It seems like kube-proxy isn't getting the update for some reason.
Running the default debian 8 (jessie) image that is provided with kops 1.9.
How can we get the master to properly update DNS with it's new ip address?
My co-worker found that the fix was restarting the kube-dns and kube-dns-autoscaler pods. We're still not sure why they were failing to update dns with the new master ip but after restarting them adding the new master to the cluster worked fine.

How to change IP addresses of memsql nodes

I'm testing out memsql for a project by running it on a laptop in its simplest configuration. It was working fine at home with an IP address of 192.168.0.22. When I take the laptop in to work, it gets a different IP address (10.0.1.35), and when I start up the server, it's unable to bring the nodes online. I get this message in the ops app:
192.168.0.22:3306: This MemSQL node is offline, but MemSQL Ops expects it to be online.
192.168.0.22:3307: This MemSQL node is offline, but MemSQL Ops expects it to be online.
Is there any way to change the IP addresses of the nodes so I can run memsql in either location?
To change the IP you probably need to update it in two places:
Ops: On the command line run:
memsql-ops memsql-unmonitor <old memsql id>
memsql-ops memql-monitor [-h <HOST>] [-P <PORT>]
MemSQL: Connect to MemSQL and run
REMOVE LEAF ‘old ip':port FORCE;
ADD LEAF root#‘<new ip>’:port;
It sounds like you are running both nodes on the same machine, in which case you may want to use 127.0.0.1 as the IP to avoid issues with your machine's IP changing.

Configuring hostname for memcached on EC2 instances

I'm using Memcached on each of my EC2 web server instances. I am not sure how to configure the various hostnames for the memcache nodes at the server level.
Consider the following example:
<?php
$mc = new Memcached()
$mc->addServer('node1', 11211);
$mc->addServer('node2', 11211);
$mc->addServer('node3', 11211);
How are node1, node2, node3 configured?
I've read about a few setups to configure the instance with hostname and update /etc/host with these entries. However, I'm not familiar enough with configuring such things.
I'm looking for a solution that scales - handles adding and removing instances - and automatic.
The difficulty with this is keeping an updated list of hosts within your application. When hosts could be added and removed, keeping this list up to date may be a challenge. You may be able to use some sort of proxy which would help by giving you a constant endpoint for your application.
If you can't use a proxy, I have a couple ideas.
If the list of hosts is static, assign an elastic ip to each memcached host. Within ec2 region, this will resolve to the local IP address of the host its associated with. With this idea, you have a constant list of hosts that your application can use.
If you are going to add/remote hosts on a regular basis, you need to be able dynamically update the lists of hosts your application will use. You can query the EC2 api for instances with a certain tag, then get the IP addresses for all of those instances. Cache the list in memory or on disk and load it with your application. If you run this every minute, any host changes should propagate within 1 minute, unless the EC2 api is being slow to update.

RabbitMQ Cluster on EC2: Hostname Issues

I want to set up a 3 node Rabbit cluster on EC2 (amazon linux). We'd like to have recovery implemented so if we lose a server it can be replaced by another new server automagically. We can set the cluster up manually easily using the default hostname (ip-xx-xx-xx-xx) so that the broker id is rabbit#ip-xx-xx-xx-xx. This is because the hostname is resolvable over the network.
The problem is: This hostname will change if we lose/reboot a server, invalidating the cluster. We haven't had luck in setting a custom static hostname because they are not resolvable by other machines in the cluster; thats the only part of that article that doens't make sense.
Has anyone accomplished a RabbitMQ Cluster on EC2 with a recovery implementation? Any advice is appreciated.
You could create three A records in an external DNS service for the three boxes and use them in the config. E.g., rabbit1.alph486.com, rabbit2.alph486.com and rabbit3.alph486.com. These could even be the ec2 private IP addresses. If all of the boxes are in the same region it'll be faster and cheaper. If you lose a box, just update the DNS record.
Additionally, you could assign an elastic IPs to the three boxes. Then, when you lose a box, all you'd need to do is assign the elastic IP to it's replacement.
Of course, if you have a small number of clients, you could just add entries into the /etc/hosts file on each box and update as needed.
From:
http://www.rabbitmq.com/ec2.html
Issues with hostname
RabbitMQ names the database directory using the current hostname of the system. If the hostname changes, a new empty database is created. To avoid data loss it's crucial to set up a fixed and resolvable hostname. For example:
sudo -s # become root
echo "rabbit" > /etc/hostname
echo "127.0.0.1 rabbit" >> /etc/hosts
hostname -F /etc/hostname
#Chrskly gave good answers that are the general consensus of the Rabbit community:
Init scripts that handle DNS or identification of other servers are mainly what I hear.
Elastic IPs we could not get to work without the aid of DNS or hostname aliases because the Internal IP/DNS on amazon still rotate and the public IP/DNS names that stay static cannot be used as the hostname for rabbit unless aliased properly.
Hosts file manipulations via an script are also an option. This needs to be accompanied by a script that can identify the DNS's of the other servers upon launch so doesn't save much work in terms of making things more "solid state" config wise.
What I'm doing:
Due to some limitations on the DNS front, I am opting to use bootstrap scripts to initialize the machine and cluster with any other available machines using the default internal dns assigned at launch. If we lose a machine, a new one will come up, prepare rabbit and lookup the DNS names of machines to cluster with. It will then remove the dead node from the cluster for housekeeping.
I'm using some homebrew init scripts in Python. However, this could easily be done with something like Chef/Puppet.
Update: Detail from Docs
From:
http://www.rabbitmq.com/ec2.html
Issues with hostname
RabbitMQ names the database directory using the current hostname of
the system. If the hostname changes, a new empty database is created.
To avoid data loss it's crucial to set up a fixed and resolvable
hostname. For example:
sudo -s # become root
echo "rabbit" > /etc/hostname
echo "127.0.0.1 rabbit" >> /etc/hosts
hostname -F /etc/hostname

UnknownHostException on tasktracker in Hadoop cluster

I have set up a pseudo-distributed Hadoop cluster (with jobtracker, a tasktracker, and namenode all on the same box) per tutorial instructions and it's working fine. I am now trying to add in a second node to this cluster as another tasktracker.
When I examine the logs on Node 2, all the logs look fine except for the tasktracker. I'm getting an infinite loop of the error message listed below. It seems that the Task Tracker is trying to use the hostname SSP-SANDBOX-1.mysite.com rather than the ip address. This hostname is not in /etc/hosts so I'm guessing this is where the problem is coming from. I do not have root access in order to add this to /etc/hosts.
Is there any property or configuration I can change so that it will stop trying to connect using the hostname?
Thanks very much,
2011-01-18 17:43:22,896 ERROR org.apache.hadoop.mapred.TaskTracker:
Caught exception: java.net.UnknownHostException: unknown host: SSP-SANDBOX-1.mysite.com
at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:195)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:850)
at org.apache.hadoop.ipc.Client.call(Client.java:720)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy5.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:207)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:170)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1033)
at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1720)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2833)
This blog posting might be helpful:
http://western-skies.blogspot.com/2010/11/fix-for-exceeded-maxfaileduniquefetches.html
The short answer is that Hadoop performs reverse hostname lookups even if you specify IP addresses in your configuration files. In your environment, in order for you to make Hadoop work, SSP-SANDBOX-1.mysite.com must resolve to the IP address of that machine, and the reverse lookup for that IP address must resolve to SSP-SANDBOX-1.mysite.com.
So you'll need to talk to whoever is administering those machines to either fudge the hosts file or to provide a DNS server that will do the right thing.

Resources