UnknownHostException on tasktracker in Hadoop cluster - linux

I have set up a pseudo-distributed Hadoop cluster (with jobtracker, a tasktracker, and namenode all on the same box) per tutorial instructions and it's working fine. I am now trying to add in a second node to this cluster as another tasktracker.
When I examine the logs on Node 2, all the logs look fine except for the tasktracker. I'm getting an infinite loop of the error message listed below. It seems that the Task Tracker is trying to use the hostname SSP-SANDBOX-1.mysite.com rather than the ip address. This hostname is not in /etc/hosts so I'm guessing this is where the problem is coming from. I do not have root access in order to add this to /etc/hosts.
Is there any property or configuration I can change so that it will stop trying to connect using the hostname?
Thanks very much,
2011-01-18 17:43:22,896 ERROR org.apache.hadoop.mapred.TaskTracker:
Caught exception: java.net.UnknownHostException: unknown host: SSP-SANDBOX-1.mysite.com
at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:195)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:850)
at org.apache.hadoop.ipc.Client.call(Client.java:720)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy5.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:207)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:170)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1033)
at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1720)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2833)

This blog posting might be helpful:
http://western-skies.blogspot.com/2010/11/fix-for-exceeded-maxfaileduniquefetches.html
The short answer is that Hadoop performs reverse hostname lookups even if you specify IP addresses in your configuration files. In your environment, in order for you to make Hadoop work, SSP-SANDBOX-1.mysite.com must resolve to the IP address of that machine, and the reverse lookup for that IP address must resolve to SSP-SANDBOX-1.mysite.com.
So you'll need to talk to whoever is administering those machines to either fudge the hosts file or to provide a DNS server that will do the right thing.

Related

getting hostname of remote computers on the local network not setup in /etc/hosts

I have a new learning, I was trying to get hostname using python's socket.
so from my macbook I ran the below code:
socket.gethostbyaddr("192.168.1.111")
and I get the ('rock64', [], ['192.168.1.111']) then I tried IP address of a computer that is not on the network anymore but used to be:
socket.gethostbyaddr("192.168.1.189")
and it returned: ('mint', [], ['192.168.1.189']) then I realised its coming from the /etc/hosts file.
now in that host file I also have this entry:
/etc/hosts
172.217.25.3 google.com.hk
but if I try to get host from ip of wan address i get different results than expected!
socket.gethostbyaddr("172.217.25.3")
that returns ('hkg07s24-in-f3.1e100.net', ['3.25.217.172.in-addr.arpa'], ['172.217.25.3'])
so I am not wondering where in the later case of WAN ip address I am getting the hostname and why in case of local computer IP's I am getting hostname from the configured /etc/hosts file ?
How can we get hostname of host computers on the local network without socket.gethostbyaddr having to look into /etc/hosts file or by other means ?
This is opinion based answer to the question "how to build registry of network devices on your local network?"
The best way to build registry of devices on your local network is to setup ntopng on your gateway. It uses DPI (Deep Packet Inspection) Technics to collect information about hosts.
NTOPNG has nice user interface and displays host names (when possible).
You can assign aliases for specific hosts which do not leak host names via any protocol.
For some reasons ntopng developers did not include alias into json response for request http://YOUR-SERVER:3000/lua/host_get_json.lua?ifid=2&host=IP-OF-DEVICE .
You can add it manually by adding lines require "mac_utils" and hj["alias"]=getDeviceName(hj["mac_address"]) into file /usr/share/ntopng/scripts/lua/host_get_json.lua
You can use REST API to interrogate ntopng and use provided information for building any script you need.

Confluence in Docker can't see PostgreSQL in Docker

I'm trying to set up both Confluence and PostgreSQL in Docker. I've got them both up and running on my fully up to date CentOS 6 machine, with volume-mapping to the host file system so I can back them up easily. I can connect to PostgreSQL using pgAdmin from another machine just fine, and I can get into Confluence from a browser from that same machine. So, basically, both apps seem to be running as expected inside their respective containers and are accessible to the outside world, which of course eliminates a whole bunch of possibilities for my issue.
And that issue is that Confluence can't talk to PostgreSQL during initial setup, which is necessary for it to function. I'm getting connection failed errors (to be specific: "Can't reach database server or port : SQLState - 08001 org.postgresql.util.PSQLException: The connection attempt failed").
PostgreSQL is using the default 5432 port, which of course is exposed, otherwise I wouldn't be able to connect to it via pgAdmin, and of course I know the ID/password I'm trying is correct for the same reason (and besides, if it was an auth problem I wouldn't expect to see this error message). When I try to configure the database connection during Confluence's initial setup, I specify the IP address of the host machine, just like from pgAdmin on the other machine, but that doesn't work. I also tried some things that I basically knew wouldn't work (0.0.0.0, 127.0.0.1 and localhost).
I'm not sure what I need to do to make this work. Is there maybe some special method to specify the IP to a container from the same host machine, some nomenclature I'm not aware of?
At this point, I'm "okay" with Docker in terms of basic operations, but I'm far from an expert, so I'm a bit lost. I'm also not a big-time *nix user generally, though I can usually fumble my way through most things... but any hints would be greatly appreciated because I'm at a loss right now otherwise.
Thanks,
Frank
EDIT 1: As requested by someone below, here's my pg_hba.conf file, minus comments:
local all all trust
host all all 127.0.0.1/32 trust
host all all ::1/128 trust
local replication all trust
host replication all 127.0.0.1/32 trust
host replication all ::1/128 trust
host all all all md5
try changing the second line of the pg_hba.conf file to the following:
host all all 0.0.0.0/32 trust
this will cause PostgreSQL to start accepting calls from any source address. Since a docker container is technically not operating on localhost but on its own ip, the current configuration causes PostgreSQL to block any connections to it.
Also check if confluence is searching for the database on localhost. If that is the case change that to the ip of the hostmachine within the docker network.
Success! The solution was to create a custom network and then use the image name in the connection string to PostreSQL container from Confluence container. In other words, I ran this:
docker network create -d bridge docker-net
Then, on both of the docker run commands for the PostgreSQL and Confluence containers, I added:
--network=docker-net
That way, when I ran through the Confluence configuration wizard, when it asked for the hostname for the PostgreSQL server, I used postgres (the name I gave the container) rather than an IP address or actual hostname. Docker makes that work thanks to the custom network. This also leaves the containers available via the IP of the host machine, so for example I can still connect to PostgreSQL via 192.168.123.12:5432, and of course I can launch Confluence in the browser via 192.168.123.12:8080.
FYI, I didn't even have to alter the pg_hba.conf file, I just used the official PostgreSQL image (latest) as it was, which is ideal.
Thanks very much to RSloeserwij for the suggestions... while none of them proved to be the solution I needed, they did put me on the right track in the Docker docs, which, after some reading, led me to understand a few things I didn't before and figure out the config magic I needed.

Rocks cluster head node DNS failure. Compute nodes unable to resolve hostnames

I've been tasked with maintaining a Rocks (Centos 6.2 based) cluster where the head node is configured with a static IP to the public network and acts as a NAT router for the compute nodes on the internal private network. The nodes are connected to the head node by standard ethernet and also QDR Infiniband.
Recently, the compute nodes have been unable to access an external data source to begin computations as DNS lookup fails when they use wget to pull down publically-available datasets. All compute nodes are configured with the IP of the head node in their /etc/resolv.conf and I've checked the iptables firewall on the head node, and nothing has changed. SSH works between all nodes and the head node. When I use the IP address of some of the data sources for manually-initiated transfers, data flows again, but some of the applications cannot use IPs to grab data. I've tried restarting named and the iptables firewall, and so far nothing has fixed it. System logs (dmesg, /var/log/messages) show no sudden failures or error messages, I've made no recent configuration changes, and everything had worked fine for multiple months until about 2 nights ago. The head node can access and resolve names fine, it's only the compute nodes behind the NAT head node that are not working.
I'm still unfamiliar with all the workings of Rocks and am not sure if there is some special rocks command(s) that I'm overlooking to get this to work again. What might I be missing to get DNS resolution working again?
Thanks in advance!
UPDATE: DNS is working internally between compute nodes and the head node (e.g. compute-10-10 resolves to the IP address of that node from all other nodes) so the head node is functioning as the cluster DNS properly. Requests to domains outside the local zone still are failing (e.g. nslookup google.com fails) for all compute nodes.
Root cause was a failed upstream DNS server. Reconfigured the /etc/named.conf forwarder options to other servers, and all compute nodes could access external resources once again.

Cassandra 3.9, how to remote access [duplicate]

I have built Cassandra server 2.0.3, then run it. It is starting and then stopped with messages:
X:\MyProjects\cassandra\apache-cassandra-2.0.3-src\bin>cassandra.bat >log.txt
java.lang.RuntimeException: Unable to gossip with any seeds
at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1160)
at org.apache.cassandra.service.StorageService.checkForEndpointCollision
(StorageService.java:416)
at org.apache.cassandra.service.StorageService.joinTokenRing(StorageServ
ice.java:608)
at org.apache.cassandra.service.StorageService.initServer(StorageService
.java:576)
at org.apache.cassandra.service.StorageService.initServer(StorageService
.java:475)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.ja
va:346)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon
.java:461)
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.jav
a:504)
What I can change to run it?
I had a similar problem with my cassandra v2.0.4 cluster running a single node.
Check your cassandra.yaml and make sure that your "listen_address" and "seeds" values match, with the exception that the seeds value requires quotes around it.
You might get this problem if your private IP address is different than the public one (like on AWS). For example, the host thinks it's "172.31.0.2" when it's visible as "55.70.33.10".
The solution to this problem is:
listen_address: 172.31.0.2
broadcast_address: 55.70.33.10
in cassandra.yaml
Make sure your cluster_name entry match on all the nodes in the cluster
(you may need to delete your storage if you changed the cluster name)
Verify that all nodes can ping to each other
broadcast_rpc_address and listen_address should be set to local IP
(not localhost or 127.0.0.1)
seeds should point to the IP address of the seed(s)
If you are on AWS and use the Ec2MultiRegionSnitch you will need to set the seeds to the public IP addresses rather than the private IPs.
I had the same problem on Ubuntu 16.04. I'm not sure which of these changes made it work, where XXX.XXX.XXX.XXX is your public facing IP address, below are selections from cassandra.yaml
seed_provider:
# Addresses of hosts that are deemed contact points.
# Cassandra nodes use this list of hosts to find each other and learn
# the topology of the ring. You must change this if you are running
# multiple nodes!
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
# seeds is actually a comma-delimited list of addresses.
# Ex: "<ip1>,<ip2>,<ip3>"
- seeds: "XXX.XXX.XXX.XXX"
listen_address: XXX.XXX.XXX.XXX
broadcast_address: XXX.XXX.XXX.XXX
broadcast_rpc_address: XXX.XXX.XXX.XXX
listen_on_broadcast_address: true
start_rpc: true
rpc_address: XXX.XXX.XXX.XXX
I also needed to restart my Virtual Machine for some reason. ¯_(ツ)_/¯
For a quick single node setup on RHEL, I did the following:
Get info about your network interface setup:
# /sbin/ifconfig -a
It will list the interfaces and the ip addresses they are attached to.
Usually it will show an "Ethernet" interface and a "Local Loopback".
Get the associated ip addresses.
Then edit conf/cassandra.yaml:
rpc_address: [Local Loopback address]
broadcast_rpc_address: [Ethernet address]
listen_address: [Local Loopback address]
broadcast_address: [Ethernet address]
listen_on_broadcast_address: true
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "[Ethernet address]"
Then also, open the correct ports on Linux firewall, being 9042, 7000 and 7001. More info about opening ports on Linux here:
http://ask.xmodulo.com/open-port-firewall-centos-rhel.html
in cassandra.yaml, I update the seed from domain name to IP address. and it works.
Happened to me because in my configuration the "intial_token" settings was specified (I think because I just copied to configuration file over from another cluster member). After clearing the data directory, commenting out the setting and restarting the node, it worked fine for me.
I experienced this error today...
I could not find any reason for the error other than timing issues.
I restarted many times and after a while it sticked. It looks like they expect a bi-directional communication on the gossip channel and if it does not happen quickly enough (which looks like a very small amount of time to me) then they drop the line and generate that error.
In my case I just upgraded my software and restarted the computer. So it was clearly not a connection issue between the computers (I have firewalls and SSL, to complicate matters) and the node was connected before... So the one entry I found in that regard from datastax did not apply...
https://support.datastax.com/hc/en-us/articles/209691483-Bootstap-fails-with-Unable-to-gossip-with-any-seeds-yet-new-node-can-connect-to-seed-nodes
I got the same error. There can be more than one solution. Hope my mistake is what you have done.
I had my localhost IP pointing to some domain name (and I did that in order that my Spring boot application's server context is some domain name like www.example.com:8080 instead of localhost:8080, and I had the following entry in my hosts file on Windows system).
127.0.0.1 www.example.com
While my cassandra batch file was looking for localhost which it didn't find. So, I made another entry for localhost too in my hosts file as:
127.0.0.1 localhost
127.0.0.1 www.example.com
After adding it, I opened new command prompt, ran cassandra batch from the cassandra bin directory and it then worked.
Disable the firewall and SELINUX and try again
In our case ssl was enabled, and cassandra.yaml configuration looks fine as per above comments. Then we enabled ssl debugging by by adding below jvm paramter in cassandra-env.sh -Djavax.net.debug=ssl:handshake
After starting the node again we noticed below in cassandra log file
MessagingService-Outgoing-geo2_host/xx.xx.xx.xx, Exception while
waiting for close javax.net.ssl.SSLHandshakeException: Received fatal
alert: certificate_unknown
After further investigating the ssl debug logs we got to know that the certificate was not valid. After fixing this ssl issue node was able to join the cluster.
Thanks to elvingt
His answer just remind me , I need to verify that all node needs to be able to talk to each other.
https://support.datastax.com/hc/en-us/articles/209691483-Bootstap-fails-with-Unable-to-gossip-with-any-seeds-yet-new-node-can-connect-to-seed-nodes
Gossip communications must be bi-directional.
To verify use this commnd, and you need test from BOTH SIDE
nc -vz {your_node_ip} 7000
Then I recollect that I turned on my ubuntu firewall last night. I open it by
sudo ufw allow 7000/tcp
And it is working now
Getting error during startup/bootstrap
Unable to gossip with any seeds
indicates there is some issue with broadcast_address. broadcast_address is responsible for communication with other nodes not with clients.
This address must be set in seed node(mandatory for seed node), If you are using cloud VMs you might have different IPs(public and private) hence its recommended to use your private IPs for broadcast_address this will save your n/w cost as well.
# Address to broadcast to other Cassandra nodes
# Leaving this blank will set it to the same value as listen_address
broadcast_address: 10.11.xx.xxx
In my scenario I was using IBM and once I set broadcast_address in seed nodes issue got resolved.
Please make sure you are starting your seed node first then other node, this order is mandatory.
in cassandra.yaml
changing listen_address value from localhost to domainName solved my issue
I had same issue, I checked port, used tcpdump, netcat to test connections and finally it comes to expired SSL certificates on internode_encryption. I modified internode_encryption to make it 'none', restarted all nodes and it worked.
Before all neighbor nodes were down. And node repair command was failing with:
"Did not get positive replies from all endpoints"
P.S Dont leave internode_encryption as none for a long time, just regenerate certs and enable it back.

connection exception when using hadoop2 (YARN)

I have setup Hadoop (YARN) on ubuntu. The resource manager appears to be running. When I run the hadoop fs -ls command, I receive the following error:
14/09/22 15:52:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
ls: Call From ubuntu-8.abcd/xxx.xxx.xxx.xxxx to ubuntu-8.testMachine:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
I checked on the suggested URL in the error message but could not figure out how to resolve the issue. I ahve tried setting the external IP address (as opposed to localhost) in my core-site.xml file (in etc/hadoop) but that has not resolved the issue. IPv6 has been disabled on the box. I am running the process as hduser (which has read/write access to the directory). Any thoughts on fixing this? I am running this on a single node.
bashrc
#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HADOOP_INSTALL=/usr/local/hadoop/hadoop-2.5.1
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export HADOOP_YARN_HOME=$HADOOP_INSTALL ##added because I was not sure about the line below
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP VARIABLES END
Your issue is not related to YARN. It is limited by HDFS usage.
Here is the question with similar situation - person who asked had 9000 port listening on external IP interface but configuration was pointing to localhost. I'd advise first check if somebody at all listens on port 9000 and on what interface. Looks like you have service listening on IP interface which differs from where you look for it. Looking at your logs your client is trying ubuntu-8.testMachine:9000. To what IP it is being resolved? If it is assigned in /etc/hosts to 127.0.0.1, you could have the situation as in question I have mentioned - client tries to access 127.0.0.1 but service is waiting on external IP. OK, you could have vice versa. Here is good default port mapping table for Hadoop services.
Indeed many similar cases have the same root - wrongly configured host interfaces. People often configure their workstation hostname and assign this hostname to localhost in /etc/hosts. More, they write first short name and only after this FQDN. But this means IP is resolved into short hostname but FQDN is resolved into IP (non-symmetric).
This in turn provokes number of situations where services are started on local 127.0.0.1 interface and people have serious connectivity issues (are you surprised? :-) ).
Right approach (at least I encourage it based on expirience):
Assign at least one external interface that is visible to your cluster clients. If you have DHCP and don't want to have static IP, please bind your IP to MAC but move to 'constant' IP value.
Write local hostname into /etc/hosts to match external interface. FQDN name first and then short.
If you can, make your DNS resolver to resolve your FQDN into your IP. Don't care about short name.
Example, you have external IP interface 1.2.3.4 and FQDN (fully qualified domain name) set to myhost.com - in this case your /etc/hosts record MUST look like:
1.2.3.4 myhost.com myhost
And yes, it's better your DNS resolver knows about your name. Check both direct and reverse resolution with:
host myhost.com
host 1.2.3.4
Yes, clustering is not so easy in term of networking administration ;-). Never has been and shall never be.
Be sure you that you had started all the necesary, type start-all.sh, this command will start all the services needed for the connection to hadoop.
After that, you can type jps, with this command you can see all the services running under hadoop, and at the end, check the ports opened of these services with netstat -plnet | grep java.
Hope this solve your issue.

Resources