Hadoop 2.2.0 multi-node cluster setup on ec2 - 4 ubuntu 12.04 t2.micro identical instances - linux

I have followed this tutorial to setup Hadoop 2.2.0 multi-node cluster on Amazon EC2. I have had a number of issues with ssh and scp which i was either able to resolve or workaround with help of articles on Stackoverflow but unfortunately, i could not resolve the latest problem.
I am attaching the core configuration files core-site.xml, hdfs-site.xml etc. Also attaching a log file which is a dump output when i run the start-dfs.sh command. It is the final step for starting the cluster and it is giving a mix of errors and i don't have a clue what to do with them.
So i have 4 nodes exactly the same AMI is used. Ubuntu 12.04 64 bit t2.micro 8GB instances.
Namenode
SecondaryNode (SNN)
Slave1
Slave2
The configuration is almost the same as suggested in the tutorial mentioned above.
I have been able to connect with WinSCP and ssh from one instance to the other. Have copied all the configuration files, masters, slaves and .pem files for security purposes and the instances seem to be accessible from one another.
If someone could please look at the log, config files, .bashrc file and let me know what am i doing wrong.
Same security group HadoopEC2SecurityGroup is used for all the instances. All TCP traffic is allowed and ssh port is open. Screenshot in the zipped folder attached. I am able to ssh from Namenode to secondary namenode (SSN). Same goes for slaves as well which means that ssh is working but when i start the hdfs every thing goes down. The error log is not throwing any useful exceptions either. All the files and screenshots can be found as zipped folder here.
Excerpt from error output on console looks like
Starting namenodes on [OpenJDK 64-Bit Server VM warning: You have loaded library /usr/local/hadoop/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c ', or link it with '-z noexecstack'.
ec2-54-72-106-167.eu-west-1.compute.amazonaws.com]
You: ssh: Could not resolve hostname you: Name or service not known
have: ssh: Could not resolve hostname have: Name or service not known
loaded: ssh: Could not resolve hostname loaded: Name or service not known
VM: ssh: Could not resolve hostname vm: Name or service not known
library: ssh: Could not resolve hostname library: Name or service not known
Server: ssh: Could not resolve hostname server: Name or service not known
warning:: ssh: Could not resolve hostname warning:: Name or service not known
which: ssh: Could not resolve hostname which: Name or service not known
guard.: ssh: Could not resolve hostname guard.: Name or service not known
have: ssh: Could not resolve hostname have: Name or service not known
might: ssh: Could not resolve hostname might: Name or service not known
.....

Add the following entries to .bashrc where HADOOP_HOME is your hadoop folder:
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export
HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
Hadoop 2.2.0 : "name or service not known" Warning
hadoop 2.2.0 64-bit installing but cannot start

Related

ssh: Could not resolve hostname sama5d27-som1-ek-sd:

I am trying to set up remote connection between my Desktop machine ( Windows machine ) and remote machine ( Linux machine ).
For that, I am following the steps described here :
https://learn.microsoft.com/en-us/cpp/linux/set-up-fips-compliant-secure-remote-linux-development?view=vs-2019#to-create-and-use-an-rsa-key-file
But, in step 2, when I try, From Windows, to copy the public key to the Linux machine using this command :
scp C:\Users\wiemz/.ssh/id_rsa.pub root#sama5d27-som1-ek-sd:
This error occurs :
ssh: Could not resolve hostname sama5d27-som1-ek-sd: H\364te Unknown.
lost connection
I verfied my Linux internet connection with ping command and it's going well. Besides, when I typed this command on the Linux machine :
ssh root#sama5d27-som1-ek-sd
it says :
ssh: Could not resolve hostname sama5d27-som1-ek-sd: Temporary failure
in name resolution
How can I fix this problem please ?
The problem is your OS can't resolve the hostname you use. You should provide FQDN like web.example.com or IP address of the machine. For example
scp file root#IP:
or
scp file root#web.example.com:

AWS EMR: Spark - SparkException java IOException: Failed to create local dir in /tmp/blockmgr*

I have a AWS EMR cluster with Spark. I can connect to it (spark):
from master node after SSHing into it
from another AWS EMR cluster
But NOT able to connect to it:
from my local machine (macOS Mojave)
from non-emr machines like Metabase and Redash
I have read answers of this question. I have checked that folder permissions and disk space are fine on all the nodes. My assumption is I'm facing similar problem what James Wierzba is asking in the comments. However, I do not have enough reputation to add a comment there. Also, this might be a different problem considering it is specific to AWS EMR.
Connection works fine after SSHing to master node.
# SSHed to master node
$ ssh -i ~/identityfile hadoop#ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com
# on master node
$ /usr/lib/spark/bin/beeline -u 'jdbc:hive2://localhost:10001/default'
# it connects fine and I can run commands, for e.g., 'show databases;'
# Beeline version 1.2.1-spark2-amzn-0 by Apache Hive
Connection to this node works fine from master node of another EMR cluster as well.
However, connection does not work from my local machine (macOS Mojave), Metabase and Redash.
My local machine:
# installed hive (for beeline)
$ brew install hive
# Beeline version 3.1.1 by Apache Hive
# connect directly
# I have checked that all ports are open for my IP
$ beeline -u 'jdbc:hive2://ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com:10001/default'
# ERROR: ConnectException: Operation timed out
#
# this connection timeout probably has something to do with spark accepting only localhost connections
# I have allowed all the ports in AWS security group for my IP
# connect via port forwarding
# open a port
$ ssh -i ~/identityfile -Nf -L 10001:localhost:10001 hadoop#ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com
$ beeline -u 'jdbc:hive2://localhost:10001/default'
# Failed to connect to localhost:10001
# Required field 'client_protocol' is unset!
$ beeline -u 'jdbc:hive2://localhost:10001/;transportMode=http'
# org.apache.http.ProtocolException: The server failed to respond with a valid HTTP response
I have setup Metabase and Redash in ec2.
Metabase → connect using data source Spark SQL → results in
java.sql.SQLException: org.apache.spark.SparkException: java.io.IOException: Failed to create local dir in /mnt/tmp/blockmgr*
Redash → connect using data source Hive → results in same error.
You need to update the inbound rules of the security group attached to Master node of EMR. You will need to add the public IP address of your network provider. You can find your public IP address on the following website :
What is my IP
For more details on how to update the inbound rules with your IP address refer following AWS documentation :
Authorizing Inbound Traffic for Your Linux Instances
You should also check the outbound rules of your own network in case you are working in a restricted network environment.
So make sure you have outbound access in your network and inbound access in your EMR's master node security group for all the ports you want to access.

Starting Hadoop without ssh'ing to localhost

I've a very tricky situation in my hand. I'm installing Hadoop on few nodes which run Ubuntu 12.04 and our IT guys have created a user "hadoop" for me to use on all the nodes. The issue with this user is that it does not allow ssh on localhost because of some security constraints. So, I'm not able to start Hadoop daemons at all.
I can connect to itself using "ssh hadoop#hadoops_address" but not using loopback address. I also cannot make any changes to the /etc/hosts. Is there a way I can tell Hadoop to ssh to itself using "ssh hadoop#hadoops_address" instead of "ssh hadoop#localhost"?
Hadoop reads the hostname from "masters" and "slaves" file which is present inside conf dir,
edit the file and change the value from localhost to hadoops_address.
This should fix your problem.

Puppet agent can't find server

I'm new to puppet, but picking it up quickly. Today, I'm running into an issue when trying to run the following:
$ puppet agent --no-daemonize --verbose --onetime
**err: Could not request certificate: getaddrinfo: Name or service not known
Exiting; failed to retrieve certificate and waitforcert is disabled**
It would appear the agent doesn't know what server to connect to. I could just specify --server on the command line, but that will be of no use to me when this runs as a daemon in production, so instead, I specify the server name in /etc/puppet/puppet.conf like so:
[main]
server = puppet.<my domain>
I do have a DNS entry for puppet.<my domain> and if I dig puppet.<my domain>, I see that the name resolves correctly.
All puppet documentation I have read states that the agent tries to connect to a puppet master at puppet by default and your options are host file trickery or do the right thing, create a CNAME in DNS, and edit the puppet.conf accordingly, which I have done.
So what am I missing? Any help is greatly appreciated!
D'oh! Need to sudo to do this! Then everything works.
I had to use the --server flag:
sudo puppet agent --server=puppet.example.org
I actually had the same error but I was using the two learning puppet vm and trying run the 'puppet agent --test' command.
I solved the problem by opening the file /etc/hosts on both the master and the agent vm and the line
***.***.***.*** learn.localdomain learn puppet.localdomain puppet
The ip address (the asterisks) was originally some random number. I had to change this number on both vm so that it was the ip address of the master node.
So I guess for experienced users my advice is to check the /etc/hosts file to make sure that the ip addresses in here for the master and agent not only match but are the same as the ip address of the master.
for other noobs like me my advice is to read the documentation more clearly. This was a step in the 'setting up an agent vm' process the I totally missed xD
In my case I was getting same error but it was due to the cert which should been signed to node on puppetmaster server.
to check pending certs run following:
puppet cert list
"node.domain.com" (SHA256) 8D:E5:8A:2*******"
sign the cert to node:
puppet cert sign node.domain.com
Had the same issue today on puppet 2.6 on CentOS 6.4
All I did to resolve the issue was to check the usual stuff such as hosts and resolv.conf to ensure they were as expected (compared with a working server) and then;
Removed /var/lib/puppet directory rm -rf /var/lib/puppet
Cleared the certificate on the puppet master puppetca --clean
servername
Restarted the network service network restart
Re-ran puppet
Even though the resolv.conf was identical to the working server, puppet updated resolv.conf and immediately re-signed the certificate and replaced all the puppet lib files.
Everything was fine after that.

How to change the host name of the ubuntu server running oracle xe

I have a oracle 11g XE instance running under ubuntu server. I tried changing the hostname of the server by modifying the host name in /etc/hostname, /etc/hosts, tnsnames.ora and listener.ora but the oracle-xe instance fails to start after reboot. Any idea which configuration I am missing?
Sometimes Oracle starts with only certain services / functionalities not working properly... If that's the case and your Oracle instance partially failed to start you can get some more information about running listeners by invoking the lsnrctl command line utility and then using the status command.
You can also look for clues in the Oracle log files under <oracle-install>/app/oracle/diag/tnslsnr/<hostname>/listener/alert/log.xml - you should definitely have one for your old hostname and you might have another one created for your new hostname as well.
I had this and solved it just rename your listner.ora and restart, it will change the setting for the new host name
see my explanation Here

Resources