I am new to Hadoop and trying to install Hadoop on multinode cluster on ubuntu 14.04-Server on VM. All goes well until I try to list the files within HDFS using hadoop fs -ls /
I keep getting an error:
ls: unknown host: Hadoop-Master.
Initially I thought I made some mistake in assigning the hostname but cross-checked with /etc/hosts and /etc./hostname. Hostname is listed correctly as Hadoop-Master. Removed hostname altogether. Only ip address remaining.
Another post here suggested to add two lines to .bashrc:
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib
I tried doing that but still getting the same error.
Please find the relevant steps below along with edits based on information asked.
Check IP address of the master with ifconfig
Add to the /etc/hosts and edit the /etc/hostname to add the host name.
Add the relevant details to masters and slaves.
.bashrc File
export HADOOP_INSTALL=/usr/local/hadoop
export PIG_HOME=/usr/local/pig
export HIVE_HOME=/usr/local/Hive
export PATH=$PATH:$HADOOP_INSTALL/bin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
Java path
export JAVA_HOME='/usr/lib/jvm/java-7-oracle'
core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs:Hadoop-Master:9001</value>
</property>
</configuration>
hadoop-env.sh
export JAVA_HOME='/usr/lib/jvm/java-7-oracle'
Edit mapred-site.xml to include the hostname and change the value to no. of nodes present.
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>2</value>
</property>
</configuration>
Edit hdfs-site.xml, changed the value to no. of data nodes present.
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/hduser/mydata/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/hduser/mydata/hdfs/datanode</value>
</property>
</configuration>
whoami
simplilearn
/etc/hosts
localhost 127.0.0.1
Hadoop-Master 192.168.207.132
Hadoop-Slave 192.168.207.140
/etc/hostname
Hadoop-Master
Changes to be made:
1. /etc/hosts file:
Change Hadoop-Master to HadoopMaster
2. /etc/hostname file:
Change Hadoop-Master to HadoopMaster
3. core-site.xml:
Change this
hdfs:Hadoop-Master:9001
to this
hdfs://HadoopMaster:9001
NOTE: Change Hadoop-Master to HadoopMaster in all nodes pointing to your IP. Change slaves and master files too.
Related
I am using yarn resource manager for spark. after restart of yarn server, all completed jobs in spark-webui disappered.
Below two properties added in yarn-site.xml Can someone explain me what could be the reason and is there any property to control this.
<property>
<name>yarn.log-aggregation-enable</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.log.retain-seconds</name>
<value>86400</value>
</property>
Thanks.
You can persist application history on restarts if you set yarn.resourcemanager.recovery.enabled to true in your yarn-site.xml and set yarn.resourcemanager.store.class.
See ResourceManger Restart for further details.
Your other entries refer to logging and define how long you want completed logs to stay before they get cleaned out. You can read more about them in yarn-default.xml.
I'm attempting to setup Apache Nutch and Apache Solr so our site can have internal site search. I have followed so my guides and while they are very useful, they lack what to do if an error occurs and most seem outdated at this point.
I'm using JDK 131, Nutch 2.3.1, and Solr 6.5.1
This the sequence of my actions from the none root user
sudo wget [java url] to /opt
sudo tar xvf java.tar.gz
export JAVA_HOME=/opt/java/
export JAVA_JRE=/opt/java/jre
export PATH=$PATH:/opt/java/bin:/opt/java/jre/bin
cd solr6.5.1/
sudo start runtime -e cloud -noprompt
sudo wget [solr url] to /root
sudo tar xvf solr.tar.gz
sudo wget [nutch url] to /opt
sudo tar xvf nutch.tar.gz
cd /opt/apache-nutch-2.3.1
sudo vi nutch-site.xml
add:
<configuration>
<property>
<name>http.agent.name</name>
<value>nutch-solr-integration</value>
</property>
<property>
<name>generate.max.per.host</name>
<value>100</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)</value>
<description> At the very least, I needed to add the parse-html, urlfilter-regex, and the indexer-solr.
</description>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.</description>
</property>
</configuration>
cd /opt/apache-nutch-2.3.1
mkdir urls
cd urls
sudo vi seed.txt
add [our site url]
[ESC]
:w
:q
cd ../conf
sudo vi regex-urlfilter.xml
add:
+^http://([a-zA-Z0-9]*\.)*[domain of our site].com/
[ESC]
:w
:q
cd ..
sudo ant runtime
sudo -E runtime/local/bin/nutch inject urls -crawlId 3
Then I get this:
InjectorJob: Injecting urlDir: urls
InjectorJob: java.lang.ClassNotFoundException: org.apache.gora.sql.store.SqlStore
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.apache.nutch.storage.StorageUtils.getDataStoreClass(StorageUtils.java:93)
at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:77)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:218)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284)
My questions are how why am I getting this error and how do I resolve it. I saw in a lot of places to modify the schema.xml the solr directory but there is no schema.xml file in the solr directory anywhere.
As you're using sql-store as Nutch back-end, did you edit ivy/ivy.xml and uncomment this line?
<dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />
If not, uncomment this line and clean & build again. If it's still not working, let me know your complete approach or the tutorial you followed.
Edit
As you said, you are using hbase as store, your nutch-site.xml property is supposed to be this -
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
</property>
Please follow the link you mentioned carefully.
we need to crawl data from a url which is authenticated with username and password.
1) we have configured httpclient-auth.xml with the following credentials
<credentials username="xxxx" password="xxxxxx">
<default/>
</credentials>
2) we have configured nutch-site.xml with the following properties
<property>
<name>http.agent.name</name>
<value>Nutch Crawl</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
<property>
<name>http.auth.file</name>
<value>httpclient-auth.xml</value>
<description>Authentication configuration file for 'protocol-httpclient' plugin.</description>
</property>
While we try to fetch the data we got only one Url which is present in seed.txt file we didn't get any errors but still we are getting only one crawled data.
what we are missing here.
I am new to Hadoop ecosystem.
I recently tried Hadoop (2.7.1) on a single-node Cluster without any problems and decided to move on to a Multi-node cluster having 1 namenode and 2 datanodes.
However I am facing a weird issue. Whatever Jobs that I try to run, are stuck with the following message:
on the web interface:
YarnApplicationState: ACCEPTED: waiting for AM container to be allocated, launched and register
and in the cli:
16/01/05 17:52:53 INFO mapreduce.Job: Running job: job_1451083949804_0001
They don't even start and at this point I am not sure what changes I need to make in order to make it work.
Here's what I have tried to resolve:
disabling firewall on all nodes
setting lower resource limits
configuring under different machines, routers and distros
I would really appreciate any help (even a minute hint) in correct direction.
I have followed these instructions (configuration):
Running Hadoop on Ubuntu Linux (Multi-Node Cluster)
How To Setup Multi Node Hadoop 2 (YARN) Cluster
I finally got this solved. Posting detailed steps for future reference. (only for test environment)
Hadoop (2.7.1) Multi-Node cluster configuration
Make sure that you have a reliable network without host isolation. Static IP assignment is preferable or at-least have extremely long DHCP lease. Additionally all nodes (Namenode/master & Datanodes/slaves) should have a common user account with same password; in case you don't, make such user account on all nodes. Having same username and password on all nodes makes things a bit less complicated.
[on all machines] First configure all nodes for single-node cluster. You can use my script that I have posted over here.
execute these commands in a new terminal
[on all machines] ↴
stop-dfs.sh;stop-yarn.sh;jps
rm -rf /tmp/hadoop-$USER
[on Namenode/master only] ↴
rm -rf ~/hadoop_store/hdfs/datanode
[on Datanodes/slaves only] ↴
rm -rf ~/hadoop_store/hdfs/namenode
[on all machines] Add IP addresses and corresponding Host names for all nodes in the cluster.
sudo nano /etc/hosts
hosts
xxx.xxx.xxx.xxx master
xxx.xxx.xxx.xxy slave1
xxx.xxx.xxx.xxz slave2
# Additionally you may need to remove lines like "xxx.xxx.xxx.xxx localhost", "xxx.xxx.xxx.xxy localhost", "xxx.xxx.xxx.xxz localhost" etc if they exist.
# However it's okay keep lines like "127.0.0.1 localhost" and others.
[on all machines] Configure iptables
Allow default or custom ports that you plan to use for various Hadoop daemons through the firewall
OR
much easier, disable iptables
on RedHat like distros (Fedora, CentOS)
sudo systemctl disable firewalld
sudo systemctl stop firewalld
on Debian like distros (Ubuntu)
sudo ufw disable
[on Namenode/master only] Gain ssh access from Namenode (master) to all Datnodes (slaves).
ssh-copy-id -i ~/.ssh/id_rsa.pub $USER#slave1
ssh-copy-id -i ~/.ssh/id_rsa.pub $USER#slave2
confirm things by running ping slave1, ssh slave1, ping slave2, ssh slave2 etc. You should have a proper response. (Remember to exit each of your ssh sessions by typing exit or closing the terminal. To be on the safer side I also made sure that all nodes were able to access each other and not just the Namenode/master.)
[on all machines] edit core-site.xml file
nano /usr/local/hadoop/etc/hadoop/core-site.xml
core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>master:9000</value>
<description>NameNode URI</description>
</property>
</configuration>
[on all machines] edit yarn-site.xml file
nano /usr/local/hadoop/etc/hadoop/yarn-site.xml
yarn-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
<description>The hostname of the RM.</description>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
[on all machines] modify slaves file, remove the text "localhost" and add slave hostnames
nano /usr/local/hadoop/etc/hadoop/slaves
slaves
slave1
slave2
(I guess having this only on Namenode/master will also work but I did this on all machines anyway. Also note that in this configuration master behaves only as resource manger, this is how I intent it to be.)
[on all machines] modify hdfs-site.xml file to change the value for property dfs.replication to something > 1 (at-least to the number of slaves in the cluster; here I have two slaves so I would set it to 2)
[on Namenode/master only] (re)format the HDFS through namenode
hdfs namenode -format
[optional]
remove dfs.datanode.data.dir property from master's hdfs-site.xml file.
remove dfs.namenode.name.dir property from all slave's hdfs-site.xml file.
TESTING (execute only on Namenode/master)
start-dfs.sh;start-yarn.sh
echo "hello world hello Hello" > ~/Downloads/test.txt
hadoop fs -mkdir /input
hadoop fs -put ~/Downloads/test.txt /input
hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount /input /output
wait for a few seconds and the mapper and reducer should begin.
These links helped me with the issue:
https://stackoverflow.com/a/24544207/2534513
Hadoop YARN Installation: The definitive guide#Cluster Installation
I met the same problem when I ran
"hadoop jar hadoop-mapreduce-examples-2.6.4.jar wordcount /calculateCount/ /output"
this command stopped there,
I tracked the job, and find "there are 15 missing blocks, and they are all corrupted"
then I did the following:
1) ran "hdfs fsck / "
2) ran "hdfs fsck / -delete "
3) added "-A INPUT -p tcp -j ACCEPT" to /etc/sysconfig/iptables on the two datanodes
4) ran "stop-all.sh and start-all.sh"
everything goes well
I think the firewall is the key point.
I am trying to mount the HDFS file system as per the content in the URL http://hadoop.apache.org/docs/r2.5.1/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html
But at the final mount statement, I am getting the mount.nfs: mount system call failed
I got that output on executing the below command:
mount -t nfs -o vers=3,proto=tcp,nolock,noacl <HDFS server name>:/ <existing local directory>
I am running the hadoop in a Pseudo Distributed mode.
If you use root to mount nfs at client, for example, you need to add following configuration in core-site.xml under $HADOOP_HOME/etc/hadoop/
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>