Hadoop: each namenode and datanode only last for a momentary time - linux

Using CentOs 5.4
Three virtual machines(using vmware workstation):master, slave1, slave2. master is used for the namenode, and slave1 slave2 are used for the datanodes.
Hadoop version is hadoop-0.20.1.tar.gz, I have configured all the relative files, and closed the firewall with root user using the command:/sbin/service iptables stop. Then I tried to format namenode and start hadoop in the master(namenode) virtual machine with the following commands, no error was reported.
bin/hadoop namenode -format
bin/start-all.sh
Then I typed the command "jps" in the master machine right now, and found the right result:
5144 JobTracker
4953 NameNode
5079 SecondaryNameNode
5216 Jps
But after about several seconds, when I tried to type the "jps" command, all the virtual machines only have one process: JPS. The following is the result displayed in the namenode(master)
5236 Jps
What's the matter? Or how can I find what caused the matter? Dose it mean that it cannot find any namenode or datanode? Thank you.
Attachment: all the places I have modified:
hadoop-env.sh:
# set java environment
export JAVA_HOME=/usr/jdk1.6.0_13/
core-site.xml:
<configuration>
<property>
<name>master.node</name>
<value>namenode_master</value>
<description>master</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
<description>local dir</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://${master.node}:9000</value>
<description> </description>
</property>
</configuration>
hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>${hadoop.tmp.dir}/hdfs/name</value>
<description>local dir</description>
</property>
<property>
<name>dfs.data.dir</name>
<value>${hadoop.tmp.dir}/hdfs/data</value>
<description> </description>
</property>
</configuration>
mapred-site.xml:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>${master.node}:9001</value>
<description> </description>
</property>
<property>
<name>mapred.local.dir</name>
<value>${hadoop.tmp.dir}/mapred/local</value>
<description> </description>
</property>
<property>
<name>mapred.system.dir</name>
<value>/tmp/mapred/system</value>
<description>hdfs dir</description>
</property>
</configuration>
master:
master
slaves:
slave1
slave2
/etc/hosts:
192.168.190.133 master
192.168.190.134 slave1
192.168.190.135 slave2

From the log files, I found that I should change the namenode_master to master in the file core-site.xml. Now it works.

Related

cannot stat '/user/hadoop/logs/datanode-cluster

I am trying to run a multi-step job that has one of the steps as a script that uses pyspark/Apache Spark. I have a 4 node computer cluster with a SLURM job scheduler and am wondering how I can run them together. Currently, I have Spark on all the nodes (with the head node acting as the "master" and the remaining 3 compute nodes as "slaves") and Hadoop(with the head node as the namenode, secondary namenode and the remaining 3 compute nodes as datanodes).
However, when I start hadoop on the head node with start-all.sh, I only see a single datanode and when I try to start it an error saying
localhost: mv: cannot stat '/user/hadoop/logs/datanode-cluster-n1.out.4': No such file or directory
localhost: mv: cannot stat '/user/hadoop/logs/datanode-cluster-n1.out.3': No such file or directory
localhost: mv: cannot stat '/user/hadoop/logs/datanode-cluster-n1.out.2': No such file or directory
localhost: mv: cannot stat '/user/hadoop/logs/datanode-cluster-n1.out.1': No such file or directory
localhost: mv: cannot stat '/user/hadoop/logs/datanode-cluster-n1.out': No such file or directory
However, these files exist and seem to be readable/writable. Spark starts well and the 3 slave nodes are able to be started from the head node. Because of the error mentioned before, when I submit my job to SLURM it throws the error above. I would appreciate any advice on this issue and any advice on the architecture of my process.
Edit 1: Hadoop config files
core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://cluster-hn:9000</value>
</property>
</configuration>
Hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permission</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/s1/snagaraj/hadoop/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/s1/snagaraj/hadoop/dataNode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.https.port</name>
<value>50470</value>
<description>The https port where namenode binds</description>
</property>
<property>
<name>dfs.socket.timeout</name>
<value>0</value>
</property>
Workers File
localhost
cluster-n1
cluster-n2
cluster-n3
I have been facing this same issue... I could fix it by giving 775 permission to the logs directory recursively... i.e, in my case...chmod 775 -R /home/admin/hadoop/logs
Now the "mv: cannot stat... .out': No such file or directory" error is gone.

failed to connect hadoop spark

I am a beginner in spark job and spark configuration
I try to submit a spark job, after a few minutes (job is accepted and running during few minutes) the job fails with a connection refused.
User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 2
Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: Failed to connect to my.domain.com/myIp:portNumber
I also have this error with jobs success
ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks
On my computer, with intellij Idea my job turn, this is not a code mistake
I try several time to change configuration in yarn-site.xml and mapred-site.xml
This is a hadoop hdfs cluster, 3 nodes, 2 cores on each node, 8GB RAM on each node, I try to submit with this command line :
spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.3 --class MyClass --master yarn --deploy-mode cluster myJar.jar
mapred-site.xml :
<property>
<value>yarn</value>
<name>mapreduce.framework.name</name>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>1000</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>1000</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>2000</value>
</property>
yarn-site.xml
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>ipadress</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4000</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>500</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2000</value>
</property>
spark-default.conf
spark.master yarn
spark.driver.memory 1g
spark.history.fs.update.interval 30s
spark.history.ui.port port
spark.core.connection.ack.wait.timeout 600s
spark.default.parallelism 2
spark.executor.memory 2g
spark.cores.max 2
spark.executor.cores 2

Yarn nodemanager not starting up. Getting no errors

I have Hadoop 2.7.4 installed on Ubuntu 16.04. I'm trying to run it in Pseudo Mode.
I have a '/hadoop' partition mounted for all my hadoop files, NameNode and DataNode files.
My core-site.xml is:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
My hdfs-site.xml is:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/hadoop/nodes/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/hadoop/nodes/datanode</value>
</property>
</configuration>
My mapred-site.xml is:
<configuration>
<property>
<name>Map-Reduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
My yarn-site.xml is:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>Map-Reduce_shuffle</value>
</property>
</configuration>
After running
$ start-dfs.sh
$ start-yarn.sh
$ jps
I get the following daemons running.
2800 ResourceManager
2290 NameNode
4242 Jps
2440 DataNode
2634 SecondaryNameNode
start-yarn.sh gives me:
$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /hadoop/hadoop-2.7.4/logs/yarn-abdy-resourcemanager-abdy-hadoop.out
localhost: starting nodemanager, logging to /hadoop/hadoop-2.7.4/logs/yarn-abdy-nodemanager-abdy-hadoop.out
The nodemanager daemon does not seem to start at all.
I've tried for 2 days to fix this issue but I cannot seem to find a fix. Someone please guide me.
If your going to start hadoop daemons for the first time.
First you have to format your namenode :
hadoop namenode -format
Before formatting namenode make sure you delete existing
/hadoop/nodes/namenode and /hadoop/nodes/datanode folders
Then you execute:
hadoop namenode -format
Once formatting of namenode is done.
you execute following commands.
start-dfs.sh
start-yarn.sh

Trying to run a spark-submit job on a yarn cluster but I keep getting the following warning. How do I fix the issue?

WARN YarnClusterScheduler: Initial job has not accepted any resources; check
your cluster UI to ensure that workers are registered and have sufficient
resources.
I have looked through similar questions and tried everything else that was mentioned. When I look through the yarn-nodemanager log on hdfs I see the following warning that might be causing the error. How do I fix these warnings?
2017-09-13 14:29:52,640 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: The
Auxilurary Service named 'mapreduce_shuffle' in the configuration is for class
org.apache.hadoop.mapred.ShuffleHandler which has a name of 'httpshuffle'.
Because these are not the same tools trying to send ServiceData and read
Service Meta Data may have issues unless the refer to the name in the config.
yarn-site.xml log:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>/usr/local/hadoop/etc/hadoop, /usr/local/hadoop/share/hadoop/common
/*, /usr/local/hadoop/share/hadoop/common/lib/*,
/usr/local/hadoop/share/hadoop/
hdfs/*, /usr/local/hadoop/share/hadoop/hdfs/lib/*,
/usr/local/hadoop/share/hadoo
p/mapreduce/*, /usr/local/hadoop/share/hadoop/mapreduce/lib/*,
/usr/local/hadoop
/share/hadoop/yarn/*, /usr/local/hadoop/share/hadoop/yarn/lib/*</value>
</property>
<property>
<name>nodemanager.resource.cpu-vcores</name>
<value>2</value>
</property>
<property>
<description>
Number of seconds after an application finishes before the nodemanager's
DeletionService will delete the application's localized file directory
and log directory.
core-site.xml log:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://sandbox:9000</value>
</property>
<property>
<name>dfs.client.use.legacy.blockreader</name>
<value>true</value>
</property>
</configuration>
hdfs-site.xml log:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Please let me know if I am trying to find a solution to my initial warning in the wrong direction because the application keeps running but no data is sent to hdfs. Thank you!

HMaster is automatically stops in hbase

I installed and configured Hadoop (version 2.7.0) & HBase (version 1.2.3) in pseudo distributed mode. I test hadoop with a test program (Word Count) and every things is OK. Before I enter Hbase shell and get list of tables, Hmaster is running, but When I enter hbase shell and get list of tables (or create table), i see this Error:
hbase(main):001:0> list
TABLE
ERROR: Can't get master address from ZooKeeper; znode data == null
Here is some help for this command:
List all tables in hbase. Optional regular expression parameter could
be used to filter the output. Examples:
hbase> list
hbase> list 'abc.*'
hbase> list 'ns:abc.*'
hbase> list 'ns:.*'
And When i back from Hbase shell and get jps, I see that there is no HMaster running but HRegionServer and Zookeeper are still running.
Here is my Hbase-site.xml Configuration:
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:54310/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>
</configuration>
Here is my core-site.xml Configuration:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
</property>
</configuration>

Resources