Yarn nodemanager not starting up. Getting no errors - linux

I have Hadoop 2.7.4 installed on Ubuntu 16.04. I'm trying to run it in Pseudo Mode.
I have a '/hadoop' partition mounted for all my hadoop files, NameNode and DataNode files.
My core-site.xml is:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
My hdfs-site.xml is:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/hadoop/nodes/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/hadoop/nodes/datanode</value>
</property>
</configuration>
My mapred-site.xml is:
<configuration>
<property>
<name>Map-Reduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
My yarn-site.xml is:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>Map-Reduce_shuffle</value>
</property>
</configuration>
After running
$ start-dfs.sh
$ start-yarn.sh
$ jps
I get the following daemons running.
2800 ResourceManager
2290 NameNode
4242 Jps
2440 DataNode
2634 SecondaryNameNode
start-yarn.sh gives me:
$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /hadoop/hadoop-2.7.4/logs/yarn-abdy-resourcemanager-abdy-hadoop.out
localhost: starting nodemanager, logging to /hadoop/hadoop-2.7.4/logs/yarn-abdy-nodemanager-abdy-hadoop.out
The nodemanager daemon does not seem to start at all.
I've tried for 2 days to fix this issue but I cannot seem to find a fix. Someone please guide me.

If your going to start hadoop daemons for the first time.
First you have to format your namenode :
hadoop namenode -format
Before formatting namenode make sure you delete existing
/hadoop/nodes/namenode and /hadoop/nodes/datanode folders
Then you execute:
hadoop namenode -format
Once formatting of namenode is done.
you execute following commands.
start-dfs.sh
start-yarn.sh

Related

cannot stat '/user/hadoop/logs/datanode-cluster

I am trying to run a multi-step job that has one of the steps as a script that uses pyspark/Apache Spark. I have a 4 node computer cluster with a SLURM job scheduler and am wondering how I can run them together. Currently, I have Spark on all the nodes (with the head node acting as the "master" and the remaining 3 compute nodes as "slaves") and Hadoop(with the head node as the namenode, secondary namenode and the remaining 3 compute nodes as datanodes).
However, when I start hadoop on the head node with start-all.sh, I only see a single datanode and when I try to start it an error saying
localhost: mv: cannot stat '/user/hadoop/logs/datanode-cluster-n1.out.4': No such file or directory
localhost: mv: cannot stat '/user/hadoop/logs/datanode-cluster-n1.out.3': No such file or directory
localhost: mv: cannot stat '/user/hadoop/logs/datanode-cluster-n1.out.2': No such file or directory
localhost: mv: cannot stat '/user/hadoop/logs/datanode-cluster-n1.out.1': No such file or directory
localhost: mv: cannot stat '/user/hadoop/logs/datanode-cluster-n1.out': No such file or directory
However, these files exist and seem to be readable/writable. Spark starts well and the 3 slave nodes are able to be started from the head node. Because of the error mentioned before, when I submit my job to SLURM it throws the error above. I would appreciate any advice on this issue and any advice on the architecture of my process.
Edit 1: Hadoop config files
core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://cluster-hn:9000</value>
</property>
</configuration>
Hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permission</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/s1/snagaraj/hadoop/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/s1/snagaraj/hadoop/dataNode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.https.port</name>
<value>50470</value>
<description>The https port where namenode binds</description>
</property>
<property>
<name>dfs.socket.timeout</name>
<value>0</value>
</property>
Workers File
localhost
cluster-n1
cluster-n2
cluster-n3
I have been facing this same issue... I could fix it by giving 775 permission to the logs directory recursively... i.e, in my case...chmod 775 -R /home/admin/hadoop/logs
Now the "mv: cannot stat... .out': No such file or directory" error is gone.

failed to connect hadoop spark

I am a beginner in spark job and spark configuration
I try to submit a spark job, after a few minutes (job is accepted and running during few minutes) the job fails with a connection refused.
User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 2
Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: Failed to connect to my.domain.com/myIp:portNumber
I also have this error with jobs success
ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks
On my computer, with intellij Idea my job turn, this is not a code mistake
I try several time to change configuration in yarn-site.xml and mapred-site.xml
This is a hadoop hdfs cluster, 3 nodes, 2 cores on each node, 8GB RAM on each node, I try to submit with this command line :
spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.3 --class MyClass --master yarn --deploy-mode cluster myJar.jar
mapred-site.xml :
<property>
<value>yarn</value>
<name>mapreduce.framework.name</name>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>1000</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>1000</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>2000</value>
</property>
yarn-site.xml
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>ipadress</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4000</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>500</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2000</value>
</property>
spark-default.conf
spark.master yarn
spark.driver.memory 1g
spark.history.fs.update.interval 30s
spark.history.ui.port port
spark.core.connection.ack.wait.timeout 600s
spark.default.parallelism 2
spark.executor.memory 2g
spark.cores.max 2
spark.executor.cores 2

Enabling dynamic allocation on spark on YARN mode

This question is similar to this but there was no answer.
I am trying to enable dynamic allocation in Spark in YARN mode. I have 11 node cluster with 1 master node and 10 worker nodes. I am following below link for instructions:
For setup in YARN:
http://spark.apache.org/docs/latest/running-on-yarn.html#configuring-the-external-shuffle-service
Config variables needs to be set in spark-defaults.conf: https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation
https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior
I have also taken reference from below link and few other resources:
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-dynamic-allocation.html#spark.dynamicAllocation.testing
Here are the steps I am doing:
Setting up config variables in spark-defaults.conf.
My spark-defaults.conf related to dynamic allocation and shuffle service is as:
spark.dynamicAllocation.enabled=true
spark.shuffle.service.enabled=true
spark.shuffle.service.port=7337
Making changes in yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>spark_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.spark_shuffle.class</name>
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
<property>
<name>yarn.nodemanager.recovery.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value> $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/common/*,$HADOOP_MAPRED_HOME/share/hadoop/common/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/hdfs/*,$HADOOP_MAPRED_HOME/share/hadoop/hdfs/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/yarn/*,$HADOOP_MAPRED_HOME/share/hadoop/yarn/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/tools/*,$HADOOP_MAPRED_HOME/share/hadoop/tools/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/client/*,$HADOOP_MAPRED_HOME/share/hadoop/client/lib/*,/home/hadoop/spark/common/network-yarn/target/scala-2.11/spark-2.2.2-SNAPSHOT-yarn-shuffle.jar </value>
</property>
All these steps are replicated in all worker nodes i.e spark-defaults.conf has the above mentioned values and yarn-site.xml has these properties. I have made sure that /home/hadoop/spark/common/network-yarn/target/scala-2.11/spark-2.2.2-SNAPSHOT-yarn-shuffle.jar exists in all worker nodes.
Then I am running $SPARK_HOME/sbin/start-shuffle-service.sh in worker nodes and master node. In master node, I am restarting the YARN using stop-yarn.sh and then start-yarn.sh
Then I am doing YARN node -list -all to see the worker nodes but I am not able to see any node
When I am disabling the property
<property>
<name>yarn.nodemanager.aux-services</name>
<value>spark_shuffle</value>
</property>
I can see all the worker nodes as normal so it seems like shuffle service is not properly configured.

Trying to run a spark-submit job on a yarn cluster but I keep getting the following warning. How do I fix the issue?

WARN YarnClusterScheduler: Initial job has not accepted any resources; check
your cluster UI to ensure that workers are registered and have sufficient
resources.
I have looked through similar questions and tried everything else that was mentioned. When I look through the yarn-nodemanager log on hdfs I see the following warning that might be causing the error. How do I fix these warnings?
2017-09-13 14:29:52,640 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: The
Auxilurary Service named 'mapreduce_shuffle' in the configuration is for class
org.apache.hadoop.mapred.ShuffleHandler which has a name of 'httpshuffle'.
Because these are not the same tools trying to send ServiceData and read
Service Meta Data may have issues unless the refer to the name in the config.
yarn-site.xml log:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>/usr/local/hadoop/etc/hadoop, /usr/local/hadoop/share/hadoop/common
/*, /usr/local/hadoop/share/hadoop/common/lib/*,
/usr/local/hadoop/share/hadoop/
hdfs/*, /usr/local/hadoop/share/hadoop/hdfs/lib/*,
/usr/local/hadoop/share/hadoo
p/mapreduce/*, /usr/local/hadoop/share/hadoop/mapreduce/lib/*,
/usr/local/hadoop
/share/hadoop/yarn/*, /usr/local/hadoop/share/hadoop/yarn/lib/*</value>
</property>
<property>
<name>nodemanager.resource.cpu-vcores</name>
<value>2</value>
</property>
<property>
<description>
Number of seconds after an application finishes before the nodemanager's
DeletionService will delete the application's localized file directory
and log directory.
core-site.xml log:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://sandbox:9000</value>
</property>
<property>
<name>dfs.client.use.legacy.blockreader</name>
<value>true</value>
</property>
</configuration>
hdfs-site.xml log:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Please let me know if I am trying to find a solution to my initial warning in the wrong direction because the application keeps running but no data is sent to hdfs. Thank you!

Hadoop: each namenode and datanode only last for a momentary time

Using CentOs 5.4
Three virtual machines(using vmware workstation):master, slave1, slave2. master is used for the namenode, and slave1 slave2 are used for the datanodes.
Hadoop version is hadoop-0.20.1.tar.gz, I have configured all the relative files, and closed the firewall with root user using the command:/sbin/service iptables stop. Then I tried to format namenode and start hadoop in the master(namenode) virtual machine with the following commands, no error was reported.
bin/hadoop namenode -format
bin/start-all.sh
Then I typed the command "jps" in the master machine right now, and found the right result:
5144 JobTracker
4953 NameNode
5079 SecondaryNameNode
5216 Jps
But after about several seconds, when I tried to type the "jps" command, all the virtual machines only have one process: JPS. The following is the result displayed in the namenode(master)
5236 Jps
What's the matter? Or how can I find what caused the matter? Dose it mean that it cannot find any namenode or datanode? Thank you.
Attachment: all the places I have modified:
hadoop-env.sh:
# set java environment
export JAVA_HOME=/usr/jdk1.6.0_13/
core-site.xml:
<configuration>
<property>
<name>master.node</name>
<value>namenode_master</value>
<description>master</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
<description>local dir</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://${master.node}:9000</value>
<description> </description>
</property>
</configuration>
hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>${hadoop.tmp.dir}/hdfs/name</value>
<description>local dir</description>
</property>
<property>
<name>dfs.data.dir</name>
<value>${hadoop.tmp.dir}/hdfs/data</value>
<description> </description>
</property>
</configuration>
mapred-site.xml:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>${master.node}:9001</value>
<description> </description>
</property>
<property>
<name>mapred.local.dir</name>
<value>${hadoop.tmp.dir}/mapred/local</value>
<description> </description>
</property>
<property>
<name>mapred.system.dir</name>
<value>/tmp/mapred/system</value>
<description>hdfs dir</description>
</property>
</configuration>
master:
master
slaves:
slave1
slave2
/etc/hosts:
192.168.190.133 master
192.168.190.134 slave1
192.168.190.135 slave2
From the log files, I found that I should change the namenode_master to master in the file core-site.xml. Now it works.

Resources