cannot stat '/user/hadoop/logs/datanode-cluster - apache-spark

I am trying to run a multi-step job that has one of the steps as a script that uses pyspark/Apache Spark. I have a 4 node computer cluster with a SLURM job scheduler and am wondering how I can run them together. Currently, I have Spark on all the nodes (with the head node acting as the "master" and the remaining 3 compute nodes as "slaves") and Hadoop(with the head node as the namenode, secondary namenode and the remaining 3 compute nodes as datanodes).
However, when I start hadoop on the head node with start-all.sh, I only see a single datanode and when I try to start it an error saying
localhost: mv: cannot stat '/user/hadoop/logs/datanode-cluster-n1.out.4': No such file or directory
localhost: mv: cannot stat '/user/hadoop/logs/datanode-cluster-n1.out.3': No such file or directory
localhost: mv: cannot stat '/user/hadoop/logs/datanode-cluster-n1.out.2': No such file or directory
localhost: mv: cannot stat '/user/hadoop/logs/datanode-cluster-n1.out.1': No such file or directory
localhost: mv: cannot stat '/user/hadoop/logs/datanode-cluster-n1.out': No such file or directory
However, these files exist and seem to be readable/writable. Spark starts well and the 3 slave nodes are able to be started from the head node. Because of the error mentioned before, when I submit my job to SLURM it throws the error above. I would appreciate any advice on this issue and any advice on the architecture of my process.
Edit 1: Hadoop config files
core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://cluster-hn:9000</value>
</property>
</configuration>
Hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permission</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/s1/snagaraj/hadoop/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/s1/snagaraj/hadoop/dataNode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.https.port</name>
<value>50470</value>
<description>The https port where namenode binds</description>
</property>
<property>
<name>dfs.socket.timeout</name>
<value>0</value>
</property>
Workers File
localhost
cluster-n1
cluster-n2
cluster-n3

I have been facing this same issue... I could fix it by giving 775 permission to the logs directory recursively... i.e, in my case...chmod 775 -R /home/admin/hadoop/logs
Now the "mv: cannot stat... .out': No such file or directory" error is gone.

Related

Yarn nodemanager not starting up. Getting no errors

I have Hadoop 2.7.4 installed on Ubuntu 16.04. I'm trying to run it in Pseudo Mode.
I have a '/hadoop' partition mounted for all my hadoop files, NameNode and DataNode files.
My core-site.xml is:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
My hdfs-site.xml is:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/hadoop/nodes/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/hadoop/nodes/datanode</value>
</property>
</configuration>
My mapred-site.xml is:
<configuration>
<property>
<name>Map-Reduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
My yarn-site.xml is:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>Map-Reduce_shuffle</value>
</property>
</configuration>
After running
$ start-dfs.sh
$ start-yarn.sh
$ jps
I get the following daemons running.
2800 ResourceManager
2290 NameNode
4242 Jps
2440 DataNode
2634 SecondaryNameNode
start-yarn.sh gives me:
$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /hadoop/hadoop-2.7.4/logs/yarn-abdy-resourcemanager-abdy-hadoop.out
localhost: starting nodemanager, logging to /hadoop/hadoop-2.7.4/logs/yarn-abdy-nodemanager-abdy-hadoop.out
The nodemanager daemon does not seem to start at all.
I've tried for 2 days to fix this issue but I cannot seem to find a fix. Someone please guide me.
If your going to start hadoop daemons for the first time.
First you have to format your namenode :
hadoop namenode -format
Before formatting namenode make sure you delete existing
/hadoop/nodes/namenode and /hadoop/nodes/datanode folders
Then you execute:
hadoop namenode -format
Once formatting of namenode is done.
you execute following commands.
start-dfs.sh
start-yarn.sh

Trying to run a spark-submit job on a yarn cluster but I keep getting the following warning. How do I fix the issue?

WARN YarnClusterScheduler: Initial job has not accepted any resources; check
your cluster UI to ensure that workers are registered and have sufficient
resources.
I have looked through similar questions and tried everything else that was mentioned. When I look through the yarn-nodemanager log on hdfs I see the following warning that might be causing the error. How do I fix these warnings?
2017-09-13 14:29:52,640 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: The
Auxilurary Service named 'mapreduce_shuffle' in the configuration is for class
org.apache.hadoop.mapred.ShuffleHandler which has a name of 'httpshuffle'.
Because these are not the same tools trying to send ServiceData and read
Service Meta Data may have issues unless the refer to the name in the config.
yarn-site.xml log:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>/usr/local/hadoop/etc/hadoop, /usr/local/hadoop/share/hadoop/common
/*, /usr/local/hadoop/share/hadoop/common/lib/*,
/usr/local/hadoop/share/hadoop/
hdfs/*, /usr/local/hadoop/share/hadoop/hdfs/lib/*,
/usr/local/hadoop/share/hadoo
p/mapreduce/*, /usr/local/hadoop/share/hadoop/mapreduce/lib/*,
/usr/local/hadoop
/share/hadoop/yarn/*, /usr/local/hadoop/share/hadoop/yarn/lib/*</value>
</property>
<property>
<name>nodemanager.resource.cpu-vcores</name>
<value>2</value>
</property>
<property>
<description>
Number of seconds after an application finishes before the nodemanager's
DeletionService will delete the application's localized file directory
and log directory.
core-site.xml log:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://sandbox:9000</value>
</property>
<property>
<name>dfs.client.use.legacy.blockreader</name>
<value>true</value>
</property>
</configuration>
hdfs-site.xml log:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Please let me know if I am trying to find a solution to my initial warning in the wrong direction because the application keeps running but no data is sent to hdfs. Thank you!

HMaster is automatically stops in hbase

I installed and configured Hadoop (version 2.7.0) & HBase (version 1.2.3) in pseudo distributed mode. I test hadoop with a test program (Word Count) and every things is OK. Before I enter Hbase shell and get list of tables, Hmaster is running, but When I enter hbase shell and get list of tables (or create table), i see this Error:
hbase(main):001:0> list
TABLE
ERROR: Can't get master address from ZooKeeper; znode data == null
Here is some help for this command:
List all tables in hbase. Optional regular expression parameter could
be used to filter the output. Examples:
hbase> list
hbase> list 'abc.*'
hbase> list 'ns:abc.*'
hbase> list 'ns:.*'
And When i back from Hbase shell and get jps, I see that there is no HMaster running but HRegionServer and Zookeeper are still running.
Here is my Hbase-site.xml Configuration:
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:54310/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>
</configuration>
Here is my core-site.xml Configuration:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
</property>
</configuration>

apache zeppelin is started but there is connection error in localhost:8080

after successfully build apache zepellin on Ubuntu 14, I start zeppelin and it says successfully started but when I go to localhost:8080 Firefox shows unable to connect error like it didn't started but when I check Zeppelin status from terminal it says running and also I just copied config files templates so the config files are the default
update
changed the port to 8090 , here is the config file , but no change in result
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>zeppelin.server.addr</name>
<value>0.0.0.0</value>
<description>Server address</description>
</property>
<property>
<name>zeppelin.server.port</name>
<value>8090</value>
<description>Server port. port+1 is used for web socket.</description>
</property>
<property>
<name>zeppelin.websocket.addr</name>
<value>0.0.0.0</value>
<description>Testing websocket address</description>
</property>
<!-- If the port value is negative, then it'll default to the server
port + 1.
-->
<property>
<name>zeppelin.websocket.port</name>
<value>-1</value>
<description>Testing websocket port</description>
</property>
<property>
<name>zeppelin.notebook.dir</name>
<value>notebook</value>
<description>path or URI for notebook persist</description>
</property>
<property>
<name>zeppelin.notebook.storage</name>
<value>org.apache.zeppelin.notebook.repo.VFSNotebookRepo</value>
<description>notebook persistence layer implementation</description>
</property>
<property>
<name>zeppelin.interpreter.dir</name>
<value>interpreter</value>
<description>Interpreter implementation base directory</description>
</property>
<property>
<name>zeppelin.interpreters</name>
<value>org.apache.zeppelin.spark.SparkInterpreter,org.apache.zeppelin.spark.PySparkInterpreter,org.apache.zeppelin.spark.SparkSqlInterpreter,org.apache.zeppelin.spark.DepInterpreter,org.apache.zeppelin.markdown.Markdown,org.apache.zeppelin.angular.AngularInterpreter,org.apache.zeppelin.shell.ShellInterpreter,org.apache.zeppelin.hive.HiveInterpreter,org.apache.zeppelin.tajo.TajoInterpreter,org.apache.zeppelin.flink.FlinkInterpreter,org.apache.zeppelin.ignite.IgniteInterpreter,org.apache.zeppelin.ignite.IgniteSqlInterpreter</value>
<description>Comma separated interpreter configurations. First interpreter become a default</description>
</property>
<property>
<name>zeppelin.ssl</name>
<value>false</value>
<description>Should SSL be used by the servers?</description>
</property>
<property>
<name>zeppelin.ssl.client.auth</name>
<value>false</value>
<description>Should client authentication be used for SSL connections?</description>
</property>
<property>
<name>zeppelin.ssl.keystore.path</name>
<value>keystore</value>
<description>Path to keystore relative to Zeppelin configuration directory</description>
</property>
<property>
<name>zeppelin.ssl.keystore.type</name>
<value>JKS</value>
<description>The format of the given keystore (e.g. JKS or PKCS12)</description>
</property>
<property>
<name>zeppelin.ssl.keystore.password</name>
<value>change me</value>
<description>Keystore password. Can be obfuscated by the Jetty Password tool</description>
</property>
<!--
<property>
<name>zeppelin.ssl.key.manager.password</name>
<value>change me</value>
<description>Key Manager password. Defaults to keystore password. Can be obfuscated.</description>
</property>
-->
<property>
<name>zeppelin.ssl.truststore.path</name>
<value>truststore</value>
<description>Path to truststore relative to Zeppelin configuration directory. Defaults to the keystore path</description>
</property>
<property>
<name>zeppelin.ssl.truststore.type</name>
<value>JKS</value>
<description>The format of the given truststore (e.g. JKS or PKCS12). Defaults to the same type as the keystore type</description>
</property>
<!--
<property>
<name>zeppelin.ssl.truststore.password</name>
<value>change me</value>
<description>Truststore password. Can be obfuscated by the Jetty Password tool. Defaults to the keystore password</description>
</property>
-->
</configuration>
and here are the ports which are in listening state after zeppelin is started
tcp6 0 0 :::8081 :::* LISTEN
tcp6 0 0 ::1:631 :::* LISTEN
tcp6 0 0 :::8091 :::* LISTEN
tcp6 0 0 :::9001 :::* LISTEN
and Zeppelin is running [ OK ]
is the response I get when I run command bin/zeppelin-daemon.sh status
Check if you can reach it at 127.0.0.1:8080. This works for me, while localhost:8080 also is not reachable.
Also check other Zeppelin files, like interpreter.json and the notebook files. They might have saved config values that are overriding what you are setting in configuration.xsl.
I had a similar problem, mostly with the MASTER setting, but also with the port. I specified new values, but Zeppelin was ignoring them. I eventually discovered that Zeppelin had taken the value of the environment variable MASTER, and unknown to me, saved it into the interpreter.json file. You might try editing that file, or recreating your Zeppelin Interpreters.
In my case, I decided not to mess with that, just did a complete reinstalling of Zeppelin to ensure a clean slat. Then added the following lines to the zeppelin-env.sh file before starting:
export MASTER=local[*]
export ZEPPELIN_PORT=8088
That worked.
In zeppelin-site.xml file conf should be like as you did, add right spark master address in zeppelin-env.sh and interpreter.json file.
you can copy spark master address from spark-master log file.
I did like this and running fine.
I had the same issue. The solution that worked for me was to add the ip and domain name in etc/hosts.
If you go to logs folder where zeppelin is installed you may find more information. For me, that helped. The logs showed an "Caused by: java.net.UnknownHostException... Temporary failure in name resolution".
Adding the host name in etc/hosts solved the issue
In my case, spark and zeppelin version was conflicted. My zeppelin was not support spark 2.2.0(It's possible from new version : check outhttps://issues.apache.org/jira/browse/ZEPPELIN-2768). If someone has no errors on zeppelin log and can't get in localhost, check your zeppelin supports your spark version.

Hadoop: each namenode and datanode only last for a momentary time

Using CentOs 5.4
Three virtual machines(using vmware workstation):master, slave1, slave2. master is used for the namenode, and slave1 slave2 are used for the datanodes.
Hadoop version is hadoop-0.20.1.tar.gz, I have configured all the relative files, and closed the firewall with root user using the command:/sbin/service iptables stop. Then I tried to format namenode and start hadoop in the master(namenode) virtual machine with the following commands, no error was reported.
bin/hadoop namenode -format
bin/start-all.sh
Then I typed the command "jps" in the master machine right now, and found the right result:
5144 JobTracker
4953 NameNode
5079 SecondaryNameNode
5216 Jps
But after about several seconds, when I tried to type the "jps" command, all the virtual machines only have one process: JPS. The following is the result displayed in the namenode(master)
5236 Jps
What's the matter? Or how can I find what caused the matter? Dose it mean that it cannot find any namenode or datanode? Thank you.
Attachment: all the places I have modified:
hadoop-env.sh:
# set java environment
export JAVA_HOME=/usr/jdk1.6.0_13/
core-site.xml:
<configuration>
<property>
<name>master.node</name>
<value>namenode_master</value>
<description>master</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
<description>local dir</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://${master.node}:9000</value>
<description> </description>
</property>
</configuration>
hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>${hadoop.tmp.dir}/hdfs/name</value>
<description>local dir</description>
</property>
<property>
<name>dfs.data.dir</name>
<value>${hadoop.tmp.dir}/hdfs/data</value>
<description> </description>
</property>
</configuration>
mapred-site.xml:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>${master.node}:9001</value>
<description> </description>
</property>
<property>
<name>mapred.local.dir</name>
<value>${hadoop.tmp.dir}/mapred/local</value>
<description> </description>
</property>
<property>
<name>mapred.system.dir</name>
<value>/tmp/mapred/system</value>
<description>hdfs dir</description>
</property>
</configuration>
master:
master
slaves:
slave1
slave2
/etc/hosts:
192.168.190.133 master
192.168.190.134 slave1
192.168.190.135 slave2
From the log files, I found that I should change the namenode_master to master in the file core-site.xml. Now it works.

Resources