Datanode is not starting after deploying Hadoop cluster 2.6 - linux

ERROR received in the logs:
FATAL datanode.DataNode: Initialization failed for Block pool <registering> (Datanode Uuid unassigned) service to hadooptest3/100.6.89.29:8020

There are 2 Possible Solutions to resolve
First:
Your namenode and datanode cluster ID does not match, make sure to make them the same.
In name node, change your cluster id in the file located in:
cat HADOOP_FILE_SYSTEM/namenode/current/VERSION
In data node you cluster id is stored in the file:
cat HADOOP_FILE_SYSTEM/datanode/current/VERSION
This locations are set in hdfs-site.xml file in the cluster.
Check your hdfs-site.xml file and check for dfs.datanode.data.dir and dfs.namenode.name.dir.
By going through those folders, Here I get the contents (in my pseudo-cluster)
clusterID=CID-483c19b1-b198-4806-93d2-af7508d1a5e5
You should have exactly same cluster-id.
Secondly:
Format the namenode:
Hadoop 1.x: hadoop namenode -format
Hadoop 2.x: hdfs namenode -format
Alternatively, remove hdfs root directory /tmp/hadoop-root/ (set up in conf files) - and format the namenode to initialize from begining.

Your config files looks fine. From the error logs that you commented Unexpected version of storage directory /home/hadoop/hdfs. Reported: -60. Expecting = -56. , it seems that data directory created inside /home/hadoop/hdfs is not reformated when you applied `hadoop namenode -format command.
So I suggest you to delete that data directory inside /home/hadoop/hdfs before you format namenode. Then apply format command and start hadoop cluster. It should be solved then.

Related

how do we copy file from hadoop to abfs remotely

how do we copy files from Hadoop to abfs (azure blob file system)
I want to copy from Hadoop fs to abfs file system but it throws an error
this is the command I ran
hdfs dfs -ls abfs://....
ls: No FileSystem for scheme "abfs"
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem not found
any idea how this can be done ?
In the core-site.xml you need to add a config property for fs.abfs.impl with value org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem, and then add any other related authentication configurations it may need.
More details on installation/configuration here - https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html
the abfs binding is already in core-default.xml for any release with the abfs client present. however, the hadoop-azure jar and dependency is not in the hadoop common/lib dir where it is needed (it is in HDI, CDH, but not the apache one)
you can tell the hadoop script to pick it and its dependencies up by setting the HADOOP_OPTIONAL_TOOLS env var; you can do this in ~/.hadoop-env; just try on your command line first
export HADOOP_OPTIONAL_TOOLS="hadoop-azure,hadoop-aws"
after doing that, download the latest cloudstore jar and use its storediag command to attempt to connect to an abfs URL; it's the place to start debugging classpath and config issues
https://github.com/steveloughran/cloudstore

"Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher" when running spark-submit or PySpark

I am trying to run the spark-submit command on my Hadoop cluster Here is a summary of my Hadoop Cluster:
The cluster is built using 5 VirtualBox VM's connected on an internal network
There is 1 namenode and 4 datanodes created.
All the VM's were built from the Bitnami Hadoop Stack VirtualBox image
I am trying to run one of the spark examples using the following spark-submit command
spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.3.jar 10
I get the following error:
[2022-07-25 13:32:39.253]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher
I get the same error when trying to run a script with PySpark.
I have tried/verified the following:
environment variables: HADOOP_HOME, SPARK_HOME and HADOOP_CONF_DIR have been set in my .bashrc file
SPARK_DIST_CLASSPATH and HADOOP_CONF_DIR have been defined in spark-env.sh
Added spark.master yarn, spark.yarn.stagingDir hdfs://hadoop-namenode:8020/user/bitnami/sparkStaging and spark.yarn.jars hdfs://hadoop-namenode:8020/user/bitnami/spark/jars/ in spark-defaults.conf
I have uploaded the jars into hdfs (i.e. hadoop fs -put $SPARK_HOME/jars/* hdfs://hadoop-namenode:8020/user/bitnami/spark/jars/ )
The logs accessible via the web interface (i.e. http://hadoop-namenode:8042 ) do not provide any further details about the error.
This section of the Spark documentation seems relevant to the error since the YARN libraries should be included, by default, but only if you've installed the appropriate Spark version
For with-hadoop Spark distribution, since it contains a built-in Hadoop runtime already, by default, when a job is submitted to Hadoop Yarn cluster, to prevent jar conflict, it will not populate Yarn’s classpath into Spark. To override this behavior, you can set spark.yarn.populateHadoopClasspath=true. For no-hadoop Spark distribution, Spark will populate Yarn’s classpath by default in order to get Hadoop runtime. For with-hadoop Spark distribution, if your application depends on certain library that is only available in the cluster, you can try to populate the Yarn classpath by setting the property mentioned above. If you run into jar conflict issue by doing so, you will need to turn it off and include this library in your application jar.
https://spark.apache.org/docs/latest/running-on-yarn.html#preparations
Otherwise, yarn.application.classpath in yarn-site.xml refers to local filesystem paths in each of ResourceManager servers where JARs are available for all YARN applications (spark.yarn.jars or extra packages should get layered onto this)
Another problem could be file permissions. You probably shouldn't put Spark jars into an HDFS user folder if they're meant to be used by all users. Typically, I'd put it under hdfs:///apps/spark/<version>, then give that 744 HDFS permissions
In the Spark / YARN UI, it should show the complete classpath of the application for further debugging
I figured out why I was getting this error. It turns out that I made an error while specifying spark.yarn.jars in spark-defaults.conf
The value of this property must be
hdfs://hadoop-namenode:8020/user/bitnami/spark/jars/*
instead of
hdfs://hadoop-namenode:8020/user/bitnami/spark/jars/
i.e. Basically, we need to specify the jar files as the value to this property and not the folder containing the jar files.

Hadoop copyToLocalFile failing in Yarn cluster mode

I was trying to copy a file to local from HDFS using Hadoop's copyToLocalFile function from my Spark2 application.
val hadoopConf = new Configuration()
val hdfs = FileSystem.get(hadoopConf)
val src = new Path("/user/yxs7634/all.txt")
val dest = new Path("file:///home/yxs7634/all.txt")
hdfs.copyToLocalFile(src, dest)
The above code is working fine when I submit my spark application in Yarn client mode. But, It keeps failing with the below exception in Yarn cluster mode.
18/10/03 12:18:40 ERROR yarn.ApplicationMaster: User class threw exception: java.io.FileNotFoundException: /home/yxs7634/all.txt (Permission denied)
In yarn-cluster mode the driver is also handled by yarn and the selected driver node may not be the one where you're submitting the job. Hence for this job to work in yarn-cluster mode I believe you need to place the local file in all the spark nodes in the cluster.
In yarn mode, the spark job is submitted through YARN.
The driver would be started on a different node.
To tackle this issue, you can use a distributed file system like HDFS to store your file and then giving the absolute path.
eg:
val src = new Path("hdfs://nameservicehost:8020/user/yxs7634/all.txt")
Looks like Spark server running under one user (for ex. "spark"), and file in code stored in other user "yxs7634" directory.
In cluster mode user "spark" does not allows to write in "yxs7634" user dir, and such exception occurs.
Additional permission for Spark user to write in "/home/yxs7634" is required.
In local mode worked fine, because Spark runs under "yxs7634" user.
You have a permission denied error, I mean, the user you are using to submit the job is not able to access the file. The directory should have at least read permission to user "other", something like this: -rw-rw-r--
Can you paste the permissions of the directory and the file? The command is
hdfs dfs -ls /your-directory/

Spark Streaming reading from local file gives NullPointerException

Using Spark 2.2.0 on OS X High Sierra. I'm running a Spark Streaming application to read a local file:
val lines = ssc.textFileStream("file:///Users/userName/Documents/Notes/MoreNotes/sampleFile")
lines.print()
This gives me
org.apache.spark.streaming.dstream.FileInputDStream logWarning - Error finding new files
java.lang.NullPointerException
at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192)
The file exists, and I am able to read it using SparkContext (sc) from spark-shell on the terminal. For some reason going through the Intellij application and Spark Streaming is not working. Any ideas appreciated!
Quoting the doc comments of textFileStream:
Create an input stream that monitors a Hadoop-compatible filesystem
for new files and reads them as text files (using key as LongWritable, value
as Text and input format as TextInputFormat). Files must be written to the
monitored directory by "moving" them from another location within the same
file system. File names starting with . are ignored.
#param directory HDFS directory to monitor for new file
So, the method expects the path to a directory in the parameter.
So I believe this should avoid that error:
ssc.textFileStream("file:///Users/userName/Documents/Notes/MoreNotes/")
Spark streaming will not read old files, so first run the spark-submit command and then create the local file in the specified directory. Make sure in the spark-submit command, you give only directory name and not the file name. Below is a sample command. Here, I am passing the directory name through the spark command as my first parameter. You can specify this path in your Scala program as well.
spark-submit --class com.spark.streaming.streamingexample.HdfsWordCount --jars /home/cloudera/pramod/kafka_2.12-1.0.1/libs/kafka-clients-1.0.1.jar--master local[4] /home/cloudera/pramod/streamingexample-0.0.1-SNAPSHOT.jar /pramod/hdfswordcount.txt

Where is Spark writing SaveAsTextFile in cluster?

I'm a bit at loss here (Spark newbie). I spun up an EC2 cluster, and submitted a Spark job which saves as text file in the last step. The code reads
reduce_tuples.saveAsTextFile('september_2015')
and the working directory of the python file I'm submitting is /root. I cannot find the directory called september_2005, and if I try to run the job again I get the error:
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://ec2-54-172-88-52.compute-1.amazonaws.com:9000/user/root/september_2015 already exists
The ec2 address is the master node where I'm ssh'ing to, but I don't have a folder /user/root.
Seems like Spark is creating the september_2015 directory somehwere, but find doesn't find it. Where does Spark write the resulting directory? Why is it pointing me to a directory that doesn't exist in the master node filesystem?
You're not saving it in the local file system, you're saving it in the hdfs cluster. Try eph*-hdfs/bin/hadoop fs -ls /, then you should see your file. See eph*-hdfs/bin/hadoop help for more commands, eg. -copyToLocal.

Resources