NoClassDefFoundError com.apache.hadoop.fs.FSDataInputStream when execute spark-shell - apache-spark

I've downloaded the prebuild version of spark 1.4.0 without hadoop (with user-provided Haddop). When I ran the spark-shell command, I got this error:
> Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/
FSDataInputStream
at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSpa
rkProperties$1.apply(SparkSubmitArguments.scala:111)
at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSpa
rkProperties$1.apply(SparkSubmitArguments.scala:111)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.deploy.SparkSubmitArguments.mergeDefaultSparkPropert
ies(SparkSubmitArguments.scala:111)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArgume
nts.scala:97)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:106)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStr
eam
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 7 more
I've searched on Internet, it is said that HADOOP_HOME has not been set yet in spark-env.cmd. But I cannot find spark-env.cmd in the spark installation folder.
I've traced the spark-shell command and it seems that there are no HADOOP_CONFIG in there. I've tried to add the HADOOP_HOME on environment variable but it still give the same exception.
Actually I don't really using the hadoop. I downloaded hadoop as a workaround as suggested in this question
I am using windows 8 and scala 2.10.
Any help will be appreciated. Thanks.

The "without Hadoop" in the Spark's build name is misleading: it means the build is not tied to a specific Hadoop distribution, not that it is meant to run without it: the user should indicate where to find Hadoop (see https://spark.apache.org/docs/latest/hadoop-provided.html)
One clean way to fix this issue is to:
Obtain Hadoop Windows binaries. Ideally build them, but this is painful (for some hints see: Hadoop on Windows Building/ Installation Error). Otherwise Google some up, for instance currently you can download 2.6.0 from here: http://www.barik.net/archive/2015/01/19/172716/
Create a spark-env.cmd file looking like this (modify Hadoop path to match your installation):
#echo off
set HADOOP_HOME=D:\Utils\hadoop-2.7.1
set PATH=%HADOOP_HOME%\bin;%PATH%
set SPARK_DIST_CLASSPATH=<paste here the output of %HADOOP_HOME%\bin\hadoop classpath>
Put this spark-env.cmd either in a conf folder located at the same level as your Spark base folder (which may look weird), or in a folder indicated by the SPARK_CONF_DIR environment variable.

I had the same problem, in fact it's mentioned on the Getting started page of Spark how to handle it:
### in conf/spark-env.sh ###
# If 'hadoop' binary is on your PATH
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
# With explicit path to 'hadoop' binary
export SPARK_DIST_CLASSPATH=$(/path/to/hadoop/bin/hadoop classpath)
# Passing a Hadoop configuration directory
export SPARK_DIST_CLASSPATH=$(hadoop --config /path/to/configs classpath)
If you want to use your own hadoop follow one of the 3 options, copy and paste it into spark-env.sh file :
1- if you have the hadoop on your PATH
2- you want to show hadoop binary explicitly
3- you can also show hadoop configuration folder
http://spark.apache.org/docs/latest/hadoop-provided.html

I too had the issue,
export SPARK_DIST_CLASSPATH=`hadoop classpath`
resolved the issue.

I ran into the same error when trying to get familiar with spark. My understanding of the error message is that while spark doesn't need a hadoop cluster to run, it does need some of the hadoop classes. Since I was just playing around with spark and didn't care what version of hadoop libraries are used, I just downloaded a spark binary pre-built with a version of hadoop (2.6) and things started working fine.

linux
ENV SPARK_DIST_CLASSPATH="$HADOOP_HOME/etc/hadoop/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/tools/lib/*"
windows
set SPARK_DIST_CLASSPATH=%HADOOP_HOME%\etc\hadoop\*;%HADOOP_HOME%\share\hadoop\common\lib\*;%HADOOP_HOME%\share\hadoop\common\*;%HADOOP_HOME%\share\hadoop\hdfs\*;%HADOOP_HOME%\share\hadoop\hdfs\lib\*;%HADOOP_HOME%\share\hadoop\hdfs\*;%HADOOP_HOME%\share\hadoop\yarn\lib\*;%HADOOP_HOME%\share\hadoop\yarn\*;%HADOOP_HOME%\share\hadoop\mapreduce\lib\*;%HADOOP_HOME%\share\hadoop\mapreduce\*;%HADOOP_HOME%\share\hadoop\tools\lib\*

Enter into SPARK_HOME -> conf
copy spark-env.sh.template file and rename it to spark-env.sh
Inside this file you can set the parameters for spark.

Run below from your package dir just before running spark-submit -
export SPARK_DIST_CLASSPATH=`hadoop classpath`

I finally find a solution to remove the exception.
In spark-class2.cmd, add :
set HADOOP_CLASS1=%HADOOP_HOME%\share\hadoop\common\*
set HADOOP_CLASS2=%HADOOP_HOME%\share\hadoop\common\lib\*
set HADOOP_CLASS3=%HADOOP_HOME%\share\hadoop\mapreduce\*
set HADOOP_CLASS4=%HADOOP_HOME%\share\hadoop\mapreduce\lib\*
set HADOOP_CLASS5=%HADOOP_HOME%\share\hadoop\yarn\*
set HADOOP_CLASS6=%HADOOP_HOME%\share\hadoop\yarn\lib\*
set HADOOP_CLASS7=%HADOOP_HOME%\share\hadoop\hdfs\*
set HADOOP_CLASS8=%HADOOP_HOME%\share\hadoop\hdfs\lib\*
set CLASSPATH=%HADOOP_CLASS1%;%HADOOP_CLASS2%;%HADOOP_CLASS3%;%HADOOP_CLASS4%;%HADOOP_CLASS5%;%HADOOP_CLASS6%;%HADOOP_CLASS7%;%HADOOP_CLASS8%;%LAUNCH_CLASSPATH%
Then, change :
"%RUNNER%" -cp %CLASSPATH%;%LAUNCH_CLASSPATH% org.apache.spark.launcher.Main %* > %LAUNCHER_OUTPUT%
to :
"%RUNNER%" -Dhadoop.home.dir=*hadoop-installation-folder* -cp %CLASSPATH% %JAVA_OPTS% %*
It works fine with me, but I'm not sure this is the best solution.

You should add these jars in you code:
common-cli-1.2.jar
hadoop-common-2.7.2.jar

Thank you so much. That worked great, but I had to add the spark jars to the classpath as well:
;c:\spark\lib*
Also, the last line of the cmd file is missing the word "echo"; so it should say:
echo %SPARK_CMD%

I had the same issue ....Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/
FSDataInputStream
at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSpa
rkProperties$1.apply(SparkSubmitArguments.scala:111)...
Then I realized that I had installed the spark version without hadoop. I installed the "with-hadoop" version the problem went away.

for my case
running spark job locally differs from running it on cluster. on cluster you might have a different dependency/context to follow. so essentially in your pom.xml you might have dependencies declared as provided.
when running locally, you don't need these provided dependencies. just uncomment them and rebuild again.

I encountered the same error. I wanted to install spark on my windows PC and therefore downloaded the without hadoop version of spark, but turns out you need the hadoop libraries! so download any hadoop spark version and set the environment variables.

I got this error because the file was copied from Windows.
Resolve it using
dos2unix file_name

I think you need spark-core dependency of maven. It worked fine for me.

I used:
export SPARK_HOME=/opt/cloudera/parcels/SPARK2/lib/spark2
export HADOOP_MAPRED_HOME=/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce
It's work for me!

I added hadoop-client-runtime-3.3.2.jar to my user library.

Related

Wrong JAVA_HOME in hadoop for spark-shell

I needed to install Hadoop in order to have Spark running on my WSL2 Ubuntu for school projects. I installed Hadoop 3.3.1 and Spark 3.2.1 follow those two tutorials :
Hadoop Tutorial on Kontext.tech
Spark Tutorial on Kontext.tech
I correctly set up env variables in my .bashrc :
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
export PATH=$PATH:$JAVA_HOME
export HADOOP_HOME=~/hadoop/hadoop-3.3.1
export SPARK_HOME=~/hadoop/spark-3.2.1
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:/usr/local/hadoop/bin/
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$SPARK_HOME/bin:$PATH
# Configure Spark to use Hadoop classpath
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
As well as the ~/hadoop/spark-3.2.1/conf/spark-env.sh.template :
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/
However when I launch spark-shell, I get this error :
/home/adrien/hadoop/spark-3.2.1/bin/spark-class: line 71: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin//bin/java: No such file or directory
/home/adrien/hadoop/spark-3.2.1/bin/spark-class: line 96: CMD: bad array subscript
There seems to be a mess up in a redefinition of the $PATH variable but I can't figure out where it can be. Can you help me solve it please ? I don't know Hadoop and know spark well but I never had to install them.
First, certain Spark packages come with Hadoop, so you don't need to download them separately. More specifically, Spark is built against Hadoop 3.2 for now, so using the latest version might cause its own problems
For your problem, JAVA_HOME should not end in /bin or /bin/java. Check the linked post again...
If you used apt install for java, you shouldn't really need to set JAVA_HOME or the PATH for Java, either, as the package manager will do this for you. Or you can use https://sdkman.io
Note: Java 11 is preferred
You also need to remove .template from any config files for them to actually be used... However, JAVA_HOME is automatically detected by spark-submit, so it's completely optional in spark-env.sh
Same applies for hadoop-env.sh
Also remove /usr/local/hadoop/bin/ from your PATH since it doesn't appear you've put anything in that location

Can't find Spark Submit when using Spark shell

I installed spark and am trying to run a file 'train.py' in the directory, '/home/xxx/Desktop/BD_Project', in shell using the following command:
$SPARK_HOME/bin/spark-submit /home/xxx/Desktop/BD_Project/train.py > output.txt
My teammates who used the same page that I did for spark installations have no problem when running this. However, it throws up the following error for me:
bash: /bin/spark-submit: No such file or directory
You need to set your SPARK_HOME to where your spark is installed, typically its in /usr/local/spark/bin/bin/spark-submit
Before you set it make sure where spark is installed by going to the directory.
You can set it like this before running your command :
export SPARK_HOME=/usr/local/spark/bin/bin/spark-submit
If you are homebrew user, setting your SPARK_HOME to
/opt/homebrew/Cellar/apache-spark/3.3.1/libexec"
would solve. Sorry for too late responding. Hoping this would help someone with this odd error.

Unable to start beeline client

I installed spark-1.5.1-bin-without-hadoop and trying to start beeline using the following command from spark install directory.
./bin/beeline
I get "Error: Could not find or load main class org.apache.hive.beeline.BeeLine".
Not sure why the classpath is not working. I ran into same issue and ended up running java with the jars under lib_managed directory. Note that verbose option is used because no errors are shown in some NoClassDef cases.
java -cp lib_managed/jars/hive-exec-1.2.1.spark.jar:lib_managed/jars/hive-metastore-1.2.1.spark.jar:lib_managed/jars/httpcore-4.3.1.jar:lib_managed/jars/httpclient-4.3.2.jar:lib_managed/jars/libthrift-0.9.2.jar:lib_managed/jars/hive-beeline-1.2.1.spark.jar:lib_managed/jars/jline-2.12.jar:lib_managed/jars/commons-cli-1.2.jar:lib_managed/jars/super-csv-2.2.0.jar:lib_managed/jars/commons-logging-1.1.3.jar:lib_managed/jars/hive-jdbc-1.2.1.spark.jar:lib_managed/jars/hive-cli-1.2.1.spark.jar:lib_managed/jars/hive-service-1.2.1.spark.jar:assembly/target/scala-2.10/spark-assembly-1.5.3-SNAPSHOT-hadoop2.2.0.jar org.apache.hive.beeline.BeeLine -u jdbc:hive2://<thrift server public address>:10000/default --verbose=true
I had exactly same problem. For me setting SPARK_HOME environment variable did it!
export SPARK_HOME=/Users/../Downloads/spark-2.1.1-bin-hadoop2.7
This is because if you actually open and see "bin/beeline" script file, you'll find this line:
Figure out if SPARK_HOME is set
So, after setting SPARK_HOME to proper location, beeline started working fine.

Installing Apache Spark on linux

I am installing Apache Spark on linux. I already have Java, Scala and Spark downloaded and they are all in the Downloads folder inside the Home folder with the path /home/alex/Downloads/X where X=scala, java, spark, literally that's what the folders are called.
I got scala to work but when I try to run spark by typing ./bin/spark-shell it says:
/home/alex/Downloads/spark/bin/saprk-class: line 100: /usr/bin/java/bin/java: Not a directory
I have already included the file path by editing the bashrc with sudo gedit ~/.bashrc:
# JAVA
export JAVA_HOME=/home/alex/Downloads/java
export PATH=$PATH:$JAVA_HOME/bin
# scala
export SCALA_HOME=/home/alex/Downloads/scala
export PATH=$PATH:$SCALA_HOME/bin
# spark
export SPARK_HOME=/home/alex/Downloads/spark
export PATH=$PATH:$SPARK_HOME/bin
When I try to type sbt/sbt package in the spark folder it say no such file or directory is found also. What should I do from here?
It seems you have a few issues, namely your JAVA_HOME is not pointed to a directory with java, when you are running sbt in spark you should run ./sbt/sbt (or in new versions ./build/sbt). While you can download Java & Scala by hand, you may find that your system packages are sufficient (make sure to get jdk 7 or later).
Furthermore, after using system packages as Holden points out, in Linux you may use the command whereis to make sure of the right paths.
Finally, the following link may prove useful:
http://www.tutorialspoint.com/apache_spark/apache_spark_installation.htm
Hope this helps.
Note: It looks like there may be a configuration issue, misspelling, of the directory name
/home/alex/Downloads/spark/bin/saprk-class: line 100: /usr/bin/java/bin/java: Not a directory
saprk-class
That could be a configuration issue only, but it's worth a look if it is called /spark-class elsewhere to see if it's causing related issues.

Why does spark-submit and spark-shell fail with "Failed to find Spark assembly JAR. You need to build Spark before running this program."?

I was trying to run spark-submit and I get
"Failed to find Spark assembly JAR.
You need to build Spark before running this program."
When I try to run spark-shell I get the same error.
What I have to do in this situation.
On Windows, I found that if it is installed in a directory that has a space in the path (C:\Program Files\Spark) the installation will fail. Move it to the root or another directory with no spaces.
Your Spark package doesn't include compiled Spark code. That's why you got the error message from these scripts spark-submit and spark-shell.
You have to download one of pre-built version in "Choose a package type" section from the Spark download page.
Try running mvn -DskipTests clean package first to build Spark.
If your spark binaries are in a folder where the name of the folder has spaces (for example, "Program Files (x86)"), it didn't work. I changed it to "Program_Files", then the spark_shell command works in cmd.
In my case, I install spark by pip3 install pyspark on macOS system, and the error caused by incorrect SPARK_HOME variable. It works when I run command like below:
PYSPARK_PYTHON=python3 SPARK_HOME=/usr/local/lib/python3.7/site-packages/pyspark python3 wordcount.py a.txt
Go to SPARK_HOME. Note that your SPARK_HOME variable should not include /bin at the end. Mention it when you're when you're adding it to path like this: export PATH=$SPARK_HOME/bin:$PATH
Run export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g" to allot more memory to maven.
Run ./build/mvn -DskipTests clean package and be patient. It took my system 1 hour and 17 minutes to finish this.
Run ./dev/make-distribution.sh --name custom-spark --pip. This is just for python/pyspark. You can add more flags for Hive, Kubernetes, etc.
Running pyspark or spark-shell will now start pyspark and spark respectively.
If you have downloaded binary and getting this exception
Then please check your Spark_home path may contain spaces like "apache spark"/bin
Just remove spaces will works.
Just to add to #jurban1997 answer.
If you are running windows then make sure that SPARK_HOME and SCALA_HOME environment variables are setup right. SPARK_HOME should be pointing to {SPARK_HOME}\bin\spark-shell.cmd
For Windows machine with the pre-build version as of today (21.01.2022):
In order to verify all the edge cases you may have and avoid tedious guesswork about what exactly is not configred properly:
Find spark-class2.cmd and open it in with a text editor
Inspect the arguments of commands staring with call or if exists by typing the arguments in Command Prompt like this:
Open Command Prompt. (For PowerShell you need to print the var another way)
Copy-paste %SPARK_HOME%\bin\ as is and press enter.
If you see something like bin\bin in the path displayed now then you have appended /bin in your environment variable %SPARK_HOME%.
Now you have to add the path to the spark/bin to your PATH variable or it will not find spark-submit command
Try out and correct every path variable that the script in this file uses and and you should be good to go.
After that enter spark-submit ... you may now encounter the missing hadoop winutils.exe for which problem you can go get the tool and paste it where the spark-submit.cmd is located
Spark Installation:
For Window machine:
Download spark-2.1.1-bin-hadoop2.7.tgz from this site https://spark.apache.org/downloads.html
Unzip and Paste your spark folder in C:\ drive and set environment variable.
If you don’t have Hadoop,
you need to create Hadoop folder and also create Bin folder in it and then copy and paste winutils.exe file in it.
download winutils file from [https://codeload.github.com/gvreddy1210/64bit/zip/master][1]
and paste winutils.exe file in Hadoop\bin folder and set environment variable for c:\hadoop\bin;
create temp\hive folder in C:\ drive and give the full permission to this folder like:
C:\Windows\system32>C:\hadoop\bin\winutils.exe chmod 777 /tmp/hive
open command prompt first run C:\hadoop\bin> winutils.exe and then navigate to C:\spark\bin>
run spark-shell

Resources