How to run Apache spark Java program in standalone - apache-spark

I have written a java program for spark, but I am not able to run it from the command line.
I have followed the steps given in the Quick start guide, but I am getting the following error. Please help me out with this problem.
Here is the error :
hadoopnod#hadoopnod:~/spark-1.2.1/bin$ ./run-example "SimpleApp " --master local /home/hadoopnod/Spark_Java/target/simple-project-1.0.jarjava.lang.ClassNotFoundException: org.apache.spark.examples.SimpleApp
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:342)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties

Create a JAR file using following command. You can find the SimpleApp.class file in "target/classes" folder. cd to this directory.
jar cfve file.jar SimpleApp.class
Put this JAR file into your project in target directory.
This JAR file contains the dependency of your SimpleApp class while submitting your job to Spark.
cd to your spark directory. I am using spark-1.4.0-bin-hadoop2.6. Your cmd looks like this.
spark-1.4.0-bin-hadoop2.6>
Submit your spark program using Spark Submit. If you have structure like Harsha has explained in another answer then provide
--class org.apache.spark.examples.SimpleApp
else
--class SimpleApp
Finally submit your spark program.
spark-1.4.0-bin-hadoop2.6>./bin/spark-submit --class SimpleApp --master local[2] /home/hadoopnod/Spark_Java/target/file.jar

The script ./run-example.sh is used to execute the examples included in the distribution. To run the example "SparkPi" do this...
> cd /apps/spark-1.2.0
> ./bin/run-example SparkPi
If you look at how this script executes its just a new user friendly wrapper which actually calls spark-submit.
Here's an example that executes the same "SparkPi" example from above, but using spark-submit
> .bin/spark-submit --class org.apache.spark.examples.SparkPi --master local examples/target/spark-examples_2.10-1.2.0.jar
You should use spark-submit to run your own code.

ClassNotFoundException: org.apache.spark.examples.SimpleApp
From the above error, it is clear the reason that it cannot find the class you are trying to execute. Have you bundled your java project into a jar file. If you have any other dependencies while creating your jar file, you need to include them as well.
Assume if you have a project structure like this
simpleapp
- src/main/java
- org.apache.spark.examples
-SimpleApp.java
- lib
- dependent.jars (you can put all dependent jars inside lib directory)
- target
- simpleapp.jar (after compiling your source)
You can use any build tool or any IDE to bundle your source into a Jar file. After that if you have added spark/bin directory into your path. you can execute below command from your project directory. you need to add --jars $(echo lib/*.jar | tr ' ' ',' ) only if you have dependent libraries in your SimpleApp.java
spark-submit --jars $(echo lib/*.jar | tr ' ' ',' ) --class org.apache.spark.examples.SimpleApp --master local[2] target/simpleapp.jar

I had the same issue. If you want to use the command provided by the Spark Quickstart, be sure your project has the same architecture:
find .
./pom.xml
./src
./src/main
./src/main/java
./src/main/java/SimpleApp.java
It may be not the case for you but my pom.xml built my architecture like
./src/main/java/myGroupId/myArtifactId/SimpleApp.java
I moved my class in default package and it worked fine after.

Related

How to access external property file in spark-submit job?

I am using spark 2.4.1 version and java8.
I am trying to load external property file while submitting my spark job using spark-submit.
As I am using below TypeSafe to load my property file.
<groupId>com.typesafe</groupId>
<artifactId>config</artifactId>
<version>1.3.1</version>
In my code I am using
public static Config loadEnvProperties(String environment) {
Config appConf = ConfigFactory.load(); // loads my "resouces" folder "application.properties" file
return appConf.getConfig(environment);
}
To externalize this "application.properties" file I tried this as suggested by an expert while spark-submit as below
spark-submit \
--master yarn \
--deploy-mode cluster \
--name Extractor \
--jars "/local/apps/jars/*.jar" \
--files /local/apps/log4j.properties \
--files /local/apps/applicationNew.properties \
--class Driver \
--conf spark.driver.extraJavaOptions=-Dconfig.file=./applicationNew.properties \
--conf spark.executor.extraJavaOptions=-Dconfig.file=./applicationNew.properties \
--conf spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties \
--conf spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties \
--conf spark.driver.extraJavaOptions=-Dlog4j.debug \
--conf spark.driver.extraClassPath=. \
migration-0.0.1.jar sit
I placed "log4j.properties" & "applicationNew.properties" files same folder where I am running my spark-submit.
1) In the above shell script if I keep
--files /local/apps/log4j.properties, /local/apps/applicationNew.properties \
Error :
Exception in thread "main" org.apache.spark.SparkException: Cannot load main class from JAR file:/local/apps//applicationNew.properties
at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657)
So what is wrong here ?
2) Then i changed above script like shown i.e.
--files /local/apps/log4j.properties \
--files /local/apps/applicationNew.properties \
when I run spark job then I will get following error.
19/08/02 14:19:09 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: User class threw exception: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'sit'
at com.typesafe.config.impl.SimpleConfig.findKeyOrNull(SimpleConfig.java:152)
So what is wrong here ? why not loading the applicationNew.properties file ?
3) When I debugged it as below
i.e. printed "config.file"
String ss = System.getProperty("config.file");
logger.error ("config.file : {}" , ss);
Error :
19/08/02 14:19:09 ERROR Driver: config.file : null
19/08/02 14:19:09 ERROR yarn.ApplicationMaster: User class threw exception: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'sit'
So how to set "config.file" option from spark-submit ?
How to fix above errors and load properties from external applicationNew.properties file ?
The proper way to list files for the --files, --jars and other similar arguments is via a comma without any spaces (this is a crucial thing, and you see the exception about invalid main class precisely because of this):
--files /local/apps/log4j.properties,/local/apps/applicationNew.properties
If file names themselves have spaces in it, you should use quotes to escape these spaces:
--files "/some/path with/spaces.properties,/another path with/spaces.properties"
Another issue is that you specify the same property twice:
...
--conf spark.driver.extraJavaOptions=-Dconfig.file=./applicationNew.properties \
...
--conf spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties \
...
There is no way for spark-submit to know how to merge these values, therefore only one of them is used. This is the reason why you see null for the config.file system property: it's just the second --conf argument takes priority and overrides the extraJavaOptions property with a single path to the log4j config file. Thus, the correct way is to specify all these values as one property:
--conf spark.driver.extraJavaOptions="-Dlog4j.configuration=file:./log4j.properties -Dconfig.file=./applicationNew.properties"
Note that because of quotes, the entire spark.driver.extraJavaOptions="..." is one command line argument rather than several, which is very important for spark-submit to pass these arguments to the driver/executor JVM correctly.
(I also changed the log4j.properties file to use a proper URI instead of a file. I recall that without this path being a URI it might not work, but you can try either way and check for sure.)
--files and SparkFiles.get
With --files you should access the resource using SparkFiles.get as follows:
$ ./bin/spark-shell --files README.md
scala> import org.apache.spark._
import org.apache.spark._
scala> SparkFiles.get("README.md")
res0: String = /private/var/folders/0w/kb0d3rqn4zb9fcc91pxhgn8w0000gn/T/spark-f0b16df1-fba6-4462-b956-fc14ee6c675a/userFiles-eef6d900-cd79-4364-a4a2-dd177b4841d2/README.md
In other words, Spark will distribute the --files to executors, but the only way to know the path of the files is to use SparkFiles utility.
getResourceAsStream(resourceFile) and InputStream
The other option would be to package all resource files into a jar file and bundle it together with the other jar files (either as a single uber-jar or simply as part of CLASSPATH of the Spark app) and use the following trick:
this.getClass.getClassLoader.getResourceAsStream(resourceFile)
With that, regardless of the jar file the resourceFile is in, as long as it's on the CLASSPATH, it should be available to the application.
I'm pretty sure any decent framework or library that uses resource files for configuration, e.g. Typesafe Config, accepts InputStream as the way to read resource files.
You could also include the --files as part of a jar file that is part of the CLASSPATH of the executors, but that'd be obviously less flexible (as every time you'd like to submit your Spark app with a different file, you'd have to recreate the jar).

Remove JAR from Spark default classpath in EMR

I'm executing a spark-submit script in an EMR step that has my super JAR as the main class, like
spark-submit \
....
--class ${MY_CLASS} "${SUPER_JAR_S3_PATH}"
... etc
but Spark is by default loading the jar file:/usr/lib/spark/jars/guice-3.0.jar which contains com.google.inject.internal.InjectorImpl, a class that's also in the Guice-4.x jar which is in my super JAR. This results in a java.lang.IllegalAccessError when my service is booting up.
I've tried setting some Spark conf in the spark-submit to put my super jar in the classpath in hopes of it getting loaded first, before Spark loads guice-3.0.jar. It looks like:
--jars "${ASSEMBLY_JAR_S3_PATH}" \
--driver-class-path "/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:${SUPER_JAR_S3_PATH}" \
--conf spark.executor.extraClassPath="/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:${SUPER_JAR_S3_PATH}" \
but this results in the same error.
Is there a way to remove that guice-3.0.jar from the default spark classpath so my code can use the InjectorImpl that's packaged in the Guice-4.x JAR? I'm also running Spark in client mode so I can't use spark.driver.userClassPathFirst or spark.executor.userClassPathFirst
one way is point to lib where your guice old version of jar is there and then exclude it.
sample shell script for spark-submit :
export latestguicejar='your path to latest guice jar'
#!/bin/sh
# build all other dependent jars in OTHER_JARS
JARS=`find /usr/lib/spark/jars/ -name '*.jar'`
OTHER_JARS=""
for eachjarinlib in $JARS ; do
if [ "$eachjarinlib" != "guice-3.0.jar" ]; then
OTHER_JARS=$eachjarinlib,$OTHER_JARS
fi
done
echo ---final list of jars are : $OTHER_JARS
echo $CLASSPATH
spark-submit --verbose --class <yourclass>
... OTHER OPTIONS
--jars $OTHER_JARS,$latestguicejar,APPLICATIONJARTOBEADDEDSEPERATELY.JAR
also see holdens answer. check with your version of the spark what is available.
As per docs runtime-environment userClassPathFirst are present in the latest version of spark as of today.
spark.executor.userClassPathFirst
spark.driver.userClassPathFirst
for this to use you can make uber jar with all application level dependencies.

Spark-shell vs Spark-submit adding jar to classpath issue

I'm able to run CREATE TEMPORARY FUNCTION testFunc using jar 'myJar.jar' query in hiveContext via spark-shell --jars myJar.jar -i some_script.scala, but I'm not able to run such command via spark-submit --class com.my.DriverClass --jars myJar.jar target.jar.
Am I doing something wrong?
If you are using local file system, the Jar must be in the same location on all nodes.
So you have 2 options:
place jar on all nodes in the same directory, for example in /home/spark/my.jar and then use this directory in --jars option.
use distributed file system like HDFS

How to append a resource jar for spark-submit?

My spark application depends on adam_2.11-0.20.0.jar, every time I have to package my application with adam_2.11-0.20.0.jar as a fat jar to submit to spark.
for example, my fat jar is myApp1-adam_2.11-0.20.0.jar,
It's ok to submit as following
spark-submit --class com.ano.adam.AnnoSp myApp1-adam_2.11-0.20.0.jar
It reported Exception in
thread "main" java.lang.NoClassDefFoundError:
org/bdgenomics/adam/rdd using --jars
spark-submit --class com.ano.adam.AnnoSp myApp1.jar --jars adam_2.11-0.20.0.jar
My question is how to submit using 2 separate jars without package them together
spark-submit --class com.ano.adam.AnnoSp myApp1.jar adam_2.11-0.20.0.jar
Add all jars in one folder and then do like below...
Option 1 :
I think Better way of doing this is
$SPARK_HOME/bin/spark-submit \
--driver-class-path $(echo /usr/local/share/build/libs/*.jar | tr ' ' ',') \
--jars $(echo /usr/local/share/build/libs/*.jar | tr ' ' ',')
in this approach, you wont miss any jar by mistake in the classpath hence no warning should come.
Option 2 see my anwer:
spark-submit-jars-arguments-wants-comma-list-how-to-declare-a-directory
Option 3 : If you want to do programmatic submit by adding jars through API its possible.Here Im not going to details of it.

Spark submit fail on call setEntityResolver of XMLConfiguration on Apache Common Configuration

I have a problem when i try to submit my application with spark submit command:
/bin/spark-submit --class MyClass myjar.jar
I set master url programmatically.
I get following error:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.commons.configuration.XMLConfiguration.setEntityResolver(Lorg/xml/sax/EntityResolver;)V
when i run my program on IDE all work correctly. This problem does not arise.
It looks like the jar you are submitting may not have all of the dependencies it requires in it. The solution to this is to build an assembly jar, see https://maven.apache.org/plugins/maven-assembly-plugin/usage.html (for maven), or https://github.com/sbt/sbt-assembly (for sbt).
I have finally found the cause of the problem.
Spark use hadoop-client-2.2.0 that use hadoop-common that use common-configuration-1.6.
In my application I used common-configuration-v1.10 where XMLConfiguration.setEntityResolver is implemented. Instead in version 1.6 of the library that method is not present.
When i run:
/bin/spark-submit --class MyClass myjar.jar
XMLConfiguration.class of common-configuration-v1.6 is loaded and the JVM does not find the method setEntityResolver
I have resolved using common-configuration-v2.0-beta1 in my application.

Resources