How to append a resource jar for spark-submit? - apache-spark

My spark application depends on adam_2.11-0.20.0.jar, every time I have to package my application with adam_2.11-0.20.0.jar as a fat jar to submit to spark.
for example, my fat jar is myApp1-adam_2.11-0.20.0.jar,
It's ok to submit as following
spark-submit --class com.ano.adam.AnnoSp myApp1-adam_2.11-0.20.0.jar
It reported Exception in
thread "main" java.lang.NoClassDefFoundError:
org/bdgenomics/adam/rdd using --jars
spark-submit --class com.ano.adam.AnnoSp myApp1.jar --jars adam_2.11-0.20.0.jar
My question is how to submit using 2 separate jars without package them together
spark-submit --class com.ano.adam.AnnoSp myApp1.jar adam_2.11-0.20.0.jar

Add all jars in one folder and then do like below...
Option 1 :
I think Better way of doing this is
$SPARK_HOME/bin/spark-submit \
--driver-class-path $(echo /usr/local/share/build/libs/*.jar | tr ' ' ',') \
--jars $(echo /usr/local/share/build/libs/*.jar | tr ' ' ',')
in this approach, you wont miss any jar by mistake in the classpath hence no warning should come.
Option 2 see my anwer:
spark-submit-jars-arguments-wants-comma-list-how-to-declare-a-directory
Option 3 : If you want to do programmatic submit by adding jars through API its possible.Here Im not going to details of it.

Related

Path of jars added to a Spark Job - spark-submit

I am using Spark 2.1 (BTW) on a YARN cluster.
I am trying to upload JAR on YARN cluster, and to use them to replace on-site (alreading in-place) Spark JAR.
I am trying to do so through spark-submit.
The question Add jars to a Spark Job - spark-submit - and the related answers - are full of interesting points.
One helpful answer is the following one:
spark-submit --jars additional1.jar,additional2.jar \
--driver-class-path additional1.jar:additional2.jar \
--conf spark.executor.extraClassPath=additional1.jar:additional2.jar \
--class MyClass main-application.jar
So, I understand the following:
"--jars" is for uploading jar on each node
"--driver-class-path" is for using uploaded jar for the driver.
"--conf spark.executor.extraClassPath" is for using uploaded jar for executors.
While I master the filepaths for "--jars" within a spark-submit command, what will be the filepaths of the uploaded JAR to be used in "--driver-class-path" for example ?
The doc says: "JARs and files are copied to the working directory for each SparkContext on the executor nodes"
Fine, but for the following command, what should I put instead of XXX and YYY ?
spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
--driver-class-path XXX:YYY \
--conf spark.executor.extraClassPath=XXX:YYY \
--class MyClass main-application.jar
When using spark-submit, how can I reference the "working directory for the SparkContext" to form XXX and YYY filepath ?
Thanks.
PS: I have tried
spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
--driver-class-path some1.jar:some2.jar \
--conf spark.executor.extraClassPath=some1.jar:some2.jar \
--class MyClass main-application.jar
No success (if I made no mistake)
And I have tried also:
spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
--driver-class-path ./some1.jar:./some2.jar \
--conf spark.executor.extraClassPath=./some1.jar:./some2.jar \
--class MyClass main-application.jar
No success either.
spark-submit by default uses client mode.
In client mode, you should not use --jars in conjunction with --driver-class-path.
--driver-class-path will overwrite original classpath, instead of prepending to it as one may expect.
--jars will automatically add the extra jars to the driver and executor classpath so you do not need to add its path manually.
It seems that in cluster mode --driver-class-path is ignored.

How to access external property file in spark-submit job?

I am using spark 2.4.1 version and java8.
I am trying to load external property file while submitting my spark job using spark-submit.
As I am using below TypeSafe to load my property file.
<groupId>com.typesafe</groupId>
<artifactId>config</artifactId>
<version>1.3.1</version>
In my code I am using
public static Config loadEnvProperties(String environment) {
Config appConf = ConfigFactory.load(); // loads my "resouces" folder "application.properties" file
return appConf.getConfig(environment);
}
To externalize this "application.properties" file I tried this as suggested by an expert while spark-submit as below
spark-submit \
--master yarn \
--deploy-mode cluster \
--name Extractor \
--jars "/local/apps/jars/*.jar" \
--files /local/apps/log4j.properties \
--files /local/apps/applicationNew.properties \
--class Driver \
--conf spark.driver.extraJavaOptions=-Dconfig.file=./applicationNew.properties \
--conf spark.executor.extraJavaOptions=-Dconfig.file=./applicationNew.properties \
--conf spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties \
--conf spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties \
--conf spark.driver.extraJavaOptions=-Dlog4j.debug \
--conf spark.driver.extraClassPath=. \
migration-0.0.1.jar sit
I placed "log4j.properties" & "applicationNew.properties" files same folder where I am running my spark-submit.
1) In the above shell script if I keep
--files /local/apps/log4j.properties, /local/apps/applicationNew.properties \
Error :
Exception in thread "main" org.apache.spark.SparkException: Cannot load main class from JAR file:/local/apps//applicationNew.properties
at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657)
So what is wrong here ?
2) Then i changed above script like shown i.e.
--files /local/apps/log4j.properties \
--files /local/apps/applicationNew.properties \
when I run spark job then I will get following error.
19/08/02 14:19:09 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: User class threw exception: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'sit'
at com.typesafe.config.impl.SimpleConfig.findKeyOrNull(SimpleConfig.java:152)
So what is wrong here ? why not loading the applicationNew.properties file ?
3) When I debugged it as below
i.e. printed "config.file"
String ss = System.getProperty("config.file");
logger.error ("config.file : {}" , ss);
Error :
19/08/02 14:19:09 ERROR Driver: config.file : null
19/08/02 14:19:09 ERROR yarn.ApplicationMaster: User class threw exception: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'sit'
So how to set "config.file" option from spark-submit ?
How to fix above errors and load properties from external applicationNew.properties file ?
The proper way to list files for the --files, --jars and other similar arguments is via a comma without any spaces (this is a crucial thing, and you see the exception about invalid main class precisely because of this):
--files /local/apps/log4j.properties,/local/apps/applicationNew.properties
If file names themselves have spaces in it, you should use quotes to escape these spaces:
--files "/some/path with/spaces.properties,/another path with/spaces.properties"
Another issue is that you specify the same property twice:
...
--conf spark.driver.extraJavaOptions=-Dconfig.file=./applicationNew.properties \
...
--conf spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties \
...
There is no way for spark-submit to know how to merge these values, therefore only one of them is used. This is the reason why you see null for the config.file system property: it's just the second --conf argument takes priority and overrides the extraJavaOptions property with a single path to the log4j config file. Thus, the correct way is to specify all these values as one property:
--conf spark.driver.extraJavaOptions="-Dlog4j.configuration=file:./log4j.properties -Dconfig.file=./applicationNew.properties"
Note that because of quotes, the entire spark.driver.extraJavaOptions="..." is one command line argument rather than several, which is very important for spark-submit to pass these arguments to the driver/executor JVM correctly.
(I also changed the log4j.properties file to use a proper URI instead of a file. I recall that without this path being a URI it might not work, but you can try either way and check for sure.)
--files and SparkFiles.get
With --files you should access the resource using SparkFiles.get as follows:
$ ./bin/spark-shell --files README.md
scala> import org.apache.spark._
import org.apache.spark._
scala> SparkFiles.get("README.md")
res0: String = /private/var/folders/0w/kb0d3rqn4zb9fcc91pxhgn8w0000gn/T/spark-f0b16df1-fba6-4462-b956-fc14ee6c675a/userFiles-eef6d900-cd79-4364-a4a2-dd177b4841d2/README.md
In other words, Spark will distribute the --files to executors, but the only way to know the path of the files is to use SparkFiles utility.
getResourceAsStream(resourceFile) and InputStream
The other option would be to package all resource files into a jar file and bundle it together with the other jar files (either as a single uber-jar or simply as part of CLASSPATH of the Spark app) and use the following trick:
this.getClass.getClassLoader.getResourceAsStream(resourceFile)
With that, regardless of the jar file the resourceFile is in, as long as it's on the CLASSPATH, it should be available to the application.
I'm pretty sure any decent framework or library that uses resource files for configuration, e.g. Typesafe Config, accepts InputStream as the way to read resource files.
You could also include the --files as part of a jar file that is part of the CLASSPATH of the executors, but that'd be obviously less flexible (as every time you'd like to submit your Spark app with a different file, you'd have to recreate the jar).

Remove JAR from Spark default classpath in EMR

I'm executing a spark-submit script in an EMR step that has my super JAR as the main class, like
spark-submit \
....
--class ${MY_CLASS} "${SUPER_JAR_S3_PATH}"
... etc
but Spark is by default loading the jar file:/usr/lib/spark/jars/guice-3.0.jar which contains com.google.inject.internal.InjectorImpl, a class that's also in the Guice-4.x jar which is in my super JAR. This results in a java.lang.IllegalAccessError when my service is booting up.
I've tried setting some Spark conf in the spark-submit to put my super jar in the classpath in hopes of it getting loaded first, before Spark loads guice-3.0.jar. It looks like:
--jars "${ASSEMBLY_JAR_S3_PATH}" \
--driver-class-path "/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:${SUPER_JAR_S3_PATH}" \
--conf spark.executor.extraClassPath="/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:${SUPER_JAR_S3_PATH}" \
but this results in the same error.
Is there a way to remove that guice-3.0.jar from the default spark classpath so my code can use the InjectorImpl that's packaged in the Guice-4.x JAR? I'm also running Spark in client mode so I can't use spark.driver.userClassPathFirst or spark.executor.userClassPathFirst
one way is point to lib where your guice old version of jar is there and then exclude it.
sample shell script for spark-submit :
export latestguicejar='your path to latest guice jar'
#!/bin/sh
# build all other dependent jars in OTHER_JARS
JARS=`find /usr/lib/spark/jars/ -name '*.jar'`
OTHER_JARS=""
for eachjarinlib in $JARS ; do
if [ "$eachjarinlib" != "guice-3.0.jar" ]; then
OTHER_JARS=$eachjarinlib,$OTHER_JARS
fi
done
echo ---final list of jars are : $OTHER_JARS
echo $CLASSPATH
spark-submit --verbose --class <yourclass>
... OTHER OPTIONS
--jars $OTHER_JARS,$latestguicejar,APPLICATIONJARTOBEADDEDSEPERATELY.JAR
also see holdens answer. check with your version of the spark what is available.
As per docs runtime-environment userClassPathFirst are present in the latest version of spark as of today.
spark.executor.userClassPathFirst
spark.driver.userClassPathFirst
for this to use you can make uber jar with all application level dependencies.

adding external property file to classpath in spark

I am currently submitting my fat jar to spark cluster using below command.
Application fat jar and related configuration are in the folder /home/myapplication
$SPARK_HOME/bin/spark-submit --jars $SPARK_HOME/lib/protobuf-java-2.5.0.jar --class MainClass /home/myapplication/my-application-fat.jar -appconf /home/myapplication/application-prop.properties -conf /home/myapplication/application-configuration.conf
Now my requirement is to add an external property file /home/myapplication/external-prop.properties to classpath of both driver and worker node.
I searched lot of resources but could not get right solution i am looking for!
Please help in resolving the issue. Thanks in advance
your requirement lies in using spark.executor.extraClassPath configuration to point to the properties file. But before that as #philantrovert has pointed out to use --files option to copy the property file to the worker nodes.
So your correct command should be
$SPARK_HOME/bin/spark-submit --jars $SPARK_HOME/lib/protobuf-java-2.5.0.jar --class MainClass /home/myapplication/my-application-fat.jar -appconf /home/myapplication/application-prop.properties -conf /home/myapplication/application-configuration.conf --files /home/myapplication/external-prop.properties --conf "spark.executor.extraClassPath=./"

spark-submit in deploy mode client not reading all the jars

I'm trying to submit an application to my spark cluster (standalone mode) through the spark-submit command. I'm following the
official spark documentation, as well as relying on this other one. Now the problem is that I get strange behaviors. My setup is the following:
I have a directory where all the dependency jars for my application are located, that is /home/myuser/jars
The jar of my application is in the same directory (/home/myuser/jars), and is called dat-test.jar
The entry point class in dat-test.jar is at the package path my.package.path.Test
Spark master is at spark://master:7077
Now, I submit the application directly on the master node, thus using the client deploy mode, running the command
./spark-submit --class my.package.path.Test --master spark://master:7077 --executor-memory 5G --total-executor-cores 10 /home/myuser/jars/*
and I received an error as
java.lang.ClassNotFoundException: my.package.path.Test
If I activate the verbose mode, what I see is that the primaryResource selected as jar containing the entry point is the first jar by alphabetical order in /home/myuser/jars/ (that is not dat-test.jar), leading (I supppose) to the ClassNotFoundException. All the jars in the same directory are anyway loaded as arguments.
Of course if I run
./spark-submit --class my.package.path.Test --master spark://master:7077 --executor-memory 5G --total-executor-cores 10 /home/myuser/jars/dat-test.jar
it finds the Test class, but it doesn't find other classes contained in other jars. Finally, if I use the --jars flag and run
./spark-submit --class my.package.path.Test --master spark://master:7077 --executor-memory 5G --total-executor-cores 10 --jars /home/myuser/jars/* /home/myuser/jars/dat-test.jar
I obtain the same result as the first option. First jar in /home/myuser/jars/ is loaded as primaryResource, leading to ClassNotFoundException for my.package.path.Test. Same if I add --jars /home/myuser/jars/*.jar.
Important points are:
I do not want to have a single jar with all the dependencies for development reasons
The jars in /home/myuser/jars/ are many. I'd like to know if there's a way to include them all instead of using the comma separated syntax
If I try to run the same commands with --deploy-cluster on the master node, I don't get the error, but the computation fails for some other reasons (but this is another problem).
Which is then the correct way of running a spark-submit in client mode?
Thanks
There is no way to include all jars using the --jars option, you will have to create a small script to enumerate them. This part is a bit sub-optimal.

Resources