adding external property file to classpath in spark - apache-spark

I am currently submitting my fat jar to spark cluster using below command.
Application fat jar and related configuration are in the folder /home/myapplication
$SPARK_HOME/bin/spark-submit --jars $SPARK_HOME/lib/protobuf-java-2.5.0.jar --class MainClass /home/myapplication/my-application-fat.jar -appconf /home/myapplication/application-prop.properties -conf /home/myapplication/application-configuration.conf
Now my requirement is to add an external property file /home/myapplication/external-prop.properties to classpath of both driver and worker node.
I searched lot of resources but could not get right solution i am looking for!
Please help in resolving the issue. Thanks in advance

your requirement lies in using spark.executor.extraClassPath configuration to point to the properties file. But before that as #philantrovert has pointed out to use --files option to copy the property file to the worker nodes.
So your correct command should be
$SPARK_HOME/bin/spark-submit --jars $SPARK_HOME/lib/protobuf-java-2.5.0.jar --class MainClass /home/myapplication/my-application-fat.jar -appconf /home/myapplication/application-prop.properties -conf /home/myapplication/application-configuration.conf --files /home/myapplication/external-prop.properties --conf "spark.executor.extraClassPath=./"

Related

Path of jars added to a Spark Job - spark-submit

I am using Spark 2.1 (BTW) on a YARN cluster.
I am trying to upload JAR on YARN cluster, and to use them to replace on-site (alreading in-place) Spark JAR.
I am trying to do so through spark-submit.
The question Add jars to a Spark Job - spark-submit - and the related answers - are full of interesting points.
One helpful answer is the following one:
spark-submit --jars additional1.jar,additional2.jar \
--driver-class-path additional1.jar:additional2.jar \
--conf spark.executor.extraClassPath=additional1.jar:additional2.jar \
--class MyClass main-application.jar
So, I understand the following:
"--jars" is for uploading jar on each node
"--driver-class-path" is for using uploaded jar for the driver.
"--conf spark.executor.extraClassPath" is for using uploaded jar for executors.
While I master the filepaths for "--jars" within a spark-submit command, what will be the filepaths of the uploaded JAR to be used in "--driver-class-path" for example ?
The doc says: "JARs and files are copied to the working directory for each SparkContext on the executor nodes"
Fine, but for the following command, what should I put instead of XXX and YYY ?
spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
--driver-class-path XXX:YYY \
--conf spark.executor.extraClassPath=XXX:YYY \
--class MyClass main-application.jar
When using spark-submit, how can I reference the "working directory for the SparkContext" to form XXX and YYY filepath ?
Thanks.
PS: I have tried
spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
--driver-class-path some1.jar:some2.jar \
--conf spark.executor.extraClassPath=some1.jar:some2.jar \
--class MyClass main-application.jar
No success (if I made no mistake)
And I have tried also:
spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
--driver-class-path ./some1.jar:./some2.jar \
--conf spark.executor.extraClassPath=./some1.jar:./some2.jar \
--class MyClass main-application.jar
No success either.
spark-submit by default uses client mode.
In client mode, you should not use --jars in conjunction with --driver-class-path.
--driver-class-path will overwrite original classpath, instead of prepending to it as one may expect.
--jars will automatically add the extra jars to the driver and executor classpath so you do not need to add its path manually.
It seems that in cluster mode --driver-class-path is ignored.

Submitting application on Spark Cluster using spark submit

I am new to Spark.
I want to run a Spark Structured Streaming application on cluster.
Master and workers has same configuration.
I have few queries for submitting app on cluster using spark-submit:
You may find them comical or strange.
How can I give path for 3rd party jars like lib/*? ( Application has 30+ jars)
Will Spark automatically distribute application and required jars to workers?
Does it require to host application on all the workers?
How can i know status of my application as I am working on console.
I am using following script for Spark-submit.
spark-submit
--class <class-name>
--master spark://master:7077
--deploy-mode cluster
--supervise
--conf spark.driver.extraClassPath <jar1, jar2..jarn>
--executor-memory 4G
--total-executor-cores 8
<running-jar-file>
But code is not running as per expectation.
Am i missing something?
To pass multiple jar file to Spark-submit you can set the following attributes in file SPARK_HOME_PATH/conf/spark-defaults.conf (create if not exists):
Don't forget to use * at the end of the paths
spark.driver.extraClassPath /fullpath/to/jar/folder/*
spark.executor.extraClassPath /fullpathto/jar/folder/*
Spark will set the attributes in the file spark-defaults.conf when you use the spark-submit command.
Copy your jar file on that directory and when you submit your Spark App on the cluster, the jar files in the specified paths will be loaded, too.
spark.driver.extraClassPath: Extra classpath entries to prepend
to the classpath of the driver. Note: In client mode, this config
must not be set through the SparkConf directly in your application,
because the driver JVM has already started at that point. Instead,
please set this through the --driver-class-path command line option or
in your default properties file.
--jars will transfer your jar files to worker nodes, and become available in both driver and executors' classpaths.
Please refer below link to see more details.
http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management
You can make a fat jar containing all dependencies. Below link helps you understand that.
https://community.hortonworks.com/articles/43886/creating-fat-jars-for-spark-kafka-streaming-using.html

Spark-shell vs Spark-submit adding jar to classpath issue

I'm able to run CREATE TEMPORARY FUNCTION testFunc using jar 'myJar.jar' query in hiveContext via spark-shell --jars myJar.jar -i some_script.scala, but I'm not able to run such command via spark-submit --class com.my.DriverClass --jars myJar.jar target.jar.
Am I doing something wrong?
If you are using local file system, the Jar must be in the same location on all nodes.
So you have 2 options:
place jar on all nodes in the same directory, for example in /home/spark/my.jar and then use this directory in --jars option.
use distributed file system like HDFS

What is the use of --driver-class-path in the spark command?

As per spark docs,
To get started you will need to include the JDBC driver for you particular database on the spark classpath. For example, to connect to postgres from the Spark Shell you would run the following command:
bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar
Job is working fine without --driver-class-path. Then, what is the use of --driver-class-path in the spark command?
--driver-class-path or spark.driver.extraClassPath can be used for to modify class path only for the Spark driver. This is useful for libraries which are not required by the executors (for example any code that is used only locally).
Compared to that, --jars or spark.jars will not only add jars to both driver and executor classpath, but also distribute archives over the cluster. If particular jar is used only by the driver this is unnecessary overhead.
Let's say we run the following command with Spark 3.3.0:
spark-submit --driver-class-path DCP.jar --jars JARS.jar MAIN.jar
What the scripts will actually execute is:
java
-cp DCP.jar:spark/conf:spark/jars/*
org.apache.spark.deploy.SparkSubmit
--conf spark.driver.extraClassPath=DCP.jar
--jars JARS.jar
MAIN.jar
(I've removed the irrelevant bits.)
The surprise (for me) is that only DCP.jar is on the classpath. Neither JARS.jar nor MAIN.jar are on the JVM classpath. This means any JDBC driver registration from those jars will not be activated. You need to put the JDBC jar on --driver-class-path.
But you also want the workers to be able to do JDBC. So you need to put the JDBC jar on --jars too. Both are required, like the documentation says.

spark-submit in deploy mode client not reading all the jars

I'm trying to submit an application to my spark cluster (standalone mode) through the spark-submit command. I'm following the
official spark documentation, as well as relying on this other one. Now the problem is that I get strange behaviors. My setup is the following:
I have a directory where all the dependency jars for my application are located, that is /home/myuser/jars
The jar of my application is in the same directory (/home/myuser/jars), and is called dat-test.jar
The entry point class in dat-test.jar is at the package path my.package.path.Test
Spark master is at spark://master:7077
Now, I submit the application directly on the master node, thus using the client deploy mode, running the command
./spark-submit --class my.package.path.Test --master spark://master:7077 --executor-memory 5G --total-executor-cores 10 /home/myuser/jars/*
and I received an error as
java.lang.ClassNotFoundException: my.package.path.Test
If I activate the verbose mode, what I see is that the primaryResource selected as jar containing the entry point is the first jar by alphabetical order in /home/myuser/jars/ (that is not dat-test.jar), leading (I supppose) to the ClassNotFoundException. All the jars in the same directory are anyway loaded as arguments.
Of course if I run
./spark-submit --class my.package.path.Test --master spark://master:7077 --executor-memory 5G --total-executor-cores 10 /home/myuser/jars/dat-test.jar
it finds the Test class, but it doesn't find other classes contained in other jars. Finally, if I use the --jars flag and run
./spark-submit --class my.package.path.Test --master spark://master:7077 --executor-memory 5G --total-executor-cores 10 --jars /home/myuser/jars/* /home/myuser/jars/dat-test.jar
I obtain the same result as the first option. First jar in /home/myuser/jars/ is loaded as primaryResource, leading to ClassNotFoundException for my.package.path.Test. Same if I add --jars /home/myuser/jars/*.jar.
Important points are:
I do not want to have a single jar with all the dependencies for development reasons
The jars in /home/myuser/jars/ are many. I'd like to know if there's a way to include them all instead of using the comma separated syntax
If I try to run the same commands with --deploy-cluster on the master node, I don't get the error, but the computation fails for some other reasons (but this is another problem).
Which is then the correct way of running a spark-submit in client mode?
Thanks
There is no way to include all jars using the --jars option, you will have to create a small script to enumerate them. This part is a bit sub-optimal.

Resources