Spark-shell vs Spark-submit adding jar to classpath issue - apache-spark

I'm able to run CREATE TEMPORARY FUNCTION testFunc using jar 'myJar.jar' query in hiveContext via spark-shell --jars myJar.jar -i some_script.scala, but I'm not able to run such command via spark-submit --class com.my.DriverClass --jars myJar.jar target.jar.
Am I doing something wrong?

If you are using local file system, the Jar must be in the same location on all nodes.
So you have 2 options:
place jar on all nodes in the same directory, for example in /home/spark/my.jar and then use this directory in --jars option.
use distributed file system like HDFS

Related

Spark external jars and files on hdfs

I have a spark job that I run using the spark-submit command.
The jar that I use is hosted on hdfs and I call it from there directly in the spark-submit query using its hdfs file path.
With this same logic, I'm trying to do the same when for the --jars options, the files options and also the extraClassPath option (in the spark.conf) but it seems that there is an issue with the fact that it point to a hdfs file path.
My command looks like this:
spark-submit \
--class Main \
--jars 'hdfs://path/externalLib.jar' \
--files 'hdfs://path/log4j.xml' \
--properties-file './spark.conf' \
'hdfs://path/job_name.jar
So not only when I call a method that refers the externalLib.jar, spark raises an exception telling me that it doesn't find the method but also from the starts I have the warning logs:
Source and destination file systems are the same. Not copying externalLib.jar
Source and destination file systems are the same. Not copying log4j.xml
It must come from the fact that I precise a hdfs path because it works flawlessly when I refers to those jar in the local file system.
Maybe it isn't possible ? What can I do ?

Remove JAR from Spark default classpath in EMR

I'm executing a spark-submit script in an EMR step that has my super JAR as the main class, like
spark-submit \
....
--class ${MY_CLASS} "${SUPER_JAR_S3_PATH}"
... etc
but Spark is by default loading the jar file:/usr/lib/spark/jars/guice-3.0.jar which contains com.google.inject.internal.InjectorImpl, a class that's also in the Guice-4.x jar which is in my super JAR. This results in a java.lang.IllegalAccessError when my service is booting up.
I've tried setting some Spark conf in the spark-submit to put my super jar in the classpath in hopes of it getting loaded first, before Spark loads guice-3.0.jar. It looks like:
--jars "${ASSEMBLY_JAR_S3_PATH}" \
--driver-class-path "/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:${SUPER_JAR_S3_PATH}" \
--conf spark.executor.extraClassPath="/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:${SUPER_JAR_S3_PATH}" \
but this results in the same error.
Is there a way to remove that guice-3.0.jar from the default spark classpath so my code can use the InjectorImpl that's packaged in the Guice-4.x JAR? I'm also running Spark in client mode so I can't use spark.driver.userClassPathFirst or spark.executor.userClassPathFirst
one way is point to lib where your guice old version of jar is there and then exclude it.
sample shell script for spark-submit :
export latestguicejar='your path to latest guice jar'
#!/bin/sh
# build all other dependent jars in OTHER_JARS
JARS=`find /usr/lib/spark/jars/ -name '*.jar'`
OTHER_JARS=""
for eachjarinlib in $JARS ; do
if [ "$eachjarinlib" != "guice-3.0.jar" ]; then
OTHER_JARS=$eachjarinlib,$OTHER_JARS
fi
done
echo ---final list of jars are : $OTHER_JARS
echo $CLASSPATH
spark-submit --verbose --class <yourclass>
... OTHER OPTIONS
--jars $OTHER_JARS,$latestguicejar,APPLICATIONJARTOBEADDEDSEPERATELY.JAR
also see holdens answer. check with your version of the spark what is available.
As per docs runtime-environment userClassPathFirst are present in the latest version of spark as of today.
spark.executor.userClassPathFirst
spark.driver.userClassPathFirst
for this to use you can make uber jar with all application level dependencies.

Submitting application on Spark Cluster using spark submit

I am new to Spark.
I want to run a Spark Structured Streaming application on cluster.
Master and workers has same configuration.
I have few queries for submitting app on cluster using spark-submit:
You may find them comical or strange.
How can I give path for 3rd party jars like lib/*? ( Application has 30+ jars)
Will Spark automatically distribute application and required jars to workers?
Does it require to host application on all the workers?
How can i know status of my application as I am working on console.
I am using following script for Spark-submit.
spark-submit
--class <class-name>
--master spark://master:7077
--deploy-mode cluster
--supervise
--conf spark.driver.extraClassPath <jar1, jar2..jarn>
--executor-memory 4G
--total-executor-cores 8
<running-jar-file>
But code is not running as per expectation.
Am i missing something?
To pass multiple jar file to Spark-submit you can set the following attributes in file SPARK_HOME_PATH/conf/spark-defaults.conf (create if not exists):
Don't forget to use * at the end of the paths
spark.driver.extraClassPath /fullpath/to/jar/folder/*
spark.executor.extraClassPath /fullpathto/jar/folder/*
Spark will set the attributes in the file spark-defaults.conf when you use the spark-submit command.
Copy your jar file on that directory and when you submit your Spark App on the cluster, the jar files in the specified paths will be loaded, too.
spark.driver.extraClassPath: Extra classpath entries to prepend
to the classpath of the driver. Note: In client mode, this config
must not be set through the SparkConf directly in your application,
because the driver JVM has already started at that point. Instead,
please set this through the --driver-class-path command line option or
in your default properties file.
--jars will transfer your jar files to worker nodes, and become available in both driver and executors' classpaths.
Please refer below link to see more details.
http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management
You can make a fat jar containing all dependencies. Below link helps you understand that.
https://community.hortonworks.com/articles/43886/creating-fat-jars-for-spark-kafka-streaming-using.html

adding external property file to classpath in spark

I am currently submitting my fat jar to spark cluster using below command.
Application fat jar and related configuration are in the folder /home/myapplication
$SPARK_HOME/bin/spark-submit --jars $SPARK_HOME/lib/protobuf-java-2.5.0.jar --class MainClass /home/myapplication/my-application-fat.jar -appconf /home/myapplication/application-prop.properties -conf /home/myapplication/application-configuration.conf
Now my requirement is to add an external property file /home/myapplication/external-prop.properties to classpath of both driver and worker node.
I searched lot of resources but could not get right solution i am looking for!
Please help in resolving the issue. Thanks in advance
your requirement lies in using spark.executor.extraClassPath configuration to point to the properties file. But before that as #philantrovert has pointed out to use --files option to copy the property file to the worker nodes.
So your correct command should be
$SPARK_HOME/bin/spark-submit --jars $SPARK_HOME/lib/protobuf-java-2.5.0.jar --class MainClass /home/myapplication/my-application-fat.jar -appconf /home/myapplication/application-prop.properties -conf /home/myapplication/application-configuration.conf --files /home/myapplication/external-prop.properties --conf "spark.executor.extraClassPath=./"

What is the use of --driver-class-path in the spark command?

As per spark docs,
To get started you will need to include the JDBC driver for you particular database on the spark classpath. For example, to connect to postgres from the Spark Shell you would run the following command:
bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar
Job is working fine without --driver-class-path. Then, what is the use of --driver-class-path in the spark command?
--driver-class-path or spark.driver.extraClassPath can be used for to modify class path only for the Spark driver. This is useful for libraries which are not required by the executors (for example any code that is used only locally).
Compared to that, --jars or spark.jars will not only add jars to both driver and executor classpath, but also distribute archives over the cluster. If particular jar is used only by the driver this is unnecessary overhead.
Let's say we run the following command with Spark 3.3.0:
spark-submit --driver-class-path DCP.jar --jars JARS.jar MAIN.jar
What the scripts will actually execute is:
java
-cp DCP.jar:spark/conf:spark/jars/*
org.apache.spark.deploy.SparkSubmit
--conf spark.driver.extraClassPath=DCP.jar
--jars JARS.jar
MAIN.jar
(I've removed the irrelevant bits.)
The surprise (for me) is that only DCP.jar is on the classpath. Neither JARS.jar nor MAIN.jar are on the JVM classpath. This means any JDBC driver registration from those jars will not be activated. You need to put the JDBC jar on --driver-class-path.
But you also want the workers to be able to do JDBC. So you need to put the JDBC jar on --jars too. Both are required, like the documentation says.

Resources