spark.executor.extraClassPath option setting programmatically through SparkConf - apache-spark

My project jars are conflicting with jars which are on EMR so to fix this
I have copied all my advanced jars to custom location of nodes through bootstrap script. I have verified jars got copied on to all executor nodes.
It is working fine with spark-submit , my code referring new jars which are in custom folder of all nodes.
/usr/bin/spark-submit --conf="spark.driver.extraClassPath=/usr/customJars/*" --conf="spark.executor.extraClassPath=/usr/customJars/*"
Same thing I want to implement programmatically in the code by updating sparkconf object.
sparkConf.set("spark.driver.extraClassPath", "/usr/customJars/*");
sparkConf.set("spark.executor.extraClassPath", "/usr/customJars/*");
it is not working when I want to implement programmatically. my code is not referring updated jars in custom location .
Any suggestion?

Most properties cannot be changed at Runtime in Spark.
You can see the documentation for SparkConf: SparkConf
Once SparkConf is passed to the SparkContext constructor, the values
are cloned and cannot be changed. This is a Spark limitation.
You need to make sure that you stop and start your Spark Session before testing new property changes.
As an additional comment from the documentation: Spark Configuration
For spark.executor.extraClassPath:
Extra classpath entries to prepend to the classpath of executors. This
exists primarily for backwards-compatibility with older versions of
Spark. Users typically should not need to set this option.
You can use spark.jars that will affect the driver and executors:
Comma-separated list of jars to include on the driver and executor
classpaths. Globs are allowed.
Make sure that your jars are available in the executors.

Related

Prepend spark.jars to workers classpath

My use case is pretty simple, I want to override a few classes that are part of the Hadoop distribution, to do so I created a new jar that I serialize from the driver to the worker nodes using spark.jars properties.
To make sure my new jar takes precedence in the workers classpath, I want to add them to spark.executor.extraClassPath property.
However, since I'm serializing these jars with spark.jars, their path in the workers is dynamic and includes the app-id & executor-id - <some-work-dir>/<app-id>/<executor-id>.
Is there a way around it? is it possible to add a dir inside the app dir to be first in classpath?
Working with Spark 2.4.5 Standalone client mode - Docker.
p.s I'm aware of the option to add the jar to the workers image, and then add it to the classpath, but then I'll have to keep updating the image with every code change.
You can enable this option on spark submit:
spark.driver.userClassPathFirst=True
Check here the spark-submit options documentation

Possible to add extra jars to master/worker nodes AFTER spark submit at runtime?

I'm writing a service that runs on a long-running Spark application from a spark submit. The service won't know what jars to put on the classpaths by the time of the initial spark submit, so I can't include it using --jars. This service will then listen for requests that can include extra jars, which I then want to load onto my spark nodes so work can be done using these jars.
My goal is to call spark submit only once, being at the very beginning to launch my service. Then I'm trying to add jars from requests to the spark session by creating a new SparkConf and building a new SparkSession out of it, something like
SparkConf conf = new SparkConf();
conf.set("spark.driver.extraClassPath", "someClassPath")
conf.set("spark.executor.extraClassPath", "someClassPath")
SparkSession.builder().config(conf).getOrCreate()
I tried this approach but it looks like the jars aren't getting loaded onto the executor classpaths as my jobs don't recognize the UDFs from the jars. I'm trying to run this in Spark client mode right now.
Is there a way to add these jars AFTER a spark-submit has been
called and just update the existing Spark application or is it only possible with another spark-submit that includes these jars using --jars?
Would using cluster mode vs client mode matter in this kind of
situation?

User specific properties file in Spark (.hiverc equivalent)

We are trying to set some additional properties like adding custom built spark listeners, adding jars to driver and executor classpaths etc for each Spark Job getting submitted.
Found below implementations:
Change the spark-submit launcher script to add these extra properties
Edit the spark-env.sh add add these properties to "SPARK_SUBMIT_OPTS" and "SPARK_DIST_CLASSPATH" variables
Add a --properties-file option to spark-submit launcher script
Would like to check if this can be done specific to users something like .hiverc in hive instead of doing it at the cluster level. This allows us to perform A/B testing of the features we newly build.

Spark-submit Executers are not getting the properties

I am trying to deploy the Spark application to 4 node DSE spark cluster, and I have created a fat jar with all dependent Jars and I have created a property file under src/main/resources which has properties like batch interval master URL etc.
I have copied this fat jar to master and I am submitting the application with "spark-submit" and below is my submit command.
dse spark-submit --class com.Processor.utils.jobLauncher --supervise application-1.0.0-develop-SNAPSHOT.jar qa
everything works properly when I run on single node cluster, but if run on DSE spark standalone cluster, the properties mentioned above like batch interval become unavailable to executors. I have googled and found that is the common issue many has solved it. so I have followed one of the solutions and created a fat jar and tried to run, but still, my properties are unavailable to executors.
can someone please give any pointers on how to solve the issue ?
I am using DSE 4.8.5 and Spark 1.4.2
and this is how I am loading the properties
System.setProperty("env",args(0))
val conf = com.typesafe.config.ConfigFactory.load(System.getProperty("env") + "_application")
figured out the solution:
I am referring the property file name from system property(i am setting it main method with the command line parameter) and when the code gets shipped and executed on worker node the system property is not available (obviously..!!) , so instead of using typesafe ConfigFactory to load property file I am using simple Scala file reading.

Add CLASSPATH to Oozie workflow job

I coded SparkSQL that accesses Hive tables, in Java, and packaged a jar file that can be run using spark-submit.
Now I want to run this jar as an Oozie workflow (and coordinator, if I make workflow to work). When I try to do that, the job fails and I get in Oozie job logs
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf
What I did was to look for the jar in $HIVE_HOME/lib that contains that class, copy that jar in the lib path of my Oozie workflow root path and add this to workflow.xml in the Spark Action:
<spark-opts> --jars lib/*.jar</spark-opts>
But this leads to another java.lang.NoClassDefFoundError that points to another missing class, so I did the process again of looking for the jar and copying, run the job and the same thing goes all over. It looks like it needs the dependency to many jars in my Hive lib.
What I don't understand is when I use spark-submit in the shell using the jar, it runs OK, I can SELECT and INSERT into my Hive tables. It is only when I use Oozie that this occurs. It looks like that Spark can't see the Hive libraries anymore when contained in an Oozie workflow job. Can someone explain how this happens?
How do I add or reference the necessary classes / jars to the Oozie path?
I am using Cloudera Quickstart VM CDH 5.4.0, Spark 1.4.0, Oozie 4.1.0.
Usually the "edge node" (the one you can connect to) has a lot of stuff pre-installed and referenced in the default CLASSPATH.
But the Hadoop "worker nodes" are probably barebones, with just core Hadoop libraries pre-installed.
So you can wait a couple of years for Oozie to package properly Spark dependencies in a ShareLib, and use the "blablah.system.libpath" flag.
[EDIT] if base Spark functionality is OK but you fail on the Hive format interface, then specify a list of ShareLibs including "HCatalog" e.g.
action.sharelib.for.spark=spark,hcatalog
Or, you can find out which JARs and config files are actually used by Spark, upload them to HDFS, and reference them (all of them, one by one) in your Oozie Action under <file> so that they are downloaded at run time in the working dir of the YARN container.
[EDIT] Maybe the ShareLibs contain the JARs but not the config files; then all you have to upload/download is a list of valid config files (Hive, Spark, whatever)
The better way to avoid the ClassPath not found exception in Oozie is, Install the Oozie SharedLib in the cluster, and update the Hive/Pig jars in the Shared Locaton {Some Times Existing Jar in Oozie Shared Location use to get mismatch with product jar.}
hdfs://hadoop:50070/user/oozie/share/lib/
once the same has been update, please pass a parameter
"oozie.use.system.libpath = true"
These will inform oozie to read the Jars from Hadoop Shared Location.
Once the You have mention the Shared Location by setting the paramenter "true" you no need to mention all and each jar one by one in workflow.xml

Resources