User specific properties file in Spark (.hiverc equivalent) - apache-spark

We are trying to set some additional properties like adding custom built spark listeners, adding jars to driver and executor classpaths etc for each Spark Job getting submitted.
Found below implementations:
Change the spark-submit launcher script to add these extra properties
Edit the spark-env.sh add add these properties to "SPARK_SUBMIT_OPTS" and "SPARK_DIST_CLASSPATH" variables
Add a --properties-file option to spark-submit launcher script
Would like to check if this can be done specific to users something like .hiverc in hive instead of doing it at the cluster level. This allows us to perform A/B testing of the features we newly build.

Related

Prepend spark.jars to workers classpath

My use case is pretty simple, I want to override a few classes that are part of the Hadoop distribution, to do so I created a new jar that I serialize from the driver to the worker nodes using spark.jars properties.
To make sure my new jar takes precedence in the workers classpath, I want to add them to spark.executor.extraClassPath property.
However, since I'm serializing these jars with spark.jars, their path in the workers is dynamic and includes the app-id & executor-id - <some-work-dir>/<app-id>/<executor-id>.
Is there a way around it? is it possible to add a dir inside the app dir to be first in classpath?
Working with Spark 2.4.5 Standalone client mode - Docker.
p.s I'm aware of the option to add the jar to the workers image, and then add it to the classpath, but then I'll have to keep updating the image with every code change.
You can enable this option on spark submit:
spark.driver.userClassPathFirst=True
Check here the spark-submit options documentation

Loading Spark Config for testing Spark Applications

I've been trying to test a spark application on my local laptop before deploying it to a cluster (to avoid having to package and deploy my entire application every time) but struggling on loading the spark config file.
When I run my application on a cluster, I am usually providing a spark config file to the application (using spark-submit's --conf). This file has a lot of config options because this application interacts with Cassandra and HDFS. However, when I try to do the same on my local laptop, I'm not sure exactly how to load this config file. I know I can probably write a piece of code that takes the file path of the config file and just goes through and parses all the values and sets them in the config, but I'm just wondering if there are easier ways.
Current status:
I placed the desired config file in the my SPARK_HOME/conf directory and called it spark-defaults.conf ---> This didn't get applied, however this exact same file runs fine using spark-submit
For local mode, when I create the spark session, I'm setting Spark Master as "local[2]". I'm doing this when creating the spark session, so I'm wondering if it's possible to create this session with a specified config file.
Did you added --properties-file flag with spark-defaults.conf value in your IDE as an argument for JVM?
In official documentation (https://spark.apache.org/docs/latest/configuration.html) there is continuous reference to 'your default properties file'. Some options can not be set inside your application, because the JVM has already started. And since conf directory is read only through spark-submit, I suppose you have to explicitly load configuration file when running locally.
This problem has been discussed here:
How to use spark-submit's --properties-file option to launch Spark application in IntelliJ IDEA?
Not sure if this will help anyone, but I ended up reading the conf file from a test resource directory and then setting all the values as system properties (copied this from Spark Source Code):
//_sparkConfs is just a map of (String,String) populated from reading the conf file
for {
(k, v) ← _sparkConfs
} {
System.setProperty(k, v)
}
This is essentially emulating the --properties-file option of spark-submit to a certain degree. By doing this, I was able to keep this logic in my test setup, and not need to modify the existing application code.

spark.executor.extraClassPath option setting programmatically through SparkConf

My project jars are conflicting with jars which are on EMR so to fix this
I have copied all my advanced jars to custom location of nodes through bootstrap script. I have verified jars got copied on to all executor nodes.
It is working fine with spark-submit , my code referring new jars which are in custom folder of all nodes.
/usr/bin/spark-submit --conf="spark.driver.extraClassPath=/usr/customJars/*" --conf="spark.executor.extraClassPath=/usr/customJars/*"
Same thing I want to implement programmatically in the code by updating sparkconf object.
sparkConf.set("spark.driver.extraClassPath", "/usr/customJars/*");
sparkConf.set("spark.executor.extraClassPath", "/usr/customJars/*");
it is not working when I want to implement programmatically. my code is not referring updated jars in custom location .
Any suggestion?
Most properties cannot be changed at Runtime in Spark.
You can see the documentation for SparkConf: SparkConf
Once SparkConf is passed to the SparkContext constructor, the values
are cloned and cannot be changed. This is a Spark limitation.
You need to make sure that you stop and start your Spark Session before testing new property changes.
As an additional comment from the documentation: Spark Configuration
For spark.executor.extraClassPath:
Extra classpath entries to prepend to the classpath of executors. This
exists primarily for backwards-compatibility with older versions of
Spark. Users typically should not need to set this option.
You can use spark.jars that will affect the driver and executors:
Comma-separated list of jars to include on the driver and executor
classpaths. Globs are allowed.
Make sure that your jars are available in the executors.

Configuring master node in spark cluster

Apologies in advance as I am new to spark. I have created a spark cluster in standalone mode with 4 workers, and after successfully being able to configure worker properties, I wanted to know how to configure the master properties.
I am writing an application and connecting it to the cluster using SparkSession.builder (I do not want to submit it using spark-submit.)
I know that that the workers can be configured in the conf/spark-env.sh file and has parameters which can be set such as 'SPARK_WORKER_MEMORY' and 'SPARK_WORKER_CORES'
My question is: How do I configure the properties for the master? Because there is no 'SPARK_MASTER_CORES' or 'SPARK_MASTER_MEMORY' in this file.
I thought about setting this in the spark-defaults.conf file, however it seems that this is only used for spark-submit.
I thought about setting it in the application using SparkConf().set("spark.driver.cores", "XX") however this only specifies the number of cores for this application to use.
Any help would be greatly appreciated.
Thanks.
Three ways of setting the configurations of Spark Master node (Driver) and spark worker nodes. I will show examples of setting the memory of the master node. Other settings can be found here
1- Programatically through SpackConf class.
Example:
new SparkConf().set("spark.driver.memory","8g")
2- Using Spark-Submit: make sure not to set the same configuraiton in your code (Programatically like 1) and while doing spark submit. if you already configured settings programatically, every job configuration mentioned in spark-submit that overlap with (1) will be ignored.
example :
spark-submit --driver-memory 8g
3- through the Spark-defaults.conf:
In case none of the above is set this settings will be the defaults.
example :
spark.driver.memory 8g

Add CLASSPATH to Oozie workflow job

I coded SparkSQL that accesses Hive tables, in Java, and packaged a jar file that can be run using spark-submit.
Now I want to run this jar as an Oozie workflow (and coordinator, if I make workflow to work). When I try to do that, the job fails and I get in Oozie job logs
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf
What I did was to look for the jar in $HIVE_HOME/lib that contains that class, copy that jar in the lib path of my Oozie workflow root path and add this to workflow.xml in the Spark Action:
<spark-opts> --jars lib/*.jar</spark-opts>
But this leads to another java.lang.NoClassDefFoundError that points to another missing class, so I did the process again of looking for the jar and copying, run the job and the same thing goes all over. It looks like it needs the dependency to many jars in my Hive lib.
What I don't understand is when I use spark-submit in the shell using the jar, it runs OK, I can SELECT and INSERT into my Hive tables. It is only when I use Oozie that this occurs. It looks like that Spark can't see the Hive libraries anymore when contained in an Oozie workflow job. Can someone explain how this happens?
How do I add or reference the necessary classes / jars to the Oozie path?
I am using Cloudera Quickstart VM CDH 5.4.0, Spark 1.4.0, Oozie 4.1.0.
Usually the "edge node" (the one you can connect to) has a lot of stuff pre-installed and referenced in the default CLASSPATH.
But the Hadoop "worker nodes" are probably barebones, with just core Hadoop libraries pre-installed.
So you can wait a couple of years for Oozie to package properly Spark dependencies in a ShareLib, and use the "blablah.system.libpath" flag.
[EDIT] if base Spark functionality is OK but you fail on the Hive format interface, then specify a list of ShareLibs including "HCatalog" e.g.
action.sharelib.for.spark=spark,hcatalog
Or, you can find out which JARs and config files are actually used by Spark, upload them to HDFS, and reference them (all of them, one by one) in your Oozie Action under <file> so that they are downloaded at run time in the working dir of the YARN container.
[EDIT] Maybe the ShareLibs contain the JARs but not the config files; then all you have to upload/download is a list of valid config files (Hive, Spark, whatever)
The better way to avoid the ClassPath not found exception in Oozie is, Install the Oozie SharedLib in the cluster, and update the Hive/Pig jars in the Shared Locaton {Some Times Existing Jar in Oozie Shared Location use to get mismatch with product jar.}
hdfs://hadoop:50070/user/oozie/share/lib/
once the same has been update, please pass a parameter
"oozie.use.system.libpath = true"
These will inform oozie to read the Jars from Hadoop Shared Location.
Once the You have mention the Shared Location by setting the paramenter "true" you no need to mention all and each jar one by one in workflow.xml

Resources