My use case is pretty simple, I want to override a few classes that are part of the Hadoop distribution, to do so I created a new jar that I serialize from the driver to the worker nodes using spark.jars properties.
To make sure my new jar takes precedence in the workers classpath, I want to add them to spark.executor.extraClassPath property.
However, since I'm serializing these jars with spark.jars, their path in the workers is dynamic and includes the app-id & executor-id - <some-work-dir>/<app-id>/<executor-id>.
Is there a way around it? is it possible to add a dir inside the app dir to be first in classpath?
Working with Spark 2.4.5 Standalone client mode - Docker.
p.s I'm aware of the option to add the jar to the workers image, and then add it to the classpath, but then I'll have to keep updating the image with every code change.
You can enable this option on spark submit:
spark.driver.userClassPathFirst=True
Check here the spark-submit options documentation
Related
We are trying to set some additional properties like adding custom built spark listeners, adding jars to driver and executor classpaths etc for each Spark Job getting submitted.
Found below implementations:
Change the spark-submit launcher script to add these extra properties
Edit the spark-env.sh add add these properties to "SPARK_SUBMIT_OPTS" and "SPARK_DIST_CLASSPATH" variables
Add a --properties-file option to spark-submit launcher script
Would like to check if this can be done specific to users something like .hiverc in hive instead of doing it at the cluster level. This allows us to perform A/B testing of the features we newly build.
My project jars are conflicting with jars which are on EMR so to fix this
I have copied all my advanced jars to custom location of nodes through bootstrap script. I have verified jars got copied on to all executor nodes.
It is working fine with spark-submit , my code referring new jars which are in custom folder of all nodes.
/usr/bin/spark-submit --conf="spark.driver.extraClassPath=/usr/customJars/*" --conf="spark.executor.extraClassPath=/usr/customJars/*"
Same thing I want to implement programmatically in the code by updating sparkconf object.
sparkConf.set("spark.driver.extraClassPath", "/usr/customJars/*");
sparkConf.set("spark.executor.extraClassPath", "/usr/customJars/*");
it is not working when I want to implement programmatically. my code is not referring updated jars in custom location .
Any suggestion?
Most properties cannot be changed at Runtime in Spark.
You can see the documentation for SparkConf: SparkConf
Once SparkConf is passed to the SparkContext constructor, the values
are cloned and cannot be changed. This is a Spark limitation.
You need to make sure that you stop and start your Spark Session before testing new property changes.
As an additional comment from the documentation: Spark Configuration
For spark.executor.extraClassPath:
Extra classpath entries to prepend to the classpath of executors. This
exists primarily for backwards-compatibility with older versions of
Spark. Users typically should not need to set this option.
You can use spark.jars that will affect the driver and executors:
Comma-separated list of jars to include on the driver and executor
classpaths. Globs are allowed.
Make sure that your jars are available in the executors.
I am trying to deploy the Spark application to 4 node DSE spark cluster, and I have created a fat jar with all dependent Jars and I have created a property file under src/main/resources which has properties like batch interval master URL etc.
I have copied this fat jar to master and I am submitting the application with "spark-submit" and below is my submit command.
dse spark-submit --class com.Processor.utils.jobLauncher --supervise application-1.0.0-develop-SNAPSHOT.jar qa
everything works properly when I run on single node cluster, but if run on DSE spark standalone cluster, the properties mentioned above like batch interval become unavailable to executors. I have googled and found that is the common issue many has solved it. so I have followed one of the solutions and created a fat jar and tried to run, but still, my properties are unavailable to executors.
can someone please give any pointers on how to solve the issue ?
I am using DSE 4.8.5 and Spark 1.4.2
and this is how I am loading the properties
System.setProperty("env",args(0))
val conf = com.typesafe.config.ConfigFactory.load(System.getProperty("env") + "_application")
figured out the solution:
I am referring the property file name from system property(i am setting it main method with the command line parameter) and when the code gets shipped and executed on worker node the system property is not available (obviously..!!) , so instead of using typesafe ConfigFactory to load property file I am using simple Scala file reading.
I am using Spark 1.6.0. I want to pass some properties files like log4j.properties and some other customer properties file. I see that we can use --files but I also saw that there is a method addFile in SparkContext. I did prefer to use --files instead of programatically adding the files, assuming both the options are same ?
I did not find much documentation about --files, so is --files & SparkContext.addFile both options same ?
References I found about --files and for SparkContext.addFile.
It depends whether your Spark application is running in client or cluster mode.
In client mode the driver (application master) is running locally and can access those files from your project, because they are available within the local file system. SparkContext.addFile should find your local files and work like expected.
If your application is running in cluster mode. The application is submitted via spark-submit. This means that your whole application is transfered to the Spark master or Yarn, which starts the driver (application master) within the cluster on a specific node and within an separated environment. This environment has no access to your local project directory. So all necessary files has to be transfered as well. This can be achieved with the --files option. The same concept applies to jar files (dependencies of your Spark application). In cluster mode, they need to be added with the --jars option to be available within the classpath of the application master. If you use PySpark there is a --py-files option.
I coded SparkSQL that accesses Hive tables, in Java, and packaged a jar file that can be run using spark-submit.
Now I want to run this jar as an Oozie workflow (and coordinator, if I make workflow to work). When I try to do that, the job fails and I get in Oozie job logs
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf
What I did was to look for the jar in $HIVE_HOME/lib that contains that class, copy that jar in the lib path of my Oozie workflow root path and add this to workflow.xml in the Spark Action:
<spark-opts> --jars lib/*.jar</spark-opts>
But this leads to another java.lang.NoClassDefFoundError that points to another missing class, so I did the process again of looking for the jar and copying, run the job and the same thing goes all over. It looks like it needs the dependency to many jars in my Hive lib.
What I don't understand is when I use spark-submit in the shell using the jar, it runs OK, I can SELECT and INSERT into my Hive tables. It is only when I use Oozie that this occurs. It looks like that Spark can't see the Hive libraries anymore when contained in an Oozie workflow job. Can someone explain how this happens?
How do I add or reference the necessary classes / jars to the Oozie path?
I am using Cloudera Quickstart VM CDH 5.4.0, Spark 1.4.0, Oozie 4.1.0.
Usually the "edge node" (the one you can connect to) has a lot of stuff pre-installed and referenced in the default CLASSPATH.
But the Hadoop "worker nodes" are probably barebones, with just core Hadoop libraries pre-installed.
So you can wait a couple of years for Oozie to package properly Spark dependencies in a ShareLib, and use the "blablah.system.libpath" flag.
[EDIT] if base Spark functionality is OK but you fail on the Hive format interface, then specify a list of ShareLibs including "HCatalog" e.g.
action.sharelib.for.spark=spark,hcatalog
Or, you can find out which JARs and config files are actually used by Spark, upload them to HDFS, and reference them (all of them, one by one) in your Oozie Action under <file> so that they are downloaded at run time in the working dir of the YARN container.
[EDIT] Maybe the ShareLibs contain the JARs but not the config files; then all you have to upload/download is a list of valid config files (Hive, Spark, whatever)
The better way to avoid the ClassPath not found exception in Oozie is, Install the Oozie SharedLib in the cluster, and update the Hive/Pig jars in the Shared Locaton {Some Times Existing Jar in Oozie Shared Location use to get mismatch with product jar.}
hdfs://hadoop:50070/user/oozie/share/lib/
once the same has been update, please pass a parameter
"oozie.use.system.libpath = true"
These will inform oozie to read the Jars from Hadoop Shared Location.
Once the You have mention the Shared Location by setting the paramenter "true" you no need to mention all and each jar one by one in workflow.xml