Add CLASSPATH to Oozie workflow job - apache-spark

I coded SparkSQL that accesses Hive tables, in Java, and packaged a jar file that can be run using spark-submit.
Now I want to run this jar as an Oozie workflow (and coordinator, if I make workflow to work). When I try to do that, the job fails and I get in Oozie job logs
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf
What I did was to look for the jar in $HIVE_HOME/lib that contains that class, copy that jar in the lib path of my Oozie workflow root path and add this to workflow.xml in the Spark Action:
<spark-opts> --jars lib/*.jar</spark-opts>
But this leads to another java.lang.NoClassDefFoundError that points to another missing class, so I did the process again of looking for the jar and copying, run the job and the same thing goes all over. It looks like it needs the dependency to many jars in my Hive lib.
What I don't understand is when I use spark-submit in the shell using the jar, it runs OK, I can SELECT and INSERT into my Hive tables. It is only when I use Oozie that this occurs. It looks like that Spark can't see the Hive libraries anymore when contained in an Oozie workflow job. Can someone explain how this happens?
How do I add or reference the necessary classes / jars to the Oozie path?
I am using Cloudera Quickstart VM CDH 5.4.0, Spark 1.4.0, Oozie 4.1.0.

Usually the "edge node" (the one you can connect to) has a lot of stuff pre-installed and referenced in the default CLASSPATH.
But the Hadoop "worker nodes" are probably barebones, with just core Hadoop libraries pre-installed.
So you can wait a couple of years for Oozie to package properly Spark dependencies in a ShareLib, and use the "blablah.system.libpath" flag.
[EDIT] if base Spark functionality is OK but you fail on the Hive format interface, then specify a list of ShareLibs including "HCatalog" e.g.
action.sharelib.for.spark=spark,hcatalog
Or, you can find out which JARs and config files are actually used by Spark, upload them to HDFS, and reference them (all of them, one by one) in your Oozie Action under <file> so that they are downloaded at run time in the working dir of the YARN container.
[EDIT] Maybe the ShareLibs contain the JARs but not the config files; then all you have to upload/download is a list of valid config files (Hive, Spark, whatever)

The better way to avoid the ClassPath not found exception in Oozie is, Install the Oozie SharedLib in the cluster, and update the Hive/Pig jars in the Shared Locaton {Some Times Existing Jar in Oozie Shared Location use to get mismatch with product jar.}
hdfs://hadoop:50070/user/oozie/share/lib/
once the same has been update, please pass a parameter
"oozie.use.system.libpath = true"
These will inform oozie to read the Jars from Hadoop Shared Location.
Once the You have mention the Shared Location by setting the paramenter "true" you no need to mention all and each jar one by one in workflow.xml

Related

Spark doesn't load all the dependencies in the uber jar

I have a requirement to connect to Azure Blob Storage from a Spark application to read data. The idea is to access the storage using Hadoop filesystem support (i.e, using hadoop-azure and azure-storage dependencies, [https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-azure/2.8.5][1]).
We submit the job on a Spark on the K8S cluster. The embedded spark library doesn't come prepackaged with the required Hadoop-azure jar. So I am building a fat jar with all the dependencies. Problem is, even if the library is part of the fat jar, the spark doesn't seem to load it, and hence I am getting the error "java.io.IOException: No FileSystem for scheme: wasbs".
The spark version is 2.4.8 and the Hadoop version is 2.8.5. Is this behavior expected, that even though the dependency is part of the fat jar, Spark is not loading it? How to force the spark to load all the dependencies in the fat jar?
It happened the same with another dependency, and I had to manually pass it using the --jars option. However, the --jars option is not feasible if the application grows.
I tried adding the fat jar itself on the executor extraClassPath, however that causes a few other version conflicts.
Any information on this would be helpful.
Thanks & Regards,
Swathi Desai

Prepend spark.jars to workers classpath

My use case is pretty simple, I want to override a few classes that are part of the Hadoop distribution, to do so I created a new jar that I serialize from the driver to the worker nodes using spark.jars properties.
To make sure my new jar takes precedence in the workers classpath, I want to add them to spark.executor.extraClassPath property.
However, since I'm serializing these jars with spark.jars, their path in the workers is dynamic and includes the app-id & executor-id - <some-work-dir>/<app-id>/<executor-id>.
Is there a way around it? is it possible to add a dir inside the app dir to be first in classpath?
Working with Spark 2.4.5 Standalone client mode - Docker.
p.s I'm aware of the option to add the jar to the workers image, and then add it to the classpath, but then I'll have to keep updating the image with every code change.
You can enable this option on spark submit:
spark.driver.userClassPathFirst=True
Check here the spark-submit options documentation

Using different version of hadoop client library with apache spark

I'm trying to run two or more jobs in parallel. All jobs write append data using same output path, problem is that first job that finishes does cleanup and erases _temporary folder which causes other jobs to throw exception.
With hadoop-client 3 there is a configuration flag to disable auto cleanup of this folder mapreduce.fileoutputcommitter.cleanup.skipped.
I was able to exclude dependencies from spark-core and add new hadoop-client using maven. This run fine for master=local but I'm not convinced it is correct.
My questions are
Is it possible to use different hadoop-client library with apache spark (e.g. hadoop-client version 3 with apache spark 2.3) and what is the correct approach?
Is there better way to run multiple jobs in parallel writing under same path?

Spark-submit Executers are not getting the properties

I am trying to deploy the Spark application to 4 node DSE spark cluster, and I have created a fat jar with all dependent Jars and I have created a property file under src/main/resources which has properties like batch interval master URL etc.
I have copied this fat jar to master and I am submitting the application with "spark-submit" and below is my submit command.
dse spark-submit --class com.Processor.utils.jobLauncher --supervise application-1.0.0-develop-SNAPSHOT.jar qa
everything works properly when I run on single node cluster, but if run on DSE spark standalone cluster, the properties mentioned above like batch interval become unavailable to executors. I have googled and found that is the common issue many has solved it. so I have followed one of the solutions and created a fat jar and tried to run, but still, my properties are unavailable to executors.
can someone please give any pointers on how to solve the issue ?
I am using DSE 4.8.5 and Spark 1.4.2
and this is how I am loading the properties
System.setProperty("env",args(0))
val conf = com.typesafe.config.ConfigFactory.load(System.getProperty("env") + "_application")
figured out the solution:
I am referring the property file name from system property(i am setting it main method with the command line parameter) and when the code gets shipped and executed on worker node the system property is not available (obviously..!!) , so instead of using typesafe ConfigFactory to load property file I am using simple Scala file reading.

Do we still have to make a fat jar for submitting jobs in Spark 2.0.0?

In the Spark 2.0.0's release note, it says that:
Spark 2.0 no longer requires a fat assembly jar for production deployment.
Does this mean that we do not need to make a fat jar anymore for submitting jobs ?
If yes, how ? Thus the documentation here isn't up-to-date.
Does this mean that we do not need to make a fat jar anymore for
submitting jobs ?
Sadly, no. You still have to create an uber JARs for Sparks deployment.
The title from the release notes is very misleading. The actual meaning is that Spark itself as a dependency is no longer compiled into an uber JAR, but acts like a normal application JAR with dependencies. You can see this in more detail # SPARK-11157 which is called "Allow Spark to be built without assemblies", and read the paper called "Replacing the Spark Assembly with good old
jars" which describes the pros and cons of deploying Spark not as several huge JARs (Core, Streaming, SQL, etc..) but as a several relatively regular sized JARs containing the code and a lib/ directory with all the related dependencies.
If you really want the details, this pull request touches several key parts.

Resources