Spark doesn't load all the dependencies in the uber jar - azure

I have a requirement to connect to Azure Blob Storage from a Spark application to read data. The idea is to access the storage using Hadoop filesystem support (i.e, using hadoop-azure and azure-storage dependencies, [https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-azure/2.8.5][1]).
We submit the job on a Spark on the K8S cluster. The embedded spark library doesn't come prepackaged with the required Hadoop-azure jar. So I am building a fat jar with all the dependencies. Problem is, even if the library is part of the fat jar, the spark doesn't seem to load it, and hence I am getting the error "java.io.IOException: No FileSystem for scheme: wasbs".
The spark version is 2.4.8 and the Hadoop version is 2.8.5. Is this behavior expected, that even though the dependency is part of the fat jar, Spark is not loading it? How to force the spark to load all the dependencies in the fat jar?
It happened the same with another dependency, and I had to manually pass it using the --jars option. However, the --jars option is not feasible if the application grows.
I tried adding the fat jar itself on the executor extraClassPath, however that causes a few other version conflicts.
Any information on this would be helpful.
Thanks & Regards,
Swathi Desai

Related

Overriding Apache Spark dependency (spark-hive)

Tech stack:
Spark 2.4.4
Hive 2.3.3
HBase 1.4.8
sbt 1.5.8
What is the best practice for Spark dependency overriding?
Suppose that Spark app (CLUSTER MODE) already have spark-hive (2.44) dependency (PROVIDED)
I compiled and assembled "custom" spark-hive jar that I want to use in Spark app.
There is not a lot of information about how you're running Spark, so it's hard to answer exactly.
But typically, you'll have Spark running on some kind of server or container or pod (in k8s).
If you're running on a server, go to $SPARK_HOME/jars. In there, you should find the spark-hive jar that you want to replace. Replace that one with your new one.
If running in a container/pod, do the same as above and rebuild your image from the directory with the replaced jar.
Hope this helps!

Understanding how spark applications use dependencies

Let's say that we have spark application that write/read to/from HDFS and we have some additional dependency, let's call it dep.
Now, let's do spark-submit on our jar built with sbt. I know that spark-submit send some jars (known as spark-libs). However, my questions are:
(1) How does version of spark influence on sent dependencies? I mean a difference between spark-with-hadoop/bin/spark-submit and spark-without-hadopo/bin/spark-submit?
(2) How does version of hadoop installed on cluster (hadoop cluster) influence on dependencies?
(3) Who is responsible for providing my dependency dep? Should I build fat-jar (assembly) ?
Please note that both first questions are about from what HDFS calls come from (I mean calls done by my spark application like write/read).
Thanks in advance
spark-without-hadoop refers only to the downloaded package, not application development.
The more correct phrasing is "Bring your own Hadoop," meaning you still are required to have the base Hadoop dependencies for any Spark application.
Should I build fat-jar (assembly) ?
If you have libraries that are outside of hadoop-client and those provided by Spark (core, mllib, streaming), then yes

Spark-submit Executers are not getting the properties

I am trying to deploy the Spark application to 4 node DSE spark cluster, and I have created a fat jar with all dependent Jars and I have created a property file under src/main/resources which has properties like batch interval master URL etc.
I have copied this fat jar to master and I am submitting the application with "spark-submit" and below is my submit command.
dse spark-submit --class com.Processor.utils.jobLauncher --supervise application-1.0.0-develop-SNAPSHOT.jar qa
everything works properly when I run on single node cluster, but if run on DSE spark standalone cluster, the properties mentioned above like batch interval become unavailable to executors. I have googled and found that is the common issue many has solved it. so I have followed one of the solutions and created a fat jar and tried to run, but still, my properties are unavailable to executors.
can someone please give any pointers on how to solve the issue ?
I am using DSE 4.8.5 and Spark 1.4.2
and this is how I am loading the properties
System.setProperty("env",args(0))
val conf = com.typesafe.config.ConfigFactory.load(System.getProperty("env") + "_application")
figured out the solution:
I am referring the property file name from system property(i am setting it main method with the command line parameter) and when the code gets shipped and executed on worker node the system property is not available (obviously..!!) , so instead of using typesafe ConfigFactory to load property file I am using simple Scala file reading.

Do we still have to make a fat jar for submitting jobs in Spark 2.0.0?

In the Spark 2.0.0's release note, it says that:
Spark 2.0 no longer requires a fat assembly jar for production deployment.
Does this mean that we do not need to make a fat jar anymore for submitting jobs ?
If yes, how ? Thus the documentation here isn't up-to-date.
Does this mean that we do not need to make a fat jar anymore for
submitting jobs ?
Sadly, no. You still have to create an uber JARs for Sparks deployment.
The title from the release notes is very misleading. The actual meaning is that Spark itself as a dependency is no longer compiled into an uber JAR, but acts like a normal application JAR with dependencies. You can see this in more detail # SPARK-11157 which is called "Allow Spark to be built without assemblies", and read the paper called "Replacing the Spark Assembly with good old
jars" which describes the pros and cons of deploying Spark not as several huge JARs (Core, Streaming, SQL, etc..) but as a several relatively regular sized JARs containing the code and a lib/ directory with all the related dependencies.
If you really want the details, this pull request touches several key parts.

Add CLASSPATH to Oozie workflow job

I coded SparkSQL that accesses Hive tables, in Java, and packaged a jar file that can be run using spark-submit.
Now I want to run this jar as an Oozie workflow (and coordinator, if I make workflow to work). When I try to do that, the job fails and I get in Oozie job logs
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf
What I did was to look for the jar in $HIVE_HOME/lib that contains that class, copy that jar in the lib path of my Oozie workflow root path and add this to workflow.xml in the Spark Action:
<spark-opts> --jars lib/*.jar</spark-opts>
But this leads to another java.lang.NoClassDefFoundError that points to another missing class, so I did the process again of looking for the jar and copying, run the job and the same thing goes all over. It looks like it needs the dependency to many jars in my Hive lib.
What I don't understand is when I use spark-submit in the shell using the jar, it runs OK, I can SELECT and INSERT into my Hive tables. It is only when I use Oozie that this occurs. It looks like that Spark can't see the Hive libraries anymore when contained in an Oozie workflow job. Can someone explain how this happens?
How do I add or reference the necessary classes / jars to the Oozie path?
I am using Cloudera Quickstart VM CDH 5.4.0, Spark 1.4.0, Oozie 4.1.0.
Usually the "edge node" (the one you can connect to) has a lot of stuff pre-installed and referenced in the default CLASSPATH.
But the Hadoop "worker nodes" are probably barebones, with just core Hadoop libraries pre-installed.
So you can wait a couple of years for Oozie to package properly Spark dependencies in a ShareLib, and use the "blablah.system.libpath" flag.
[EDIT] if base Spark functionality is OK but you fail on the Hive format interface, then specify a list of ShareLibs including "HCatalog" e.g.
action.sharelib.for.spark=spark,hcatalog
Or, you can find out which JARs and config files are actually used by Spark, upload them to HDFS, and reference them (all of them, one by one) in your Oozie Action under <file> so that they are downloaded at run time in the working dir of the YARN container.
[EDIT] Maybe the ShareLibs contain the JARs but not the config files; then all you have to upload/download is a list of valid config files (Hive, Spark, whatever)
The better way to avoid the ClassPath not found exception in Oozie is, Install the Oozie SharedLib in the cluster, and update the Hive/Pig jars in the Shared Locaton {Some Times Existing Jar in Oozie Shared Location use to get mismatch with product jar.}
hdfs://hadoop:50070/user/oozie/share/lib/
once the same has been update, please pass a parameter
"oozie.use.system.libpath = true"
These will inform oozie to read the Jars from Hadoop Shared Location.
Once the You have mention the Shared Location by setting the paramenter "true" you no need to mention all and each jar one by one in workflow.xml

Resources