Understanding how spark applications use dependencies - apache-spark

Let's say that we have spark application that write/read to/from HDFS and we have some additional dependency, let's call it dep.
Now, let's do spark-submit on our jar built with sbt. I know that spark-submit send some jars (known as spark-libs). However, my questions are:
(1) How does version of spark influence on sent dependencies? I mean a difference between spark-with-hadoop/bin/spark-submit and spark-without-hadopo/bin/spark-submit?
(2) How does version of hadoop installed on cluster (hadoop cluster) influence on dependencies?
(3) Who is responsible for providing my dependency dep? Should I build fat-jar (assembly) ?
Please note that both first questions are about from what HDFS calls come from (I mean calls done by my spark application like write/read).
Thanks in advance

spark-without-hadoop refers only to the downloaded package, not application development.
The more correct phrasing is "Bring your own Hadoop," meaning you still are required to have the base Hadoop dependencies for any Spark application.
Should I build fat-jar (assembly) ?
If you have libraries that are outside of hadoop-client and those provided by Spark (core, mllib, streaming), then yes

Related

Overriding Apache Spark dependency (spark-hive)

Tech stack:
Spark 2.4.4
Hive 2.3.3
HBase 1.4.8
sbt 1.5.8
What is the best practice for Spark dependency overriding?
Suppose that Spark app (CLUSTER MODE) already have spark-hive (2.44) dependency (PROVIDED)
I compiled and assembled "custom" spark-hive jar that I want to use in Spark app.
There is not a lot of information about how you're running Spark, so it's hard to answer exactly.
But typically, you'll have Spark running on some kind of server or container or pod (in k8s).
If you're running on a server, go to $SPARK_HOME/jars. In there, you should find the spark-hive jar that you want to replace. Replace that one with your new one.
If running in a container/pod, do the same as above and rebuild your image from the directory with the replaced jar.
Hope this helps!

Spark doesn't load all the dependencies in the uber jar

I have a requirement to connect to Azure Blob Storage from a Spark application to read data. The idea is to access the storage using Hadoop filesystem support (i.e, using hadoop-azure and azure-storage dependencies, [https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-azure/2.8.5][1]).
We submit the job on a Spark on the K8S cluster. The embedded spark library doesn't come prepackaged with the required Hadoop-azure jar. So I am building a fat jar with all the dependencies. Problem is, even if the library is part of the fat jar, the spark doesn't seem to load it, and hence I am getting the error "java.io.IOException: No FileSystem for scheme: wasbs".
The spark version is 2.4.8 and the Hadoop version is 2.8.5. Is this behavior expected, that even though the dependency is part of the fat jar, Spark is not loading it? How to force the spark to load all the dependencies in the fat jar?
It happened the same with another dependency, and I had to manually pass it using the --jars option. However, the --jars option is not feasible if the application grows.
I tried adding the fat jar itself on the executor extraClassPath, however that causes a few other version conflicts.
Any information on this would be helpful.
Thanks & Regards,
Swathi Desai

How to find installed libraries in hadoop server?

I am currently working with hadoop server. Now , I have to train a neural network with libraries like keras, Tensorflow etc. I know there is spark libs already installed. I just wanna check whether there are any other libs installed on hadoop server. Our company has it's own hadoop server in a remote location. Am not allowed to install any new libs and had to work with existing libs. Can you please let me know how to check whether there is any library that's installed in hadoop server already?
Hadoop is not a single server, and you actually need to check all YARN NodeManagers for any libraries, as that's where Spark runs. In a large cluster, that's not an easy task...
When you submit a Spark job, you can freely add your own --files and --archives to bring in any dependencies to your classpath. These flags will copy files locally into your Spark execution space, overwriting what's already in the cluster.
By default, Spark just uses whatever builtin classes there are, and those are typically contained in an archive file. You would need to inspect your Spark configuration files to determine where that is, download it from HDFS, then extract it out to determine any available libraries.
Or you ask the cluster administrator what version of Spark is installed, and if any extra libraries were added (typically the answer to that would be none). With the version information, go download Spark yourself and inspect its contents

Can I run spark 2.0.* artifact on a spark 2.2.* stand-alone cluster?

I am aware of the fact that with the change of major version of spark (i.e. from 1.* to 2.*) there will be compile time failures due to changes in existing APIs.
As per my knowledge spark guarantees that with minor version update (i.e. 2.0.* to 2.2.*), changes will be backward compatible.
Although this will eliminate the possibility of compile-time failures with upgrade, would it be safe to assume that there won't be any run time failure too if submit a job on spark 2.2.* stand alone cluster using an artifact(jar) created using 2.0.* dependencies?
would it be safe to assume that there won't be any run time failure too if submit a job on 2.2.* cluster using an artifact(jar) created using 2.0.* dependencies?
Yes.
I'd even say that there's no concept of a Spark cluster unless we talk about the built-in Spark Standalone cluster.
In other words, you deploy a Spark application to a cluster, e.g. Hadoop YARN or Apache Mesos, as a application jar that may or may not contain Spark jars and so disregard what's already available in the environment.
If however you do think of Spark Standalone, things may have been broken between releases even between 2.0 and 2.2 as the jars in your Spark application have to be compatible with the ones on JVM of Spark workers (they are already pre-loaded).
I would not claim full compatibility between releases of Spark Standalone.

Do we still have to make a fat jar for submitting jobs in Spark 2.0.0?

In the Spark 2.0.0's release note, it says that:
Spark 2.0 no longer requires a fat assembly jar for production deployment.
Does this mean that we do not need to make a fat jar anymore for submitting jobs ?
If yes, how ? Thus the documentation here isn't up-to-date.
Does this mean that we do not need to make a fat jar anymore for
submitting jobs ?
Sadly, no. You still have to create an uber JARs for Sparks deployment.
The title from the release notes is very misleading. The actual meaning is that Spark itself as a dependency is no longer compiled into an uber JAR, but acts like a normal application JAR with dependencies. You can see this in more detail # SPARK-11157 which is called "Allow Spark to be built without assemblies", and read the paper called "Replacing the Spark Assembly with good old
jars" which describes the pros and cons of deploying Spark not as several huge JARs (Core, Streaming, SQL, etc..) but as a several relatively regular sized JARs containing the code and a lib/ directory with all the related dependencies.
If you really want the details, this pull request touches several key parts.

Resources