Is there an option to disable specific jars when using spark-submit? - apache-spark

I am running the jar file through spark-submit. (yarn, client mode)
When running the jar, the library installed in the class path in the spark cluster server is used.
However, I get an error because of version dependencies between certain libraries.
I don't want to change a component in the server classpath. Because it can be used elsewhere.
So the question is, is there an option to disable certain libraries on spark cluster server only when running spark-submit?
or is there any other way to solve my problem?
I look forward to your reply. thank you.

Related

Overriding Apache Spark dependency (spark-hive)

Tech stack:
Spark 2.4.4
Hive 2.3.3
HBase 1.4.8
sbt 1.5.8
What is the best practice for Spark dependency overriding?
Suppose that Spark app (CLUSTER MODE) already have spark-hive (2.44) dependency (PROVIDED)
I compiled and assembled "custom" spark-hive jar that I want to use in Spark app.
There is not a lot of information about how you're running Spark, so it's hard to answer exactly.
But typically, you'll have Spark running on some kind of server or container or pod (in k8s).
If you're running on a server, go to $SPARK_HOME/jars. In there, you should find the spark-hive jar that you want to replace. Replace that one with your new one.
If running in a container/pod, do the same as above and rebuild your image from the directory with the replaced jar.
Hope this helps!

Understanding how spark applications use dependencies

Let's say that we have spark application that write/read to/from HDFS and we have some additional dependency, let's call it dep.
Now, let's do spark-submit on our jar built with sbt. I know that spark-submit send some jars (known as spark-libs). However, my questions are:
(1) How does version of spark influence on sent dependencies? I mean a difference between spark-with-hadoop/bin/spark-submit and spark-without-hadopo/bin/spark-submit?
(2) How does version of hadoop installed on cluster (hadoop cluster) influence on dependencies?
(3) Who is responsible for providing my dependency dep? Should I build fat-jar (assembly) ?
Please note that both first questions are about from what HDFS calls come from (I mean calls done by my spark application like write/read).
Thanks in advance
spark-without-hadoop refers only to the downloaded package, not application development.
The more correct phrasing is "Bring your own Hadoop," meaning you still are required to have the base Hadoop dependencies for any Spark application.
Should I build fat-jar (assembly) ?
If you have libraries that are outside of hadoop-client and those provided by Spark (core, mllib, streaming), then yes

spark LOCAL and alluxio client

I'm running spark in LOCAL mode and trying to get it to talk to alluxio. I'm getting the error:
java.lang.ClassNotFoundException: Class alluxio.hadoop.FileSystem not found
I have looked at the page here:
https://www.alluxio.org/docs/master/en/Debugging-Guide.html#q-why-do-i-see-exceptions-like-javalangruntimeexception-javalangclassnotfoundexception-class-alluxiohadoopfilesystem-not-found
Which details the steps to take in this situation, but I'm not finding success.
According to Spark documentation, I can instance a local Spark like so:
SparkSession.builder
.appName("App")
.getOrCreate
Then I can add the alluxio client library like so:
sparkSession.conf.set("spark.driver.extraClassPath", ALLUXIO_SPARK_CLIENT)
sparkSession.conf.set("spark.executor.extraClassPath", ALLUXIO_SPARK_CLIENT)
I have verified that the proper jar file exists in the right location on my local machine with:
logger.error(sparkSession.conf.get("spark.driver.extraClassPath"))
logger.error(sparkSession.conf.get("spark.executor.extraClassPath"))
But I still get the error. Is there anything else I can do to figure out why Spark is not picking the library up?
Please note I am not using spark-submit - I am aware of the methods for adding the client jar to a spark-submit job. My Spark instance is being created as local within my application and this is the use case I want to solve.
As an FYI there is another application in the cluster which is connecting to my alluxio using the fs client and that all works fine. In that case, though, the fs client is being packaged as part of the application through standard sbt dependencies.
Thanks
In the hopes that this helps someone else:
My problem here was not that the library wasn't getting loaded or wasn't on the classpath, it was that I was using the "fs" version of the client rather than the "hdfs" version.
I had been using a generic 1.4 client - at some point this client was split into a fs version and an hdfs version. When I updated this for 1.7 recently I mistakenly added the "fs" version.

How to find installed libraries in hadoop server?

I am currently working with hadoop server. Now , I have to train a neural network with libraries like keras, Tensorflow etc. I know there is spark libs already installed. I just wanna check whether there are any other libs installed on hadoop server. Our company has it's own hadoop server in a remote location. Am not allowed to install any new libs and had to work with existing libs. Can you please let me know how to check whether there is any library that's installed in hadoop server already?
Hadoop is not a single server, and you actually need to check all YARN NodeManagers for any libraries, as that's where Spark runs. In a large cluster, that's not an easy task...
When you submit a Spark job, you can freely add your own --files and --archives to bring in any dependencies to your classpath. These flags will copy files locally into your Spark execution space, overwriting what's already in the cluster.
By default, Spark just uses whatever builtin classes there are, and those are typically contained in an archive file. You would need to inspect your Spark configuration files to determine where that is, download it from HDFS, then extract it out to determine any available libraries.
Or you ask the cluster administrator what version of Spark is installed, and if any extra libraries were added (typically the answer to that would be none). With the version information, go download Spark yourself and inspect its contents

Add CLASSPATH to Oozie workflow job

I coded SparkSQL that accesses Hive tables, in Java, and packaged a jar file that can be run using spark-submit.
Now I want to run this jar as an Oozie workflow (and coordinator, if I make workflow to work). When I try to do that, the job fails and I get in Oozie job logs
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf
What I did was to look for the jar in $HIVE_HOME/lib that contains that class, copy that jar in the lib path of my Oozie workflow root path and add this to workflow.xml in the Spark Action:
<spark-opts> --jars lib/*.jar</spark-opts>
But this leads to another java.lang.NoClassDefFoundError that points to another missing class, so I did the process again of looking for the jar and copying, run the job and the same thing goes all over. It looks like it needs the dependency to many jars in my Hive lib.
What I don't understand is when I use spark-submit in the shell using the jar, it runs OK, I can SELECT and INSERT into my Hive tables. It is only when I use Oozie that this occurs. It looks like that Spark can't see the Hive libraries anymore when contained in an Oozie workflow job. Can someone explain how this happens?
How do I add or reference the necessary classes / jars to the Oozie path?
I am using Cloudera Quickstart VM CDH 5.4.0, Spark 1.4.0, Oozie 4.1.0.
Usually the "edge node" (the one you can connect to) has a lot of stuff pre-installed and referenced in the default CLASSPATH.
But the Hadoop "worker nodes" are probably barebones, with just core Hadoop libraries pre-installed.
So you can wait a couple of years for Oozie to package properly Spark dependencies in a ShareLib, and use the "blablah.system.libpath" flag.
[EDIT] if base Spark functionality is OK but you fail on the Hive format interface, then specify a list of ShareLibs including "HCatalog" e.g.
action.sharelib.for.spark=spark,hcatalog
Or, you can find out which JARs and config files are actually used by Spark, upload them to HDFS, and reference them (all of them, one by one) in your Oozie Action under <file> so that they are downloaded at run time in the working dir of the YARN container.
[EDIT] Maybe the ShareLibs contain the JARs but not the config files; then all you have to upload/download is a list of valid config files (Hive, Spark, whatever)
The better way to avoid the ClassPath not found exception in Oozie is, Install the Oozie SharedLib in the cluster, and update the Hive/Pig jars in the Shared Locaton {Some Times Existing Jar in Oozie Shared Location use to get mismatch with product jar.}
hdfs://hadoop:50070/user/oozie/share/lib/
once the same has been update, please pass a parameter
"oozie.use.system.libpath = true"
These will inform oozie to read the Jars from Hadoop Shared Location.
Once the You have mention the Shared Location by setting the paramenter "true" you no need to mention all and each jar one by one in workflow.xml

Resources