How to find installed libraries in hadoop server? - apache-spark

I am currently working with hadoop server. Now , I have to train a neural network with libraries like keras, Tensorflow etc. I know there is spark libs already installed. I just wanna check whether there are any other libs installed on hadoop server. Our company has it's own hadoop server in a remote location. Am not allowed to install any new libs and had to work with existing libs. Can you please let me know how to check whether there is any library that's installed in hadoop server already?

Hadoop is not a single server, and you actually need to check all YARN NodeManagers for any libraries, as that's where Spark runs. In a large cluster, that's not an easy task...
When you submit a Spark job, you can freely add your own --files and --archives to bring in any dependencies to your classpath. These flags will copy files locally into your Spark execution space, overwriting what's already in the cluster.
By default, Spark just uses whatever builtin classes there are, and those are typically contained in an archive file. You would need to inspect your Spark configuration files to determine where that is, download it from HDFS, then extract it out to determine any available libraries.
Or you ask the cluster administrator what version of Spark is installed, and if any extra libraries were added (typically the answer to that would be none). With the version information, go download Spark yourself and inspect its contents

Related

Overriding Apache Spark dependency (spark-hive)

Tech stack:
Spark 2.4.4
Hive 2.3.3
HBase 1.4.8
sbt 1.5.8
What is the best practice for Spark dependency overriding?
Suppose that Spark app (CLUSTER MODE) already have spark-hive (2.44) dependency (PROVIDED)
I compiled and assembled "custom" spark-hive jar that I want to use in Spark app.
There is not a lot of information about how you're running Spark, so it's hard to answer exactly.
But typically, you'll have Spark running on some kind of server or container or pod (in k8s).
If you're running on a server, go to $SPARK_HOME/jars. In there, you should find the spark-hive jar that you want to replace. Replace that one with your new one.
If running in a container/pod, do the same as above and rebuild your image from the directory with the replaced jar.
Hope this helps!

Understanding how spark applications use dependencies

Let's say that we have spark application that write/read to/from HDFS and we have some additional dependency, let's call it dep.
Now, let's do spark-submit on our jar built with sbt. I know that spark-submit send some jars (known as spark-libs). However, my questions are:
(1) How does version of spark influence on sent dependencies? I mean a difference between spark-with-hadoop/bin/spark-submit and spark-without-hadopo/bin/spark-submit?
(2) How does version of hadoop installed on cluster (hadoop cluster) influence on dependencies?
(3) Who is responsible for providing my dependency dep? Should I build fat-jar (assembly) ?
Please note that both first questions are about from what HDFS calls come from (I mean calls done by my spark application like write/read).
Thanks in advance
spark-without-hadoop refers only to the downloaded package, not application development.
The more correct phrasing is "Bring your own Hadoop," meaning you still are required to have the base Hadoop dependencies for any Spark application.
Should I build fat-jar (assembly) ?
If you have libraries that are outside of hadoop-client and those provided by Spark (core, mllib, streaming), then yes

Spark Standalone Cluster :Configuring Distributed File System

I have just moved from a Spark local setup to a Spark standalone cluster. Obviously, loading and saving files no longer works.
I understand that I need to use Hadoop for saving and loading files.
My Spark installation is spark-2.2.1-bin-hadoop2.7
Question 1:
Am I correct that I still need to separately download, install and configure Hadoop to work with my standalone Spark cluster?
Question 2:
What would be the difference between running with Hadoop and running with Yarn? ...and which is easier to install and configure (assuming fairly light data loads)?
A1. Right. The package you mentioned is just packed with hadoop client with specified version and still you need to install hadoop if you want to use hdfs.
A2. Running with yarn means you're using resource manager of spark as yarn. (http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-across-applications) So, when the case you don't need DFS, like when you're only running spark streaming applications, you still can install Hadoop but only run yarn processes to use its resource management functionality.

If I already have Hadoop installed, should I download Apache Spark WITH Hadoop or WITHOUT Hadoop?

I already have Hadoop 3.0.0 installed. Should I now install the with-hadoop or without-hadoop version of Apache Spark from this page?
I am following this guide to get started with Apache Spark.
It says
Download the latest version of Apache Spark (Pre-built according to
your Hadoop version) from this link:...
But I am confused. If I already have an instance of Hadoop running in my machine, and then I download, install and run Apache-Spark-WITH-Hadoop, won't it start another additional instance of Hadoop?
First off, Spark does not yet support Hadoop 3, as far as I know. You'll notice this by no available option for "your Hadoop version" available for download.
You can try setting HADOOP_CONF_DIR and HADOOP_HOME in your spark-env.sh, though, regardless of which you download.
You should always download the version without Hadoop if you already have it.
won't it start another additional instance of Hadoop?
No. You still would need to explicitly configure and start that version of Hadoop.
That Spark option is already configured to use the included Hadoop, I believe
This is in addition to the answer by #cricket_007.
If you have Hadoop installed, do not download spark with Hadoop, however, as your Hadoop version is still unsupported by any version of spark, you will need to download the one with Hadoop. Although, you will need to configure the bundled Hadoop version on your machine for Spark to run on. This will mean that all your data on the Hadoop 3 will be LOST. So, If you need this data, please take a backup of the data before beginning your downgrade/re-configuration. I do not think you will be able to host 2 instances of Hadoop on the same system because of certain environment variables.

What is the difference between the package types of Spark on the download page?

what's the difference beetween the download packages type of spark :
1)pre-built for hadoop 2-6-0 and later and
2)Source code(can build several hadoop versions)
can i insatll a pre-built for hadoop 2-6-0 and later but i work without using (hadoop , hdfs , hbase)
ps :hadoop 2.6.0 is already installed on my machine .
Last answer only addressed Q1, so writing this.
Answer to your Q2 is Yes, you can work on spark without hadoop components installed, even if you use Spark prebuilt with specific hadoop version. Spark will throw bunch of errors while starting up master/workers, which you (and spark) can blissfully ignore as long as you see them up and running.
In terms of applications, its never a problem.
The difference is the version of the hadoop API they are built against. To interop with a Hadoop installation, Spark needs to be built against that API. e.g. the dreaded conflict of org.apache.hadoop.mapred vs org.apache.hadoop.mapreduce
If you're using Hadoop 2.6, get that binary version that matches your Hadoop installation.
You can also build spark from source. That's the Source Code download for. If you want to build from source, follow the instructions listed here: https://spark.apache.org/docs/latest/building-spark.html

Resources