If I already have Hadoop installed, should I download Apache Spark WITH Hadoop or WITHOUT Hadoop? - apache-spark

I already have Hadoop 3.0.0 installed. Should I now install the with-hadoop or without-hadoop version of Apache Spark from this page?
I am following this guide to get started with Apache Spark.
It says
Download the latest version of Apache Spark (Pre-built according to
your Hadoop version) from this link:...
But I am confused. If I already have an instance of Hadoop running in my machine, and then I download, install and run Apache-Spark-WITH-Hadoop, won't it start another additional instance of Hadoop?

First off, Spark does not yet support Hadoop 3, as far as I know. You'll notice this by no available option for "your Hadoop version" available for download.
You can try setting HADOOP_CONF_DIR and HADOOP_HOME in your spark-env.sh, though, regardless of which you download.
You should always download the version without Hadoop if you already have it.
won't it start another additional instance of Hadoop?
No. You still would need to explicitly configure and start that version of Hadoop.
That Spark option is already configured to use the included Hadoop, I believe

This is in addition to the answer by #cricket_007.
If you have Hadoop installed, do not download spark with Hadoop, however, as your Hadoop version is still unsupported by any version of spark, you will need to download the one with Hadoop. Although, you will need to configure the bundled Hadoop version on your machine for Spark to run on. This will mean that all your data on the Hadoop 3 will be LOST. So, If you need this data, please take a backup of the data before beginning your downgrade/re-configuration. I do not think you will be able to host 2 instances of Hadoop on the same system because of certain environment variables.

Related

Spark Standalone Cluster :Configuring Distributed File System

I have just moved from a Spark local setup to a Spark standalone cluster. Obviously, loading and saving files no longer works.
I understand that I need to use Hadoop for saving and loading files.
My Spark installation is spark-2.2.1-bin-hadoop2.7
Question 1:
Am I correct that I still need to separately download, install and configure Hadoop to work with my standalone Spark cluster?
Question 2:
What would be the difference between running with Hadoop and running with Yarn? ...and which is easier to install and configure (assuming fairly light data loads)?
A1. Right. The package you mentioned is just packed with hadoop client with specified version and still you need to install hadoop if you want to use hdfs.
A2. Running with yarn means you're using resource manager of spark as yarn. (http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-across-applications) So, when the case you don't need DFS, like when you're only running spark streaming applications, you still can install Hadoop but only run yarn processes to use its resource management functionality.

How to find installed libraries in hadoop server?

I am currently working with hadoop server. Now , I have to train a neural network with libraries like keras, Tensorflow etc. I know there is spark libs already installed. I just wanna check whether there are any other libs installed on hadoop server. Our company has it's own hadoop server in a remote location. Am not allowed to install any new libs and had to work with existing libs. Can you please let me know how to check whether there is any library that's installed in hadoop server already?
Hadoop is not a single server, and you actually need to check all YARN NodeManagers for any libraries, as that's where Spark runs. In a large cluster, that's not an easy task...
When you submit a Spark job, you can freely add your own --files and --archives to bring in any dependencies to your classpath. These flags will copy files locally into your Spark execution space, overwriting what's already in the cluster.
By default, Spark just uses whatever builtin classes there are, and those are typically contained in an archive file. You would need to inspect your Spark configuration files to determine where that is, download it from HDFS, then extract it out to determine any available libraries.
Or you ask the cluster administrator what version of Spark is installed, and if any extra libraries were added (typically the answer to that would be none). With the version information, go download Spark yourself and inspect its contents

installing spark 1.4 on google cloud cluster

I set up a google compute cluster with Click to Deploy
I want to use spark 1.4 but I get spark 1.1.0
Anyone know if it is possible to set up a cluster with spark 1.4?
I also had issues with this. These are the steps I took:
Download a copy of GCE's bdutil from github https://github.com/GoogleCloudPlatform/bdutil
Download the spark version you want, in this case spark-1.4.1 from the spark website and store it into a google compute bucket that you control. Make sure it's a spark that supports the hadoop you'll also be deploying with bdutil
Edit the spark env file https://github.com/GoogleCloudPlatform/bdutil/blob/master/extensions/spark/spark_env.sh
Change SPARK_HADOOP2_TARBALL_URI='gs://spark-dist/spark-1.3.1-bin-hadoop2.6.tgz' to SPARK_HADOOP2_TARBALL_URI='gs://[YOUR SPARK PATH]' I'm assuming you want hadoop 2, if you want hadoop 1 make sure you change the right variable.
Once that's all done, from the modified bdutil, build your hadoop+spark cluster, you should have a modern version of spark after this
You'll have to make sure you execute the spark_env.sh with the -e command when executing bdutil, you'll need to also add the hadoop_2 env if you're installing hadoop2 which I was as well.
One option would be to try http://spark-packages.org/package/sigmoidanalytics/spark_gce , this deploys Spark 1.2.0 but you could edit the file to deploy 1.4.0.

What is the difference between the package types of Spark on the download page?

what's the difference beetween the download packages type of spark :
1)pre-built for hadoop 2-6-0 and later and
2)Source code(can build several hadoop versions)
can i insatll a pre-built for hadoop 2-6-0 and later but i work without using (hadoop , hdfs , hbase)
ps :hadoop 2.6.0 is already installed on my machine .
Last answer only addressed Q1, so writing this.
Answer to your Q2 is Yes, you can work on spark without hadoop components installed, even if you use Spark prebuilt with specific hadoop version. Spark will throw bunch of errors while starting up master/workers, which you (and spark) can blissfully ignore as long as you see them up and running.
In terms of applications, its never a problem.
The difference is the version of the hadoop API they are built against. To interop with a Hadoop installation, Spark needs to be built against that API. e.g. the dreaded conflict of org.apache.hadoop.mapred vs org.apache.hadoop.mapreduce
If you're using Hadoop 2.6, get that binary version that matches your Hadoop installation.
You can also build spark from source. That's the Source Code download for. If you want to build from source, follow the instructions listed here: https://spark.apache.org/docs/latest/building-spark.html

Does any of Cloudera Hadoop distribution supports Apache Spark SQL

I am new to Apache Spark. I heard that none of the versions of CDH are supposrting Apache Spark SQL as of now, same case with hortonworks distribution as well. Is that true..?
And another one is I have CDH 5.0.0 installed in my PC, which version of Apache Spark my CDH supports..?
Also could someone please provide me the steps to execute my Spark program in my CDH distribution. I have written some basic programs using Apache Spark 1.2 version and I am not able to run those programs in CDH environment, i am facing very basic problem when I am running Spark program using spark-submit command
spark-submit: Command not found
Do i need to configure anything prior to run my Spark program..?
Thanks in advance
All of the distributions of CDH include the whole Spark distribution, including Spark SQL.
EDIT: It is supported as of CDH 5.5.x.
CDH 5.0.x includes Spark 0.9.x. CDH 5.3.x includes Spark 1.2.x and 5.4.x should ship 1.3.x since it is about to be released upstream.
spark-submit is already part of your path if you are using CDH. If you're running from somewhere else, you have to put this file on your path or give the full path to it. This is the same as any program. So, this is something wrong with what you set up.

Resources