Installing Spark/Zeppelin on Standalone node - apache-spark

I have a Cloudera cluster which is being managed by an admin team. However there is no Zeppelin installed in the cluster.
I would like to install Zeppelin on a separate node and connect with the Cloudera cluster?
Is it feasible to install zeppelin on a node which is not part of the cluster and submit spark jobs to it?
Any reference is really appreciated?
Thanks

Zeppelin is just another Spark client.
For example, on the machine that you want to use Zeppelin on, you should first make sure that spark shell and spark submit work as expected, then Zeppelin configurations become much easier
An easy way to manage that would be to have the admins use Cloudera Manager to install Spark (and Hive and Hadoop) client libraries into this standalone node, then I assume they give you SSH access, or you tell them how to install it

Related

do we need to install spark on yarn to read data from HDFS into Py Spark?

I am having a Hadoop 3.1.1 multi-node cluster, i want to make use of PySpark to read files from my HDFS into PySpark for ETL operations and then load it to target MySQL databases.
Given below is the ask.
can I install spark in standalone mode?
do I need to install spark on my yarn first?
if no, how can I install spark separately?
You can use any mode for communicating with HDFS and MySQL, including Kubernetes. Or, you just use --master="local[*]" and you don't need a scheduler at all. This is useful, for example, from a Jupyter Notebook.
YARN would be recommended as you already have HDFS, and therefore the scripts to start YARN processes as well.
You don't really "install Spark on YARN". Applications from clients get submitted to the YARN cluster. spark.yarn.archives HDFS path will get unpacked into the classes necessary to run the job.
Refer https://spark.apache.org/docs/latest/running-on-yarn.html

Airflow and Spark/Hadoop - Unique cluster or one for Airflow and other for Spark/Hadoop

I'm trying to figure out which is the best way to work with Airflow and Spark/Hadoop.
I already have a Spark/Hadoop cluster and I'm thinking about creating another cluster for Airflow that will submit jobs remotely to Spark/Hadoop cluster.
Any advice about it? Looks like it's a little complicated to deploy spark remotely from another cluster and that will create some file configuration duplication.
You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work. (You could try cluster deploy mode, but I think having the driver being managed by Airflow isn't a bad idea)
Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.
If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath (not all NodeManagers could have a Hive client installed on them)
I prefer submitting Spark Jobs using SSHOperator and running spark-submit command which would save you from copy/pasting yarn-site.xml. Also, I would not create a cluster for Airflow if the only task that I perform is running Spark jobs, a single VM with LocalExecutor should be fine.
There are a variety of options for remotely performing spark-submit via Airflow.
Emr-Step
Apache-Livy (see this for hint)
SSH
Do note that none of these are plug-and-play ready and you'll have to write your own operators to get things done.

Spark Standalone Cluster :Configuring Distributed File System

I have just moved from a Spark local setup to a Spark standalone cluster. Obviously, loading and saving files no longer works.
I understand that I need to use Hadoop for saving and loading files.
My Spark installation is spark-2.2.1-bin-hadoop2.7
Question 1:
Am I correct that I still need to separately download, install and configure Hadoop to work with my standalone Spark cluster?
Question 2:
What would be the difference between running with Hadoop and running with Yarn? ...and which is easier to install and configure (assuming fairly light data loads)?
A1. Right. The package you mentioned is just packed with hadoop client with specified version and still you need to install hadoop if you want to use hdfs.
A2. Running with yarn means you're using resource manager of spark as yarn. (http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-across-applications) So, when the case you don't need DFS, like when you're only running spark streaming applications, you still can install Hadoop but only run yarn processes to use its resource management functionality.

How to setup YARN with Spark in cluster mode

I need to setup spark cluster (1 Master and 2 slaves nodes) on centos7 along with resource manager as YARN. I am new to all this and still exploring. Can somebody share me detailed steps of setting up Spark with Yarn in cluster mode.
Afterwards i have to integrate Livy too(an open source REST interface for using Spark from anywhere).
Inputs are welcome.Thanks
YARN is part of Hadoop. So, a Hadoop installation is necessary to run Spark on YARN.
Check out the page on the Hadoop Cluster Setup.
Then you can utilize the this documentation to learn about Spark on YARN.
Another method to quickly learn about Hadoop, YARN and Spark is to utilize Cloudera Distribution of Hadoop (CDH). Read the CDH 5 Quick Start Guide.
We are currently using the similar setup in aws. AWS EMR is costly hence
we setup our own cluster using ec2 machines with the help of Hadoop Cookbook. The cookbook supports multiple distributions, however we choose HDP.
The setup included following.
Master Setup
Spark (Along with History server)
Yarn Resource Manager
HDFS Name Node
Livy server
Slave Setup
Yarn Node Manager
HDFS Data Node
More information on manually installing can be found in HDP Documentation
You can see the part of that automation in here.

Accessing Cassandra from Google Cloud Dataproc

I just set up a Spark cluster in Google Cloud using DataProc and I have a standalone installation of Cassandra running on a separate VM. I would like to install the Datastax spark-cassandra connector so I can connect to Cassandra from spark. How can I do this ?
The connector can be downloaded here:
https://github.com/datastax/spark-cassandra-connector
The instructions on building are here:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/12_building_and_artifacts.md
sbt is needed to build it.
Where can I find sbt for the DataProc installation ?
Would it be under $SPARK_HOME/bin ? Where is spark installed for DataProc ?
I'm going to follow up the really helpful comment #angus-davis made not too long ago.
Where can I find sbt for the DataProc installation ?
At present, sbt is not included on Cloud Dataproc clusters. The sbt documentation contains information on how to install sbt manually. If you need to re-install sbt on your clusters, I highly recommend you create an init action to install sbt when you create a cluster. After some research, it looks like SBT is covered under a BSD-3 license, which means we can probably (no promise) include it in Cloud Dataproc clusters.
Would it be under $SPARK_HOME/bin ? Where is spark installed for DataProc ?
The answer to this is it depends on what you mean.
binaries - /usr/bin
config - /etc/spark/conf
spark_home - /usr/lib/spark
Importantly, this same pattern is used for other major OSS components installed on Cloud Dataproc clusters, like Hadoop and Hive.
I would like to install the Datastax spark-cassandra connector so I can connect to Cassandra from spark. How can I do this ?
The Stack Overflow answer Angus sent is probably the easiest way if it can be used as a Spark package. Based on what I can find, however, this is probably not an option. This means you will need to install sbt and manually install.
You can use cassandra along with the mentioned jar and connector from datastax. You can simply download the jar and pass it to dataproc cluster. You can find Google provided template, I contributed to, in this link [1]. This explains how you can use the template to connect to Cassandra using Dataproc.

Resources