Mesos Configuration with existing Apache Spark standalone cluster - apache-spark

I am a beginner in Apache-spark!
I have setup Spark standalone cluster using 4 PCs.
I want to use Mesos with existing Spark standalone cluster. But I read that I need to install Mesos first then configure the spark.
I have also seen the Documentation of Spark on setting with Mesos, but it is not helpful for me.
So how to configure Mesos with existing spark standalone cluster?

Mesos is an alternative cluster manager to standalone Spark manger. You don't use it with, you use it instead of.
to create Mesos cluster follow https://mesos.apache.org/gettingstarted/
make sure to distribute Mesos native library is available on the machine you use to submit jobs
for cluster mode start Mesos dispatcher (sbin/start-mesos-dispatcher.sh).
submit application using Mesos master URI (client mode) or dispatcher URI (cluster mode).

Related

Apache Spark and Livy cluster

Scenario :
I have spark cluster and I also want to use Livy.
I am new about Livy
Problem :
I built
my spark cluster by using docker swarm and I will also create a
service for Livy.
Can Livy communicate with external spark master and
send a job to external spark master? If it is ok, which configuration
need to be done? Or Livy should be installed on spark master node?
I think is a little late, but I hope this help you.
sorry for my english, but I am mexican, you can use docker to send jobs via livy, but also you can use livy to send jobs throw Livy REST API.
The livy server can be outside of the spark cluster, you only need to send a conf file to livy that points to you spark cluster.
It looks you are running spark standalone, easist way to configure livy to work is that livy lives on spark master node, if you already have YARN on your cluster machines, you can install livy on any node and run spark application in yarn-cluster or yarn-client mode.

How to setup YARN with Spark in cluster mode

I need to setup spark cluster (1 Master and 2 slaves nodes) on centos7 along with resource manager as YARN. I am new to all this and still exploring. Can somebody share me detailed steps of setting up Spark with Yarn in cluster mode.
Afterwards i have to integrate Livy too(an open source REST interface for using Spark from anywhere).
Inputs are welcome.Thanks
YARN is part of Hadoop. So, a Hadoop installation is necessary to run Spark on YARN.
Check out the page on the Hadoop Cluster Setup.
Then you can utilize the this documentation to learn about Spark on YARN.
Another method to quickly learn about Hadoop, YARN and Spark is to utilize Cloudera Distribution of Hadoop (CDH). Read the CDH 5 Quick Start Guide.
We are currently using the similar setup in aws. AWS EMR is costly hence
we setup our own cluster using ec2 machines with the help of Hadoop Cookbook. The cookbook supports multiple distributions, however we choose HDP.
The setup included following.
Master Setup
Spark (Along with History server)
Yarn Resource Manager
HDFS Name Node
Livy server
Slave Setup
Yarn Node Manager
HDFS Data Node
More information on manually installing can be found in HDP Documentation
You can see the part of that automation in here.

How to run Spark driver in HA mode?

I have a Spark driver submitted to a Mesos cluster (with highly-available Mesos masters) in client mode (see this for client deploy mode).
I'd like to run the Spark driver in HA mode, too. How?
I can implement my own implementation for this, but for now looking for anything available.
tl;dr Use cluster deploy mode with --supervise, e.g. spark-submit --deploy-mode cluster --supervise
Having a HA of a Spark driver in client mode is not possible as described in the cited document:
In client mode, a Spark Mesos framework is launched directly on the client machine and waits for the driver output.
You'd have to somehow monitor the process on the client machine and check its exit code perhaps.
A much safer solution is to let Mesos do its job. You should use cluster deploy mode in which it's Mesos to make sure the driver runs (and gets restarted when goes down). See the section Cluster mode:
Spark on Mesos also supports cluster mode, where the driver is launched in the cluster and the client can find the results of the driver from the Mesos Web UI.

Apache Spark and Mesos running on a single node

I am interested in testing Spark running on Mesos. I created a Hadoop 2.6.0 single-node cluster in my Virtualbox and installed Spark on it. I can successfully process files in HDFS using Spark.
Then I installed Mesos Master and Slave on the same node. I tried to run Spark as a framework in Mesos using these instructions. I get the following error with Spark:
WARN TaskSchedulerImpl: Initial job has not accepted any resources;
check your cluster UI to ensure that workers are registered and have
sufficient resources
Sparkshell is successfully registered as a framework in the Mesos. Is there anything wrong with using a single-node setup? Or whether I need to add more Spark worker nodes?
I am very new to Spark and my aim is to just test Spark, HDFS, and Mesos.
If you have allocated enough resources for spark slaves, the cause might be firewall blocking the communication. Take a look at my other answer:
Apache Spark on Mesos: Initial job has not accepted any resources

YARN vs Spark processing engine based on real time application?

I understood YARN and Spark. But I want to know when I need to use Yarn and Spark processing engine. What are the different case studies in that I can identify the difference between YARN and Spark?
You cannot compare Yarn and Spark directly per se. Yarn is a distributed container manager, like Mesos for example, whereas Spark is a data processing tool. Spark can run on Yarn, the same way Hadoop Map Reduce can run on Yarn. It just happens that Hadoop Map Reduce is a feature that ships with Yarn, when Spark is not.
If you mean comparing Map Reduce and Spark, I suggest reading this other answer.
Apache Spark can be run on YARN, MESOS or StandAlone Mode.
Spark in StandAlone mode - it means that all the resource management and job scheduling are taken care Spark inbuilt.
Spark in YARN - YARN is a resource manager introduced in MRV2, which not only supports native hadoop but also Spark, Kafka, Elastic Search and other custom applications.
Spark in Mesos - Spark also supports Mesos, this is one more type of resource manager.
Advantages of Spark on YARN
YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN.
YARN schedulers can be used for spark jobs, Only With YARN, Spark can run against Kerberized Hadoop clusters and uses secure authentication between its processes.
Link for more documentation on YARN, Spark.
We can conclude saying this, if you want to build a small and simple cluster independent of everything go for standalone. If you want to use existing hadoop cluster go for YARN/Mesos.

Resources