running multiple Spark jobs on a Mesos cluster - apache-spark

I would like to run multiple spark jobs on my Mesos cluster, and have all spark jobs share the same spark framework. Is this possible?
I have tried running the MesosClusterDispatcher and have the spark jobs connect to the dispatcher, but each spark job launches its own "Spark Framework" (I have tried running both client-mode and cluster-mode).
Is this the expected behaviour?
Is it possible to share the same spark-framework among multiple spark jobs?

It is normal and it's the expected behaviour.
In Mesos as far as I know, SparkDispatcher is in charge of allocate resources for your Spark Driver which will act as a framework. Once Spark driver has been allocated, it is responsible for talk to Mesos and accept offers to allocate the executors where tasks will be executed.

Related

Monitor Spark with Prometheus when Spark clusters are spined up just when needed

We run spark over Kubernetes and we spin up a spark driver and executors for a lot of our tasks (not a spark task). After the task is finished we spin the cluster (on Kubernetes) down and spin up another one when needed (There could be a lot running simultaneously).
The problem I have is that I can't monitor it with Prometheus because I do not have a diver that is always "alive" that I can pull information on the executors from.
Is there a solution for that kind of architecture?

Is there a link between Spark Components and the Spark Ecosystem?

I read the Cluster mode overview (link: https://spark.apache.org/docs/latest/cluster-overview.html) and I was wondering how the components such as the Driver, Executor and Work nodes can be mapped on the components of the Spark Ecosystem such as Spark core, Spark SQL, Spark Streaming, Spark MLlib, Spark GraphX and Scheduling/cluster managers. Which of these components are for the Drivers, the Executors and the Work nodes?
So basically my question is if there is a link between these two figures of the components of Spark (figure 1) and the ecosystem of Spark (figure 2). If so can somebody please explain to my what belongs to the drivers/executors/work nodes
Figure 1: Components of Spark
Figure 2: Spark Ecosystem
The cluster manager in the figure 1(as mentioned in the question) is related to (Standalone Scheduler, Yarn, Mesos) in the figure 2(as mentioned in the question).
The cluster manager can be any one of the cluster/resource managers like Yarn, Mesos, kubernates etc.
Nodes or worker Nodes are the machines that are part of the cluster on which you want to run your spark application in distributed manner. You cannot relate this to something on the spark ecosystem diagram.
Nodes/Worker Nodes are actual physical machines like your computer/laptop.
Now the drivers and executors are the processes that runs on machines that are part of the cluster.
One of the node from the cluster is selected as the master/driver node and this is where the driver process runs which creates sparkContext and runs your main method and split up your code in a way that it can be executed in distributed fashion by creating jobs, stages and tasks.
Other nodes from the cluster are selected as Worker nodes and executor process runs the tasks assigned to them by driver process on this nodes.
Now coming to Spark Core , it is the component/framework that has been created which allows all of this communications, Scheduling and data transfer to happen between driver node and worker nodes and you don't have to worry about all these things and just focus on your business logic t get the required work done.
Structured Streaming, Spark SQL, MLib, GraphX are some functionality that is implemented utilizing Spark Core as the base functionality so you get some of common functionality that you can utilize to make your life easier. You would have spark installed on all the nodes i.e driver node and worker nodes and have all these components on those nodes by default.
You cannot compare both the figures exactly because one shows the working of how the spark application is executed when you submit your code to cluster and other just shows the various components that the spark framework in whole provides.

Kafka Spark Streaming

I was trying to build Kafka and spark streaming use case. In that, Spark Streaming is consuming streaming from Kafka. And we are enhancing stream and storing enhanced stream into some target system.
My question here is that does it make sense to run spark streaming job in yarn-cluster or yarn-client mode? (Hadoop is not involved here)
What I think Spark streaming job should run only local mode but another question is how to improve the performance of spark streaming job.
Thanks,
local[*]
This is specific to run the job in local mode
Usually we use this to perform POC's and on a very small data.
You can debug the job to understand how each line of code is working.
But, you need to be aware that since the job is running in your local you cannot get the most out of sparks distributed architecture.
yarn-client
your driver program is running on the yarn client where you type the command to submit the spark application . But, the tasks are still executed on the Executors.
yarn-cluster
In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. This is the finest way of running the spark job to be benefited by the advantages provided by a cluster manager
I hope this gives you a clarity on how you may want to deploy your spark job.
Infact, Spark provides you a very clean documentation explaining various deployment strategies with examples.
https://spark.apache.org/docs/latest/running-on-yarn.html
the difference will be with yarn-client, you will force the spark job to choose the host where you run spark-submit as the driver , because in yarn-cluster , the choice won't be the same host everytime you run it
so the best choice is to always choose yarn-cluster to avoide overloading the same host if you are going to submit multi job in the same host with yarn-client

Role of master in Spark standalone cluster

In a Spark standalone cluster, what is exactly the role of the master (a node started with start_master.sh script)?
I understand that is the node that receives the jobs from the submit-job.sh script, but what is its role when processing a job?
I'm seeing in the web UI that always delivers the job to a slave (a node started with start_slave.sh) and is not participating from processing, Am I right? In that case, should I also run also the script start_slave.sh in the same machine than master to to take advantage of its resources (cpu and memory)?
Thanks in advance.
Spark runs in the following cluster modes:
Local
Standalone
Mesos
Yarn
The above are cluster modes which offer resources to Spark Applications
Spark standalone mode is master slave architecture, we have Spark Master and Spark Workers. Spark Master runs in one of the cluster nodes and Spark Workers run on the Slave nodes of the cluster.
Spark Master (often written standalone Master) is the resource manager
for the Spark Standalone cluster to allocate the resources (CPU, Memory, Disk etc...) among the
Spark applications. The resources are used to run the Spark Driver and Executors.
Spark Workers report to Spark Master about resources information on the Slave nodes.
[apache-spark]
Spark standalone comes with its own resource manager. Think about Spark Master/Worker as YARN ResourceManager/NodeManager.

Apache Spark and Mesos running on a single node

I am interested in testing Spark running on Mesos. I created a Hadoop 2.6.0 single-node cluster in my Virtualbox and installed Spark on it. I can successfully process files in HDFS using Spark.
Then I installed Mesos Master and Slave on the same node. I tried to run Spark as a framework in Mesos using these instructions. I get the following error with Spark:
WARN TaskSchedulerImpl: Initial job has not accepted any resources;
check your cluster UI to ensure that workers are registered and have
sufficient resources
Sparkshell is successfully registered as a framework in the Mesos. Is there anything wrong with using a single-node setup? Or whether I need to add more Spark worker nodes?
I am very new to Spark and my aim is to just test Spark, HDFS, and Mesos.
If you have allocated enough resources for spark slaves, the cause might be firewall blocking the communication. Take a look at my other answer:
Apache Spark on Mesos: Initial job has not accepted any resources

Resources