I read the Cluster mode overview (link: https://spark.apache.org/docs/latest/cluster-overview.html) and I was wondering how the components such as the Driver, Executor and Work nodes can be mapped on the components of the Spark Ecosystem such as Spark core, Spark SQL, Spark Streaming, Spark MLlib, Spark GraphX and Scheduling/cluster managers. Which of these components are for the Drivers, the Executors and the Work nodes?
So basically my question is if there is a link between these two figures of the components of Spark (figure 1) and the ecosystem of Spark (figure 2). If so can somebody please explain to my what belongs to the drivers/executors/work nodes
Figure 1: Components of Spark
Figure 2: Spark Ecosystem
The cluster manager in the figure 1(as mentioned in the question) is related to (Standalone Scheduler, Yarn, Mesos) in the figure 2(as mentioned in the question).
The cluster manager can be any one of the cluster/resource managers like Yarn, Mesos, kubernates etc.
Nodes or worker Nodes are the machines that are part of the cluster on which you want to run your spark application in distributed manner. You cannot relate this to something on the spark ecosystem diagram.
Nodes/Worker Nodes are actual physical machines like your computer/laptop.
Now the drivers and executors are the processes that runs on machines that are part of the cluster.
One of the node from the cluster is selected as the master/driver node and this is where the driver process runs which creates sparkContext and runs your main method and split up your code in a way that it can be executed in distributed fashion by creating jobs, stages and tasks.
Other nodes from the cluster are selected as Worker nodes and executor process runs the tasks assigned to them by driver process on this nodes.
Now coming to Spark Core , it is the component/framework that has been created which allows all of this communications, Scheduling and data transfer to happen between driver node and worker nodes and you don't have to worry about all these things and just focus on your business logic t get the required work done.
Structured Streaming, Spark SQL, MLib, GraphX are some functionality that is implemented utilizing Spark Core as the base functionality so you get some of common functionality that you can utilize to make your life easier. You would have spark installed on all the nodes i.e driver node and worker nodes and have all these components on those nodes by default.
You cannot compare both the figures exactly because one shows the working of how the spark application is executed when you submit your code to cluster and other just shows the various components that the spark framework in whole provides.
Related
we have Hadoop cluster ( HDP 2.6.5 cluster with ambari , with 25 datanodes machines )
we are using spark streaming application (spark 2.1 run over Hortonworks 2.6.x )
the current situation is that spark streaming applications runs on all datanodes machines
but now we want the spark streaming application to run only on the first 10 datanodes machines
so the others last 15 datanodes machines will be restricted , and spark application will runs only on the first 10 datanodes machines
is this scenario can be done by ambary features or other approach?
for example we found the - https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.3.2/bk_yarn_resource_mgt/content/configuring_node_labels.html ,
and
http://crazyadmins.com/configure-node-labels-on-yarn/
but not sure if Node Labes can help us
#Jessica Yes, you are absolutely onto the right path. Yarn Node Labels and Yarn Queues are how Ambari Administrators control team level access to portions of the entire yarn cluster. You can start very basic with just a non default queues or get very in-depth with many queues for many different teams. Node labels take it to another level, allow you to map queues and teams to nodes specifically.
Here is a post with the syntax for spark to use the yarn queue:
How to choose the queue for Spark job using spark-submit?
I tried to find 2.6 version of these docs, but was not able.... they have really mixed up the docs since the merger...
https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.3.2/bk_yarn_resource_mgt/content/ch_node_labels.html
https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.0/data-operating-system/content/configuring_node_labels.html
The actual steps you may have to take may be a combination of items from both. Typical experience for me when working in Ambari HDP/HDF.
I am reading Spark: The Definitive Guide and am learning a great deal.
However, one thing I am confused about while reading is how many driver processes there are per Spark job. When the text first introduces driver and executor processes, it implies that there is a driver per machine, but in the discussion about broadcast variables, it sounds as though there is one driver per cluster.
This is because the text talks about the driver process sending the broadcast variable to every node in the cluster so that it can be cached there. That makes it sound as though there is only one driver process in the whole cluster that is responsible for this.
Which one is it: one driver process per cluster, or one per machine? Or can it be both? I think I am missing something here.
In Spark architecture, there will be only one driver for your spark application.
The spark driver, as part of the spark application is responsible for instantiating a spark session. The spark driver has multiple responsibilities
It communicates with the cluster manager (CM).
Requests resources from the CM for spark's executor JVMs.
Transforms all spark operations into DAG computations, schedules them and distributes their execution as tasks across all spark executors.
It's interaction with the CM is merely to get Spark executor resources.
So, the workflow of running spark applications on a cluster can be seen as:
The user submits an application using spark-submit.
spark-submit launches the driver program and invokes the main method specified by the user.
The driver program contacts the cluster manager to ask for resources to start executor.
The cluster manager launches executors on behalf of the driver program.
The driver process runs through the user application. Based on the RDD or dataset operations in the program, the driver sends work to executors in the form of tasks.
The tasks are run on executor process to compute and save result.
What features of YARN make it better than Spark Standalone mode for multi-tenant cluster running only Spark applications? Maybe besides authentication.
There are a lot of answers at Google, pretty much of them sounds wrong to me, so I'm not sure where is the truth.
For example:
DZone, Deep Dive Into Spark Cluster Management
Standalone is good for small Spark clusters, but it is not good for
bigger clusters (there is an overhead of running Spark daemons —
master + slave — in cluster nodes)
But other cluster managers also require running agents on cluster nodes. I.e. YARN's slaves are called node managers. They may consume even more memory than Spark's slaves (Spark default is 1 GB).
This answer
The Spark standalone mode requires each application to run an executor
on every node in the cluster; whereas with YARN, you choose the number
of executors to use
agains Spark Standalone # executor/cores control, that shows how you can specify number of consumed resources at Standalone mode.
Spark Standalone Mode documentation
The standalone cluster mode currently only supports a simple FIFO
scheduler across applications.
Against the fact Standalone mode can use Dynamic Allocation, and you can specify spark.dynamicAllocation.minExecutors & spark.dynamicAllocation.maxExecutors. Also I haven't found a note about Standalone doesn't support FairScheduler.
This answer
YARN directly handles rack and machine locality
How does YARN may know anything about data locality in my job? Suppose, I'm storing file locations at AWS Glue (used by EMR as Hive metastore). Inside Spark job I'm querying some-db.some-table. How YARN may know what executor is better for job assignment?
UPD: found another mention about YARN and data locality https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-data-locality.html. Still doesn't matter in case of S3 for example.
I would like to run multiple spark jobs on my Mesos cluster, and have all spark jobs share the same spark framework. Is this possible?
I have tried running the MesosClusterDispatcher and have the spark jobs connect to the dispatcher, but each spark job launches its own "Spark Framework" (I have tried running both client-mode and cluster-mode).
Is this the expected behaviour?
Is it possible to share the same spark-framework among multiple spark jobs?
It is normal and it's the expected behaviour.
In Mesos as far as I know, SparkDispatcher is in charge of allocate resources for your Spark Driver which will act as a framework. Once Spark driver has been allocated, it is responsible for talk to Mesos and accept offers to allocate the executors where tasks will be executed.
I have an analytics node running, with Spark Sql Thriftserver running on it. Now I can't run another Spark Application with spark-submit.
It says it doesn't have resources. How to configure the dse node, to be able to run both?
The SparkSqlThriftServer is a Spark application like any other. This means it requests and reserves all resources in the cluster by default.
There are two options if you want to run multiple applications at the same time:
Allocate only part of your resources to each application.
This is done by setting spark.cores.max to a smaller value than the max resources in your cluster.
See Spark Docs
Dynamic Allocation
Which allows applications to change the amount of resources they use depending on how much work they are trying to do.
See Spark Docs