Role of the Executors on the Spark master machine - apache-spark

In a Spark stand alone cluster, does the Master node run tasks as well? I wasn't sure if there Executors processes are spun up on the Master node and do work, alongside the Worker nodes.
Thanks!

Executors would only be started on the nodes where there is at least one worker daemon on that node, i.e, No executor would be started up in a node that do not serve as Worker.
However, Where to start Master and Workers are all based on your decision, there isn't such limitations that Master and Worker cannot co-locate on a same node.
To start a worker daemon the same machine with your master, you can either edit the conf/slaves file to add the master ip in it and use start-all.sh at start time or start a worker at any time you want on the master node, start-slave.sh and supply the Spark master URL --master spark://master-host:7077
Update (based on Daniel Darabos's suggestion) :
When referring to Application Detail UI's Executors tab, you could also find a row has <driver> for its Executor ID, the driver it denotes is the process where your job is scheduled and monitored, it's running the main program you submitted to the spark cluster, slicing your transformations and actions on RDDs into stages, scheduling the stages as TaskSets and arranging executors to run the tasks.
This <driver> will be started on the node which you call spark-submit in client mode, or on one of the worker nodes in cluster mode

Related

does spark driver process run on master node or one of the worker nodes assigned by the cluster manager?

Some article tells, the spark driver process(program) runs on master node and some where i can see, cluster manager launches the driver process on one of the worker nodes.
if that runs on one of the worker nodes, In master node only cluster manager process will run?
ex:
if we have 4 nodes with 1 master node and 3 worker node. if the cluster manager launches driver program in one of the worker node, ideally we will have only 2 worker nodes right?
I just want to understand the where driver program runs, master node or worker node in cluster deploy mode?

Does executors in spark nd application master in yarn do the same job?

Does executors in spark nd application master in yarn do the same
In Spark, there is a Driver and Executors. I'm not gonna go into detail of what driver and executors are but in a one-liner, the driver manages the job flow and schedules tasks, and Executors are worker nodes processes in charge of running individual tasks.
YARN is basically a resource manager which allocates memory to compute engines. Now, this compute engine can be Spark/Tez/Map-reduce. What you need to understand here is when YARN successfully allocates memory they are called containers.
Now when Spark Job is deployed in YARN, Assuming that YARN has sufficient memory for the spark job to run, Yarn first allocates resources as containers for Spark Application Master which will have the driver program (in case of cluster mode). This Application Master will further requests resources for spark executors which YARN will further allocate as containers. So spark job will have multiple containers, one for the driver program and n containers for n executors. So you see in the computing sense the fundamental difference between spark running in a spark cluster and spark running in YARN is the use of containers.
So executors and application master in YARN run inside containers and do the same thing as spark on spark clusters.

What is the workflow that Hadoop uses to assign Master and Worker nodes through the Hadoop Configuration files?

I 'm a bit confused about how master and worker nodes are assigned to the respective connected machines (VMs) on the network in the cluster mode of Spark.
My question is when i launch a Spark job (using Spark-submit) what is the process workflow that is responsible of assigning a master node and a worker node.
Thanks !
The Driver and Executors requests containers from yarn to launch and do work. Yarn takes care of the allocations for you so you don't need to worry about where the master(driver)/slave(executor) are allocated.

Role of master in Spark standalone cluster

In a Spark standalone cluster, what is exactly the role of the master (a node started with start_master.sh script)?
I understand that is the node that receives the jobs from the submit-job.sh script, but what is its role when processing a job?
I'm seeing in the web UI that always delivers the job to a slave (a node started with start_slave.sh) and is not participating from processing, Am I right? In that case, should I also run also the script start_slave.sh in the same machine than master to to take advantage of its resources (cpu and memory)?
Thanks in advance.
Spark runs in the following cluster modes:
Local
Standalone
Mesos
Yarn
The above are cluster modes which offer resources to Spark Applications
Spark standalone mode is master slave architecture, we have Spark Master and Spark Workers. Spark Master runs in one of the cluster nodes and Spark Workers run on the Slave nodes of the cluster.
Spark Master (often written standalone Master) is the resource manager
for the Spark Standalone cluster to allocate the resources (CPU, Memory, Disk etc...) among the
Spark applications. The resources are used to run the Spark Driver and Executors.
Spark Workers report to Spark Master about resources information on the Slave nodes.
[apache-spark]
Spark standalone comes with its own resource manager. Think about Spark Master/Worker as YARN ResourceManager/NodeManager.

Spark Driver in Apache spark

I already have a cluster of 3 machines (ubuntu1,ubuntu2,ubuntu3 by VM virtualbox) running Hadoop 1.0.0. I installed spark on each of these machines. ub1 is my master node and the other nodes are working as slave. My question is what exactly a spark driver is? and should we set a IP and port to spark driver by spark.driver.host and where it will be executed and located? (master or slave)
The spark driver is the program that declares the transformations and actions on RDDs of data and submits such requests to the master.
In practical terms, the driver is the program that creates the SparkContext, connecting to a given Spark Master. In the case of a local cluster, like is your case, the master_url=spark://<host>:<port>
Its location is independent of the master/slaves. You could co-located with the master or run it from another node. The only requirement is that it must be in a network addressable from the Spark Workers.
This is how the configuration of your driver looks like:
val conf = new SparkConf()
.setMaster("master_url") // this is where the master is specified
.setAppName("SparkExamplesMinimal")
.set("spark.local.ip","xx.xx.xx.xx") // helps when multiple network interfaces are present. The driver must be in the same network as the master and slaves
.set("spark.driver.host","xx.xx.xx.xx") // same as above. This duality might disappear in a future version
val sc = new spark.SparkContext(conf)
// etc...
To explain a bit more on the different roles:
The driver prepares the context and declares the operations on the data using RDD transformations and actions.
The driver submits the serialized RDD graph to the master. The master creates tasks out of it and submits them to the workers for execution. It coordinates the different job stages.
The workers is where the tasks are actually executed. They should have the resources and network connectivity required to execute the operations requested on the RDDs.
You question is related to spark deploy on yarn, see 1: http://spark.apache.org/docs/latest/running-on-yarn.html "Running Spark on YARN"
Assume you start from a spark-submit --master yarn cmd :
The cmd will request yarn Resource Manager (RM) to start a ApplicationMaster (AM)process on one of your cluster machines (those have yarn node manager installled on it).
Once the AM started, it will call your driver program's main method. So the driver is actually where you define your spark context, your rdd, and your jobs. The driver contains the entry main method which start the spark computation.
The spark context will prepare RPC endpoint for the executor to talk back, and a lot of other things(memory store, disk block manager, jetty server...)
The AM will request RM for containers to run your spark executors, with the driver RPC url (something like spark://CoarseGrainedScheduler#ip:37444) specified on the executor's start cmd.
The Yellow box "Spark context" is the Driver.
A Spark driver is the process that creates and owns an instance of SparkContext. It is your
Spark application that launches the main method in which the instance of SparkContext is
created. It is the cockpit of jobs and tasks execution (using DAGScheduler and Task
Scheduler). It hosts Web UI for the environment
It splits a Spark application into tasks and schedules them to run on executors.
A driver is where the task scheduler lives and spawns tasks across workers.
A driver coordinates workers and overall execution of tasks.
In simple term, Spark driver is a program which contains the main method (main method is the starting point of your program). So, in Java, driver will be the Class which will contain public static void main(String args[]).
In a cluster, you can run this program in either one of the ways:
1) In any remote host machine. Here you'll have to provide the remote host machine details while submitting the driver program on to the remote host. The driver runs in the JVM process created in remote machine and only comes back with final result.
2) Locally from your client machine(Your laptop). Here the driver program runs in JVM process created in your machine locally. From here it sends the task to remote hosts and wait for the result from each tasks.
If you set config "spark.deploy.mode = cluster", then your driver will be launched at your worker hosts(ubuntu2 or ubuntu3).
If spark.deploy.mode=driver, which is the default value, then the driver will run on the machine your submit your application.
And finally, you can see your application on web UI: http://driverhost:driver_ui_port, where the driver_ui_port is default 4040, and you can change the port by set config "spark.ui.port"
Spark driver is node that allows application to create SparkContext, sparkcontext is connection to compute resource.
Now driver can run the box it is submitted or it can run on one of node of cluster when using some resource manager like YARN.
Both options of client/cluster has some tradeoff like
Access to CPU/Memory of once of the node on cluster, some time this is good because cluster node will be big in terms memory.
Driver logs are on cluster node vs local box from where job was submitted.
You should have history server for cluster mode other wise driver side logs are lost.
Some time it is hard to install dependency(i.e some native dependency) executor and running spark application in client mode comes to rescue.
If you want to read more on Spark Job anatomy then http://ashkrit.blogspot.com/2018/09/anatomy-of-apache-spark-job.html post could be useful
Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).
Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.
Spark cluster components
There are several useful things to note about this architecture:
Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.
Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (e.g. Mesos/YARN).
The driver program must listen for and accept incoming connections from its executors throughout its lifetime (e.g., see spark.driver.port in the network config section). As such, the driver program must be network addressable from the worker nodes.
Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. If you’d like to send requests to the cluster remotely, it’s better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.

Resources