In Spark's client mode, the driver needs network access to remote executors? - apache-spark

When using spark at client mode (e.g. yarn-client), does the local machine that runs the driver communicates directly with the cluster worker nodes that run the remote executors?
If yes, does it mean the machine (that runs the driver) need to have network access to the worker nodes? So the master node requests resources from the cluster, and returns the IP addresses/ports of the worker nodes to the driver, so the driver can initiating the communication with the worker nodes?
If not, how does the client mode actually work?
If yes, does it mean that the client mode won't work if the cluster is configured in a way that the work nodes are not visible outside the cluster, and one will have to use cluster mode?
Thanks!

The Driver connects to the Spark Master, requests a context, and then the Spark Master passes the Spark Workers the details of the Driver to communicate and get instructions on what to do.
The means that the driver node must be available on the network to the workers, and it's IP must be one that's visible to them (i.e. if the driver is behind NAT, while the workers are in a different network, it won't work and you'll see errors on the workers that they fail to connect to the driver)

When you run Spark in client mode, the driver process runs locally.
In cluster mode, it runs remotely on an ApplicationMaster.
In other words you will need all the nodes to see each other. Spark driver definitely needs to communicate with all the worker nodes. If this is a problem try to use the yarn-cluster mode, then the driver will run inside your cluster on one of the nodes.

Related

What is the difference between Driver and Application manager in spark

I couldn't figure out what is the difference between Spark driver and application master. Basically the responsibilities in running an application, who does what?
In client mode, client machine has the driver and app master runs in one of the cluster nodes. In cluster mode, client doesn't have any, driver and app master runs in same node (one of the cluster nodes).
What exactly are the operations that driver do and app master do?
References:
Spark Driver memory and Application Master memory
Spark yarn cluster vs client - how to choose which one to use?
As per the spark documentation
Spark Driver :
The Driver(aka driver program) is responsible for converting a user
application to smaller execution units called tasks and then schedules
them to run with a cluster manager on executors. The driver is also
responsible for executing the Spark application and returning the
status/results to the user.
Spark Driver contains various components – DAGScheduler,
TaskScheduler, BackendScheduler and BlockManager. They are responsible
for the translation of user code into actual Spark jobs executed on
the cluster.
Where in Application Master is
The Application Master is responsible for the execution of a single
application. It asks for containers from the Resource Scheduler
(Resource Manager) and executes specific programs on the obtained containers.
Application Master is just a broker that negotiates resources with the Resource Manager and then after getting some container it make sure to launch tasks(which are picked from scheduler queue) on containers.
In a nutshell Driver program will translate your custom logic into stages, job and task.. and your application master will make sure to get enough resources from RM And also make sure to check the status of your tasks running in a container.
as it is already said in your provided references the only different between client and cluster mode is
In client, mode driver will run on the machine where we have executed/run spark application/job and AM runs in one of the cluster nodes.
(AND)
In cluster mode driver run inside application master, it means the application has much more responsibility.
References :
https://luminousmen.com/post/spark-anatomy-of-spark-application#:~:text=The%20Driver(aka%20driver%20program,status%2Fresults%20to%20the%20user.
https://www.edureka.co/community/1043/difference-between-application-master-application-manager#:~:text=The%20Application%20Master%20is%20responsible,class)%20on%20the%20obtained%20containers.

Spark yarn cluster vs client - how to choose which one to use?

The spark docs have the following paragraph that describes the difference between yarn client and yarn cluster:
There are two deploy modes that can be used to launch Spark applications on YARN. In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
I'm assuming there are two choices for a reason. If so, how do you choose which one to use?
Please use facts to justify your response so that this question and answer(s) meet stackoverflow's requirements.
There are a few similar questions on stackoverflow, however those questions focus on the difference between the two approaches, but don't focus on when one approach is more suitable than the other.
A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, client mode is appropriate. In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster. The input and output of the application is attached to the console. Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell).
Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to use cluster mode to minimize network latency between the drivers and the executors. Note that cluster mode is currently not supported for Mesos clusters. Currently only YARN supports cluster mode for Python applications." -- Submitting Applications
What I understand from this is that both strategies use the cluster to distribute tasks; the difference is where the "driver program" runs: locally with spark-submit, or, also in the cluster.
When you should use either of them is detailed in the quote above, but I also did another thing: for big jars, I used rsync to copy them to the cluster (or even to master node) with 100 times the network speed, and then submitted from the cluster. This can be better than "cluster mode" for big jars. Note that client mode does not probably transfer the jar to the master. At that point the difference between the 2 is minimal. Probably client mode is better when the driver program is idle most of the time, to make full use of cores on the local machine and perhaps avoid transferring the jar to the master (even on loopback interface a big jar takes quite a bit of seconds). And with client mode you can transfer (rsync) the jar on any cluster node.
On the other hand, if the driver is very intensive, in cpu or I/O, cluster mode may be more appropriate, to better balance the cluster (in client mode, the local machine would run both the driver and as many workers as possible, making it over loaded and making it that local tasks will be slower, making it such that the whole job may end up waiting for a couple of tasks from the local machine).
Conclusion :
To sum up, if I am in the same local network with the cluster, I would
use the client mode and submit it from my laptop. If the cluster is
far away, I would either submit locally with cluster mode, or rsync
the jar to the remote cluster and submit it there, in client or
cluster mode, depending on how heavy the driver program is on
resources.*
AFAIK With the driver program running in the cluster, it is less vulnerable to remote disconnects crashing the driver and the entire spark job.This is especially useful for long running jobs such as stream processing type workloads.
Spark Jobs Running on YARN
When running Spark on YARN, each Spark executor runs as a YARN container. Where MapReduce schedules a container and fires up a JVM for each task, Spark hosts multiple tasks within the same container. This approach enables several orders of magnitude faster task startup time.
Spark supports two modes for running on YARN, “yarn-cluster” mode and “yarn-client” mode. Broadly, yarn-cluster mode makes sense for production jobs, while yarn-client mode makes sense for interactive and debugging uses where you want to see your application’s output immediately.
Understanding the difference requires an understanding of YARN’s Application Master concept. In YARN, each application instance has an Application Master process, which is the first container started for that application. The application is responsible for requesting resources from the ResourceManager, and, when allocated them, telling NodeManagers to start containers on its behalf. Application Masters obviate the need for an active client — the process starting the application can go away and coordination continues from a process managed by YARN running on the cluster.
In yarn-cluster mode, the driver runs in the Application Master. This means that the same process is responsible for both driving the application and requesting resources from YARN, and this process runs inside a YARN container. The client that starts the app doesn’t need to stick around for its entire lifetime.
yarn-cluster mode
The yarn-cluster mode is not well suited to using Spark interactively, but the yarn-client mode is. Spark applications that require user input, like spark-shell and PySpark, need the Spark driver to run inside the client process that initiates the Spark application. In yarn-client mode, the Application Master is merely present to request executor containers from YARN. The client communicates with those containers to schedule work after they start:
yarn-client mode
This table offers a concise list of differences between these modes:
Reference: https://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/ - Apache Spark Resource Management and YARN App Models (web.archive.com mirror)
In yarn-cluster mode, the driver program will run on the node where application master is running where as in yarn-client mode the driver program will run on the node on which job is submitted on centralized gateway node.
Both answers by Ram and Chavati/Wai Lee are useful and helpful, but here I just want to added a couple of other points:
Much has been written about the proximity of driver to the executors, and that is a big factor. The other factors are:
Will the driver process be around until execution of job is finished?
How's returned data being handled?
#1. In spark client mode, the driver process must be up and running the whole time when the spark job is in execution. So if you have a truly long job that say take many hours to run, you need to make sure the driver process is still up and running, and that the driver session is not auto-logout.
On the other hand, after submitting a job to run in cluster mode, the process can go away. The cluster mode will keep running. So this is typically how a production job will run: the job can be triggered by a timer, or by an external event and then the job will run to its completion without worrying about the lifetime of the process submitting the spark job.
#2. In client mode, you can call sc.collect() to gather all the data back from all executors, and then write/save the returned data to a local Linux file on local disk. Now this may not work for cluster mode, as the 'driver' typically run in a different remote host. The data written up therefore need to be persisted in a common mounted volume (such as GPFS, NFS) or in distributed file system like HDFS. If the job is running under Hadoop/YARN, the more common way for cluster mode is simply ask each executor to persist the data to HDFS, and not to run collect( ) at all. Collect() actually have scalability issue when there are a large number of executors returning large amount of data - it can overwhelm the driver process.

Can driver process run outside of the Spark cluster?

I read an answer from What conditions should cluster deploy mode be used instead of client?,
(In client mode) You could run spark-submit on your laptop, and the Driver Program would run on your laptop.
Also, the Spark Doc says,
In client mode, the driver is launched in the same process as the client that submits the application.
Does it mean that I can submit spark tasks from any machine, as long as it can be reachable from master and has Spark environment?
Or in other words, can driver process run outside of the Spark cluster?
Yes, the driver can run on your laptop. Keep in mind though:
The Spark driver will need the Hadoop configuration to be able to talk to YARN and HDFS. You could copy it from the cluster and point to it via HADOOP_CONF_DIR.
The Spark driver will listen on a lot of ports and expect the executors to be able to connect to it. It will advertise the hostname of your laptop. Make sure it can be resolved and all ports accessed from the cluster environment.
Yes, I'm running spark-submit jobs over the LAN using option --deploy-mode cluster. Currently running into this issue however: the server response (json object) isn't very descriptive.

What conditions should cluster deploy mode be used instead of client?

The doc https://spark.apache.org/docs/1.1.0/submitting-applications.html
describes deploy-mode as :
--deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)
Using this diagram fig1 as a guide (taken from http://spark.apache.org/docs/1.2.0/cluster-overview.html) :
If I kick off a Spark job :
./bin/spark-submit \
--class com.driver \
--master spark://MY_MASTER:7077 \
--executor-memory 845M \
--deploy-mode client \
./bin/Driver.jar
Then the Driver Program will be MY_MASTER as specified in fig1 MY_MASTER
If instead I use --deploy-mode cluster then the Driver Program will be shared among the Worker Nodes ? If this is true then does this mean that the Driver Program box in fig1 can be dropped (as it is no longer utilized) as the SparkContext will also be shared among the worker nodes ?
What conditions should cluster be used instead of client ?
No, when deploy-mode is client, the Driver Program is not necessarily the master node. You could run spark-submit on your laptop, and the Driver Program would run on your laptop.
On the contrary, when deploy-mode is cluster, then cluster manager (master node) is used to find a slave having enough available resources to execute the Driver Program. As a result, the Driver Program would run on one of the slave nodes. As its execution is delegated, you can not get the result from Driver Program, it must store its results in a file, database, etc.
Client mode
Want to get a job result (dynamic analysis)
Easier for developing/debugging
Control where your Driver Program is running
Always up application: expose your Spark job launcher as REST service or a Web UI
Cluster mode
Easier for resource allocation (let the master decide): Fire and forget
Monitor your Driver Program from Master Web UI like other workers
Stop at the end: one job is finished, allocated resources are freed
I think this may help you understand.In the document https://spark.apache.org/docs/latest/submitting-applications.html
It says " A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, client mode is appropriate. In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster. The input and output of the application is attached to the console. Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell).
Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to use cluster mode to minimize network latency between the drivers and the executors. Note that cluster mode is currently not supported for Mesos clusters or Python applications."
What about HADR?
In cluster mode, YARN restarts the driver without killing the executors.
In client mode, YARN automatically kills all executors if your driver is killed.

Spark Driver in Apache spark

I already have a cluster of 3 machines (ubuntu1,ubuntu2,ubuntu3 by VM virtualbox) running Hadoop 1.0.0. I installed spark on each of these machines. ub1 is my master node and the other nodes are working as slave. My question is what exactly a spark driver is? and should we set a IP and port to spark driver by spark.driver.host and where it will be executed and located? (master or slave)
The spark driver is the program that declares the transformations and actions on RDDs of data and submits such requests to the master.
In practical terms, the driver is the program that creates the SparkContext, connecting to a given Spark Master. In the case of a local cluster, like is your case, the master_url=spark://<host>:<port>
Its location is independent of the master/slaves. You could co-located with the master or run it from another node. The only requirement is that it must be in a network addressable from the Spark Workers.
This is how the configuration of your driver looks like:
val conf = new SparkConf()
.setMaster("master_url") // this is where the master is specified
.setAppName("SparkExamplesMinimal")
.set("spark.local.ip","xx.xx.xx.xx") // helps when multiple network interfaces are present. The driver must be in the same network as the master and slaves
.set("spark.driver.host","xx.xx.xx.xx") // same as above. This duality might disappear in a future version
val sc = new spark.SparkContext(conf)
// etc...
To explain a bit more on the different roles:
The driver prepares the context and declares the operations on the data using RDD transformations and actions.
The driver submits the serialized RDD graph to the master. The master creates tasks out of it and submits them to the workers for execution. It coordinates the different job stages.
The workers is where the tasks are actually executed. They should have the resources and network connectivity required to execute the operations requested on the RDDs.
You question is related to spark deploy on yarn, see 1: http://spark.apache.org/docs/latest/running-on-yarn.html "Running Spark on YARN"
Assume you start from a spark-submit --master yarn cmd :
The cmd will request yarn Resource Manager (RM) to start a ApplicationMaster (AM)process on one of your cluster machines (those have yarn node manager installled on it).
Once the AM started, it will call your driver program's main method. So the driver is actually where you define your spark context, your rdd, and your jobs. The driver contains the entry main method which start the spark computation.
The spark context will prepare RPC endpoint for the executor to talk back, and a lot of other things(memory store, disk block manager, jetty server...)
The AM will request RM for containers to run your spark executors, with the driver RPC url (something like spark://CoarseGrainedScheduler#ip:37444) specified on the executor's start cmd.
The Yellow box "Spark context" is the Driver.
A Spark driver is the process that creates and owns an instance of SparkContext. It is your
Spark application that launches the main method in which the instance of SparkContext is
created. It is the cockpit of jobs and tasks execution (using DAGScheduler and Task
Scheduler). It hosts Web UI for the environment
It splits a Spark application into tasks and schedules them to run on executors.
A driver is where the task scheduler lives and spawns tasks across workers.
A driver coordinates workers and overall execution of tasks.
In simple term, Spark driver is a program which contains the main method (main method is the starting point of your program). So, in Java, driver will be the Class which will contain public static void main(String args[]).
In a cluster, you can run this program in either one of the ways:
1) In any remote host machine. Here you'll have to provide the remote host machine details while submitting the driver program on to the remote host. The driver runs in the JVM process created in remote machine and only comes back with final result.
2) Locally from your client machine(Your laptop). Here the driver program runs in JVM process created in your machine locally. From here it sends the task to remote hosts and wait for the result from each tasks.
If you set config "spark.deploy.mode = cluster", then your driver will be launched at your worker hosts(ubuntu2 or ubuntu3).
If spark.deploy.mode=driver, which is the default value, then the driver will run on the machine your submit your application.
And finally, you can see your application on web UI: http://driverhost:driver_ui_port, where the driver_ui_port is default 4040, and you can change the port by set config "spark.ui.port"
Spark driver is node that allows application to create SparkContext, sparkcontext is connection to compute resource.
Now driver can run the box it is submitted or it can run on one of node of cluster when using some resource manager like YARN.
Both options of client/cluster has some tradeoff like
Access to CPU/Memory of once of the node on cluster, some time this is good because cluster node will be big in terms memory.
Driver logs are on cluster node vs local box from where job was submitted.
You should have history server for cluster mode other wise driver side logs are lost.
Some time it is hard to install dependency(i.e some native dependency) executor and running spark application in client mode comes to rescue.
If you want to read more on Spark Job anatomy then http://ashkrit.blogspot.com/2018/09/anatomy-of-apache-spark-job.html post could be useful
Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).
Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.
Spark cluster components
There are several useful things to note about this architecture:
Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.
Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (e.g. Mesos/YARN).
The driver program must listen for and accept incoming connections from its executors throughout its lifetime (e.g., see spark.driver.port in the network config section). As such, the driver program must be network addressable from the worker nodes.
Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. If you’d like to send requests to the cluster remotely, it’s better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.

Resources