Understanding Spark Submit Yarn Client vs Cluster mode [duplicate] - apache-spark

TL;DR: In a Spark Standalone cluster, what are the differences between client and cluster deploy modes? How do I set which mode my application is going to run on?
We have a Spark Standalone cluster with three machines, all of them with Spark 1.6.1:
A master machine, which also is where our application is run using spark-submit
2 identical worker machines
From the Spark Documentation, I read:
(...) For standalone clusters, Spark currently supports two deploy modes. In client mode, the driver is launched in the same process as the client that submits the application. In cluster mode, however, the driver is launched from one of the Worker processes inside the cluster, and the client process exits as soon as it fulfills its responsibility of submitting the application without waiting for the application to finish.
However, I don't really understand the practical differences by reading this, and I don't get what are the advantages and disadvantages of the different deploy modes.
Additionally, when I start my application using start-submit, even if I set the property spark.submit.deployMode to "cluster", the Spark UI for my context shows the following entry:
So I am not able to test both modes to see the practical differences. That being said, my questions are:
1) What are the practical differences between Spark Standalone client deploy mode and cluster deploy mode? What are the pro's and con's of using each one?
2) How to I choose which one my application is going to be running on, using spark-submit?

What are the practical differences between Spark Standalone client
deploy mode and cluster deploy mode? What are the pro's and con's of
using each one?
Let's try to look at the differences between client and cluster mode.
Client:
Driver runs on a dedicated server (Master node) inside a dedicated process. This means it has all available resources at it's disposal to execute work.
Driver opens up a dedicated Netty HTTP server and distributes the JAR files specified to all Worker nodes (big advantage).
Because the Master node has dedicated resources of it's own, you don't need to "spend" worker resources for the Driver program.
If the driver process dies, you need an external monitoring system to reset it's execution.
Cluster:
Driver runs on one of the cluster's Worker nodes. The worker is chosen by the Master leader
Driver runs as a dedicated, standalone process inside the Worker.
Driver programs takes up at least 1 core and a dedicated amount of memory from one of the workers (this can be configured).
Driver program can be monitored from the Master node using the --supervise flag and be reset in case it dies.
When working in Cluster mode, all JARs related to the execution of your application need to be publicly available to all the workers. This means you can either manually place them in a shared place or in a folder for each of the workers.
Which one is better? Not sure, that's actually for you to experiment and decide. This is no better decision here, you gain things from the former and latter, it's up to you to see which one works better for your use-case.
How to I choose which one my application is going to be running on,
using spark-submit
The way to choose which mode to run in is by using the --deploy-mode flag. From the Spark Configuration page:
/bin/spark-submit \
--class <main-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]

Let's say you are going to perform a spark submit in EMR by doing SSH to the master node.
If you are providing the option --deploy-mode cluster, then following things will happen.
You won't be able to see the detailed logs in the terminal.
Since driver is not created in the Master itself, you won't be able to terminate the job from the terminal.
But in case of --deploy-mode client:
You will be able to see the detailed logs in the terminal.
You will be able to terminate the job from the terminal itself.
These are the basic things that I have noticed till now.

I'm also having the same scenario, here master node use a standalone ec2 cluster. In this setup client mode is appropriate. In this driver is launched directly with in the spark-submit process which acts as a client to the cluster. The Input & output of the application is attached to the console.Thus, this mode is especially suitable for applications that involve REPL.
Else if your application is submitted from a machine far from the worker machines then it is quite common to use cluster mode to minimize the network latency b/w driver & executor.

Related

How are spark jobs submitted in cluster mode?

I know there is information worth 10 google pages on this but, all of them tell me to just put --master yarn in the spark-submit command. But, in cluster mode, how can my local laptop even know what that means? Let us say I have my laptop and a running dataproc cluster. How can I use spark-submit from my laptop to submit a job to this cluster?
Most of the documentation on running a Spark application in cluster mode assumes that you are already on the same cluster where YARN/Hadoop are configured (e.g. you are ssh'ed in), in which case most of the time Spark will pick up the appropriate local configs and "just work".
This is same for Dataproc: if you ssh onto the Dataproc master node, you can just run spark-submit --master yarn. More detailed instructions can be found in the documentation.
If you are trying to run applications locally on your laptop, this is more difficult. You will need to set up an ssh tunnel to the cluster, and then locally create configuration files that tell Spark how to reach the master via the tunnel.
Alternatively, you can use the Dataproc jobs API to submit jobs to the cluster without having to directly connect. The one caveat is that you will have to use properties to tell Spark to run in cluster mode instead of client mode (--properties spark.submit.deployMode=cluster). Note that when submitting jobs via the Dataproc API, the difference between client and cluster mode is much less pressing because in either case the Spark driver will actually run on the cluster (on the master or a worker respectively), not on your local laptop.

Apache Spark when and what creates the driver?

I am trying to understand the sequence of events related to the creation of a driver program on spark-submit in cluster and client mode
Spark-Submit
Let's say I am on my machine and I do a spark-submit with the Yarn resource manager and deploy mode is cluster
Now, when a driver is created? Is it before the execution of the main program? or is when Spark Session is being created?
My understanding:
The spark-submit bash script interacts with the resource manager and asks for a container for running the main program.
Once the container is initiated the spark-submit script runs the main program on the cluster container.
Once the main program is executed then spark context interacts with
the resource manager to create containers for executors.
Now, if this is a correct understanding then what happens when we simply run a python script on a local machine with cluster mode?
See https://blog.knoldus.com/understanding-the-working-of-spark-driver-and-executor/
I can't explain it any better than this. See also https://spark.apache.org/docs/latest/submitting-applications.html
This answers more than your question. An excellent read.
Let’s say a user submits a job using “spark-submit”.
“spark-submit” will in-turn launch the Driver which will execute the main() method of our code.
Driver contacts the cluster manager and requests for resources to launch the -Executors.
The cluster manager launches the Executors on behalf of the Driver.
Once the Executors are launched, they establish a direct connection with the Driver.
The driver determines the total number of Tasks by checking the Lineage.
The driver creates the Logical and Physical Plan.
Once the Physical Plan is generated, Spark allocates the Tasks to the Executors.
Task runs on Executor and each Task upon completion returns the result to the Driver.
Finally, when all Task is completed, the main() method running in the Driver exits, i.e. main() method invokes sparkContext.stop().
Finally, Spark releases all the resources from the Cluster Manager.
Spark has two deploy modes: client and cluster.
client mode is the mode where the computer you submitted Spark jobs is the driver. That could be your local computer, or usually, it could be a so-called "edge node". With this mode, the driver shares its resources with many other software, and most of the time it's not optimal and reliable (think about the case when you submit a job while running something super heavy in your computer at the same time)
cluster mode is the mode where YARN is the one who picks a node among the cluster's available nodes and makes it the driver. So it will try to pick the best one and you don't have to worry about its resources anymore.
what happens when we simply run a python script on a local machine with cluster mode?
You now probably have some sense about the answer to this question: if you simply run a python script on a local machine, it would be client mode, spark job will use that local computer resources as part of Spark computation. On the other hand, with cluster mode, another computer will run as driver, not your local machine.

how many JVMs are running when Spark Local mode is used?

newbie here, but really enjoyed Spark so far.
I did the following (using a laptop, running Windows 7):
start the master by using command prompt window:
spark-class org.apache.spark.deploy.master.Master
start one worker by typing the following:
spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077
repeat step 2, in other words, start another worker by using the same above command.
Now, I have one master, two workers, all in the same physical machine. Based on what I have been reading, this should be considered as the "local mode"... not sure about this, hope someone can confirm?
Also, from what I have read, local mode should have master and workers in one SINGLE JVM. But by running a small utility code, I can see that master is in one JVM, and two workers each stays in one JVM, so there are three JVMs, and they are different JVMs.
Can someone tell me which part I did wrong, or, what is the problem with my understanding?
Also, for this local model, there is no cluster manager, right?
Many thanks!
Local mode is a single JVM. Local mode is when you specify the master, via --master command line switch, as local[*]. This can be done via spark-submit or spark-shell.
This explains it pretty well.
Now, I have one master, two workers, all in the same physical machine. Based on what I have been reading, this should be considered as the "local mode"... not sure about this, hope someone can confirm?
It is not. It is a standalone mode, where you Spark's own cluster manager. Unlike local it is fully distributed. It will use:
For cluster manager:
Single JVM for each cluster master (one in your case, more in HA mode).
Single JVM for each started worker.
Optionally additional JVMs for services like history server, or shuffle service.
For application:
Single JVM for the driver.
Single JVM for each executor.
In local mode there is only one JVM, as already stated by Greg

Spark yarn cluster vs client - how to choose which one to use?

The spark docs have the following paragraph that describes the difference between yarn client and yarn cluster:
There are two deploy modes that can be used to launch Spark applications on YARN. In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
I'm assuming there are two choices for a reason. If so, how do you choose which one to use?
Please use facts to justify your response so that this question and answer(s) meet stackoverflow's requirements.
There are a few similar questions on stackoverflow, however those questions focus on the difference between the two approaches, but don't focus on when one approach is more suitable than the other.
A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, client mode is appropriate. In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster. The input and output of the application is attached to the console. Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell).
Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to use cluster mode to minimize network latency between the drivers and the executors. Note that cluster mode is currently not supported for Mesos clusters. Currently only YARN supports cluster mode for Python applications." -- Submitting Applications
What I understand from this is that both strategies use the cluster to distribute tasks; the difference is where the "driver program" runs: locally with spark-submit, or, also in the cluster.
When you should use either of them is detailed in the quote above, but I also did another thing: for big jars, I used rsync to copy them to the cluster (or even to master node) with 100 times the network speed, and then submitted from the cluster. This can be better than "cluster mode" for big jars. Note that client mode does not probably transfer the jar to the master. At that point the difference between the 2 is minimal. Probably client mode is better when the driver program is idle most of the time, to make full use of cores on the local machine and perhaps avoid transferring the jar to the master (even on loopback interface a big jar takes quite a bit of seconds). And with client mode you can transfer (rsync) the jar on any cluster node.
On the other hand, if the driver is very intensive, in cpu or I/O, cluster mode may be more appropriate, to better balance the cluster (in client mode, the local machine would run both the driver and as many workers as possible, making it over loaded and making it that local tasks will be slower, making it such that the whole job may end up waiting for a couple of tasks from the local machine).
Conclusion :
To sum up, if I am in the same local network with the cluster, I would
use the client mode and submit it from my laptop. If the cluster is
far away, I would either submit locally with cluster mode, or rsync
the jar to the remote cluster and submit it there, in client or
cluster mode, depending on how heavy the driver program is on
resources.*
AFAIK With the driver program running in the cluster, it is less vulnerable to remote disconnects crashing the driver and the entire spark job.This is especially useful for long running jobs such as stream processing type workloads.
Spark Jobs Running on YARN
When running Spark on YARN, each Spark executor runs as a YARN container. Where MapReduce schedules a container and fires up a JVM for each task, Spark hosts multiple tasks within the same container. This approach enables several orders of magnitude faster task startup time.
Spark supports two modes for running on YARN, “yarn-cluster” mode and “yarn-client” mode. Broadly, yarn-cluster mode makes sense for production jobs, while yarn-client mode makes sense for interactive and debugging uses where you want to see your application’s output immediately.
Understanding the difference requires an understanding of YARN’s Application Master concept. In YARN, each application instance has an Application Master process, which is the first container started for that application. The application is responsible for requesting resources from the ResourceManager, and, when allocated them, telling NodeManagers to start containers on its behalf. Application Masters obviate the need for an active client — the process starting the application can go away and coordination continues from a process managed by YARN running on the cluster.
In yarn-cluster mode, the driver runs in the Application Master. This means that the same process is responsible for both driving the application and requesting resources from YARN, and this process runs inside a YARN container. The client that starts the app doesn’t need to stick around for its entire lifetime.
yarn-cluster mode
The yarn-cluster mode is not well suited to using Spark interactively, but the yarn-client mode is. Spark applications that require user input, like spark-shell and PySpark, need the Spark driver to run inside the client process that initiates the Spark application. In yarn-client mode, the Application Master is merely present to request executor containers from YARN. The client communicates with those containers to schedule work after they start:
yarn-client mode
This table offers a concise list of differences between these modes:
Reference: https://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/ - Apache Spark Resource Management and YARN App Models (web.archive.com mirror)
In yarn-cluster mode, the driver program will run on the node where application master is running where as in yarn-client mode the driver program will run on the node on which job is submitted on centralized gateway node.
Both answers by Ram and Chavati/Wai Lee are useful and helpful, but here I just want to added a couple of other points:
Much has been written about the proximity of driver to the executors, and that is a big factor. The other factors are:
Will the driver process be around until execution of job is finished?
How's returned data being handled?
#1. In spark client mode, the driver process must be up and running the whole time when the spark job is in execution. So if you have a truly long job that say take many hours to run, you need to make sure the driver process is still up and running, and that the driver session is not auto-logout.
On the other hand, after submitting a job to run in cluster mode, the process can go away. The cluster mode will keep running. So this is typically how a production job will run: the job can be triggered by a timer, or by an external event and then the job will run to its completion without worrying about the lifetime of the process submitting the spark job.
#2. In client mode, you can call sc.collect() to gather all the data back from all executors, and then write/save the returned data to a local Linux file on local disk. Now this may not work for cluster mode, as the 'driver' typically run in a different remote host. The data written up therefore need to be persisted in a common mounted volume (such as GPFS, NFS) or in distributed file system like HDFS. If the job is running under Hadoop/YARN, the more common way for cluster mode is simply ask each executor to persist the data to HDFS, and not to run collect( ) at all. Collect() actually have scalability issue when there are a large number of executors returning large amount of data - it can overwhelm the driver process.

What conditions should cluster deploy mode be used instead of client?

The doc https://spark.apache.org/docs/1.1.0/submitting-applications.html
describes deploy-mode as :
--deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)
Using this diagram fig1 as a guide (taken from http://spark.apache.org/docs/1.2.0/cluster-overview.html) :
If I kick off a Spark job :
./bin/spark-submit \
--class com.driver \
--master spark://MY_MASTER:7077 \
--executor-memory 845M \
--deploy-mode client \
./bin/Driver.jar
Then the Driver Program will be MY_MASTER as specified in fig1 MY_MASTER
If instead I use --deploy-mode cluster then the Driver Program will be shared among the Worker Nodes ? If this is true then does this mean that the Driver Program box in fig1 can be dropped (as it is no longer utilized) as the SparkContext will also be shared among the worker nodes ?
What conditions should cluster be used instead of client ?
No, when deploy-mode is client, the Driver Program is not necessarily the master node. You could run spark-submit on your laptop, and the Driver Program would run on your laptop.
On the contrary, when deploy-mode is cluster, then cluster manager (master node) is used to find a slave having enough available resources to execute the Driver Program. As a result, the Driver Program would run on one of the slave nodes. As its execution is delegated, you can not get the result from Driver Program, it must store its results in a file, database, etc.
Client mode
Want to get a job result (dynamic analysis)
Easier for developing/debugging
Control where your Driver Program is running
Always up application: expose your Spark job launcher as REST service or a Web UI
Cluster mode
Easier for resource allocation (let the master decide): Fire and forget
Monitor your Driver Program from Master Web UI like other workers
Stop at the end: one job is finished, allocated resources are freed
I think this may help you understand.In the document https://spark.apache.org/docs/latest/submitting-applications.html
It says " A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, client mode is appropriate. In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster. The input and output of the application is attached to the console. Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell).
Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to use cluster mode to minimize network latency between the drivers and the executors. Note that cluster mode is currently not supported for Mesos clusters or Python applications."
What about HADR?
In cluster mode, YARN restarts the driver without killing the executors.
In client mode, YARN automatically kills all executors if your driver is killed.

Resources